From mst at mellanox.co.il Tue Aug 1 01:27:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 1 Aug 2006 11:27:53 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060801082752.GR9411@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Fwd: issues in ipoib > > > 1. pkey cache issues > > http://thread.gmane.org/gmane.linux.drivers.openib/26684/focus=26692 > > I thought we fixed the P_Key cache issues by correcting the oversight > in retrying the P_Key query? > > > 3. ipoib race reported after code review by Eitan Rabin > > http://openib.org/pipermail/openib-general/2006-June/022916.html > > Yeah, might be a problem I guess. Does it work to do > netif_stop_queue() in ipoib_ib_dev_down()? Hmm. Since we are lockless, could ipoib_start_xmit run even after we call netif_stop_queue? Since interrupts are disabled anyway, can we just just take tx_lock? How does the following look? --- Prevent flush task from freeing the ipoib_neigh pointer, while ipoib_start_xmit is accessing the ipoib_neigh through the pointer is has loaded from the hardware address. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cf71d2a..31c4b05 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -336,7 +336,8 @@ void ipoib_flush_paths(struct net_device struct ipoib_path *path, *tp; LIST_HEAD(remove_list); - spin_lock_irq(&priv->lock); + spin_lock_irq(&priv->tx_lock); + spin_lock(&priv->lock); list_splice(&priv->path_list, &remove_list); INIT_LIST_HEAD(&priv->path_list); @@ -352,7 +353,8 @@ void ipoib_flush_paths(struct net_device path_free(dev, path); spin_lock_irq(&priv->lock); } - spin_unlock_irq(&priv->lock); + spin_unlock(&priv->lock); + spin_unlock_irq(&priv->tx_lock); } static void path_rec_completion(int status, -- MST From mst at mellanox.co.il Tue Aug 1 02:19:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 1 Aug 2006 12:19:14 +0300 Subject: [openib-general] hotplug support in mthca Message-ID: <20060801091914.GV9411@mellanox.co.il> Roland, what happends today if an mthca device is removed while a userspace applcation still keeps a reference to it? I understand uverbs remove_one will get called, but I don't see what prevents it from exiting while userspace still has open resources. -- MST From eitan at mellanox.co.il Tue Aug 1 03:28:37 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 01 Aug 2006 13:28:37 +0300 Subject: [openib-general] [PATCH] libibumad: nit on short mad read Message-ID: <86psfk4une.fsf@mtl066.yok.mtl.com> Hi Hal This was reported to me by Ishai R. Consider function umad_recv line 810: if ((n = read(port->dev_fd, umad, sizeof *mad + *length)) <= sizeof *mad + *length) { DEBUG("mad received by agent %d length %d", mad->agent_id, n); *length = n - sizeof *mad; return mad->agent_id; } if (n == -EWOULDBLOCK) { if (!errno) errno = EWOULDBLOCK; return n; } Seems that umad.c umad_recv would never go through the second "if" as if the read return n < 0 it will be cought by the first "if". Then I have noticed that a wrap around of the returned length is also possible. The patch fixes these issue. Eitan Signed-off-by: Eitan Zahavi Index: libibumad/src/umad.c =================================================================== --- libibumad/src/umad.c (revision 8313) +++ libibumad/src/umad.c (working copy) @@ -806,10 +806,13 @@ umad_recv(int portid, void *umad, int *l return n; } - if ((n = read(port->dev_fd, umad, sizeof *mad + *length)) <= - sizeof *mad + *length) { + n = read(port->dev_fd, umad, sizeof *mad + *length); + if ((n >= 0) && (n <= sizeof *mad + *length)) { DEBUG("mad received by agent %d length %d", mad->agent_id, n); + if (n > sizeof *mad) *length = n - sizeof *mad; + else + *length = 0; return mad->agent_id; } From svenar at simula.no Tue Aug 1 04:04:27 2006 From: svenar at simula.no (Sven-Arne Reinemo) Date: Tue, 01 Aug 2006 13:04:27 +0200 Subject: [openib-general] A few questions about IBMgtSim In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3028D909B@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3028D909B@mtlexch01.mtl.com> Message-ID: <44CF353B.2000001@simula.no> Anno Domini 22-07-2006 20:25, Eitan Zahavi wrote: > Hi Sven, > >>> Currently there is no way to scale simulation time to real time. >>> The main reason is that the time scale is mixed: * OpenSM >>> calculation time is about the same (if you run the simulator on >>> remote node) >> So this means that the internal operation of OpenSM with the >> simulator is identical to its operation with real hardware? > [EZ] Yes, if the algorithmic stage is only computational (like the > routing stage) the time it takes is the sane as real hardware. But > the entire fabric setting is involving sending and receiving MADs > thus odes not scale. >> I have done some performance test with IBMgtSim and OpenSM running >> on separate machines and to me it looks like there is very little >> concurrency between the two processes. I.e. it looks like they >> spend a lot of time waiting for each other. Below are some results >> from a few simulation runs, the observed CPU utilization seems >> quite low. I would have expected much higher CPU load for >> IBMgtSim... Any thoughts on how this matches your experience? > [EZ] Yes - these is not much concurrency. Actually it really depends > on the number of MADs you allow on the wire. Also, one of the major > limitations I run into (which made me split the processes to 2 > machine) was memory availability for the 10K nodes case. Is it possible to configure the number of MADs allowd on the wire? > > I do not see what is the drive for doing these comparisons. BTW: do > you plan to run the OpenSM tests over the simulator? Yes. As start we would like to experiment with OpenSM and possible enhancements by using the simulator. We have compared IBMgtSim with some of our own simulation tools. Our tools all do flit level simulations so they do not scale very well. Neither do they support IBA management traffic. When doing these test it was natural to look for benefits of distributing the process (apart from the memory requirements). > >> OpenSM #hosts² #sw #ports elapsed¹ kernel¹ user¹ %cpu mem 288 36 >> 24 585 109 99 35 410 512 48 32 766 144 136 >> 36 520 1152 72 48 1161 218 211 36 741 >> >> IBMgtSim #hosts² #sw #ports elapsed¹ kernel¹ user¹ %cpu mem 288 >> 36 24 586 87 221 52 92 512 48 32 767 >> 109 278 50 102 1152 72 48 1161 169 432 51 132 >> >> ¹time in seconds ²organized in a 3 stage Clos >> >> Best regards, Sven-Arne >> >> -- SAR ---- GnuPG public key - >> http://home.ifi.uio.no/~svenar/gpg.asc ---- "There are only 10 >> kinds of people in this world; those who know binary and those who >> don't." -- Unknown -- SAR ---- GnuPG public key - http://home.ifi.uio.no/~svenar/gpg.asc ---- "There are only 10 kinds of people in this world; those who know binary and those who don't." -- Unknown From hch at lst.de Tue Aug 1 04:25:44 2006 From: hch at lst.de (Christoph Hellwig) Date: Tue, 1 Aug 2006 13:25:44 +0200 Subject: [openib-general] [PATCH 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: References: <000001c6b40d$c2b7ae70$cdcc180a@amr.corp.intel.com> <1154357568.1066.4.camel@stevo-desktop> <20060731151538.GA23392@greglaptop.hsd1.ca.comcast.net> <1154359451.3078.4.camel@stevo-desktop> <20060731162333.GD23392@greglaptop.hsd1.ca.comcast.net> <20060731173742.GF1098@greglaptop.internal.keyresearch.com> Message-ID: <20060801112544.GA20058@lst.de> On Mon, Jul 31, 2006 at 10:45:39AM -0700, Roland Dreier wrote: > > That's much better than rdma_, but do you really think the Linux folks > > are going to be happy about OpenFabrics calls with a prefix that > > doesn't look anything like "Open Fabrics"? > > I don't think Linux folks care about Open Fabrics at all. > > No other drivers have a brand name and it's pretty silly trying to > brand IB/iWARP/RDMA/whatever drivers. Exactly. Please don't even try to put brand names (especially if they're as stupid as this) in. We don't call our wireless stack centrino just because intel contributed to it either. From hch at lst.de Tue Aug 1 04:26:29 2006 From: hch at lst.de (Christoph Hellwig) Date: Tue, 1 Aug 2006 13:26:29 +0200 Subject: [openib-general] [PATCH 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: <20060731182846.GP1098@greglaptop.internal.keyresearch.com> References: <1154357568.1066.4.camel@stevo-desktop> <20060731151538.GA23392@greglaptop.hsd1.ca.comcast.net> <1154359451.3078.4.camel@stevo-desktop> <20060731162333.GD23392@greglaptop.hsd1.ca.comcast.net> <20060731173742.GF1098@greglaptop.internal.keyresearch.com> <20060731175857.GJ1098@greglaptop.internal.keyresearch.com> <20060731182846.GP1098@greglaptop.internal.keyresearch.com> Message-ID: <20060801112629.GB20058@lst.de> On Mon, Jul 31, 2006 at 11:28:46AM -0700, Greg Lindahl wrote: > On Mon, Jul 31, 2006 at 11:18:16AM -0700, Roland Dreier wrote: > > > My gut reaction is negative. The whole idea of "verbs" is a bit of > > technical jargon that makes no sense unless you've lived in the RDMA > > world for a while, > > Given the way you are defining RDMA, I'm not surprised at the > conclusion you are coming to. We have been calling these the > transport neutral verbs, btw. > > How about ofabric_ ? No way. This subsystem is about doing rdma-type operation so call it something that includes rdma. From halr at voltaire.com Tue Aug 1 04:27:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Aug 2006 07:27:10 -0400 Subject: [openib-general] A few questions about IBMgtSim In-Reply-To: <44CF353B.2000001@simula.no> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3028D909B@mtlexch01.mtl.com> <44CF353B.2000001@simula.no> Message-ID: <1154431629.17511.105495.camel@hal.voltaire.com> On Tue, 2006-08-01 at 07:04, Sven-Arne Reinemo wrote: > Anno Domini 22-07-2006 20:25, Eitan Zahavi wrote: > > Hi Sven, > > > >>> Currently there is no way to scale simulation time to real time. > >>> The main reason is that the time scale is mixed: * OpenSM > >>> calculation time is about the same (if you run the simulator on > >>> remote node) > >> So this means that the internal operation of OpenSM with the > >> simulator is identical to its operation with real hardware? > > [EZ] Yes, if the algorithmic stage is only computational (like the > > routing stage) the time it takes is the sane as real hardware. But > > the entire fabric setting is involving sending and receiving MADs > > thus odes not scale. > >> I have done some performance test with IBMgtSim and OpenSM running > >> on separate machines and to me it looks like there is very little > >> concurrency between the two processes. I.e. it looks like they > >> spend a lot of time waiting for each other. Below are some results > >> from a few simulation runs, the observed CPU utilization seems > >> quite low. I would have expected much higher CPU load for > >> IBMgtSim... Any thoughts on how this matches your experience? > > [EZ] Yes - these is not much concurrency. Actually it really depends > > on the number of MADs you allow on the wire. Also, one of the major > > limitations I run into (which made me split the processes to 2 > > machine) was memory availability for the 10K nodes case. > > Is it possible to configure the number of MADs allowd on the wire? -maxsmps option to OpenSM or -maxsmps This option specifies the number of VL15 SMP MADs allowed on the wire at any one time. Specifying -maxsmps 0 allows unlimited outstanding SMPs. Without -maxsmps, OpenSM defaults to a maximum of 4 outstanding SMPs. or in /var/cache/osm/opensm.opts: # Number of MADs sent in parallel max_wire_smps 4 -- Hal > > > > I do not see what is the drive for doing these comparisons. BTW: do > > you plan to run the OpenSM tests over the simulator? > > Yes. As start we would like to experiment with OpenSM and possible > enhancements by using the simulator. We have compared IBMgtSim with some > of our own simulation tools. Our tools all do flit level simulations so > they do not scale very well. Neither do they support IBA management > traffic. When doing these test it was natural to look for benefits of > distributing the process (apart from the memory requirements). > > > > >> OpenSM #hosts² #sw #ports elapsed¹ kernel¹ user¹ %cpu mem 288 36 > >> 24 585 109 99 35 410 512 48 32 766 144 136 > >> 36 520 1152 72 48 1161 218 211 36 741 > >> > >> IBMgtSim #hosts² #sw #ports elapsed¹ kernel¹ user¹ %cpu mem 288 > >> 36 24 586 87 221 52 92 512 48 32 767 > >> 109 278 50 102 1152 72 48 1161 169 432 51 132 > >> > >> ¹time in seconds ²organized in a 3 stage Clos > >> > >> Best regards, Sven-Arne > >> > >> -- SAR ---- GnuPG public key - > >> http://home.ifi.uio.no/~svenar/gpg.asc ---- "There are only 10 > >> kinds of people in this world; those who know binary and those who > >> don't." -- Unknown > > From jackm at mellanox.co.il Tue Aug 1 06:10:02 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 1 Aug 2006 16:10:02 +0300 Subject: [openib-general] APM: QP migration state change when failover triggered by hw In-Reply-To: <44CEBA3E.3060208@3leafnetworks.com> References: <44B55981.6040408@3leafnetworks.com> <200607301005.32499.jackm@mellanox.co.il> <44CEBA3E.3060208@3leafnetworks.com> Message-ID: <200608011610.03212.jackm@mellanox.co.il> On Tuesday 01 August 2006 05:19, Venkatesh Babu wrote: > Configuration2: Node1 and Node 2 conneected through two switches for > each port. > Node1, port1 -> switch1 -> Node2, port1 > Node1, port2 -> switch2 -> Node2, port2 > > Node 1: > 1. Call ib_cm_listen() to wait for connection requests > 2. When a REQ message arrives create a RC QP and establish a connection > 3. Setup callback handlers to receive packets. > 4. Receive packets and verify it and drop it. > 5. Event IB_MIG_MIGRATED received > 6. Stopped receiving packets. > > Node 2: > 1. Create RC QP > 2. Send REQ message to Node 1 to establish the connection (Load both > primary and alternate paths) > 3. Contineously send some packets > 4. Simulate the port failure by unplugging the IB cable > 5. Event IB_MIG_MIGRATED received > > But with > Configuration2, IB_EVENT_PORT_ERR event occurrs on a node1, failover to > the alternate path doesn't work. The traffic stops. Because node1 > doesn't now when the IB_EVENT_PORT_ERR event occurred on Node2. We have not seen these problems here. We have regression tests which check APM, and they have run without problems. These tests have scripts which bring the HCA port down (equivalent to pulling the cable) to check that the migration occurs automatically. (You should NOT need to do ib_modify_qp for the migration to work in the case of a port error). Note, though, that these tests use the ibv_verbs layer directly. We have not checked out APM over the CM. There may be a bug here regarding setting up the alternate path properly when creating the connection (although this does seem strange, since you indicate that the MIGRATED event is received on both sides!). Please send us your test code so that we may reproduce the problem here. - Jack From glebn at voltaire.com Tue Aug 1 06:17:56 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Tue, 1 Aug 2006 16:17:56 +0300 Subject: [openib-general] [PATCH/RFC] libibverbs and libmthca fork support In-Reply-To: References: Message-ID: <20060801131756.GF4681@minantech.com> On Mon, Jul 31, 2006 at 11:52:18AM -0700, Roland Dreier wrote: > Here's an initial cut (based on Gleb Natapov's work) at using > madvise(MADV_DONTFORK) to support fork() from libibverbs. The main > changes from Gleb's earlier work are: > > - I added code to handle doorbell pages in libmthca. As far as I can > see this is necessary -- my tests don't work without it. Gleb, did > you ever test your changes on memfree HCAs? > Nope. And when I test it now, it doesn't work. Bringing your fix over helps though :) > - I added a new API function, ibv_fork_init(), which must be called > before everything else if an app expects to do fork(). I did this > because I wanted a way for apps to know if fork() was expected to > work or not, and also because the vast majority of apps don't > fork() and probably don't want to pay the price of an extra system > call plus RB tree operation for every memory registration. > > - And the bulk of this patch is converting memory.c over to use RB > trees -- I just couldn't bring myself to use an O(N) algorithm at > this stage... > That's excellent! > Comments welcome... > You forgot to include buf.c in the patch. > - R. > > > Index: libibverbs/include/infiniband/driver.h > =================================================================== > --- libibverbs/include/infiniband/driver.h (revision 8791) > +++ libibverbs/include/infiniband/driver.h (working copy) > @@ -135,6 +135,9 @@ int ibv_cmd_destroy_ah(struct ibv_ah *ah > int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); > int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); > > +int ibv_dontfork_range(void *base, size_t size); > +int ibv_dofork_range(void *base, size_t size); > + > /* > * sysfs helper functions > */ > Index: libibverbs/include/infiniband/verbs.h > =================================================================== > --- libibverbs/include/infiniband/verbs.h (revision 8791) > +++ libibverbs/include/infiniband/verbs.h (working copy) > @@ -285,6 +285,8 @@ struct ibv_pd { > struct ibv_mr { > struct ibv_context *context; > struct ibv_pd *pd; > + void *addr; > + size_t length; > uint32_t handle; > uint32_t lkey; > uint32_t rkey; > @@ -1016,6 +1018,14 @@ int ibv_attach_mcast(struct ibv_qp *qp, > */ > int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); > > +/** > + * ibv_fork_init - Prepare data structures so that fork() may be used > + * safely. If this function is not called or returns a non-zero > + * status, then libibverbs data structures are not fork()-safe and the > + * effect of an application calling fork() is undefined. > + */ > +int ibv_fork_init(void); > + > END_C_DECLS > > # undef __attribute_const > Index: libibverbs/ChangeLog > =================================================================== > --- libibverbs/ChangeLog (revision 8791) > +++ libibverbs/ChangeLog (working copy) > @@ -1,3 +1,29 @@ > +2006-07-26 Roland Dreier > + > + * src/verbs.c (ibv_reg_mr, ibv_dereg_mr): Add calls to > + ibv_dontfork_range() and ibv_dofork_range() for memory regions > + registered by library consumers. > + > + * include/infiniband/verbs.h: Add declaration of ibv_fork_init(). > + > + * include/infiniband/driver.h: Add declarations of > + ibv_dontfork_range() and ibv_dofork_range(). > + > + * src/memory.c: Rewrite to use a red-black tree instead of a > + linked list. Change from doing mlock()/munlock() to > + madvise(..., MADV_DONTFORK) and madvise(..., MADV_DOFORK), and > + change the name of the entry points to ibv_dontfork_range() and > + ibv_dofork_range(). Add ibv_fork_init() for applications to > + request fork-safe behavior. > + > + * src/ibverbs.h: Kill off unused declarations. > + > + * src/init.c (ibverbs_init): Get rid of call to ibv_init_mem_map(). > + > + * include/infiniband/verbs.h: Add addr and length field to struct > + ibv_mr so that memory regions can be madvised(). This changes the > + ABI, since the layout of struct ibv_mr is changed. > + > 2006-07-04 Roland Dreier > > * include/infiniband/arch.h: Fix typo in sparc mb() > Index: libibverbs/src/libibverbs.map > =================================================================== > --- libibverbs/src/libibverbs.map (revision 8791) > +++ libibverbs/src/libibverbs.map (working copy) > @@ -74,6 +74,9 @@ IBVERBS_1.0 { > mult_to_ibv_rate; > ibv_get_sysfs_path; > ibv_read_sysfs_file; > + ibv_fork_init; > + ibv_dontfork_range; > + ibv_dofork_range; > > local: *; > }; > Index: libibverbs/src/ibverbs.h > =================================================================== > --- libibverbs/src/ibverbs.h (revision 8791) > +++ libibverbs/src/ibverbs.h (working copy) > @@ -58,11 +58,7 @@ struct ibv_abi_compat_v2 { > > extern HIDDEN int abi_ver; > > -extern HIDDEN int ibverbs_init(struct ibv_device ***list); > - > -extern HIDDEN int ibv_init_mem_map(void); > -extern HIDDEN int ibv_lock_range(void *base, size_t size); > -extern HIDDEN int ibv_unlock_range(void *base, size_t size); > +HIDDEN int ibverbs_init(struct ibv_device ***list); > > #define IBV_INIT_CMD(cmd, size, opcode) \ > do { \ > Index: libibverbs/src/verbs.c > =================================================================== > --- libibverbs/src/verbs.c (revision 8791) > +++ libibverbs/src/verbs.c (working copy) > @@ -155,18 +155,32 @@ struct ibv_mr *ibv_reg_mr(struct ibv_pd > { > struct ibv_mr *mr; > > + if (ibv_dontfork_range(addr, length)) > + return NULL; > + > mr = pd->context->ops.reg_mr(pd, addr, length, access); > if (mr) { > mr->context = pd->context; > mr->pd = pd; > - } > + mr->addr = addr; > + mr->length = length; > + } else > + ibv_dofork_range(addr, length); > > return mr; > } > > int ibv_dereg_mr(struct ibv_mr *mr) > { > - return mr->context->ops.dereg_mr(mr); > + int ret; > + void *addr = mr->addr; > + size_t length = mr->length; > + > + ret = mr->context->ops.dereg_mr(mr); > + if (!ret) > + ibv_dofork_range(addr, length); > + > + return ret; > } > > static struct ibv_comp_channel *ibv_create_comp_channel_v2(struct ibv_context *context) > Index: libibverbs/src/init.c > =================================================================== > --- libibverbs/src/init.c (revision 8791) > +++ libibverbs/src/init.c (working copy) > @@ -205,9 +205,6 @@ HIDDEN int ibverbs_init(struct ibv_devic > > *list = NULL; > > - if (ibv_init_mem_map()) > - return 0; > - > find_drivers(default_path); > > /* > Index: libibverbs/src/memory.c > =================================================================== > --- libibverbs/src/memory.c (revision 8791) > +++ libibverbs/src/memory.c (working copy) > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. > + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -36,6 +37,7 @@ > # include > #endif /* HAVE_CONFIG_H */ > > +#include > #include > #include > #include > @@ -44,114 +46,424 @@ > #include "ibverbs.h" > > /* > - * We keep a linked list of page ranges that have been locked along with a > - * reference count to manage overlapping registrations, etc. > - * > - * Eventually we should turn this into an RB-tree or something similar > - * to avoid the O(n) cost of registering/unregistering memory. > + * Most distro's headers don't have these yet. > */ > +#ifndef MADV_DONTFORK > +#define MADV_DONTFORK 10 > +#endif > + > +#ifndef MADV_DOFORK > +#define MADV_DOFORK 11 > +#endif > > struct ibv_mem_node { > - struct ibv_mem_node *prev, *next; > - uintptr_t start, end; > - int refcnt; > + enum { > + IBV_RED, > + IBV_BLACK > + } color; > + struct ibv_mem_node *parent; > + struct ibv_mem_node *left, *right; > + uintptr_t start, end; > + int refcnt; > }; > > -static struct { > - struct ibv_mem_node *first; > - pthread_mutex_t mutex; > - uintptr_t page_size; > -} mem_map; > +static struct ibv_mem_node *mm_root; > +static pthread_mutex_t mm_mutex = PTHREAD_MUTEX_INITIALIZER; > +static int page_size; > +static int too_late; > > -int ibv_init_mem_map(void) > +int ibv_fork_init(void) > { > - struct ibv_mem_node *node = NULL; > - > - node = malloc(sizeof *node); > - if (!node) > - goto fail; > - > - node->prev = node->next = NULL; > - node->start = 0; > - node->end = UINTPTR_MAX; > - node->refcnt = 0; > + void *tmp; > > - mem_map.first = node; > + if (mm_root) > + return 0; > > - mem_map.page_size = sysconf(_SC_PAGESIZE); > - if (mem_map.page_size < 0) > - goto fail; > + if (too_late) > + return EINVAL; > > - if (pthread_mutex_init(&mem_map.mutex, NULL)) > - goto fail; > + page_size = sysconf(_SC_PAGESIZE); > + if (page_size < 0) > + return errno; > + > + if (posix_memalign(&tmp, page_size, page_size)) > + return ENOMEM; > + > + if (madvise(tmp, page_size, MADV_DONTFORK) || > + madvise(tmp, page_size, MADV_DOFORK)) > + return ENOSYS; > + > + free(tmp); > + > + mm_root = malloc(sizeof *mm_root); > + if (!mm_root) > + return ENOMEM; > + > + mm_root->parent = NULL; > + mm_root->left = NULL; > + mm_root->right = NULL; > + mm_root->color = IBV_BLACK; > + mm_root->start = 0; > + mm_root->end = UINTPTR_MAX; > + mm_root->refcnt = 0; > > return 0; > +} > > -fail: > - if (node) > - free(node); > +static struct ibv_mem_node *__mm_prev(struct ibv_mem_node *node) > +{ > + if (node->left) { > + node = node->left; > + while (node->right) > + node = node->right; > + } else { > + while (node->parent && node == node->parent->left) > + node = node->parent; > > - return -1; > + node = node->parent; > + } > + > + return node; > } > > -static struct ibv_mem_node *__mm_find_first(uintptr_t start, uintptr_t end) > +static struct ibv_mem_node *__mm_next(struct ibv_mem_node *node) > { > - struct ibv_mem_node *node = mem_map.first; > + if (node->right) { > + node = node->right; > + while (node->left) > + node = node->left; > + } else { > + while (node->parent && node == node->parent->right) > + node = node->parent; > > - while (node) { > - if ((node->start <= start && node->end >= start) || > - (node->start <= end && node->end >= end)) > - break; > - node = node->next; > + node = node->parent; > } > > return node; > } > > -static struct ibv_mem_node *__mm_prev(struct ibv_mem_node *node) > +static void __mm_rotate_right(struct ibv_mem_node *node) > { > - return node->prev; > + struct ibv_mem_node *tmp; > + > + tmp = node->left; > + > + node->left = tmp->right; > + if (node->left) > + node->left->parent = node; > + > + if (node->parent) { > + if (node->parent->right == node) > + node->parent->right = tmp; > + else > + node->parent->left = tmp; > + } else > + mm_root = tmp; > + > + tmp->parent = node->parent; > + > + tmp->right = node; > + node->parent = tmp; > } > > -static struct ibv_mem_node *__mm_next(struct ibv_mem_node *node) > +static void __mm_rotate_left(struct ibv_mem_node *node) > +{ > + struct ibv_mem_node *tmp; > + > + tmp = node->right; > + > + node->right = tmp->left; > + if (node->right) > + node->right->parent = node; > + > + if (node->parent) { > + if (node->parent->right == node) > + node->parent->right = tmp; > + else > + node->parent->left = tmp; > + } else > + mm_root = tmp; > + > + tmp->parent = node->parent; > + > + tmp->left = node; > + node->parent = tmp; > +} > + > +static int verify(struct ibv_mem_node *node) > +{ > + int hl, hr; > + > + if (!node) > + return 1; > + > + hl = verify(node->left); > + hr = verify(node->left); > + > + if (!hl || !hr) > + return 0; > + if (hl != hr) > + return 0; > + > + if (node->color == IBV_RED) { > + if (node->left && node->left->color != IBV_BLACK) > + return 0; > + if (node->right && node->right->color != IBV_BLACK) > + return 0; > + return hl; > + } > + > + return hl + 1; > +} > + > +static void __mm_add_rebalance(struct ibv_mem_node *node) > { > - return node->next; > + struct ibv_mem_node *parent, *gp, *uncle; > + > + while (node->parent && node->parent->color == IBV_RED) { > + parent = node->parent; > + gp = node->parent->parent; > + > + if (parent == gp->left) { > + uncle = gp->right; > + > + if (uncle && uncle->color == IBV_RED) { > + parent->color = IBV_BLACK; > + uncle->color = IBV_BLACK; > + gp->color = IBV_RED; > + > + node = gp; > + } else { > + if (node == parent->right) { > + __mm_rotate_left(parent); > + node = parent; > + parent = node->parent; > + } > + > + parent->color = IBV_BLACK; > + gp->color = IBV_RED; > + > + __mm_rotate_right(gp); > + } > + } else { > + uncle = gp->left; > + > + if (uncle && uncle->color == IBV_RED) { > + parent->color = IBV_BLACK; > + uncle->color = IBV_BLACK; > + gp->color = IBV_RED; > + > + node = gp; > + } else { > + if (node == parent->left) { > + __mm_rotate_right(parent); > + node = parent; > + parent = node->parent; > + } > + > + parent->color = IBV_BLACK; > + gp->color = IBV_RED; > + > + __mm_rotate_left(gp); > + } > + } > + } > + > + mm_root->color = IBV_BLACK; > } > > -static void __mm_add(struct ibv_mem_node *node, > - struct ibv_mem_node *new) > +static void __mm_add(struct ibv_mem_node *new) > { > - new->prev = node; > - new->next = node->next; > - node->next = new; > - if (new->next) > - new->next->prev = new; > + struct ibv_mem_node *node, *parent = NULL; > + > + node = mm_root; > + while (node) { > + parent = node; > + if (node->start < new->start) > + node = node->right; > + else > + node = node->left; > + } > + > + if (parent->start < new->start) > + parent->right = new; > + else > + parent->left = new; > + > + new->parent = parent; > + new->left = NULL; > + new->right = NULL; > + > + new->color = IBV_RED; > + __mm_add_rebalance(new); > } > > static void __mm_remove(struct ibv_mem_node *node) > { > - /* Never have to remove the first node, so we can use prev */ > - node->prev->next = node->next; > - if (node->next) > - node->next->prev = node->prev; > + struct ibv_mem_node *child, *parent, *sib, *tmp; > + int nodecol; > + > + if (node->left && node->right) { > + tmp = node->left; > + while (tmp->right) > + tmp = tmp->right; > + > + nodecol = tmp->color; > + child = tmp->left; > + tmp->color = node->color; > + > + if (tmp->parent != node) { > + parent = tmp->parent; > + parent->right = tmp->left; > + if (tmp->left) > + tmp->left->parent = parent; > + > + tmp->left = node->left; > + node->left->parent = tmp; > + } else > + parent = tmp; > + > + tmp->right = node->right; > + node->right->parent = tmp; > + > + tmp->parent = node->parent; > + if (node->parent) { > + if (node->parent->left == node) > + node->parent->left = tmp; > + else > + node->parent->right = tmp; > + } else > + mm_root = tmp; > + } else { > + nodecol = node->color; > + > + child = node->left ? node->left : node->right; > + parent = node->parent; > + > + if (child) > + child->parent = parent; > + if (parent) { > + if (parent->left == node) > + parent->left = child; > + else > + parent->right = child; > + } else > + mm_root = child; > + } > + > + free(node); > + > + if (nodecol == IBV_RED) > + return; > + > + while ((!child || child->color == IBV_BLACK) && child != mm_root) { > + if (parent->left == child) { > + sib = parent->right; > + > + if (sib->color == IBV_RED) { > + parent->color = IBV_RED; > + sib->color = IBV_BLACK; > + __mm_rotate_left(parent); > + sib = parent->right; > + } > + > + if ((!sib->left || sib->left->color == IBV_BLACK) && > + (!sib->right || sib->right->color == IBV_BLACK)) { > + sib->color = IBV_RED; > + child = parent; > + parent = child->parent; > + } else { > + if (!sib->right || sib->right->color == IBV_BLACK) { > + if (sib->left) > + sib->left->color = IBV_BLACK; > + sib->color = IBV_RED; > + __mm_rotate_right(sib); > + sib = parent->right; > + } > + > + sib->color = parent->color; > + parent->color = IBV_BLACK; > + if (sib->right) > + sib->right->color = IBV_BLACK; > + __mm_rotate_left(parent); > + child = mm_root; > + break; > + } > + } else { > + sib = parent->left; > + > + if (sib->color == IBV_RED) { > + parent->color = IBV_RED; > + sib->color = IBV_BLACK; > + __mm_rotate_right(parent); > + sib = parent->left; > + } > + > + if ((!sib->left || sib->left->color == IBV_BLACK) && > + (!sib->right || sib->right->color == IBV_BLACK)) { > + sib->color = IBV_RED; > + child = parent; > + parent = child->parent; > + } else { > + if (!sib->left || sib->left->color == IBV_BLACK) { > + if (sib->right) > + sib->right->color = IBV_BLACK; > + sib->color = IBV_RED; > + __mm_rotate_left(sib); > + sib = parent->left; > + } > + > + sib->color = parent->color; > + parent->color = IBV_BLACK; > + if (sib->left) > + sib->left->color = IBV_BLACK; > + __mm_rotate_right(parent); > + child = mm_root; > + break; > + } > + } > + } > + > + if (child) > + child->color = IBV_BLACK; > +} > + > +static struct ibv_mem_node *__mm_find_start(uintptr_t start, uintptr_t end) > +{ > + struct ibv_mem_node *node = mm_root; > + > + while (node) { > + if (node->start <= start && node->end >= start) > + break; > + > + if (node->start < start) > + node = node->right; > + else > + node = node->left; > + } > + > + return node; > } > > -int ibv_lock_range(void *base, size_t size) > +static int ibv_madvise_range(void *base, size_t size, int advice) > { > uintptr_t start, end; > struct ibv_mem_node *node, *tmp; > + int inc; > int ret = 0; > > if (!size) > return 0; > > - start = (uintptr_t) base & ~(mem_map.page_size - 1); > - end = ((uintptr_t) (base + size + mem_map.page_size - 1) & > - ~(mem_map.page_size - 1)) - 1; > + inc = advice == MADV_DONTFORK ? 1 : -1; > + > + start = (uintptr_t) base & ~(page_size - 1); > + end = ((uintptr_t) (base + size + page_size - 1) & > + ~(page_size - 1)) - 1; > > - pthread_mutex_lock(&mem_map.mutex); > + pthread_mutex_lock(&mm_mutex); > > - node = __mm_find_first(start, end); > + node = __mm_find_start(start, end); > > if (node->start < start) { > tmp = malloc(sizeof *tmp); > @@ -165,11 +477,19 @@ int ibv_lock_range(void *base, size_t si > tmp->refcnt = node->refcnt; > node->end = start - 1; > > - __mm_add(node, tmp); > + __mm_add(tmp); > node = tmp; > + } else { > + tmp = __mm_prev(node); > + if (tmp && tmp->refcnt == node->refcnt + inc) { > + tmp->end = node->end; > + tmp->refcnt = node->refcnt; > + __mm_remove(node); > + node = tmp; > + } > } > > - while (node->start <= end) { > + while (node && node->start <= end) { > if (node->end > end) { > tmp = malloc(sizeof *tmp); > if (!tmp) { > @@ -182,13 +502,16 @@ int ibv_lock_range(void *base, size_t si > tmp->refcnt = node->refcnt; > node->end = end; > > - __mm_add(node, tmp); > + __mm_add(tmp); > } > > + node->refcnt += inc; > > - if (node->refcnt++ == 0) { > - ret = mlock((void *) node->start, > - node->end - node->start + 1); > + if ((inc == -1 && node->refcnt == 0) || > + (inc == 1 && node->refcnt == 1)) { > + ret = madvise((void *) node->start, > + node->end - node->start + 1, > + advice); > if (ret) > goto out; > } > @@ -196,63 +519,36 @@ int ibv_lock_range(void *base, size_t si > node = __mm_next(node); > } > > + if (node) { > + tmp = __mm_prev(node); > + if (tmp && node->refcnt == tmp->refcnt) { > + tmp->end = node->end; > + __mm_remove(node); > + } > + } > + > out: > - pthread_mutex_unlock(&mem_map.mutex); > + pthread_mutex_unlock(&mm_mutex); > > return ret; > } > > -int ibv_unlock_range(void *base, size_t size) > +int ibv_dontfork_range(void *base, size_t size) > { > - uintptr_t start, end; > - struct ibv_mem_node *node, *tmp; > - int ret = 0; > - > - if (!size) > + if (mm_root) > + return ibv_madvise_range(base, size, MADV_DONTFORK); > + else { > + too_late = 1; > return 0; > - > - start = (uintptr_t) base & ~(mem_map.page_size - 1); > - end = ((uintptr_t) (base + size + mem_map.page_size - 1) & > - ~(mem_map.page_size - 1)) - 1; > - > - pthread_mutex_lock(&mem_map.mutex); > - > - node = __mm_find_first(start, end); > - > - if (node->start != start) { > - ret = -1; > - goto out; > - } > - > - while (node && node->end <= end) { > - if (--node->refcnt == 0) { > - ret = munlock((void *) node->start, > - node->end - node->start + 1); > - } > - > - if (__mm_prev(node) && node->refcnt == __mm_prev(node)->refcnt) { > - __mm_prev(node)->end = node->end; > - tmp = __mm_prev(node); > - __mm_remove(node); > - node = tmp; > - } > - > - node = __mm_next(node); > - } > - > - if (node && node->refcnt == __mm_prev(node)->refcnt) { > - __mm_prev(node)->end = node->end; > - tmp = __mm_prev(node); > - __mm_remove(node); > } > +} > > - if (node->end != end) { > - ret = -1; > - goto out; > +int ibv_dofork_range(void *base, size_t size) > +{ > + if (mm_root) > + return ibv_madvise_range(base, size, MADV_DOFORK); > + else { > + too_late = 1; > + return 0; > } > - > -out: > - pthread_mutex_unlock(&mem_map.mutex); > - > - return ret; > } > Index: libmthca/configure.in > =================================================================== > --- libmthca/configure.in (revision 8791) > +++ libmthca/configure.in (working copy) > @@ -26,7 +26,7 @@ AC_C_CONST > AC_CHECK_SIZEOF(long) > > dnl Checks for library functions > -AC_CHECK_FUNCS(ibv_read_sysfs_file) > +AC_CHECK_FUNCS(ibv_read_sysfs_file ibv_dontfork_range ibv_dofork_range) > > AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, > if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then > Index: libmthca/src/memfree.c > =================================================================== > --- libmthca/src/memfree.c (revision 8791) > +++ libmthca/src/memfree.c (working copy) > @@ -46,8 +46,8 @@ > #define MTHCA_FREE_MAP_SIZE (MTHCA_DB_REC_PER_PAGE / (SIZEOF_LONG * 8)) > > struct mthca_db_page { > - unsigned long free[MTHCA_FREE_MAP_SIZE]; > - uint64_t *db_rec; > + unsigned long free[MTHCA_FREE_MAP_SIZE]; > + struct mthca_buf db_rec; > }; > > struct mthca_db_table { > @@ -91,7 +91,7 @@ int mthca_alloc_db(struct mthca_db_table > } > > for (i = start; i != end; i += dir) > - if (db_tab->page[i].db_rec) > + if (db_tab->page[i].db_rec.buf) > for (j = 0; j < MTHCA_FREE_MAP_SIZE; ++j) > if (db_tab->page[i].free[j]) > goto found; > @@ -101,18 +101,14 @@ int mthca_alloc_db(struct mthca_db_table > goto out; > } > > - { > - void *tmp; > - > - if (posix_memalign(&tmp, MTHCA_DB_REC_PAGE_SIZE, > - MTHCA_DB_REC_PAGE_SIZE)) { > - ret = -1; > - goto out; > - } > - db_tab->page[i].db_rec = tmp; > + if (mthca_alloc_buf(&db_tab->page[i].db_rec, > + MTHCA_DB_REC_PAGE_SIZE, > + MTHCA_DB_REC_PAGE_SIZE)) { > + ret = -1; > + goto out; > } > > - memset(db_tab->page[i].db_rec, 0, MTHCA_DB_REC_PAGE_SIZE); > + memset(db_tab->page[i].db_rec.buf, 0, MTHCA_DB_REC_PAGE_SIZE); > memset(db_tab->page[i].free, 0xff, sizeof db_tab->page[i].free); > > if (group == 0) > @@ -140,7 +136,7 @@ found: > j = MTHCA_DB_REC_PER_PAGE - 1 - j; > > ret = i * MTHCA_DB_REC_PER_PAGE + j; > - *db = (uint32_t *) &db_tab->page[i].db_rec[j]; > + *db = db_tab->page[i].db_rec.buf + j * 8; > > out: > pthread_mutex_unlock(&db_tab->mutex); > @@ -163,7 +159,7 @@ void mthca_free_db(struct mthca_db_table > page = db_tab->page + i; > > pthread_mutex_lock(&db_tab->mutex); > - page->db_rec[j] = 0; > + *(uint64_t *) (page->db_rec.buf + j * 8) = 0; > > if (i >= db_tab->min_group2) > j = MTHCA_DB_REC_PER_PAGE - 1 - j; > @@ -190,7 +186,7 @@ struct mthca_db_table *mthca_alloc_db_ta > db_tab->min_group2 = npages - 1; > > for (i = 0; i < npages; ++i) > - db_tab->page[i].db_rec = NULL; > + db_tab->page[i].db_rec.buf = NULL; > > return db_tab; > } > @@ -203,8 +199,8 @@ void mthca_free_db_tab(struct mthca_db_t > return; > > for (i = 0; i < db_tab->npages; ++i) > - if (db_tab->page[i].db_rec) > - free(db_tab->page[i].db_rec); > + if (db_tab->page[i].db_rec.buf) > + mthca_free_buf(&db_tab->page[i].db_rec); > > free(db_tab); > } > Index: libmthca/src/qp.c > =================================================================== > --- libmthca/src/qp.c (revision 8791) > +++ libmthca/src/qp.c (working copy) > @@ -58,12 +58,12 @@ static const uint8_t mthca_opcode[] = { > > static void *get_recv_wqe(struct mthca_qp *qp, int n) > { > - return qp->buf + (n << qp->rq.wqe_shift); > + return qp->buf.buf + (n << qp->rq.wqe_shift); > } > > static void *get_send_wqe(struct mthca_qp *qp, int n) > { > - return qp->buf + qp->send_wqe_offset + (n << qp->sq.wqe_shift); > + return qp->buf.buf + qp->send_wqe_offset + (n << qp->sq.wqe_shift); > } > > void mthca_init_qp_indices(struct mthca_qp *qp) > @@ -821,13 +821,14 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd > > qp->buf_size = qp->send_wqe_offset + (qp->sq.max << qp->sq.wqe_shift); > > - if (posix_memalign(&qp->buf, to_mdev(pd->context->device)->page_size, > - align(qp->buf_size, to_mdev(pd->context->device)->page_size))) { > + if (mthca_alloc_buf(&qp->buf, > + align(qp->buf_size, to_mdev(pd->context->device)->page_size), > + to_mdev(pd->context->device)->page_size)) { > free(qp->wrid); > return -1; > } > > - memset(qp->buf, 0, qp->buf_size); > + memset(qp->buf.buf, 0, qp->buf_size); > > if (mthca_is_memfree(pd->context)) { > struct mthca_next_seg *next; > Index: libmthca/src/verbs.c > =================================================================== > --- libmthca/src/verbs.c (revision 8791) > +++ libmthca/src/verbs.c (working copy) > @@ -188,11 +188,10 @@ struct ibv_cq *mthca_create_cq(struct ib > goto err; > > cqe = align_cq_size(cqe); > - cq->buf = mthca_alloc_cq_buf(to_mdev(context->device), cqe); > - if (!cq->buf) > + if (mthca_alloc_cq_buf(to_mdev(context->device), &cq->buf, cqe)) > goto err; > > - cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf, > + cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf.buf, > cqe * MTHCA_CQ_ENTRY_SIZE, > 0, IBV_ACCESS_LOCAL_WRITE); > if (!cq->mr) > @@ -251,7 +250,7 @@ err_unreg: > mthca_dereg_mr(cq->mr); > > err_buf: > - free(cq->buf); > + mthca_free_buf(&cq->buf); > > err: > free(cq); > @@ -264,7 +263,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq, > struct mthca_cq *cq = to_mcq(ibcq); > struct mthca_resize_cq cmd; > struct ibv_mr *mr; > - void *buf; > + struct mthca_buf buf; > int old_cqe; > int ret; > > @@ -280,17 +279,15 @@ int mthca_resize_cq(struct ibv_cq *ibcq, > goto out; > } > > - buf = mthca_alloc_cq_buf(to_mdev(ibcq->context->device), cqe); > - if (!buf) { > - ret = ENOMEM; > + ret = mthca_alloc_cq_buf(to_mdev(ibcq->context->device), &buf, cqe); > + if (ret) > goto out; > - } > > - mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf, > + mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf.buf, > cqe * MTHCA_CQ_ENTRY_SIZE, > 0, IBV_ACCESS_LOCAL_WRITE); > if (!mr) { > - free(buf); > + mthca_free_buf(&buf); > ret = ENOMEM; > goto out; > } > @@ -303,14 +300,14 @@ int mthca_resize_cq(struct ibv_cq *ibcq, > ret = ibv_cmd_resize_cq(ibcq, cqe - 1, &cmd.ibv_cmd, sizeof cmd); > if (ret) { > mthca_dereg_mr(mr); > - free(buf); > + mthca_free_buf(&buf); > goto out; > } > > - mthca_cq_resize_copy_cqes(cq, buf, old_cqe); > + mthca_cq_resize_copy_cqes(cq, buf.buf, old_cqe); > > mthca_dereg_mr(cq->mr); > - free(cq->buf); > + mthca_free_buf(&cq->buf); > > cq->buf = buf; > cq->mr = mr; > @@ -336,8 +333,7 @@ int mthca_destroy_cq(struct ibv_cq *cq) > } > > mthca_dereg_mr(to_mcq(cq)->mr); > - > - free(to_mcq(cq)->buf); > + mthca_free_buf(&to_mcq(cq)->buf); > free(to_mcq(cq)); > > return 0; > @@ -389,7 +385,7 @@ struct ibv_srq *mthca_create_srq(struct > if (mthca_alloc_srq_buf(pd, &attr->attr, srq)) > goto err; > > - srq->mr = __mthca_reg_mr(pd, srq->buf, srq->buf_size, 0, 0); > + srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0); > if (!srq->mr) > goto err_free; > > @@ -430,7 +426,7 @@ err_unreg: > > err_free: > free(srq->wrid); > - free(srq->buf); > + mthca_free_buf(&srq->buf); > > err: > free(srq); > @@ -469,7 +465,7 @@ int mthca_destroy_srq(struct ibv_srq *sr > > mthca_dereg_mr(to_msrq(srq)->mr); > > - free(to_msrq(srq)->buf); > + mthca_free_buf(&to_msrq(srq)->buf); > free(to_msrq(srq)->wrid); > free(to_msrq(srq)); > > @@ -507,7 +503,7 @@ struct ibv_qp *mthca_create_qp(struct ib > pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE)) > goto err_free; > > - qp->mr = __mthca_reg_mr(pd, qp->buf, qp->buf_size, 0, 0); > + qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0); > if (!qp->mr) > goto err_free; > > @@ -574,7 +570,7 @@ err_unreg: > > err_free: > free(qp->wrid); > - free(qp->buf); > + mthca_free_buf(&qp->buf); > > err: > free(qp); > @@ -655,8 +651,7 @@ int mthca_destroy_qp(struct ibv_qp *qp) > } > > mthca_dereg_mr(to_mqp(qp)->mr); > - > - free(to_mqp(qp)->buf); > + mthca_free_buf(&to_mqp(qp)->buf); > free(to_mqp(qp)->wrid); > free(to_mqp(qp)); > > Index: libmthca/src/mthca.h > =================================================================== > --- libmthca/src/mthca.h (revision 8791) > +++ libmthca/src/mthca.h (working copy) > @@ -112,6 +112,11 @@ struct mthca_context { > int qp_table_mask; > }; > > +struct mthca_buf { > + void *buf; > + size_t length; > +}; > + > struct mthca_pd { > struct ibv_pd ibv_pd; > struct mthca_ah_page *ah_list; > @@ -121,7 +126,7 @@ struct mthca_pd { > > struct mthca_cq { > struct ibv_cq ibv_cq; > - void *buf; > + struct mthca_buf buf; > pthread_spinlock_t lock; > struct ibv_mr *mr; > uint32_t cqn; > @@ -137,7 +142,7 @@ struct mthca_cq { > > struct mthca_srq { > struct ibv_srq ibv_srq; > - void *buf; > + struct mthca_buf buf; > void *last; > pthread_spinlock_t lock; > struct ibv_mr *mr; > @@ -174,7 +179,7 @@ struct mthca_wq { > > struct mthca_qp { > struct ibv_qp ibv_qp; > - void *buf; > + struct mthca_buf buf; > uint64_t *wrid; > int send_wqe_offset; > int max_inline_data; > @@ -259,6 +264,9 @@ static inline int mthca_is_memfree(struc > return to_mdev(ibctx->device)->hca_type == MTHCA_ARBEL; > } > > +int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size); > +void mthca_free_buf(struct mthca_buf *buf); > + > int mthca_alloc_db(struct mthca_db_table *db_tab, enum mthca_db_type type, > uint32_t **db); > void mthca_set_db_qn(uint32_t *db, enum mthca_db_type type, uint32_t qn); > @@ -290,7 +298,7 @@ void mthca_arbel_cq_event(struct ibv_cq > void mthca_cq_clean(struct mthca_cq *cq, uint32_t qpn, > struct mthca_srq *srq); > void mthca_cq_resize_copy_cqes(struct mthca_cq *cq, void *buf, int new_cqe); > -void *mthca_alloc_cq_buf(struct mthca_device *dev, int cqe); > +int mthca_alloc_cq_buf(struct mthca_device *dev, struct mthca_buf *buf, int nent); > > struct ibv_srq *mthca_create_srq(struct ibv_pd *pd, > struct ibv_srq_init_attr *attr); > Index: libmthca/src/cq.c > =================================================================== > --- libmthca/src/cq.c (revision 8791) > +++ libmthca/src/cq.c (working copy) > @@ -126,7 +126,7 @@ struct mthca_err_cqe { > > static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) > { > - return cq->buf + entry * MTHCA_CQ_ENTRY_SIZE; > + return cq->buf.buf + entry * MTHCA_CQ_ENTRY_SIZE; > } > > static inline struct mthca_cqe *cqe_sw(struct mthca_cq *cq, int i) > @@ -612,17 +612,16 @@ void mthca_cq_resize_copy_cqes(struct mt > get_cqe(cq, i & old_cqe), MTHCA_CQ_ENTRY_SIZE); > } > > -void *mthca_alloc_cq_buf(struct mthca_device *dev, int nent) > +int mthca_alloc_cq_buf(struct mthca_device *dev, struct mthca_buf *buf, int nent) > { > - void *buf; > int i; > > - if (posix_memalign(&buf, dev->page_size, > - align(nent * MTHCA_CQ_ENTRY_SIZE, dev->page_size))) > - return NULL; > + if (mthca_alloc_buf(buf, align(nent * MTHCA_CQ_ENTRY_SIZE, dev->page_size), > + dev->page_size)) > + return -1; > > for (i = 0; i < nent; ++i) > - ((struct mthca_cqe *) buf)[i].owner = MTHCA_CQ_ENTRY_OWNER_HW; > + ((struct mthca_cqe *) buf->buf)[i].owner = MTHCA_CQ_ENTRY_OWNER_HW; > > - return buf; > + return 0; > } > Index: libmthca/src/srq.c > =================================================================== > --- libmthca/src/srq.c (revision 8791) > +++ libmthca/src/srq.c (working copy) > @@ -47,7 +47,7 @@ > > static void *get_wqe(struct mthca_srq *srq, int n) > { > - return srq->buf + (n << srq->wqe_shift); > + return srq->buf.buf + (n << srq->wqe_shift); > } > > /* > @@ -292,13 +292,14 @@ int mthca_alloc_srq_buf(struct ibv_pd *p > > srq->buf_size = srq->max << srq->wqe_shift; > > - if (posix_memalign(&srq->buf, to_mdev(pd->context->device)->page_size, > - align(srq->buf_size, to_mdev(pd->context->device)->page_size))) { > + if (mthca_alloc_buf(&srq->buf, > + align(srq->buf_size, to_mdev(pd->context->device)->page_size), > + to_mdev(pd->context->device)->page_size)) { > free(srq->wrid); > return -1; > } > > - memset(srq->buf, 0, srq->buf_size); > + memset(srq->buf.buf, 0, srq->buf_size); > > /* > * Now initialize the SRQ buffer so that all of the WQEs are > Index: libmthca/src/ah.c > =================================================================== > --- libmthca/src/ah.c (revision 8791) > +++ libmthca/src/ah.c (working copy) > @@ -45,7 +45,7 @@ > > struct mthca_ah_page { > struct mthca_ah_page *prev, *next; > - void *buf; > + struct mthca_buf buf; > struct ibv_mr *mr; > int use_cnt; > unsigned free[0]; > @@ -60,14 +60,14 @@ static struct mthca_ah_page *__add_page( > if (!page) > return NULL; > > - if (posix_memalign(&page->buf, page_size, page_size)) { > + if (mthca_alloc_buf(&page->buf, page_size, page_size)) { > free(page); > return NULL; > } > > - page->mr = mthca_reg_mr(&pd->ibv_pd, page->buf, page_size, 0); > + page->mr = mthca_reg_mr(&pd->ibv_pd, page->buf.buf, page_size, 0); > if (!page->mr) { > - free(page->buf); > + mthca_free_buf(&page->buf); > free(page); > return NULL; > } > @@ -123,7 +123,7 @@ int mthca_alloc_av(struct mthca_pd *pd, > if (page->free[i]) { > j = ffs(page->free[i]); > page->free[i] &= ~(1 << (j - 1)); > - ah->av = page->buf + > + ah->av = page->buf.buf + > (i * 8 * sizeof (int) + (j - 1)) * sizeof *ah->av; > break; > } > @@ -172,7 +172,7 @@ void mthca_free_av(struct mthca_ah *ah) > pthread_mutex_lock(&pd->ah_mutex); > > page = ah->page; > - i = ((void *) ah->av - page->buf) / sizeof *ah->av; > + i = ((void *) ah->av - page->buf.buf) / sizeof *ah->av; > page->free[i / (8 * sizeof (int))] |= 1 << (i % (8 * sizeof (int))); > > if (!--page->use_cnt) { > @@ -184,7 +184,7 @@ void mthca_free_av(struct mthca_ah *ah) > page->next->prev = page->prev; > > mthca_dereg_mr(page->mr); > - free(page->buf); > + mthca_free_buf(&page->buf); > free(page); > } > > Index: libmthca/ChangeLog > =================================================================== > --- libmthca/ChangeLog (revision 8791) > +++ libmthca/ChangeLog (working copy) > @@ -1,3 +1,19 @@ > +2006-07-26 Roland Dreier > + > + * src/mthca.h, src/ah.c, src/cq.c, src/memfree.c, src/qp.c, > + src/srq.c, src/verbs.c: Convert internal allocations for AH pages > + (for non-memfree HCAs), CQ buffers, doorbell pages (for memfree > + HCAs), QP buffers and SRQ buffers to use the new buffer > + allocator. This makes libmthca fork()-clean when built against > + libibverbs 1.1. > + > + * src/buf.c (mthca_alloc_buf, mthca_free_buf): Add new functions > + to wrap up allocating page-aligned buffers. The new functions > + will call ibv_dontfork_range()/ibv_dofork_range() to do proper > + madvise()ing to handle fork(), if applicable. > + > + * configure.in: Check for ibv_dontfork_range() and ibv_dontfork_range(). > + > 2006-07-04 Dotan Barak > > * src/verbs.c (mthca_create_cq, mthca_resize_cq): Passing huge > Index: libmthca/Makefile.am > =================================================================== > --- libmthca/Makefile.am (revision 8791) > +++ libmthca/Makefile.am (working copy) > @@ -12,10 +12,9 @@ else > mthca_version_script = > endif > > -src_mthca_la_SOURCES = src/ah.c src/cq.c src/memfree.c src/mthca.c src/qp.c \ > - src/srq.c src/verbs.c > -src_mthca_la_LDFLAGS = -avoid-version -module \ > - $(mthca_version_script) > +src_mthca_la_SOURCES = src/ah.c src/buf.c src/cq.c src/memfree.c src/mthca.c \ > + src/qp.c src/srq.c src/verbs.c > +src_mthca_la_LDFLAGS = -avoid-version -module $(mthca_version_script) > > DEBIAN = debian/changelog debian/compat debian/control debian/copyright \ > debian/libmthca1.install debian/libmthca-dev.install debian/rules > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Gleb. From mohitka at hcl.in Tue Aug 1 06:56:46 2006 From: mohitka at hcl.in (Mohit Katiyar, Noida) Date: Tue, 1 Aug 2006 19:26:46 +0530 Subject: [openib-general] iSER Source and target code Message-ID: Hi all, I was looking at iSER initiator and target code. I noticed that iSER target code available is based on Datamover architecture but the initiator code is not based on Datamover Architecture. Why the iSER initiator code is not based on Datamover Architecture? What are the future plans for the iSER target driver code? With which iSCSI interface the iSER target driver code will be used? Thanks in advance Thanks and Regards Mohit Katiyar Lead Engineer TLS, HCL Noida "To dare is to lose one's footing momentarily. To not dare is to lose oneself." DISCLAIMER: ----------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect. ----------------------------------------------------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Aug 1 07:19:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 01 Aug 2006 07:19:04 -0700 Subject: [openib-general] hotplug support in mthca In-Reply-To: <20060801091914.GV9411@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 1 Aug 2006 12:19:14 +0300") References: <20060801091914.GV9411@mellanox.co.il> Message-ID: Michael> Roland, what happends today if an mthca device is removed Michael> while a userspace applcation still keeps a reference to Michael> it? Something bad. - R. From rdreier at cisco.com Tue Aug 1 07:21:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 01 Aug 2006 07:21:15 -0700 Subject: [openib-general] [PATCH/RFC] libibverbs and libmthca fork support In-Reply-To: <20060801131756.GF4681@minantech.com> (Gleb Natapov's message of "Tue, 1 Aug 2006 16:17:56 +0300") References: <20060801131756.GF4681@minantech.com> Message-ID: > You forgot to include buf.c in the patch. Oops, forgot to do svn add before generating the diff. Updated diff below: Index: libibverbs/include/infiniband/driver.h =================================================================== --- libibverbs/include/infiniband/driver.h (revision 8793) +++ libibverbs/include/infiniband/driver.h (working copy) @@ -135,6 +135,9 @@ int ibv_cmd_destroy_ah(struct ibv_ah *ah int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); +int ibv_dontfork_range(void *base, size_t size); +int ibv_dofork_range(void *base, size_t size); + /* * sysfs helper functions */ Index: libibverbs/include/infiniband/verbs.h =================================================================== --- libibverbs/include/infiniband/verbs.h (revision 8793) +++ libibverbs/include/infiniband/verbs.h (working copy) @@ -285,6 +285,8 @@ struct ibv_pd { struct ibv_mr { struct ibv_context *context; struct ibv_pd *pd; + void *addr; + size_t length; uint32_t handle; uint32_t lkey; uint32_t rkey; @@ -1016,6 +1018,14 @@ int ibv_attach_mcast(struct ibv_qp *qp, */ int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); +/** + * ibv_fork_init - Prepare data structures so that fork() may be used + * safely. If this function is not called or returns a non-zero + * status, then libibverbs data structures are not fork()-safe and the + * effect of an application calling fork() is undefined. + */ +int ibv_fork_init(void); + END_C_DECLS # undef __attribute_const Index: libibverbs/ChangeLog =================================================================== --- libibverbs/ChangeLog (revision 8793) +++ libibverbs/ChangeLog (working copy) @@ -1,3 +1,29 @@ +2006-07-26 Roland Dreier + + * src/verbs.c (ibv_reg_mr, ibv_dereg_mr): Add calls to + ibv_dontfork_range() and ibv_dofork_range() for memory regions + registered by library consumers. + + * include/infiniband/verbs.h: Add declaration of ibv_fork_init(). + + * include/infiniband/driver.h: Add declarations of + ibv_dontfork_range() and ibv_dofork_range(). + + * src/memory.c: Rewrite to use a red-black tree instead of a + linked list. Change from doing mlock()/munlock() to + madvise(..., MADV_DONTFORK) and madvise(..., MADV_DOFORK), and + change the name of the entry points to ibv_dontfork_range() and + ibv_dofork_range(). Add ibv_fork_init() for applications to + request fork-safe behavior. + + * src/ibverbs.h: Kill off unused declarations. + + * src/init.c (ibverbs_init): Get rid of call to ibv_init_mem_map(). + + * include/infiniband/verbs.h: Add addr and length field to struct + ibv_mr so that memory regions can be madvised(). This changes the + ABI, since the layout of struct ibv_mr is changed. + 2006-07-04 Roland Dreier * include/infiniband/arch.h: Fix typo in sparc mb() Index: libibverbs/src/libibverbs.map =================================================================== --- libibverbs/src/libibverbs.map (revision 8793) +++ libibverbs/src/libibverbs.map (working copy) @@ -74,6 +74,9 @@ IBVERBS_1.0 { mult_to_ibv_rate; ibv_get_sysfs_path; ibv_read_sysfs_file; + ibv_fork_init; + ibv_dontfork_range; + ibv_dofork_range; local: *; }; Index: libibverbs/src/ibverbs.h =================================================================== --- libibverbs/src/ibverbs.h (revision 8793) +++ libibverbs/src/ibverbs.h (working copy) @@ -58,11 +58,7 @@ struct ibv_abi_compat_v2 { extern HIDDEN int abi_ver; -extern HIDDEN int ibverbs_init(struct ibv_device ***list); - -extern HIDDEN int ibv_init_mem_map(void); -extern HIDDEN int ibv_lock_range(void *base, size_t size); -extern HIDDEN int ibv_unlock_range(void *base, size_t size); +HIDDEN int ibverbs_init(struct ibv_device ***list); #define IBV_INIT_CMD(cmd, size, opcode) \ do { \ Index: libibverbs/src/verbs.c =================================================================== --- libibverbs/src/verbs.c (revision 8793) +++ libibverbs/src/verbs.c (working copy) @@ -155,18 +155,32 @@ struct ibv_mr *ibv_reg_mr(struct ibv_pd { struct ibv_mr *mr; + if (ibv_dontfork_range(addr, length)) + return NULL; + mr = pd->context->ops.reg_mr(pd, addr, length, access); if (mr) { mr->context = pd->context; mr->pd = pd; - } + mr->addr = addr; + mr->length = length; + } else + ibv_dofork_range(addr, length); return mr; } int ibv_dereg_mr(struct ibv_mr *mr) { - return mr->context->ops.dereg_mr(mr); + int ret; + void *addr = mr->addr; + size_t length = mr->length; + + ret = mr->context->ops.dereg_mr(mr); + if (!ret) + ibv_dofork_range(addr, length); + + return ret; } static struct ibv_comp_channel *ibv_create_comp_channel_v2(struct ibv_context *context) Index: libibverbs/src/init.c =================================================================== --- libibverbs/src/init.c (revision 8793) +++ libibverbs/src/init.c (working copy) @@ -205,9 +205,6 @@ HIDDEN int ibverbs_init(struct ibv_devic *list = NULL; - if (ibv_init_mem_map()) - return 0; - find_drivers(default_path); /* Index: libibverbs/src/memory.c =================================================================== --- libibverbs/src/memory.c (revision 8793) +++ libibverbs/src/memory.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -36,6 +37,7 @@ # include #endif /* HAVE_CONFIG_H */ +#include #include #include #include @@ -44,114 +46,424 @@ #include "ibverbs.h" /* - * We keep a linked list of page ranges that have been locked along with a - * reference count to manage overlapping registrations, etc. - * - * Eventually we should turn this into an RB-tree or something similar - * to avoid the O(n) cost of registering/unregistering memory. + * Most distro's headers don't have these yet. */ +#ifndef MADV_DONTFORK +#define MADV_DONTFORK 10 +#endif + +#ifndef MADV_DOFORK +#define MADV_DOFORK 11 +#endif struct ibv_mem_node { - struct ibv_mem_node *prev, *next; - uintptr_t start, end; - int refcnt; + enum { + IBV_RED, + IBV_BLACK + } color; + struct ibv_mem_node *parent; + struct ibv_mem_node *left, *right; + uintptr_t start, end; + int refcnt; }; -static struct { - struct ibv_mem_node *first; - pthread_mutex_t mutex; - uintptr_t page_size; -} mem_map; +static struct ibv_mem_node *mm_root; +static pthread_mutex_t mm_mutex = PTHREAD_MUTEX_INITIALIZER; +static int page_size; +static int too_late; -int ibv_init_mem_map(void) +int ibv_fork_init(void) { - struct ibv_mem_node *node = NULL; - - node = malloc(sizeof *node); - if (!node) - goto fail; - - node->prev = node->next = NULL; - node->start = 0; - node->end = UINTPTR_MAX; - node->refcnt = 0; + void *tmp; - mem_map.first = node; + if (mm_root) + return 0; - mem_map.page_size = sysconf(_SC_PAGESIZE); - if (mem_map.page_size < 0) - goto fail; + if (too_late) + return EINVAL; - if (pthread_mutex_init(&mem_map.mutex, NULL)) - goto fail; + page_size = sysconf(_SC_PAGESIZE); + if (page_size < 0) + return errno; + + if (posix_memalign(&tmp, page_size, page_size)) + return ENOMEM; + + if (madvise(tmp, page_size, MADV_DONTFORK) || + madvise(tmp, page_size, MADV_DOFORK)) + return ENOSYS; + + free(tmp); + + mm_root = malloc(sizeof *mm_root); + if (!mm_root) + return ENOMEM; + + mm_root->parent = NULL; + mm_root->left = NULL; + mm_root->right = NULL; + mm_root->color = IBV_BLACK; + mm_root->start = 0; + mm_root->end = UINTPTR_MAX; + mm_root->refcnt = 0; return 0; +} -fail: - if (node) - free(node); +static struct ibv_mem_node *__mm_prev(struct ibv_mem_node *node) +{ + if (node->left) { + node = node->left; + while (node->right) + node = node->right; + } else { + while (node->parent && node == node->parent->left) + node = node->parent; - return -1; + node = node->parent; + } + + return node; } -static struct ibv_mem_node *__mm_find_first(uintptr_t start, uintptr_t end) +static struct ibv_mem_node *__mm_next(struct ibv_mem_node *node) { - struct ibv_mem_node *node = mem_map.first; + if (node->right) { + node = node->right; + while (node->left) + node = node->left; + } else { + while (node->parent && node == node->parent->right) + node = node->parent; - while (node) { - if ((node->start <= start && node->end >= start) || - (node->start <= end && node->end >= end)) - break; - node = node->next; + node = node->parent; } return node; } -static struct ibv_mem_node *__mm_prev(struct ibv_mem_node *node) +static void __mm_rotate_right(struct ibv_mem_node *node) { - return node->prev; + struct ibv_mem_node *tmp; + + tmp = node->left; + + node->left = tmp->right; + if (node->left) + node->left->parent = node; + + if (node->parent) { + if (node->parent->right == node) + node->parent->right = tmp; + else + node->parent->left = tmp; + } else + mm_root = tmp; + + tmp->parent = node->parent; + + tmp->right = node; + node->parent = tmp; } -static struct ibv_mem_node *__mm_next(struct ibv_mem_node *node) +static void __mm_rotate_left(struct ibv_mem_node *node) +{ + struct ibv_mem_node *tmp; + + tmp = node->right; + + node->right = tmp->left; + if (node->right) + node->right->parent = node; + + if (node->parent) { + if (node->parent->right == node) + node->parent->right = tmp; + else + node->parent->left = tmp; + } else + mm_root = tmp; + + tmp->parent = node->parent; + + tmp->left = node; + node->parent = tmp; +} + +static int verify(struct ibv_mem_node *node) +{ + int hl, hr; + + if (!node) + return 1; + + hl = verify(node->left); + hr = verify(node->left); + + if (!hl || !hr) + return 0; + if (hl != hr) + return 0; + + if (node->color == IBV_RED) { + if (node->left && node->left->color != IBV_BLACK) + return 0; + if (node->right && node->right->color != IBV_BLACK) + return 0; + return hl; + } + + return hl + 1; +} + +static void __mm_add_rebalance(struct ibv_mem_node *node) { - return node->next; + struct ibv_mem_node *parent, *gp, *uncle; + + while (node->parent && node->parent->color == IBV_RED) { + parent = node->parent; + gp = node->parent->parent; + + if (parent == gp->left) { + uncle = gp->right; + + if (uncle && uncle->color == IBV_RED) { + parent->color = IBV_BLACK; + uncle->color = IBV_BLACK; + gp->color = IBV_RED; + + node = gp; + } else { + if (node == parent->right) { + __mm_rotate_left(parent); + node = parent; + parent = node->parent; + } + + parent->color = IBV_BLACK; + gp->color = IBV_RED; + + __mm_rotate_right(gp); + } + } else { + uncle = gp->left; + + if (uncle && uncle->color == IBV_RED) { + parent->color = IBV_BLACK; + uncle->color = IBV_BLACK; + gp->color = IBV_RED; + + node = gp; + } else { + if (node == parent->left) { + __mm_rotate_right(parent); + node = parent; + parent = node->parent; + } + + parent->color = IBV_BLACK; + gp->color = IBV_RED; + + __mm_rotate_left(gp); + } + } + } + + mm_root->color = IBV_BLACK; } -static void __mm_add(struct ibv_mem_node *node, - struct ibv_mem_node *new) +static void __mm_add(struct ibv_mem_node *new) { - new->prev = node; - new->next = node->next; - node->next = new; - if (new->next) - new->next->prev = new; + struct ibv_mem_node *node, *parent = NULL; + + node = mm_root; + while (node) { + parent = node; + if (node->start < new->start) + node = node->right; + else + node = node->left; + } + + if (parent->start < new->start) + parent->right = new; + else + parent->left = new; + + new->parent = parent; + new->left = NULL; + new->right = NULL; + + new->color = IBV_RED; + __mm_add_rebalance(new); } static void __mm_remove(struct ibv_mem_node *node) { - /* Never have to remove the first node, so we can use prev */ - node->prev->next = node->next; - if (node->next) - node->next->prev = node->prev; + struct ibv_mem_node *child, *parent, *sib, *tmp; + int nodecol; + + if (node->left && node->right) { + tmp = node->left; + while (tmp->right) + tmp = tmp->right; + + nodecol = tmp->color; + child = tmp->left; + tmp->color = node->color; + + if (tmp->parent != node) { + parent = tmp->parent; + parent->right = tmp->left; + if (tmp->left) + tmp->left->parent = parent; + + tmp->left = node->left; + node->left->parent = tmp; + } else + parent = tmp; + + tmp->right = node->right; + node->right->parent = tmp; + + tmp->parent = node->parent; + if (node->parent) { + if (node->parent->left == node) + node->parent->left = tmp; + else + node->parent->right = tmp; + } else + mm_root = tmp; + } else { + nodecol = node->color; + + child = node->left ? node->left : node->right; + parent = node->parent; + + if (child) + child->parent = parent; + if (parent) { + if (parent->left == node) + parent->left = child; + else + parent->right = child; + } else + mm_root = child; + } + + free(node); + + if (nodecol == IBV_RED) + return; + + while ((!child || child->color == IBV_BLACK) && child != mm_root) { + if (parent->left == child) { + sib = parent->right; + + if (sib->color == IBV_RED) { + parent->color = IBV_RED; + sib->color = IBV_BLACK; + __mm_rotate_left(parent); + sib = parent->right; + } + + if ((!sib->left || sib->left->color == IBV_BLACK) && + (!sib->right || sib->right->color == IBV_BLACK)) { + sib->color = IBV_RED; + child = parent; + parent = child->parent; + } else { + if (!sib->right || sib->right->color == IBV_BLACK) { + if (sib->left) + sib->left->color = IBV_BLACK; + sib->color = IBV_RED; + __mm_rotate_right(sib); + sib = parent->right; + } + + sib->color = parent->color; + parent->color = IBV_BLACK; + if (sib->right) + sib->right->color = IBV_BLACK; + __mm_rotate_left(parent); + child = mm_root; + break; + } + } else { + sib = parent->left; + + if (sib->color == IBV_RED) { + parent->color = IBV_RED; + sib->color = IBV_BLACK; + __mm_rotate_right(parent); + sib = parent->left; + } + + if ((!sib->left || sib->left->color == IBV_BLACK) && + (!sib->right || sib->right->color == IBV_BLACK)) { + sib->color = IBV_RED; + child = parent; + parent = child->parent; + } else { + if (!sib->left || sib->left->color == IBV_BLACK) { + if (sib->right) + sib->right->color = IBV_BLACK; + sib->color = IBV_RED; + __mm_rotate_left(sib); + sib = parent->left; + } + + sib->color = parent->color; + parent->color = IBV_BLACK; + if (sib->left) + sib->left->color = IBV_BLACK; + __mm_rotate_right(parent); + child = mm_root; + break; + } + } + } + + if (child) + child->color = IBV_BLACK; +} + +static struct ibv_mem_node *__mm_find_start(uintptr_t start, uintptr_t end) +{ + struct ibv_mem_node *node = mm_root; + + while (node) { + if (node->start <= start && node->end >= start) + break; + + if (node->start < start) + node = node->right; + else + node = node->left; + } + + return node; } -int ibv_lock_range(void *base, size_t size) +static int ibv_madvise_range(void *base, size_t size, int advice) { uintptr_t start, end; struct ibv_mem_node *node, *tmp; + int inc; int ret = 0; if (!size) return 0; - start = (uintptr_t) base & ~(mem_map.page_size - 1); - end = ((uintptr_t) (base + size + mem_map.page_size - 1) & - ~(mem_map.page_size - 1)) - 1; + inc = advice == MADV_DONTFORK ? 1 : -1; + + start = (uintptr_t) base & ~(page_size - 1); + end = ((uintptr_t) (base + size + page_size - 1) & + ~(page_size - 1)) - 1; - pthread_mutex_lock(&mem_map.mutex); + pthread_mutex_lock(&mm_mutex); - node = __mm_find_first(start, end); + node = __mm_find_start(start, end); if (node->start < start) { tmp = malloc(sizeof *tmp); @@ -165,11 +477,19 @@ int ibv_lock_range(void *base, size_t si tmp->refcnt = node->refcnt; node->end = start - 1; - __mm_add(node, tmp); + __mm_add(tmp); node = tmp; + } else { + tmp = __mm_prev(node); + if (tmp && tmp->refcnt == node->refcnt + inc) { + tmp->end = node->end; + tmp->refcnt = node->refcnt; + __mm_remove(node); + node = tmp; + } } - while (node->start <= end) { + while (node && node->start <= end) { if (node->end > end) { tmp = malloc(sizeof *tmp); if (!tmp) { @@ -182,13 +502,16 @@ int ibv_lock_range(void *base, size_t si tmp->refcnt = node->refcnt; node->end = end; - __mm_add(node, tmp); + __mm_add(tmp); } + node->refcnt += inc; - if (node->refcnt++ == 0) { - ret = mlock((void *) node->start, - node->end - node->start + 1); + if ((inc == -1 && node->refcnt == 0) || + (inc == 1 && node->refcnt == 1)) { + ret = madvise((void *) node->start, + node->end - node->start + 1, + advice); if (ret) goto out; } @@ -196,63 +519,36 @@ int ibv_lock_range(void *base, size_t si node = __mm_next(node); } + if (node) { + tmp = __mm_prev(node); + if (tmp && node->refcnt == tmp->refcnt) { + tmp->end = node->end; + __mm_remove(node); + } + } + out: - pthread_mutex_unlock(&mem_map.mutex); + pthread_mutex_unlock(&mm_mutex); return ret; } -int ibv_unlock_range(void *base, size_t size) +int ibv_dontfork_range(void *base, size_t size) { - uintptr_t start, end; - struct ibv_mem_node *node, *tmp; - int ret = 0; - - if (!size) + if (mm_root) + return ibv_madvise_range(base, size, MADV_DONTFORK); + else { + too_late = 1; return 0; - - start = (uintptr_t) base & ~(mem_map.page_size - 1); - end = ((uintptr_t) (base + size + mem_map.page_size - 1) & - ~(mem_map.page_size - 1)) - 1; - - pthread_mutex_lock(&mem_map.mutex); - - node = __mm_find_first(start, end); - - if (node->start != start) { - ret = -1; - goto out; - } - - while (node && node->end <= end) { - if (--node->refcnt == 0) { - ret = munlock((void *) node->start, - node->end - node->start + 1); - } - - if (__mm_prev(node) && node->refcnt == __mm_prev(node)->refcnt) { - __mm_prev(node)->end = node->end; - tmp = __mm_prev(node); - __mm_remove(node); - node = tmp; - } - - node = __mm_next(node); - } - - if (node && node->refcnt == __mm_prev(node)->refcnt) { - __mm_prev(node)->end = node->end; - tmp = __mm_prev(node); - __mm_remove(node); } +} - if (node->end != end) { - ret = -1; - goto out; +int ibv_dofork_range(void *base, size_t size) +{ + if (mm_root) + return ibv_madvise_range(base, size, MADV_DOFORK); + else { + too_late = 1; + return 0; } - -out: - pthread_mutex_unlock(&mem_map.mutex); - - return ret; } Index: libibverbs/README =================================================================== --- libibverbs/README (revision 8793) +++ libibverbs/README (working copy) @@ -101,12 +101,6 @@ necessary permissions to release your wo TODO ==== -1.0 series ----------- - - * Use the MADV_DONTFORK advice for madvise(2) to make applications - that use fork(2) work better. - 1.1 series ---------- Index: libmthca/configure.in =================================================================== --- libmthca/configure.in (revision 8793) +++ libmthca/configure.in (working copy) @@ -26,7 +26,7 @@ AC_C_CONST AC_CHECK_SIZEOF(long) dnl Checks for library functions -AC_CHECK_FUNCS(ibv_read_sysfs_file) +AC_CHECK_FUNCS(ibv_read_sysfs_file ibv_dontfork_range ibv_dofork_range) AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then Index: libmthca/src/memfree.c =================================================================== --- libmthca/src/memfree.c (revision 8793) +++ libmthca/src/memfree.c (working copy) @@ -46,8 +46,8 @@ #define MTHCA_FREE_MAP_SIZE (MTHCA_DB_REC_PER_PAGE / (SIZEOF_LONG * 8)) struct mthca_db_page { - unsigned long free[MTHCA_FREE_MAP_SIZE]; - uint64_t *db_rec; + unsigned long free[MTHCA_FREE_MAP_SIZE]; + struct mthca_buf db_rec; }; struct mthca_db_table { @@ -91,7 +91,7 @@ int mthca_alloc_db(struct mthca_db_table } for (i = start; i != end; i += dir) - if (db_tab->page[i].db_rec) + if (db_tab->page[i].db_rec.buf) for (j = 0; j < MTHCA_FREE_MAP_SIZE; ++j) if (db_tab->page[i].free[j]) goto found; @@ -101,18 +101,14 @@ int mthca_alloc_db(struct mthca_db_table goto out; } - { - void *tmp; - - if (posix_memalign(&tmp, MTHCA_DB_REC_PAGE_SIZE, - MTHCA_DB_REC_PAGE_SIZE)) { - ret = -1; - goto out; - } - db_tab->page[i].db_rec = tmp; + if (mthca_alloc_buf(&db_tab->page[i].db_rec, + MTHCA_DB_REC_PAGE_SIZE, + MTHCA_DB_REC_PAGE_SIZE)) { + ret = -1; + goto out; } - memset(db_tab->page[i].db_rec, 0, MTHCA_DB_REC_PAGE_SIZE); + memset(db_tab->page[i].db_rec.buf, 0, MTHCA_DB_REC_PAGE_SIZE); memset(db_tab->page[i].free, 0xff, sizeof db_tab->page[i].free); if (group == 0) @@ -140,7 +136,7 @@ found: j = MTHCA_DB_REC_PER_PAGE - 1 - j; ret = i * MTHCA_DB_REC_PER_PAGE + j; - *db = (uint32_t *) &db_tab->page[i].db_rec[j]; + *db = db_tab->page[i].db_rec.buf + j * 8; out: pthread_mutex_unlock(&db_tab->mutex); @@ -163,7 +159,7 @@ void mthca_free_db(struct mthca_db_table page = db_tab->page + i; pthread_mutex_lock(&db_tab->mutex); - page->db_rec[j] = 0; + *(uint64_t *) (page->db_rec.buf + j * 8) = 0; if (i >= db_tab->min_group2) j = MTHCA_DB_REC_PER_PAGE - 1 - j; @@ -190,7 +186,7 @@ struct mthca_db_table *mthca_alloc_db_ta db_tab->min_group2 = npages - 1; for (i = 0; i < npages; ++i) - db_tab->page[i].db_rec = NULL; + db_tab->page[i].db_rec.buf = NULL; return db_tab; } @@ -203,8 +199,8 @@ void mthca_free_db_tab(struct mthca_db_t return; for (i = 0; i < db_tab->npages; ++i) - if (db_tab->page[i].db_rec) - free(db_tab->page[i].db_rec); + if (db_tab->page[i].db_rec.buf) + mthca_free_buf(&db_tab->page[i].db_rec); free(db_tab); } Index: libmthca/src/qp.c =================================================================== --- libmthca/src/qp.c (revision 8793) +++ libmthca/src/qp.c (working copy) @@ -58,12 +58,12 @@ static const uint8_t mthca_opcode[] = { static void *get_recv_wqe(struct mthca_qp *qp, int n) { - return qp->buf + (n << qp->rq.wqe_shift); + return qp->buf.buf + (n << qp->rq.wqe_shift); } static void *get_send_wqe(struct mthca_qp *qp, int n) { - return qp->buf + qp->send_wqe_offset + (n << qp->sq.wqe_shift); + return qp->buf.buf + qp->send_wqe_offset + (n << qp->sq.wqe_shift); } void mthca_init_qp_indices(struct mthca_qp *qp) @@ -821,13 +821,14 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd qp->buf_size = qp->send_wqe_offset + (qp->sq.max << qp->sq.wqe_shift); - if (posix_memalign(&qp->buf, to_mdev(pd->context->device)->page_size, - align(qp->buf_size, to_mdev(pd->context->device)->page_size))) { + if (mthca_alloc_buf(&qp->buf, + align(qp->buf_size, to_mdev(pd->context->device)->page_size), + to_mdev(pd->context->device)->page_size)) { free(qp->wrid); return -1; } - memset(qp->buf, 0, qp->buf_size); + memset(qp->buf.buf, 0, qp->buf_size); if (mthca_is_memfree(pd->context)) { struct mthca_next_seg *next; Index: libmthca/src/verbs.c =================================================================== --- libmthca/src/verbs.c (revision 8793) +++ libmthca/src/verbs.c (working copy) @@ -188,11 +188,10 @@ struct ibv_cq *mthca_create_cq(struct ib goto err; cqe = align_cq_size(cqe); - cq->buf = mthca_alloc_cq_buf(to_mdev(context->device), cqe); - if (!cq->buf) + if (mthca_alloc_cq_buf(to_mdev(context->device), &cq->buf, cqe)) goto err; - cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf, + cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf.buf, cqe * MTHCA_CQ_ENTRY_SIZE, 0, IBV_ACCESS_LOCAL_WRITE); if (!cq->mr) @@ -251,7 +250,7 @@ err_unreg: mthca_dereg_mr(cq->mr); err_buf: - free(cq->buf); + mthca_free_buf(&cq->buf); err: free(cq); @@ -264,7 +263,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq, struct mthca_cq *cq = to_mcq(ibcq); struct mthca_resize_cq cmd; struct ibv_mr *mr; - void *buf; + struct mthca_buf buf; int old_cqe; int ret; @@ -280,17 +279,15 @@ int mthca_resize_cq(struct ibv_cq *ibcq, goto out; } - buf = mthca_alloc_cq_buf(to_mdev(ibcq->context->device), cqe); - if (!buf) { - ret = ENOMEM; + ret = mthca_alloc_cq_buf(to_mdev(ibcq->context->device), &buf, cqe); + if (ret) goto out; - } - mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf, + mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf.buf, cqe * MTHCA_CQ_ENTRY_SIZE, 0, IBV_ACCESS_LOCAL_WRITE); if (!mr) { - free(buf); + mthca_free_buf(&buf); ret = ENOMEM; goto out; } @@ -303,14 +300,14 @@ int mthca_resize_cq(struct ibv_cq *ibcq, ret = ibv_cmd_resize_cq(ibcq, cqe - 1, &cmd.ibv_cmd, sizeof cmd); if (ret) { mthca_dereg_mr(mr); - free(buf); + mthca_free_buf(&buf); goto out; } - mthca_cq_resize_copy_cqes(cq, buf, old_cqe); + mthca_cq_resize_copy_cqes(cq, buf.buf, old_cqe); mthca_dereg_mr(cq->mr); - free(cq->buf); + mthca_free_buf(&cq->buf); cq->buf = buf; cq->mr = mr; @@ -336,8 +333,7 @@ int mthca_destroy_cq(struct ibv_cq *cq) } mthca_dereg_mr(to_mcq(cq)->mr); - - free(to_mcq(cq)->buf); + mthca_free_buf(&to_mcq(cq)->buf); free(to_mcq(cq)); return 0; @@ -389,7 +385,7 @@ struct ibv_srq *mthca_create_srq(struct if (mthca_alloc_srq_buf(pd, &attr->attr, srq)) goto err; - srq->mr = __mthca_reg_mr(pd, srq->buf, srq->buf_size, 0, 0); + srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0); if (!srq->mr) goto err_free; @@ -430,7 +426,7 @@ err_unreg: err_free: free(srq->wrid); - free(srq->buf); + mthca_free_buf(&srq->buf); err: free(srq); @@ -469,7 +465,7 @@ int mthca_destroy_srq(struct ibv_srq *sr mthca_dereg_mr(to_msrq(srq)->mr); - free(to_msrq(srq)->buf); + mthca_free_buf(&to_msrq(srq)->buf); free(to_msrq(srq)->wrid); free(to_msrq(srq)); @@ -507,7 +503,7 @@ struct ibv_qp *mthca_create_qp(struct ib pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE)) goto err_free; - qp->mr = __mthca_reg_mr(pd, qp->buf, qp->buf_size, 0, 0); + qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0); if (!qp->mr) goto err_free; @@ -574,7 +570,7 @@ err_unreg: err_free: free(qp->wrid); - free(qp->buf); + mthca_free_buf(&qp->buf); err: free(qp); @@ -655,8 +651,7 @@ int mthca_destroy_qp(struct ibv_qp *qp) } mthca_dereg_mr(to_mqp(qp)->mr); - - free(to_mqp(qp)->buf); + mthca_free_buf(&to_mqp(qp)->buf); free(to_mqp(qp)->wrid); free(to_mqp(qp)); Index: libmthca/src/mthca.h =================================================================== --- libmthca/src/mthca.h (revision 8793) +++ libmthca/src/mthca.h (working copy) @@ -112,6 +112,11 @@ struct mthca_context { int qp_table_mask; }; +struct mthca_buf { + void *buf; + size_t length; +}; + struct mthca_pd { struct ibv_pd ibv_pd; struct mthca_ah_page *ah_list; @@ -121,7 +126,7 @@ struct mthca_pd { struct mthca_cq { struct ibv_cq ibv_cq; - void *buf; + struct mthca_buf buf; pthread_spinlock_t lock; struct ibv_mr *mr; uint32_t cqn; @@ -137,7 +142,7 @@ struct mthca_cq { struct mthca_srq { struct ibv_srq ibv_srq; - void *buf; + struct mthca_buf buf; void *last; pthread_spinlock_t lock; struct ibv_mr *mr; @@ -174,7 +179,7 @@ struct mthca_wq { struct mthca_qp { struct ibv_qp ibv_qp; - void *buf; + struct mthca_buf buf; uint64_t *wrid; int send_wqe_offset; int max_inline_data; @@ -259,6 +264,9 @@ static inline int mthca_is_memfree(struc return to_mdev(ibctx->device)->hca_type == MTHCA_ARBEL; } +int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size); +void mthca_free_buf(struct mthca_buf *buf); + int mthca_alloc_db(struct mthca_db_table *db_tab, enum mthca_db_type type, uint32_t **db); void mthca_set_db_qn(uint32_t *db, enum mthca_db_type type, uint32_t qn); @@ -290,7 +298,7 @@ void mthca_arbel_cq_event(struct ibv_cq void mthca_cq_clean(struct mthca_cq *cq, uint32_t qpn, struct mthca_srq *srq); void mthca_cq_resize_copy_cqes(struct mthca_cq *cq, void *buf, int new_cqe); -void *mthca_alloc_cq_buf(struct mthca_device *dev, int cqe); +int mthca_alloc_cq_buf(struct mthca_device *dev, struct mthca_buf *buf, int nent); struct ibv_srq *mthca_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *attr); Index: libmthca/src/cq.c =================================================================== --- libmthca/src/cq.c (revision 8793) +++ libmthca/src/cq.c (working copy) @@ -126,7 +126,7 @@ struct mthca_err_cqe { static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) { - return cq->buf + entry * MTHCA_CQ_ENTRY_SIZE; + return cq->buf.buf + entry * MTHCA_CQ_ENTRY_SIZE; } static inline struct mthca_cqe *cqe_sw(struct mthca_cq *cq, int i) @@ -612,17 +612,16 @@ void mthca_cq_resize_copy_cqes(struct mt get_cqe(cq, i & old_cqe), MTHCA_CQ_ENTRY_SIZE); } -void *mthca_alloc_cq_buf(struct mthca_device *dev, int nent) +int mthca_alloc_cq_buf(struct mthca_device *dev, struct mthca_buf *buf, int nent) { - void *buf; int i; - if (posix_memalign(&buf, dev->page_size, - align(nent * MTHCA_CQ_ENTRY_SIZE, dev->page_size))) - return NULL; + if (mthca_alloc_buf(buf, align(nent * MTHCA_CQ_ENTRY_SIZE, dev->page_size), + dev->page_size)) + return -1; for (i = 0; i < nent; ++i) - ((struct mthca_cqe *) buf)[i].owner = MTHCA_CQ_ENTRY_OWNER_HW; + ((struct mthca_cqe *) buf->buf)[i].owner = MTHCA_CQ_ENTRY_OWNER_HW; - return buf; + return 0; } Index: libmthca/src/srq.c =================================================================== --- libmthca/src/srq.c (revision 8793) +++ libmthca/src/srq.c (working copy) @@ -47,7 +47,7 @@ static void *get_wqe(struct mthca_srq *srq, int n) { - return srq->buf + (n << srq->wqe_shift); + return srq->buf.buf + (n << srq->wqe_shift); } /* @@ -292,13 +292,14 @@ int mthca_alloc_srq_buf(struct ibv_pd *p srq->buf_size = srq->max << srq->wqe_shift; - if (posix_memalign(&srq->buf, to_mdev(pd->context->device)->page_size, - align(srq->buf_size, to_mdev(pd->context->device)->page_size))) { + if (mthca_alloc_buf(&srq->buf, + align(srq->buf_size, to_mdev(pd->context->device)->page_size), + to_mdev(pd->context->device)->page_size)) { free(srq->wrid); return -1; } - memset(srq->buf, 0, srq->buf_size); + memset(srq->buf.buf, 0, srq->buf_size); /* * Now initialize the SRQ buffer so that all of the WQEs are Index: libmthca/src/ah.c =================================================================== --- libmthca/src/ah.c (revision 8793) +++ libmthca/src/ah.c (working copy) @@ -45,7 +45,7 @@ struct mthca_ah_page { struct mthca_ah_page *prev, *next; - void *buf; + struct mthca_buf buf; struct ibv_mr *mr; int use_cnt; unsigned free[0]; @@ -60,14 +60,14 @@ static struct mthca_ah_page *__add_page( if (!page) return NULL; - if (posix_memalign(&page->buf, page_size, page_size)) { + if (mthca_alloc_buf(&page->buf, page_size, page_size)) { free(page); return NULL; } - page->mr = mthca_reg_mr(&pd->ibv_pd, page->buf, page_size, 0); + page->mr = mthca_reg_mr(&pd->ibv_pd, page->buf.buf, page_size, 0); if (!page->mr) { - free(page->buf); + mthca_free_buf(&page->buf); free(page); return NULL; } @@ -123,7 +123,7 @@ int mthca_alloc_av(struct mthca_pd *pd, if (page->free[i]) { j = ffs(page->free[i]); page->free[i] &= ~(1 << (j - 1)); - ah->av = page->buf + + ah->av = page->buf.buf + (i * 8 * sizeof (int) + (j - 1)) * sizeof *ah->av; break; } @@ -172,7 +172,7 @@ void mthca_free_av(struct mthca_ah *ah) pthread_mutex_lock(&pd->ah_mutex); page = ah->page; - i = ((void *) ah->av - page->buf) / sizeof *ah->av; + i = ((void *) ah->av - page->buf.buf) / sizeof *ah->av; page->free[i / (8 * sizeof (int))] |= 1 << (i % (8 * sizeof (int))); if (!--page->use_cnt) { @@ -184,7 +184,7 @@ void mthca_free_av(struct mthca_ah *ah) page->next->prev = page->prev; mthca_dereg_mr(page->mr); - free(page->buf); + mthca_free_buf(&page->buf); free(page); } Index: libmthca/src/buf.c =================================================================== --- libmthca/src/buf.c (revision 0) +++ libmthca/src/buf.c (revision 0) @@ -0,0 +1,168 @@ +/* + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include + +#include "mthca.h" + +#if !(defined(HAVE_IBV_DONTFORK_RANGE) && defined(HAVE_IBV_DOFORK_RANGE)) + +/* + * If libibverbs isn't exporting these functions, then there's no + * point in doing it here, because the rest of libibverbs isn't going + * to be fork-safe anyway. + */ +static int ibv_dontfork_range(void *base, size_t size) +{ + return 0; +} + +static int ibv_dofork_range(void *base, size_t size) +{ + return 0; +} + +#endif /* HAVE_IBV_DONTFORK_RANGE && HAVE_IBV_DOFORK_RANGE */ + +int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size) +{ + int ret; + + ret = posix_memalign(&buf->buf, page_size, align(size, page_size)); + if (ret) + return ret; + + ret = ibv_dontfork_range(buf->buf, size); + if (ret) + free(buf->buf); + + if (!ret) + buf->length = size; + + return ret; +} + +void mthca_free_buf(struct mthca_buf *buf) +{ + ibv_dofork_range(buf->buf, buf->length); + free(buf->buf); +} +/* + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include + +#include "mthca.h" + +#if !(defined(HAVE_IBV_DONTFORK_RANGE) && defined(HAVE_IBV_DOFORK_RANGE)) + +/* + * If libibverbs isn't exporting these functions, then there's no + * point in doing it here, because the rest of libibverbs isn't going + * to be fork-safe anyway. + */ +static int ibv_dontfork_range(void *base, size_t size) +{ + return 0; +} + +static int ibv_dofork_range(void *base, size_t size) +{ + return 0; +} + +#endif /* HAVE_IBV_DONTFORK_RANGE && HAVE_IBV_DOFORK_RANGE */ + +int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size) +{ + int ret; + + ret = posix_memalign(&buf->buf, page_size, align(size, page_size)); + if (ret) + return ret; + + ret = ibv_dontfork_range(buf->buf, size); + if (ret) + free(buf->buf); + + if (!ret) + buf->length = size; + + return ret; +} + +void mthca_free_buf(struct mthca_buf *buf) +{ + ibv_dofork_range(buf->buf, buf->length); + free(buf->buf); +} Index: libmthca/ChangeLog =================================================================== --- libmthca/ChangeLog (revision 8793) +++ libmthca/ChangeLog (working copy) @@ -1,3 +1,19 @@ +2006-07-26 Roland Dreier + + * src/mthca.h, src/ah.c, src/cq.c, src/memfree.c, src/qp.c, + src/srq.c, src/verbs.c: Convert internal allocations for AH pages + (for non-memfree HCAs), CQ buffers, doorbell pages (for memfree + HCAs), QP buffers and SRQ buffers to use the new buffer + allocator. This makes libmthca fork()-clean when built against + libibverbs 1.1. + + * src/buf.c (mthca_alloc_buf, mthca_free_buf): Add new functions + to wrap up allocating page-aligned buffers. The new functions + will call ibv_dontfork_range()/ibv_dofork_range() to do proper + madvise()ing to handle fork(), if applicable. + + * configure.in: Check for ibv_dontfork_range() and ibv_dontfork_range(). + 2006-07-04 Dotan Barak * src/verbs.c (mthca_create_cq, mthca_resize_cq): Passing huge Index: libmthca/Makefile.am =================================================================== --- libmthca/Makefile.am (revision 8793) +++ libmthca/Makefile.am (working copy) @@ -12,10 +12,9 @@ else mthca_version_script = endif -src_mthca_la_SOURCES = src/ah.c src/cq.c src/memfree.c src/mthca.c src/qp.c \ - src/srq.c src/verbs.c -src_mthca_la_LDFLAGS = -avoid-version -module \ - $(mthca_version_script) +src_mthca_la_SOURCES = src/ah.c src/buf.c src/cq.c src/memfree.c src/mthca.c \ + src/qp.c src/srq.c src/verbs.c +src_mthca_la_LDFLAGS = -avoid-version -module $(mthca_version_script) DEBIAN = debian/changelog debian/compat debian/control debian/copyright \ debian/libmthca1.install debian/libmthca-dev.install debian/rules From mst at mellanox.co.il Tue Aug 1 07:22:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 1 Aug 2006 17:22:42 +0300 Subject: [openib-general] hotplug support in mthca In-Reply-To: References: Message-ID: <20060801142242.GD9411@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: hotplug support in mthca > > Michael> Roland, what happends today if an mthca device is removed > Michael> while a userspace applcation still keeps a reference to > Michael> it? > > Something bad. How about fixing it by blocking remove_one in uverbs until all contexts are closed and device refcount drops to 0? -- MST From bardov at gmail.com Tue Aug 1 07:45:39 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Tue, 1 Aug 2006 17:45:39 +0300 Subject: [openib-general] iSER Source and target code In-Reply-To: References: Message-ID: Mohit hi, You are correct that the iser initiator code is now based on openIB stack and follows the open-iscsi APIs instead of the datamover API. If you dig in the iser-initiator history, you'll find that it too was originaly based on datamover API on top of kDAPL. The datamover API was a nice concept, but it did not carry well into the real-world implementation, and so was dropped. The ISER target code available was donated by Voltaire as a starting point for an open-source ISER/iSCSI target implementation. There is a small working group working on this project, in conjunction with the stgt project. In the future we expect it to loose the datamover API as well, and change a lot. If you're interested, I suggest you subscribe to stgt mailing list at http://developer.berlios.de/mail/?group_id=4492 Dan On 8/1/06, Mohit Katiyar, Noida wrote: > > Hi all, > > I was looking at iSER initiator and target code. I noticed that iSER > target code available is based on Datamover architecture but the initiator > code is not based on Datamover Architecture. Why the iSER initiator code is > not based on Datamover Architecture? What are the future plans for the iSER > target driver code? > > With which iSCSI interface the iSER target driver code will be used? > > > > Thanks in advance > > > > Thanks and Regards > > Mohit Katiyar > > Lead Engineer > > TLS, HCL Noida > > > > "To dare is to lose one's footing momentarily. To not dare is to lose > oneself." > > > DISCLAIMER: > > ----------------------------------------------------------------------------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > It shall not attach any liability on the originator or HCL or its > affiliates. Any views or opinions presented in > this email are solely those of the author and may not necessarily reflect > the opinions of HCL or its affiliates. > Any form of reproduction, dissemination, copying, disclosure, > modification, distribution and / or publication of > this message without the prior written consent of the author of this > e-mail is strictly prohibited. If you have > received this email in error please delete it and notify the sender > immediately. Before opening any mail and > attachments please check them for viruses and defect. > > > ----------------------------------------------------------------------------------------------------------------------- > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jackm at mellanox.co.il Tue Aug 1 08:06:17 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 1 Aug 2006 18:06:17 +0300 Subject: [openib-general] hotplug support in mthca In-Reply-To: <20060801142242.GD9411@mellanox.co.il> References: <20060801142242.GD9411@mellanox.co.il> Message-ID: <200608011806.17150.jackm@mellanox.co.il> On Tuesday 01 August 2006 17:22, Michael S. Tsirkin wrote: > Quoting r. Roland Dreier : > > Subject: Re: hotplug support in mthca > > > > Michael> Roland, what happends today if an mthca device is removed > > Michael> while a userspace applcation still keeps a reference to > > Michael> it? > > > > Something bad. > > How about fixing it by blocking remove_one in uverbs until all contexts > are closed and device refcount drops to 0? I've got a patch ready which does exactly that. I'm currently verifying that it works properly (e.g., what if simultaneously the administrator tries to unload the module, etc.). We can send it to you tomorrow. - Jack From sashak at voltaire.com Tue Aug 1 08:15:58 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 1 Aug 2006 18:15:58 +0300 Subject: [openib-general] [PATH TRIVIAL] opensm: management/Makefile: osm and diags build order reversal Message-ID: <20060801151558.GG24920@sashak.voltaire.com> opensm: management/Makefile: osm and diags build order reversal osm and diags build order reversal in management/Makefile. It is needed since saquery diag tool uses osm libraries. Signed-off-by: Sasha Khapyorsky --- Makefile | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/Makefile b/Makefile index 9c86916..770112a 100644 --- a/Makefile +++ b/Makefile @@ -6,7 +6,7 @@ OSM:=osm OSMLIBS:=complib libvendor DIAG:=diags -SUBDIRS=$(DIAG) $(OSM) +SUBDIRS=$(OSM) $(DIAG) all: BUILD_TARG=all all: libs_install subdirs From venkatesh.babu at 3leafnetworks.com Tue Aug 1 12:09:22 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Tue, 01 Aug 2006 12:09:22 -0700 Subject: [openib-general] APM: QP migration state change when failover triggered by hw In-Reply-To: <200608011610.03212.jackm@mellanox.co.il> References: <44B55981.6040408@3leafnetworks.com> <200607301005.32499.jackm@mellanox.co.il> <44CEBA3E.3060208@3leafnetworks.com> <200608011610.03212.jackm@mellanox.co.il> Message-ID: <44CFA6E2.8070509@3leafnetworks.com> I am testing APM with kernel module which directly interfaces with ib_verbs.ko and ib_cm.ko. Yes, I do receive IB_MIG_MIGRATED event, but the QP's mig_state is not actually changed to MIGRATED. So I had to do this from my module. It could be a bug with ib_cm code, which may not be transitioning the QP state correctly. But the HW may be thinking that it has migrated. I am not sure how exactly ib_cm should notice this event and should should transition the QP state. Any thoughts and suggestions are welcome. I can code it and test it. I don't have the test program which will specifically test this functionality. I am afraid if I can share the whole module. VBabu Jack Morgenstein wrote: >On Tuesday 01 August 2006 05:19, Venkatesh Babu wrote: > > >>Configuration2: Node1 and Node 2 conneected through two switches for >>each port. >> Node1, port1 -> switch1 -> Node2, port1 >> Node1, port2 -> switch2 -> Node2, port2 >> >>Node 1: >>1. Call ib_cm_listen() to wait for connection requests >>2. When a REQ message arrives create a RC QP and establish a connection >>3. Setup callback handlers to receive packets. >>4. Receive packets and verify it and drop it. >>5. Event IB_MIG_MIGRATED received >>6. Stopped receiving packets. >> >>Node 2: >>1. Create RC QP >>2. Send REQ message to Node 1 to establish the connection (Load both >>primary and alternate paths) >>3. Contineously send some packets >>4. Simulate the port failure by unplugging the IB cable >>5. Event IB_MIG_MIGRATED received >> >> But with >>Configuration2, IB_EVENT_PORT_ERR event occurrs on a node1, failover to >>the alternate path doesn't work. The traffic stops. Because node1 >>doesn't now when the IB_EVENT_PORT_ERR event occurred on Node2. >> >> > >We have not seen these problems here. We have regression tests which check >APM, and they have run without problems. These tests have scripts which >bring the HCA port down (equivalent to pulling the cable) to check that the >migration occurs automatically. >(You should NOT need to do ib_modify_qp for the migration to work in the case >of a port error). > >Note, though, that these tests use the ibv_verbs layer directly. We have not >checked out APM over the CM. There may be a bug here regarding setting up >the alternate path properly when creating the connection (although this does >seem strange, since you indicate that the MIGRATED event is received on both >sides!). > >Please send us your test code so that we may reproduce the problem here. > >- Jack > > > > > > > From sean.hefty at intel.com Tue Aug 1 11:55:44 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 1 Aug 2006 11:55:44 -0700 Subject: [openib-general] APM: QP migration state change when failover triggered by hw In-Reply-To: <44CFA6E2.8070509@3leafnetworks.com> Message-ID: <000201c6b59c$17ef1d30$29cc180a@amr.corp.intel.com> > I am testing APM with kernel module which directly interfaces with >ib_verbs.ko and ib_cm.ko. >Yes, I do receive IB_MIG_MIGRATED event, but the QP's mig_state is not >actually changed to MIGRATED. So I had to do this from my module. The ib_cm does not perform QP state transitions. That is left to the ULPs. (It's difficult to push this into the ib_cm because of the potential race between the IB CM modifying the QP at the same time a ULP tries to destroy it. The ULP may also want to post receives, or modify the QP size, etc. based on connection information.) >It could be a bug with ib_cm code, which may not be transitioning the QP >state correctly. But the HW may be thinking that it has migrated. I am >not sure how exactly ib_cm should notice this event and should should >transition the QP state. Any thoughts and suggestions are welcome. I can >code it and test it. There is a pending patch that was recently posted (dispatch communication establish event) that can be extended to pass path migration events to the ib_cm. The purpose of passing path migration events to the ib_cm would be limited to changing the path that future CM messages, and not related to QP transitions. - Sean From troy at scl.ameslab.gov Tue Aug 1 13:48:06 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Tue, 1 Aug 2006 15:48:06 -0500 Subject: [openib-general] making sense of dapl (and dat.conf) Message-ID: <20060801204806.GC13356@minbar-g5.scl.ameslab.gov> So, let's suppose I build ibverbs, libecha/libmthca, and dapl from subversion trunk.. what should my /etc/dat.conf file look like so things actually work? Right now I have: OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "10.40.4.56 0" "" OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "10.40.4.54 0" "" OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "p5l6.ib 0" "" OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "p5l4.ib 0" "" OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" Is there any way I can avoid having to have a different config file for each machine? Is this documented anywhere what all these fields actually mean in a way that makes sense to those of us who haven't read the DAT specification, or are familiar with ibverbs? From pw at osc.edu Tue Aug 1 14:34:16 2006 From: pw at osc.edu (Pete Wyckoff) Date: Tue, 1 Aug 2006 17:34:16 -0400 Subject: [openib-general] rdma cm process hang Message-ID: <20060801213416.GA18941@osc.edu> Using the iwarp branch of r8688, with linux-2.6.17.7 on up to date x86_64 FC4 SMP with Ammasso cards, I can hang the client side during RDMA CM connection setup. The scenario is: start server side process on some other node start client process have server die after RDMA_CM_EVENT_CONNECT_REQUEST arrives, but before calling rdma_accept hit ctrl-C on client The last bits of the console log (from c2 debug) are: c2: c2_create_qp:248 c2: c2_query_pkey:110 c2: c2_qp_modify:145 qp=ffff81007fe3b980, IB_QPS_RESET --> IB_QPS_INIT c2: c2_qp_modify:243 qp=ffff81007fe3b980, cur_state=IB_QPS_INIT c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=1 c2: c2_connect:598 c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=2 The process is in S state before the ctrl-c, here's a traceback (waiting in rdma_get_cm_event): ardma-rdmacm S ffff81003d0f9e68 0 2914 2842 (NOTLB) ffff81003d0f9e68 0000000000000000 0000000000000000 00000000000007b4 ffff81003ef1d280 ffff81007fd06080 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call Trace: {:rdma_ucm:ucma_get_event+268} {autoremove_wake_function+0} {:rdma_ucm:ucma_write+111} {vfs_write+189} {sys_write+83} {system_call+126} Then after ctrl-C, one more console log entry: c2: c2_destroy_qp:290 qp=ffff81007fe3b980,qp->state=1 and now the process is unkillable (but the node does not oops): ardma-rdmacm D ffff81003d0f9bf8 0 2914 2842 (L-TLB) ffff81003d0f9bf8 ffffc2000001ffff ffff81007eee5c80 0000000000009ee7 ffff81003ef1d280 ffff81003f060aa0 ffffffff80232646 ffff81000100c130 ffff81003ec3d140 ffff81003dc85800 Call Trace: {on_each_cpu+38} {__remove_vm_area+55} {:iw_c2:c2_free_qp+355} {autoremove_wake_function+0} {:iw_c2:c2_destroy_qp+52} {:ib_core:ib_destroy_qp+49} {:ib_uverbs:ib_uverbs_close+410} {__fput+178} {filp_close+104} {put_files_struct+122} {do_exit+596} {__dequeue_signal+495} {do_group_exit+216} {get_signal_to_deliver+1192} {do_signal+129} {:rdma_ucm:ucma_get_event+501} {:rdma_ucm:ucma_write+111} {vfs_write+189} {sysret_signal+28} {ptregscall_common+103} Once I figure out the bug in the server side code I will hopefully not have this problem anymore. But thought you'd like to see it. -- Pete From minich at ornl.gov Tue Aug 1 14:39:49 2006 From: minich at ornl.gov (Makia Minich) Date: Tue, 01 Aug 2006 17:39:49 -0400 Subject: [openib-general] xt3 troubles (with OFED 1.0.1) Message-ID: So, after flailing about with my IPOIB issue on the XT3, I decided that perhaps a firmware upgrade (from 3.3.3 to 3.4.0) might be in order. Prior to the upgrade, I was able to bring the entire stack online and see the infiniband network (provided I refrained from ifconfig'ing the ipoib interfaces). Now, on loading ib_mthca I get the following out of dmesg: ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca: Initializing 0000:04:00.0 PCI: Unable to reserve mem region #3:ffffffffffe00000 at ff000000 for device 0000:04:00.0 ib_mthca 0000:04:00.0: Cannot obtain PCI resources, aborting. ib_mthca: probe of 0000:04:00.0 failed with error -16 I saw an early-2005 thread about the P615 that seemed to have the same problem, but I couldn't find a resolution. Does anyone know what might be able to be done? Or perhaps where I should start looking? Thanks... -- Makia Minich National Center for Computation Science Oak Ridge National Laboratory From troy at scl.ameslab.gov Tue Aug 1 14:59:13 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Tue, 1 Aug 2006 16:59:13 -0500 Subject: [openib-general] xt3 troubles (with OFED 1.0.1) In-Reply-To: References: Message-ID: <20060801215913.GD13356@minbar-g5.scl.ameslab.gov> On Tue, Aug 01, 2006 at 05:39:49PM -0400, Makia Minich wrote: > So, after flailing about with my IPOIB issue on the XT3, I decided that > perhaps a firmware upgrade (from 3.3.3 to 3.4.0) might be in order. Prior > to the upgrade, I was able to bring the entire stack online and see the > infiniband network (provided I refrained from ifconfig'ing the ipoib > interfaces). Now, on loading ib_mthca I get the following out of dmesg: > > ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) > ib_mthca: Initializing 0000:04:00.0 > PCI: Unable to reserve mem region #3:ffffffffffe00000 at ff000000 for device > 0000:04:00.0 > ib_mthca 0000:04:00.0: Cannot obtain PCI resources, aborting. > ib_mthca: probe of 0000:04:00.0 failed with error -16 > > I saw an early-2005 thread about the P615 that seemed to have the same > problem, but I couldn't find a resolution. Does anyone know what might be > able to be done? Or perhaps where I should start looking? You could probably start by encourageing your linux vendor to post their kernel patches.. I bet they did something hokey in the PCI memory mapping. Can you post the output of lspci -v ? On the P615's, this was due to the firmware not allocating a big enough MMIO region allocated. From sean.hefty at intel.com Tue Aug 1 15:17:21 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 1 Aug 2006 15:17:21 -0700 Subject: [openib-general] [RFC] [PATCH 1/2] sa_query: add generic query interfaces capable of supporting RMPP Message-ID: <000001c6b5b8$42ab2480$67f9070a@amr.corp.intel.com> The following patch adds a generic interface to send MADs to the SA. The primary motivation of adding these calls is to expand the SA query interface to include RMPP responses for users wanting more than a single attribute returned from a query (e.g. multipath record queries). The design for retrieving attributes from an RMPP response was taken from that used by the local SA cache. The implementation of existing SA query routines was layered on top of the generic query interface. Signed-off-by: Sean Hefty --- Index: include/rdma/ib_sa.h =================================================================== --- include/rdma/ib_sa.h (revision 8647) +++ include/rdma/ib_sa.h (working copy) @@ -254,6 +254,73 @@ struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); +struct ib_sa_attr_cursor; + +/** + * ib_sa_create_cursor - Create a cursor that may be used to walk through + * a list of returned SA records. + * @mad_recv_wc: A received response from the SA. + * + * This call allocates a cursor that is used to walk through a list of + * SA records. Users must free the cursor by calling ib_sa_free_cursor. + */ +struct ib_sa_attr_cursor *ib_sa_create_cursor(struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_sa_free_cursor - Release a cursor. + * @cursor: The cursor to free. + */ +void ib_sa_free_cursor(struct ib_sa_attr_cursor *cursor); + +/** + * ib_sa_get_next_attr - Retrieve the next SA attribute referenced by a cursor. + * @cursor: A reference to a cursor that points to the next attribute to + * retrieve. + * @attr: Buffer to copy attribute. + * + * Returns non-zero if an attribute was returned, and copies the attribute + * into the provided buffer. Returns zero if all attributes have been + * retrieved from the cursor. + */ +int ib_sa_get_next_attr(struct ib_sa_attr_cursor *cursor, void *attr); + +/** + * ib_sa_send_mad - Send a MAD to the SA. + * @device:device to send query on + * @port_num: port number to send query on + * @method:MAD method to use in the send. + * @attr:Reference to attribute to send in MAD. + * @attr_id:Attribute type identifier. + * @comp_mask:component mask to send in MAD + * @timeout_ms:time to wait for response, if one is expected + * @retries:number of times to retry request + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a message to the SA. If a response is expected (timeout_ms is + * non-zero), the callback function will be called when the query completes. + * Status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. Mad_recv_wc will reference any returned + * response from the SA. It is the responsibility of the caller to free + * mad_recv_wc by call ib_free_recv_mad() if it is non-NULL. + * + * If the return value of ib_sa_send_mad() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_send_mad(struct ib_device *device, u8 port_num, + int method, void *attr, int attr_id, + ib_sa_comp_mask comp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context), + void *context, struct ib_sa_query **query); + int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 8647) +++ core/sa_query.c (working copy) @@ -72,30 +72,41 @@ struct ib_sa_device { }; struct ib_sa_query { - void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); - void (*release)(struct ib_sa_query *); + void (*callback)(int, struct ib_mad_recv_wc *, void *); struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; + void *context; int id; }; struct ib_sa_service_query { void (*callback)(int, struct ib_sa_service_rec *, void *); void *context; - struct ib_sa_query sa_query; + struct ib_sa_query *sa_query; }; struct ib_sa_path_query { void (*callback)(int, struct ib_sa_path_rec *, void *); void *context; - struct ib_sa_query sa_query; + struct ib_sa_query *sa_query; }; struct ib_sa_mcmember_query { void (*callback)(int, struct ib_sa_mcmember_rec *, void *); void *context; - struct ib_sa_query sa_query; + struct ib_sa_query *sa_query; +}; + +struct ib_sa_attr_cursor { + struct ib_mad_recv_wc *recv_wc; + struct ib_mad_recv_buf *recv_buf; + int attr_id; + int attr_size; + int attr_offset; + int data_offset; + int data_left; + u8 attr[0]; }; static void ib_sa_add_one(struct ib_device *device); @@ -504,9 +515,17 @@ EXPORT_SYMBOL(ib_init_ah_from_mcmember); int ib_sa_pack_attr(void *dst, void *src, int attr_id) { switch (attr_id) { + case IB_SA_ATTR_SERVICE_REC: + ib_pack(service_rec_table, ARRAY_SIZE(service_rec_table), + src, dst); + break; case IB_SA_ATTR_PATH_REC: ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); break; + case IB_SA_ATTR_MC_MEMBER_REC: + ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + src, dst); + break; default: return -EINVAL; } @@ -517,9 +536,17 @@ EXPORT_SYMBOL(ib_sa_pack_attr); int ib_sa_unpack_attr(void *dst, void *src, int attr_id) { switch (attr_id) { + case IB_SA_ATTR_SERVICE_REC: + ib_unpack(service_rec_table, ARRAY_SIZE(service_rec_table), + src, dst); + break; case IB_SA_ATTR_PATH_REC: ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); break; + case IB_SA_ATTR_MC_MEMBER_REC: + ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + src, dst); + break; default: return -EINVAL; } @@ -527,15 +554,20 @@ int ib_sa_unpack_attr(void *dst, void *s } EXPORT_SYMBOL(ib_sa_unpack_attr); -static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) +static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent, + int method, void *attr, int attr_id, + ib_sa_comp_mask comp_mask) { unsigned long flags; - memset(mad, 0, sizeof *mad); - mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = cpu_to_be16(attr_id); + mad->sa_hdr.comp_mask = comp_mask; + + ib_sa_pack_attr(mad->data, attr, attr_id); spin_lock_irqsave(&tid_lock, flags); mad->mad_hdr.tid = @@ -589,26 +621,175 @@ retry: return ret ? ret : id; } -static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, - int status, - struct ib_sa_mad *mad) +/* Return size of SA attributes on the wire. */ +static int sa_mad_attr_size(int attr_id) +{ + int size; + + switch (attr_id) { + case IB_SA_ATTR_SERVICE_REC: + size = 176; + break; + case IB_SA_ATTR_PATH_REC: + size = 64; + break; + case IB_SA_ATTR_MC_MEMBER_REC: + size = 52; + break; + default: + size = 0; + break; + } + return size; +} + +struct ib_sa_attr_cursor *ib_sa_create_cursor(struct ib_mad_recv_wc *mad_recv_wc) { - struct ib_sa_path_query *query = - container_of(sa_query, struct ib_sa_path_query, sa_query); + struct ib_sa_attr_cursor *cursor; + struct ib_sa_mad *mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + int attr_id, attr_size, attr_offset; - if (mad) { - struct ib_sa_path_rec rec; + attr_id = be16_to_cpu(mad->mad_hdr.attr_id); + attr_offset = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; + attr_size = sa_mad_attr_size(attr_id); + if (!attr_size || attr_offset < attr_size) + return ERR_PTR(-EINVAL); - ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), - mad->data, &rec); - query->callback(status, &rec, query->context); - } else - query->callback(status, NULL, query->context); + cursor = kzalloc(sizeof *cursor + attr_size, GFP_KERNEL); + if (!cursor) + return ERR_PTR(-ENOMEM); + + cursor->data_left = mad_recv_wc->mad_len - IB_MGMT_SA_HDR; + cursor->recv_wc = mad_recv_wc; + cursor->recv_buf = &mad_recv_wc->recv_buf; + cursor->attr_id = attr_id; + cursor->attr_offset = attr_offset; + cursor->attr_size = attr_size; + return cursor; } +EXPORT_SYMBOL(ib_sa_create_cursor); -static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) +void ib_sa_free_cursor(struct ib_sa_attr_cursor *cursor) { - kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); + kfree(cursor); +} +EXPORT_SYMBOL(ib_sa_free_cursor); + +int ib_sa_get_next_attr(struct ib_sa_attr_cursor *cursor, void *attr) +{ + struct ib_sa_mad *mad; + void *sa_attr = NULL; + int left, offset = 0; + + while (cursor->data_left >= cursor->attr_offset) { + while (cursor->data_offset < IB_MGMT_SA_DATA) { + mad = (struct ib_sa_mad *) cursor->recv_buf->mad; + + left = IB_MGMT_SA_DATA - cursor->data_offset; + if (left < cursor->attr_size) { + /* copy first piece of the attribute */ + sa_attr = &cursor->attr; + memcpy(sa_attr, &mad->data[cursor->data_offset], + left); + offset = left; + break; + } else if (offset) { + /* copy the second piece of the attribute */ + memcpy(sa_attr + offset, &mad->data[0], + cursor->attr_size - offset); + cursor->data_offset = cursor->attr_size - offset; + offset = 0; + } else { + sa_attr = &mad->data[cursor->data_offset]; + cursor->data_offset += cursor->attr_size; + } + + cursor->data_left -= cursor->attr_offset; + return !ib_sa_unpack_attr(attr, sa_attr, + cursor->attr_id); + } + cursor->data_offset = 0; + cursor->recv_buf = list_entry(cursor->recv_buf->list.next, + struct ib_mad_recv_buf, list); + } + return 0; +} +EXPORT_SYMBOL(ib_sa_get_next_attr); + +int ib_sa_send_mad(struct ib_device *device, u8 port_num, + int method, void *attr, int attr_id, + ib_sa_comp_mask comp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context), + void *context, struct ib_sa_query **query) +{ + struct ib_sa_query *sa_query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port; + struct ib_mad_agent *agent; + int ret; + + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + + sa_query = kmalloc(sizeof *query, gfp_mask); + if (!sa_query) + return -ENOMEM; + + sa_query->mad_buf = ib_create_send_mad(agent, 1, 0, 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, gfp_mask); + if (!sa_query->mad_buf) { + ret = -ENOMEM; + goto err1; + } + + sa_query->port = port; + sa_query->callback = callback; + sa_query->context = context; + + init_mad(sa_query->mad_buf->mad, agent, method, attr, attr_id, + comp_mask); + + ret = send_mad(sa_query, timeout_ms, retries, gfp_mask); + if (ret < 0) + goto err2; + + *query = sa_query; + return ret; + +err2: + ib_free_send_mad(sa_query->mad_buf); +err1: + kfree(query); + return ret; +} + +static void ib_sa_path_rec_callback(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context) +{ + struct ib_sa_path_query *query = context; + + if (query->callback) { + if (mad_recv_wc) { + struct ib_sa_mad *mad; + struct ib_sa_path_rec rec; + + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); + } + if (mad_recv_wc) + ib_free_recv_mad(mad_recv_wc); + kfree(query); } /** @@ -647,83 +828,47 @@ int ib_sa_path_rec_get(struct ib_device struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port; - struct ib_mad_agent *agent; - struct ib_sa_mad *mad; int ret; - if (!sa_dev) - return -ENODEV; - - port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; - query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; - goto err1; - } - query->callback = callback; query->context = context; - - mad = query->sa_query.mad_buf->mad; - init_mad(mad, agent); - - query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; - query->sa_query.release = ib_sa_path_rec_release; - query->sa_query.port = port; - mad->mad_hdr.method = IB_MGMT_METHOD_GET; - mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); - mad->sa_hdr.comp_mask = comp_mask; - - ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), rec, mad->data); - - *sa_query = &query->sa_query; - - ret = send_mad(&query->sa_query, timeout_ms, retries, gfp_mask); + ret = ib_sa_send_mad(device, port_num, IB_MGMT_METHOD_GET, rec, + IB_SA_ATTR_PATH_REC, comp_mask, timeout_ms, + retries, gfp_mask, ib_sa_path_rec_callback, + query, &query->sa_query); if (ret < 0) - goto err2; + kfree(query); return ret; - -err2: - *sa_query = NULL; - ib_free_send_mad(query->sa_query.mad_buf); - -err1: - kfree(query); - return ret; } EXPORT_SYMBOL(ib_sa_path_rec_get); -static void ib_sa_service_rec_callback(struct ib_sa_query *sa_query, - int status, - struct ib_sa_mad *mad) +static void ib_sa_service_rec_callback(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context) { - struct ib_sa_service_query *query = - container_of(sa_query, struct ib_sa_service_query, sa_query); - - if (mad) { - struct ib_sa_service_rec rec; + struct ib_sa_service_query *query = context; - ib_unpack(service_rec_table, ARRAY_SIZE(service_rec_table), - mad->data, &rec); - query->callback(status, &rec, query->context); - } else - query->callback(status, NULL, query->context); -} - -static void ib_sa_service_rec_release(struct ib_sa_query *sa_query) -{ - kfree(container_of(sa_query, struct ib_sa_service_query, sa_query)); + if (query->callback) { + if (mad_recv_wc) { + struct ib_sa_mad *mad; + struct ib_sa_service_rec rec; + + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + + ib_unpack(service_rec_table, ARRAY_SIZE(service_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); + } + if (mad_recv_wc) + ib_free_recv_mad(mad_recv_wc); + kfree(query); } /** @@ -764,89 +909,47 @@ int ib_sa_service_rec_query(struct ib_de struct ib_sa_query **sa_query) { struct ib_sa_service_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port; - struct ib_mad_agent *agent; - struct ib_sa_mad *mad; int ret; - if (!sa_dev) - return -ENODEV; - - port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; - - if (method != IB_MGMT_METHOD_GET && - method != IB_MGMT_METHOD_SET && - method != IB_SA_METHOD_DELETE) - return -EINVAL; - query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; - goto err1; - } - query->callback = callback; query->context = context; - - mad = query->sa_query.mad_buf->mad; - init_mad(mad, agent); - - query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; - query->sa_query.release = ib_sa_service_rec_release; - query->sa_query.port = port; - mad->mad_hdr.method = method; - mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_SERVICE_REC); - mad->sa_hdr.comp_mask = comp_mask; - - ib_pack(service_rec_table, ARRAY_SIZE(service_rec_table), - rec, mad->data); - - *sa_query = &query->sa_query; - - ret = send_mad(&query->sa_query, timeout_ms, retries, gfp_mask); + ret = ib_sa_send_mad(device, port_num, method, rec, + IB_SA_ATTR_SERVICE_REC, comp_mask, timeout_ms, + retries, gfp_mask, ib_sa_service_rec_callback, + query, &query->sa_query); if (ret < 0) - goto err2; + kfree(query); return ret; - -err2: - *sa_query = NULL; - ib_free_send_mad(query->sa_query.mad_buf); - -err1: - kfree(query); - return ret; } EXPORT_SYMBOL(ib_sa_service_rec_query); -static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, - int status, - struct ib_sa_mad *mad) +static void ib_sa_mcmember_rec_callback(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context) { - struct ib_sa_mcmember_query *query = - container_of(sa_query, struct ib_sa_mcmember_query, sa_query); + struct ib_sa_mcmember_query *query = context; - if (mad) { - struct ib_sa_mcmember_rec rec; - - ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), - mad->data, &rec); - query->callback(status, &rec, query->context); - } else - query->callback(status, NULL, query->context); -} - -static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) -{ - kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); + if (query->callback) { + if (mad_recv_wc) { + struct ib_sa_mad *mad; + struct ib_sa_mcmember_rec rec; + + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(mcmember_rec_table, + ARRAY_SIZE(mcmember_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); + } + if (mad_recv_wc) + ib_free_recv_mad(mad_recv_wc); + kfree(query); } int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, @@ -861,61 +964,22 @@ int ib_sa_mcmember_rec_query(struct ib_d struct ib_sa_query **sa_query) { struct ib_sa_mcmember_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port; - struct ib_mad_agent *agent; - struct ib_sa_mad *mad; int ret; - if (!sa_dev) - return -ENODEV; - - port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; - query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; - goto err1; - } - query->callback = callback; query->context = context; - - mad = query->sa_query.mad_buf->mad; - init_mad(mad, agent); - - query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; - query->sa_query.release = ib_sa_mcmember_rec_release; - query->sa_query.port = port; - mad->mad_hdr.method = method; - mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); - mad->sa_hdr.comp_mask = comp_mask; - - ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), - rec, mad->data); - - *sa_query = &query->sa_query; - - ret = send_mad(&query->sa_query, timeout_ms, retries, gfp_mask); + ret = ib_sa_send_mad(device, port_num, method, rec, + IB_SA_ATTR_MC_MEMBER_REC, comp_mask, timeout_ms, + retries, gfp_mask, ib_sa_mcmember_rec_callback, + query, &query->sa_query); if (ret < 0) - goto err2; + kfree(query); return ret; - -err2: - *sa_query = NULL; - ib_free_send_mad(query->sa_query.mad_buf); - -err1: - kfree(query); - return ret; } EXPORT_SYMBOL(ib_sa_mcmember_rec_query); @@ -931,13 +995,13 @@ static void send_handler(struct ib_mad_a /* No callback -- already got recv */ break; case IB_WC_RESP_TIMEOUT_ERR: - query->callback(query, -ETIMEDOUT, NULL); + query->callback(-ETIMEDOUT, NULL, query->context); break; case IB_WC_WR_FLUSH_ERR: - query->callback(query, -EINTR, NULL); + query->callback(-EINTR, NULL, query->context); break; default: - query->callback(query, -EIO, NULL); + query->callback(-EIO, NULL, query->context); break; } @@ -947,7 +1011,7 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(mad_send_wc->send_buf); kref_put(&query->sm_ah->ref, free_sm_ah); - query->release(query); + kfree(query); } static void recv_handler(struct ib_mad_agent *mad_agent, @@ -959,17 +1023,11 @@ static void recv_handler(struct ib_mad_a mad_buf = (void *) (unsigned long) mad_recv_wc->wc->wr_id; query = mad_buf->context[0]; - if (query->callback) { - if (mad_recv_wc->wc->status == IB_WC_SUCCESS) - query->callback(query, - mad_recv_wc->recv_buf.mad->mad_hdr.status ? - -EINVAL : 0, - (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); - else - query->callback(query, -EIO, NULL); - } - - ib_free_recv_mad(mad_recv_wc); + if (query->callback) + query->callback(mad_recv_wc->recv_buf.mad->mad_hdr.status ? + -EINVAL : 0, mad_recv_wc, query->context); + else + ib_free_recv_mad(mad_recv_wc); } static void ib_sa_add_one(struct ib_device *device) From sean.hefty at intel.com Tue Aug 1 15:22:24 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 1 Aug 2006 15:22:24 -0700 Subject: [openib-general] [RFC] [PATCH 2/2] local_sa: use new SA cursor routines to walk attributes in RMPP response In-Reply-To: <000001c6b5b8$42ab2480$67f9070a@amr.corp.intel.com> Message-ID: <000101c6b5b8$f6e99170$67f9070a@amr.corp.intel.com> Convert local SA to use the new SA cursor routines for walking a list of attributes in an RMPP response returned by the SA. This replaces a local SA specific implementation. Signed-off-by: Sean Hefty --- Index: core/local_sa.c =================================================================== --- core/local_sa.c (revision 8647) +++ core/local_sa.c (working copy) @@ -194,60 +194,34 @@ static int insert_attr(struct index_root static void update_path_rec(struct sa_db_port *port, struct ib_mad_recv_wc *mad_recv_wc) { - struct ib_mad_recv_buf *recv_buf; - struct ib_sa_mad *mad = (void *) mad_recv_wc->recv_buf.mad; + struct ib_sa_attr_cursor *cursor; struct ib_path_rec_info *path_info; - struct ib_path_rec ib_path, *path = NULL; - int i, attr_size, left, offset = 0; - attr_size = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; - if (attr_size < sizeof ib_path) + cursor = ib_sa_create_cursor(mad_recv_wc); + if (IS_ERR(cursor)) return; + path_info = kmalloc(sizeof *path_info, GFP_KERNEL); + if (!path_info) + goto free_cursor; + down_write(&lock); port->update++; - list_for_each_entry(recv_buf, &mad_recv_wc->rmpp_list, list) { - for (i = 0; i < IB_MGMT_SA_DATA;) { - mad = (struct ib_sa_mad *) recv_buf->mad; - - left = IB_MGMT_SA_DATA - i; - if (left < sizeof ib_path) { - /* copy first piece of the attribute */ - memcpy(&ib_path, &mad->data[i], left); - path = &ib_path; - offset = left; - break; - } else if (offset) { - /* copy the second piece of the attribute */ - memcpy((void*) path + offset, &mad->data[i], - sizeof ib_path - offset); - i += attr_size - offset; - offset = 0; - } else { - path = (void *) &mad->data[i]; - i += attr_size; - } - - if (!path->slid) - goto unlock; - - path_info = kmalloc(sizeof *path_info, GFP_KERNEL); - if (!path_info) - goto unlock; - - ib_sa_unpack_attr(&path_info->rec, path, - IB_SA_ATTR_PATH_REC); - - if (insert_attr(&port->index, port->update, - path_info->rec.dgid.raw, - &path_info->cursor)) { - kfree(path_info); - goto unlock; - } - } + while (ib_sa_get_next_attr(cursor, &path_info->rec)) { + if (insert_attr(&port->index, port->update, + path_info->rec.dgid.raw, + &path_info->cursor)) + break; + + path_info = kmalloc(sizeof *path_info, GFP_KERNEL); + if (!path_info) + break; } -unlock: + up_write(&lock); + kfree(path_info); +free_cursor: + ib_sa_free_cursor(cursor); } static void recv_handler(struct ib_mad_agent *mad_agent, From sean.hefty at intel.com Tue Aug 1 15:24:23 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 1 Aug 2006 15:24:23 -0700 Subject: [openib-general] rdma cm process hang In-Reply-To: <20060801213416.GA18941@osc.edu> Message-ID: <000201c6b5b9$3de03980$67f9070a@amr.corp.intel.com> >Using the iwarp branch of r8688, with linux-2.6.17.7 on up to date >x86_64 FC4 SMP with Ammasso cards, I can hang the client side during >RDMA CM connection setup. Do you know or have any way to check if this hang also occurs with IB devices? - Sean From rdreier at cisco.com Tue Aug 1 16:37:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 01 Aug 2006 16:37:55 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060801082752.GR9411@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 1 Aug 2006 11:27:53 +0300") References: <20060801082752.GR9411@mellanox.co.il> Message-ID: > Hmm. Since we are lockless, could ipoib_start_xmit run even after we call > netif_stop_queue? Since interrupts are disabled anyway, can we just just take > tx_lock? How does the following look? ipoib_start_xmit() can run after netif_stop_queue() but it will return immediately (before touching the skb) if the queue is stopped. But this fix is probably cleaner anyway. - R. From rdreier at cisco.com Tue Aug 1 16:39:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 01 Aug 2006 16:39:33 -0700 Subject: [openib-general] hotplug support in mthca In-Reply-To: <20060801142242.GD9411@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 1 Aug 2006 17:22:42 +0300") References: <20060801142242.GD9411@mellanox.co.il> Message-ID: Michael> How about fixing it by blocking remove_one in uverbs Michael> until all contexts are closed and device refcount drops Michael> to 0? That seems kind of horrible, because there's no guarantee that all the contexts will ever be freed. I think the real answer is to revoke all the contexts that userspace has. But that doesn't seem trivial to do, which is why I haven't implemented it yet. - R. From rdreier at cisco.com Tue Aug 1 16:50:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 01 Aug 2006 16:50:18 -0700 Subject: [openib-general] [PATCH/RFC] libibverbs and libmthca fork support In-Reply-To: (Roland Dreier's message of "Tue, 01 Aug 2006 07:21:15 -0700") References: <20060801131756.GF4681@minantech.com> Message-ID: I guess this is pretty uncontroversial, so I'll check this in tomorrow. But one last call for comments: in particular, does anyone object to libibverbs being fork-unsafe by default unless ibv_fork_init is called? > +/** > + * ibv_fork_init - Prepare data structures so that fork() may be used > + * safely. If this function is not called or returns a non-zero > + * status, then libibverbs data structures are not fork()-safe and the > + * effect of an application calling fork() is undefined. > + */ > +int ibv_fork_init(void); - R. From ralphc at pathscale.com Tue Aug 1 17:21:36 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 01 Aug 2006 17:21:36 -0700 Subject: [openib-general] [PATCH/RFC] libibverbs and libmthca fork support In-Reply-To: References: <20060801131756.GF4681@minantech.com> Message-ID: <1154478096.20325.200.camel@brick.pathscale.com> This is the sort of thing that needs to be clearly documented for the library. We currently don't have a useable set of documents (some collection of reading the code and the IB spec.). On Tue, 2006-08-01 at 16:50 -0700, Roland Dreier wrote: > I guess this is pretty uncontroversial, so I'll check this in > tomorrow. But one last call for comments: in particular, does anyone > object to libibverbs being fork-unsafe by default unless ibv_fork_init > is called? > > > +/** > > + * ibv_fork_init - Prepare data structures so that fork() may be used > > + * safely. If this function is not called or returns a non-zero > > + * status, then libibverbs data structures are not fork()-safe and the > > + * effect of an application calling fork() is undefined. > > + */ > > +int ibv_fork_init(void); > > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Tue Aug 1 17:22:02 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 1 Aug 2006 17:22:02 -0700 Subject: [openib-general] [RFC] [PATCH 1/2] sa_query: add generic query interfaces capable of supporting RMPP In-Reply-To: Message-ID: <000301c6b5c9$adb349e0$67f9070a@amr.corp.intel.com> >I think I would rather see this called an "iterator". The word "cursor" >tends to mean that blinky thing on your screen these days. That's easy enough to change. > > +int ib_sa_get_next_attr(struct ib_sa_attr_cursor *cursor, void *attr); > >How does the consumer know how big the buffer has to be? My assumption was that the user would know based on the context of the query, or could use the attribute id field in the received MAD as a guide. I was having attr reference an unpacked struct ib_sa_path_rec, ib_sa_mcmember_rec, ib_sa_service_rec, etc. I will convert these to return a pointer to the packed structure instead. > ib_sa_iter_next -- bump the iterator > ib_sa_iter_last -- return true if the iter has been bumped past the end > ib_sa_iter_attr -- return pointer to attr that iter is pointing at > ib_sa_iter_size -- return size of attr that iter points at Note that an SA attribute can span across two different MADs, so some sort of intermediate buffer is needed in certain cases. I'll update the patch to use the first three calls. I'm not as sure about ib_sa_iter_size. The packed attribute size can be determined from the MAD attr_offset field, but I'm guessing that most users will need to unpack the attribute before use. - Sean From rdreier at cisco.com Tue Aug 1 17:29:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 01 Aug 2006 17:29:44 -0700 Subject: [openib-general] [RFC] [PATCH 1/2] sa_query: add generic query interfaces capable of supporting RMPP In-Reply-To: <000301c6b5c9$adb349e0$67f9070a@amr.corp.intel.com> (Sean Hefty's message of "Tue, 1 Aug 2006 17:22:02 -0700") References: <000301c6b5c9$adb349e0$67f9070a@amr.corp.intel.com> Message-ID: Sean> Note that an SA attribute can span across two different Sean> MADs, so some sort of intermediate buffer is needed in Sean> certain cases. Umm... good point. I guess the function to read from the iterator does need to do the copying out. I wonder if it's worth trying to get really tricky and do the unpacking in the same step? - R. From mst at mellanox.co.il Tue Aug 1 22:07:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 08:07:50 +0300 Subject: [openib-general] hotplug support in mthca In-Reply-To: References: Message-ID: <20060802050750.GE9411@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: hotplug support in mthca > > Michael> How about fixing it by blocking remove_one in uverbs > Michael> until all contexts are closed and device refcount drops > Michael> to 0? > > That seems kind of horrible, because there's no guarantee that all the > contexts will ever be freed. Hmm. Maybe that's an inherent limitation of user-space drivers? Isn't this what happens for example if a sysfs file is open? How about reporting an event to the application? Would that be sufficient? > I think the real answer is to revoke all the contexts that userspace > has. But that doesn't seem trivial to do, which is why I haven't > implemented it yet. Right, this revoking doesn't sound like 2.6.18 material. Isn't just blocking hotplug still better than letting bad things happen? -- MST From ogerlitz at voltaire.com Tue Aug 1 22:41:05 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 02 Aug 2006 08:41:05 +0300 Subject: [openib-general] [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> Message-ID: <44D03AF1.8080300@voltaire.com> Sean Hefty wrote: > Just to make sure, here's my summary of the discussions so far. Anyone who > disagrees can jump in. > There was discussion on this being 2.6.18 material. The consensus is to let it > get wider testing in svn first. The fix / feature isn't critical enough to push > upstream immediately, so waiting until 2.6.19 seems reasonable. > There was also discussion on whether the CM should track local QPNs on both the > active and passive sides, or just the passive side. To properly track timewait > and eventually handle path migration, tracking on both sides is needed, which is > what the patch does. > Or asked about removing the ib_cm_establish() call from the API. It was left as > part of the API to avoid breaking the ABI, and still allow users to force > connection establishment in case they poll a receive WC before the COMM_EST > event occurs. > There was also some side discussion about the COMM_EST event in general and > trying to respond to request messages while the QP is still in the RTR state. A > discussion on possible changes to the spec spawned off from this... Sean, As all the points you mention here were agreed: + not 2.6.19 material to allow more testing + tracking local QPNs on both sides to allow for later usage in APM + leave ib_cm_establish in place And the only open issue over which there is not yet a consensus is the **side** discussion, I suggest you move forward with committing the patch and once the side discussion is done the next step (which i think is NO-OP per the IB stack, that is let the ULP handle this) would be implemented. Or. From ogerlitz at voltaire.com Tue Aug 1 22:45:34 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 02 Aug 2006 08:45:34 +0300 Subject: [openib-general] [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <44D03AF1.8080300@voltaire.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> Message-ID: <44D03BFE.2010402@voltaire.com> Or Gerlitz wrote: > + not 2.6.19 material to allow more testing I meant to say, it not 2.6.18 but rather 2.6.19 material to allow for more testing. Or. From eeb at bartonsoftware.com Tue Aug 1 23:04:47 2006 From: eeb at bartonsoftware.com (Eric Barton) Date: Wed, 2 Aug 2006 07:04:47 +0100 (BST) Subject: [openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT) In-Reply-To: <44D03AF1.8080300@voltaire.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> Message-ID: <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> I've had a report of rdma_connect() failing with a callback event type of RDMA_CM_EVENT_UNREACHABLE and status -ETIMEDOUT although the peer node was up and running at the time. It seems this can be reproduced as follows... 1. Establish a connection between nodes A and B 2. Reboot node A 3. Start establishing a new connection from node A to node B 4. After a timeout, the CM callback occurs as described. Could this happen with a buggy SM? Are there some good places in the OpenFabrics stack to add printks to help point the finger (or can some existing debug/trace code be enabled)? -- Cheers, Eric From sean.hefty at intel.com Tue Aug 1 23:13:12 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 1 Aug 2006 23:13:12 -0700 Subject: [openib-general] [RFC] [PATCH 1/2] sa_query: add generic query interfaces capable of supporting RMPP In-Reply-To: Message-ID: <000001c6b5fa$bc789b70$f1d8180a@amr.corp.intel.com> >I guess the function to read from the iterator does need to do the >copying out. I wonder if it's worth trying to get really tricky and >do the unpacking in the same step? Internally, I copy a split attribute into an intermediate buffer that is allocated as part of the cursor / iterator, then unpack into the final buffer. Ib_sa_iter_attr would simply return a pointer to the intermediate buffer, but this does prevent being really clever and unpacking in the same step. It's challenging enough for the code to handle different sized attributes arbitrarily split across two MADs. I can't imagine trying to unpack at the same time actually being faster than using an intermediate buffer. - Sean From ogerlitz at voltaire.com Tue Aug 1 23:44:05 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 02 Aug 2006 09:44:05 +0300 Subject: [openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT) In-Reply-To: <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> Message-ID: <44D049B5.6000505@voltaire.com> Eric Barton wrote: > I've had a report of rdma_connect() failing with a callback event type of > RDMA_CM_EVENT_UNREACHABLE and status -ETIMEDOUT although the peer node was > up and running at the time. > > It seems this can be reproduced as follows... > > 1. Establish a connection between nodes A and B > > 2. Reboot node A > > 3. Start establishing a new connection from node A to node B > > 4. After a timeout, the CM callback occurs as described. > > Could this happen with a buggy SM? Are there some good places in the > OpenFabrics stack to add printks to help point the finger (or can some > existing debug/trace code be enabled)? Eric, My guess this is related to the CM not the SM. I think there is a chance that the CM on node B does not treat the REQ sent by A after the reboot as "stale connection" situation and hence just **silently** dtop it, that is not REJ is sent. Adding prints in the if/else below within core/cm.c :: cm_match_req() would help you to figure out if the direction i suggest indeed is the one for you to hunt. I am not familiar enough with the generation of the CM IDs, but my basic thinking is that generating them randomly should solve it. In case the IDs are started from some seed value and then each new generated id is the current value plus one, having the initial seed being equal to jiffies instead of to some constant, should be fine. Or. if (timewait_info) { cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, timewait_info->work.remote_id); spin_unlock_irqrestore(&cm.lock, flags); if (cur_cm_id_priv) { cm_dup_req_handler(work, cur_cm_id_priv); cm_deref_id(cur_cm_id_priv); } else cm_issue_rej(work->port, work->mad_recv_wc, IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, NULL, 0); goto error; } From jackm at mellanox.co.il Wed Aug 2 00:43:16 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 2 Aug 2006 10:43:16 +0300 Subject: [openib-general] APM: QP migration state change when failover triggered by hw In-Reply-To: <000201c6b59c$17ef1d30$29cc180a@amr.corp.intel.com> References: <000201c6b59c$17ef1d30$29cc180a@amr.corp.intel.com> Message-ID: <200608021043.17017.jackm@mellanox.co.il> On Tuesday 01 August 2006 21:55, Sean Hefty wrote: > > I am testing APM with kernel module which directly interfaces with > >ib_verbs.ko and ib_cm.ko. > >Yes, I do receive IB_MIG_MIGRATED event, but the QP's mig_state is not > >actually changed to MIGRATED. So I had to do this from my module. > > > There is a pending patch that was recently posted (dispatch communication > establish event) that can be extended to pass path migration events to the > ib_cm. The purpose of passing path migration events to the ib_cm would be > limited to changing the path that future CM messages, and not related to QP > transitions. > > - Sean This could be a bit complicated. For example, say there are two possible paths. After migration has occurred the first time, there is no guarantee that the original path has become available again. There is also a race condition here in your proposal -- the new Alt Path data must be specified between the MIGRATED event and the communication-established event on the migrated path (so that the LAP message may be correctly sent to the remote node). Babu, regarding the migration event that you are seeing, are you sure that it is from the migration transition that does not occur? Possibly, the problematic transition is the second one, which occurs after specifying a new alternate path and rearming APM? It seems more likely to me that the first transition does occur, since you receive a MIG event on both sides, and since the alt path data is loaded by you during the initial bringup of the RC QP pair(either at init->rtr, or at rtr->rts). If you are receiving the MIGRATED event, the qp is already in the migrated state. However, after the first migration occurs, you need to do the following: 1. send a LAP packet to the remote node, containing the new alt path info. 2. load NEW alt path information (ib_modify_qp, rts->rts), including remote LID received in LAP packet. 3. Rearm path migration (ib_modify_qp, rts->rts) Are you certain that the above 3 steps have taken place? Note that 1. and 2. above are a separate phase from 3., since the IB Spec allows changing the alternate path while the QP is still armed, not just when it has migrated. - Jack From mst at mellanox.co.il Wed Aug 2 02:15:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 12:15:55 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060802091555.GM9411@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Fwd: issues in ipoib > > > 1. pkey cache issues > > http://thread.gmane.org/gmane.linux.drivers.openib/26684/focus=26692 > > I thought we fixed the P_Key cache issues by correcting the oversight > in retrying the P_Key query? Hmm. Re-reading that thread: > > ipoib_main calls ipoib_pkey_dev_delay_open() before it tries > > ipoib_ib_dev_up(). So it should be OK if the P_Key isn't assigned > > yet. > > But ipoib_ib_dev_flush doesn't? And it still doesn't. Jack here also confirms that the problem still exits if SM clears P_Key table and then later readds a P_Key. Could you take a look please? -- MST From info at imagenarte.net Tue Aug 1 22:09:51 2006 From: info at imagenarte.net (IART) Date: Wed, 2 Aug 2006 05:09:51 +0000 Subject: [openib-general] Pregunte como mejorar su imagen. Message-ID: <200608020509.HSMQHBQXNE@Dynamic-IP-697918666.cable.net.co> Documento sin título
cabezote_iart
imagen_prin
Diseño Publicitario

Atrás quedo la vieja idea de comunicación que afirmaba que tan solo existía un emisor-mensaje-receptor. Ahora, las nuevas tecnologías permiten acceder a la lógica de la interactividad en la cual no somos tan solo los espectadores de un acontecimiento sino que también somos los creadores de este.

Los multimedios abren los campos de alimentación y retroalimentación para una buena comunicación porque con la vertiginosa rapidez de la actividad humana los antiguos esquemas de comunicación cada vez se vuelven más ineficaces. La mezcla de audio, video, animación, navegación e impacto visual hace de la multimedia un nuevo elemento con un sin numero de cualidades que pueden ayudar al proceso de aprendizaje, desarrollo intelectual y entrete

banner_cms

Boletin Informativo

 

 

I-magen.net Lanza su nuevo sitio web

Agosto 1 lanzamiento del nuevo sitio Web de i-magen.net, este nuevo sitio estará enfocado al servicio en línea para sus clientes, contara con nuevos productos y servicios para cada necesidad, con un diseño lógico y novedoso con el cual lograra ser uno de los sitio mas útiles para las empresas que requieren servicios publicitarios y gráficos.

I-magen.net nuevo nombre nueva imagen

iart es el nuevo nombre con el cual pretendemos posicionarnos en el mercado, con el lanzamiento del sitio Web se iniciara una campaña electrónica de posicionamiento de marca, i-magen.net con un nuevo nombre y una nueva imagen pretende tener mayor recordación y diferenciales específicos dentro del medio, logrando mayor penetración en el mercado.

 
Telefono: 526-0867 Cel.316 357-8997 e-mail: info at imagenarte.net

© 2006 i-magen.net (iart) Todos los derechos reservados

-------------- next part -------------- An HTML attachment was scrubbed... URL: From monil at voltaire.com Wed Aug 2 04:30:32 2006 From: monil at voltaire.com (Moni Levy) Date: Wed, 2 Aug 2006 14:30:32 +0300 Subject: [openib-general] Multicast traffic performace of OFED 1.0 ipoib Message-ID: <6a122cc00608020430k4a520d10xfebb256829feb752@mail.gmail.com> Hi, we are doing some performance testing of multicast traffic over ipoib. The tests are performed by using iperf on dual 1.6G AMD PCI-X servers with PCI-X Tavor cards with 3.4.FW. Below are the command the may be used to run the test. Iperf server: route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0 /home/qa/testing-tools/iperf-2.0.2/iperf -us -B 224.4.4.4 -i 1 Iperf client: route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0 /home/qa/testing-tools/iperf-2.0.2/iperf -uc 224.4.4.4 -i 1 -b 100M -t 400 -l 100 We are looking for the max PPT rate (100 byte packets size) without losses, by changing the BW parameter and looking at the point where we get no losses reported. The best results we received were around 50k PPS. I remember that we got some 120k-140k packets of the same size running without losses. We are going to look into it and try to see where is the time spent, but any ideas are welcome. Best regards, Moni From mst at mellanox.co.il Wed Aug 2 04:31:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 14:31:26 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060802113126.GA15340@mellanox.co.il> Quoting r. Roland Dreier : > > 4. module unloading races > > http://openib.org/pipermail/openib-general/2006-April/020397.html OK, I finally got around to working on this. Please review the following. -- Require registration with SA module, to prevent module text from going away while sa query callback is still running, and update all users. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d6f99d5..bf668b3 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -60,6 +60,10 @@ static struct ib_client cma_client = { .remove = cma_remove_one }; +static struct ib_sa_client cma_sa_client = { + .name = "cma" +}; + static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); @@ -1140,7 +1144,7 @@ static int cma_query_ib_route(struct rdm path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; - id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + id_priv->query_id = ib_sa_path_rec_get(&cma_sa_client, id_priv->id.device, id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, @@ -1910,6 +1914,8 @@ static int cma_init(void) ret = ib_register_client(&cma_client); if (ret) goto err; + + ib_sa_register_client(&cma_sa_client); return 0; err: @@ -1919,6 +1925,7 @@ err: static void cma_cleanup(void) { + ib_sa_unregister_client(&cma_sa_client); ib_unregister_client(&cma_client); destroy_workqueue(cma_wq); idr_destroy(&sdp_ps); diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index aeda484..a5ecb6a 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -78,6 +78,7 @@ struct ib_sa_query { struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; + struct ib_sa_client *client; int id; }; @@ -539,6 +540,7 @@ static void ib_sa_path_rec_callback(stru struct ib_sa_path_query *query = container_of(sa_query, struct ib_sa_path_query, sa_query); + down_read(&sa_query->client->sem); if (mad) { struct ib_sa_path_rec rec; @@ -547,6 +549,7 @@ static void ib_sa_path_rec_callback(stru query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + up_read(&sa_query->client->sem); } static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) @@ -556,6 +559,7 @@ static void ib_sa_path_rec_release(struc /** * ib_sa_path_rec_get - Start a Path get query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:Path Record to send in query @@ -578,7 +582,8 @@ static void ib_sa_path_rec_release(struc * error code. Otherwise it is a query ID that can be used to cancel * the query. */ -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -619,6 +624,7 @@ int ib_sa_path_rec_get(struct ib_device mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + query->sa_query.client = client; query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; query->sa_query.port = port; @@ -653,6 +659,7 @@ static void ib_sa_service_rec_callback(s struct ib_sa_service_query *query = container_of(sa_query, struct ib_sa_service_query, sa_query); + down_read(&sa_query->client->sem); if (mad) { struct ib_sa_service_rec rec; @@ -661,6 +668,7 @@ static void ib_sa_service_rec_callback(s query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + up_read(&sa_query->client->sem); } static void ib_sa_service_rec_release(struct ib_sa_query *sa_query) @@ -670,6 +678,7 @@ static void ib_sa_service_rec_release(st /** * ib_sa_service_rec_query - Start Service Record operation + * @client:client object used to track the query * @device:device to send request on * @port_num: port number to send request on * @method:SA method - should be get, set, or delete @@ -694,7 +703,8 @@ static void ib_sa_service_rec_release(st * error code. Otherwise it is a request ID that can be used to cancel * the query. */ -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -740,6 +750,7 @@ int ib_sa_service_rec_query(struct ib_de mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + query->sa_query.client = client; query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; query->sa_query.release = ib_sa_service_rec_release; query->sa_query.port = port; @@ -775,6 +786,7 @@ static void ib_sa_mcmember_rec_callback( struct ib_sa_mcmember_query *query = container_of(sa_query, struct ib_sa_mcmember_query, sa_query); + down_read(&sa_query->client->sem); if (mad) { struct ib_sa_mcmember_rec rec; @@ -783,6 +795,7 @@ static void ib_sa_mcmember_rec_callback( query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + up_read(&sa_query->client->sem); } static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) @@ -790,7 +803,8 @@ static void ib_sa_mcmember_rec_release(s kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -832,6 +846,7 @@ int ib_sa_mcmember_rec_query(struct ib_d mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + query->sa_query.client = client; query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; query->sa_query.port = port; @@ -866,6 +881,7 @@ static void send_handler(struct ib_mad_a struct ib_sa_query *query = mad_send_wc->send_buf->context[0]; unsigned long flags; + down_read(&query->client->sem); if (query->callback) switch (mad_send_wc->status) { case IB_WC_SUCCESS: @@ -881,6 +897,7 @@ static void send_handler(struct ib_mad_a query->callback(query, -EIO, NULL); break; } + up_read(&query->client->sem); spin_lock_irqsave(&idr_lock, flags); idr_remove(&query_idr, query->id); @@ -900,6 +917,7 @@ static void recv_handler(struct ib_mad_a mad_buf = (void *) (unsigned long) mad_recv_wc->wc->wr_id; query = mad_buf->context[0]; + down_read(&query->client->sem); if (query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, @@ -909,6 +927,7 @@ static void recv_handler(struct ib_mad_a else query->callback(query, -EIO, NULL); } + up_read(&query->client->sem); ib_free_recv_mad(mad_recv_wc); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 474aa21..28a9f0f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -390,4 +390,5 @@ #define IPOIB_GID_RAW_ARG(gid) ((u8 *)(g #define IPOIB_GID_ARG(gid) IPOIB_GID_RAW_ARG((gid).raw) +extern struct ib_sa_client ipoib_sa_client; #endif /* _IPOIB_H */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cf71d2a..ca10724 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -91,6 +91,10 @@ static struct ib_client ipoib_client = { .remove = ipoib_remove_one }; +struct ib_sa_client ipoib_sa_client = { + .name = "ipoib" +}; + int ipoib_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -459,7 +463,7 @@ static int path_rec_start(struct net_dev init_completion(&path->done); path->query_id = - ib_sa_path_rec_get(priv->ca, priv->port, + ib_sa_path_rec_get(&ipoib_sa_client, priv->ca, priv->port, &path->pathrec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | @@ -1185,6 +1189,8 @@ static int __init ipoib_init_module(void if (ret) goto err_wq; + ib_sa_register_client(&ipoib_sa_client); + return 0; err_wq: @@ -1198,6 +1204,7 @@ err_fs: static void __exit ipoib_cleanup_module(void) { + ib_sa_unregister_client(&ipoib_sa_client); ib_unregister_client(&ipoib_client); ipoib_unregister_debugfs(); destroy_workqueue(ipoib_workqueue); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index b5e6a7b..f688323 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -360,7 +360,7 @@ #endif init_completion(&mcast->done); - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, + ret = ib_sa_mcmember_rec_set(&ipoib_sa_client, priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | @@ -484,8 +484,8 @@ static void ipoib_mcast_join(struct net_ init_completion(&mcast->done); - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, - mcast->backoff * 1000, GFP_ATOMIC, + ret = ib_sa_mcmember_rec_set(&ipoib_sa_client, priv->ca, priv->port, &rec, + comp_mask, mcast->backoff * 1000, GFP_ATOMIC, ipoib_mcast_join_complete, mcast, &mcast->query); @@ -680,7 +680,7 @@ static int ipoib_mcast_leave(struct net_ * Just make one shot at leaving and don't wait for a reply; * if we fail, too bad. */ - ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, + ret = ib_sa_mcmember_rec_delete(&ipoib_sa_client, priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..0856d78 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -88,6 +88,10 @@ static struct ib_client srp_client = { .remove = srp_remove_one }; +static struct ib_sa_client srp_sa_client = { + .name = "srp" +}; + static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) { return (struct srp_target_port *) host->hostdata; @@ -259,7 +263,8 @@ static int srp_lookup_path(struct srp_ta init_completion(&target->done); - target->path_query_id = ib_sa_path_rec_get(target->srp_host->dev->dev, + target->path_query_id = ib_sa_path_rec_get(&srp_sa_client, + target->srp_host->dev->dev, target->srp_host->port, &target->path, IB_SA_PATH_REC_DGID | diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index c99e442..7bf0449 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -37,6 +37,7 @@ #ifndef IB_SA_H #define IB_SA_H #include +#include #include #include @@ -250,11 +251,17 @@ struct ib_sa_service_rec { u64 data64[2]; }; +struct ib_sa_client { + char *name; + struct rw_semaphore sem; +}; + struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -264,7 +271,8 @@ int ib_sa_path_rec_get(struct ib_device void *context, struct ib_sa_query **query); -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -275,7 +283,8 @@ int ib_sa_mcmember_rec_query(struct ib_d void *context, struct ib_sa_query **query); -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, @@ -288,6 +297,7 @@ int ib_sa_service_rec_query(struct ib_de /** * ib_sa_mcmember_rec_set - Start an MCMember set query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -311,7 +321,8 @@ int ib_sa_service_rec_query(struct ib_de * cancel the query. */ static inline int -ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_set(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -321,7 +332,7 @@ ib_sa_mcmember_rec_set(struct ib_device void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, timeout_ms, gfp_mask, callback, @@ -330,6 +341,7 @@ ib_sa_mcmember_rec_set(struct ib_device /** * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -353,7 +365,8 @@ ib_sa_mcmember_rec_set(struct ib_device * cancel the query. */ static inline int -ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_delete(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -363,7 +376,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, timeout_ms, gfp_mask, callback, @@ -378,4 +391,22 @@ int ib_init_ah_from_path(struct ib_devic struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr); +/** + * ib_sa_register_client - register SA client object + * @client:client object used to track queries + */ +static inline void ib_sa_register_client(struct ib_sa_client *client) +{ + init_rwsem(&client->sem); +} + +/** + * ib_sa_unregister_client - unregister SA client object + * @client:client object used to track queries + */ +static inline void ib_sa_unregister_client(struct ib_sa_client *client) +{ + down_write(&client->sem); +} + #endif /* IB_SA_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h -- MST From glebn at voltaire.com Wed Aug 2 04:42:43 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Wed, 2 Aug 2006 14:42:43 +0300 Subject: [openib-general] [PATCH/RFC] libibverbs and libmthca fork support In-Reply-To: References: <20060801131756.GF4681@minantech.com> Message-ID: <20060802114242.GL4681@minantech.com> On Tue, Aug 01, 2006 at 04:50:18PM -0700, Roland Dreier wrote: > I guess this is pretty uncontroversial, so I'll check this in > tomorrow. Before you do it check libmthca/src/buf.c file please. It look funny at least in patch you've sent (as if somebody did cat buf.c >> buf.c) > But one last call for comments: in particular, does anyone > object to libibverbs being fork-unsafe by default unless ibv_fork_init > is called? > I am not sure about this one. The library like MPI will have to always call this anyway. On the other side if some library calls fork() without application knowing it? Suddenly programmer should care about such details. Perhaps opt out is better the opt in and libibverbs should skip ibv_fork_init() only if application ask this explicitly? > > +/** > > + * ibv_fork_init - Prepare data structures so that fork() may be used > > + * safely. If this function is not called or returns a non-zero > > + * status, then libibverbs data structures are not fork()-safe and the > > + * effect of an application calling fork() is undefined. > > + */ > > +int ibv_fork_init(void); > > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Gleb. From ogerlitz at voltaire.com Wed Aug 2 05:32:19 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 2 Aug 2006 15:32:19 +0300 (IDT) Subject: [openib-general] some issues related to when/why IPoIB calls netif_carrier_on() etc Message-ID: Roland, I'd like to verify with you an issue related to the logic applied by ipoib about when/why it decided to join the multicast group used for IPv4 broadcast (eg ARP) and when/why its sets the carrier bit. Probing ipoib and ifconfing ib0 to be UP, i see that ib0 is RUNNING, has an IPv6 address and is joined to the IPv4 broascast group (ie to MGID ff12:401b:ffff:0:0:0:ffff:ffff), see the exact sequence of operations below. I wonder if you can shed some light on: 1) what is the exact reason that ib0 is running here, is it as of this "magic" configuration of the IPv6 addr that caused it to join to the IPv4 and IPv6 broascast groups? 2) is it well defined what conditions should hold s.t IPoIB will be RUNNING 3) just to make sure: RUNNING <--> ipoib called netif_carrier_on(), correct? i see that latter is called by ipoib_mcast_join_task(), is it when "joining everything we want to join to" holds or you can somehow refine the predicate? thanks, Or. $ modprobe ib_ipoib $ ifconfig ib0 up $ ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-04-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet6 addr: fe80::208:f104:396:51dd/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:35 errors:0 dropped:0 overruns:0 frame:0 TX packets:6 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:2760 (2.6 Kb) TX bytes:456 (456.0 b) $ ip addr show ib0 10: ib0: mtu 2044 qdisc pfifo_fast qlen 128 link/[32] 00:04:04:04:fe:80:00:00:00:00:00:00:00:08:f1:04:03:96:51:dd brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet6 fe80::208:f104:396:51dd/64 scope link valid_lft forever preferred_lft forever $ cat /sys/kernel/debug/ipoib/ib0_mcg GID: ff12:401b:ffff:0:0:0:ffff:ffff created: 4294982836 queuelen: 0 complete: yes send_only: no GID: ff12:601b:ffff:0:0:0:0:1 created: 4294982838 queuelen: 0 complete: yes send_only: no GID: ff12:601b:ffff:0:0:0:0:2 created: 4294983192 queuelen: 0 complete: yes send_only: yes GID: ff12:601b:ffff:0:0:0:0:16 created: 4294982840 queuelen: 0 complete: yes send_only: yes GID: ff12:601b:ffff:0:0:1:ff96:51dd created: 4294982838 queuelen: 0 complete: yes send_only: no $ dmesg ib0: bringing up interface ib0: starting multicast thread ADDRCONF(NETDEV_UP): ib0: link is not ready ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff ib0: restarting multicast task ib0: stopping multicast thread ib0: waiting for MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status -4) ib0: starting multicast thread ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status 0) ib0: Created ah ffff810034630380 ib0: MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff AV ffff810034630380, LID 0xc000, SL 0 ib0: successfully joined all multicast groups ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready ib0: restarting multicast task ib0: stopping multicast thread ib0: waiting for MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff ib0: adding multicast entry for mgid ff12:601b:ffff:0000:0000:0001:ff96:51dd ib0: adding multicast entry for mgid ff12:601b:ffff:0000:0000:0000:0000:0001 ib0: starting multicast thread ib0: joining MGID ff12:601b:ffff:0000:0000:0001:ff96:51dd ib0: join completion for ff12:601b:ffff:0000:0000:0001:ff96:51dd (status 0) ib0: Created ah ffff810034630d00 ib0: MGID ff12:601b:ffff:0000:0000:0001:ff96:51dd AV ffff810034630d00, LID 0xc006, SL 0 ib0: joining MGID ff12:601b:ffff:0000:0000:0000:0000:0001 ib0: join completion for ff12:601b:ffff:0000:0000:0000:0000:0001 (status 0) ib0: Created ah ffff8100346305c0 ib0: MGID ff12:601b:ffff:0000:0000:0000:0000:0001 AV ffff8100346305c0, LID 0xc003, SL 0 ib0: successfully joined all multicast groups ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0000:0000:0016 ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting join ib0: Created ah ffff81003d669400 ib0: MGID ff12:601b:ffff:0000:0000:0000:0000:0016 AV ffff81003d669400, LID 0xc004, SL 0 ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0000:0000:0002 ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting join ib0: Created ah ffff81002009f7c0 ib0: MGID ff12:601b:ffff:0000:0000:0000:0000:0002 AV ffff81002009f7c0, LID 0xc005, SL 0 ib0: no IPv6 routers present ib0: neigh_destructor for ffffff ff12:601b:ffff:0000:0000:0000:0000:0002 ib0: neigh_destructor for ffffff ff12:601b:ffff:0000:0000:0001:ff96:51dd From mst at mellanox.co.il Wed Aug 2 06:04:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 16:04:56 +0300 Subject: [openib-general] [PATCH] RFC: srp filesystem data corruption problem/work-around In-Reply-To: References: Message-ID: <20060802130456.GA15769@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] RFC: srp filesystem data corruption problem/work-around > > > This is an oversight. It will be fixed in the next major version of the > > LSI Logic Storage controller firmware. The OUI after the fix will be > > A0B8. > > OK great. So let's respin this patch so it defaults on but only > triggers for targets with the Mellanox OUI -- that will catch current > LSI targets but I think that's acceptable. Here it is. Looks good to me. Roland? -- Add work-around for data corruption observed with Mellanox targets when VA != 0. Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..6a8b286 100644 Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-07-31 16:52:26.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-07-31 18:37:58.000000000 +0300 @@ -77,6 +77,14 @@ MODULE_PARM_DESC(topspin_workarounds, static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; +static int mellanox_workarounds = 1; + +module_param(mellanox_workarounds, int, 0444); +MODULE_PARM_DESC(mellanox_workarounds, + "Enable workarounds for Mellanox SRP target bugs if != 0"); + +static const u8 mellanox_oui[3] = { 0x02, 0xc9, 0x02 }; + static void srp_add_one(struct ib_device *device); static void srp_remove_one(struct ib_device *device); static void srp_completion(struct ib_cq *cq, void *target_ptr); @@ -569,7 +577,7 @@ err: return ret; } -static int srp_map_fmr(struct srp_device *dev, struct scatterlist *scat, +static int srp_map_fmr(struct srp_target_port *target, struct scatterlist *scat, int sg_cnt, struct srp_request *req, struct srp_direct_buf *buf) { @@ -579,10 +587,15 @@ static int srp_map_fmr(struct srp_device int page_cnt; int i, j; int ret; + struct srp_device *dev = target->srp_host->dev; if (!dev->fmr_pool) return -ENODEV; + if ((sg_dma_address(&scat[0]) & ~dev->fmr_page_mask) && + mellanox_workarounds && !memcmp(&target->ioc_guid, mellanox_oui, 3)) + return -EINVAL; + len = page_cnt = 0; for (i = 0; i < sg_cnt; ++i) { if (sg_dma_address(&scat[i]) & ~dev->fmr_page_mask) { @@ -685,7 +699,7 @@ static int srp_map_data(struct scsi_cmnd buf->va = cpu_to_be64(sg_dma_address(scat)); buf->key = cpu_to_be32(target->srp_host->dev->mr->rkey); buf->len = cpu_to_be32(sg_dma_len(scat)); - } else if (srp_map_fmr(target->srp_host->dev, scat, count, req, + } else if (srp_map_fmr(target, scat, count, req, (void *) cmd->add_data)) { /* * FMR mapping failed, and the scatterlist has more -- MST From mst at mellanox.co.il Wed Aug 2 06:13:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 16:13:31 +0300 Subject: [openib-general] hotplug support in mthca In-Reply-To: <20060802050750.GE9411@mellanox.co.il> References: <20060802050750.GE9411@mellanox.co.il> Message-ID: <20060802131331.GB15769@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: hotplug support in mthca > > Quoting r. Roland Dreier : > > Subject: Re: hotplug support in mthca > > > > Michael> How about fixing it by blocking remove_one in uverbs > > Michael> until all contexts are closed and device refcount drops > > Michael> to 0? > > > > That seems kind of horrible, because there's no guarantee that all the > > contexts will ever be freed. > > Hmm. Maybe that's an inherent limitation of user-space drivers? > Isn't this what happens for example if a sysfs file is open? > > How about reporting an event to the application? Would that be sufficient? > > > I think the real answer is to revoke all the contexts that userspace > > has. But that doesn't seem trivial to do, which is why I haven't > > implemented it yet. > > Right, this revoking doesn't sound like 2.6.18 material. > Isn't just blocking hotplug still better than letting bad things happen? The following helps avoids crash after hotplug remove. I think this is at least better that what we have now, especially if we add to this reporting an event to the application. Roland, what do you think? --- Avoid crash on hotplug remove event by waiting until all users have closed the device context. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: src/drivers/infiniband/core/uverbs.h =================================================================== --- src.orig/drivers/infiniband/core/uverbs.h 2006-08-02 11:14:12.477572000 +0300 +++ src/drivers/infiniband/core/uverbs.h 2006-08-02 12:15:43.950309000 +0300 @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -69,6 +70,7 @@ struct ib_uverbs_device { struct kref ref; + struct completion comp; int devnum; struct cdev *dev; struct class_device *class_dev; Index: src/drivers/infiniband/core/uverbs_main.c =================================================================== --- src.orig/drivers/infiniband/core/uverbs_main.c 2006-08-02 11:14:12.449574000 +0300 +++ src/drivers/infiniband/core/uverbs_main.c 2006-08-02 12:19:32.537924000 +0300 @@ -122,7 +122,7 @@ static void ib_uverbs_release_dev(struct struct ib_uverbs_device *dev = container_of(ref, struct ib_uverbs_device, ref); - kfree(dev); + complete(&dev->comp); } void ib_uverbs_release_ucq(struct ib_uverbs_file *file, @@ -740,6 +740,7 @@ static void ib_uverbs_add_one(struct ib_ return; kref_init(&uverbs_dev->ref); + init_completion(&uverbs_dev->comp); spin_lock(&map_lock); uverbs_dev->devnum = find_first_zero_bit(dev_map, IB_UVERBS_MAX_DEVICES); @@ -793,6 +794,8 @@ err_cdev: err: kref_put(&uverbs_dev->ref, ib_uverbs_release_dev); + wait_for_completion(&uverbs_dev->comp); + kfree(uverbs_dev); return; } @@ -812,7 +815,10 @@ static void ib_uverbs_remove_one(struct spin_unlock(&map_lock); clear_bit(uverbs_dev->devnum, dev_map); + kref_put(&uverbs_dev->ref, ib_uverbs_release_dev); + wait_for_completion(&uverbs_dev->comp); + kfree(uverbs_dev); } static struct super_block *uverbs_event_get_sb(struct file_system_type *fs_type, int flags, -- MST From tziporet at mellanox.co.il Wed Aug 2 07:48:40 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 02 Aug 2006 17:48:40 +0300 Subject: [openib-general] Multicast traffic performace of OFED 1.0 ipoib In-Reply-To: <6a122cc00608020430k4a520d10xfebb256829feb752@mail.gmail.com> References: <6a122cc00608020430k4a520d10xfebb256829feb752@mail.gmail.com> Message-ID: <44D0BB48.5030400@mellanox.co.il> Moni Levy wrote: > > We are going to look into it and try to see where is the time spent, > but any ideas are welcome. > > Best regards, > Moni We will try to investigate it and let you know. Tziporet From swise at opengridcomputing.com Wed Aug 2 07:53:25 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 02 Aug 2006 09:53:25 -0500 Subject: [openib-general] rdma cm process hang In-Reply-To: <20060801213416.GA18941@osc.edu> References: <20060801213416.GA18941@osc.edu> Message-ID: <1154530405.32560.6.camel@stevo-desktop> This smells like an amso or iwcm problem. On Tue, 2006-08-01 at 17:34 -0400, Pete Wyckoff wrote: > Using the iwarp branch of r8688, with linux-2.6.17.7 on up to date > x86_64 FC4 SMP with Ammasso cards, I can hang the client side during > RDMA CM connection setup. > > The scenario is: > > start server side process on some other node > start client process > have server die after RDMA_CM_EVENT_CONNECT_REQUEST arrives, > but before calling rdma_accept > hit ctrl-C on client > > The last bits of the console log (from c2 debug) are: > > c2: c2_create_qp:248 > c2: c2_query_pkey:110 > c2: c2_qp_modify:145 qp=ffff81007fe3b980, IB_QPS_RESET --> IB_QPS_INIT > c2: c2_qp_modify:243 qp=ffff81007fe3b980, cur_state=IB_QPS_INIT > c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=1 > c2: c2_connect:598 > c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=2 > > The process is in S state before the ctrl-c, here's a traceback > (waiting in rdma_get_cm_event): > > ardma-rdmacm S ffff81003d0f9e68 0 2914 2842 (NOTLB) > ffff81003d0f9e68 0000000000000000 0000000000000000 00000000000007b4 > ffff81003ef1d280 ffff81007fd06080 0000000000000000 0000000000000000 > 0000000000000000 0000000000000000 > Call Trace: {:rdma_ucm:ucma_get_event+268} > {autoremove_wake_function+0} {:rdma_ucm:ucma_write+111} > {vfs_write+189} {sys_write+83} > {system_call+126} > > Then after ctrl-C, one more console log entry: > > c2: c2_destroy_qp:290 qp=ffff81007fe3b980,qp->state=1 > > and now the process is unkillable (but the node does not oops): > > ardma-rdmacm D ffff81003d0f9bf8 0 2914 2842 (L-TLB) > ffff81003d0f9bf8 ffffc2000001ffff ffff81007eee5c80 0000000000009ee7 > ffff81003ef1d280 ffff81003f060aa0 ffffffff80232646 ffff81000100c130 > ffff81003ec3d140 ffff81003dc85800 > Call Trace: {on_each_cpu+38} {__remove_vm_area+55} > {:iw_c2:c2_free_qp+355} {autoremove_wake_function+0} > {:iw_c2:c2_destroy_qp+52} {:ib_core:ib_destroy_qp+49} > {:ib_uverbs:ib_uverbs_close+410} {__fput+178} > {filp_close+104} {put_files_struct+122} > {do_exit+596} {__dequeue_signal+495} > {do_group_exit+216} {get_signal_to_deliver+1192} > {do_signal+129} {:rdma_ucm:ucma_get_event+501} > {:rdma_ucm:ucma_write+111} {vfs_write+189} > {sysret_signal+28} {ptregscall_common+103} > > Once I figure out the bug in the server side code I will hopefully > not have this problem anymore. But thought you'd like to see it. > > -- Pete > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From krause at cup.hp.com Wed Aug 2 07:54:09 2006 From: krause at cup.hp.com (Michael Krause) Date: Wed, 02 Aug 2006 07:54:09 -0700 Subject: [openib-general] Multicast traffic performace of OFED 1.0 ipoib In-Reply-To: <6a122cc00608020430k4a520d10xfebb256829feb752@mail.gmail.co m> References: <6a122cc00608020430k4a520d10xfebb256829feb752@mail.gmail.com> Message-ID: <6.2.0.14.2.20060802075013.0397ca08@esmail.cup.hp.com> Is the performance being measured on an identical topology and hardware set as before? Multicast by its very nature is sensitive to topology, hardware components used (buffer depth, latency, etc.) and workload occurring within the fabric. Loss occurs as a function of congestion or lack of forward progress resulting in a timeout and thus a toss of a packet. If the hardware is different or the settings chosen are changed, then the results would be expected to change. It is not clear what you hope to achieve with such tests as there will be other workloads flowing over the fabric which will create random HOL blocking which can result in packet loss. Multicast workloads should be tolerant of such loss. Mike At 04:30 AM 8/2/2006, Moni Levy wrote: >Hi, > we are doing some performance testing of multicast traffic over >ipoib. The tests are performed by using iperf on dual 1.6G AMD PCI-X >servers with PCI-X Tavor cards with 3.4.FW. Below are the command the >may be used to run the test. > >Iperf server: >route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0 >/home/qa/testing-tools/iperf-2.0.2/iperf -us -B 224.4.4.4 -i 1 > >Iperf client: >route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0 >/home/qa/testing-tools/iperf-2.0.2/iperf -uc 224.4.4.4 -i 1 -b 100M -t >400 -l 100 > >We are looking for the max PPT rate (100 byte packets size) without >losses, by changing the BW parameter and looking at the point where we >get no losses reported. The best results we received were around 50k >PPS. I remember that we got some 120k-140k packets of the same size >running without losses. > >We are going to look into it and try to see where is the time spent, >but any ideas are welcome. > >Best regards, >Moni > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Aug 2 07:59:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 Aug 2006 07:59:41 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060802091555.GM9411@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Aug 2006 12:15:55 +0300") References: <20060802091555.GM9411@mellanox.co.il> Message-ID: > And it still doesn't. > Jack here also confirms that the problem still exits if SM clears P_Key table > and then later readds a P_Key. > > Could you take a look please? So is the solution to call ipoib_pkey_dev_delay_open() in ipoib_ib_dev_flush()? From rdreier at cisco.com Wed Aug 2 08:05:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 Aug 2006 08:05:36 -0700 Subject: [openib-general] some issues related to when/why IPoIB calls netif_carrier_on() etc In-Reply-To: (Or Gerlitz's message of "Wed, 2 Aug 2006 15:32:19 +0300 (IDT)") References: Message-ID: > 1) what is the exact reason that ib0 is running here, is it as of this > "magic" configuration of the IPv6 addr that caused it to join to > the IPv4 and IPv6 broascast groups? No, ipv6 autoconf has nothing to do with it. I think it's because you did ifconfig ib0 up, which called ipoib_open(), which calls ipoib_ib_dev_up(), which joins the ipv4 broadcast group. Bringing the interface up the starts ipv6 autoconf but that is just a side issue. You could build a kernel without ipv6 and see what happens. > 2) is it well defined what conditions should hold s.t IPoIB will be RUNNING Not really. > 3) just to make sure: RUNNING <--> ipoib called netif_carrier_on(), correct? > i see that latter is called by ipoib_mcast_join_task(), is it when > "joining everything we want to join to" holds or you can somehow > refine the predicate? Yes, I believe that is correct. - R. From rdreier at cisco.com Wed Aug 2 08:08:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 Aug 2006 08:08:48 -0700 Subject: [openib-general] [PATCH/RFC] libibverbs and libmthca fork support In-Reply-To: <20060802114242.GL4681@minantech.com> (Gleb Natapov's message of "Wed, 2 Aug 2006 14:42:43 +0300") References: <20060801131756.GF4681@minantech.com> <20060802114242.GL4681@minantech.com> Message-ID: Gleb> Before you do it check libmthca/src/buf.c file please. It Gleb> look funny at least in patch you've sent (as if somebody did Gleb> cat buf.c >> buf.c) Yes that was just a bad copy and paste into my mail. Roland> But one last call for comments: in particular, does anyone Roland> object to libibverbs being fork-unsafe by default unless Roland> ibv_fork_init is called? Gleb> I am not sure about this one. The library like MPI will have Gleb> to always call this anyway. On the other side if some Gleb> library calls fork() without application knowing it? Gleb> Suddenly programmer should care about such details. Perhaps Gleb> opt out is better the opt in and libibverbs should skip Gleb> ibv_fork_init() only if application ask this explicitly? But then what should libibverbs do on a kernel that doesn't support the required madvise(MADV_DONTFORK) call? It's fine if MPI calls ibv_fork_init() always -- at least then it has a hint about whether fork() will work. Do you know if there are libraries that call fork()? - R. From swise at opengridcomputing.com Wed Aug 2 08:09:39 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 02 Aug 2006 10:09:39 -0500 Subject: [openib-general] rdma cm process hang In-Reply-To: <20060801213416.GA18941@osc.edu> References: <20060801213416.GA18941@osc.edu> Message-ID: <1154531379.32560.13.camel@stevo-desktop> This hang is due to 2 things: 1) the amso card will _never_ timeout a connection that is awaiting an MP reply. That is exactly what is happening here. The fix for this (timeout mpa connection setup stalls) is a firmware fix and we don't have the firmware src. 2) the IWCM holds a reference on the QP until connection setup either succeeds or fails. So that's where we get the stall. The amso driver is waiting for the reference on the qp to go to zero, and it never will because the amso firmware will never timeout the stalled mpa connection setup. Lemme look more at the amso driver and see if this can be avoided. Perhaps the amso driver can blow away the qp and stop the stall. I thought thats what it did, but I'll look... Steve. On Tue, 2006-08-01 at 17:34 -0400, Pete Wyckoff wrote: > Using the iwarp branch of r8688, with linux-2.6.17.7 on up to date > x86_64 FC4 SMP with Ammasso cards, I can hang the client side during > RDMA CM connection setup. > > The scenario is: > > start server side process on some other node > start client process > have server die after RDMA_CM_EVENT_CONNECT_REQUEST arrives, > but before calling rdma_accept > hit ctrl-C on client > > The last bits of the console log (from c2 debug) are: > > c2: c2_create_qp:248 > c2: c2_query_pkey:110 > c2: c2_qp_modify:145 qp=ffff81007fe3b980, IB_QPS_RESET --> IB_QPS_INIT > c2: c2_qp_modify:243 qp=ffff81007fe3b980, cur_state=IB_QPS_INIT > c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=1 > c2: c2_connect:598 > c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=2 > > The process is in S state before the ctrl-c, here's a traceback > (waiting in rdma_get_cm_event): > > ardma-rdmacm S ffff81003d0f9e68 0 2914 2842 (NOTLB) > ffff81003d0f9e68 0000000000000000 0000000000000000 00000000000007b4 > ffff81003ef1d280 ffff81007fd06080 0000000000000000 0000000000000000 > 0000000000000000 0000000000000000 > Call Trace: {:rdma_ucm:ucma_get_event+268} > {autoremove_wake_function+0} {:rdma_ucm:ucma_write+111} > {vfs_write+189} {sys_write+83} > {system_call+126} > > Then after ctrl-C, one more console log entry: > > c2: c2_destroy_qp:290 qp=ffff81007fe3b980,qp->state=1 > > and now the process is unkillable (but the node does not oops): > > ardma-rdmacm D ffff81003d0f9bf8 0 2914 2842 (L-TLB) > ffff81003d0f9bf8 ffffc2000001ffff ffff81007eee5c80 0000000000009ee7 > ffff81003ef1d280 ffff81003f060aa0 ffffffff80232646 ffff81000100c130 > ffff81003ec3d140 ffff81003dc85800 > Call Trace: {on_each_cpu+38} {__remove_vm_area+55} > {:iw_c2:c2_free_qp+355} {autoremove_wake_function+0} > {:iw_c2:c2_destroy_qp+52} {:ib_core:ib_destroy_qp+49} > {:ib_uverbs:ib_uverbs_close+410} {__fput+178} > {filp_close+104} {put_files_struct+122} > {do_exit+596} {__dequeue_signal+495} > {do_group_exit+216} {get_signal_to_deliver+1192} > {do_signal+129} {:rdma_ucm:ucma_get_event+501} > {:rdma_ucm:ucma_write+111} {vfs_write+189} > {sysret_signal+28} {ptregscall_common+103} > > Once I figure out the bug in the server side code I will hopefully > not have this problem anymore. But thought you'd like to see it. > > -- Pete > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Wed Aug 2 08:12:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 Aug 2006 08:12:57 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060802113126.GA15340@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Aug 2006 14:31:26 +0300") References: <20060802113126.GA15340@mellanox.co.il> Message-ID: I think this is going to trigger warnings with lock debugging because a locked rwsem gets freed. - R. From kliteyn at gmail.com Wed Aug 2 08:16:32 2006 From: kliteyn at gmail.com (Yevgeny Kliteynik) Date: Wed, 2 Aug 2006 18:16:32 +0300 Subject: [openib-general] [PATCH] osm: Dynamic verbosity control per file Message-ID: <842b8cdf0608020816y3fdfa145nea876171f58650d9@mail.gmail.com> Hi Hal This patch adds new verbosity functionality. 1. Verbosity configuration file ------------------------------- The user is able to set verbosity level per source code file by supplying verbosity configuration file using the following command line arguments: -b filename --verbosity_file filename By default, the OSM will use the following file: /etc/opensmlog.conf Verbosity configuration file should contain zero or more lines of the following pattern: filename verbosity_level where 'filename' is the name of the source code file that the 'verbosity_level' refers to, and the 'verbosity_level' itself should be specified as an integer number (decimal or hexadecimal). One reserved filename is 'all' - it represents general verbosity level, that is used for all the files that are not specified in the verbosity configuration file. If 'all' is not specified, the verbosity level set in the command line will be used instead. Note: The 'all' file verbosity level will override any other general level that was specified by the command line arguments. Sending a SIGHUP signal to the OSM will cause it to reload the verbosity configuration file. 2. Logging source code filename and line number ----------------------------------------------- If command line option -S or --log_source_info is specified, OSM will add source code filename and line number to every log message that is written to the log file. By default, the OSM will not log this additional info. Yevgeny Signed-off-by: Yevgeny Kliteynik Index: include/opensm/osm_subnet.h =================================================================== --- include/opensm/osm_subnet.h (revision 8614) +++ include/opensm/osm_subnet.h (working copy) @@ -285,6 +285,8 @@ typedef struct _osm_subn_opt osm_qos_options_t qos_sw0_options; osm_qos_options_t qos_swe_options; osm_qos_options_t qos_rtr_options; + boolean_t src_info; + char * verbosity_file; } osm_subn_opt_t; /* * FIELDS @@ -463,6 +465,27 @@ typedef struct _osm_subn_opt * qos_rtr_options * QoS options for router ports * +* src_info +* If TRUE - the source code filename and line number will be +* added to each log message. +* Default value - FALSE. +* +* verbosity_file +* OSM log configuration file - the file that describes +* verbosity level per source code file. +* The file may containg zero or more lines of the following +* pattern: +* filename verbosity_level +* where 'filename' is the name of the source code file that +* the 'verbosity_level' refers to. +* Filename "all" represents general verbosity level, that is +* used for all the files that are not specified in the +* verbosity file. +* If "all" is not specified, the general verbosity level will +* be used instead. +* Note: the "all" file verbosity level will override any other +* general level that was specified by the command line arguments. +* * SEE ALSO * Subnet object *********/ Index: include/opensm/osm_base.h =================================================================== --- include/opensm/osm_base.h (revision 8614) +++ include/opensm/osm_base.h (working copy) @@ -222,6 +222,22 @@ BEGIN_C_DECLS #endif /***********/ +/****d* OpenSM: Base/OSM_DEFAULT_VERBOSITY_FILE +* NAME +* OSM_DEFAULT_VERBOSITY_FILE +* +* DESCRIPTION +* Specifies the default verbosity config file name +* +* SYNOPSIS +*/ +#ifdef __WIN__ +#define OSM_DEFAULT_VERBOSITY_FILE strcat(GetOsmPath(), "opensmlog.conf") +#else +#define OSM_DEFAULT_VERBOSITY_FILE "/etc/opensmlog.conf" +#endif +/***********/ + /****d* OpenSM: Base/OSM_DEFAULT_PARTITION_CONFIG_FILE * NAME * OSM_DEFAULT_PARTITION_CONFIG_FILE Index: include/opensm/osm_log.h =================================================================== --- include/opensm/osm_log.h (revision 8652) +++ include/opensm/osm_log.h (working copy) @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -123,9 +124,45 @@ typedef struct _osm_log cl_spinlock_t lock; boolean_t flush; FILE* out_port; + boolean_t src_info; + st_table * table; } osm_log_t; /*********/ +/****f* OpenSM: Log/osm_log_read_verbosity_file +* NAME +* osm_log_read_verbosity_file +* +* DESCRIPTION +* This function reads the verbosity configuration file +* and constructs a verbosity data structure. +* +* SYNOPSIS +*/ +void +osm_log_read_verbosity_file( + IN osm_log_t* p_log, + IN const char * const verbosity_file); +/* +* PARAMETERS +* p_log +* [in] Pointer to a Log object to construct. +* +* verbosity_file +* [in] verbosity configuration file +* +* RETURN VALUE +* None +* +* NOTES +* If the verbosity configuration file is not found, default +* verbosity value is used for all files. +* If there is an error in some line of the verbosity +* configuration file, the line is ignored. +* +*********/ + + /****f* OpenSM: Log/osm_log_construct * NAME * osm_log_construct @@ -201,9 +238,13 @@ osm_log_destroy( * osm_log_init *********/ -/****f* OpenSM: Log/osm_log_init +#define osm_log_init(p_log, flush, log_flags, log_file, accum_log_file) \ + osm_log_init_ext(p_log, flush, (log_flags), log_file, \ + accum_log_file, FALSE, OSM_DEFAULT_VERBOSITY_FILE) + +/****f* OpenSM: Log/osm_log_init_ext * NAME -* osm_log_init +* osm_log_init_ext * * DESCRIPTION * The osm_log_init function initializes a @@ -211,50 +252,15 @@ osm_log_destroy( * * SYNOPSIS */ -static inline ib_api_status_t -osm_log_init( +ib_api_status_t +osm_log_init_ext( IN osm_log_t* const p_log, IN const boolean_t flush, IN const uint8_t log_flags, IN const char *log_file, - IN const boolean_t accum_log_file ) -{ - p_log->level = log_flags; - p_log->flush = flush; - - if (log_file == NULL || !strcmp(log_file, "-") || - !strcmp(log_file, "stdout")) - { - p_log->out_port = stdout; - } - else if (!strcmp(log_file, "stderr")) - { - p_log->out_port = stderr; - } - else - { - if (accum_log_file) - p_log->out_port = fopen(log_file, "a+"); - else - p_log->out_port = fopen(log_file, "w+"); - - if (!p_log->out_port) - { - if (accum_log_file) - printf("Cannot open %s for appending. Permission denied\n", log_file); - else - printf("Cannot open %s for writing. Permission denied\n", log_file); - - return(IB_UNKNOWN_ERROR); - } - } - openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); - - if (cl_spinlock_init( &p_log->lock ) == CL_SUCCESS) - return IB_SUCCESS; - else - return IB_ERROR; -} + IN const boolean_t accum_log_file, + IN const boolean_t src_info, + IN const char *verbosity_file); /* * PARAMETERS * p_log @@ -271,6 +277,16 @@ osm_log_init( * log_file * [in] if not NULL defines the name of the log file. Otherwise it is stdout. * +* accum_log_file +* [in] Whether the log file should be accumulated. +* +* src_info +* [in] Set to TRUE directs the log to add filename and line number +* to each log message. +* +* verbosity_file +* [in] Log configuration file location. +* * RETURN VALUES * CL_SUCCESS if the Log object was initialized * successfully. @@ -283,26 +299,32 @@ osm_log_init( * osm_log_destroy *********/ -/****f* OpenSM: Log/osm_log_get_level +#define osm_log_get_level(p_log) \ + osm_log_get_level_ext(p_log, __FILE__) + +/****f* OpenSM: Log/osm_log_get_level_ext * NAME -* osm_log_get_level +* osm_log_get_level_ext * * DESCRIPTION -* Returns the current log level. +* Returns the current log level for the file. +* If the file is not specified in the log config file, +* the general verbosity level will be returned. * * SYNOPSIS */ -static inline osm_log_level_t -osm_log_get_level( - IN const osm_log_t* const p_log ) -{ - return( p_log->level ); -} +osm_log_level_t +osm_log_get_level_ext( + IN const osm_log_t* const p_log, + IN const char* const p_filename ); /* * PARAMETERS * p_log * [in] Pointer to the log object. * +* p_filename +* [in] Source code file name. +* * RETURN VALUES * Returns the current log level. * @@ -310,7 +332,7 @@ osm_log_get_level( * * SEE ALSO * Log object, osm_log_construct, -* osm_log_destroy +* osm_log_destroy, osm_log_get_level *********/ /****f* OpenSM: Log/osm_log_set_level @@ -318,7 +340,7 @@ osm_log_get_level( * osm_log_set_level * * DESCRIPTION -* Sets the current log level. +* Sets the current general log level. * * SYNOPSIS */ @@ -338,7 +360,7 @@ osm_log_set_level( * [in] New level to set. * * RETURN VALUES -* Returns the current log level. +* None. * * NOTES * @@ -347,9 +369,12 @@ osm_log_set_level( * osm_log_destroy *********/ -/****f* OpenSM: Log/osm_log_is_active +#define osm_log_is_active(p_log, level) \ + osm_log_is_active_ext(p_log, __FILE__, level) + +/****f* OpenSM: Log/osm_log_is_active_ext * NAME -* osm_log_is_active +* osm_log_is_active_ext * * DESCRIPTION * Returns TRUE if the specified log level would be logged. @@ -357,18 +382,19 @@ osm_log_set_level( * * SYNOPSIS */ -static inline boolean_t -osm_log_is_active( +boolean_t +osm_log_is_active_ext( IN const osm_log_t* const p_log, - IN const osm_log_level_t level ) -{ - return( (p_log->level & level) != 0 ); -} + IN const char* const p_filename, + IN const osm_log_level_t level ); /* * PARAMETERS * p_log * [in] Pointer to the log object. * +* p_filename +* [in] Source code file name. +* * level * [in] Level to check. * @@ -383,17 +409,125 @@ osm_log_is_active( * osm_log_destroy *********/ + +#define osm_log(p_log, verbosity, p_str, args...) \ + osm_log_ext(p_log, verbosity, __FILE__, __LINE__, p_str , ## args) + +/****f* OpenSM: Log/osm_log_ext +* NAME +* osm_log_ext +* +* DESCRIPTION +* Logs the formatted specified message. +* +* SYNOPSIS +*/ void -osm_log( +osm_log_ext( IN osm_log_t* const p_log, IN const osm_log_level_t verbosity, + IN const char *p_filename, + IN int line, IN const char *p_str, ... ); +/* +* PARAMETERS +* p_log +* [in] Pointer to the log object. +* +* verbosity +* [in] Current message verbosity level + + p_filename + [in] Name of the file that is logging this message + + line + [in] Line number in the file that is logging this message + + p_str + [in] Format string of the message +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +* Log object, osm_log_construct, +* osm_log_destroy +*********/ +#define osm_log_raw(p_log, verbosity, p_buff) \ + osm_log_raw_ext(p_log, verbosity, __FILE__, p_buff) + +/****f* OpenSM: Log/osm_log_raw_ext +* NAME +* osm_log_ext +* +* DESCRIPTION +* Logs the specified message. +* +* SYNOPSIS +*/ void -osm_log_raw( +osm_log_raw_ext( IN osm_log_t* const p_log, IN const osm_log_level_t verbosity, + IN const char * p_filename, IN const char *p_buf ); +/* +* PARAMETERS +* p_log +* [in] Pointer to the log object. +* +* verbosity +* [in] Current message verbosity level + + p_filename + [in] Name of the file that is logging this message + + p_buf + [in] Message string +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +* Log object, osm_log_construct, +* osm_log_destroy +*********/ + + +/****f* OpenSM: Log/osm_log_flush +* NAME +* osm_log_flush +* +* DESCRIPTION +* Flushes the log. +* +* SYNOPSIS +*/ +static inline void +osm_log_flush( + IN osm_log_t* const p_log) +{ + fflush(p_log->out_port); +} +/* +* PARAMETERS +* p_log +* [in] Pointer to the log object. +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +* +*********/ + #define DBG_CL_LOCK 0 Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 8614) +++ opensm/osm_subnet.c (working copy) @@ -493,6 +493,8 @@ osm_subn_set_default_opt( p_opt->ucast_dump_file = NULL; p_opt->updn_guid_file = NULL; p_opt->exit_on_fatal = TRUE; + p_opt->src_info = FALSE; + p_opt->verbosity_file = OSM_DEFAULT_VERBOSITY_FILE; subn_set_default_qos_options(&p_opt->qos_options); subn_set_default_qos_options(&p_opt->qos_hca_options); subn_set_default_qos_options(&p_opt->qos_sw0_options); @@ -959,6 +961,13 @@ osm_subn_parse_conf_file( "honor_guid2lid_file", p_key, p_val, &p_opts->honor_guid2lid_file); + __osm_subn_opts_unpack_boolean( + "log_source_info", + p_key, p_val, &p_opts->src_info); + + __osm_subn_opts_unpack_charp( + "verbosity_file", p_key, p_val, &p_opts->verbosity_file); + subn_parse_qos_options("qos", p_key, p_val, &p_opts->qos_options); @@ -1182,7 +1191,11 @@ osm_subn_write_conf_file( "# No multicast routing is performed if TRUE\n" "disable_multicast %s\n\n" "# If TRUE opensm will exit on fatal initialization issues\n" - "exit_on_fatal %s\n\n", + "exit_on_fatal %s\n\n" + "# If TRUE OpenSM will log filename and line numbers\n" + "log_source_info %s\n\n" + "# Verbosity configuration file to be used\n" + "verbosity_file %s\n\n", p_opts->log_flags, p_opts->force_log_flush ? "TRUE" : "FALSE", p_opts->log_file, @@ -1190,7 +1203,9 @@ osm_subn_write_conf_file( p_opts->dump_files_dir, p_opts->no_multicast_option ? "TRUE" : "FALSE", p_opts->disable_multicast ? "TRUE" : "FALSE", - p_opts->exit_on_fatal ? "TRUE" : "FALSE" + p_opts->exit_on_fatal ? "TRUE" : "FALSE", + p_opts->src_info ? "TRUE" : "FALSE", + p_opts->verbosity_file ); fprintf( Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 8614) +++ opensm/osm_opensm.c (working copy) @@ -180,8 +180,10 @@ osm_opensm_init( /* Can't use log macros here, since we're initializing the log. */ osm_opensm_construct( p_osm ); - status = osm_log_init( &p_osm->log, p_opt->force_log_flush, - p_opt->log_flags, p_opt->log_file, p_opt->accum_log_file ); + status = osm_log_init_ext( &p_osm->log, p_opt->force_log_flush, + p_opt->log_flags, p_opt->log_file, + p_opt->accum_log_file, p_opt->src_info, + p_opt->verbosity_file); if( status != IB_SUCCESS ) return ( status ); Index: opensm/libopensm.map =================================================================== --- opensm/libopensm.map (revision 8614) +++ opensm/libopensm.map (working copy) @@ -1,6 +1,11 @@ -OPENSM_1.0 { +OPENSM_2.0 { global: - osm_log; + osm_log_init_ext; + osm_log_ext; + osm_log_raw_ext; + osm_log_get_level_ext; + osm_log_is_active_ext; + osm_log_read_verbosity_file; osm_is_debug; osm_mad_pool_construct; osm_mad_pool_destroy; @@ -39,7 +44,6 @@ OPENSM_1.0 { osm_dump_dr_path; osm_dump_smp_dr_path; osm_dump_pkey_block; - osm_log_raw; osm_get_sm_state_str; osm_get_sm_signal_str; osm_get_disp_msg_str; @@ -51,5 +55,11 @@ OPENSM_1.0 { osm_get_lsa_str; osm_get_sm_mgr_signal_str; osm_get_sm_mgr_state_str; + st_init_strtable; + st_delete; + st_insert; + st_lookup; + st_foreach; + st_free_table; local: *; }; Index: opensm/osm_log.c =================================================================== --- opensm/osm_log.c (revision 8614) +++ opensm/osm_log.c (working copy) @@ -80,17 +80,365 @@ static char *month_str[] = { }; #endif /* ndef WIN32 */ + +/*************************************************************************** + ***************************************************************************/ + +#define OSM_VERBOSITY_ALL "all" + +static void +__osm_log_free_verbosity_table( + IN osm_log_t* p_log); +static void +__osm_log_print_verbosity_table( + IN osm_log_t* const p_log); + +/*************************************************************************** + ***************************************************************************/ + +osm_log_level_t +osm_log_get_level_ext( + IN const osm_log_t* const p_log, + IN const char* const p_filename ) +{ + osm_log_level_t * p_curr_file_level = NULL; + + if (!p_filename || !p_log->table) + return p_log->level; + + if ( st_lookup( p_log->table, + (st_data_t) p_filename, + (st_data_t*) &p_curr_file_level) ) + return *p_curr_file_level; + else + return p_log->level; +} + +/*************************************************************************** + ***************************************************************************/ + +ib_api_status_t +osm_log_init_ext( + IN osm_log_t* const p_log, + IN const boolean_t flush, + IN const uint8_t log_flags, + IN const char *log_file, + IN const boolean_t accum_log_file, + IN const boolean_t src_info, + IN const char *verbosity_file) +{ + p_log->level = log_flags; + p_log->flush = flush; + p_log->src_info = src_info; + p_log->table = NULL; + + if (log_file == NULL || !strcmp(log_file, "-") || + !strcmp(log_file, "stdout")) + { + p_log->out_port = stdout; + } + else if (!strcmp(log_file, "stderr")) + { + p_log->out_port = stderr; + } + else + { + if (accum_log_file) + p_log->out_port = fopen(log_file, "a+"); + else + p_log->out_port = fopen(log_file, "w+"); + + if (!p_log->out_port) + { + if (accum_log_file) + printf("Cannot open %s for appending. Permission denied\n", log_file); + else + printf("Cannot open %s for writing. Permission denied\n", log_file); + + return(IB_UNKNOWN_ERROR); + } + } + openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); + + if (cl_spinlock_init( &p_log->lock ) != CL_SUCCESS) + return IB_ERROR; + + osm_log_read_verbosity_file(p_log,verbosity_file); + return IB_SUCCESS; +} + +/*************************************************************************** + ***************************************************************************/ + +void +osm_log_read_verbosity_file( + IN osm_log_t* p_log, + IN const char * const verbosity_file) +{ + FILE *infile; + char line[500]; + struct stat buf; + boolean_t table_empty = TRUE; + char * tmp_str = NULL; + + if (p_log->table) + { + /* + * Free the existing table. + * Note: if the verbosity config file will not be found, this will + * effectivly reset the existing verbosity configuration and set + * all the files to the same verbosity level + */ + __osm_log_free_verbosity_table(p_log); + } + + if (!verbosity_file) + return; + + if ( stat(verbosity_file, &buf) != 0 ) + { + /* + * Verbosity configuration file doesn't exist. + */ + if (strcmp(verbosity_file,OSM_DEFAULT_VERBOSITY_FILE) == 0) + { + /* + * Verbosity configuration file wasn't explicitly specified. + * No need to issue any error message. + */ + return; + } + else + { + /* + * Verbosity configuration file was explicitly specified. + */ + osm_log(p_log, OSM_LOG_SYS, + "ERROR: Verbosity configuration file (%s) doesn't exist.\n", + verbosity_file); + osm_log(p_log, OSM_LOG_SYS, + " Using general verbosity value.\n"); + return; + } + } + + infile = fopen(verbosity_file, "r"); + if ( infile == NULL ) + { + osm_log(p_log, OSM_LOG_SYS, + "ERROR: Failed opening verbosity configuration file (%s).\n", + verbosity_file); + osm_log(p_log, OSM_LOG_SYS, + " Using general verbosity value.\n"); + return; + } + + p_log->table = st_init_strtable(); + if (p_log->table == NULL) + { + osm_log(p_log, OSM_LOG_SYS, "ERROR: Verbosity table initialization failed.\n"); + return; + } + + /* + * Read the file line by line, parse the lines, and + * add each line to p_log->table. + */ + while ( fgets(line, sizeof(line), infile) != NULL ) + { + char * str = line; + char * name = NULL; + char * value = NULL; + osm_log_level_t * p_log_level_value = NULL; + int res; + + name = strtok_r(str," \t\n",&tmp_str); + if (name == NULL || strlen(name) == 0) { + /* + * empty line - ignore it + */ + continue; + } + value = strtok_r(NULL," \t\n",&tmp_str); + if (value == NULL || strlen(value) == 0) + { + /* + * No verbosity value - wrong syntax. + * This line will be ignored. + */ + continue; + } + + /* + * If the conversion will fail, the log_level_value will get 0, + * so the only way to check that the syntax is correct is to + * scan value for any non-digit (which we're not doing here). + */ + p_log_level_value = malloc (sizeof(osm_log_level_t)); + if (!p_log_level_value) + { + osm_log(p_log, OSM_LOG_SYS, "ERROR: malloc failed.\n"); + p_log->table = NULL; + fclose(infile); + return; + } + *p_log_level_value = strtoul(value, NULL, 0); + + if (strcasecmp(name,OSM_VERBOSITY_ALL) == 0) + { + osm_log_set_level(p_log, *p_log_level_value); + free(p_log_level_value); + } + else + { + res = st_insert( p_log->table, + (st_data_t) strdup(name), + (st_data_t) p_log_level_value); + if (res != 0) + { + /* + * Something is wrong with the verbosity table. + * We won't try to free the table, because there's + * clearly something corrupted there. + */ + osm_log(p_log, OSM_LOG_SYS, "ERROR: Failed adding verbosity table element.\n"); + p_log->table = NULL; + fclose(infile); + return; + } + table_empty = FALSE; + } + + } + + if (table_empty) + __osm_log_free_verbosity_table(p_log); + + fclose(infile); + + __osm_log_print_verbosity_table(p_log); +} + +/*************************************************************************** + ***************************************************************************/ + +static int +__osm_log_print_verbosity_table_element( + IN st_data_t key, + IN st_data_t val, + IN st_data_t arg) +{ + osm_log( (osm_log_t* const) arg, + OSM_LOG_INFO, + "[verbosity] File: %s, Level: 0x%x\n", + (char *) key, *((osm_log_level_t *) val)); + + return ST_CONTINUE; +} + +static void +__osm_log_print_verbosity_table( + IN osm_log_t* const p_log) +{ + osm_log( p_log, OSM_LOG_INFO, + "[verbosity] Verbosity table loaded\n" ); + osm_log( p_log, OSM_LOG_INFO, + "[verbosity] General level: 0x%x\n",osm_log_get_level_ext(p_log,NULL)); + + if (p_log->table) + { + st_foreach( p_log->table, + __osm_log_print_verbosity_table_element, + (st_data_t) p_log ); + } + osm_log_flush(p_log); +} + +/*************************************************************************** + ***************************************************************************/ + +static int +__osm_log_free_verbosity_table_element( + IN st_data_t key, + IN st_data_t val, + IN st_data_t arg) +{ + free( (char *) key ); + free( (osm_log_level_t *) val ); + return ST_DELETE; +} + +static void +__osm_log_free_verbosity_table( + IN osm_log_t* p_log) +{ + if (!p_log->table) + return; + + st_foreach( p_log->table, + __osm_log_free_verbosity_table_element, + (st_data_t) NULL); + + st_free_table(p_log->table); + p_log->table = NULL; +} + +/*************************************************************************** + ***************************************************************************/ + +static inline const char * +__osm_log_get_base_name( + IN const char * const p_filename) +{ +#ifdef WIN32 + char dir_separator = '\\'; +#else + char dir_separator = '/'; +#endif + char * tmp_ptr; + + if (!p_filename) + return NULL; + + tmp_ptr = strrchr(p_filename,dir_separator); + + if (!tmp_ptr) + return p_filename; + return tmp_ptr+1; +} + +/*************************************************************************** + ***************************************************************************/ + +boolean_t +osm_log_is_active_ext( + IN const osm_log_t* const p_log, + IN const char* const p_filename, + IN const osm_log_level_t level ) +{ + osm_log_level_t tmp_lvl; + tmp_lvl = level & + osm_log_get_level_ext(p_log,__osm_log_get_base_name(p_filename)); + return ( tmp_lvl != 0 ); +} + +/*************************************************************************** + ***************************************************************************/ + static int log_exit_count = 0; void -osm_log( +osm_log_ext( IN osm_log_t* const p_log, IN const osm_log_level_t verbosity, + IN const char *p_filename, + IN int line, IN const char *p_str, ... ) { char buffer[LOG_ENTRY_SIZE_MAX]; va_list args; int ret; + osm_log_level_t file_verbosity; #ifdef WIN32 SYSTEMTIME st; @@ -108,69 +456,89 @@ osm_log( localtime_r(&tim, &result); #endif /* WIN32 */ - /* If this is a call to syslog - always print it */ - if ( verbosity & OSM_LOG_SYS ) + /* + * Extract only the file name out of the full path + */ + p_filename = __osm_log_get_base_name(p_filename); + /* + * Get the verbosity level for this file. + * If the file is not specified in the log config file, + * the general verbosity level will be returned. + */ + file_verbosity = osm_log_get_level_ext(p_log, p_filename); + + if ( ! (verbosity & OSM_LOG_SYS) && + ! (file_verbosity & verbosity) ) { - /* this is a call to the syslog */ + /* + * This is not a syslog message (which is always printed) + * and doesn't have the required verbosity level. + */ + return; + } + va_start( args, p_str ); vsprintf( buffer, p_str, args ); va_end(args); - cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); + + if ( verbosity & OSM_LOG_SYS ) + { + /* this is a call to the syslog */ + cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); /* SYSLOG should go to stdout too */ if (p_log->out_port != stdout) { - printf("%s\n", buffer); + printf("%s", buffer); fflush( stdout ); } + } + /* SYSLOG also goes to to the log file */ + + cl_spinlock_acquire( &p_log->lock ); - /* send it also to the log file */ #ifdef WIN32 GetLocalTime(&st); - fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", + if (p_log->src_info) + { + ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] [%s:%d] -> %s", st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, - pid, buffer); -#else - fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s\n", - (result.tm_mon < 12 ? month_str[result.tm_mon] : "???"), - result.tm_mday, result.tm_hour, - result.tm_min, result.tm_sec, - usecs, pid, buffer); - fflush( p_log->out_port ); -#endif + pid, p_filename, line, buffer); } - - /* SYS messages go to the log anyways */ - if (p_log->level & verbosity) + else { - - va_start( args, p_str ); - vsprintf( buffer, p_str, args ); - va_end(args); - - /* regular log to default out_port */ - cl_spinlock_acquire( &p_log->lock ); -#ifdef WIN32 - GetLocalTime(&st); ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, pid, buffer); - + } #else pid = pthread_self(); tim = time(NULL); + if (p_log->src_info) + { + ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] [%s:%d] -> %s", + ((result.tm_mon < 12) && (result.tm_mon >= 0) ? + month_str[result.tm_mon] : "???"), + result.tm_mday, result.tm_hour, + result.tm_min, result.tm_sec, + usecs, pid, p_filename, line, buffer); + } + else + { ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", ((result.tm_mon < 12) && (result.tm_mon >= 0) ? month_str[result.tm_mon] : "???"), result.tm_mday, result.tm_hour, result.tm_min, result.tm_sec, usecs, pid, buffer); -#endif /* WIN32 */ - + } +#endif /* - Flush log on errors too. + * Flush log on errors and SYSLOGs too. */ - if( p_log->flush || (verbosity & OSM_LOG_ERROR) ) + if ( p_log->flush || + (verbosity & OSM_LOG_ERROR) || + (verbosity & OSM_LOG_SYS) ) fflush( p_log->out_port ); cl_spinlock_release( &p_log->lock ); @@ -183,15 +551,30 @@ osm_log( } } } -} + +/*************************************************************************** + ***************************************************************************/ void -osm_log_raw( +osm_log_raw_ext( IN osm_log_t* const p_log, IN const osm_log_level_t verbosity, + IN const char * p_filename, IN const char *p_buf ) { - if( p_log->level & verbosity ) + osm_log_level_t file_verbosity; + /* + * Extract only the file name out of the full path + */ + p_filename = __osm_log_get_base_name(p_filename); + /* + * Get the verbosity level for this file. + * If the file is not specified in the log config file, + * the general verbosity level will be returned. + */ + file_verbosity = osm_log_get_level_ext(p_log, p_filename); + + if ( file_verbosity & verbosity ) { cl_spinlock_acquire( &p_log->lock ); printf( "%s", p_buf ); @@ -205,6 +588,9 @@ osm_log_raw( } } +/*************************************************************************** + ***************************************************************************/ + boolean_t osm_is_debug(void) { @@ -214,3 +600,7 @@ osm_is_debug(void) return FALSE; #endif /* defined( _DEBUG_ ) */ } + +/*************************************************************************** + ***************************************************************************/ + Index: opensm/main.c =================================================================== --- opensm/main.c (revision 8652) +++ opensm/main.c (working copy) @@ -296,6 +296,33 @@ show_usage(void) " -d3 - Disable multicast support\n" " -d10 - Put OpenSM in testability mode\n" " Without -d, no debug options are enabled\n\n" ); + printf( "-S\n" + "--log_source_info\n" + " This option tells SM to add source code filename\n" + " and line number to every log message.\n" + " By default, the SM will not log this additional info.\n\n"); + printf( "-b\n" + "--verbosity_file \n" + " This option specifies name of the verbosity\n" + " configuration file, which describes verbosity level\n" + " per source code file. The file may contain zero or\n" + " more lines of the following pattern:\n" + " filename verbosity_level\n" + " where 'filename' is the name of the source code file\n" + " that the 'verbosity_level' refers to, and the \n" + " 'verbosity_level' itself should be specified as a\n" + " number (decimal or hexadecimal).\n" + " Filename 'all' represents general verbosity level,\n" + " that is used for all the files that are not specified\n" + " in the verbosity file.\n" + " Note: The 'all' file verbosity level will override any\n" + " other general level that was specified by the command\n" + " line arguments.\n" + " By default, the SM will use the following file:\n" + " %s\n" + " Sending a SIGHUP signal to the SM will cause it to\n" + " re-read the verbosity configuration file.\n" + "\n\n", OSM_DEFAULT_VERBOSITY_FILE); printf( "-h\n" "--help\n" " Display this usage info then exit.\n\n" ); @@ -527,7 +554,7 @@ main( boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:s:t:a:R:U:P:NQvVhorcyx"; + const char * const short_option = "i:f:ed:g:l:s:t:a:R:U:P:b:SNQvVhorcyx"; /* In the array below, the 2nd parameter specified the number @@ -565,6 +592,8 @@ main( { "cache-options", 0, NULL, 'c'}, { "stay_on_fatal", 0, NULL, 'y'}, { "honor_guid2lid", 0, NULL, 'x'}, + { "log_source_info",0,NULL, 'S'}, + { "verbosity_file",1, NULL, 'b'}, { NULL, 0, NULL, 0 } /* Required at the end of the array */ }; @@ -808,6 +837,16 @@ main( printf (" Honor guid2lid file, if possible\n"); break; + case 'S': + opt.src_info = TRUE; + printf(" Logging source code filename and line number\n"); + break; + + case 'b': + opt.verbosity_file = optarg; + printf(" Verbosity Configuration File: %s\n", optarg); + break; + case 'h': case '?': case ':': @@ -920,9 +959,13 @@ main( if (osm_hup_flag) { osm_hup_flag = 0; - /* a HUP signal should only start a new heavy sweep */ + /* + * A HUP signal should cause OSM to re-read the log + * configuration file and start a new heavy sweep + */ osm.subn.force_immediate_heavy_sweep = TRUE; osm_opensm_sweep( &osm ); + osm_log_read_verbosity_file(&osm.log,opt.verbosity_file); } } } Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 8614) +++ opensm/Makefile.am (working copy) @@ -43,7 +43,7 @@ else libopensm_version_script = endif -libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c +libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c st.c libopensm_la_LDFLAGS = -version-info $(opensm_api_version) \ -export-dynamic $(libopensm_version_script) libopensm_la_DEPENDENCIES = $(srcdir)/libopensm.map @@ -90,7 +90,7 @@ opensm_SOURCES = main.c osm_console.c os osm_trap_rcv.c osm_trap_rcv_ctrl.c \ osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c \ osm_vl15intf.c osm_vl_arb_rcv.c \ - osm_vl_arb_rcv_ctrl.c st.c + osm_vl_arb_rcv_ctrl.c if OSMV_OPENIB opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 Index: doc/verbosity-config.txt =================================================================== --- doc/verbosity-config.txt (revision 0) +++ doc/verbosity-config.txt (revision 0) @@ -0,0 +1,43 @@ + +This patch adds new verbosity functionality. + +1. Verbosity configuration file +------------------------------- + +The user is able to set verbosity level per source code file +by supplying verbosity configuration file using the following +command line arguments: + + -b filename + --verbosity_file filename + +By default, the OSM will use the following file: /etc/opensmlog.conf +Verbosity configuration file should contain zero or more lines of +the following pattern: + + filename verbosity_level + +where 'filename' is the name of the source code file that the +'verbosity_level' refers to, and the 'verbosity_level' itself +should be specified as an integer number (decimal or hexadecimal). + +One reserved filename is 'all' - it represents general verbosity +level, that is used for all the files that are not specified in +the verbosity configuration file. +If 'all' is not specified, the verbosity level set in the +command line will be used instead. +Note: The 'all' file verbosity level will override any other +general level that was specified by the command line arguments. + +Sending a SIGHUP signal to the OSM will cause it to reload +the verbosity configuration file. + + +2. Logging source code filename and line number +----------------------------------------------- + +If command line option -S or --log_source_info is specified, +OSM will add source code filename and line number to every +log message that is written to the log file. +By default, the OSM will not log this additional info. + -------------- next part -------------- An HTML attachment was scrubbed... URL: From pw at osc.edu Wed Aug 2 08:57:21 2006 From: pw at osc.edu (Pete Wyckoff) Date: Wed, 2 Aug 2006 11:57:21 -0400 Subject: [openib-general] rdma cm process hang In-Reply-To: <1154531379.32560.13.camel@stevo-desktop> References: <20060801213416.GA18941@osc.edu> <1154531379.32560.13.camel@stevo-desktop> Message-ID: <20060802155721.GA20429@osc.edu> swise at opengridcomputing.com wrote on Wed, 02 Aug 2006 10:09 -0500: > This hang is due to 2 things: > > 1) the amso card will _never_ timeout a connection that is awaiting an > MP reply. That is exactly what is happening here. The fix for this > (timeout mpa connection setup stalls) is a firmware fix and we don't > have the firmware src. > > 2) the IWCM holds a reference on the QP until connection setup either > succeeds or fails. So that's where we get the stall. The amso driver > is waiting for the reference on the qp to go to zero, and it never will > because the amso firmware will never timeout the stalled mpa connection > setup. > > Lemme look more at the amso driver and see if this can be avoided. > Perhaps the amso driver can blow away the qp and stop the stall. I > thought thats what it did, but I'll look... Thanks for looking. I'd just come to the conclusion that it was waiting on the qp refcnt, but didn't get much farther when your mail arrived. Testing on mthca would be a bit more difficult here, but hopefully that's not an issue now. Here's an easier test case using ucmatose. Just on a single machine, pick an IP that is theoretically reachable but has nothing listening on it, viz: am30$ ip a s dev iw2 5: iw2: mtu 1500 qdisc noqueue link/ether 00:0d:b2:00:04:8f brd 00:00:00:00:00:00 inet 10.100.9.30/24 brd 10.100.9.255 scope global iw2 am30$ ucmatose 10.100.9.31 cmatose: starting client cmatose: connecting Then hit ctrl-C. The full console log is (with last line appearing only after ctrl-C): c2: c2_alloc_ucontext:135 c2: c2_query_device:68 c2: c2_alloc_pd:163 c2: c2_create_qp:248 c2: c2_query_pkey:110 c2: c2_qp_modify:145 qp=ffff81007f1f3d80, IB_QPS_RESET --> IB_QPS_INIT c2: c2_qp_modify:243 qp=ffff81007f1f3d80, cur_state=IB_QPS_INIT c2: c2_reg_user_mr:442 c2: i=1, offset=2048, page_size=4096, length=100, user_base=504800, virt_base=504800, acc=00000098, c2mr=ffff81003f002f00 c2: [0] 3d24c000 c2: c2_get_qp Returning QP=ffff81007f1f3d80 for QPN=1, device=ffff81003df28800, refcount=1 c2: c2_connect:598 c2: c2_get_qp Returning QP=ffff81007f1f3d80 for QPN=1, device=ffff81003df28800, refcount=2 c2: c2_destroy_qp:290 qp=ffff81007f1f3d80,qp->state=1 -- Pete From mst at mellanox.co.il Wed Aug 2 09:01:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 19:01:36 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060802160136.GB18426@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Fwd: issues in ipoib > > I think this is going to trigger warnings with lock debugging because > a locked rwsem gets freed. I'll test this. Maybe lockdep can be made not to warn about this? How? -- MST From vuhuong at mellanox.com Wed Aug 2 09:20:17 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Wed, 02 Aug 2006 09:20:17 -0700 Subject: [openib-general] [PATCH] RFC: srp filesystem data corruption problem/work-around In-Reply-To: <20060802130456.GA15769@mellanox.co.il> References: <20060802130456.GA15769@mellanox.co.il> Message-ID: <44D0D0C1.1060704@mellanox.com> Michael, > +static const u8 mellanox_oui[3] = { 0x02, 0xc9, 0x02 }; Should it be {0x00, 0x02, 0xc9}? From rdreier at cisco.com Wed Aug 2 09:39:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 Aug 2006 09:39:37 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060802160136.GB18426@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Aug 2006 19:01:36 +0300") References: <20060802160136.GB18426@mellanox.co.il> Message-ID: Michael> I'll test this. Maybe lockdep can be made not to warn Michael> about this? How? It's not lockdep, it's just general lock debugging. And freeing a locked lock is bad practice anyway. - R. From rdreier at cisco.com Wed Aug 2 09:40:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 Aug 2006 09:40:07 -0700 Subject: [openib-general] [PATCH] RFC: srp filesystem data corruption problem/work-around In-Reply-To: <44D0D0C1.1060704@mellanox.com> (Vu Pham's message of "Wed, 02 Aug 2006 09:20:17 -0700") References: <20060802130456.GA15769@mellanox.co.il> <44D0D0C1.1060704@mellanox.com> Message-ID: >> +static const u8 mellanox_oui[3] = { 0x02, 0xc9, 0x02 }; > Should it be {0x00, 0x02, 0xc9}? I'll fix that up. - R. From mshefty at ichips.intel.com Wed Aug 2 09:43:48 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 02 Aug 2006 09:43:48 -0700 Subject: [openib-general] (SPAM?) Re: [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <44D03AF1.8080300@voltaire.com> References: <44D03AF1.8080300@voltaire.com> Message-ID: <44D0D644.4000607@ichips.intel.com> Or Gerlitz wrote: > I suggest you move forward with committing the patch and once the side > discussion is done the next step (which i think is NO-OP per the IB > stack, that is let the ULP handle this) would be implemented. I think the open question is whether a call like rdma_establish() needs to be added to the API. If it is, then I'm not sure that we need to route events directly to the IB CM. Given the current choices, adding rdma_establish() may actually be easier for the ULP than event forwarding, or maybe we just do both... - Sean From mst at mellanox.co.il Wed Aug 2 09:44:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 19:44:21 +0300 Subject: [openib-general] [PATCH] RFC: srp filesystem data corruption problem/work-around In-Reply-To: <44D0D0C1.1060704@mellanox.com> References: <44D0D0C1.1060704@mellanox.com> Message-ID: <20060802164421.GB19103@mellanox.co.il> Quoting r. Vu Pham : > Subject: Re: [PATCH] RFC: srp filesystem data corruption problem/work-around > > Michael, > > > +static const u8 mellanox_oui[3] = { 0x02, 0xc9, 0x02 }; > > Should it be {0x00, 0x02, 0xc9}? Ugh. Of course it should. Like this: -- Add work-around for data corruption observed with Mellanox targets when VA != 0. Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..6a8b286 100644 Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-07-31 16:52:26.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-07-31 18:37:58.000000000 +0300 @@ -77,6 +77,14 @@ MODULE_PARM_DESC(topspin_workarounds, static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; +static int mellanox_workarounds = 1; + +module_param(mellanox_workarounds, int, 0444); +MODULE_PARM_DESC(mellanox_workarounds, + "Enable workarounds for Mellanox SRP target bugs if != 0"); + +static const u8 mellanox_oui[3] = { 0x00, 0x02, 0xc9 }; + static void srp_add_one(struct ib_device *device); static void srp_remove_one(struct ib_device *device); static void srp_completion(struct ib_cq *cq, void *target_ptr); @@ -569,7 +577,7 @@ err: return ret; } -static int srp_map_fmr(struct srp_device *dev, struct scatterlist *scat, +static int srp_map_fmr(struct srp_target_port *target, struct scatterlist *scat, int sg_cnt, struct srp_request *req, struct srp_direct_buf *buf) { @@ -579,10 +587,15 @@ static int srp_map_fmr(struct srp_device int page_cnt; int i, j; int ret; + struct srp_device *dev = target->srp_host->dev; if (!dev->fmr_pool) return -ENODEV; + if ((sg_dma_address(&scat[0]) & ~dev->fmr_page_mask) && + mellanox_workarounds && !memcmp(&target->ioc_guid, mellanox_oui, 3)) + return -EINVAL; + len = page_cnt = 0; for (i = 0; i < sg_cnt; ++i) { if (sg_dma_address(&scat[i]) & ~dev->fmr_page_mask) { @@ -685,7 +699,7 @@ static int srp_map_data(struct scsi_cmnd buf->va = cpu_to_be64(sg_dma_address(scat)); buf->key = cpu_to_be32(target->srp_host->dev->mr->rkey); buf->len = cpu_to_be32(sg_dma_len(scat)); - } else if (srp_map_fmr(target->srp_host->dev, scat, count, req, + } else if (srp_map_fmr(target, scat, count, req, (void *) cmd->add_data)) { /* * FMR mapping failed, and the scatterlist has more -- MST From mst at mellanox.co.il Wed Aug 2 09:47:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 19:47:52 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060802164752.GC19103@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Fwd: issues in ipoib > > Michael> I'll test this. Maybe lockdep can be made not to warn > Michael> about this? How? > > It's not lockdep, it's just general lock debugging. And freeing a > locked lock is bad practice anyway. Fine, although I wander why exactly. I'll just respin with up_read then? But won't down_read up_read look weird too? Is there some other way to flush out all readers? -- MST From mshefty at ichips.intel.com Wed Aug 2 09:51:20 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 02 Aug 2006 09:51:20 -0700 Subject: [openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT) In-Reply-To: <44D049B5.6000505@voltaire.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> <44D049B5.6000505@voltaire.com> Message-ID: <44D0D808.1070003@ichips.intel.com> Or Gerlitz wrote: > My guess this is related to the CM not the SM. > > I think there is a chance that the CM on node B does not treat the REQ > sent by A after the reboot as "stale connection" situation and hence > just **silently** dtop it, that is not REJ is sent. I agree. This sounds like an issue where the CM is treating the REQ as an old REQ for the established connection, versus a REQ for a new connection. The desired behavior in this situation would be to reject the new request, and force the remote side to disconnect. You can try initializing next_id in cm_alloc_id() (cm.c) to a random value and see if that helps. I will also try to reproduce the problem here. - Sean From mst at mellanox.co.il Wed Aug 2 09:54:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 19:54:42 +0300 Subject: [openib-general] (SPAM?) Re: [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <44D0D644.4000607@ichips.intel.com> References: <44D0D644.4000607@ichips.intel.com> Message-ID: <20060802165442.GD19103@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: (SPAM?) Re: [openib-general] [PATCH 0/4] Dispatch communication related events to the IB CM > > Or Gerlitz wrote: > > I suggest you move forward with committing the patch and once the side > > discussion is done the next step (which i think is NO-OP per the IB > > stack, that is let the ULP handle this) would be implemented. > > I think the open question is whether a call like rdma_establish() needs to be > added to the API. If it is, then I'm not sure that we need to route events > directly to the IB CM. Given the current choices, adding rdma_establish() may > actually be easier for the ULP than event forwarding, or maybe we just do both... I like rdma_establish. Further, adding this will be a very small patch, safe for 2.6.18. -- MST From mst at mellanox.co.il Wed Aug 2 09:56:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 19:56:03 +0300 Subject: [openib-general] [PATCH] RFC: srp filesystem data corruption problem/work-around In-Reply-To: <18A61515E49B764AB09447A336E51F5661FB90@NAMAIL2.ad.lsil.com> References: <18A61515E49B764AB09447A336E51F5661FB90@NAMAIL2.ad.lsil.com> Message-ID: <20060802165603.GE19103@mellanox.co.il> I guess I don't unless that target is shown to have the same problem? Quoting r. Snider, Tim : Subject: RE: [PATCH] RFC: srp filesystem data corruption problem/work-around You may (or may not) want to include the LSI OUI (0x00,0xa0,0xb8) for future compatability. This change will be available in the storage firmware around Sept. 2006. It might save some future mod. Timothy Snider Storage Architect Strategic Planning, Technology and Architecture LSI Logic Corporation 3718 North Rock Road Wichita, KS 67226 (316) 636-8736 tim.snider at lsil.com -----Original Message----- From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] Sent: Wednesday, August 02, 2006 11:44 AM To: Vu Pham Cc: Roland Dreier; Snider, Tim; openib-general at openib.org Subject: Re: [PATCH] RFC: srp filesystem data corruption problem/work-around Quoting r. Vu Pham : > Subject: Re: [PATCH] RFC: srp filesystem data corruption > problem/work-around > > Michael, > > > +static const u8 mellanox_oui[3] = { 0x02, 0xc9, 0x02 }; > > Should it be {0x00, 0x02, 0xc9}? Ugh. Of course it should. Like this: -- Add work-around for data corruption observed with Mellanox targets when VA != 0. Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..6a8b286 100644 Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-07-31 16:52:26.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-07-31 18:37:58.000000000 +0300 @@ -77,6 +77,14 @@ MODULE_PARM_DESC(topspin_workarounds, static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; +static int mellanox_workarounds = 1; + +module_param(mellanox_workarounds, int, 0444); +MODULE_PARM_DESC(mellanox_workarounds, + "Enable workarounds for Mellanox SRP target bugs if != 0"); + +static const u8 mellanox_oui[3] = { 0x00, 0x02, 0xc9 }; + static void srp_add_one(struct ib_device *device); static void srp_remove_one(struct ib_device *device); static void srp_completion(struct ib_cq *cq, void *target_ptr); @@ -569,7 +577,7 @@ err: return ret; } -static int srp_map_fmr(struct srp_device *dev, struct scatterlist *scat, +static int srp_map_fmr(struct srp_target_port *target, struct +scatterlist *scat, int sg_cnt, struct srp_request *req, struct srp_direct_buf *buf) { @@ -579,10 +587,15 @@ static int srp_map_fmr(struct srp_device int page_cnt; int i, j; int ret; + struct srp_device *dev = target->srp_host->dev; if (!dev->fmr_pool) return -ENODEV; + if ((sg_dma_address(&scat[0]) & ~dev->fmr_page_mask) && + mellanox_workarounds && !memcmp(&target->ioc_guid, mellanox_oui, 3)) + return -EINVAL; + len = page_cnt = 0; for (i = 0; i < sg_cnt; ++i) { if (sg_dma_address(&scat[i]) & ~dev->fmr_page_mask) { @@ -685,7 +699,7 @@ static int srp_map_data(struct scsi_cmnd buf->va = cpu_to_be64(sg_dma_address(scat)); buf->key = cpu_to_be32(target->srp_host->dev->mr->rkey); buf->len = cpu_to_be32(sg_dma_len(scat)); - } else if (srp_map_fmr(target->srp_host->dev, scat, count, req, + } else if (srp_map_fmr(target, scat, count, req, (void *) cmd->add_data)) { /* * FMR mapping failed, and the scatterlist has more -- MST -- MST From Tim.Snider at engenio.com Wed Aug 2 09:50:09 2006 From: Tim.Snider at engenio.com (Snider, Tim) Date: Wed, 2 Aug 2006 10:50:09 -0600 Subject: [openib-general] [PATCH] RFC: srp filesystem data corruption problem/work-around Message-ID: <18A61515E49B764AB09447A336E51F5661FB90@NAMAIL2.ad.lsil.com> You may (or may not) want to include the LSI OUI (0x00,0xa0,0xb8) for future compatability. This change will be available in the storage firmware around Sept. 2006. It might save some future mod. Timothy Snider Storage Architect Strategic Planning, Technology and Architecture LSI Logic Corporation 3718 North Rock Road Wichita, KS 67226 (316) 636-8736 tim.snider at lsil.com -----Original Message----- From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] Sent: Wednesday, August 02, 2006 11:44 AM To: Vu Pham Cc: Roland Dreier; Snider, Tim; openib-general at openib.org Subject: Re: [PATCH] RFC: srp filesystem data corruption problem/work-around Quoting r. Vu Pham : > Subject: Re: [PATCH] RFC: srp filesystem data corruption > problem/work-around > > Michael, > > > +static const u8 mellanox_oui[3] = { 0x02, 0xc9, 0x02 }; > > Should it be {0x00, 0x02, 0xc9}? Ugh. Of course it should. Like this: -- Add work-around for data corruption observed with Mellanox targets when VA != 0. Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..6a8b286 100644 Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-07-31 16:52:26.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-07-31 18:37:58.000000000 +0300 @@ -77,6 +77,14 @@ MODULE_PARM_DESC(topspin_workarounds, static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; +static int mellanox_workarounds = 1; + +module_param(mellanox_workarounds, int, 0444); +MODULE_PARM_DESC(mellanox_workarounds, + "Enable workarounds for Mellanox SRP target bugs if != 0"); + +static const u8 mellanox_oui[3] = { 0x00, 0x02, 0xc9 }; + static void srp_add_one(struct ib_device *device); static void srp_remove_one(struct ib_device *device); static void srp_completion(struct ib_cq *cq, void *target_ptr); @@ -569,7 +577,7 @@ err: return ret; } -static int srp_map_fmr(struct srp_device *dev, struct scatterlist *scat, +static int srp_map_fmr(struct srp_target_port *target, struct +scatterlist *scat, int sg_cnt, struct srp_request *req, struct srp_direct_buf *buf) { @@ -579,10 +587,15 @@ static int srp_map_fmr(struct srp_device int page_cnt; int i, j; int ret; + struct srp_device *dev = target->srp_host->dev; if (!dev->fmr_pool) return -ENODEV; + if ((sg_dma_address(&scat[0]) & ~dev->fmr_page_mask) && + mellanox_workarounds && !memcmp(&target->ioc_guid, mellanox_oui, 3)) + return -EINVAL; + len = page_cnt = 0; for (i = 0; i < sg_cnt; ++i) { if (sg_dma_address(&scat[i]) & ~dev->fmr_page_mask) { @@ -685,7 +699,7 @@ static int srp_map_data(struct scsi_cmnd buf->va = cpu_to_be64(sg_dma_address(scat)); buf->key = cpu_to_be32(target->srp_host->dev->mr->rkey); buf->len = cpu_to_be32(sg_dma_len(scat)); - } else if (srp_map_fmr(target->srp_host->dev, scat, count, req, + } else if (srp_map_fmr(target, scat, count, req, (void *) cmd->add_data)) { /* * FMR mapping failed, and the scatterlist has more -- MST From mst at mellanox.co.il Wed Aug 2 09:56:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 19:56:41 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060802165641.GF19103@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Fwd: issues in ipoib > > > And it still doesn't. > > Jack here also confirms that the problem still exits if SM clears P_Key table > > and then later readds a P_Key. > > > > Could you take a look please? > > So is the solution to call ipoib_pkey_dev_delay_open() in ipoib_ib_dev_flush()? > Probably .. patch? We'll test it here. -- MST From mshefty at ichips.intel.com Wed Aug 2 10:00:09 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 02 Aug 2006 10:00:09 -0700 Subject: [openib-general] APM: QP migration state change when failover triggered by hw In-Reply-To: <200608021043.17017.jackm@mellanox.co.il> References: <000201c6b59c$17ef1d30$29cc180a@amr.corp.intel.com> <200608021043.17017.jackm@mellanox.co.il> Message-ID: <44D0DA19.9000005@ichips.intel.com> Jack Morgenstein wrote: > This could be a bit complicated. For example, say there are two possible > paths. After migration has occurred the first time, there is no guarantee > that the original path has become available again. That's okay. I would expect the code to be able to handle this. > There is also a race condition here in your proposal -- the new Alt Path data > must be specified between the MIGRATED event and the > communication-established event on the migrated path (so that the LAP message > may be correctly sent to the remote node). I'm not following you here. The CM currently sends all messages along the primary path specified in the REQ, but saves alternate path information. The CM needs to know when to begin using the alternate path. The implementation to do this is missing. > However, after the first migration occurs, you need to do the following: > 1. send a LAP packet to the remote node, containing the new alt path info. > 2. load NEW alt path information (ib_modify_qp, rts->rts), including remote > LID received in LAP packet. > 3. Rearm path migration (ib_modify_qp, rts->rts) > > Are you certain that the above 3 steps have taken place? I'm not sure that step 1 can occur if the primary path has failed. The CM doesn't know to send future MADs out a different path. - Sean From mst at mellanox.co.il Wed Aug 2 10:02:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 20:02:21 +0300 Subject: [openib-general] Multicast traffic performace of OFED 1.0 ipoib In-Reply-To: <6a122cc00608020430k4a520d10xfebb256829feb752@mail.gmail.com> References: <6a122cc00608020430k4a520d10xfebb256829feb752@mail.gmail.com> Message-ID: <20060802170221.GG19103@mellanox.co.il> Quoting r. Moni Levy : > We are looking for the max PPT rate (100 byte packets size) without > losses Maybe IPoIB send side is faster than the receive side, so you are overflowing the RQ? Would not the number of packets arrived be a better bechmark? -- MST From rdreier at cisco.com Wed Aug 2 10:30:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 Aug 2006 10:30:46 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060802164752.GC19103@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Aug 2006 19:47:52 +0300") References: <20060802164752.GC19103@mellanox.co.il> Message-ID: Roland> It's not lockdep, it's just general lock debugging. And Roland> freeing a locked lock is bad practice anyway. Michael> Fine, although I wander why exactly. I'll just respin Michael> with up_read then? But won't down_read up_read look weird Michael> too? Is there some other way to flush out all readers? It's bad practice because in general you don't know who else is blocked on the lock getting freed, so freeing it could lead to deadlock or use-after-free. In this case using an rwsem seems sort of awkward anyway. Wouldn't it match better with what's really going on to have a reference count, and wait for it to go to zero? - R. From rdreier at cisco.com Wed Aug 2 10:31:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 Aug 2006 10:31:42 -0700 Subject: [openib-general] (SPAM?) Re: [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <20060802165442.GD19103@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Aug 2006 19:54:42 +0300") References: <44D0D644.4000607@ichips.intel.com> <20060802165442.GD19103@mellanox.co.il> Message-ID: Michael> I like rdma_establish. Further, adding this will be a Michael> very small patch, safe for 2.6.18. Really? I don't think we want to add new API calls after -rc3, do we? From venkatesh.babu at 3leafnetworks.com Wed Aug 2 11:10:02 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Wed, 02 Aug 2006 11:10:02 -0700 Subject: [openib-general] APM: QP migration state change when failover triggered by hw In-Reply-To: <200608021043.17017.jackm@mellanox.co.il> References: <000201c6b59c$17ef1d30$29cc180a@amr.corp.intel.com> <200608021043.17017.jackm@mellanox.co.il> Message-ID: <44D0EA7A.1020503@3leafnetworks.com> >Babu, regarding the migration event that you are seeing, are you sure that it >is from the migration transition that does not occur? Possibly, the >problematic transition is the second one, which occurs after specifying a new >alternate path and rearming APM? > > I am sure that it is when the cable is disconnected for the first time, and not by the second transition. I will reload the alternate path with LAP messages only when the port's state changes to IB_PORT_ACTIVE. If the remote passive node's port has disconnected then I am expecting for the notice event saying that remote port transitioned to IB_PORT_ACTIVE. In gen1 I was using tsIbSetInServiceNoticeHandler() for this. In OFED we don't have these interfaces yet. >It seems more likely to me that the first transition does occur, since you >receive a MIG event on both sides, and since the alt path data is loaded by >you during the initial bringup of the RC QP pair(either at init->rtr, or at >rtr->rts). If you are receiving the MIGRATED event, the qp is already in the >migrated state. > >However, after the first migration occurs, you need to do the following: >1. send a LAP packet to the remote node, containing the new alt path info. >2. load NEW alt path information (ib_modify_qp, rts->rts), including remote >LID received in LAP packet. >3. Rearm path migration (ib_modify_qp, rts->rts) > >Are you certain that the above 3 steps have taken place? > > Yes I am doing all these steps only when I get the event IB_PORT_ACTIVE or InServiceNotice event is received for the remote port. >Note that 1. and 2. above are a separate phase from 3., since the IB Spec >allows changing the alternate path while the QP is still armed, not just when >it has migrated. > >- Jack > > From mst at mellanox.co.il Wed Aug 2 10:57:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 20:57:18 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060802175718.GA20034@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Fwd: issues in ipoib > > Roland> It's not lockdep, it's just general lock debugging. And > Roland> freeing a locked lock is bad practice anyway. > > Michael> Fine, although I wander why exactly. I'll just respin > Michael> with up_read then? But won't down_read up_read look weird > Michael> too? Is there some other way to flush out all readers? > > It's bad practice because in general you don't know who else is > blocked on the lock getting freed, so freeing it could lead to > deadlock or use-after-free. > > In this case using an rwsem seems sort of awkward anyway. Wouldn't it > match better with what's really going on to have a reference count, > and wait for it to go to zero? So basically a kref + a waitqueue? Fine. -- MST From mst at mellanox.co.il Wed Aug 2 10:59:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 20:59:11 +0300 Subject: [openib-general] (SPAM?) Re: [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: References: Message-ID: <20060802175911.GB20034@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: (SPAM?) Re: [openib-general] [PATCH 0/4] Dispatch communication related events to the IB CM > > Michael> I like rdma_establish. Further, adding this will be a > Michael> very small patch, safe for 2.6.18. > > Really? I don't think we want to add new API calls after -rc3, do we? > Donnu. Without this, there's no way to handle a lost RTU in 2.6.18 CMA though. -- MST From rdreier at cisco.com Wed Aug 2 11:01:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 Aug 2006 11:01:32 -0700 Subject: [openib-general] (SPAM?) Re: [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <20060802175911.GB20034@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Aug 2006 20:59:11 +0300") References: <20060802175911.GB20034@mellanox.co.il> Message-ID: Michael> Donnu. Without this, there's no way to handle a lost RTU Michael> in 2.6.18 CMA though. It doesn't seem to be a problem anyone is hitting in practice though. I would hold off and agree on the best solution, and put that in 2.6.19. - R. From sean.hefty at intel.com Wed Aug 2 11:24:32 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 2 Aug 2006 11:24:32 -0700 Subject: [openib-general] [PATCH] cm: randomize starting local CM ID In-Reply-To: <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> Message-ID: <000101c6b660$e6bbf0b0$ff0da8c0@amr.corp.intel.com> Eric, Can you try running with this patch installed? This should decrease the chances that a REQ will be seen as an old duplicate, rather than a new request. Signed-off-by: Sean Hefty --- Index: cm.c =================================================================== --- cm.c (revision 8647) +++ cm.c (working copy) @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -73,6 +74,7 @@ static struct ib_cm { struct rb_root remote_id_table; struct rb_root remote_sidr_table; struct idr local_id_table; + int next_id; struct workqueue_struct *wq; } cm; @@ -301,11 +303,11 @@ static int cm_alloc_id(struct cm_id_priv { unsigned long flags; int ret; - static int next_id; do { spin_lock_irqsave(&cm.lock, flags); - ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, next_id++, + ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, + cm.next_id++, (__force int *) &cm_id_priv->id.local_id); spin_unlock_irqrestore(&cm.lock, flags); } while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) ); @@ -3390,6 +3392,7 @@ static int __init ib_cm_init(void) cm.remote_id_table = RB_ROOT; cm.remote_qp_table = RB_ROOT; cm.remote_sidr_table = RB_ROOT; + get_random_bytes(&cm.next_id, sizeof cm.next_id); idr_init(&cm.local_id_table); idr_pre_get(&cm.local_id_table, GFP_KERNEL); From mst at mellanox.co.il Wed Aug 2 11:36:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 21:36:41 +0300 Subject: [openib-general] (SPAM?) Re: [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: References: Message-ID: <20060802183641.GA20435@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: (SPAM?) Re: [openib-general] [PATCH 0/4] Dispatch communication related events to the IB CM > > Michael> Donnu. Without this, there's no way to handle a lost RTU > Michael> in 2.6.18 CMA though. > > It doesn't seem to be a problem anyone is hitting in practice though. > I would hold off and agree on the best solution, and put that in 2.6.19. > Fair enough. -- MST From mst at mellanox.co.il Wed Aug 2 11:38:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Aug 2006 21:38:32 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060802183832.GB20435@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Fwd: issues in ipoib > > Roland> It's not lockdep, it's just general lock debugging. And > Roland> freeing a locked lock is bad practice anyway. > > Michael> Fine, although I wander why exactly. I'll just respin > Michael> with up_read then? But won't down_read up_read look weird > Michael> too? Is there some other way to flush out all readers? > > It's bad practice because in general you don't know who else is > blocked on the lock getting freed, so freeing it could lead to > deadlock or use-after-free. > > In this case using an rwsem seems sort of awkward anyway. Wouldn't it > match better with what's really going on to have a reference count, > and wait for it to go to zero? > > - R. > Like this? -- Require registration with SA module, to prevent module text from going away while sa query callback is still running, and update all users. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d6f99d5..bf668b3 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -60,6 +60,10 @@ static struct ib_client cma_client = { .remove = cma_remove_one }; +static struct ib_sa_client cma_sa_client = { + .name = "cma" +}; + static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); @@ -1140,7 +1144,7 @@ static int cma_query_ib_route(struct rdm path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; - id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + id_priv->query_id = ib_sa_path_rec_get(&cma_sa_client, id_priv->id.device, id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, @@ -1910,6 +1914,8 @@ static int cma_init(void) ret = ib_register_client(&cma_client); if (ret) goto err; + + ib_sa_register_client(&cma_sa_client); return 0; err: @@ -1919,6 +1925,7 @@ err: static void cma_cleanup(void) { + ib_sa_unregister_client(&cma_sa_client); ib_unregister_client(&cma_client); destroy_workqueue(cma_wq); idr_destroy(&sdp_ps); diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index aeda484..9e94666 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -78,6 +78,7 @@ struct ib_sa_query { struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; + struct ib_sa_client *client; int id; }; @@ -532,6 +533,17 @@ retry: return ret ? ret : id; } +static inline void ib_sa_client_get(struct ib_sa_client *client) +{ + atomic_inc(&client->users); +} + +static inline void ib_sa_client_put(struct ib_sa_client *client) +{ + if (atomic_dec_and_test(&client->users)) + wake_up(&client->wait); +} + static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, int status, struct ib_sa_mad *mad) @@ -539,6 +551,7 @@ static void ib_sa_path_rec_callback(stru struct ib_sa_path_query *query = container_of(sa_query, struct ib_sa_path_query, sa_query); + ib_sa_client_get(sa_query->client); if (mad) { struct ib_sa_path_rec rec; @@ -547,6 +560,7 @@ static void ib_sa_path_rec_callback(stru query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + ib_sa_client_put(sa_query->client); } static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) @@ -556,6 +570,7 @@ static void ib_sa_path_rec_release(struc /** * ib_sa_path_rec_get - Start a Path get query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:Path Record to send in query @@ -578,7 +593,8 @@ static void ib_sa_path_rec_release(struc * error code. Otherwise it is a query ID that can be used to cancel * the query. */ -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -619,6 +635,7 @@ int ib_sa_path_rec_get(struct ib_device mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + query->sa_query.client = client; query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; query->sa_query.port = port; @@ -653,6 +670,7 @@ static void ib_sa_service_rec_callback(s struct ib_sa_service_query *query = container_of(sa_query, struct ib_sa_service_query, sa_query); + ib_sa_client_get(sa_query->client); if (mad) { struct ib_sa_service_rec rec; @@ -661,6 +679,7 @@ static void ib_sa_service_rec_callback(s query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + ib_sa_client_put(sa_query->client); } static void ib_sa_service_rec_release(struct ib_sa_query *sa_query) @@ -670,6 +689,7 @@ static void ib_sa_service_rec_release(st /** * ib_sa_service_rec_query - Start Service Record operation + * @client:client object used to track the query * @device:device to send request on * @port_num: port number to send request on * @method:SA method - should be get, set, or delete @@ -694,7 +714,8 @@ static void ib_sa_service_rec_release(st * error code. Otherwise it is a request ID that can be used to cancel * the query. */ -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -740,6 +761,7 @@ int ib_sa_service_rec_query(struct ib_de mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + query->sa_query.client = client; query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; query->sa_query.release = ib_sa_service_rec_release; query->sa_query.port = port; @@ -775,6 +797,7 @@ static void ib_sa_mcmember_rec_callback( struct ib_sa_mcmember_query *query = container_of(sa_query, struct ib_sa_mcmember_query, sa_query); + ib_sa_client_get(sa_query->client); if (mad) { struct ib_sa_mcmember_rec rec; @@ -783,6 +806,7 @@ static void ib_sa_mcmember_rec_callback( query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + ib_sa_client_put(sa_query->client); } static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) @@ -790,7 +814,8 @@ static void ib_sa_mcmember_rec_release(s kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -832,6 +857,7 @@ int ib_sa_mcmember_rec_query(struct ib_d mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + query->sa_query.client = client; query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; query->sa_query.port = port; @@ -866,6 +892,7 @@ static void send_handler(struct ib_mad_a struct ib_sa_query *query = mad_send_wc->send_buf->context[0]; unsigned long flags; + ib_sa_client_get(query->client); if (query->callback) switch (mad_send_wc->status) { case IB_WC_SUCCESS: @@ -881,6 +908,7 @@ static void send_handler(struct ib_mad_a query->callback(query, -EIO, NULL); break; } + ib_sa_client_put(query->client); spin_lock_irqsave(&idr_lock, flags); idr_remove(&query_idr, query->id); @@ -900,6 +928,7 @@ static void recv_handler(struct ib_mad_a mad_buf = (void *) (unsigned long) mad_recv_wc->wc->wr_id; query = mad_buf->context[0]; + ib_sa_client_get(query->client); if (query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, @@ -909,6 +938,7 @@ static void recv_handler(struct ib_mad_a else query->callback(query, -EIO, NULL); } + ib_sa_client_put(query->client); ib_free_recv_mad(mad_recv_wc); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 474aa21..28a9f0f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -390,4 +390,5 @@ #define IPOIB_GID_RAW_ARG(gid) ((u8 *)(g #define IPOIB_GID_ARG(gid) IPOIB_GID_RAW_ARG((gid).raw) +extern struct ib_sa_client ipoib_sa_client; #endif /* _IPOIB_H */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cf71d2a..ca10724 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -91,6 +91,10 @@ static struct ib_client ipoib_client = { .remove = ipoib_remove_one }; +struct ib_sa_client ipoib_sa_client = { + .name = "ipoib" +}; + int ipoib_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -459,7 +463,7 @@ static int path_rec_start(struct net_dev init_completion(&path->done); path->query_id = - ib_sa_path_rec_get(priv->ca, priv->port, + ib_sa_path_rec_get(&ipoib_sa_client, priv->ca, priv->port, &path->pathrec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | @@ -1185,6 +1189,8 @@ static int __init ipoib_init_module(void if (ret) goto err_wq; + ib_sa_register_client(&ipoib_sa_client); + return 0; err_wq: @@ -1198,6 +1204,7 @@ err_fs: static void __exit ipoib_cleanup_module(void) { + ib_sa_unregister_client(&ipoib_sa_client); ib_unregister_client(&ipoib_client); ipoib_unregister_debugfs(); destroy_workqueue(ipoib_workqueue); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index b5e6a7b..f688323 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -360,7 +360,7 @@ #endif init_completion(&mcast->done); - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, + ret = ib_sa_mcmember_rec_set(&ipoib_sa_client, priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | @@ -484,8 +484,8 @@ static void ipoib_mcast_join(struct net_ init_completion(&mcast->done); - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, - mcast->backoff * 1000, GFP_ATOMIC, + ret = ib_sa_mcmember_rec_set(&ipoib_sa_client, priv->ca, priv->port, &rec, + comp_mask, mcast->backoff * 1000, GFP_ATOMIC, ipoib_mcast_join_complete, mcast, &mcast->query); @@ -680,7 +680,7 @@ static int ipoib_mcast_leave(struct net_ * Just make one shot at leaving and don't wait for a reply; * if we fail, too bad. */ - ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, + ret = ib_sa_mcmember_rec_delete(&ipoib_sa_client, priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..0856d78 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -88,6 +88,10 @@ static struct ib_client srp_client = { .remove = srp_remove_one }; +static struct ib_sa_client srp_sa_client = { + .name = "srp" +}; + static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) { return (struct srp_target_port *) host->hostdata; @@ -259,7 +263,8 @@ static int srp_lookup_path(struct srp_ta init_completion(&target->done); - target->path_query_id = ib_sa_path_rec_get(target->srp_host->dev->dev, + target->path_query_id = ib_sa_path_rec_get(&srp_sa_client, + target->srp_host->dev->dev, target->srp_host->port, &target->path, IB_SA_PATH_REC_DGID | diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index c99e442..07e4b81 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -37,6 +37,8 @@ #ifndef IB_SA_H #define IB_SA_H #include +#include +#include #include #include @@ -250,11 +252,18 @@ struct ib_sa_service_rec { u64 data64[2]; }; +struct ib_sa_client { + char *name; + atomic_t users; + wait_queue_head_t wait; +}; + struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -264,7 +273,8 @@ int ib_sa_path_rec_get(struct ib_device void *context, struct ib_sa_query **query); -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -275,7 +285,8 @@ int ib_sa_mcmember_rec_query(struct ib_d void *context, struct ib_sa_query **query); -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, @@ -288,6 +299,7 @@ int ib_sa_service_rec_query(struct ib_de /** * ib_sa_mcmember_rec_set - Start an MCMember set query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -311,7 +323,8 @@ int ib_sa_service_rec_query(struct ib_de * cancel the query. */ static inline int -ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_set(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -321,7 +334,7 @@ ib_sa_mcmember_rec_set(struct ib_device void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, timeout_ms, gfp_mask, callback, @@ -330,6 +343,7 @@ ib_sa_mcmember_rec_set(struct ib_device /** * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -353,7 +367,8 @@ ib_sa_mcmember_rec_set(struct ib_device * cancel the query. */ static inline int -ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_delete(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -363,7 +378,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, timeout_ms, gfp_mask, callback, @@ -378,4 +393,23 @@ int ib_init_ah_from_path(struct ib_devic struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr); +/** + * ib_sa_register_client - register SA client object + * @client:client object used to track queries + */ +static inline void ib_sa_register_client(struct ib_sa_client *client) +{ + atomic_set(&client->users, 1); + init_waitqueue_head(&client->wait); +} + +/** + * ib_sa_unregister_client - unregister SA client object + * @client:client object used to track queries + */ +static inline void ib_sa_unregister_client(struct ib_sa_client *client) +{ + wait_event(client->wait, atomic_read(&client->users) == 0); +} + #endif /* IB_SA_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h -- MST From or.gerlitz at gmail.com Wed Aug 2 13:05:26 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 2 Aug 2006 22:05:26 +0200 Subject: [openib-general] (SPAM?) Re: [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <44D0D644.4000607@ichips.intel.com> References: <44D03AF1.8080300@voltaire.com> <44D0D644.4000607@ichips.intel.com> Message-ID: <15ddcffd0608021305o6b320d70y26b02a4c341e3ade@mail.gmail.com> On 8/2/06, Sean Hefty wrote: > I think the open question is whether a call like rdma_establish() needs to be > added to the API. If it is, then I'm not sure that we need to route events > directly to the IB CM. Given the current choices, adding rdma_establish() may > actually be easier for the ULP than event forwarding, or maybe we just do both... Let me see that i follow: rdma_established is to be used by app which is a CMA consumer and now polls from a CQ a completion associated with a QP whose state is RTR (ie no RTU == ESTABLISHED event) was received yet. This may solve the need of apps who do not want to wait for the synthesized ESTABLISHED event that would be delivered at some future point with the approach taken by your patch. Moreover, the solution is one of: - the patch you sent - enforcing the ULP to call rdma_establish (or cm_establish for direct CM consumers) else a repeatedly lost RTU case is not handled. what do you mean by event forwarding, is it to fwd the COMM_EST event from the app QP handler to the CM/CMA? Or. From jlentini at netapp.com Wed Aug 2 13:14:54 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 2 Aug 2006 16:14:54 -0400 (EDT) Subject: [openib-general] making sense of dapl (and dat.conf) In-Reply-To: <20060801204806.GC13356@minbar-g5.scl.ameslab.gov> References: <20060801204806.GC13356@minbar-g5.scl.ameslab.gov> Message-ID: On Tue, 1 Aug 2006, Troy Benjegerdes wrote: > So, let's suppose I build ibverbs, libecha/libmthca, and dapl from > subversion trunk.. what should my /etc/dat.conf file look like so things > actually work? > > Right now I have: You shouldn't have multiple entries with the same ia_name. > OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so > mv_dapl.1.2 "10.40.4.56 0" "" > OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so > mv_dapl.1.2 "10.40.4.54 0" "" > OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so > mv_dapl.1.2 "p5l6.ib 0" "" > OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so > mv_dapl.1.2 "p5l4.ib 0" "" > OpenIB-cma-netdev u1.2 nonthreadsafe default > /usr/local/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > Is there any way I can avoid having to have a different config file for > each machine? If you use a CMA provider, I believe that this should work on all nodes: OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > Is this documented anywhere what all these fields actually mean in a > way that makes sense to those of us who haven't read the DAT > specification, or are familiar with ibverbs? In the uDAPL docs directory, there is a brief descpription in the same dat.conf. There is a longer description of how the DAT registry works in dapl_registry_design.txt. From mshefty at ichips.intel.com Wed Aug 2 13:20:56 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 02 Aug 2006 13:20:56 -0700 Subject: [openib-general] (SPAM?) Re: [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <15ddcffd0608021305o6b320d70y26b02a4c341e3ade@mail.gmail.com> References: <44D03AF1.8080300@voltaire.com> <44D0D644.4000607@ichips.intel.com> <15ddcffd0608021305o6b320d70y26b02a4c341e3ade@mail.gmail.com> Message-ID: <44D10928.2080605@ichips.intel.com> Or Gerlitz wrote: > Let me see that i follow: rdma_established is to be used by app which > is a CMA consumer and now polls from a CQ a completion associated with > a QP whose state is RTR (ie no RTU == ESTABLISHED event) was received > yet. This may solve the need of apps who do not want to wait for the > synthesized ESTABLISHED event that would be delivered at some future > point with the approach taken by your patch. correct > Moreover, the solution is one of: > - the patch you sent > - enforcing the ULP to call rdma_establish (or cm_establish for direct > CM consumers) else a repeatedly lost RTU case is not handled. or both, or we do nothing and let the connection fail > what do you mean by event forwarding, is it to fwd the COMM_EST event > from the app QP handler to the CM/CMA? I meant forwarding the COMM_EST event directly to the IB CM, as well as the application. - Sean From swise at opengridcomputing.com Wed Aug 2 13:27:47 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 02 Aug 2006 15:27:47 -0500 Subject: [openib-general] [PATCH v4 0/2][RFC] iWARP Core Support Message-ID: <20060802202747.24212.10931.stgit@dell3.ogc.int> Roland, Here is the iWARP Core Support patchset merged to your latest for-2.6.19 branch. It has gone through 3 reviews on lklm and netdev a while ago, and I think its ready to be pulled in. Steve. ---- This patchset defines the modifications to the Linux infiniband subsystem to support iWARP devices. The patchset consists of 2 patches: 1 - New iWARP CM implementation. 2 - Core changes to support iWARP. Signed-off-by: Tom Tucker Signed-off-by: Steve Wise From swise at opengridcomputing.com Wed Aug 2 13:27:52 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 02 Aug 2006 15:27:52 -0500 Subject: [openib-general] [PATCH v4 2/2] iWARP Core Changes. In-Reply-To: <20060802202747.24212.10931.stgit@dell3.ogc.int> References: <20060802202747.24212.10931.stgit@dell3.ogc.int> Message-ID: <20060802202752.24212.73349.stgit@dell3.ogc.int> This patch contains modifications to the existing rdma header files, core files, drivers, and ulp files to support iWARP. V2 Review updates: V1 Review updates: - copy_addr() -> rdma_copy_addr() - dst_dev_addr param in rdma_copy_addr to const. - various spacing nits with recasting - include linux/inetdevice.h to get ip_dev_find() prototype. - dev_put() after successful ip_dev_find() --- drivers/infiniband/core/Makefile | 4 drivers/infiniband/core/addr.c | 19 + drivers/infiniband/core/cache.c | 8 - drivers/infiniband/core/cm.c | 3 drivers/infiniband/core/cma.c | 356 +++++++++++++++++++++++--- drivers/infiniband/core/device.c | 6 drivers/infiniband/core/mad.c | 11 + drivers/infiniband/core/sa_query.c | 5 drivers/infiniband/core/smi.c | 18 + drivers/infiniband/core/sysfs.c | 18 + drivers/infiniband/core/ucm.c | 5 drivers/infiniband/core/user_mad.c | 9 - drivers/infiniband/hw/ipath/ipath_verbs.c | 2 drivers/infiniband/hw/mthca/mthca_provider.c | 2 drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 + drivers/infiniband/ulp/srp/ib_srp.c | 2 include/rdma/ib_addr.h | 16 + include/rdma/ib_verbs.h | 39 ++- 18 files changed, 438 insertions(+), 93 deletions(-) diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 68e73ec..163d991 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -1,7 +1,7 @@ infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS) := ib_addr.o rdma_cm.o obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o $(infiniband-y) + ib_cm.o iw_cm.o $(infiniband-y) obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o @@ -14,6 +14,8 @@ ib_sa-y := sa_query.o ib_cm-y := cm.o +iw_cm-y := iwcm.o + rdma_cm-y := cma.o ib_addr-y := addr.o diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index d294bbc..83f84ef 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -32,6 +32,7 @@ #include #include #include #include +#include #include #include #include @@ -60,12 +61,15 @@ static LIST_HEAD(req_list); static DECLARE_WORK(work, process_req, NULL); static struct workqueue_struct *addr_wq; -static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, - unsigned char *dst_dev_addr) +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + const unsigned char *dst_dev_addr) { switch (dev->type) { case ARPHRD_INFINIBAND: - dev_addr->dev_type = IB_NODE_CA; + dev_addr->dev_type = RDMA_NODE_IB_CA; + break; + case ARPHRD_ETHER: + dev_addr->dev_type = RDMA_NODE_RNIC; break; default: return -EADDRNOTAVAIL; @@ -77,6 +81,7 @@ static int copy_addr(struct rdma_dev_add memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); return 0; } +EXPORT_SYMBOL(rdma_copy_addr); int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) { @@ -88,7 +93,7 @@ int rdma_translate_ip(struct sockaddr *a if (!dev) return -EADDRNOTAVAIL; - ret = copy_addr(dev_addr, dev, NULL); + ret = rdma_copy_addr(dev_addr, dev, NULL); dev_put(dev); return ret; } @@ -160,7 +165,7 @@ static int addr_resolve_remote(struct so /* If the device does ARP internally, return 'done' */ if (rt->idev->dev->flags & IFF_NOARP) { - copy_addr(addr, rt->idev->dev, NULL); + rdma_copy_addr(addr, rt->idev->dev, NULL); goto put; } @@ -180,7 +185,7 @@ static int addr_resolve_remote(struct so src_in->sin_addr.s_addr = rt->rt_src; } - ret = copy_addr(addr, neigh->dev, neigh->ha); + ret = rdma_copy_addr(addr, neigh->dev, neigh->ha); release: neigh_release(neigh); put: @@ -244,7 +249,7 @@ static int addr_resolve_local(struct soc if (ZERONET(src_ip)) { src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = dst_ip; - ret = copy_addr(addr, dev, dev->dev_addr); + ret = rdma_copy_addr(addr, dev, dev->dev_addr); } else if (LOOPBACK(src_ip)) { ret = rdma_translate_ip((struct sockaddr *)dst_in, addr); if (!ret) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index e05ca2c..061858c 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -32,13 +32,12 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: cache.c 6885 2006-05-03 18:22:02Z sean.hefty $ */ #include #include #include -#include /* INIT_WORK, schedule_work(), flush_scheduled_work() */ #include @@ -62,12 +61,13 @@ struct ib_update_work { static inline int start_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : 1; + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; } static inline int end_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; } int ib_get_cached_gid(struct ib_device *device, diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index f85c97f..21312fe 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -3260,6 +3260,9 @@ static void cm_add_one(struct ib_device int ret; u8 i; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d6f99d5..0e9f476 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -35,6 +35,7 @@ #include #include #include #include +#include #include @@ -43,6 +44,7 @@ #include #include #include #include +#include MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); @@ -124,6 +126,7 @@ struct rdma_id_private { int query_id; union { struct ib_cm_id *ib; + struct iw_cm_id *iw; } cm_id; u32 seq_num; @@ -259,14 +262,23 @@ static void cma_detach_from_dev(struct r id_priv->cma_dev = NULL; } -static int cma_acquire_ib_dev(struct rdma_id_private *id_priv) +static int cma_acquire_dev(struct rdma_id_private *id_priv) { + enum rdma_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type; struct cma_device *cma_dev; union ib_gid gid; int ret = -ENODEV; - ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid), - + switch (rdma_node_get_transport(dev_type)) { + case RDMA_TRANSPORT_IB: + ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid); + break; + case RDMA_TRANSPORT_IWARP: + iw_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid); + break; + default: + return -ENODEV; + } mutex_lock(&lock); list_for_each_entry(cma_dev, &dev_list, list) { ret = ib_find_cached_gid(cma_dev->device, &gid, @@ -280,16 +292,6 @@ static int cma_acquire_ib_dev(struct rdm return ret; } -static int cma_acquire_dev(struct rdma_id_private *id_priv) -{ - switch (id_priv->id.route.addr.dev_addr.dev_type) { - case IB_NODE_CA: - return cma_acquire_ib_dev(id_priv); - default: - return -ENODEV; - } -} - static void cma_deref_id(struct rdma_id_private *id_priv) { if (atomic_dec_and_test(&id_priv->refcount)) @@ -347,6 +349,16 @@ static int cma_init_ib_qp(struct rdma_id IB_QP_PKEY_INDEX | IB_QP_PORT); } +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS); +} + int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr) { @@ -362,10 +374,13 @@ int rdma_create_qp(struct rdma_cm_id *id if (IS_ERR(qp)) return PTR_ERR(qp); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_init_ib_qp(id_priv, qp); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_init_iw_qp(id_priv, qp); + break; default: ret = -ENOSYS; break; @@ -451,13 +466,17 @@ int rdma_init_qp_attr(struct rdma_cm_id int ret; id_priv = container_of(id, struct rdma_id_private, id); - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, qp_attr_mask); if (qp_attr->qp_state == IB_QPS_RTR) qp_attr->rq_psn = id_priv->seq_num; break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr, + qp_attr_mask); + break; default: ret = -ENOSYS; break; @@ -590,8 +609,8 @@ static int cma_notify_user(struct rdma_i static void cma_cancel_route(struct rdma_id_private *id_priv) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->query) ib_sa_cancel_query(id_priv->query_id, id_priv->query); break; @@ -611,11 +630,15 @@ static void cma_destroy_listen(struct rd cma_exch(id_priv, CMA_DESTROYING); if (id_priv->cma_dev) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; + case RDMA_TRANSPORT_IWARP: + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) + iw_destroy_cm_id(id_priv->cm_id.iw); + break; default: break; } @@ -690,11 +713,15 @@ void rdma_destroy_id(struct rdma_cm_id * cma_cancel_operation(id_priv, state); if (id_priv->cma_dev) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; + case RDMA_TRANSPORT_IWARP: + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) + iw_destroy_cm_id(id_priv->cm_id.iw); + break; default: break; } @@ -869,7 +896,7 @@ static struct rdma_id_private *cma_new_i ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); - rt->addr.dev_addr.dev_type = IB_NODE_CA; + rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA; id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; @@ -898,7 +925,7 @@ static int cma_req_handler(struct ib_cm_ } atomic_inc(&conn_id->dev_remove); - ret = cma_acquire_ib_dev(conn_id); + ret = cma_acquire_dev(conn_id); if (ret) { ret = -ENODEV; cma_release_remove(conn_id); @@ -982,6 +1009,125 @@ static void cma_set_compare_data(enum rd } } +static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) +{ + struct rdma_id_private *id_priv = iw_id->context; + enum rdma_cm_event_type event = 0; + struct sockaddr_in *sin; + int ret = 0; + + atomic_inc(&id_priv->dev_remove); + + switch (iw_event->event) { + case IW_CM_EVENT_CLOSE: + event = RDMA_CM_EVENT_DISCONNECTED; + break; + case IW_CM_EVENT_CONNECT_REPLY: + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; + *sin = iw_event->remote_addr; + if (iw_event->status) + event = RDMA_CM_EVENT_REJECTED; + else + event = RDMA_CM_EVENT_ESTABLISHED; + break; + case IW_CM_EVENT_ESTABLISHED: + event = RDMA_CM_EVENT_ESTABLISHED; + break; + default: + BUG_ON(1); + } + + ret = cma_notify_user(id_priv, event, iw_event->status, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id.iw = NULL; + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return ret; + } + + cma_release_remove(id_priv); + return ret; +} + +static int iw_conn_req_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct rdma_cm_id *new_cm_id; + struct rdma_id_private *listen_id, *conn_id; + struct sockaddr_in *sin; + struct net_device *dev = NULL; + int ret; + + listen_id = cm_id->context; + atomic_inc(&listen_id->dev_remove); + if (!cma_comp(listen_id, CMA_LISTEN)) { + ret = -ECONNABORTED; + goto out; + } + + /* Create a new RDMA id for the new IW CM ID */ + new_cm_id = rdma_create_id(listen_id->id.event_handler, + listen_id->id.context, + RDMA_PS_TCP); + if (!new_cm_id) { + ret = -ENOMEM; + goto out; + } + conn_id = container_of(new_cm_id, struct rdma_id_private, id); + atomic_inc(&conn_id->dev_remove); + conn_id->state = CMA_CONNECT; + + dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr); + if (!dev) { + ret = -EADDRNOTAVAIL; + rdma_destroy_id(new_cm_id); + goto out; + } + ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL); + if (ret) { + rdma_destroy_id(new_cm_id); + goto out; + } + + ret = cma_acquire_dev(conn_id); + if (ret) { + rdma_destroy_id(new_cm_id); + goto out; + } + + conn_id->cm_id.iw = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_iw_handler; + + sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr; + *sin = iw_event->remote_addr; + + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* User wants to destroy the CM ID */ + conn_id->cm_id.iw = NULL; + cma_exch(conn_id, CMA_DESTROYING); + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + } + +out: + if (!dev) + dev_put(dev); + cma_release_remove(listen_id); + return ret; +} + static int cma_ib_listen(struct rdma_id_private *id_priv) { struct ib_cm_compare_data compare_data; @@ -1011,6 +1157,30 @@ static int cma_ib_listen(struct rdma_id_ return ret; } +static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog) +{ + int ret; + struct sockaddr_in *sin; + + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, + iw_conn_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id.iw)) + return PTR_ERR(id_priv->cm_id.iw); + + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + id_priv->cm_id.iw->local_addr = *sin; + + ret = iw_cm_listen(id_priv->cm_id.iw, backlog); + + if (ret) { + iw_destroy_cm_id(id_priv->cm_id.iw); + id_priv->cm_id.iw = NULL; + } + + return ret; +} + static int cma_listen_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) { @@ -1087,12 +1257,17 @@ int rdma_listen(struct rdma_cm_id *id, i id_priv->backlog = backlog; if (id->device) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_ib_listen(id_priv); if (ret) goto err; break; + case RDMA_TRANSPORT_IWARP: + ret = cma_iw_listen(id_priv, backlog); + if (ret) + goto err; + break; default: ret = -ENOSYS; goto err; @@ -1231,6 +1406,23 @@ err: } EXPORT_SYMBOL(rdma_set_ib_paths); +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct cma_work *work; + + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler, work); + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ROUTE_RESOLVED; + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; + queue_work(cma_wq, &work->work); + return 0; +} + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -1241,10 +1433,13 @@ int rdma_resolve_route(struct rdma_cm_id return -EINVAL; atomic_inc(&id_priv->refcount); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_resolve_iw_route(id_priv, timeout_ms); + break; default: ret = -ENOSYS; break; @@ -1357,8 +1552,8 @@ static int cma_resolve_loopback(struct r ib_addr_set_dgid(&id_priv->id.route.addr.dev_addr, &gid); if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) { - src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr; - dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr; + src_in = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + dst_in = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = dst_in->sin_addr.s_addr; } @@ -1649,6 +1844,47 @@ out: return ret; } +static int cma_connect_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_id *cm_id; + struct sockaddr_in* sin; + int ret; + struct iw_cm_conn_param iw_param; + + cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv); + if (IS_ERR(cm_id)) { + ret = PTR_ERR(cm_id); + goto out; + } + + id_priv->cm_id.iw = cm_id; + + sin = (struct sockaddr_in*) &id_priv->id.route.addr.src_addr; + cm_id->local_addr = *sin; + + sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr; + cm_id->remote_addr = *sin; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) { + iw_destroy_cm_id(cm_id); + return ret; + } + + iw_param.ord = conn_param->initiator_depth; + iw_param.ird = conn_param->responder_resources; + iw_param.private_data = conn_param->private_data; + iw_param.private_data_len = conn_param->private_data_len; + if (id_priv->id.qp) + iw_param.qpn = id_priv->qp_num; + else + iw_param.qpn = conn_param->qp_num; + ret = iw_cm_connect(cm_id, &iw_param); +out: + return ret; +} + int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1664,10 +1900,13 @@ int rdma_connect(struct rdma_cm_id *id, id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_connect_ib(id_priv, conn_param); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_connect_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1708,6 +1947,28 @@ static int cma_accept_ib(struct rdma_id_ return ib_send_cm_rep(id_priv->cm_id.ib, &rep); } +static int cma_accept_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_conn_param iw_param; + int ret; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) + return ret; + + iw_param.ord = conn_param->initiator_depth; + iw_param.ird = conn_param->responder_resources; + iw_param.private_data = conn_param->private_data; + iw_param.private_data_len = conn_param->private_data_len; + if (id_priv->id.qp) { + iw_param.qpn = id_priv->qp_num; + } else + iw_param.qpn = conn_param->qp_num; + + return iw_cm_accept(id_priv->cm_id.iw, &iw_param); +} + int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1723,13 +1984,16 @@ int rdma_accept(struct rdma_cm_id *id, s id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: if (conn_param) ret = cma_accept_ib(id_priv, conn_param); else ret = cma_rep_recv(id_priv); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_accept_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1756,12 +2020,16 @@ int rdma_reject(struct rdma_cm_id *id, c if (!cma_comp(id_priv, CMA_CONNECT)) return -EINVAL; - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, private_data, private_data_len); break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_reject(id_priv->cm_id.iw, + private_data, private_data_len); + break; default: ret = -ENOSYS; break; @@ -1780,16 +2048,18 @@ int rdma_disconnect(struct rdma_cm_id *i !cma_comp(id_priv, CMA_DISCONNECT)) return -EINVAL; - ret = cma_modify_qp_err(id); - if (ret) - goto out; - - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: + ret = cma_modify_qp_err(id); + if (ret) + goto out; /* Initiate or respond to a disconnect. */ if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_disconnect(id_priv->cm_id.iw, 0); + break; default: break; } diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index b2f3cb9..7318fba 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: device.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: device.c 5943 2006-03-22 00:58:04Z roland $ */ #include @@ -505,7 +505,7 @@ int ib_query_port(struct ib_device *devi u8 port_num, struct ib_port_attr *port_attr) { - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) @@ -580,7 +580,7 @@ int ib_modify_port(struct ib_device *dev u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify) { - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 1c3cfbb..b105e6a 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. * @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ + * $Id: mad.c 7294 2006-05-17 18:12:30Z roland $ */ #include #include @@ -2876,7 +2876,10 @@ static void ib_mad_init_device(struct ib { int start, end, i; - if (device->node_type == IB_NODE_SWITCH) { + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) { start = 0; end = 0; } else { @@ -2923,7 +2926,7 @@ static void ib_mad_remove_device(struct { int i, num_ports, cur_port; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { num_ports = 1; cur_port = 0; } else { diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index aeda484..a7482c8 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -918,7 +918,10 @@ static void ib_sa_add_one(struct ib_devi struct ib_sa_device *sa_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { s = 1; diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c index 35852e7..b81b2b9 100644 --- a/drivers/infiniband/core/smi.c +++ b/drivers/infiniband/core/smi.c @@ -34,7 +34,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $ + * $Id: smi.c 5258 2006-02-01 20:32:40Z sean.hefty $ */ #include @@ -64,7 +64,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-9:2 */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; /* smp->return_path set when received */ @@ -77,7 +77,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == hop_cnt) { /* smp->return_path set when received */ smp->hop_ptr++; - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -95,7 +95,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; smp->hop_ptr--; @@ -107,7 +107,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == 1) { smp->hop_ptr--; /* C14-13:3 -- SMPs destined for SM shouldn't be here */ - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_slid == IB_LID_PERMISSIVE); } @@ -142,7 +142,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; smp->return_path[hop_ptr] = port_num; @@ -156,7 +156,7 @@ int smi_handle_dr_smp_recv(struct ib_smp smp->return_path[hop_ptr] = port_num; /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -175,7 +175,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; /* smp->hop_ptr updated when sending */ @@ -190,7 +190,7 @@ int smi_handle_dr_smp_recv(struct ib_smp return 1; } /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH); + return (node_type == RDMA_NODE_IB_SWITCH); } /* C14-13:4 -- hop_ptr = 0 -> give to SM */ diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 21f9282..cfd2c06 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: sysfs.c 6940 2006-05-04 17:04:55Z roland $ */ #include "core_priv.h" @@ -589,10 +589,16 @@ static ssize_t show_node_type(struct cla return -ENODEV; switch (dev->node_type) { - case IB_NODE_CA: return sprintf(buf, "%d: CA\n", dev->node_type); - case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type); - case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type); - default: return sprintf(buf, "%d: \n", dev->node_type); + case RDMA_NODE_IB_CA: + return sprintf(buf, "%d: CA\n", dev->node_type); + case RDMA_NODE_RNIC: + return sprintf(buf, "%d: RNIC\n", dev->node_type); + case RDMA_NODE_IB_SWITCH: + return sprintf(buf, "%d: switch\n", dev->node_type); + case RDMA_NODE_IB_ROUTER: + return sprintf(buf, "%d: router\n", dev->node_type); + default: + return sprintf(buf, "%d: \n", dev->node_type); } } @@ -708,7 +714,7 @@ int ib_device_register_sysfs(struct ib_d if (ret) goto err_put; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { ret = add_port(device, 0); if (ret) goto err_put; diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index c1c6fda..936afc8 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ + * $Id: ucm.c 7119 2006-05-11 16:40:38Z sean.hefty $ */ #include @@ -1247,7 +1247,8 @@ static void ib_ucm_add_one(struct ib_dev { struct ib_ucm_device *ucm_dev; - if (!device->alloc_ucontext) + if (!device->alloc_ucontext || + rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) return; ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 1273f88..d6c151b 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ + * $Id: user_mad.c 6041 2006-03-27 21:06:00Z halr $ */ #include @@ -1032,7 +1032,10 @@ static void ib_umad_add_one(struct ib_de struct ib_umad_device *umad_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { s = 1; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index d70a9b6..5f41441 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -1093,7 +1093,7 @@ static void *ipath_register_ib_device(in (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); - dev->node_type = IB_NODE_CA; + dev->node_type = RDMA_NODE_IB_CA; dev->phys_port_cnt = 1; dev->dma_device = ipath_layer_get_device(dd); dev->class_dev.dev = dev->dma_device; diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 230ae21..2103ee8 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1292,7 +1292,7 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); - dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.node_type = RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; dev->ib_dev.class_dev.dev = &dev->pdev->dev; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cf71d2a..8a67b87 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1107,13 +1107,16 @@ static void ipoib_add_one(struct ib_devi struct ipoib_dev_priv *priv; int s, e, p; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; INIT_LIST_HEAD(dev_list); - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { @@ -1137,6 +1140,9 @@ static void ipoib_remove_one(struct ib_d struct ipoib_dev_priv *priv, *tmp; struct list_head *dev_list; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev_list = ib_get_client_data(device, &ipoib_client); list_for_each_entry_safe(priv, tmp, dev_list, list) { diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..df3120a 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1880,7 +1880,7 @@ static void srp_add_one(struct ib_device if (IS_ERR(srp_dev->fmr_pool)) srp_dev->fmr_pool = NULL; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index 0ff6739..d933433 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -40,7 +40,7 @@ struct rdma_dev_addr { unsigned char src_dev_addr[MAX_ADDR_LEN]; unsigned char dst_dev_addr[MAX_ADDR_LEN]; unsigned char broadcast[MAX_ADDR_LEN]; - enum ib_node_type dev_type; + enum rdma_node_type dev_type; }; /** @@ -72,6 +72,9 @@ int rdma_resolve_ip(struct sockaddr *src void rdma_addr_cancel(struct rdma_dev_addr *addr); +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + const unsigned char *dst_dev_addr); + static inline int ip_addr_size(struct sockaddr *addr) { return addr->sa_family == AF_INET6 ? @@ -113,4 +116,15 @@ static inline void ib_addr_set_dgid(stru memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); } +static inline void iw_addr_get_sgid(struct rdma_dev_addr* rda, + union ib_gid *gid) +{ + memcpy(gid, rda->src_dev_addr, sizeof *gid); +} + +static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda) +{ + return (union ib_gid *) rda->dst_dev_addr; +} + #endif /* IB_ADDR_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index ee1f3a3..4b4c30a 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -35,7 +35,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $ + * $Id: ib_verbs.h 6885 2006-05-03 18:22:02Z sean.hefty $ */ #if !defined(IB_VERBS_H) @@ -56,12 +56,35 @@ union ib_gid { } global; }; -enum ib_node_type { - IB_NODE_CA = 1, - IB_NODE_SWITCH, - IB_NODE_ROUTER +enum rdma_node_type { + /* IB values map to NodeInfo:NodeType. */ + RDMA_NODE_IB_CA = 1, + RDMA_NODE_IB_SWITCH, + RDMA_NODE_IB_ROUTER, + RDMA_NODE_RNIC }; +enum rdma_transport_type { + RDMA_TRANSPORT_IB, + RDMA_TRANSPORT_IWARP +}; + +static inline enum rdma_transport_type +rdma_node_get_transport(enum rdma_node_type node_type) +{ + switch (node_type) { + case RDMA_NODE_IB_CA: + case RDMA_NODE_IB_SWITCH: + case RDMA_NODE_IB_ROUTER: + return RDMA_TRANSPORT_IB; + case RDMA_NODE_RNIC: + return RDMA_TRANSPORT_IWARP; + default: + BUG(); + return 0; + } +} + enum ib_device_cap_flags { IB_DEVICE_RESIZE_MAX_WR = 1, IB_DEVICE_BAD_PKEY_CNTR = (1<<1), @@ -78,6 +101,9 @@ enum ib_device_cap_flags { IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_ZERO_STAG = (1<<15), + IB_DEVICE_SEND_W_INV = (1<<16), + IB_DEVICE_MEM_WINDOW = (1<<17) }; enum ib_atomic_cap { @@ -835,6 +861,7 @@ struct ib_cache { u8 *lmc_cache; }; +struct iw_cm_verbs; struct ib_device { struct device *dma_device; @@ -851,6 +878,8 @@ struct ib_device { u32 flags; + struct iw_cm_verbs *iwcm; + int (*query_device)(struct ib_device *device, struct ib_device_attr *device_attr); int (*query_port)(struct ib_device *device, From swise at opengridcomputing.com Wed Aug 2 13:27:50 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 02 Aug 2006 15:27:50 -0500 Subject: [openib-general] [PATCH v4 1/2] iWARP Connection Manager. In-Reply-To: <20060802202747.24212.10931.stgit@dell3.ogc.int> References: <20060802202747.24212.10931.stgit@dell3.ogc.int> Message-ID: <20060802202749.24212.49421.stgit@dell3.ogc.int> This patch provides the new files implementing the iWARP Connection Manager. This module is a logical instance of the xx_cm where xx is the transport type (ib or iw). The symbols exported are used by the transport independent rdma_cm module, and are available also for transport dependent ULPs. V2 Review Changes: - BUG_ON(1) -> BUG() - Don't typecast whan assigning between something* and void* - pre-allocate iwcm_work objects to avoid allocating them in the interrupt context. - copy private data on connect request and connect reply events. - #if !defined() -> #ifndef V1 Review Changes: - sizeof -> sizeof() - removed printks - removed TT debug code - cleaned up lock/unlock around switch statements. - waitqueue -> completion for destroy path. --- drivers/infiniband/core/iwcm.c | 1008 ++++++++++++++++++++++++++++++++++++++++ include/rdma/iw_cm.h | 255 ++++++++++ include/rdma/iw_cm_private.h | 63 +++ 3 files changed, 1326 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c new file mode 100644 index 0000000..fe43c00 --- /dev/null +++ b/drivers/infiniband/core/iwcm.c @@ -0,0 +1,1008 @@ +/* + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Tom Tucker"); +MODULE_DESCRIPTION("iWARP CM"); +MODULE_LICENSE("Dual BSD/GPL"); + +static struct workqueue_struct *iwcm_wq; +struct iwcm_work { + struct work_struct work; + struct iwcm_id_private *cm_id; + struct list_head list; + struct iw_cm_event event; + struct list_head free_list; +}; + +/* + * The following services provide a mechanism for pre-allocating iwcm_work + * elements. The design pre-allocates them based on the cm_id type: + * LISTENING IDS: Get enough elements preallocated to handle the + * listen backlog. + * ACTIVE IDS: 4: CONNECT_REPLY, ESTABLISHED, DISCONNECT, CLOSE + * PASSIVE IDS: 3: ESTABLISHED, DISCONNECT, CLOSE + * + * Allocating them in connect and listen avoids having to deal + * with allocation failures on the event upcall from the provider (which + * is called in the interrupt context). + * + * One exception is when creating the cm_id for incoming connection requests. + * There are two cases: + * 1) in the event upcall, cm_event_handler(), for a listening cm_id. If + * the backlog is exceeded, then no more connection request events will + * be processed. cm_event_handler() returns -ENOMEM in this case. Its up + * to the provider to reject the connectino request. + * 2) in the connection request workqueue handler, cm_conn_req_handler(). + * If work elements cannot be allocated for the new connect request cm_id, + * then IWCM will call the provider reject method. This is ok since + * cm_conn_req_handler() runs in the workqueue thread context. + */ + +static struct iwcm_work *get_work(struct iwcm_id_private *cm_id_priv) +{ + struct iwcm_work *work; + + if (list_empty(&cm_id_priv->work_free_list)) + return NULL; + work = list_entry(cm_id_priv->work_free_list.next, struct iwcm_work, + free_list); + list_del_init(&work->free_list); + return work; +} + +static void put_work(struct iwcm_work *work) +{ + list_add(&work->free_list, &work->cm_id->work_free_list); +} + +static void dealloc_work_entries(struct iwcm_id_private *cm_id_priv) +{ + struct list_head *e, *tmp; + + list_for_each_safe(e, tmp, &cm_id_priv->work_free_list) + kfree(list_entry(e, struct iwcm_work, free_list)); +} + +static int alloc_work_entries(struct iwcm_id_private *cm_id_priv, int count) +{ + struct iwcm_work *work; + + BUG_ON(!list_empty(&cm_id_priv->work_free_list)); + while (count--) { + work = kmalloc(sizeof(struct iwcm_work), GFP_KERNEL); + if (!work) { + dealloc_work_entries(cm_id_priv); + return -ENOMEM; + } + work->cm_id = cm_id_priv; + INIT_LIST_HEAD(&work->list); + put_work(work); + } + return 0; +} + +/* + * Save private data from incoming connection requests in the + * cm_id_priv so the low level driver doesn't have to. Adjust + * the event ptr to point to the local copy. + */ +static int copy_private_data(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *event) +{ + void *p; + + p = kmalloc(event->private_data_len, GFP_ATOMIC); + if (!p) + return -ENOMEM; + memcpy(p, event->private_data, event->private_data_len); + event->private_data = p; + return 0; +} + +/* + * Release a reference on cm_id. If the last reference is being removed + * and iw_destroy_cm_id is waiting, wake up the waiting thread. + */ +static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv) +{ + int ret = 0; + + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); + if (atomic_dec_and_test(&cm_id_priv->refcount)) { + BUG_ON(!list_empty(&cm_id_priv->work_list)); + if (waitqueue_active(&cm_id_priv->destroy_comp.wait)) { + BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING); + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, + &cm_id_priv->flags)); + ret = 1; + } + complete(&cm_id_priv->destroy_comp); + } + + return ret; +} + +static void add_ref(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + atomic_inc(&cm_id_priv->refcount); +} + +static void rem_ref(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + iwcm_deref_id(cm_id_priv); +} + +static int cm_event_handler(struct iw_cm_id *cm_id, struct iw_cm_event *event); + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context) +{ + struct iwcm_id_private *cm_id_priv; + + cm_id_priv = kzalloc(sizeof(*cm_id_priv), GFP_KERNEL); + if (!cm_id_priv) + return ERR_PTR(-ENOMEM); + + cm_id_priv->state = IW_CM_STATE_IDLE; + cm_id_priv->id.device = device; + cm_id_priv->id.cm_handler = cm_handler; + cm_id_priv->id.context = context; + cm_id_priv->id.event_handler = cm_event_handler; + cm_id_priv->id.add_ref = add_ref; + cm_id_priv->id.rem_ref = rem_ref; + spin_lock_init(&cm_id_priv->lock); + atomic_set(&cm_id_priv->refcount, 1); + init_waitqueue_head(&cm_id_priv->connect_wait); + init_completion(&cm_id_priv->destroy_comp); + INIT_LIST_HEAD(&cm_id_priv->work_list); + INIT_LIST_HEAD(&cm_id_priv->work_free_list); + + return &cm_id_priv->id; +} +EXPORT_SYMBOL(iw_create_cm_id); + + +static int iwcm_modify_qp_err(struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + if (!qp) + return -EINVAL; + + qp_attr.qp_state = IB_QPS_ERR; + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE); +} + +/* + * This is really the RDMAC CLOSING state. It is most similar to the + * IB SQD QP state. + */ +static int iwcm_modify_qp_sqd(struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + BUG_ON(qp == NULL); + qp_attr.qp_state = IB_QPS_SQD; + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE); +} + +/* + * CM_ID <-- CLOSING + * + * Block if a passive or active connection is currenlty being processed. Then + * process the event as follows: + * - If we are ESTABLISHED, move to CLOSING and modify the QP state + * based on the abrupt flag + * - If the connection is already in the CLOSING or IDLE state, the peer is + * disconnecting concurrently with us and we've already seen the + * DISCONNECT event -- ignore the request and return 0 + * - Disconnect on a listening endpoint returns -EINVAL + */ +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + struct ib_qp *qp = NULL; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_ESTABLISHED: + cm_id_priv->state = IW_CM_STATE_CLOSING; + + /* QP could be for user-mode client */ + if (cm_id_priv->qp) + qp = cm_id_priv->qp; + else + ret = -EINVAL; + break; + case IW_CM_STATE_LISTEN: + ret = -EINVAL; + break; + case IW_CM_STATE_CLOSING: + /* remote peer closed first */ + case IW_CM_STATE_IDLE: + /* accept or connect returned !0 */ + break; + case IW_CM_STATE_CONN_RECV: + /* + * App called disconnect before/without calling accept after + * connect_request event delivered. + */ + break; + case IW_CM_STATE_CONN_SENT: + /* Can only get here if wait above fails */ + default: + BUG(); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + if (qp) { + if (abrupt) + ret = iwcm_modify_qp_err(qp); + else + ret = iwcm_modify_qp_sqd(qp); + + /* + * If both sides are disconnecting the QP could + * already be in ERR or SQD states + */ + ret = 0; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_disconnect); + +/* + * CM_ID <-- DESTROYING + * + * Clean up all resources associated with the connection and release + * the initial reference taken by iw_create_cm_id. + */ +static void destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall. A + * listening endpoint should never block here. */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_LISTEN: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + /* destroy the listening endpoint */ + ret = cm_id->device->iwcm->destroy_listen(cm_id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_ESTABLISHED: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + /* Abrupt close of the connection */ + (void)iwcm_modify_qp_err(cm_id_priv->qp); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CLOSING: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + break; + case IW_CM_STATE_CONN_RECV: + /* + * App called destroy before/without calling accept after + * receiving connection request event notification. + */ + cm_id_priv->state = IW_CM_STATE_DESTROYING; + break; + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_DESTROYING: + default: + BUG(); + break; + } + if (cm_id_priv->qp) { + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + (void)iwcm_deref_id(cm_id_priv); +} + +/* + * This function is only called by the application thread and cannot + * be called by the event thread. The function will wait for all + * references to be released on the cm_id and then kfree the cm_id + * object. + */ +void iw_destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)); + + destroy_cm_id(cm_id); + + wait_for_completion(&cm_id_priv->destroy_comp); + + dealloc_work_entries(cm_id_priv); + + kfree(cm_id_priv); +} +EXPORT_SYMBOL(iw_destroy_cm_id); + +/* + * CM_ID <-- LISTEN + * + * Start listening for connect requests. Generates one CONNECT_REQUEST + * event for each inbound connect request. + */ +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + ret = alloc_work_entries(cm_id_priv, backlog); + if (ret) + return ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + cm_id_priv->state = IW_CM_STATE_LISTEN; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); + if (ret) + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + default: + ret = -EINVAL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + return ret; +} +EXPORT_SYMBOL(iw_cm_listen); + +/* + * CM_ID <-- IDLE + * + * Rejects an inbound connection request. No events are generated. + */ +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->reject(cm_id, private_data, + private_data_len); + + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} +EXPORT_SYMBOL(iw_cm_reject); + +/* + * CM_ID <-- ESTABLISHED + * + * Accepts an inbound connection request and generates an ESTABLISHED + * event. Callers of iw_cm_disconnect and iw_destroy_cm_id will block + * until the ESTABLISHED event is received from the provider. + */ +int iw_cm_accept(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *iw_param) +{ + struct iwcm_id_private *cm_id_priv; + struct ib_qp *qp; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + /* Get the ib_qp given the QPN */ + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); + if (!qp) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return -EINVAL; + } + cm_id->device->iwcm->add_ref(qp); + cm_id_priv->qp = qp; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->accept(cm_id, iw_param); + if (ret) { + /* An error on accept precludes provider events */ + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV); + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->qp) { + cm_id->device->iwcm->rem_ref(qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_accept); + +/* + * Active Side: CM_ID <-- CONN_SENT + * + * If successful, results in the generation of a CONNECT_REPLY + * event. iw_cm_disconnect and iw_cm_destroy will block until the + * CONNECT_REPLY event is received from the provider. + */ +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + struct iwcm_id_private *cm_id_priv; + int ret = 0; + unsigned long flags; + struct ib_qp *qp; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + ret = alloc_work_entries(cm_id_priv, 4); + if (ret) + return ret; + + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + spin_lock_irqsave(&cm_id_priv->lock, flags); + + if (cm_id_priv->state != IW_CM_STATE_IDLE) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + + /* Get the ib_qp given the QPN */ + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); + if (!qp) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return -EINVAL; + } + cm_id->device->iwcm->add_ref(qp); + cm_id_priv->qp = qp; + cm_id_priv->state = IW_CM_STATE_CONN_SENT; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->connect(cm_id, iw_param); + if (ret) { + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->qp) { + cm_id->device->iwcm->rem_ref(qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT); + cm_id_priv->state = IW_CM_STATE_IDLE; + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_connect); + +/* + * Passive Side: new CM_ID <-- CONN_RECV + * + * Handles an inbound connect request. The function creates a new + * iw_cm_id to represent the new connection and inherits the client + * callback function and other attributes from the listening parent. + * + * The work item contains a pointer to the listen_cm_id and the event. The + * listen_cm_id contains the client cm_handler, context and + * device. These are copied when the device is cloned. The event + * contains the new four tuple. + * + * An error on the child should not affect the parent, so this + * function does not return a value. + */ +static void cm_conn_req_handler(struct iwcm_id_private *listen_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + struct iw_cm_id *cm_id; + struct iwcm_id_private *cm_id_priv; + int ret; + + /* The provider should never generate a connection request + * event with a bad status. + */ + BUG_ON(iw_event->status); + + /* We could be destroying the listening id. If so, ignore this + * upcall. */ + spin_lock_irqsave(&listen_id_priv->lock, flags); + if (listen_id_priv->state != IW_CM_STATE_LISTEN) { + spin_unlock_irqrestore(&listen_id_priv->lock, flags); + return; + } + spin_unlock_irqrestore(&listen_id_priv->lock, flags); + + cm_id = iw_create_cm_id(listen_id_priv->id.device, + listen_id_priv->id.cm_handler, + listen_id_priv->id.context); + /* If the cm_id could not be created, ignore the request */ + if (IS_ERR(cm_id)) + return; + + cm_id->provider_data = iw_event->provider_data; + cm_id->local_addr = iw_event->local_addr; + cm_id->remote_addr = iw_event->remote_addr; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + cm_id_priv->state = IW_CM_STATE_CONN_RECV; + + ret = alloc_work_entries(cm_id_priv, 3); + if (ret) { + iw_cm_reject(cm_id, NULL, 0); + iw_destroy_cm_id(cm_id); + return; + } + + /* Call the client CM handler */ + ret = cm_id->cm_handler(cm_id, iw_event); + if (ret) { + set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); + destroy_cm_id(cm_id); + if (atomic_read(&cm_id_priv->refcount)==0) + kfree(cm_id); + } + + if (iw_event->private_data_len) + kfree(iw_event->private_data); +} + +/* + * Passive Side: CM_ID <-- ESTABLISHED + * + * The provider generated an ESTABLISHED event which means that + * the MPA negotion has completed successfully and we are now in MPA + * FPDU mode. + * + * This event can only be received in the CONN_RECV state. If the + * remote peer closed, the ESTABLISHED event would be received followed + * by the CLOSE event. If the app closes, it will block until we wake + * it up after processing this event. + */ +static int cm_conn_est_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + + /* We clear the CONNECT_WAIT bit here to allow the callback + * function to call iw_cm_disconnect. Calling iw_destroy_cm_id + * from a callback handler is not allowed */ + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV); + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} + +/* + * Active Side: CM_ID <-- ESTABLISHED + * + * The app has called connect and is waiting for the established event to + * post it's requests to the server. This event will wake up anyone + * blocked in iw_cm_disconnect or iw_destroy_id. + */ +static int cm_conn_rep_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + /* Clear the connect wait bit so a callback function calling + * iw_cm_disconnect will not wait and deadlock this thread */ + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT); + if (iw_event->status == IW_CM_EVENT_STATUS_ACCEPTED) { + cm_id_priv->id.local_addr = iw_event->local_addr; + cm_id_priv->id.remote_addr = iw_event->remote_addr; + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; + } else { + /* REJECTED or RESET */ + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + cm_id_priv->state = IW_CM_STATE_IDLE; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + + if (iw_event->private_data_len) + kfree(iw_event->private_data); + + /* Wake up waiters on connect complete */ + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} + +/* + * CM_ID <-- CLOSING + * + * If in the ESTABLISHED state, move to CLOSING. + */ +static void cm_disconnect_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state == IW_CM_STATE_ESTABLISHED) + cm_id_priv->state = IW_CM_STATE_CLOSING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); +} + +/* + * CM_ID <-- IDLE + * + * If in the ESTBLISHED or CLOSING states, the QP will have have been + * moved by the provider to the ERR state. Disassociate the CM_ID from + * the QP, move to IDLE, and remove the 'connected' reference. + * + * If in some other state, the cm_id was destroyed asynchronously. + * This is the last reference that will result in waking up + * the app thread blocked in iw_destroy_cm_id. + */ +static int cm_close_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + spin_lock_irqsave(&cm_id_priv->lock, flags); + + if (cm_id_priv->qp) { + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + } + switch (cm_id_priv->state) { + case IW_CM_STATE_ESTABLISHED: + case IW_CM_STATE_CLOSING: + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_DESTROYING: + break; + default: + BUG(); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + return ret; +} + +static int process_event(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + int ret = 0; + + switch (iw_event->event) { + case IW_CM_EVENT_CONNECT_REQUEST: + cm_conn_req_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_CONNECT_REPLY: + ret = cm_conn_rep_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_ESTABLISHED: + ret = cm_conn_est_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_DISCONNECT: + cm_disconnect_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_CLOSE: + ret = cm_close_handler(cm_id_priv, iw_event); + break; + default: + BUG(); + } + + return ret; +} + +/* + * Process events on the work_list for the cm_id. If the callback + * function requests that the cm_id be deleted, a flag is set in the + * cm_id flags to indicate that when the last reference is + * removed, the cm_id is to be destroyed. This is necessary to + * distinguish between an object that will be destroyed by the app + * thread asleep on the destroy_comp list vs. an object destroyed + * here synchronously when the last reference is removed. + */ +static void cm_work_handler(void *arg) +{ + struct iwcm_work *work = arg, lwork; + struct iwcm_id_private *cm_id_priv = work->cm_id; + unsigned long flags; + int empty; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + empty = list_empty(&cm_id_priv->work_list); + while (!empty) { + work = list_entry(cm_id_priv->work_list.next, + struct iwcm_work, list); + list_del_init(&work->list); + empty = list_empty(&cm_id_priv->work_list); + lwork = *work; + put_work(work); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = process_event(cm_id_priv, &work->event); + if (ret) { + set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); + destroy_cm_id(&cm_id_priv->id); + } + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); + if (iwcm_deref_id(cm_id_priv)) + return; + + if (atomic_read(&cm_id_priv->refcount)==0 && + test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)) { + dealloc_work_entries(cm_id_priv); + kfree(cm_id_priv); + return; + } + spin_lock_irqsave(&cm_id_priv->lock, flags); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); +} + +/* + * This function is called on interrupt context. Schedule events on + * the iwcm_wq thread to allow callback functions to downcall into + * the CM and/or block. Events are queued to a per-CM_ID + * work_list. If this is the first event on the work_list, the work + * element is also queued on the iwcm_wq thread. + * + * Each event holds a reference on the cm_id. Until the last posted + * event has been delivered and processed, the cm_id cannot be + * deleted. + * + * Returns: + * 0 - the event was handled. + * -ENOMEM - the event was not handled due to lack of resources. + */ +static int cm_event_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct iwcm_work *work; + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + work = get_work(cm_id_priv); + if (!work) { + ret = -ENOMEM; + goto out; + } + + INIT_WORK(&work->work, cm_work_handler, work); + work->cm_id = cm_id_priv; + work->event = *iw_event; + + if ((work->event.event == IW_CM_EVENT_CONNECT_REQUEST || + work->event.event == IW_CM_EVENT_CONNECT_REPLY) && + work->event.private_data_len) { + ret = copy_private_data(cm_id_priv, &work->event); + if (ret) { + put_work(work); + goto out; + } + } + + atomic_inc(&cm_id_priv->refcount); + if (list_empty(&cm_id_priv->work_list)) { + list_add_tail(&work->list, &cm_id_priv->work_list); + queue_work(iwcm_wq, &work->work); + } else + list_add_tail(&work->list, &cm_id_priv->work_list); +out: + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +static int iwcm_init_qp_init_attr(struct iwcm_id_private *cm_id_priv, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_ESTABLISHED: + *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS; + qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE| + IB_ACCESS_REMOTE_READ; + ret = 0; + break; + default: + ret = -EINVAL; + break; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +static int iwcm_init_qp_rts_attr(struct iwcm_id_private *cm_id_priv, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_ESTABLISHED: + *qp_attr_mask = 0; + ret = 0; + break; + default: + ret = -EINVAL; + break; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +int iw_cm_init_qp_attr(struct iw_cm_id *cm_id, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct iwcm_id_private *cm_id_priv; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + switch (qp_attr->qp_state) { + case IB_QPS_INIT: + case IB_QPS_RTR: + ret = iwcm_init_qp_init_attr(cm_id_priv, + qp_attr, qp_attr_mask); + break; + case IB_QPS_RTS: + ret = iwcm_init_qp_rts_attr(cm_id_priv, + qp_attr, qp_attr_mask); + break; + default: + ret = -EINVAL; + break; + } + return ret; +} +EXPORT_SYMBOL(iw_cm_init_qp_attr); + +static int __init iw_cm_init(void) +{ + iwcm_wq = create_singlethread_workqueue("iw_cm_wq"); + if (!iwcm_wq) + return -ENOMEM; + + return 0; +} + +static void __exit iw_cm_cleanup(void) +{ + destroy_workqueue(iwcm_wq); +} + +module_init(iw_cm_init); +module_exit(iw_cm_cleanup); diff --git a/include/rdma/iw_cm.h b/include/rdma/iw_cm.h new file mode 100644 index 0000000..36f44aa --- /dev/null +++ b/include/rdma/iw_cm.h @@ -0,0 +1,255 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef IW_CM_H +#define IW_CM_H + +#include +#include + +struct iw_cm_id; + +enum iw_cm_event_type { + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ + IW_CM_EVENT_ESTABLISHED, /* passive side accept successful */ + IW_CM_EVENT_DISCONNECT, /* orderly shutdown */ + IW_CM_EVENT_CLOSE /* close complete */ +}; +enum iw_cm_event_status { + IW_CM_EVENT_STATUS_OK = 0, /* request successful */ + IW_CM_EVENT_STATUS_ACCEPTED = 0, /* connect request accepted */ + IW_CM_EVENT_STATUS_REJECTED, /* connect request rejected */ + IW_CM_EVENT_STATUS_TIMEOUT, /* the operation timed out */ + IW_CM_EVENT_STATUS_RESET, /* reset from remote peer */ + IW_CM_EVENT_STATUS_EINVAL, /* asynchronous failure for bad parm */ +}; +struct iw_cm_event { + enum iw_cm_event_type event; + enum iw_cm_event_status status; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *private_data; + u8 private_data_len; + void* provider_data; +}; + +/** + * iw_cm_handler - Function to be called by the IW CM when delivering events + * to the client. + * + * @cm_id: The IW CM identifier associated with the event. + * @event: Pointer to the event structure. + */ +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); + +/** + * iw_event_handler - Function called by the provider when delivering provider + * events to the IW CM. Returns either 0 indicating the event was processed + * or -errno if the event could not be processed. + * + * @cm_id: The IW CM identifier associated with the event. + * @event: Pointer to the event structure. + */ +typedef int (*iw_event_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); +struct iw_cm_id { + iw_cm_handler cm_handler; /* client callback function */ + void *context; /* client cb context */ + struct ib_device *device; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *provider_data; /* provider private data */ + iw_event_handler event_handler; /* cb for provider + events */ + /* Used by provider to add and remove refs on IW cm_id */ + void (*add_ref)(struct iw_cm_id *); + void (*rem_ref)(struct iw_cm_id *); +}; + +struct iw_cm_conn_param { + const void *private_data; + u16 private_data_len; + u32 ord; + u32 ird; + u32 qpn; +}; + +struct iw_cm_verbs { + void (*add_ref)(struct ib_qp *qp); + + void (*rem_ref)(struct ib_qp *qp); + + struct ib_qp * (*get_qp)(struct ib_device *device, + int qpn); + + int (*connect)(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *conn_param); + + int (*accept)(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *conn_param); + + int (*reject)(struct iw_cm_id *cm_id, + const void *pdata, u8 pdata_len); + + int (*create_listen)(struct iw_cm_id *cm_id, + int backlog); + + int (*destroy_listen)(struct iw_cm_id *cm_id); +}; + +/** + * iw_create_cm_id - Create an IW CM identifier. + * + * @device: The IB device on which to create the IW CM identier. + * @event_handler: User callback invoked to report events associated with the + * returned IW CM identifier. + * @context: User specified context associated with the id. + */ +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, void *context); + +/** + * iw_destroy_cm_id - Destroy an IW CM identifier. + * + * @cm_id: The previously created IW CM identifier to destroy. + * + * The client can assume that no events will be delivered for the CM ID after + * this function returns. + */ +void iw_destroy_cm_id(struct iw_cm_id *cm_id); + +/** + * iw_cm_bind_qp - Unbind the specified IW CM identifier and QP + * + * @cm_id: The IW CM idenfier to unbind from the QP. + * @qp: The QP + * + * This is called by the provider when destroying the QP to ensure + * that any references held by the IWCM are released. It may also + * be called by the IWCM when destroying a CM_ID to that any + * references held by the provider are released. + */ +void iw_cm_unbind_qp(struct iw_cm_id *cm_id, struct ib_qp *qp); + +/** + * iw_cm_get_qp - Return the ib_qp associated with a QPN + * + * @ib_device: The IB device + * @qpn: The queue pair number + */ +struct ib_qp *iw_cm_get_qp(struct ib_device *device, int qpn); + +/** + * iw_cm_listen - Listen for incoming connection requests on the + * specified IW CM id. + * + * @cm_id: The IW CM identifier. + * @backlog: The maximum number of outstanding un-accepted inbound listen + * requests to queue. + * + * The source address and port number are specified in the IW CM identifier + * structure. + */ +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); + +/** + * iw_cm_accept - Called to accept an incoming connect request. + * + * @cm_id: The IW CM identifier associated with the connection request. + * @iw_param: Pointer to a structure containing connection establishment + * parameters. + * + * The specified cm_id will have been provided in the event data for a + * CONNECT_REQUEST event. Subsequent events related to this connection will be + * delivered to the specified IW CM identifier prior and may occur prior to + * the return of this function. If this function returns a non-zero value, the + * client can assume that no events will be delivered to the specified IW CM + * identifier. + */ +int iw_cm_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param); + +/** + * iw_cm_reject - Reject an incoming connection request. + * + * @cm_id: Connection identifier associated with the request. + * @private_daa: Pointer to data to deliver to the remote peer as part of the + * reject message. + * @private_data_len: The number of bytes in the private_data parameter. + * + * The client can assume that no events will be delivered to the specified IW + * CM identifier following the return of this function. The private_data + * buffer is available for reuse when this function returns. + */ +int iw_cm_reject(struct iw_cm_id *cm_id, const void *private_data, + u8 private_data_len); + +/** + * iw_cm_connect - Called to request a connection to a remote peer. + * + * @cm_id: The IW CM identifier for the connection. + * @iw_param: Pointer to a structure containing connection establishment + * parameters. + * + * Events may be delivered to the specified IW CM identifier prior to the + * return of this function. If this function returns a non-zero value, the + * client can assume that no events will be delivered to the specified IW CM + * identifier. + */ +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param); + +/** + * iw_cm_disconnect - Close the specified connection. + * + * @cm_id: The IW CM identifier to close. + * @abrupt: If 0, the connection will be closed gracefully, otherwise, the + * connection will be reset. + * + * The IW CM identifier is still active until the IW_CM_EVENT_CLOSE event is + * delivered. + */ +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt); + +/** + * iw_cm_init_qp_attr - Called to initialize the attributes of the QP + * associated with a IW CM identifier. + * + * @cm_id: The IW CM identifier associated with the QP + * @qp_attr: Pointer to the QP attributes structure. + * @qp_attr_mask: Pointer to a bit vector specifying which QP attributes are + * valid. + */ +int iw_cm_init_qp_attr(struct iw_cm_id *cm_id, struct ib_qp_attr *qp_attr, + int *qp_attr_mask); + +#endif /* IW_CM_H */ diff --git a/include/rdma/iw_cm_private.h b/include/rdma/iw_cm_private.h new file mode 100644 index 0000000..fc28e34 --- /dev/null +++ b/include/rdma/iw_cm_private.h @@ -0,0 +1,63 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef IW_CM_PRIVATE_H +#define IW_CM_PRIVATE_H + +#include + +enum iw_cm_state { + IW_CM_STATE_IDLE, /* unbound, inactive */ + IW_CM_STATE_LISTEN, /* listen waiting for connect */ + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ + IW_CM_STATE_ESTABLISHED, /* established */ + IW_CM_STATE_CLOSING, /* disconnect */ + IW_CM_STATE_DESTROYING /* object being deleted */ +}; + +struct iwcm_id_private { + struct iw_cm_id id; + enum iw_cm_state state; + unsigned long flags; + struct ib_qp *qp; + struct completion destroy_comp; + wait_queue_head_t connect_wait; + struct list_head work_list; + spinlock_t lock; + atomic_t refcount; + struct list_head work_free_list; +}; +#define IWCM_F_CALLBACK_DESTROY 1 +#define IWCM_F_CONNECT_WAIT 2 + +#endif /* IW_CM_PRIVATE_H */ From bugzilla-daemon at openib.org Wed Aug 2 19:25:28 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 2 Aug 2006 19:25:28 -0700 (PDT) Subject: [openib-general] [Bug 184] New: System crashes on shutdown due to access of freed memory Message-ID: <20060803022528.08C0A2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=184 Summary: System crashes on shutdown due to access of freed memory Product: OpenFabrics Windows Version: unspecified Platform: X86 OS/Version: Other Status: NEW Severity: critical Priority: P2 Component: Core AssignedTo: bugzilla at openib.org ReportedBy: jbottorff at xsigo.com This is for release 432. On system shutdown, driver verifier detects access to freed memory in core\al\kernel\al_ioc_pnp.c. Without verifier, this causes a crash moments later. The problem seems to be in static void __process_sweep( IN cl_async_proc_item_t *p_async_item ) { ib_api_status_t status; ioc_sweep_results_t *p_results; AL_ENTER( AL_DBG_PNP ); p_results = PARENT_STRUCT( p_async_item, ioc_sweep_results_t, async_item ); CL_ASSERT( !p_results->p_svc->query_cnt ); if( p_results->p_svc->obj.state == CL_DESTROYING ) { __put_iou_map( gp_ioc_pnp, &p_results->iou_map ); cl_free( p_results ); } /* Walk the map of IOUs and discard any that didn't respond to IOU info. */ __flush_duds( p_results ); switch( p_results->state ) { case SWEEP_IOU_INFO: � Pretty clearly, if the code path for p_results->p_svc->obj.state == CL_DESTROYING is taken, and cl_free (p_results) is called, the following statements that access p_results are going to be invalid. I believe this also may cause an error that's reported as a double freeing of memory, if the function frees p_results at the top, and then makes it to the bottom where it may free p_results again. The stack looks like: bafc7c94 ba954be8 00000000 8fdf6ff0 00000000 nt!KiTrap0E+0xe4 bafc7d28 ba965221 8fdf6f80 00000000 00000001 ibbus!cl_fmap_head+0x38 [k:\windows-openib\src\winib-432\inc\complib\cl_fleximap.h @ 486] bafc7d50 ba964e85 8fdf6f58 859d0020 00000000 ibbus!__flush_duds+0xa1 [k:\windows-openib\src\winib-432\core\al\kernel\al_ioc_pnp.c @ 2225] bafc7d70 ba950884 8fdf6f58 00000001 8fdf6f58 ibbus!__process_sweep+0x105 [k:\windows-openib\src\winib-432\core\al\kernel\al_ioc_pnp.c @ 2295] bafc7d8c ba956b54 86af2e74 86af2e74 00000000 ibbus!__cl_async_proc_worker+0x94 [k:\windows-openib\src\winib-432\core\complib\cl_async_proc.c @ 153] bafc7da0 ba958c0c 86af2e74 bafc7ddc 80a07678 ibbus!__cl_thread_pool_routine+0x54 [k:\windows-openib\src\winib-432\core\complib\cl_threadpool.c @ 67] bafc7dac 80a07678 87208fe0 00000000 00000000 ibbus!__thread_callback+0x2c [k:\windows-openib\src\winib-432\core\complib\kernel\cl_thread.c @ 49] bafc7ddc 80781346 ba958be0 87208fe0 00000000 nt!PspSystemThreadStartup+0x2e 00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16 ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From ogerlitz at voltaire.com Wed Aug 2 23:02:11 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 03 Aug 2006 09:02:11 +0300 Subject: [openib-general] some issues related to when/why IPoIB calls netif_carrier_on() etc In-Reply-To: References: Message-ID: <44D19163.7080408@voltaire.com> Roland Dreier wrote: > > 1) what is the exact reason that ib0 is running here, is it as of this > > "magic" configuration of the IPv6 addr that caused it to join to > > the IPv4 and IPv6 broascast groups? > > No, ipv6 autoconf has nothing to do with it. I think it's because you > did ifconfig ib0 up, which called ipoib_open(), which calls > ipoib_ib_dev_up(), which joins the ipv4 broadcast group. > > Bringing the interface up the starts ipv6 autoconf but that is just a > side issue. You could build a kernel without ipv6 and see what happens. > > > 2) is it well defined what conditions should hold s.t IPoIB will be RUNNING > > Not really. > > > 3) just to make sure: RUNNING <--> ipoib called netif_carrier_on(), correct? > > i see that latter is called by ipoib_mcast_join_task(), is it when > > "joining everything we want to join to" holds or you can somehow > > refine the predicate? > > Yes, I believe that is correct. OK, thanks for all the clarifications. Or. From ogerlitz at voltaire.com Wed Aug 2 23:32:11 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 03 Aug 2006 09:32:11 +0300 Subject: [openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT) In-Reply-To: <44D0D808.1070003@ichips.intel.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> <44D049B5.6000505@voltaire.com> <44D0D808.1070003@ichips.intel.com> Message-ID: <44D1986B.6070302@voltaire.com> Sean Hefty wrote: > I agree. This sounds like an issue where the CM is treating the REQ as > an old REQ for the established connection, versus a REQ for a new > connection. > The desired behavior in this situation would be to reject the new > request, and force the remote side to disconnect. Sean, Is it correct that with the gen2 code, the remote **CM** will reconnect on that case? I see in cm.c :: cm_rej_handler() that when the state is IB_CM_REQ_SENT and the reject reason is IB_CM_REJ_STALE_CONN you just move the cm_id into timewait state, which will cause a retry on the REQ, correct? cool! Or. From krkumar2 at in.ibm.com Thu Aug 3 00:10:05 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 12:40:05 +0530 Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. Message-ID: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> This patchset is a proposal to create new API's and data structures with transport neutral names. The idea is to remove the old API once all libraries/applications/examples are gradually converted to use the new API. Patch 1/6 - Changes to libibverbs configuration file to build the libibverbs with the new API. Patch 2/6 - Additions to include files in libibverbs for the new API. Patch 3/6 - Source files in libibverbs defining the new API. Patch 4/6 - Convert librdmacm examples to use the new API. Patch 5/6 - Convert librdmacm include files to use the new libibverbs API. Patch 6/6 - Convert librdmacm source files to use the new libibverbs API. Changes from previous (v2) round : ---------------------------------- 1. #defined most data structures as suggested by Sean. Created a deprecate.h which can be removed once all apps are converted to new API. 2. Changed rdma_ to rdmav_. This also enabled to retain rdma_create_qp() and rdma_destroy_qp() routines. (The last suggestion to not convert IB specific types to generic types was not done at this time, since my previous note explained that it is not clear how to retain some names while changing others. Eg, ibv_event_type or ibv_qp_attr_mask, which have enums for both IB and generic types, etc). Testing : --------- 1. Compile tested libibverbs, librdmacm, libmthca, libsdp, libibcm, libipathverbs and dapl. No warnings or failures in any of these. The only warning in libmthca was regarding multiple ibv_read_sysfs_file()'s, which is not a compile issue (and also can be removed). 2. Tested rping, ibv_devices & ibv_devinfo utils. Information notes found during the changes : -------------------------------------------- 1. Added LIBRDMAVERBS_DRIVER_PATH and also use old OPENIB_DRIVER_PATH_ENV for backwards compatibility, but have not set user_path to include OPENIB_DRIVER_PATH_ENV results. 2. Currently ibv_driver_init is implemented in all drivers. But the function returns a "struct ibv_driver *", while we expect "struct rdma_driver *". In reality this is fine as they are both pointers pointing to identical objects. Otherwise each driver has to be changed now. Once all drivers are changed to use rdma_* API's, this will not be an issue. 3. All names are changed to neutral names, even IB specific names as it is not clear how to retain some names while changing others. Eg, ibv_event_type or ibv_qp_attr_mask, etc. 4. Passing different pointer to verbs, though the end result is the same (no warnings generated though as this is a link-time trick). Eg : int rdma_query_device(struct rdma_context *context, struct rdma_device_attr *device_attr) { return context->ops.query_device(context, device_attr); } However this will not be an issue once the drivers are changed to use the new API. Eg : int mthca_query_device(struct rdma_context *context, struct rdma_device_attr *attr) 5. Makefile.am still makes libibverbs.* libraries so that other apps do not break. librdmaverbs.spec.in also does the same. 6. Kept ibv_driver_init call as all libraries have implemented ibv_driver_init, but this can be changed easily to new API (and then retired). 7. Prefix is kept as rdmav_ (rdv_ didn't have much takers) to be generic and consistent enough. 8. [Missing] IBV_OPCODE() macro is not done (but no use ones it currently). --- Signed-off-by: Krishna Kumar From krkumar2 at in.ibm.com Thu Aug 3 00:10:26 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 12:40:26 +0530 Subject: [openib-general] [PATCH v3 1/6] libibverbs include files changes. In-Reply-To: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803071026.6106.2187.sendpatchset@K50wks273950wss.in.ibm.com> Additions to include files in libibverbs for the new API. Signed-off-by: Krishna Kumar diff -ruNp ORG/libibverbs/include/infiniband/arch.h NEW/libibverbs/include/infiniband/arch.h --- ORG/libibverbs/include/infiniband/arch.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/arch.h 2006-08-02 18:24:49.000000000 -0700 @@ -32,8 +32,8 @@ * $Id: arch.h 8358 2006-07-04 20:38:54Z roland $ */ -#ifndef INFINIBAND_ARCH_H -#define INFINIBAND_ARCH_H +#ifndef RDMAV_ARCH_H +#define RDMAV_ARCH_H #include #include @@ -92,4 +92,4 @@ static inline uint64_t ntohll(uint64_t x #endif -#endif /* INFINIBAND_ARCH_H */ +#endif /* RDMAV_ARCH_H */ diff -ruNp ORG/libibverbs/include/infiniband/deprecate.h NEW/libibverbs/include/infiniband/deprecate.h --- ORG/libibverbs/include/infiniband/deprecate.h 1969-12-31 16:00:00.000000000 -0800 +++ NEW/libibverbs/include/infiniband/deprecate.h 2006-08-03 17:50:06.000000000 -0700 @@ -0,0 +1,387 @@ +/* + * Copyright (c) 2006 IBM. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: + */ + +#ifndef RDMAV_DEPRECATE_H +#define RDMAV_DEPRECATE_H + +/* + * This header file can be removed once all applications are ported over + * to the new API. Till then, this is kept around as a compatibility + * header. + */ + +/* All exported IBV_ defines */ + +#define IBV_NODE_CA RDMAV_NODE_CA +#define IBV_NODE_SWITCH RDMAV_NODE_SWITCH +#define IBV_NODE_ROUTER RDMAV_NODE_ROUTER +#define IBV_DEVICE_RESIZE_MAX_WR RDMAV_DEVICE_RESIZE_MAX_WR +#define IBV_DEVICE_BAD_PKEY_CNTR RDMAV_DEVICE_BAD_PKEY_CNTR +#define IBV_DEVICE_BAD_QKEY_CNTR RDMAV_DEVICE_BAD_QKEY_CNTR +#define IBV_DEVICE_RAW_MULTI RDMAV_DEVICE_RAW_MULTI +#define IBV_DEVICE_AUTO_PATH_MIG RDMAV_DEVICE_AUTO_PATH_MIG +#define IBV_DEVICE_CHANGE_PHY_PORT RDMAV_DEVICE_CHANGE_PHY_PORT +#define IBV_DEVICE_UD_AV_PORT_ENFORCE RDMAV_DEVICE_UD_AV_PORT_ENFORCE +#define IBV_DEVICE_CURR_QP_STATE_MOD RDMAV_DEVICE_CURR_QP_STATE_MOD +#define IBV_DEVICE_SHUTDOWN_PORT RDMAV_DEVICE_SHUTDOWN_PORT +#define IBV_DEVICE_INIT_TYPE RDMAV_DEVICE_INIT_TYPE +#define IBV_DEVICE_PORT_ACTIVE_EVENT RDMAV_DEVICE_PORT_ACTIVE_EVENT +#define IBV_DEVICE_SYS_IMAGE_GUID RDMAV_DEVICE_SYS_IMAGE_GUID +#define IBV_DEVICE_RC_RNR_NAK_GEN RDMAV_DEVICE_RC_RNR_NAK_GEN +#define IBV_DEVICE_SRQ_RESIZE RDMAV_DEVICE_SRQ_RESIZE +#define IBV_DEVICE_N_NOTIFY_CQ RDMAV_DEVICE_N_NOTIFY_CQ +#define IBV_ATOMIC_NONE RDMAV_ATOMIC_NONE +#define IBV_ATOMIC_HCA RDMAV_ATOMIC_HCA +#define IBV_ATOMIC_GLOB RDMAV_ATOMIC_GLOB +#define IBV_MTU_256 RDMAV_MTU_256 +#define IBV_MTU_512 RDMAV_MTU_512 +#define IBV_MTU_1024 RDMAV_MTU_1024 +#define IBV_MTU_2048 RDMAV_MTU_2048 +#define IBV_MTU_4096 RDMAV_MTU_4096 +#define IBV_PORT_NOP RDMAV_PORT_NOP +#define IBV_PORT_DOWN RDMAV_PORT_DOWN +#define IBV_PORT_INIT RDMAV_PORT_INIT +#define IBV_PORT_ARMED RDMAV_PORT_ARMED +#define IBV_PORT_ACTIVE RDMAV_PORT_ACTIVE +#define IBV_PORT_ACTIVE_DEFER RDMAV_PORT_ACTIVE_DEFER +#define IBV_EVENT_CQ_ERR RDMAV_EVENT_CQ_ERR +#define IBV_EVENT_QP_FATAL RDMAV_EVENT_QP_FATAL +#define IBV_EVENT_QP_REQ_ERR RDMAV_EVENT_QP_REQ_ERR +#define IBV_EVENT_QP_ACCESS_ERR RDMAV_EVENT_QP_ACCESS_ERR +#define IBV_EVENT_COMM_EST RDMAV_EVENT_COMM_EST +#define IBV_EVENT_SQ_DRAINED RDMAV_EVENT_SQ_DRAINED +#define IBV_EVENT_PATH_MIG RDMAV_EVENT_PATH_MIG +#define IBV_EVENT_PATH_MIG_ERR RDMAV_EVENT_PATH_MIG_ERR +#define IBV_EVENT_DEVICE_FATAL RDMAV_EVENT_DEVICE_FATAL +#define IBV_EVENT_PORT_ACTIVE RDMAV_EVENT_PORT_ACTIVE +#define IBV_EVENT_PORT_ERR RDMAV_EVENT_PORT_ERR +#define IBV_EVENT_LID_CHANGE RDMAV_EVENT_LID_CHANGE +#define IBV_EVENT_PKEY_CHANGE RDMAV_EVENT_PKEY_CHANGE +#define IBV_EVENT_SM_CHANGE RDMAV_EVENT_SM_CHANGE +#define IBV_EVENT_SRQ_ERR RDMAV_EVENT_SRQ_ERR +#define IBV_EVENT_SRQ_LIMIT_REACHED RDMAV_EVENT_SRQ_LIMIT_REACHED +#define IBV_EVENT_QP_LAST_WQE_REACHED RDMAV_EVENT_QP_LAST_WQE_REACHED +#define IBV_EVENT_CLIENT_REREGISTER RDMAV_EVENT_CLIENT_REREGISTER +#define IBV_WC_SUCCESS RDMAV_WC_SUCCESS +#define IBV_WC_LOC_LEN_ERR RDMAV_WC_LOC_LEN_ERR +#define IBV_WC_LOC_QP_OP_ERR RDMAV_WC_LOC_QP_OP_ERR +#define IBV_WC_LOC_EEC_OP_ERR RDMAV_WC_LOC_EEC_OP_ERR +#define IBV_WC_LOC_PROT_ERR RDMAV_WC_LOC_PROT_ERR +#define IBV_WC_WR_FLUSH_ERR RDMAV_WC_WR_FLUSH_ERR +#define IBV_WC_MW_BIND_ERR RDMAV_WC_MW_BIND_ERR +#define IBV_WC_BAD_RESP_ERR RDMAV_WC_BAD_RESP_ERR +#define IBV_WC_LOC_ACCESS_ERR RDMAV_WC_LOC_ACCESS_ERR +#define IBV_WC_REM_INV_REQ_ERR RDMAV_WC_REM_INV_REQ_ERR +#define IBV_WC_REM_ACCESS_ERR RDMAV_WC_REM_ACCESS_ERR +#define IBV_WC_REM_OP_ERR RDMAV_WC_REM_OP_ERR +#define IBV_WC_RETRY_EXC_ERR RDMAV_WC_RETRY_EXC_ERR +#define IBV_WC_RNR_RETRY_EXC_ERR RDMAV_WC_RNR_RETRY_EXC_ERR +#define IBV_WC_LOC_RDD_VIOL_ERR RDMAV_WC_LOC_RDD_VIOL_ERR +#define IBV_WC_REM_INV_RD_REQ_ERR RDMAV_WC_REM_INV_RD_REQ_ERR +#define IBV_WC_REM_ABORT_ERR RDMAV_WC_REM_ABORT_ERR +#define IBV_WC_INV_EECN_ERR RDMAV_WC_INV_EECN_ERR +#define IBV_WC_INV_EEC_STATE_ERR RDMAV_WC_INV_EEC_STATE_ERR +#define IBV_WC_FATAL_ERR RDMAV_WC_FATAL_ERR +#define IBV_WC_RESP_TIMEOUT_ERR RDMAV_WC_RESP_TIMEOUT_ERR +#define IBV_WC_GENERAL_ERR RDMAV_WC_GENERAL_ERR +#define IBV_WC_SEND RDMAV_WC_SEND +#define IBV_WC_RDMA_WRITE RDMAV_WC_RDMA_WRITE +#define IBV_WC_RDMA_READ RDMAV_WC_RDMA_READ +#define IBV_WC_COMP_SWAP RDMAV_WC_COMP_SWAP +#define IBV_WC_FETCH_ADD RDMAV_WC_FETCH_ADD +#define IBV_WC_BIND_MW RDMAV_WC_BIND_MW +#define IBV_WC_RECV RDMAV_WC_RECV +#define IBV_WC_RECV_RDMA_WITH_IMM RDMAV_WC_RECV_RDMA_WITH_IMM +#define IBV_WC_GRH RDMAV_WC_GRH +#define IBV_WC_WITH_IMM RDMAV_WC_WITH_IMM +#define IBV_ACCESS_LOCAL_WRITE RDMAV_ACCESS_LOCAL_WRITE +#define IBV_ACCESS_REMOTE_WRITE RDMAV_ACCESS_REMOTE_WRITE +#define IBV_ACCESS_REMOTE_READ RDMAV_ACCESS_REMOTE_READ +#define IBV_ACCESS_REMOTE_ATOMIC RDMAV_ACCESS_REMOTE_ATOMIC +#define IBV_ACCESS_MW_BIND RDMAV_ACCESS_MW_BIND +#define IBV_RATE_MAX RDMAV_RATE_MAX +#define IBV_RATE_2_5_GBPS RDMAV_RATE_2_5_GBPS +#define IBV_RATE_5_GBPS RDMAV_RATE_5_GBPS +#define IBV_RATE_10_GBPS RDMAV_RATE_10_GBPS +#define IBV_RATE_20_GBPS RDMAV_RATE_20_GBPS +#define IBV_RATE_30_GBPS RDMAV_RATE_30_GBPS +#define IBV_RATE_40_GBPS RDMAV_RATE_40_GBPS +#define IBV_RATE_60_GBPS RDMAV_RATE_60_GBPS +#define IBV_RATE_80_GBPS RDMAV_RATE_80_GBPS +#define IBV_RATE_120_GBPS RDMAV_RATE_120_GBPS +#define IBV_SRQ_MAX_WR RDMAV_SRQ_MAX_WR +#define IBV_SRQ_LIMIT RDMAV_SRQ_LIMIT +#define IBV_QPT_RC RDMAV_QPT_RC +#define IBV_QPT_UC RDMAV_QPT_UC +#define IBV_QPT_UD RDMAV_QPT_UD +#define IBV_QP_STATE RDMAV_QP_STATE +#define IBV_QP_CUR_STATE RDMAV_QP_CUR_STATE +#define IBV_QP_EN_SQD_ASYNC_NOTIFY RDMAV_QP_EN_SQD_ASYNC_NOTIFY +#define IBV_QP_ACCESS_FLAGS RDMAV_QP_ACCESS_FLAGS +#define IBV_QP_PKEY_INDEX RDMAV_QP_PKEY_INDEX +#define IBV_QP_PORT RDMAV_QP_PORT +#define IBV_QP_QKEY RDMAV_QP_QKEY +#define IBV_QP_AV RDMAV_QP_AV +#define IBV_QP_PATH_MTU RDMAV_QP_PATH_MTU +#define IBV_QP_TIMEOUT RDMAV_QP_TIMEOUT +#define IBV_QP_RETRY_CNT RDMAV_QP_RETRY_CNT +#define IBV_QP_RNR_RETRY RDMAV_QP_RNR_RETRY +#define IBV_QP_RQ_PSN RDMAV_QP_RQ_PSN +#define IBV_QP_MAX_QP_RD_ATOMIC RDMAV_QP_MAX_QP_RD_ATOMIC +#define IBV_QP_ALT_PATH RDMAV_QP_ALT_PATH +#define IBV_QP_MIN_RNR_TIMER RDMAV_QP_MIN_RNR_TIMER +#define IBV_QP_SQ_PSN RDMAV_QP_SQ_PSN +#define IBV_QP_MAX_DEST_RD_ATOMIC RDMAV_QP_MAX_DEST_RD_ATOMIC +#define IBV_QP_PATH_MIG_STATE RDMAV_QP_PATH_MIG_STATE +#define IBV_QP_CAP RDMAV_QP_CAP +#define IBV_QP_DEST_QPN RDMAV_QP_DEST_QPN +#define IBV_QPS_RESET RDMAV_QPS_RESET +#define IBV_QPS_INIT RDMAV_QPS_INIT +#define IBV_QPS_RTR RDMAV_QPS_RTR +#define IBV_QPS_RTS RDMAV_QPS_RTS +#define IBV_QPS_SQD RDMAV_QPS_SQD +#define IBV_QPS_SQE RDMAV_QPS_SQE +#define IBV_QPS_ERR RDMAV_QPS_ERR +#define IBV_MIG_MIGRATED RDMAV_MIG_MIGRATED +#define IBV_MIG_REARM RDMAV_MIG_REARM +#define IBV_MIG_ARMED RDMAV_MIG_ARMED +#define IBV_WR_RDMA_WRITE RDMAV_WR_RDMA_WRITE +#define IBV_WR_RDMA_WRITE_WITH_IMM RDMAV_WR_RDMA_WRITE_WITH_IMM +#define IBV_WR_SEND RDMAV_WR_SEND +#define IBV_WR_SEND_WITH_IMM RDMAV_WR_SEND_WITH_IMM +#define IBV_WR_RDMA_READ RDMAV_WR_RDMA_READ +#define IBV_WR_ATOMIC_CMP_AND_SWP RDMAV_WR_ATOMIC_CMP_AND_SWP +#define IBV_WR_ATOMIC_FETCH_AND_ADD RDMAV_WR_ATOMIC_FETCH_AND_ADD +#define IBV_SEND_FENCE RDMAV_SEND_FENCE +#define IBV_SEND_SIGNALED RDMAV_SEND_SIGNALED +#define IBV_SEND_SOLICITED RDMAV_SEND_SOLICITED +#define IBV_SEND_INLINE RDMAV_SEND_INLINE +#define IBV_SYSFS_NAME_MAX RDMAV_SYSFS_NAME_MAX +#define IBV_SYSFS_PATH_MAX RDMAV_SYSFS_PATH_MAX + + +#define IBV_OPCODE_RC RDMAV_OPCODE_RC +#define IBV_OPCODE_UC RDMAV_OPCODE_UC +#define IBV_OPCODE_RD RDMAV_OPCODE_RD +#define IBV_OPCODE_UD RDMAV_OPCODE_UD +#define IBV_OPCODE_SEND_FIRST RDMAV_OPCODE_SEND_FIRST +#define IBV_OPCODE_SEND_MIDDLE RDMAV_OPCODE_SEND_MIDDLE +#define IBV_OPCODE_SEND_LAST RDMAV_OPCODE_SEND_LAST +#define IBV_OPCODE_SEND_LAST_WITH_IMMEDIATE RDMAV_OPCODE_SEND_LAST_WITH_IMMEDIATE +#define IBV_OPCODE_SEND_ONLY RDMAV_OPCODE_SEND_ONLY +#define IBV_OPCODE_SEND_ONLY_WITH_IMMEDIATE RDMAV_OPCODE_SEND_ONLY_WITH_IMMEDIATE +#define IBV_OPCODE_RDMA_WRITE_FIRST RDMAV_OPCODE_RDMA_WRITE_FIRST +#define IBV_OPCODE_RDMA_WRITE_MIDDLE RDMAV_OPCODE_RDMA_WRITE_MIDDLE +#define IBV_OPCODE_RDMA_WRITE_LAST RDMAV_OPCODE_RDMA_WRITE_LAST +#define IBV_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE RDMAV_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE +#define IBV_OPCODE_RDMA_WRITE_ONLY RDMAV_OPCODE_RDMA_WRITE_ONLY +#define IBV_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE RDMAV_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE +#define IBV_OPCODE_RDMA_READ_REQUEST RDMAV_OPCODE_RDMA_READ_REQUEST +#define IBV_OPCODE_RDMA_READ_RESPONSE_FIRST RDMAV_OPCODE_RDMA_READ_RESPONSE_FIRST +#define IBV_OPCODE_RDMA_READ_RESPONSE_MIDDLE RDMAV_OPCODE_RDMA_READ_RESPONSE_MIDDLE +#define IBV_OPCODE_RDMA_READ_RESPONSE_LAST RDMAV_OPCODE_RDMA_READ_RESPONSE_LAST +#define IBV_OPCODE_RDMA_READ_RESPONSE_ONLY RDMAV_OPCODE_RDMA_READ_RESPONSE_ONLY +#define IBV_OPCODE_ACKNOWLEDGE RDMAV_OPCODE_ACKNOWLEDGE +#define IBV_OPCODE_ATOMIC_ACKNOWLEDGE RDMAV_OPCODE_ATOMIC_ACKNOWLEDGE +#define IBV_OPCODE_COMPARE_SWAP RDMAV_OPCODE_COMPARE_SWAP +#define IBV_OPCODE_FETCH_ADD RDMAV_OPCODE_FETCH_ADD + +/* All exported ibv_ routines */ + +#define ibv_open_device rdmav_open_device +#define ibv_get_device_guid rdmav_get_device_guid +#define ibv_get_device_name rdmav_get_device_name +#define ibv_ack_async_event rdmav_ack_async_event +#define ibv_ack_cq_events rdmav_ack_cq_events +#define ibv_alloc_pd rdmav_alloc_pd +#define ibv_attach_mcast rdmav_attach_mcast +#define ibv_close_device rdmav_close_device +#define ibv_cmd_alloc_pd rdmav_cmd_alloc_pd +#define ibv_cmd_attach_mcast rdmav_cmd_attach_mcast +#define ibv_cmd_create_ah rdmav_cmd_create_ah +#define ibv_cmd_create_cq rdmav_cmd_create_cq +#define ibv_cmd_create_qp rdmav_cmd_create_qp +#define ibv_cmd_create_srq rdmav_cmd_create_srq +#define ibv_cmd_dealloc_pd rdmav_cmd_dealloc_pd +#define ibv_cmd_dereg_mr rdmav_cmd_dereg_mr +#define ibv_cmd_destroy_ah rdmav_cmd_destroy_ah +#define ibv_cmd_destroy_cq rdmav_cmd_destroy_cq +#define ibv_cmd_destroy_qp rdmav_cmd_destroy_qp +#define ibv_cmd_destroy_srq rdmav_cmd_destroy_srq +#define ibv_cmd_detach_mcast rdmav_cmd_detach_mcast +#define ibv_cmd_get_context rdmav_cmd_get_context +#define ibv_cmd_modify_qp rdmav_cmd_modify_qp +#define ibv_cmd_modify_srq rdmav_cmd_modify_srq +#define ibv_cmd_poll_cq rdmav_cmd_poll_cq +#define ibv_cmd_post_recv rdmav_cmd_post_recv +#define ibv_cmd_post_send rdmav_cmd_post_send +#define ibv_cmd_post_srq_recv rdmav_cmd_post_srq_recv +#define ibv_cmd_query_device rdmav_cmd_query_device +#define ibv_cmd_query_port rdmav_cmd_query_port +#define ibv_cmd_query_qp rdmav_cmd_query_qp +#define ibv_cmd_query_srq rdmav_cmd_query_srq +#define ibv_cmd_reg_mr rdmav_cmd_reg_mr +#define ibv_cmd_req_notify_cq rdmav_cmd_req_notify_cq +#define ibv_cmd_resize_cq rdmav_cmd_resize_cq +#define ibv_copy_ah_attr_from_kern rdmav_copy_ah_attr_from_kern +#define ibv_copy_path_rec_from_kern rdmav_copy_path_rec_from_kern +#define ibv_copy_path_rec_to_kern rdmav_copy_path_rec_to_kern +#define ibv_copy_qp_attr_from_kern rdmav_copy_qp_attr_from_kern +#define ibv_create_ah rdmav_create_ah +#define ibv_create_comp_channel rdmav_create_comp_channel +#define ibv_create_cq rdmav_create_cq +#define ibv_create_qp rdmav_create_qp +#define ibv_create_srq rdmav_create_srq +#define ibv_dealloc_pd rdmav_dealloc_pd +#define ibv_dereg_mr rdmav_dereg_mr +#define ibv_destroy_ah rdmav_destroy_ah +#define ibv_destroy_comp_channel rdmav_destroy_comp_channel +#define ibv_destroy_cq rdmav_destroy_cq +#define ibv_destroy_qp rdmav_destroy_qp +#define ibv_destroy_srq rdmav_destroy_srq +#define ibv_detach_mcast rdmav_detach_mcast +#define ibv_free_device_list rdmav_free_device_list +#define ibv_get_async_event rdmav_get_async_event +#define ibv_get_cq_event rdmav_get_cq_event +#define ibv_get_device_guid rdmav_get_device_guid +#define ibv_init_ah_from_wc rdmav_init_ah_from_wc +#define ibv_modify_qp rdmav_modify_qp +#define ibv_modify_srq rdmav_modify_srq +#define ibv_poll_cq rdmav_poll_cq +#define ibv_post_recv rdmav_post_recv +#define ibv_post_send rdmav_post_send +#define ibv_post_srq_recv rdmav_post_srq_recv +#define ibv_query_device rdmav_query_device +#define ibv_query_gid rdmav_query_gid +#define ibv_query_pkey rdmav_query_pkey +#define ibv_query_port rdmav_query_port +#define ibv_query_qp rdmav_query_qp +#define ibv_query_srq rdmav_query_srq +#define ibv_rate_to_mult rdmav_rate_to_mult +#define ibv_read_sysfs_file rdmav_read_sysfs_file +#define ibv_reg_mr rdmav_reg_mr +#define ibv_req_notify_cq rdmav_req_notify_cq +#define ibv_resize_cq rdmav_resize_cq + +/* All exported ibv_ data structures */ + +#define ibv_access_flags rdmav_access_flags +#define ibv_ah rdmav_ah +#define ibv_ah_attr rdmav_ah_attr +#define ibv_alloc_pd_resp rdmav_alloc_pd_resp +#define ibv_async_event rdmav_async_event +#define ibv_atomic_cap rdmav_atomic_cap +#define ibv_cmd_query_gid rdmav_cmd_query_gid +#define ibv_cmd_query_pkey rdmav_cmd_query_pkey +#define ibv_comp_channel rdmav_comp_channel +#define ibv_comp_event rdmav_comp_event +#define ibv_context rdmav_context +#define ibv_context_ops rdmav_context_ops +#define ibv_cq rdmav_cq +#define ibv_create_ah_resp rdmav_create_ah_resp +#define ibv_create_comp_channel_resp rdmav_create_comp_channel_resp +#define ibv_create_cq_resp rdmav_create_cq_resp +#define ibv_create_qp_resp rdmav_create_qp_resp +#define ibv_create_srq_resp rdmav_create_srq_resp +#define ibv_destroy_cq_resp rdmav_destroy_cq_resp +#define ibv_destroy_qp_resp rdmav_destroy_qp_resp +#define ibv_destroy_srq_resp rdmav_destroy_srq_resp +#define ibv_device rdmav_device +#define ibv_device_attr rdmav_device_attr +#define ibv_device_cap_flags rdmav_device_cap_flags +#define ibv_device_ops rdmav_device_ops +#define ibv_driver rdmav_driver +#define ibv_event_type rdmav_event_type +#define ibv_get_context rdmav_get_context +#define ibv_get_context_resp rdmav_get_context_resp +#define ibv_gid rdmav_gid +#define ibv_global_route rdmav_global_route +#define ibv_grh rdmav_grh +#define ibv_kern_ah_attr rdmav_kern_ah_attr +#define ibv_kern_async_event rdmav_kern_async_event +#define ibv_kern_global_route rdmav_kern_global_route +#define ibv_kern_path_rec rdmav_kern_path_rec +#define ibv_kern_qp_attr rdmav_kern_qp_attr +#define ibv_kern_recv_wr rdmav_kern_recv_wr +#define ibv_kern_send_wr rdmav_kern_send_wr +#define ibv_kern_wc rdmav_kern_wc +#define ibv_mig_state rdmav_mig_state +#define ibv_mr rdmav_mr +#define ibv_mtu rdmav_mtu +#define ibv_node_type rdmav_node_type +#define ibv_pd rdmav_pd +#define ibv_poll_cq_resp rdmav_poll_cq_resp +#define ibv_port_attr rdmav_port_attr +#define ibv_port_state rdmav_port_state +#define ibv_post_recv_resp rdmav_post_recv_resp +#define ibv_post_send_resp rdmav_post_send_resp +#define ibv_post_srq_recv_resp rdmav_post_srq_recv_resp +#define ibv_qp rdmav_qp +#define ibv_qp_attr rdmav_qp_attr +#define ibv_qp_attr_mask rdmav_qp_attr_mask +#define ibv_qp_cap rdmav_qp_cap +#define ibv_qp_dest rdmav_qp_dest +#define ibv_qp_init_attr rdmav_qp_init_attr +#define ibv_qp_state rdmav_qp_state +#define ibv_qp_type rdmav_qp_type +#define ibv_query_device_resp rdmav_query_device_resp +#define ibv_query_params rdmav_query_params +#define ibv_query_params_resp rdmav_query_params_resp +#define ibv_query_port_resp rdmav_query_port_resp +#define ibv_query_qp_resp rdmav_query_qp_resp +#define ibv_query_srq_resp rdmav_query_srq_resp +#define ibv_rate rdmav_rate +#define ibv_recv_wr rdmav_recv_wr +#define ibv_reg_mr_resp rdmav_reg_mr_resp +#define ibv_resize_cq_resp rdmav_resize_cq_resp +#define ibv_sa_mcmember_rec rdmav_sa_mcmember_rec +#define ibv_sa_path_rec rdmav_sa_path_rec +#define ibv_sa_service_rec rdmav_sa_service_rec +#define ibv_send_flags rdmav_send_flags +#define ibv_send_wr rdmav_send_wr +#define ibv_sge rdmav_sge +#define ibv_srq rdmav_srq +#define ibv_srq_attr rdmav_srq_attr +#define ibv_srq_attr_mask rdmav_srq_attr_mask +#define ibv_srq_init_attr rdmav_srq_init_attr +#define ibv_wc rdmav_wc +#define ibv_wc_flags rdmav_wc_flags +#define ibv_wc_opcode rdmav_wc_opcode +#define ibv_wc_status rdmav_wc_status +#define ibv_wr_opcode rdmav_wr_opcode + +/* All declarations needed for compiles */ +extern struct rdmav_device **ibv_get_device_list(int *num); + +#endif /* RDMAV_DEPRECATE_H */ diff -ruNp ORG/libibverbs/include/infiniband/driver.h NEW/libibverbs/include/infiniband/driver.h --- ORG/libibverbs/include/infiniband/driver.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/driver.h 2006-08-02 18:24:49.000000000 -0700 @@ -34,8 +34,8 @@ * $Id: driver.h 7484 2006-05-24 21:12:21Z roland $ */ -#ifndef INFINIBAND_DRIVER_H -#define INFINIBAND_DRIVER_H +#ifndef RDMAV_DRIVER_H +#define RDMAV_DRIVER_H #include #include @@ -57,90 +57,90 @@ * * libibverbs will call each driver's ibv_driver_init() function once * for each InfiniBand device. If the device is one that the driver - * can support, it should return a struct ibv_device * with the ops + * can support, it should return a struct rdmav_device * with the ops * member filled in. If the driver does not support the device, it * should return NULL from openib_driver_init(). */ -typedef struct ibv_device *(*ibv_driver_init_func)(const char *, int); +typedef struct rdmav_device *(*rdmav_driver_init_func)(const char *, int); -int ibv_cmd_get_context(struct ibv_context *context, struct ibv_get_context *cmd, - size_t cmd_size, struct ibv_get_context_resp *resp, +int rdmav_cmd_get_context(struct rdmav_context *context, struct rdmav_get_context *cmd, + size_t cmd_size, struct rdmav_get_context_resp *resp, size_t resp_size); -int ibv_cmd_query_device(struct ibv_context *context, - struct ibv_device_attr *device_attr, +int rdmav_cmd_query_device(struct rdmav_context *context, + struct rdmav_device_attr *device_attr, uint64_t *raw_fw_ver, - struct ibv_query_device *cmd, size_t cmd_size); -int ibv_cmd_query_port(struct ibv_context *context, uint8_t port_num, - struct ibv_port_attr *port_attr, - struct ibv_query_port *cmd, size_t cmd_size); -int ibv_cmd_query_gid(struct ibv_context *context, uint8_t port_num, - int index, union ibv_gid *gid); -int ibv_cmd_query_pkey(struct ibv_context *context, uint8_t port_num, + struct rdmav_query_device *cmd, size_t cmd_size); +int rdmav_cmd_query_port(struct rdmav_context *context, uint8_t port_num, + struct rdmav_port_attr *port_attr, + struct rdmav_query_port *cmd, size_t cmd_size); +int rdmav_cmd_query_gid(struct rdmav_context *context, uint8_t port_num, + int index, union rdmav_gid *gid); +int rdmav_cmd_query_pkey(struct rdmav_context *context, uint8_t port_num, int index, uint16_t *pkey); -int ibv_cmd_alloc_pd(struct ibv_context *context, struct ibv_pd *pd, - struct ibv_alloc_pd *cmd, size_t cmd_size, - struct ibv_alloc_pd_resp *resp, size_t resp_size); -int ibv_cmd_dealloc_pd(struct ibv_pd *pd); -int ibv_cmd_reg_mr(struct ibv_pd *pd, void *addr, size_t length, - uint64_t hca_va, enum ibv_access_flags access, - struct ibv_mr *mr, struct ibv_reg_mr *cmd, +int rdmav_cmd_alloc_pd(struct rdmav_context *context, struct rdmav_pd *pd, + struct rdmav_alloc_pd *cmd, size_t cmd_size, + struct rdmav_alloc_pd_resp *resp, size_t resp_size); +int rdmav_cmd_dealloc_pd(struct rdmav_pd *pd); +int rdmav_cmd_reg_mr(struct rdmav_pd *pd, void *addr, size_t length, + uint64_t hca_va, enum rdmav_access_flags access, + struct rdmav_mr *mr, struct rdmav_reg_mr *cmd, size_t cmd_size); -int ibv_cmd_dereg_mr(struct ibv_mr *mr); -int ibv_cmd_create_cq(struct ibv_context *context, int cqe, - struct ibv_comp_channel *channel, - int comp_vector, struct ibv_cq *cq, - struct ibv_create_cq *cmd, size_t cmd_size, - struct ibv_create_cq_resp *resp, size_t resp_size); -int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); -int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only); -int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, - struct ibv_resize_cq *cmd, size_t cmd_size); -int ibv_cmd_destroy_cq(struct ibv_cq *cq); - -int ibv_cmd_create_srq(struct ibv_pd *pd, - struct ibv_srq *srq, struct ibv_srq_init_attr *attr, - struct ibv_create_srq *cmd, size_t cmd_size, - struct ibv_create_srq_resp *resp, size_t resp_size); -int ibv_cmd_modify_srq(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask, - struct ibv_modify_srq *cmd, size_t cmd_size); -int ibv_cmd_query_srq(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - struct ibv_query_srq *cmd, size_t cmd_size); -int ibv_cmd_destroy_srq(struct ibv_srq *srq); - -int ibv_cmd_create_qp(struct ibv_pd *pd, - struct ibv_qp *qp, struct ibv_qp_init_attr *attr, - struct ibv_create_qp *cmd, size_t cmd_size, - struct ibv_create_qp_resp *resp, size_t resp_size); -int ibv_cmd_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *qp_attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_qp_init_attr *qp_init_attr, - struct ibv_query_qp *cmd, size_t cmd_size); -int ibv_cmd_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_modify_qp *cmd, size_t cmd_size); -int ibv_cmd_destroy_qp(struct ibv_qp *qp); -int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, - struct ibv_send_wr **bad_wr); -int ibv_cmd_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr); -int ibv_cmd_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr); -int ibv_cmd_create_ah(struct ibv_pd *pd, struct ibv_ah *ah, - struct ibv_ah_attr *attr); -int ibv_cmd_destroy_ah(struct ibv_ah *ah); -int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); -int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); +int rdmav_cmd_dereg_mr(struct rdmav_mr *mr); +int rdmav_cmd_create_cq(struct rdmav_context *context, int cqe, + struct rdmav_comp_channel *channel, + int comp_vector, struct rdmav_cq *cq, + struct rdmav_create_cq *cmd, size_t cmd_size, + struct rdmav_create_cq_resp *resp, size_t resp_size); +int rdmav_cmd_poll_cq(struct rdmav_cq *cq, int ne, struct rdmav_wc *wc); +int rdmav_cmd_req_notify_cq(struct rdmav_cq *cq, int solicited_only); +int rdmav_cmd_resize_cq(struct rdmav_cq *cq, int cqe, + struct rdmav_resize_cq *cmd, size_t cmd_size); +int rdmav_cmd_destroy_cq(struct rdmav_cq *cq); + +int rdmav_cmd_create_srq(struct rdmav_pd *pd, + struct rdmav_srq *srq, struct rdmav_srq_init_attr *attr, + struct rdmav_create_srq *cmd, size_t cmd_size, + struct rdmav_create_srq_resp *resp, size_t resp_size); +int rdmav_cmd_modify_srq(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask, + struct rdmav_modify_srq *cmd, size_t cmd_size); +int rdmav_cmd_query_srq(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + struct rdmav_query_srq *cmd, size_t cmd_size); +int rdmav_cmd_destroy_srq(struct rdmav_srq *srq); + +int rdmav_cmd_create_qp(struct rdmav_pd *pd, + struct rdmav_qp *qp, struct rdmav_qp_init_attr *attr, + struct rdmav_create_qp *cmd, size_t cmd_size, + struct rdmav_create_qp_resp *resp, size_t resp_size); +int rdmav_cmd_query_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *qp_attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_qp_init_attr *qp_init_attr, + struct rdmav_query_qp *cmd, size_t cmd_size); +int rdmav_cmd_modify_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_modify_qp *cmd, size_t cmd_size); +int rdmav_cmd_destroy_qp(struct rdmav_qp *qp); +int rdmav_cmd_post_send(struct rdmav_qp *ibqp, struct rdmav_send_wr *wr, + struct rdmav_send_wr **bad_wr); +int rdmav_cmd_post_recv(struct rdmav_qp *ibqp, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr); +int rdmav_cmd_post_srq_recv(struct rdmav_srq *srq, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr); +int rdmav_cmd_create_ah(struct rdmav_pd *pd, struct rdmav_ah *ah, + struct rdmav_ah_attr *attr); +int rdmav_cmd_destroy_ah(struct rdmav_ah *ah); +int rdmav_cmd_attach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); +int rdmav_cmd_detach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); /* * sysfs helper functions */ -const char *ibv_get_sysfs_path(void); +const char *rdmav_get_sysfs_path(void); -int ibv_read_sysfs_file(const char *dir, const char *file, +int rdmav_read_sysfs_file(const char *dir, const char *file, char *buf, size_t size); -#endif /* INFINIBAND_DRIVER_H */ +#endif /* RDMAV_DRIVER_H */ diff -ruNp ORG/libibverbs/include/infiniband/kern-abi.h NEW/libibverbs/include/infiniband/kern-abi.h --- ORG/libibverbs/include/infiniband/kern-abi.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/kern-abi.h 2006-08-02 18:24:49.000000000 -0700 @@ -47,47 +47,47 @@ /* * The minimum and maximum kernel ABI that we can handle. */ -#define IB_USER_VERBS_MIN_ABI_VERSION 1 -#define IB_USER_VERBS_MAX_ABI_VERSION 6 +#define RDMAV_USER_VERBS_MIN_ABI_VERSION 1 +#define RDMAV_USER_VERBS_MAX_ABI_VERSION 6 enum { - IB_USER_VERBS_CMD_GET_CONTEXT, - IB_USER_VERBS_CMD_QUERY_DEVICE, - IB_USER_VERBS_CMD_QUERY_PORT, - IB_USER_VERBS_CMD_ALLOC_PD, - IB_USER_VERBS_CMD_DEALLOC_PD, - IB_USER_VERBS_CMD_CREATE_AH, - IB_USER_VERBS_CMD_MODIFY_AH, - IB_USER_VERBS_CMD_QUERY_AH, - IB_USER_VERBS_CMD_DESTROY_AH, - IB_USER_VERBS_CMD_REG_MR, - IB_USER_VERBS_CMD_REG_SMR, - IB_USER_VERBS_CMD_REREG_MR, - IB_USER_VERBS_CMD_QUERY_MR, - IB_USER_VERBS_CMD_DEREG_MR, - IB_USER_VERBS_CMD_ALLOC_MW, - IB_USER_VERBS_CMD_BIND_MW, - IB_USER_VERBS_CMD_DEALLOC_MW, - IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL, - IB_USER_VERBS_CMD_CREATE_CQ, - IB_USER_VERBS_CMD_RESIZE_CQ, - IB_USER_VERBS_CMD_DESTROY_CQ, - IB_USER_VERBS_CMD_POLL_CQ, - IB_USER_VERBS_CMD_PEEK_CQ, - IB_USER_VERBS_CMD_REQ_NOTIFY_CQ, - IB_USER_VERBS_CMD_CREATE_QP, - IB_USER_VERBS_CMD_QUERY_QP, - IB_USER_VERBS_CMD_MODIFY_QP, - IB_USER_VERBS_CMD_DESTROY_QP, - IB_USER_VERBS_CMD_POST_SEND, - IB_USER_VERBS_CMD_POST_RECV, - IB_USER_VERBS_CMD_ATTACH_MCAST, - IB_USER_VERBS_CMD_DETACH_MCAST, - IB_USER_VERBS_CMD_CREATE_SRQ, - IB_USER_VERBS_CMD_MODIFY_SRQ, - IB_USER_VERBS_CMD_QUERY_SRQ, - IB_USER_VERBS_CMD_DESTROY_SRQ, - IB_USER_VERBS_CMD_POST_SRQ_RECV + RDMAV_USER_VERBS_CMD_GET_CONTEXT, + RDMAV_USER_VERBS_CMD_QUERY_DEVICE, + RDMAV_USER_VERBS_CMD_QUERY_PORT, + RDMAV_USER_VERBS_CMD_ALLOC_PD, + RDMAV_USER_VERBS_CMD_DEALLOC_PD, + RDMAV_USER_VERBS_CMD_CREATE_AH, + RDMAV_USER_VERBS_CMD_MODIFY_AH, + RDMAV_USER_VERBS_CMD_QUERY_AH, + RDMAV_USER_VERBS_CMD_DESTROY_AH, + RDMAV_USER_VERBS_CMD_REG_MR, + RDMAV_USER_VERBS_CMD_REG_SMR, + RDMAV_USER_VERBS_CMD_REREG_MR, + RDMAV_USER_VERBS_CMD_QUERY_MR, + RDMAV_USER_VERBS_CMD_DEREG_MR, + RDMAV_USER_VERBS_CMD_ALLOC_MW, + RDMAV_USER_VERBS_CMD_BIND_MW, + RDMAV_USER_VERBS_CMD_DEALLOC_MW, + RDMAV_USER_VERBS_CMD_CREATE_COMP_CHANNEL, + RDMAV_USER_VERBS_CMD_CREATE_CQ, + RDMAV_USER_VERBS_CMD_RESIZE_CQ, + RDMAV_USER_VERBS_CMD_DESTROY_CQ, + RDMAV_USER_VERBS_CMD_POLL_CQ, + RDMAV_USER_VERBS_CMD_PEEK_CQ, + RDMAV_USER_VERBS_CMD_REQ_NOTIFY_CQ, + RDMAV_USER_VERBS_CMD_CREATE_QP, + RDMAV_USER_VERBS_CMD_QUERY_QP, + RDMAV_USER_VERBS_CMD_MODIFY_QP, + RDMAV_USER_VERBS_CMD_DESTROY_QP, + RDMAV_USER_VERBS_CMD_POST_SEND, + RDMAV_USER_VERBS_CMD_POST_RECV, + RDMAV_USER_VERBS_CMD_ATTACH_MCAST, + RDMAV_USER_VERBS_CMD_DETACH_MCAST, + RDMAV_USER_VERBS_CMD_CREATE_SRQ, + RDMAV_USER_VERBS_CMD_MODIFY_SRQ, + RDMAV_USER_VERBS_CMD_QUERY_SRQ, + RDMAV_USER_VERBS_CMD_DESTROY_SRQ, + RDMAV_USER_VERBS_CMD_POST_SRQ_RECV }; /* @@ -101,13 +101,13 @@ enum { * different between 32-bit and 64-bit architectures. */ -struct ibv_kern_async_event { +struct rdmav_kern_async_event { __u64 element; __u32 event_type; __u32 reserved; }; -struct ibv_comp_event { +struct rdmav_comp_event { __u64 cq_handle; }; @@ -119,18 +119,18 @@ struct ibv_comp_event { * the rest of the command struct based on these value. */ -struct ibv_query_params { +struct rdmav_query_params { __u32 command; __u16 in_words; __u16 out_words; __u64 response; }; -struct ibv_query_params_resp { +struct rdmav_query_params_resp { __u32 num_cq_events; }; -struct ibv_get_context { +struct rdmav_get_context { __u32 command; __u16 in_words; __u16 out_words; @@ -138,12 +138,12 @@ struct ibv_get_context { __u64 driver_data[0]; }; -struct ibv_get_context_resp { +struct rdmav_get_context_resp { __u32 async_fd; __u32 num_comp_vectors; }; -struct ibv_query_device { +struct rdmav_query_device { __u32 command; __u16 in_words; __u16 out_words; @@ -151,7 +151,7 @@ struct ibv_query_device { __u64 driver_data[0]; }; -struct ibv_query_device_resp { +struct rdmav_query_device_resp { __u64 fw_ver; __u64 node_guid; __u64 sys_image_guid; @@ -195,7 +195,7 @@ struct ibv_query_device_resp { __u8 reserved[4]; }; -struct ibv_query_port { +struct rdmav_query_port { __u32 command; __u16 in_words; __u16 out_words; @@ -205,7 +205,7 @@ struct ibv_query_port { __u64 driver_data[0]; }; -struct ibv_query_port_resp { +struct rdmav_query_port_resp { __u32 port_cap_flags; __u32 max_msg_sz; __u32 bad_pkey_cntr; @@ -228,7 +228,7 @@ struct ibv_query_port_resp { __u8 reserved[3]; }; -struct ibv_alloc_pd { +struct rdmav_alloc_pd { __u32 command; __u16 in_words; __u16 out_words; @@ -236,18 +236,18 @@ struct ibv_alloc_pd { __u64 driver_data[0]; }; -struct ibv_alloc_pd_resp { +struct rdmav_alloc_pd_resp { __u32 pd_handle; }; -struct ibv_dealloc_pd { +struct rdmav_dealloc_pd { __u32 command; __u16 in_words; __u16 out_words; __u32 pd_handle; }; -struct ibv_reg_mr { +struct rdmav_reg_mr { __u32 command; __u16 in_words; __u16 out_words; @@ -260,31 +260,31 @@ struct ibv_reg_mr { __u64 driver_data[0]; }; -struct ibv_reg_mr_resp { +struct rdmav_reg_mr_resp { __u32 mr_handle; __u32 lkey; __u32 rkey; }; -struct ibv_dereg_mr { +struct rdmav_dereg_mr { __u32 command; __u16 in_words; __u16 out_words; __u32 mr_handle; }; -struct ibv_create_comp_channel { +struct rdmav_create_comp_channel { __u32 command; __u16 in_words; __u16 out_words; __u64 response; }; -struct ibv_create_comp_channel_resp { +struct rdmav_create_comp_channel_resp { __u32 fd; }; -struct ibv_create_cq { +struct rdmav_create_cq { __u32 command; __u16 in_words; __u16 out_words; @@ -297,12 +297,12 @@ struct ibv_create_cq { __u64 driver_data[0]; }; -struct ibv_create_cq_resp { +struct rdmav_create_cq_resp { __u32 cq_handle; __u32 cqe; }; -struct ibv_kern_wc { +struct rdmav_kern_wc { __u64 wr_id; __u32 status; __u32 opcode; @@ -320,7 +320,7 @@ struct ibv_kern_wc { __u8 reserved; }; -struct ibv_poll_cq { +struct rdmav_poll_cq { __u32 command; __u16 in_words; __u16 out_words; @@ -329,13 +329,13 @@ struct ibv_poll_cq { __u32 ne; }; -struct ibv_poll_cq_resp { +struct rdmav_poll_cq_resp { __u32 count; __u32 reserved; - struct ibv_kern_wc wc[0]; + struct rdmav_kern_wc wc[0]; }; -struct ibv_req_notify_cq { +struct rdmav_req_notify_cq { __u32 command; __u16 in_words; __u16 out_words; @@ -343,7 +343,7 @@ struct ibv_req_notify_cq { __u32 solicited; }; -struct ibv_resize_cq { +struct rdmav_resize_cq { __u32 command; __u16 in_words; __u16 out_words; @@ -353,11 +353,11 @@ struct ibv_resize_cq { __u64 driver_data[0]; }; -struct ibv_resize_cq_resp { +struct rdmav_resize_cq_resp { __u32 cqe; }; -struct ibv_destroy_cq { +struct rdmav_destroy_cq { __u32 command; __u16 in_words; __u16 out_words; @@ -366,12 +366,12 @@ struct ibv_destroy_cq { __u32 reserved; }; -struct ibv_destroy_cq_resp { +struct rdmav_destroy_cq_resp { __u32 comp_events_reported; __u32 async_events_reported; }; -struct ibv_kern_global_route { +struct rdmav_kern_global_route { __u8 dgid[16]; __u32 flow_label; __u8 sgid_index; @@ -380,8 +380,8 @@ struct ibv_kern_global_route { __u8 reserved; }; -struct ibv_kern_ah_attr { - struct ibv_kern_global_route grh; +struct rdmav_kern_ah_attr { + struct rdmav_kern_global_route grh; __u16 dlid; __u8 sl; __u8 src_path_bits; @@ -391,7 +391,7 @@ struct ibv_kern_ah_attr { __u8 reserved; }; -struct ibv_kern_qp_attr { +struct rdmav_kern_qp_attr { __u32 qp_attr_mask; __u32 qp_state; __u32 cur_qp_state; @@ -403,8 +403,8 @@ struct ibv_kern_qp_attr { __u32 dest_qp_num; __u32 qp_access_flags; - struct ibv_kern_ah_attr ah_attr; - struct ibv_kern_ah_attr alt_ah_attr; + struct rdmav_kern_ah_attr ah_attr; + struct rdmav_kern_ah_attr alt_ah_attr; /* ib_qp_cap */ __u32 max_send_wr; @@ -429,7 +429,7 @@ struct ibv_kern_qp_attr { __u8 reserved[5]; }; -struct ibv_create_qp { +struct rdmav_create_qp { __u32 command; __u16 in_words; __u16 out_words; @@ -451,7 +451,7 @@ struct ibv_create_qp { __u64 driver_data[0]; }; -struct ibv_create_qp_resp { +struct rdmav_create_qp_resp { __u32 qp_handle; __u32 qpn; __u32 max_send_wr; @@ -462,7 +462,7 @@ struct ibv_create_qp_resp { __u32 reserved; }; -struct ibv_qp_dest { +struct rdmav_qp_dest { __u8 dgid[16]; __u32 flow_label; __u16 dlid; @@ -477,7 +477,7 @@ struct ibv_qp_dest { __u8 port_num; }; -struct ibv_query_qp { +struct rdmav_query_qp { __u32 command; __u16 in_words; __u16 out_words; @@ -487,9 +487,9 @@ struct ibv_query_qp { __u64 driver_data[0]; }; -struct ibv_query_qp_resp { - struct ibv_qp_dest dest; - struct ibv_qp_dest alt_dest; +struct rdmav_query_qp_resp { + struct rdmav_qp_dest dest; + struct rdmav_qp_dest alt_dest; __u32 max_send_wr; __u32 max_recv_wr; __u32 max_send_sge; @@ -521,12 +521,12 @@ struct ibv_query_qp_resp { __u64 driver_data[0]; }; -struct ibv_modify_qp { +struct rdmav_modify_qp { __u32 command; __u16 in_words; __u16 out_words; - struct ibv_qp_dest dest; - struct ibv_qp_dest alt_dest; + struct rdmav_qp_dest dest; + struct rdmav_qp_dest alt_dest; __u32 qp_handle; __u32 attr_mask; __u32 qkey; @@ -554,7 +554,7 @@ struct ibv_modify_qp { __u64 driver_data[0]; }; -struct ibv_destroy_qp { +struct rdmav_destroy_qp { __u32 command; __u16 in_words; __u16 out_words; @@ -563,11 +563,11 @@ struct ibv_destroy_qp { __u32 reserved; }; -struct ibv_destroy_qp_resp { +struct rdmav_destroy_qp_resp { __u32 events_reported; }; -struct ibv_kern_send_wr { +struct rdmav_kern_send_wr { __u64 wr_id; __u32 num_sge; __u32 opcode; @@ -595,7 +595,7 @@ struct ibv_kern_send_wr { } wr; }; -struct ibv_post_send { +struct rdmav_post_send { __u32 command; __u16 in_words; __u16 out_words; @@ -604,20 +604,20 @@ struct ibv_post_send { __u32 wr_count; __u32 sge_count; __u32 wqe_size; - struct ibv_kern_send_wr send_wr[0]; + struct rdmav_kern_send_wr send_wr[0]; }; -struct ibv_post_send_resp { +struct rdmav_post_send_resp { __u32 bad_wr; }; -struct ibv_kern_recv_wr { +struct rdmav_kern_recv_wr { __u64 wr_id; __u32 num_sge; __u32 reserved; }; -struct ibv_post_recv { +struct rdmav_post_recv { __u32 command; __u16 in_words; __u16 out_words; @@ -626,14 +626,14 @@ struct ibv_post_recv { __u32 wr_count; __u32 sge_count; __u32 wqe_size; - struct ibv_kern_recv_wr recv_wr[0]; + struct rdmav_kern_recv_wr recv_wr[0]; }; -struct ibv_post_recv_resp { +struct rdmav_post_recv_resp { __u32 bad_wr; }; -struct ibv_post_srq_recv { +struct rdmav_post_srq_recv { __u32 command; __u16 in_words; __u16 out_words; @@ -642,14 +642,14 @@ struct ibv_post_srq_recv { __u32 wr_count; __u32 sge_count; __u32 wqe_size; - struct ibv_kern_recv_wr recv_wr[0]; + struct rdmav_kern_recv_wr recv_wr[0]; }; -struct ibv_post_srq_recv_resp { +struct rdmav_post_srq_recv_resp { __u32 bad_wr; }; -struct ibv_create_ah { +struct rdmav_create_ah { __u32 command; __u16 in_words; __u16 out_words; @@ -657,21 +657,21 @@ struct ibv_create_ah { __u64 user_handle; __u32 pd_handle; __u32 reserved; - struct ibv_kern_ah_attr attr; + struct rdmav_kern_ah_attr attr; }; -struct ibv_create_ah_resp { +struct rdmav_create_ah_resp { __u32 handle; }; -struct ibv_destroy_ah { +struct rdmav_destroy_ah { __u32 command; __u16 in_words; __u16 out_words; __u32 ah_handle; }; -struct ibv_attach_mcast { +struct rdmav_attach_mcast { __u32 command; __u16 in_words; __u16 out_words; @@ -682,7 +682,7 @@ struct ibv_attach_mcast { __u64 driver_data[0]; }; -struct ibv_detach_mcast { +struct rdmav_detach_mcast { __u32 command; __u16 in_words; __u16 out_words; @@ -693,7 +693,7 @@ struct ibv_detach_mcast { __u64 driver_data[0]; }; -struct ibv_create_srq { +struct rdmav_create_srq { __u32 command; __u16 in_words; __u16 out_words; @@ -706,14 +706,14 @@ struct ibv_create_srq { __u64 driver_data[0]; }; -struct ibv_create_srq_resp { +struct rdmav_create_srq_resp { __u32 srq_handle; __u32 max_wr; __u32 max_sge; __u32 reserved; }; -struct ibv_modify_srq { +struct rdmav_modify_srq { __u32 command; __u16 in_words; __u16 out_words; @@ -724,7 +724,7 @@ struct ibv_modify_srq { __u64 driver_data[0]; }; -struct ibv_query_srq { +struct rdmav_query_srq { __u32 command; __u16 in_words; __u16 out_words; @@ -734,14 +734,14 @@ struct ibv_query_srq { __u64 driver_data[0]; }; -struct ibv_query_srq_resp { +struct rdmav_query_srq_resp { __u32 max_wr; __u32 max_sge; __u32 srq_limit; __u32 reserved; }; -struct ibv_destroy_srq { +struct rdmav_destroy_srq { __u32 command; __u16 in_words; __u16 out_words; @@ -750,7 +750,7 @@ struct ibv_destroy_srq { __u32 reserved; }; -struct ibv_destroy_srq_resp { +struct rdmav_destroy_srq_resp { __u32 events_reported; }; @@ -759,74 +759,74 @@ struct ibv_destroy_srq_resp { */ enum { - IB_USER_VERBS_CMD_QUERY_PARAMS_V2, - IB_USER_VERBS_CMD_GET_CONTEXT_V2, - IB_USER_VERBS_CMD_QUERY_DEVICE_V2, - IB_USER_VERBS_CMD_QUERY_PORT_V2, - IB_USER_VERBS_CMD_QUERY_GID_V2, - IB_USER_VERBS_CMD_QUERY_PKEY_V2, - IB_USER_VERBS_CMD_ALLOC_PD_V2, - IB_USER_VERBS_CMD_DEALLOC_PD_V2, - IB_USER_VERBS_CMD_CREATE_AH_V2, - IB_USER_VERBS_CMD_MODIFY_AH_V2, - IB_USER_VERBS_CMD_QUERY_AH_V2, - IB_USER_VERBS_CMD_DESTROY_AH_V2, - IB_USER_VERBS_CMD_REG_MR_V2, - IB_USER_VERBS_CMD_REG_SMR_V2, - IB_USER_VERBS_CMD_REREG_MR_V2, - IB_USER_VERBS_CMD_QUERY_MR_V2, - IB_USER_VERBS_CMD_DEREG_MR_V2, - IB_USER_VERBS_CMD_ALLOC_MW_V2, - IB_USER_VERBS_CMD_BIND_MW_V2, - IB_USER_VERBS_CMD_DEALLOC_MW_V2, - IB_USER_VERBS_CMD_CREATE_CQ_V2, - IB_USER_VERBS_CMD_RESIZE_CQ_V2, - IB_USER_VERBS_CMD_DESTROY_CQ_V2, - IB_USER_VERBS_CMD_POLL_CQ_V2, - IB_USER_VERBS_CMD_PEEK_CQ_V2, - IB_USER_VERBS_CMD_REQ_NOTIFY_CQ_V2, - IB_USER_VERBS_CMD_CREATE_QP_V2, - IB_USER_VERBS_CMD_QUERY_QP_V2, - IB_USER_VERBS_CMD_MODIFY_QP_V2, - IB_USER_VERBS_CMD_DESTROY_QP_V2, - IB_USER_VERBS_CMD_POST_SEND_V2, - IB_USER_VERBS_CMD_POST_RECV_V2, - IB_USER_VERBS_CMD_ATTACH_MCAST_V2, - IB_USER_VERBS_CMD_DETACH_MCAST_V2, - IB_USER_VERBS_CMD_CREATE_SRQ_V2, - IB_USER_VERBS_CMD_MODIFY_SRQ_V2, - IB_USER_VERBS_CMD_QUERY_SRQ_V2, - IB_USER_VERBS_CMD_DESTROY_SRQ_V2, - IB_USER_VERBS_CMD_POST_SRQ_RECV_V2, + RDMAV_USER_VERBS_CMD_QUERY_PARAMS_V2, + RDMAV_USER_VERBS_CMD_GET_CONTEXT_V2, + RDMAV_USER_VERBS_CMD_QUERY_DEVICE_V2, + RDMAV_USER_VERBS_CMD_QUERY_PORT_V2, + RDMAV_USER_VERBS_CMD_QUERY_GID_V2, + RDMAV_USER_VERBS_CMD_QUERY_PKEY_V2, + RDMAV_USER_VERBS_CMD_ALLOC_PD_V2, + RDMAV_USER_VERBS_CMD_DEALLOC_PD_V2, + RDMAV_USER_VERBS_CMD_CREATE_AH_V2, + RDMAV_USER_VERBS_CMD_MODIFY_AH_V2, + RDMAV_USER_VERBS_CMD_QUERY_AH_V2, + RDMAV_USER_VERBS_CMD_DESTROY_AH_V2, + RDMAV_USER_VERBS_CMD_REG_MR_V2, + RDMAV_USER_VERBS_CMD_REG_SMR_V2, + RDMAV_USER_VERBS_CMD_REREG_MR_V2, + RDMAV_USER_VERBS_CMD_QUERY_MR_V2, + RDMAV_USER_VERBS_CMD_DEREG_MR_V2, + RDMAV_USER_VERBS_CMD_ALLOC_MW_V2, + RDMAV_USER_VERBS_CMD_BIND_MW_V2, + RDMAV_USER_VERBS_CMD_DEALLOC_MW_V2, + RDMAV_USER_VERBS_CMD_CREATE_CQ_V2, + RDMAV_USER_VERBS_CMD_RESIZE_CQ_V2, + RDMAV_USER_VERBS_CMD_DESTROY_CQ_V2, + RDMAV_USER_VERBS_CMD_POLL_CQ_V2, + RDMAV_USER_VERBS_CMD_PEEK_CQ_V2, + RDMAV_USER_VERBS_CMD_REQ_NOTIFY_CQ_V2, + RDMAV_USER_VERBS_CMD_CREATE_QP_V2, + RDMAV_USER_VERBS_CMD_QUERY_QP_V2, + RDMAV_USER_VERBS_CMD_MODIFY_QP_V2, + RDMAV_USER_VERBS_CMD_DESTROY_QP_V2, + RDMAV_USER_VERBS_CMD_POST_SEND_V2, + RDMAV_USER_VERBS_CMD_POST_RECV_V2, + RDMAV_USER_VERBS_CMD_ATTACH_MCAST_V2, + RDMAV_USER_VERBS_CMD_DETACH_MCAST_V2, + RDMAV_USER_VERBS_CMD_CREATE_SRQ_V2, + RDMAV_USER_VERBS_CMD_MODIFY_SRQ_V2, + RDMAV_USER_VERBS_CMD_QUERY_SRQ_V2, + RDMAV_USER_VERBS_CMD_DESTROY_SRQ_V2, + RDMAV_USER_VERBS_CMD_POST_SRQ_RECV_V2, /* * Set commands that didn't exist to -1 so our compile-time * trick opcodes in IBV_INIT_CMD() doesn't break. */ - IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL_V2 = -1, + RDMAV_USER_VERBS_CMD_CREATE_COMP_CHANNEL_V2 = -1, }; -struct ibv_destroy_cq_v1 { +struct rdmav_destroy_cq_v1 { __u32 command; __u16 in_words; __u16 out_words; __u32 cq_handle; }; -struct ibv_destroy_qp_v1 { +struct rdmav_destroy_qp_v1 { __u32 command; __u16 in_words; __u16 out_words; __u32 qp_handle; }; -struct ibv_destroy_srq_v1 { +struct rdmav_destroy_srq_v1 { __u32 command; __u16 in_words; __u16 out_words; __u32 srq_handle; }; -struct ibv_get_context_v2 { +struct rdmav_get_context_v2 { __u32 command; __u16 in_words; __u16 out_words; @@ -835,7 +835,7 @@ struct ibv_get_context_v2 { __u64 driver_data[0]; }; -struct ibv_create_cq_v2 { +struct rdmav_create_cq_v2 { __u32 command; __u16 in_words; __u16 out_words; @@ -846,7 +846,7 @@ struct ibv_create_cq_v2 { __u64 driver_data[0]; }; -struct ibv_modify_srq_v3 { +struct rdmav_modify_srq_v3 { __u32 command; __u16 in_words; __u16 out_words; @@ -859,12 +859,12 @@ struct ibv_modify_srq_v3 { __u64 driver_data[0]; }; -struct ibv_create_qp_resp_v3 { +struct rdmav_create_qp_resp_v3 { __u32 qp_handle; __u32 qpn; }; -struct ibv_create_qp_resp_v4 { +struct rdmav_create_qp_resp_v4 { __u32 qp_handle; __u32 qpn; __u32 max_send_wr; @@ -874,7 +874,7 @@ struct ibv_create_qp_resp_v4 { __u32 max_inline_data; }; -struct ibv_create_srq_resp_v5 { +struct rdmav_create_srq_resp_v5 { __u32 srq_handle; }; diff -ruNp ORG/libibverbs/include/infiniband/marshall.h NEW/libibverbs/include/infiniband/marshall.h --- ORG/libibverbs/include/infiniband/marshall.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/marshall.h 2006-08-02 18:24:49.000000000 -0700 @@ -30,8 +30,8 @@ * SOFTWARE. */ -#ifndef INFINIBAND_MARSHALL_H -#define INFINIBAND_MARSHALL_H +#ifndef RDMAV_MARSHALL_H +#define RDMAV_MARSHALL_H #include #include @@ -48,18 +48,18 @@ BEGIN_C_DECLS -void ibv_copy_qp_attr_from_kern(struct ibv_qp_attr *dst, - struct ibv_kern_qp_attr *src); +void rdmav_copy_qp_attr_from_kern(struct rdmav_qp_attr *dst, + struct rdmav_kern_qp_attr *src); -void ibv_copy_ah_attr_from_kern(struct ibv_ah_attr *dst, - struct ibv_kern_ah_attr *src); +void rdmav_copy_ah_attr_from_kern(struct rdmav_ah_attr *dst, + struct rdmav_kern_ah_attr *src); -void ibv_copy_path_rec_from_kern(struct ibv_sa_path_rec *dst, - struct ibv_kern_path_rec *src); +void rdmav_copy_path_rec_from_kern(struct rdmav_sa_path_rec *dst, + struct rdmav_kern_path_rec *src); -void ibv_copy_path_rec_to_kern(struct ibv_kern_path_rec *dst, - struct ibv_sa_path_rec *src); +void rdmav_copy_path_rec_to_kern(struct rdmav_kern_path_rec *dst, + struct rdmav_sa_path_rec *src); END_C_DECLS -#endif /* INFINIBAND_MARSHALL_H */ +#endif /* RDMAV_MARSHALL_H */ diff -ruNp ORG/libibverbs/include/infiniband/opcode.h NEW/libibverbs/include/infiniband/opcode.h --- ORG/libibverbs/include/infiniband/opcode.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/opcode.h 2006-08-03 17:42:30.000000000 -0700 @@ -32,118 +32,119 @@ * $Id: opcode.h 1989 2005-03-14 20:25:13Z roland $ */ -#ifndef INFINIBAND_OPCODE_H -#define INFINIBAND_OPCODE_H +#ifndef RDMAV_OPCODE_H +#define RDMAV_OPCODE_H /* * This macro cleans up the definitions of constants for BTH opcodes. - * It is used to define constants such as IBV_OPCODE_UD_SEND_ONLY, - * which becomes IBV_OPCODE_UD + IBV_OPCODE_SEND_ONLY, and this gives + * It is used to define constants such as RDMAV_OPCODE_UD_SEND_ONLY, + * which becomes RDMAV_OPCODE_UD + RDMAV_OPCODE_SEND_ONLY, and this gives * the correct value. * * In short, user code should use the constants defined using the * macro rather than worrying about adding together other constants. */ -#define IBV_OPCODE(transport, op) \ - IBV_OPCODE_ ## transport ## _ ## op = \ - IBV_OPCODE_ ## transport + IBV_OPCODE_ ## op + +#define RDMAV_OPCODE(transport, op) \ + RDMAV_OPCODE_ ## transport ## _ ## op = \ + RDMAV_OPCODE_ ## transport + RDMAV_OPCODE_ ## op enum { /* transport types -- just used to define real constants */ - IBV_OPCODE_RC = 0x00, - IBV_OPCODE_UC = 0x20, - IBV_OPCODE_RD = 0x40, - IBV_OPCODE_UD = 0x60, + RDMAV_OPCODE_RC = 0x00, + RDMAV_OPCODE_UC = 0x20, + RDMAV_OPCODE_RD = 0x40, + RDMAV_OPCODE_UD = 0x60, /* operations -- just used to define real constants */ - IBV_OPCODE_SEND_FIRST = 0x00, - IBV_OPCODE_SEND_MIDDLE = 0x01, - IBV_OPCODE_SEND_LAST = 0x02, - IBV_OPCODE_SEND_LAST_WITH_IMMEDIATE = 0x03, - IBV_OPCODE_SEND_ONLY = 0x04, - IBV_OPCODE_SEND_ONLY_WITH_IMMEDIATE = 0x05, - IBV_OPCODE_RDMA_WRITE_FIRST = 0x06, - IBV_OPCODE_RDMA_WRITE_MIDDLE = 0x07, - IBV_OPCODE_RDMA_WRITE_LAST = 0x08, - IBV_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE = 0x09, - IBV_OPCODE_RDMA_WRITE_ONLY = 0x0a, - IBV_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE = 0x0b, - IBV_OPCODE_RDMA_READ_REQUEST = 0x0c, - IBV_OPCODE_RDMA_READ_RESPONSE_FIRST = 0x0d, - IBV_OPCODE_RDMA_READ_RESPONSE_MIDDLE = 0x0e, - IBV_OPCODE_RDMA_READ_RESPONSE_LAST = 0x0f, - IBV_OPCODE_RDMA_READ_RESPONSE_ONLY = 0x10, - IBV_OPCODE_ACKNOWLEDGE = 0x11, - IBV_OPCODE_ATOMIC_ACKNOWLEDGE = 0x12, - IBV_OPCODE_COMPARE_SWAP = 0x13, - IBV_OPCODE_FETCH_ADD = 0x14, + RDMAV_OPCODE_SEND_FIRST = 0x00, + RDMAV_OPCODE_SEND_MIDDLE = 0x01, + RDMAV_OPCODE_SEND_LAST = 0x02, + RDMAV_OPCODE_SEND_LAST_WITH_IMMEDIATE = 0x03, + RDMAV_OPCODE_SEND_ONLY = 0x04, + RDMAV_OPCODE_SEND_ONLY_WITH_IMMEDIATE = 0x05, + RDMAV_OPCODE_RDMA_WRITE_FIRST = 0x06, + RDMAV_OPCODE_RDMA_WRITE_MIDDLE = 0x07, + RDMAV_OPCODE_RDMA_WRITE_LAST = 0x08, + RDMAV_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE = 0x09, + RDMAV_OPCODE_RDMA_WRITE_ONLY = 0x0a, + RDMAV_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE = 0x0b, + RDMAV_OPCODE_RDMA_READ_REQUEST = 0x0c, + RDMAV_OPCODE_RDMA_READ_RESPONSE_FIRST = 0x0d, + RDMAV_OPCODE_RDMA_READ_RESPONSE_MIDDLE = 0x0e, + RDMAV_OPCODE_RDMA_READ_RESPONSE_LAST = 0x0f, + RDMAV_OPCODE_RDMA_READ_RESPONSE_ONLY = 0x10, + RDMAV_OPCODE_ACKNOWLEDGE = 0x11, + RDMAV_OPCODE_ATOMIC_ACKNOWLEDGE = 0x12, + RDMAV_OPCODE_COMPARE_SWAP = 0x13, + RDMAV_OPCODE_FETCH_ADD = 0x14, - /* real constants follow -- see comment about above IBV_OPCODE() + /* real constants follow -- see comment about above RDMAV_OPCODE() macro for more details */ /* RC */ - IBV_OPCODE(RC, SEND_FIRST), - IBV_OPCODE(RC, SEND_MIDDLE), - IBV_OPCODE(RC, SEND_LAST), - IBV_OPCODE(RC, SEND_LAST_WITH_IMMEDIATE), - IBV_OPCODE(RC, SEND_ONLY), - IBV_OPCODE(RC, SEND_ONLY_WITH_IMMEDIATE), - IBV_OPCODE(RC, RDMA_WRITE_FIRST), - IBV_OPCODE(RC, RDMA_WRITE_MIDDLE), - IBV_OPCODE(RC, RDMA_WRITE_LAST), - IBV_OPCODE(RC, RDMA_WRITE_LAST_WITH_IMMEDIATE), - IBV_OPCODE(RC, RDMA_WRITE_ONLY), - IBV_OPCODE(RC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), - IBV_OPCODE(RC, RDMA_READ_REQUEST), - IBV_OPCODE(RC, RDMA_READ_RESPONSE_FIRST), - IBV_OPCODE(RC, RDMA_READ_RESPONSE_MIDDLE), - IBV_OPCODE(RC, RDMA_READ_RESPONSE_LAST), - IBV_OPCODE(RC, RDMA_READ_RESPONSE_ONLY), - IBV_OPCODE(RC, ACKNOWLEDGE), - IBV_OPCODE(RC, ATOMIC_ACKNOWLEDGE), - IBV_OPCODE(RC, COMPARE_SWAP), - IBV_OPCODE(RC, FETCH_ADD), + RDMAV_OPCODE(RC, SEND_FIRST), + RDMAV_OPCODE(RC, SEND_MIDDLE), + RDMAV_OPCODE(RC, SEND_LAST), + RDMAV_OPCODE(RC, SEND_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(RC, SEND_ONLY), + RDMAV_OPCODE(RC, SEND_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(RC, RDMA_WRITE_FIRST), + RDMAV_OPCODE(RC, RDMA_WRITE_MIDDLE), + RDMAV_OPCODE(RC, RDMA_WRITE_LAST), + RDMAV_OPCODE(RC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(RC, RDMA_WRITE_ONLY), + RDMAV_OPCODE(RC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(RC, RDMA_READ_REQUEST), + RDMAV_OPCODE(RC, RDMA_READ_RESPONSE_FIRST), + RDMAV_OPCODE(RC, RDMA_READ_RESPONSE_MIDDLE), + RDMAV_OPCODE(RC, RDMA_READ_RESPONSE_LAST), + RDMAV_OPCODE(RC, RDMA_READ_RESPONSE_ONLY), + RDMAV_OPCODE(RC, ACKNOWLEDGE), + RDMAV_OPCODE(RC, ATOMIC_ACKNOWLEDGE), + RDMAV_OPCODE(RC, COMPARE_SWAP), + RDMAV_OPCODE(RC, FETCH_ADD), /* UC */ - IBV_OPCODE(UC, SEND_FIRST), - IBV_OPCODE(UC, SEND_MIDDLE), - IBV_OPCODE(UC, SEND_LAST), - IBV_OPCODE(UC, SEND_LAST_WITH_IMMEDIATE), - IBV_OPCODE(UC, SEND_ONLY), - IBV_OPCODE(UC, SEND_ONLY_WITH_IMMEDIATE), - IBV_OPCODE(UC, RDMA_WRITE_FIRST), - IBV_OPCODE(UC, RDMA_WRITE_MIDDLE), - IBV_OPCODE(UC, RDMA_WRITE_LAST), - IBV_OPCODE(UC, RDMA_WRITE_LAST_WITH_IMMEDIATE), - IBV_OPCODE(UC, RDMA_WRITE_ONLY), - IBV_OPCODE(UC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(UC, SEND_FIRST), + RDMAV_OPCODE(UC, SEND_MIDDLE), + RDMAV_OPCODE(UC, SEND_LAST), + RDMAV_OPCODE(UC, SEND_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(UC, SEND_ONLY), + RDMAV_OPCODE(UC, SEND_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(UC, RDMA_WRITE_FIRST), + RDMAV_OPCODE(UC, RDMA_WRITE_MIDDLE), + RDMAV_OPCODE(UC, RDMA_WRITE_LAST), + RDMAV_OPCODE(UC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(UC, RDMA_WRITE_ONLY), + RDMAV_OPCODE(UC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), /* RD */ - IBV_OPCODE(RD, SEND_FIRST), - IBV_OPCODE(RD, SEND_MIDDLE), - IBV_OPCODE(RD, SEND_LAST), - IBV_OPCODE(RD, SEND_LAST_WITH_IMMEDIATE), - IBV_OPCODE(RD, SEND_ONLY), - IBV_OPCODE(RD, SEND_ONLY_WITH_IMMEDIATE), - IBV_OPCODE(RD, RDMA_WRITE_FIRST), - IBV_OPCODE(RD, RDMA_WRITE_MIDDLE), - IBV_OPCODE(RD, RDMA_WRITE_LAST), - IBV_OPCODE(RD, RDMA_WRITE_LAST_WITH_IMMEDIATE), - IBV_OPCODE(RD, RDMA_WRITE_ONLY), - IBV_OPCODE(RD, RDMA_WRITE_ONLY_WITH_IMMEDIATE), - IBV_OPCODE(RD, RDMA_READ_REQUEST), - IBV_OPCODE(RD, RDMA_READ_RESPONSE_FIRST), - IBV_OPCODE(RD, RDMA_READ_RESPONSE_MIDDLE), - IBV_OPCODE(RD, RDMA_READ_RESPONSE_LAST), - IBV_OPCODE(RD, RDMA_READ_RESPONSE_ONLY), - IBV_OPCODE(RD, ACKNOWLEDGE), - IBV_OPCODE(RD, ATOMIC_ACKNOWLEDGE), - IBV_OPCODE(RD, COMPARE_SWAP), - IBV_OPCODE(RD, FETCH_ADD), + RDMAV_OPCODE(RD, SEND_FIRST), + RDMAV_OPCODE(RD, SEND_MIDDLE), + RDMAV_OPCODE(RD, SEND_LAST), + RDMAV_OPCODE(RD, SEND_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(RD, SEND_ONLY), + RDMAV_OPCODE(RD, SEND_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(RD, RDMA_WRITE_FIRST), + RDMAV_OPCODE(RD, RDMA_WRITE_MIDDLE), + RDMAV_OPCODE(RD, RDMA_WRITE_LAST), + RDMAV_OPCODE(RD, RDMA_WRITE_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(RD, RDMA_WRITE_ONLY), + RDMAV_OPCODE(RD, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(RD, RDMA_READ_REQUEST), + RDMAV_OPCODE(RD, RDMA_READ_RESPONSE_FIRST), + RDMAV_OPCODE(RD, RDMA_READ_RESPONSE_MIDDLE), + RDMAV_OPCODE(RD, RDMA_READ_RESPONSE_LAST), + RDMAV_OPCODE(RD, RDMA_READ_RESPONSE_ONLY), + RDMAV_OPCODE(RD, ACKNOWLEDGE), + RDMAV_OPCODE(RD, ATOMIC_ACKNOWLEDGE), + RDMAV_OPCODE(RD, COMPARE_SWAP), + RDMAV_OPCODE(RD, FETCH_ADD), /* UD */ - IBV_OPCODE(UD, SEND_ONLY), - IBV_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE) + RDMAV_OPCODE(UD, SEND_ONLY), + RDMAV_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE) }; -#endif /* INFINIBAND_OPCODE_H */ +#endif /* RDMAV_OPCODE_H */ diff -ruNp ORG/libibverbs/include/infiniband/sa-kern-abi.h NEW/libibverbs/include/infiniband/sa-kern-abi.h --- ORG/libibverbs/include/infiniband/sa-kern-abi.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/sa-kern-abi.h 2006-08-02 18:24:49.000000000 -0700 @@ -30,8 +30,8 @@ * SOFTWARE. */ -#ifndef INFINIBAND_SA_KERN_ABI_H -#define INFINIBAND_SA_KERN_ABI_H +#ifndef RDMAV_SA_KERN_ABI_H +#define RDMAV_SA_KERN_ABI_H #include @@ -40,7 +40,7 @@ */ #define ib_kern_path_rec ibv_kern_path_rec -struct ibv_kern_path_rec { +struct rdmav_kern_path_rec { __u8 dgid[16]; __u8 sgid[16]; __u16 dlid; @@ -62,4 +62,4 @@ struct ibv_kern_path_rec { __u8 preference; }; -#endif /* INFINIBAND_SA_KERN_ABI_H */ +#endif /* RDMAV_SA_KERN_ABI_H */ diff -ruNp ORG/libibverbs/include/infiniband/sa.h NEW/libibverbs/include/infiniband/sa.h --- ORG/libibverbs/include/infiniband/sa.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/sa.h 2006-08-02 18:24:49.000000000 -0700 @@ -33,16 +33,16 @@ * $Id: sa.h 2616 2005-06-15 15:22:39Z halr $ */ -#ifndef INFINIBAND_SA_H -#define INFINIBAND_SA_H +#ifndef RDMAV_SA_H +#define RDMAV_SA_H #include -struct ibv_sa_path_rec { +struct rdmav_sa_path_rec { /* reserved */ /* reserved */ - union ibv_gid dgid; - union ibv_gid sgid; + union rdmav_gid dgid; + union rdmav_gid sgid; uint16_t dlid; uint16_t slid; int raw_traffic; @@ -64,9 +64,9 @@ struct ibv_sa_path_rec { uint8_t preference; }; -struct ibv_sa_mcmember_rec { - union ibv_gid mgid; - union ibv_gid port_gid; +struct rdmav_sa_mcmember_rec { + union rdmav_gid mgid; + union rdmav_gid port_gid; uint32_t qkey; uint16_t mlid; uint8_t mtu_selector; @@ -85,9 +85,9 @@ struct ibv_sa_mcmember_rec { int proxy_join; }; -struct ibv_sa_service_rec { +struct rdmav_sa_service_rec { uint64_t id; - union ibv_gid gid; + union rdmav_gid gid; uint16_t pkey; /* uint16_t resv; */ uint32_t lease; @@ -99,4 +99,4 @@ struct ibv_sa_service_rec { uint64_t data64[2]; }; -#endif /* INFINIBAND_SA_H */ +#endif /* RDMAV_SA_H */ diff -ruNp ORG/libibverbs/include/infiniband/verbs.h NEW/libibverbs/include/infiniband/verbs.h --- ORG/libibverbs/include/infiniband/verbs.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/verbs.h 2006-08-03 17:29:08.000000000 -0700 @@ -35,11 +35,12 @@ * $Id: verbs.h 8076 2006-06-16 18:26:34Z sean.hefty $ */ -#ifndef INFINIBAND_VERBS_H -#define INFINIBAND_VERBS_H +#ifndef RDMAV_VERBS_H +#define RDMAV_VERBS_H #include #include +#include "deprecate.h" #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -57,7 +58,7 @@ BEGIN_C_DECLS -union ibv_gid { +union rdmav_gid { uint8_t raw[16]; struct { uint64_t subnet_prefix; @@ -65,37 +66,37 @@ union ibv_gid { } global; }; -enum ibv_node_type { - IBV_NODE_CA = 1, - IBV_NODE_SWITCH, - IBV_NODE_ROUTER -}; - -enum ibv_device_cap_flags { - IBV_DEVICE_RESIZE_MAX_WR = 1, - IBV_DEVICE_BAD_PKEY_CNTR = 1 << 1, - IBV_DEVICE_BAD_QKEY_CNTR = 1 << 2, - IBV_DEVICE_RAW_MULTI = 1 << 3, - IBV_DEVICE_AUTO_PATH_MIG = 1 << 4, - IBV_DEVICE_CHANGE_PHY_PORT = 1 << 5, - IBV_DEVICE_UD_AV_PORT_ENFORCE = 1 << 6, - IBV_DEVICE_CURR_QP_STATE_MOD = 1 << 7, - IBV_DEVICE_SHUTDOWN_PORT = 1 << 8, - IBV_DEVICE_INIT_TYPE = 1 << 9, - IBV_DEVICE_PORT_ACTIVE_EVENT = 1 << 10, - IBV_DEVICE_SYS_IMAGE_GUID = 1 << 11, - IBV_DEVICE_RC_RNR_NAK_GEN = 1 << 12, - IBV_DEVICE_SRQ_RESIZE = 1 << 13, - IBV_DEVICE_N_NOTIFY_CQ = 1 << 14 -}; - -enum ibv_atomic_cap { - IBV_ATOMIC_NONE, - IBV_ATOMIC_HCA, - IBV_ATOMIC_GLOB +enum rdmav_node_type { + RDMAV_NODE_CA = 1, + RDMAV_NODE_SWITCH, + RDMAV_NODE_ROUTER +}; + +enum rdmav_device_cap_flags { + RDMAV_DEVICE_RESIZE_MAX_WR = 1, + RDMAV_DEVICE_BAD_PKEY_CNTR = 1 << 1, + RDMAV_DEVICE_BAD_QKEY_CNTR = 1 << 2, + RDMAV_DEVICE_RAW_MULTI = 1 << 3, + RDMAV_DEVICE_AUTO_PATH_MIG = 1 << 4, + RDMAV_DEVICE_CHANGE_PHY_PORT = 1 << 5, + RDMAV_DEVICE_UD_AV_PORT_ENFORCE = 1 << 6, + RDMAV_DEVICE_CURR_QP_STATE_MOD = 1 << 7, + RDMAV_DEVICE_SHUTDOWN_PORT = 1 << 8, + RDMAV_DEVICE_INIT_TYPE = 1 << 9, + RDMAV_DEVICE_PORT_ACTIVE_EVENT = 1 << 10, + RDMAV_DEVICE_SYS_IMAGE_GUID = 1 << 11, + RDMAV_DEVICE_RC_RNR_NAK_GEN = 1 << 12, + RDMAV_DEVICE_SRQ_RESIZE = 1 << 13, + RDMAV_DEVICE_N_NOTIFY_CQ = 1 << 14 +}; + +enum rdmav_atomic_cap { + RDMAV_ATOMIC_NONE, + RDMAV_ATOMIC_HCA, + RDMAV_ATOMIC_GLOB }; -struct ibv_device_attr { +struct rdmav_device_attr { char fw_ver[64]; uint64_t node_guid; uint64_t sys_image_guid; @@ -118,7 +119,7 @@ struct ibv_device_attr { int max_res_rd_atom; int max_qp_init_rd_atom; int max_ee_init_rd_atom; - enum ibv_atomic_cap atomic_cap; + enum rdmav_atomic_cap atomic_cap; int max_ee; int max_rdd; int max_mw; @@ -138,27 +139,27 @@ struct ibv_device_attr { uint8_t phys_port_cnt; }; -enum ibv_mtu { - IBV_MTU_256 = 1, - IBV_MTU_512 = 2, - IBV_MTU_1024 = 3, - IBV_MTU_2048 = 4, - IBV_MTU_4096 = 5 -}; - -enum ibv_port_state { - IBV_PORT_NOP = 0, - IBV_PORT_DOWN = 1, - IBV_PORT_INIT = 2, - IBV_PORT_ARMED = 3, - IBV_PORT_ACTIVE = 4, - IBV_PORT_ACTIVE_DEFER = 5 -}; - -struct ibv_port_attr { - enum ibv_port_state state; - enum ibv_mtu max_mtu; - enum ibv_mtu active_mtu; +enum rdmav_mtu { + RDMAV_MTU_256 = 1, + RDMAV_MTU_512 = 2, + RDMAV_MTU_1024 = 3, + RDMAV_MTU_2048 = 4, + RDMAV_MTU_4096 = 5 +}; + +enum rdmav_port_state { + RDMAV_PORT_NOP = 0, + RDMAV_PORT_DOWN = 1, + RDMAV_PORT_INIT = 2, + RDMAV_PORT_ARMED = 3, + RDMAV_PORT_ACTIVE = 4, + RDMAV_PORT_ACTIVE_DEFER = 5 +}; + +struct rdmav_port_attr { + enum rdmav_port_state state; + enum rdmav_mtu max_mtu; + enum rdmav_mtu active_mtu; int gid_tbl_len; uint32_t port_cap_flags; uint32_t max_msg_sz; @@ -177,165 +178,165 @@ struct ibv_port_attr { uint8_t phys_state; }; -enum ibv_event_type { - IBV_EVENT_CQ_ERR, - IBV_EVENT_QP_FATAL, - IBV_EVENT_QP_REQ_ERR, - IBV_EVENT_QP_ACCESS_ERR, - IBV_EVENT_COMM_EST, - IBV_EVENT_SQ_DRAINED, - IBV_EVENT_PATH_MIG, - IBV_EVENT_PATH_MIG_ERR, - IBV_EVENT_DEVICE_FATAL, - IBV_EVENT_PORT_ACTIVE, - IBV_EVENT_PORT_ERR, - IBV_EVENT_LID_CHANGE, - IBV_EVENT_PKEY_CHANGE, - IBV_EVENT_SM_CHANGE, - IBV_EVENT_SRQ_ERR, - IBV_EVENT_SRQ_LIMIT_REACHED, - IBV_EVENT_QP_LAST_WQE_REACHED, - IBV_EVENT_CLIENT_REREGISTER +enum rdmav_event_type { + RDMAV_EVENT_CQ_ERR, + RDMAV_EVENT_QP_FATAL, + RDMAV_EVENT_QP_REQ_ERR, + RDMAV_EVENT_QP_ACCESS_ERR, + RDMAV_EVENT_COMM_EST, + RDMAV_EVENT_SQ_DRAINED, + RDMAV_EVENT_PATH_MIG, + RDMAV_EVENT_PATH_MIG_ERR, + RDMAV_EVENT_DEVICE_FATAL, + RDMAV_EVENT_PORT_ACTIVE, + RDMAV_EVENT_PORT_ERR, + RDMAV_EVENT_LID_CHANGE, + RDMAV_EVENT_PKEY_CHANGE, + RDMAV_EVENT_SM_CHANGE, + RDMAV_EVENT_SRQ_ERR, + RDMAV_EVENT_SRQ_LIMIT_REACHED, + RDMAV_EVENT_QP_LAST_WQE_REACHED, + RDMAV_EVENT_CLIENT_REREGISTER }; -struct ibv_async_event { +struct rdmav_async_event { union { - struct ibv_cq *cq; - struct ibv_qp *qp; - struct ibv_srq *srq; + struct rdmav_cq *cq; + struct rdmav_qp *qp; + struct rdmav_srq *srq; int port_num; } element; - enum ibv_event_type event_type; + enum rdmav_event_type event_type; }; -enum ibv_wc_status { - IBV_WC_SUCCESS, - IBV_WC_LOC_LEN_ERR, - IBV_WC_LOC_QP_OP_ERR, - IBV_WC_LOC_EEC_OP_ERR, - IBV_WC_LOC_PROT_ERR, - IBV_WC_WR_FLUSH_ERR, - IBV_WC_MW_BIND_ERR, - IBV_WC_BAD_RESP_ERR, - IBV_WC_LOC_ACCESS_ERR, - IBV_WC_REM_INV_REQ_ERR, - IBV_WC_REM_ACCESS_ERR, - IBV_WC_REM_OP_ERR, - IBV_WC_RETRY_EXC_ERR, - IBV_WC_RNR_RETRY_EXC_ERR, - IBV_WC_LOC_RDD_VIOL_ERR, - IBV_WC_REM_INV_RD_REQ_ERR, - IBV_WC_REM_ABORT_ERR, - IBV_WC_INV_EECN_ERR, - IBV_WC_INV_EEC_STATE_ERR, - IBV_WC_FATAL_ERR, - IBV_WC_RESP_TIMEOUT_ERR, - IBV_WC_GENERAL_ERR -}; - -enum ibv_wc_opcode { - IBV_WC_SEND, - IBV_WC_RDMA_WRITE, - IBV_WC_RDMA_READ, - IBV_WC_COMP_SWAP, - IBV_WC_FETCH_ADD, - IBV_WC_BIND_MW, +enum rdmav_wc_status { + RDMAV_WC_SUCCESS, + RDMAV_WC_LOC_LEN_ERR, + RDMAV_WC_LOC_QP_OP_ERR, + RDMAV_WC_LOC_EEC_OP_ERR, + RDMAV_WC_LOC_PROT_ERR, + RDMAV_WC_WR_FLUSH_ERR, + RDMAV_WC_MW_BIND_ERR, + RDMAV_WC_BAD_RESP_ERR, + RDMAV_WC_LOC_ACCESS_ERR, + RDMAV_WC_REM_INV_REQ_ERR, + RDMAV_WC_REM_ACCESS_ERR, + RDMAV_WC_REM_OP_ERR, + RDMAV_WC_RETRY_EXC_ERR, + RDMAV_WC_RNR_RETRY_EXC_ERR, + RDMAV_WC_LOC_RDD_VIOL_ERR, + RDMAV_WC_REM_INV_RD_REQ_ERR, + RDMAV_WC_REM_ABORT_ERR, + RDMAV_WC_INV_EECN_ERR, + RDMAV_WC_INV_EEC_STATE_ERR, + RDMAV_WC_FATAL_ERR, + RDMAV_WC_RESP_TIMEOUT_ERR, + RDMAV_WC_GENERAL_ERR +}; + +enum rdmav_wc_opcode { + RDMAV_WC_SEND, + RDMAV_WC_RDMA_WRITE, + RDMAV_WC_RDMA_READ, + RDMAV_WC_COMP_SWAP, + RDMAV_WC_FETCH_ADD, + RDMAV_WC_BIND_MW, /* - * Set value of IBV_WC_RECV so consumers can test if a completion is a - * receive by testing (opcode & IBV_WC_RECV). + * Set value of RDMAV_WC_RECV so consumers can test if a completion is a + * receive by testing (opcode & RDMAV_WC_RECV). */ - IBV_WC_RECV = 1 << 7, - IBV_WC_RECV_RDMA_WITH_IMM + RDMAV_WC_RECV = 1 << 7, + RDMAV_WC_RECV_RDMA_WITH_IMM }; -enum ibv_wc_flags { - IBV_WC_GRH = 1 << 0, - IBV_WC_WITH_IMM = 1 << 1 +enum rdmav_wc_flags { + RDMAV_WC_GRH = 1 << 0, + RDMAV_WC_WITH_IMM = 1 << 1 }; -struct ibv_wc { +struct rdmav_wc { uint64_t wr_id; - enum ibv_wc_status status; - enum ibv_wc_opcode opcode; + enum rdmav_wc_status status; + enum rdmav_wc_opcode opcode; uint32_t vendor_err; uint32_t byte_len; uint32_t imm_data; /* in network byte order */ uint32_t qp_num; uint32_t src_qp; - enum ibv_wc_flags wc_flags; + enum rdmav_wc_flags wc_flags; uint16_t pkey_index; uint16_t slid; uint8_t sl; uint8_t dlid_path_bits; }; -enum ibv_access_flags { - IBV_ACCESS_LOCAL_WRITE = 1, - IBV_ACCESS_REMOTE_WRITE = (1<<1), - IBV_ACCESS_REMOTE_READ = (1<<2), - IBV_ACCESS_REMOTE_ATOMIC = (1<<3), - IBV_ACCESS_MW_BIND = (1<<4) +enum rdmav_access_flags { + RDMAV_ACCESS_LOCAL_WRITE = 1, + RDMAV_ACCESS_REMOTE_WRITE = (1<<1), + RDMAV_ACCESS_REMOTE_READ = (1<<2), + RDMAV_ACCESS_REMOTE_ATOMIC = (1<<3), + RDMAV_ACCESS_MW_BIND = (1<<4) }; -struct ibv_pd { - struct ibv_context *context; +struct rdmav_pd { + struct rdmav_context *context; uint32_t handle; }; -struct ibv_mr { - struct ibv_context *context; - struct ibv_pd *pd; +struct rdmav_mr { + struct rdmav_context *context; + struct rdmav_pd *pd; uint32_t handle; uint32_t lkey; uint32_t rkey; }; -struct ibv_global_route { - union ibv_gid dgid; +struct rdmav_global_route { + union rdmav_gid dgid; uint32_t flow_label; uint8_t sgid_index; uint8_t hop_limit; uint8_t traffic_class; }; -struct ibv_grh { +struct rdmav_grh { uint32_t version_tclass_flow; uint16_t paylen; uint8_t next_hdr; uint8_t hop_limit; - union ibv_gid sgid; - union ibv_gid dgid; + union rdmav_gid sgid; + union rdmav_gid dgid; }; -enum ibv_rate { - IBV_RATE_MAX = 0, - IBV_RATE_2_5_GBPS = 2, - IBV_RATE_5_GBPS = 5, - IBV_RATE_10_GBPS = 3, - IBV_RATE_20_GBPS = 6, - IBV_RATE_30_GBPS = 4, - IBV_RATE_40_GBPS = 7, - IBV_RATE_60_GBPS = 8, - IBV_RATE_80_GBPS = 9, - IBV_RATE_120_GBPS = 10 +enum rdmav_rate { + RDMAV_RATE_MAX = 0, + RDMAV_RATE_2_5_GBPS = 2, + RDMAV_RATE_5_GBPS = 5, + RDMAV_RATE_10_GBPS = 3, + RDMAV_RATE_20_GBPS = 6, + RDMAV_RATE_30_GBPS = 4, + RDMAV_RATE_40_GBPS = 7, + RDMAV_RATE_60_GBPS = 8, + RDMAV_RATE_80_GBPS = 9, + RDMAV_RATE_120_GBPS = 10 }; /** - * ibv_rate_to_mult - Convert the IB rate enum to a multiple of the - * base rate of 2.5 Gbit/sec. For example, IBV_RATE_5_GBPS will be + * rdmav_rate_to_mult - Convert the IB rate enum to a multiple of the + * base rate of 2.5 Gbit/sec. For example, RDMAV_RATE_5_GBPS will be * converted to 2, since 5 Gbit/sec is 2 * 2.5 Gbit/sec. * @rate: rate to convert. */ -int ibv_rate_to_mult(enum ibv_rate rate) __attribute_const; +int rdmav_rate_to_mult(enum rdmav_rate rate) __attribute_const; /** - * mult_to_ibv_rate - Convert a multiple of 2.5 Gbit/sec to an IB rate enum. + * mult_to_rdmav_rate - Convert a multiple of 2.5 Gbit/sec to an IB rate enum. * @mult: multiple to convert. */ -enum ibv_rate mult_to_ibv_rate(int mult) __attribute_const; +enum rdmav_rate mult_to_rdmav_rate(int mult) __attribute_const; -struct ibv_ah_attr { - struct ibv_global_route grh; +struct rdmav_ah_attr { + struct rdmav_global_route grh; uint16_t dlid; uint8_t sl; uint8_t src_path_bits; @@ -344,29 +345,29 @@ struct ibv_ah_attr { uint8_t port_num; }; -enum ibv_srq_attr_mask { - IBV_SRQ_MAX_WR = 1 << 0, - IBV_SRQ_LIMIT = 1 << 1 +enum rdmav_srq_attr_mask { + RDMAV_SRQ_MAX_WR = 1 << 0, + RDMAV_SRQ_LIMIT = 1 << 1 }; -struct ibv_srq_attr { +struct rdmav_srq_attr { uint32_t max_wr; uint32_t max_sge; uint32_t srq_limit; }; -struct ibv_srq_init_attr { +struct rdmav_srq_init_attr { void *srq_context; - struct ibv_srq_attr attr; + struct rdmav_srq_attr attr; }; -enum ibv_qp_type { - IBV_QPT_RC = 2, - IBV_QPT_UC, - IBV_QPT_UD +enum rdmav_qp_type { + RDMAV_QPT_RC = 2, + RDMAV_QPT_UC, + RDMAV_QPT_UD }; -struct ibv_qp_cap { +struct rdmav_qp_cap { uint32_t max_send_wr; uint32_t max_recv_wr; uint32_t max_send_sge; @@ -374,69 +375,69 @@ struct ibv_qp_cap { uint32_t max_inline_data; }; -struct ibv_qp_init_attr { +struct rdmav_qp_init_attr { void *qp_context; - struct ibv_cq *send_cq; - struct ibv_cq *recv_cq; - struct ibv_srq *srq; - struct ibv_qp_cap cap; - enum ibv_qp_type qp_type; + struct rdmav_cq *send_cq; + struct rdmav_cq *recv_cq; + struct rdmav_srq *srq; + struct rdmav_qp_cap cap; + enum rdmav_qp_type qp_type; int sq_sig_all; }; -enum ibv_qp_attr_mask { - IBV_QP_STATE = 1 << 0, - IBV_QP_CUR_STATE = 1 << 1, - IBV_QP_EN_SQD_ASYNC_NOTIFY = 1 << 2, - IBV_QP_ACCESS_FLAGS = 1 << 3, - IBV_QP_PKEY_INDEX = 1 << 4, - IBV_QP_PORT = 1 << 5, - IBV_QP_QKEY = 1 << 6, - IBV_QP_AV = 1 << 7, - IBV_QP_PATH_MTU = 1 << 8, - IBV_QP_TIMEOUT = 1 << 9, - IBV_QP_RETRY_CNT = 1 << 10, - IBV_QP_RNR_RETRY = 1 << 11, - IBV_QP_RQ_PSN = 1 << 12, - IBV_QP_MAX_QP_RD_ATOMIC = 1 << 13, - IBV_QP_ALT_PATH = 1 << 14, - IBV_QP_MIN_RNR_TIMER = 1 << 15, - IBV_QP_SQ_PSN = 1 << 16, - IBV_QP_MAX_DEST_RD_ATOMIC = 1 << 17, - IBV_QP_PATH_MIG_STATE = 1 << 18, - IBV_QP_CAP = 1 << 19, - IBV_QP_DEST_QPN = 1 << 20 -}; - -enum ibv_qp_state { - IBV_QPS_RESET, - IBV_QPS_INIT, - IBV_QPS_RTR, - IBV_QPS_RTS, - IBV_QPS_SQD, - IBV_QPS_SQE, - IBV_QPS_ERR -}; - -enum ibv_mig_state { - IBV_MIG_MIGRATED, - IBV_MIG_REARM, - IBV_MIG_ARMED -}; - -struct ibv_qp_attr { - enum ibv_qp_state qp_state; - enum ibv_qp_state cur_qp_state; - enum ibv_mtu path_mtu; - enum ibv_mig_state path_mig_state; +enum rdmav_qp_attr_mask { + RDMAV_QP_STATE = 1 << 0, + RDMAV_QP_CUR_STATE = 1 << 1, + RDMAV_QP_EN_SQD_ASYNC_NOTIFY = 1 << 2, + RDMAV_QP_ACCESS_FLAGS = 1 << 3, + RDMAV_QP_PKEY_INDEX = 1 << 4, + RDMAV_QP_PORT = 1 << 5, + RDMAV_QP_QKEY = 1 << 6, + RDMAV_QP_AV = 1 << 7, + RDMAV_QP_PATH_MTU = 1 << 8, + RDMAV_QP_TIMEOUT = 1 << 9, + RDMAV_QP_RETRY_CNT = 1 << 10, + RDMAV_QP_RNR_RETRY = 1 << 11, + RDMAV_QP_RQ_PSN = 1 << 12, + RDMAV_QP_MAX_QP_RD_ATOMIC = 1 << 13, + RDMAV_QP_ALT_PATH = 1 << 14, + RDMAV_QP_MIN_RNR_TIMER = 1 << 15, + RDMAV_QP_SQ_PSN = 1 << 16, + RDMAV_QP_MAX_DEST_RD_ATOMIC = 1 << 17, + RDMAV_QP_PATH_MIG_STATE = 1 << 18, + RDMAV_QP_CAP = 1 << 19, + RDMAV_QP_DEST_QPN = 1 << 20 +}; + +enum rdmav_qp_state { + RDMAV_QPS_RESET, + RDMAV_QPS_INIT, + RDMAV_QPS_RTR, + RDMAV_QPS_RTS, + RDMAV_QPS_SQD, + RDMAV_QPS_SQE, + RDMAV_QPS_ERR +}; + +enum rdmav_mig_state { + RDMAV_MIG_MIGRATED, + RDMAV_MIG_REARM, + RDMAV_MIG_ARMED +}; + +struct rdmav_qp_attr { + enum rdmav_qp_state qp_state; + enum rdmav_qp_state cur_qp_state; + enum rdmav_mtu path_mtu; + enum rdmav_mig_state path_mig_state; uint32_t qkey; uint32_t rq_psn; uint32_t sq_psn; uint32_t dest_qp_num; int qp_access_flags; - struct ibv_qp_cap cap; - struct ibv_ah_attr ah_attr; - struct ibv_ah_attr alt_ah_attr; + struct rdmav_qp_cap cap; + struct rdmav_ah_attr ah_attr; + struct rdmav_ah_attr alt_ah_attr; uint16_t pkey_index; uint16_t alt_pkey_index; uint8_t en_sqd_async_notify; @@ -452,36 +453,36 @@ struct ibv_qp_attr { uint8_t alt_timeout; }; -enum ibv_wr_opcode { - IBV_WR_RDMA_WRITE, - IBV_WR_RDMA_WRITE_WITH_IMM, - IBV_WR_SEND, - IBV_WR_SEND_WITH_IMM, - IBV_WR_RDMA_READ, - IBV_WR_ATOMIC_CMP_AND_SWP, - IBV_WR_ATOMIC_FETCH_AND_ADD -}; - -enum ibv_send_flags { - IBV_SEND_FENCE = 1 << 0, - IBV_SEND_SIGNALED = 1 << 1, - IBV_SEND_SOLICITED = 1 << 2, - IBV_SEND_INLINE = 1 << 3 +enum rdmav_wr_opcode { + RDMAV_WR_RDMA_WRITE, + RDMAV_WR_RDMA_WRITE_WITH_IMM, + RDMAV_WR_SEND, + RDMAV_WR_SEND_WITH_IMM, + RDMAV_WR_RDMA_READ, + RDMAV_WR_ATOMIC_CMP_AND_SWP, + RDMAV_WR_ATOMIC_FETCH_AND_ADD +}; + +enum rdmav_send_flags { + RDMAV_SEND_FENCE = 1 << 0, + RDMAV_SEND_SIGNALED = 1 << 1, + RDMAV_SEND_SOLICITED = 1 << 2, + RDMAV_SEND_INLINE = 1 << 3 }; -struct ibv_sge { +struct rdmav_sge { uint64_t addr; uint32_t length; uint32_t lkey; }; -struct ibv_send_wr { - struct ibv_send_wr *next; +struct rdmav_send_wr { + struct rdmav_send_wr *next; uint64_t wr_id; - struct ibv_sge *sg_list; + struct rdmav_sge *sg_list; int num_sge; - enum ibv_wr_opcode opcode; - enum ibv_send_flags send_flags; + enum rdmav_wr_opcode opcode; + enum rdmav_send_flags send_flags; uint32_t imm_data; /* in network byte order */ union { struct { @@ -495,24 +496,24 @@ struct ibv_send_wr { uint32_t rkey; } atomic; struct { - struct ibv_ah *ah; + struct rdmav_ah *ah; uint32_t remote_qpn; uint32_t remote_qkey; } ud; } wr; }; -struct ibv_recv_wr { - struct ibv_recv_wr *next; +struct rdmav_recv_wr { + struct rdmav_recv_wr *next; uint64_t wr_id; - struct ibv_sge *sg_list; + struct rdmav_sge *sg_list; int num_sge; }; -struct ibv_srq { - struct ibv_context *context; +struct rdmav_srq { + struct rdmav_context *context; void *srq_context; - struct ibv_pd *pd; + struct rdmav_pd *pd; uint32_t handle; pthread_mutex_t mutex; @@ -520,29 +521,29 @@ struct ibv_srq { uint32_t events_completed; }; -struct ibv_qp { - struct ibv_context *context; +struct rdmav_qp { + struct rdmav_context *context; void *qp_context; - struct ibv_pd *pd; - struct ibv_cq *send_cq; - struct ibv_cq *recv_cq; - struct ibv_srq *srq; + struct rdmav_pd *pd; + struct rdmav_cq *send_cq; + struct rdmav_cq *recv_cq; + struct rdmav_srq *srq; uint32_t handle; uint32_t qp_num; - enum ibv_qp_state state; - enum ibv_qp_type qp_type; + enum rdmav_qp_state state; + enum rdmav_qp_type qp_type; pthread_mutex_t mutex; pthread_cond_t cond; uint32_t events_completed; }; -struct ibv_comp_channel { +struct rdmav_comp_channel { int fd; }; -struct ibv_cq { - struct ibv_context *context; +struct rdmav_cq { + struct rdmav_context *context; void *cq_context; uint32_t handle; int cqe; @@ -553,89 +554,89 @@ struct ibv_cq { uint32_t async_events_completed; }; -struct ibv_ah { - struct ibv_context *context; - struct ibv_pd *pd; +struct rdmav_ah { + struct rdmav_context *context; + struct rdmav_pd *pd; uint32_t handle; }; -struct ibv_device; -struct ibv_context; +struct rdmav_device; +struct rdmav_context; -struct ibv_device_ops { - struct ibv_context * (*alloc_context)(struct ibv_device *device, int cmd_fd); - void (*free_context)(struct ibv_context *context); +struct rdmav_device_ops { + struct rdmav_context * (*alloc_context)(struct rdmav_device *device, int cmd_fd); + void (*free_context)(struct rdmav_context *context); }; enum { - IBV_SYSFS_NAME_MAX = 64, - IBV_SYSFS_PATH_MAX = 256 + RDMAV_SYSFS_NAME_MAX = 64, + RDMAV_SYSFS_PATH_MAX = 256 }; -struct ibv_device { - struct ibv_driver *driver; - struct ibv_device_ops ops; +struct rdmav_device { + struct rdmav_driver *driver; + struct rdmav_device_ops ops; /* Name of underlying kernel IB device, eg "mthca0" */ - char name[IBV_SYSFS_NAME_MAX]; + char name[RDMAV_SYSFS_NAME_MAX]; /* Name of uverbs device, eg "uverbs0" */ - char dev_name[IBV_SYSFS_NAME_MAX]; + char dev_name[RDMAV_SYSFS_NAME_MAX]; /* Path to infiniband_verbs class device in sysfs */ - char dev_path[IBV_SYSFS_PATH_MAX]; + char dev_path[RDMAV_SYSFS_PATH_MAX]; /* Path to infiniband class device in sysfs */ - char ibdev_path[IBV_SYSFS_PATH_MAX]; + char ibdev_path[RDMAV_SYSFS_PATH_MAX]; }; -struct ibv_context_ops { - int (*query_device)(struct ibv_context *context, - struct ibv_device_attr *device_attr); - int (*query_port)(struct ibv_context *context, uint8_t port_num, - struct ibv_port_attr *port_attr); - struct ibv_pd * (*alloc_pd)(struct ibv_context *context); - int (*dealloc_pd)(struct ibv_pd *pd); - struct ibv_mr * (*reg_mr)(struct ibv_pd *pd, void *addr, size_t length, - enum ibv_access_flags access); - int (*dereg_mr)(struct ibv_mr *mr); - struct ibv_cq * (*create_cq)(struct ibv_context *context, int cqe, - struct ibv_comp_channel *channel, +struct rdmav_context_ops { + int (*query_device)(struct rdmav_context *context, + struct rdmav_device_attr *device_attr); + int (*query_port)(struct rdmav_context *context, uint8_t port_num, + struct rdmav_port_attr *port_attr); + struct rdmav_pd * (*alloc_pd)(struct rdmav_context *context); + int (*dealloc_pd)(struct rdmav_pd *pd); + struct rdmav_mr * (*reg_mr)(struct rdmav_pd *pd, void *addr, size_t length, + enum rdmav_access_flags access); + int (*dereg_mr)(struct rdmav_mr *mr); + struct rdmav_cq * (*create_cq)(struct rdmav_context *context, int cqe, + struct rdmav_comp_channel *channel, int comp_vector); - int (*poll_cq)(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc); - int (*req_notify_cq)(struct ibv_cq *cq, int solicited_only); - void (*cq_event)(struct ibv_cq *cq); - int (*resize_cq)(struct ibv_cq *cq, int cqe); - int (*destroy_cq)(struct ibv_cq *cq); - struct ibv_srq * (*create_srq)(struct ibv_pd *pd, - struct ibv_srq_init_attr *srq_init_attr); - int (*modify_srq)(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask); - int (*query_srq)(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr); - int (*destroy_srq)(struct ibv_srq *srq); - int (*post_srq_recv)(struct ibv_srq *srq, - struct ibv_recv_wr *recv_wr, - struct ibv_recv_wr **bad_recv_wr); - struct ibv_qp * (*create_qp)(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); - int (*query_qp)(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_qp_init_attr *init_attr); - int (*modify_qp)(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask); - int (*destroy_qp)(struct ibv_qp *qp); - int (*post_send)(struct ibv_qp *qp, struct ibv_send_wr *wr, - struct ibv_send_wr **bad_wr); - int (*post_recv)(struct ibv_qp *qp, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr); - struct ibv_ah * (*create_ah)(struct ibv_pd *pd, struct ibv_ah_attr *attr); - int (*destroy_ah)(struct ibv_ah *ah); - int (*attach_mcast)(struct ibv_qp *qp, union ibv_gid *gid, + int (*poll_cq)(struct rdmav_cq *cq, int num_entries, struct rdmav_wc *wc); + int (*req_notify_cq)(struct rdmav_cq *cq, int solicited_only); + void (*cq_event)(struct rdmav_cq *cq); + int (*resize_cq)(struct rdmav_cq *cq, int cqe); + int (*destroy_cq)(struct rdmav_cq *cq); + struct rdmav_srq * (*create_srq)(struct rdmav_pd *pd, + struct rdmav_srq_init_attr *srq_init_attr); + int (*modify_srq)(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask); + int (*query_srq)(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr); + int (*destroy_srq)(struct rdmav_srq *srq); + int (*post_srq_recv)(struct rdmav_srq *srq, + struct rdmav_recv_wr *recv_wr, + struct rdmav_recv_wr **bad_recv_wr); + struct rdmav_qp * (*create_qp)(struct rdmav_pd *pd, struct rdmav_qp_init_attr *attr); + int (*query_qp)(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_qp_init_attr *init_attr); + int (*modify_qp)(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask); + int (*destroy_qp)(struct rdmav_qp *qp); + int (*post_send)(struct rdmav_qp *qp, struct rdmav_send_wr *wr, + struct rdmav_send_wr **bad_wr); + int (*post_recv)(struct rdmav_qp *qp, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr); + struct rdmav_ah * (*create_ah)(struct rdmav_pd *pd, struct rdmav_ah_attr *attr); + int (*destroy_ah)(struct rdmav_ah *ah); + int (*attach_mcast)(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); - int (*detach_mcast)(struct ibv_qp *qp, union ibv_gid *gid, + int (*detach_mcast)(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); }; -struct ibv_context { - struct ibv_device *device; - struct ibv_context_ops ops; +struct rdmav_context { + struct rdmav_device *device; + struct rdmav_context_ops ops; int cmd_fd; int async_fd; int num_comp_vectors; @@ -643,124 +644,124 @@ struct ibv_context { }; /** - * ibv_get_device_list - Get list of IB devices currently available + * rdmav_get_device_list - Get list of IB devices currently available * @num_devices: optional. if non-NULL, set to the number of devices * returned in the array. * * Return a NULL-terminated array of IB devices. The array can be - * released with ibv_free_device_list(). + * released with rdmav_free_device_list(). */ -struct ibv_device **ibv_get_device_list(int *num_devices); +struct rdmav_device **rdmav_get_device_list(int *num_devices); /** - * ibv_free_device_list - Free list from ibv_get_device_list() + * rdmav_free_device_list - Free list from rdmav_get_device_list() * - * Free an array of devices returned from ibv_get_device_list(). Once + * Free an array of devices returned from rdmav_get_device_list(). Once * the array is freed, pointers to devices that were not opened with - * ibv_open_device() are no longer valid. Client code must open all - * devices it intends to use before calling ibv_free_device_list(). + * rdmav_open_device() are no longer valid. Client code must open all + * devices it intends to use before calling rdmav_free_device_list(). */ -void ibv_free_device_list(struct ibv_device **list); +void rdmav_free_device_list(struct rdmav_device **list); /** - * ibv_get_device_name - Return kernel device name + * rdmav_get_device_name - Return kernel device name */ -const char *ibv_get_device_name(struct ibv_device *device); +const char *rdmav_get_device_name(struct rdmav_device *device); /** - * ibv_get_device_guid - Return device's node GUID + * rdmav_get_device_guid - Return device's node GUID */ -uint64_t ibv_get_device_guid(struct ibv_device *device); +uint64_t rdmav_get_device_guid(struct rdmav_device *device); /** - * ibv_open_device - Initialize device for use + * rdmav_open_device - Initialize device for use */ -struct ibv_context *ibv_open_device(struct ibv_device *device); +struct rdmav_context *rdmav_open_device(struct rdmav_device *device); /** - * ibv_close_device - Release device + * rdmav_close_device - Release device */ -int ibv_close_device(struct ibv_context *context); +int rdmav_close_device(struct rdmav_context *context); /** - * ibv_get_async_event - Get next async event + * rdmav_get_async_event - Get next async event * @event: Pointer to use to return async event * - * All async events returned by ibv_get_async_event() must eventually - * be acknowledged with ibv_ack_async_event(). + * All async events returned by rdmav_get_async_event() must eventually + * be acknowledged with rdmav_ack_async_event(). */ -int ibv_get_async_event(struct ibv_context *context, - struct ibv_async_event *event); +int rdmav_get_async_event(struct rdmav_context *context, + struct rdmav_async_event *event); /** - * ibv_ack_async_event - Acknowledge an async event + * rdmav_ack_async_event - Acknowledge an async event * @event: Event to be acknowledged. * - * All async events which are returned by ibv_get_async_event() must + * All async events which are returned by rdmav_get_async_event() must * be acknowledged. To avoid races, destroying an object (CQ, SRQ or * QP) will wait for all affiliated events to be acknowledged, so * there should be a one-to-one correspondence between acks and * successful gets. */ -void ibv_ack_async_event(struct ibv_async_event *event); +void rdmav_ack_async_event(struct rdmav_async_event *event); /** - * ibv_query_device - Get device properties + * rdmav_query_device - Get device properties */ -int ibv_query_device(struct ibv_context *context, - struct ibv_device_attr *device_attr); +int rdmav_query_device(struct rdmav_context *context, + struct rdmav_device_attr *device_attr); /** - * ibv_query_port - Get port properties + * rdmav_query_port - Get port properties */ -int ibv_query_port(struct ibv_context *context, uint8_t port_num, - struct ibv_port_attr *port_attr); +int rdmav_query_port(struct rdmav_context *context, uint8_t port_num, + struct rdmav_port_attr *port_attr); /** - * ibv_query_gid - Get a GID table entry + * rdmav_query_gid - Get a GID table entry */ -int ibv_query_gid(struct ibv_context *context, uint8_t port_num, - int index, union ibv_gid *gid); +int rdmav_query_gid(struct rdmav_context *context, uint8_t port_num, + int index, union rdmav_gid *gid); /** - * ibv_query_pkey - Get a P_Key table entry + * rdmav_query_pkey - Get a P_Key table entry */ -int ibv_query_pkey(struct ibv_context *context, uint8_t port_num, +int rdmav_query_pkey(struct rdmav_context *context, uint8_t port_num, int index, uint16_t *pkey); /** - * ibv_alloc_pd - Allocate a protection domain + * rdmav_alloc_pd - Allocate a protection domain */ -struct ibv_pd *ibv_alloc_pd(struct ibv_context *context); +struct rdmav_pd *rdmav_alloc_pd(struct rdmav_context *context); /** - * ibv_dealloc_pd - Free a protection domain + * rdmav_dealloc_pd - Free a protection domain */ -int ibv_dealloc_pd(struct ibv_pd *pd); +int rdmav_dealloc_pd(struct rdmav_pd *pd); /** - * ibv_reg_mr - Register a memory region + * rdmav_reg_mr - Register a memory region */ -struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, - size_t length, enum ibv_access_flags access); +struct rdmav_mr *rdmav_reg_mr(struct rdmav_pd *pd, void *addr, + size_t length, enum rdmav_access_flags access); /** - * ibv_dereg_mr - Deregister a memory region + * rdmav_dereg_mr - Deregister a memory region */ -int ibv_dereg_mr(struct ibv_mr *mr); +int rdmav_dereg_mr(struct rdmav_mr *mr); /** - * ibv_create_comp_channel - Create a completion event channel + * rdmav_create_comp_channel - Create a completion event channel */ -struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context *context); +struct rdmav_comp_channel *rdmav_create_comp_channel(struct rdmav_context *context); /** - * ibv_destroy_comp_channel - Destroy a completion event channel + * rdmav_destroy_comp_channel - Destroy a completion event channel */ -int ibv_destroy_comp_channel(struct ibv_comp_channel *channel); +int rdmav_destroy_comp_channel(struct rdmav_comp_channel *channel); /** - * ibv_create_cq - Create a completion queue + * rdmav_create_cq - Create a completion queue * @context - Context CQ will be attached to * @cqe - Minimum number of entries required for CQ * @cq_context - Consumer-supplied context returned for completion events @@ -769,57 +770,57 @@ int ibv_destroy_comp_channel(struct ibv_ * @comp_vector - Completion vector used to signal completion events. * Must be >= 0 and < context->num_comp_vectors. */ -struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe, +struct rdmav_cq *rdmav_create_cq(struct rdmav_context *context, int cqe, void *cq_context, - struct ibv_comp_channel *channel, + struct rdmav_comp_channel *channel, int comp_vector); /** - * ibv_resize_cq - Modifies the capacity of the CQ. + * rdmav_resize_cq - Modifies the capacity of the CQ. * @cq: The CQ to resize. * @cqe: The minimum size of the CQ. * * Users can examine the cq structure to determine the actual CQ size. */ -int ibv_resize_cq(struct ibv_cq *cq, int cqe); +int rdmav_resize_cq(struct rdmav_cq *cq, int cqe); /** - * ibv_destroy_cq - Destroy a completion queue + * rdmav_destroy_cq - Destroy a completion queue */ -int ibv_destroy_cq(struct ibv_cq *cq); +int rdmav_destroy_cq(struct rdmav_cq *cq); /** - * ibv_get_cq_event - Read next CQ event + * rdmav_get_cq_event - Read next CQ event * @channel: Channel to get next event from. * @cq: Used to return pointer to CQ. * @cq_context: Used to return consumer-supplied CQ context. * - * All completion events returned by ibv_get_cq_event() must - * eventually be acknowledged with ibv_ack_cq_events(). + * All completion events returned by rdmav_get_cq_event() must + * eventually be acknowledged with rdmav_ack_cq_events(). */ -int ibv_get_cq_event(struct ibv_comp_channel *channel, - struct ibv_cq **cq, void **cq_context); +int rdmav_get_cq_event(struct rdmav_comp_channel *channel, + struct rdmav_cq **cq, void **cq_context); /** - * ibv_ack_cq_events - Acknowledge CQ completion events + * rdmav_ack_cq_events - Acknowledge CQ completion events * @cq: CQ to acknowledge events for * @nevents: Number of events to acknowledge. * - * All completion events which are returned by ibv_get_cq_event() must - * be acknowledged. To avoid races, ibv_destroy_cq() will wait for + * All completion events which are returned by rdmav_get_cq_event() must + * be acknowledged. To avoid races, rdmav_destroy_cq() will wait for * all completion events to be acknowledged, so there should be a * one-to-one correspondence between acks and successful gets. An * application may accumulate multiple completion events and - * acknowledge them in a single call to ibv_ack_cq_events() by passing + * acknowledge them in a single call to rdmav_ack_cq_events() by passing * the number of events to ack in @nevents. */ -void ibv_ack_cq_events(struct ibv_cq *cq, unsigned int nevents); +void rdmav_ack_cq_events(struct rdmav_cq *cq, unsigned int nevents); /** - * ibv_poll_cq - Poll a CQ for work completions + * rdmav_poll_cq - Poll a CQ for work completions * @cq:the CQ being polled * @num_entries:maximum number of completions to return - * @wc:array of at least @num_entries of &struct ibv_wc where completions + * @wc:array of at least @num_entries of &struct rdmav_wc where completions * will be returned * * Poll a CQ for (possibly multiple) completions. If the return value @@ -828,13 +829,13 @@ void ibv_ack_cq_events(struct ibv_cq *cq * non-negative and strictly less than num_entries, then the CQ was * emptied. */ -static inline int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc) +static inline int rdmav_poll_cq(struct rdmav_cq *cq, int num_entries, struct rdmav_wc *wc) { return cq->context->ops.poll_cq(cq, num_entries, wc); } /** - * ibv_req_notify_cq - Request completion notification on a CQ. An + * rdmav_req_notify_cq - Request completion notification on a CQ. An * event will be added to the completion channel associated with the * CQ when an entry is added to the CQ. * @cq: The completion queue to request notification for. @@ -842,83 +843,83 @@ static inline int ibv_poll_cq(struct ibv * the next solicited CQ entry. If zero, any CQ entry, solicited or * not, will generate an event. */ -static inline int ibv_req_notify_cq(struct ibv_cq *cq, int solicited_only) +static inline int rdmav_req_notify_cq(struct rdmav_cq *cq, int solicited_only) { return cq->context->ops.req_notify_cq(cq, solicited_only); } /** - * ibv_create_srq - Creates a SRQ associated with the specified protection + * rdmav_create_srq - Creates a SRQ associated with the specified protection * domain. * @pd: The protection domain associated with the SRQ. * @srq_init_attr: A list of initial attributes required to create the SRQ. * * srq_attr->max_wr and srq_attr->max_sge are read the determine the * requested size of the SRQ, and set to the actual values allocated - * on return. If ibv_create_srq() succeeds, then max_wr and max_sge + * on return. If rdmav_create_srq() succeeds, then max_wr and max_sge * will always be at least as large as the requested values. */ -struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, - struct ibv_srq_init_attr *srq_init_attr); +struct rdmav_srq *rdmav_create_srq(struct rdmav_pd *pd, + struct rdmav_srq_init_attr *srq_init_attr); /** - * ibv_modify_srq - Modifies the attributes for the specified SRQ. + * rdmav_modify_srq - Modifies the attributes for the specified SRQ. * @srq: The SRQ to modify. * @srq_attr: On input, specifies the SRQ attributes to modify. On output, * the current values of selected SRQ attributes are returned. * @srq_attr_mask: A bit-mask used to specify which attributes of the SRQ * are being modified. * - * The mask may contain IBV_SRQ_MAX_WR to resize the SRQ and/or - * IBV_SRQ_LIMIT to set the SRQ's limit and request notification when + * The mask may contain RDMAV_SRQ_MAX_WR to resize the SRQ and/or + * RDMAV_SRQ_LIMIT to set the SRQ's limit and request notification when * the number of receives queued drops below the limit. */ -int ibv_modify_srq(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask); +int rdmav_modify_srq(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask); /** - * ibv_query_srq - Returns the attribute list and current values for the + * rdmav_query_srq - Returns the attribute list and current values for the * specified SRQ. * @srq: The SRQ to query. * @srq_attr: The attributes of the specified SRQ. */ -int ibv_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr); +int rdmav_query_srq(struct rdmav_srq *srq, struct rdmav_srq_attr *srq_attr); /** - * ibv_destroy_srq - Destroys the specified SRQ. + * rdmav_destroy_srq - Destroys the specified SRQ. * @srq: The SRQ to destroy. */ -int ibv_destroy_srq(struct ibv_srq *srq); +int rdmav_destroy_srq(struct rdmav_srq *srq); /** - * ibv_post_srq_recv - Posts a list of work requests to the specified SRQ. + * rdmav_post_srq_recv - Posts a list of work requests to the specified SRQ. * @srq: The SRQ to post the work request on. * @recv_wr: A list of work requests to post on the receive queue. * @bad_recv_wr: On an immediate failure, this parameter will reference * the work request that failed to be posted on the QP. */ -static inline int ibv_post_srq_recv(struct ibv_srq *srq, - struct ibv_recv_wr *recv_wr, - struct ibv_recv_wr **bad_recv_wr) +static inline int rdmav_post_srq_recv(struct rdmav_srq *srq, + struct rdmav_recv_wr *recv_wr, + struct rdmav_recv_wr **bad_recv_wr) { return srq->context->ops.post_srq_recv(srq, recv_wr, bad_recv_wr); } /** - * ibv_create_qp - Create a queue pair. + * rdmav_create_qp - Create a queue pair. */ -struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, - struct ibv_qp_init_attr *qp_init_attr); +struct rdmav_qp *rdmav_create_qp(struct rdmav_pd *pd, + struct rdmav_qp_init_attr *qp_init_attr); /** - * ibv_modify_qp - Modify a queue pair. + * rdmav_modify_qp - Modify a queue pair. */ -int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask); +int rdmav_modify_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask); /** - * ibv_query_qp - Returns the attribute list and current values for the + * rdmav_query_qp - Returns the attribute list and current values for the * specified QP. * @qp: The QP to query. * @attr: The attributes of the specified QP. @@ -928,40 +929,40 @@ int ibv_modify_qp(struct ibv_qp *qp, str * The qp_attr_mask may be used to limit the query to gathering only the * selected attributes. */ -int ibv_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_qp_init_attr *init_attr); +int rdmav_query_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_qp_init_attr *init_attr); /** - * ibv_destroy_qp - Destroy a queue pair. + * rdmav_destroy_qp - Destroy a queue pair. */ -int ibv_destroy_qp(struct ibv_qp *qp); +int rdmav_destroy_qp(struct rdmav_qp *qp); /** - * ibv_post_send - Post a list of work requests to a send queue. + * rdmav_post_send - Post a list of work requests to a send queue. */ -static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, - struct ibv_send_wr **bad_wr) +static inline int rdmav_post_send(struct rdmav_qp *qp, struct rdmav_send_wr *wr, + struct rdmav_send_wr **bad_wr) { return qp->context->ops.post_send(qp, wr, bad_wr); } /** - * ibv_post_recv - Post a list of work requests to a receive queue. + * rdmav_post_recv - Post a list of work requests to a receive queue. */ -static inline int ibv_post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr) +static inline int rdmav_post_recv(struct rdmav_qp *qp, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr) { return qp->context->ops.post_recv(qp, wr, bad_wr); } /** - * ibv_create_ah - Create an address handle. + * rdmav_create_ah - Create an address handle. */ -struct ibv_ah *ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr); +struct rdmav_ah *rdmav_create_ah(struct rdmav_pd *pd, struct rdmav_ah_attr *attr); /** - * ibv_init_ah_from_wc - Initializes address handle attributes from a + * rdmav_init_ah_from_wc - Initializes address handle attributes from a * work completion. * @context: Device context on which the received message arrived. * @port_num: Port on which the received message arrived. @@ -971,12 +972,12 @@ struct ibv_ah *ibv_create_ah(struct ibv_ * @ah_attr: Returned attributes that can be used when creating an address * handle for replying to the message. */ -int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num, - struct ibv_wc *wc, struct ibv_grh *grh, - struct ibv_ah_attr *ah_attr); +int rdmav_init_ah_from_wc(struct rdmav_context *context, uint8_t port_num, + struct rdmav_wc *wc, struct rdmav_grh *grh, + struct rdmav_ah_attr *ah_attr); /** - * ibv_create_ah_from_wc - Creates an address handle associated with the + * rdmav_create_ah_from_wc - Creates an address handle associated with the * sender of the specified work completion. * @pd: The protection domain associated with the address handle. * @wc: Work completion information associated with a received message. @@ -987,16 +988,16 @@ int ibv_init_ah_from_wc(struct ibv_conte * The address handle is used to reference a local or global destination * in all UD QP post sends. */ -struct ibv_ah *ibv_create_ah_from_wc(struct ibv_pd *pd, struct ibv_wc *wc, - struct ibv_grh *grh, uint8_t port_num); +struct rdmav_ah *rdmav_create_ah_from_wc(struct rdmav_pd *pd, struct rdmav_wc *wc, + struct rdmav_grh *grh, uint8_t port_num); /** - * ibv_destroy_ah - Destroy an address handle. + * rdmav_destroy_ah - Destroy an address handle. */ -int ibv_destroy_ah(struct ibv_ah *ah); +int rdmav_destroy_ah(struct rdmav_ah *ah); /** - * ibv_attach_mcast - Attaches the specified QP to a multicast group. + * rdmav_attach_mcast - Attaches the specified QP to a multicast group. * @qp: QP to attach to the multicast group. The QP must be a UD QP. * @gid: Multicast group GID. * @lid: Multicast group LID in host byte order. @@ -1006,18 +1007,18 @@ int ibv_destroy_ah(struct ibv_ah *ah); * the fabric appropriately. The port associated with the specified * QP must also be a member of the multicast group. */ -int ibv_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); +int rdmav_attach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); /** - * ibv_detach_mcast - Detaches the specified QP from a multicast group. + * rdmav_detach_mcast - Detaches the specified QP from a multicast group. * @qp: QP to detach from the multicast group. * @gid: Multicast group GID. * @lid: Multicast group LID in host byte order. */ -int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); +int rdmav_detach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); END_C_DECLS # undef __attribute_const -#endif /* INFINIBAND_VERBS_H */ +#endif /* RDMAV_VERBS_H */ From krkumar2 at in.ibm.com Thu Aug 3 00:11:21 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 12:41:21 +0530 Subject: [openib-general] [PATCH v3 4/6] librdmacm include file changes. In-Reply-To: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803071120.6106.89894.sendpatchset@K50wks273950wss.in.ibm.com> Convert librdmacm include files to use the new libibverbs API. Signed-off-by: Krishna Kumar diff -ruNp ORG/librdmacm/include/rdma/rdma_cma.h NEW/librdmacm/include/rdma/rdma_cma.h --- ORG/librdmacm/include/rdma/rdma_cma.h 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/include/rdma/rdma_cma.h 2006-08-03 17:22:26.000000000 -0700 @@ -68,8 +68,8 @@ enum { }; struct ib_addr { - union ibv_gid sgid; - union ibv_gid dgid; + union rdmav_gid sgid; + union rdmav_gid dgid; uint16_t pkey; }; @@ -83,7 +83,7 @@ struct rdma_addr { struct rdma_route { struct rdma_addr addr; - struct ibv_sa_path_rec *path_rec; + struct rdmav_sa_path_rec *path_rec; int num_paths; }; @@ -92,10 +92,10 @@ struct rdma_event_channel { }; struct rdma_cm_id { - struct ibv_context *verbs; + struct rdmav_context *verbs; struct rdma_event_channel *channel; void *context; - struct ibv_qp *qp; + struct rdmav_qp *qp; struct rdma_route route; enum rdma_port_space ps; uint8_t port_num; @@ -191,8 +191,8 @@ int rdma_resolve_route(struct rdma_cm_id * QPs allocated to an rdma_cm_id will automatically be transitioned by the CMA * through their states. */ -int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd, - struct ibv_qp_init_attr *qp_init_attr); +int rdma_create_qp(struct rdma_cm_id *id, struct rdmav_pd *pd, + struct rdmav_qp_init_attr *qp_init_attr); /** * rdma_destroy_qp - Deallocate the QP associated with the specified RDMA @@ -214,7 +214,7 @@ struct rdma_conn_param { /* Fields below ignored if a QP is created on the rdma_cm_id. */ uint8_t srq; uint32_t qp_num; - enum ibv_qp_type qp_type; + enum rdmav_qp_type qp_type; }; /** @@ -341,11 +341,11 @@ static inline uint16_t rdma_get_dst_port * across multiple rdma_cm_id's. * The array must be released by calling rdma_free_devices(). */ -struct ibv_context **rdma_get_devices(int *num_devices); +struct rdmav_context **rdma_get_devices(int *num_devices); /** * rdma_free_devices - Frees the list of devices returned by rdma_get_devices(). */ -void rdma_free_devices(struct ibv_context **list); +void rdma_free_devices(struct rdmav_context **list); #endif /* RDMA_CMA_H */ diff -ruNp ORG/librdmacm/include/rdma/rdma_cma_abi.h NEW/librdmacm/include/rdma/rdma_cma_abi.h --- ORG/librdmacm/include/rdma/rdma_cma_abi.h 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/include/rdma/rdma_cma_abi.h 2006-08-03 17:22:32.000000000 -0700 @@ -123,7 +123,7 @@ struct ucma_abi_query_route { struct ucma_abi_query_route_resp { __u64 node_guid; - struct ibv_kern_path_rec ib_route[2]; + struct rdmav_kern_path_rec ib_route[2]; struct sockaddr_in6 src_addr; struct sockaddr_in6 dst_addr; __u32 num_paths; @@ -194,7 +194,7 @@ struct ucma_abi_leave_mcast { struct ucma_abi_dst_attr_resp { __u32 remote_qpn; __u32 remote_qkey; - struct ibv_kern_ah_attr ah_attr; + struct rdmav_kern_ah_attr ah_attr; }; struct ucma_abi_get_dst_attr { diff -ruNp ORG/librdmacm/include/rdma/rdma_cma_ib.h NEW/librdmacm/include/rdma/rdma_cma_ib.h --- ORG/librdmacm/include/rdma/rdma_cma_ib.h 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/include/rdma/rdma_cma_ib.h 2006-08-03 17:22:39.000000000 -0700 @@ -34,7 +34,7 @@ /* IB specific option names for get/set. */ enum { - IB_PATH_OPTIONS = 1, /* struct ibv_kern_path_rec */ + IB_PATH_OPTIONS = 1, /* struct rdmav_kern_path_rec */ IB_CM_REQ_OPTIONS = 2 /* struct ib_cm_req_opt */ }; @@ -56,7 +56,7 @@ struct ib_cm_req_opt { * Users must have called rdma_connect() to resolve the destination information. */ int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr, - struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn, + struct rdmav_ah_attr *ah_attr, uint32_t *remote_qpn, uint32_t *remote_qkey); #endif /* RDMA_CMA_IB_H */ From krkumar2 at in.ibm.com Thu Aug 3 00:11:32 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 12:41:32 +0530 Subject: [openib-general] [PATCH v3 5/6] librdmacm source file changes. In-Reply-To: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803071132.6106.45544.sendpatchset@K50wks273950wss.in.ibm.com> Convert librdmacm source files to use the new libibverbs API. Signed-off-by: Krishna Kumar diff -ruNp ORG/librdmacm/configure.in NEW/librdmacm/configure.in --- ORG/librdmacm/configure.in 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/configure.in 2006-08-03 00:02:57.000000000 -0700 @@ -25,8 +25,8 @@ AC_CHECK_SIZEOF(long) dnl Checks for libraries if test "$disable_libcheck" != "yes" then -AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], - AC_MSG_ERROR([ibv_get_device_list() not found. librdmacm requires libibverbs.])) +AC_CHECK_LIB(ibverbs, rdmav_get_device_list, [], + AC_MSG_ERROR([rdmav_get_device_list() not found. librdmacm requires libibverbs.])) fi dnl Checks for header files. diff -ruNp ORG/librdmacm/src/cma.c NEW/librdmacm/src/cma.c --- ORG/librdmacm/src/cma.c 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/src/cma.c 2006-08-03 17:23:08.000000000 -0700 @@ -103,7 +103,7 @@ do { } while (0) struct cma_device { - struct ibv_context *verbs; + struct rdmav_context *verbs; uint64_t guid; int port_cnt; }; @@ -130,7 +130,7 @@ static void ucma_cleanup(void) { if (cma_dev_cnt) { while (cma_dev_cnt) - ibv_close_device(cma_dev_array[--cma_dev_cnt].verbs); + rdmav_close_device(cma_dev_array[--cma_dev_cnt].verbs); free(cma_dev_array); cma_dev_cnt = 0; @@ -141,7 +141,7 @@ static int check_abi_version(void) { char value[8]; - if (ibv_read_sysfs_file(ibv_get_sysfs_path(), + if (rdmav_read_sysfs_file(rdmav_get_sysfs_path(), "class/misc/rdma_cm/abi_version", value, sizeof value) < 0) { /* @@ -167,9 +167,9 @@ static int check_abi_version(void) static int ucma_init(void) { - struct ibv_device **dev_list = NULL; + struct rdmav_device **dev_list = NULL; struct cma_device *cma_dev; - struct ibv_device_attr attr; + struct rdmav_device_attr attr; int i, ret; pthread_mutex_lock(&mut); @@ -180,7 +180,7 @@ static int ucma_init(void) if (ret) goto err; - dev_list = ibv_get_device_list(&cma_dev_cnt); + dev_list = rdmav_get_device_list(&cma_dev_cnt); if (!dev_list) { printf("CMA: unable to get RDMA device list\n"); ret = -ENODEV; @@ -196,15 +196,15 @@ static int ucma_init(void) for (i = 0; dev_list[i]; ++i) { cma_dev = &cma_dev_array[i]; - cma_dev->guid = ibv_get_device_guid(dev_list[i]); - cma_dev->verbs = ibv_open_device(dev_list[i]); + cma_dev->guid = rdmav_get_device_guid(dev_list[i]); + cma_dev->verbs = rdmav_open_device(dev_list[i]); if (!cma_dev->verbs) { printf("CMA: unable to open RDMA device\n"); ret = -ENODEV; goto err; } - ret = ibv_query_device(cma_dev->verbs, &attr); + ret = rdmav_query_device(cma_dev->verbs, &attr); if (ret) { printf("CMA: unable to query RDMA device\n"); goto err; @@ -219,13 +219,13 @@ err: ucma_cleanup(); pthread_mutex_unlock(&mut); if (dev_list) - ibv_free_device_list(dev_list); + rdmav_free_device_list(dev_list); return ret; } -struct ibv_context **rdma_get_devices(int *num_devices) +struct rdmav_context **rdma_get_devices(int *num_devices) { - struct ibv_context **devs = NULL; + struct rdmav_context **devs = NULL; int i; if (!cma_dev_cnt && ucma_init()) @@ -244,7 +244,7 @@ out: return devs; } -void rdma_free_devices(struct ibv_context **list) +void rdma_free_devices(struct rdmav_context **list) { free(list); } @@ -479,7 +479,7 @@ static int ucma_query_route(struct rdma_ id->route.num_paths = resp->num_paths; for (i = 0; i < resp->num_paths; i++) - ibv_copy_path_rec_from_kern(&id->route.path_rec[i], + rdmav_copy_path_rec_from_kern(&id->route.path_rec[i], &resp->ib_route[i]); } @@ -578,11 +578,11 @@ int rdma_resolve_route(struct rdma_cm_id return 0; } -static int rdma_init_qp_attr(struct rdma_cm_id *id, struct ibv_qp_attr *qp_attr, +static int rdma_init_qp_attr(struct rdma_cm_id *id, struct rdmav_qp_attr *qp_attr, int *qp_attr_mask) { struct ucma_abi_init_qp_attr *cmd; - struct ibv_kern_qp_attr *resp; + struct rdmav_kern_qp_attr *resp; struct cma_id_private *id_priv; void *msg; int ret, size; @@ -596,59 +596,59 @@ static int rdma_init_qp_attr(struct rdma if (ret != size) return (ret > 0) ? -ENODATA : ret; - ibv_copy_qp_attr_from_kern(qp_attr, resp); + rdmav_copy_qp_attr_from_kern(qp_attr, resp); *qp_attr_mask = resp->qp_attr_mask; return 0; } static int ucma_modify_qp_rtr(struct rdma_cm_id *id) { - struct ibv_qp_attr qp_attr; + struct rdmav_qp_attr qp_attr; int qp_attr_mask, ret; if (!id->qp) return -EINVAL; /* Need to update QP attributes from default values. */ - qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.qp_state = RDMAV_QPS_INIT; ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); if (ret) return ret; - ret = ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = rdmav_modify_qp(id->qp, &qp_attr, qp_attr_mask); if (ret) return ret; - qp_attr.qp_state = IBV_QPS_RTR; + qp_attr.qp_state = RDMAV_QPS_RTR; ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); if (ret) return ret; - return ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); + return rdmav_modify_qp(id->qp, &qp_attr, qp_attr_mask); } static int ucma_modify_qp_rts(struct rdma_cm_id *id) { - struct ibv_qp_attr qp_attr; + struct rdmav_qp_attr qp_attr; int qp_attr_mask, ret; - qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.qp_state = RDMAV_QPS_RTS; ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); if (ret) return ret; - return ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); + return rdmav_modify_qp(id->qp, &qp_attr, qp_attr_mask); } static int ucma_modify_qp_err(struct rdma_cm_id *id) { - struct ibv_qp_attr qp_attr; + struct rdmav_qp_attr qp_attr; if (!id->qp) return 0; - qp_attr.qp_state = IBV_QPS_ERR; - return ibv_modify_qp(id->qp, &qp_attr, IBV_QP_STATE); + qp_attr.qp_state = RDMAV_QPS_ERR; + return rdmav_modify_qp(id->qp, &qp_attr, RDMAV_QP_STATE); } static int ucma_find_pkey(struct cma_device *cma_dev, uint8_t port_num, @@ -658,7 +658,7 @@ static int ucma_find_pkey(struct cma_dev uint16_t chk_pkey; for (i = 0, ret = 0; !ret; i++) { - ret = ibv_query_pkey(cma_dev->verbs, port_num, i, &chk_pkey); + ret = rdmav_query_pkey(cma_dev->verbs, port_num, i, &chk_pkey); if (!ret && pkey == chk_pkey) { *pkey_index = (uint16_t) i; return 0; @@ -668,9 +668,9 @@ static int ucma_find_pkey(struct cma_dev return -EINVAL; } -static int ucma_init_ib_qp(struct cma_id_private *id_priv, struct ibv_qp *qp) +static int ucma_init_ib_qp(struct cma_id_private *id_priv, struct rdmav_qp *qp) { - struct ibv_qp_attr qp_attr; + struct rdmav_qp_attr qp_attr; struct ib_addr *ibaddr; int ret; @@ -681,15 +681,15 @@ static int ucma_init_ib_qp(struct cma_id return ret; qp_attr.port_num = id_priv->id.port_num; - qp_attr.qp_state = IBV_QPS_INIT; - qp_attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE; - return ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_ACCESS_FLAGS | - IBV_QP_PKEY_INDEX | IBV_QP_PORT); + qp_attr.qp_state = RDMAV_QPS_INIT; + qp_attr.qp_access_flags = RDMAV_ACCESS_LOCAL_WRITE; + return rdmav_modify_qp(qp, &qp_attr, RDMAV_QP_STATE | RDMAV_QP_ACCESS_FLAGS | + RDMAV_QP_PKEY_INDEX | RDMAV_QP_PORT); } -static int ucma_init_ud_qp(struct cma_id_private *id_priv, struct ibv_qp *qp) +static int ucma_init_ud_qp(struct cma_id_private *id_priv, struct rdmav_qp *qp) { - struct ibv_qp_attr qp_attr; + struct rdmav_qp_attr qp_attr; struct ib_addr *ibaddr; int ret; @@ -700,35 +700,35 @@ static int ucma_init_ud_qp(struct cma_id return ret; qp_attr.port_num = id_priv->id.port_num; - qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.qp_state = RDMAV_QPS_INIT; qp_attr.qkey = ntohs(rdma_get_src_port(&id_priv->id)); - ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | - IBV_QP_PORT | IBV_QP_QKEY); + ret = rdmav_modify_qp(qp, &qp_attr, RDMAV_QP_STATE | RDMAV_QP_PKEY_INDEX | + RDMAV_QP_PORT | RDMAV_QP_QKEY); if (ret) return ret; - qp_attr.qp_state = IBV_QPS_RTR; - ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE); + qp_attr.qp_state = RDMAV_QPS_RTR; + ret = rdmav_modify_qp(qp, &qp_attr, RDMAV_QP_STATE); if (ret) return ret; - qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.qp_state = RDMAV_QPS_RTS; qp_attr.sq_psn = 0; - return ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_SQ_PSN); + return rdmav_modify_qp(qp, &qp_attr, RDMAV_QP_STATE | RDMAV_QP_SQ_PSN); } -int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd, - struct ibv_qp_init_attr *qp_init_attr) +int rdma_create_qp(struct rdma_cm_id *id, struct rdmav_pd *pd, + struct rdmav_qp_init_attr *qp_init_attr) { struct cma_id_private *id_priv; - struct ibv_qp *qp; + struct rdmav_qp *qp; int ret; id_priv = container_of(id, struct cma_id_private, id); if (id->verbs != pd->context) return -EINVAL; - qp = ibv_create_qp(pd, qp_init_attr); + qp = rdmav_create_qp(pd, qp_init_attr); if (!qp) return -ENOMEM; @@ -742,19 +742,19 @@ int rdma_create_qp(struct rdma_cm_id *id id->qp = qp; return 0; err: - ibv_destroy_qp(qp); + rdmav_destroy_qp(qp); return ret; } void rdma_destroy_qp(struct rdma_cm_id *id) { - ibv_destroy_qp(id->qp); + rdmav_destroy_qp(id->qp); } static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst, struct rdma_conn_param *src, uint32_t qp_num, - enum ibv_qp_type qp_type, uint8_t srq) + enum rdmav_qp_type qp_type, uint8_t srq) { dst->qp_num = qp_num; dst->qp_type = qp_type; @@ -934,7 +934,7 @@ int rdma_leave_multicast(struct rdma_cm_ struct cma_id_private *id_priv; void *msg; int ret, size, addrlen; - struct ibv_ah_attr ah_attr; + struct rdmav_ah_attr ah_attr; uint32_t qp_info; addrlen = ucma_addrlen(addr); @@ -951,7 +951,7 @@ int rdma_leave_multicast(struct rdma_cm_ if (ret) goto out; - ret = ibv_detach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); + ret = rdmav_detach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); if (ret) goto out; } @@ -1075,7 +1075,7 @@ static void ucma_process_mcast(struct rd { struct ucma_abi_join_mcast kmc_data; struct rdma_multicast_data *mc_data; - struct ibv_ah_attr ah_attr; + struct rdmav_ah_attr ah_attr; uint32_t qp_info; kmc_data = *(struct ucma_abi_join_mcast *) evt->private_data; @@ -1093,7 +1093,7 @@ static void ucma_process_mcast(struct rd if (evt->status) goto err; - evt->status = ibv_attach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); + evt->status = rdmav_attach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); if (evt->status) goto err; return; @@ -1243,7 +1243,7 @@ int rdma_set_option(struct rdma_cm_id *i } int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr, - struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn, + struct rdmav_ah_attr *ah_attr, uint32_t *remote_qpn, uint32_t *remote_qkey) { struct ucma_abi_dst_attr_resp *resp; @@ -1265,7 +1265,7 @@ int rdma_get_dst_attr(struct rdma_cm_id if (ret != size) return (ret > 0) ? -ENODATA : ret; - ibv_copy_ah_attr_from_kern(ah_attr, &resp->ah_attr); + rdmav_copy_ah_attr_from_kern(ah_attr, &resp->ah_attr); *remote_qpn = resp->remote_qpn; *remote_qkey = resp->remote_qkey; return 0; From krkumar2 at in.ibm.com Thu Aug 3 00:11:40 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 12:41:40 +0530 Subject: [openib-general] [PATCH v3 6/6] librdmacm examples changes. In-Reply-To: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803071140.6106.96989.sendpatchset@K50wks273950wss.in.ibm.com> Convert librdmacm examples to use the new API. Signed-off-by: Krishna Kumar diff -ruNp ORG/librdmacm/examples/cmatose.c NEW/librdmacm/examples/cmatose.c --- ORG/librdmacm/examples/cmatose.c 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/examples/cmatose.c 2006-08-03 17:32:36.000000000 -0700 @@ -62,9 +62,9 @@ struct cmatest_node { int id; struct rdma_cm_id *cma_id; int connected; - struct ibv_pd *pd; - struct ibv_cq *cq; - struct ibv_mr *mr; + struct rdmav_pd *pd; + struct rdmav_cq *cq; + struct rdmav_mr *mr; void *mem; }; @@ -100,8 +100,8 @@ static int create_message(struct cmatest printf("failed message allocation\n"); return -1; } - node->mr = ibv_reg_mr(node->pd, node->mem, message_size, - IBV_ACCESS_LOCAL_WRITE); + node->mr = rdmav_reg_mr(node->pd, node->mem, message_size, + RDMAV_ACCESS_LOCAL_WRITE); if (!node->mr) { printf("failed to reg MR\n"); goto err; @@ -114,10 +114,10 @@ err: static int init_node(struct cmatest_node *node) { - struct ibv_qp_init_attr init_qp_attr; + struct rdmav_qp_init_attr init_qp_attr; int cqe, ret; - node->pd = ibv_alloc_pd(node->cma_id->verbs); + node->pd = rdmav_alloc_pd(node->cma_id->verbs); if (!node->pd) { ret = -ENOMEM; printf("cmatose: unable to allocate PD\n"); @@ -125,7 +125,7 @@ static int init_node(struct cmatest_node } cqe = message_count ? message_count * 2 : 2; - node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); + node->cq = rdmav_create_cq(node->cma_id->verbs, cqe, node, 0, 0); if (!node->cq) { ret = -ENOMEM; printf("cmatose: unable to create CQ\n"); @@ -139,7 +139,7 @@ static int init_node(struct cmatest_node init_qp_attr.cap.max_recv_sge = 1; init_qp_attr.qp_context = node; init_qp_attr.sq_sig_all = 1; - init_qp_attr.qp_type = IBV_QPT_RC; + init_qp_attr.qp_type = RDMAV_QPT_RC; init_qp_attr.send_cq = node->cq; init_qp_attr.recv_cq = node->cq; ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); @@ -159,8 +159,8 @@ out: static int post_recvs(struct cmatest_node *node) { - struct ibv_recv_wr recv_wr, *recv_failure; - struct ibv_sge sge; + struct rdmav_recv_wr recv_wr, *recv_failure; + struct rdmav_sge sge; int i, ret = 0; if (!message_count) @@ -176,7 +176,7 @@ static int post_recvs(struct cmatest_nod sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++ ) { - ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); + ret = rdmav_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); if (ret) { printf("failed to post receives: %d\n", ret); break; @@ -187,8 +187,8 @@ static int post_recvs(struct cmatest_nod static int post_sends(struct cmatest_node *node) { - struct ibv_send_wr send_wr, *bad_send_wr; - struct ibv_sge sge; + struct rdmav_send_wr send_wr, *bad_send_wr; + struct rdmav_sge sge; int i, ret = 0; if (!node->connected || !message_count) @@ -197,7 +197,7 @@ static int post_sends(struct cmatest_nod send_wr.next = NULL; send_wr.sg_list = &sge; send_wr.num_sge = 1; - send_wr.opcode = IBV_WR_SEND; + send_wr.opcode = RDMAV_WR_SEND; send_wr.send_flags = 0; send_wr.wr_id = (unsigned long)node; @@ -206,7 +206,7 @@ static int post_sends(struct cmatest_nod sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++) { - ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); + ret = rdmav_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); if (ret) printf("failed to post sends: %d\n", ret); } @@ -350,15 +350,15 @@ static void destroy_node(struct cmatest_ rdma_destroy_qp(node->cma_id); if (node->cq) - ibv_destroy_cq(node->cq); + rdmav_destroy_cq(node->cq); if (node->mem) { - ibv_dereg_mr(node->mr); + rdmav_dereg_mr(node->mr); free(node->mem); } if (node->pd) - ibv_dealloc_pd(node->pd); + rdmav_dealloc_pd(node->pd); /* Destroy the RDMA ID after all device resources */ rdma_destroy_id(node->cma_id); @@ -404,7 +404,7 @@ static void destroy_nodes(void) static int poll_cqs(void) { - struct ibv_wc wc[8]; + struct rdmav_wc wc[8]; int done, i, ret; for (i = 0; i < connections; i++) { @@ -412,7 +412,7 @@ static int poll_cqs(void) continue; for (done = 0; done < message_count; done += ret) { - ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + ret = rdmav_poll_cq(test.nodes[i].cq, 8, wc); if (ret < 0) { printf("cmatose: failed polling CQ: %d\n", ret); return ret; diff -ruNp ORG/librdmacm/examples/mckey.c NEW/librdmacm/examples/mckey.c --- ORG/librdmacm/examples/mckey.c 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/examples/mckey.c 2006-08-03 17:32:41.000000000 -0700 @@ -50,10 +50,10 @@ struct cmatest_node { int id; struct rdma_cm_id *cma_id; int connected; - struct ibv_pd *pd; - struct ibv_cq *cq; - struct ibv_mr *mr; - struct ibv_ah *ah; + struct rdmav_pd *pd; + struct rdmav_cq *cq; + struct rdmav_mr *mr; + struct rdmav_ah *ah; uint32_t remote_qpn; uint32_t remote_qkey; void *mem; @@ -85,14 +85,14 @@ static int create_message(struct cmatest if (!message_count) return 0; - node->mem = malloc(message_size + sizeof(struct ibv_grh)); + node->mem = malloc(message_size + sizeof(struct rdmav_grh)); if (!node->mem) { printf("failed message allocation\n"); return -1; } - node->mr = ibv_reg_mr(node->pd, node->mem, - message_size + sizeof(struct ibv_grh), - IBV_ACCESS_LOCAL_WRITE); + node->mr = rdmav_reg_mr(node->pd, node->mem, + message_size + sizeof(struct rdmav_grh), + RDMAV_ACCESS_LOCAL_WRITE); if (!node->mr) { printf("failed to reg MR\n"); goto err; @@ -105,10 +105,10 @@ err: static int init_node(struct cmatest_node *node) { - struct ibv_qp_init_attr init_qp_attr; + struct rdmav_qp_init_attr init_qp_attr; int cqe, ret; - node->pd = ibv_alloc_pd(node->cma_id->verbs); + node->pd = rdmav_alloc_pd(node->cma_id->verbs); if (!node->pd) { ret = -ENOMEM; printf("mckey: unable to allocate PD\n"); @@ -116,7 +116,7 @@ static int init_node(struct cmatest_node } cqe = message_count ? message_count * 2 : 2; - node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); + node->cq = rdmav_create_cq(node->cma_id->verbs, cqe, node, 0, 0); if (!node->cq) { ret = -ENOMEM; printf("mckey: unable to create CQ\n"); @@ -130,7 +130,7 @@ static int init_node(struct cmatest_node init_qp_attr.cap.max_recv_sge = 1; init_qp_attr.qp_context = node; init_qp_attr.sq_sig_all = 0; - init_qp_attr.qp_type = IBV_QPT_UD; + init_qp_attr.qp_type = RDMAV_QPT_UD; init_qp_attr.send_cq = node->cq; init_qp_attr.recv_cq = node->cq; ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); @@ -150,8 +150,8 @@ out: static int post_recvs(struct cmatest_node *node) { - struct ibv_recv_wr recv_wr, *recv_failure; - struct ibv_sge sge; + struct rdmav_recv_wr recv_wr, *recv_failure; + struct rdmav_sge sge; int i, ret = 0; if (!message_count) @@ -162,12 +162,12 @@ static int post_recvs(struct cmatest_nod recv_wr.num_sge = 1; recv_wr.wr_id = (uintptr_t) node; - sge.length = message_size + sizeof(struct ibv_grh); + sge.length = message_size + sizeof(struct rdmav_grh); sge.lkey = node->mr->lkey; sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++ ) { - ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); + ret = rdmav_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); if (ret) { printf("failed to post receives: %d\n", ret); break; @@ -178,8 +178,8 @@ static int post_recvs(struct cmatest_nod static int post_sends(struct cmatest_node *node, int signal_flag) { - struct ibv_send_wr send_wr, *bad_send_wr; - struct ibv_sge sge; + struct rdmav_send_wr send_wr, *bad_send_wr; + struct rdmav_sge sge; int i, ret = 0; if (!node->connected || !message_count) @@ -188,8 +188,8 @@ static int post_sends(struct cmatest_nod send_wr.next = NULL; send_wr.sg_list = &sge; send_wr.num_sge = 1; - send_wr.opcode = IBV_WR_SEND_WITH_IMM; - send_wr.send_flags = IBV_SEND_INLINE | signal_flag; + send_wr.opcode = RDMAV_WR_SEND_WITH_IMM; + send_wr.send_flags = RDMAV_SEND_INLINE | signal_flag; send_wr.wr_id = (unsigned long)node; send_wr.imm_data = htonl(node->cma_id->qp->qp_num); @@ -197,12 +197,12 @@ static int post_sends(struct cmatest_nod send_wr.wr.ud.remote_qpn = node->remote_qpn; send_wr.wr.ud.remote_qkey = node->remote_qkey; - sge.length = message_size - sizeof(struct ibv_grh); + sge.length = message_size - sizeof(struct rdmav_grh); sge.lkey = node->mr->lkey; sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++) { - ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); + ret = rdmav_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); if (ret) printf("failed to post sends: %d\n", ret); } @@ -241,7 +241,7 @@ err: static int join_handler(struct cmatest_node *node) { - struct ibv_ah_attr ah_attr; + struct rdmav_ah_attr ah_attr; int ret; ret = rdma_get_dst_attr(node->cma_id, test.dst_addr, &ah_attr, @@ -251,7 +251,7 @@ static int join_handler(struct cmatest_n goto err; } - node->ah = ibv_create_ah(node->pd, &ah_attr); + node->ah = rdmav_create_ah(node->pd, &ah_attr); if (!node->ah) { printf("mckey: failure creating address handle\n"); goto err; @@ -299,21 +299,21 @@ static void destroy_node(struct cmatest_ return; if (node->ah) - ibv_destroy_ah(node->ah); + rdmav_destroy_ah(node->ah); if (node->cma_id->qp) rdma_destroy_qp(node->cma_id); if (node->cq) - ibv_destroy_cq(node->cq); + rdmav_destroy_cq(node->cq); if (node->mem) { - ibv_dereg_mr(node->mr); + rdmav_dereg_mr(node->mr); free(node->mem); } if (node->pd) - ibv_dealloc_pd(node->pd); + rdmav_dealloc_pd(node->pd); /* Destroy the RDMA ID after all device resources */ rdma_destroy_id(node->cma_id); @@ -356,7 +356,7 @@ static void destroy_nodes(void) static int poll_cqs(void) { - struct ibv_wc wc[8]; + struct rdmav_wc wc[8]; int done, i, ret; for (i = 0; i < connections; i++) { @@ -364,7 +364,7 @@ static int poll_cqs(void) continue; for (done = 0; done < message_count; done += ret) { - ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + ret = rdmav_poll_cq(test.nodes[i].cq, 8, wc); if (ret < 0) { printf("mckey: failed polling CQ: %d\n", ret); return ret; diff -ruNp ORG/librdmacm/examples/rping.c NEW/librdmacm/examples/rping.c --- ORG/librdmacm/examples/rping.c 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/examples/rping.c 2006-08-03 17:32:45.000000000 -0700 @@ -111,32 +111,32 @@ struct rping_rdma_info { struct rping_cb { int server; /* 0 iff client */ pthread_t cqthread; - struct ibv_comp_channel *channel; - struct ibv_cq *cq; - struct ibv_pd *pd; - struct ibv_qp *qp; + struct rdmav_comp_channel *channel; + struct rdmav_cq *cq; + struct rdmav_pd *pd; + struct rdmav_qp *qp; - struct ibv_recv_wr rq_wr; /* recv work request record */ - struct ibv_sge recv_sgl; /* recv single SGE */ + struct rdmav_recv_wr rq_wr; /* recv work request record */ + struct rdmav_sge recv_sgl; /* recv single SGE */ struct rping_rdma_info recv_buf;/* malloc'd buffer */ - struct ibv_mr *recv_mr; /* MR associated with this buffer */ + struct rdmav_mr *recv_mr; /* MR associated with this buffer */ - struct ibv_send_wr sq_wr; /* send work requrest record */ - struct ibv_sge send_sgl; + struct rdmav_send_wr sq_wr; /* send work requrest record */ + struct rdmav_sge send_sgl; struct rping_rdma_info send_buf;/* single send buf */ - struct ibv_mr *send_mr; + struct rdmav_mr *send_mr; - struct ibv_send_wr rdma_sq_wr; /* rdma work request record */ - struct ibv_sge rdma_sgl; /* rdma single SGE */ + struct rdmav_send_wr rdma_sq_wr; /* rdma work request record */ + struct rdmav_sge rdma_sgl; /* rdma single SGE */ char *rdma_buf; /* used as rdma sink */ - struct ibv_mr *rdma_mr; + struct rdmav_mr *rdma_mr; uint32_t remote_rkey; /* remote guys RKEY */ uint64_t remote_addr; /* remote guys TO */ uint32_t remote_len; /* remote guys LEN */ char *start_buf; /* rdma read src */ - struct ibv_mr *start_mr; + struct rdmav_mr *start_mr; enum test_state state; /* used for cond/signalling */ sem_t sem; @@ -232,7 +232,7 @@ static int rping_cma_event_handler(struc return ret; } -static int server_recv(struct rping_cb *cb, struct ibv_wc *wc) +static int server_recv(struct rping_cb *cb, struct rdmav_wc *wc) { if (wc->byte_len != sizeof(cb->recv_buf)) { fprintf(stderr, "Received bogus data, size %d\n", wc->byte_len); @@ -253,7 +253,7 @@ static int server_recv(struct rping_cb * return 0; } -static int client_recv(struct rping_cb *cb, struct ibv_wc *wc) +static int client_recv(struct rping_cb *cb, struct rdmav_wc *wc) { if (wc->byte_len != sizeof(cb->recv_buf)) { fprintf(stderr, "Received bogus data, size %d\n", wc->byte_len); @@ -270,39 +270,39 @@ static int client_recv(struct rping_cb * static int rping_cq_event_handler(struct rping_cb *cb) { - struct ibv_wc wc; - struct ibv_recv_wr *bad_wr; + struct rdmav_wc wc; + struct rdmav_recv_wr *bad_wr; int ret; - while ((ret = ibv_poll_cq(cb->cq, 1, &wc)) == 1) { + while ((ret = rdmav_poll_cq(cb->cq, 1, &wc)) == 1) { ret = 0; if (wc.status) { fprintf(stderr, "cq completion failed status %d\n", wc.status); - if (wc.status != IBV_WC_WR_FLUSH_ERR) + if (wc.status != RDMAV_WC_WR_FLUSH_ERR) ret = -1; goto error; } switch (wc.opcode) { - case IBV_WC_SEND: + case RDMAV_WC_SEND: DEBUG_LOG("send completion\n"); break; - case IBV_WC_RDMA_WRITE: + case RDMAV_WC_RDMA_WRITE: DEBUG_LOG("rdma write completion\n"); cb->state = RDMA_WRITE_COMPLETE; sem_post(&cb->sem); break; - case IBV_WC_RDMA_READ: + case RDMAV_WC_RDMA_READ: DEBUG_LOG("rdma read completion\n"); cb->state = RDMA_READ_COMPLETE; sem_post(&cb->sem); break; - case IBV_WC_RECV: + case RDMAV_WC_RECV: DEBUG_LOG("recv completion\n"); ret = cb->server ? server_recv(cb, &wc) : client_recv(cb, &wc); @@ -311,7 +311,7 @@ static int rping_cq_event_handler(struct goto error; } - ret = ibv_post_recv(cb->qp, &cb->rq_wr, &bad_wr); + ret = rdmav_post_recv(cb->qp, &cb->rq_wr, &bad_wr); if (ret) { fprintf(stderr, "post recv error: %d\n", ret); goto error; @@ -374,14 +374,14 @@ static void rping_setup_wr(struct rping_ cb->send_sgl.length = sizeof cb->send_buf; cb->send_sgl.lkey = cb->send_mr->lkey; - cb->sq_wr.opcode = IBV_WR_SEND; - cb->sq_wr.send_flags = IBV_SEND_SIGNALED; + cb->sq_wr.opcode = RDMAV_WR_SEND; + cb->sq_wr.send_flags = RDMAV_SEND_SIGNALED; cb->sq_wr.sg_list = &cb->send_sgl; cb->sq_wr.num_sge = 1; cb->rdma_sgl.addr = (uint64_t) (unsigned long) cb->rdma_buf; cb->rdma_sgl.lkey = cb->rdma_mr->lkey; - cb->rdma_sq_wr.send_flags = IBV_SEND_SIGNALED; + cb->rdma_sq_wr.send_flags = RDMAV_SEND_SIGNALED; cb->rdma_sq_wr.sg_list = &cb->rdma_sgl; cb->rdma_sq_wr.num_sge = 1; } @@ -392,14 +392,14 @@ static int rping_setup_buffers(struct rp DEBUG_LOG("rping_setup_buffers called on cb %p\n", cb); - cb->recv_mr = ibv_reg_mr(cb->pd, &cb->recv_buf, sizeof cb->recv_buf, - IBV_ACCESS_LOCAL_WRITE); + cb->recv_mr = rdmav_reg_mr(cb->pd, &cb->recv_buf, sizeof cb->recv_buf, + RDMAV_ACCESS_LOCAL_WRITE); if (!cb->recv_mr) { fprintf(stderr, "recv_buf reg_mr failed\n"); return errno; } - cb->send_mr = ibv_reg_mr(cb->pd, &cb->send_buf, sizeof cb->send_buf, 0); + cb->send_mr = rdmav_reg_mr(cb->pd, &cb->send_buf, sizeof cb->send_buf, 0); if (!cb->send_mr) { fprintf(stderr, "send_buf reg_mr failed\n"); ret = errno; @@ -413,10 +413,10 @@ static int rping_setup_buffers(struct rp goto err2; } - cb->rdma_mr = ibv_reg_mr(cb->pd, cb->rdma_buf, cb->size, - IBV_ACCESS_LOCAL_WRITE | - IBV_ACCESS_REMOTE_READ | - IBV_ACCESS_REMOTE_WRITE); + cb->rdma_mr = rdmav_reg_mr(cb->pd, cb->rdma_buf, cb->size, + RDMAV_ACCESS_LOCAL_WRITE | + RDMAV_ACCESS_REMOTE_READ | + RDMAV_ACCESS_REMOTE_WRITE); if (!cb->rdma_mr) { fprintf(stderr, "rdma_buf reg_mr failed\n"); ret = errno; @@ -431,10 +431,10 @@ static int rping_setup_buffers(struct rp goto err4; } - cb->start_mr = ibv_reg_mr(cb->pd, cb->start_buf, cb->size, - IBV_ACCESS_LOCAL_WRITE | - IBV_ACCESS_REMOTE_READ | - IBV_ACCESS_REMOTE_WRITE); + cb->start_mr = rdmav_reg_mr(cb->pd, cb->start_buf, cb->size, + RDMAV_ACCESS_LOCAL_WRITE | + RDMAV_ACCESS_REMOTE_READ | + RDMAV_ACCESS_REMOTE_WRITE); if (!cb->start_mr) { fprintf(stderr, "start_buf reg_mr failed\n"); ret = errno; @@ -449,32 +449,32 @@ static int rping_setup_buffers(struct rp err5: free(cb->start_buf); err4: - ibv_dereg_mr(cb->rdma_mr); + rdmav_dereg_mr(cb->rdma_mr); err3: free(cb->rdma_buf); err2: - ibv_dereg_mr(cb->send_mr); + rdmav_dereg_mr(cb->send_mr); err1: - ibv_dereg_mr(cb->recv_mr); + rdmav_dereg_mr(cb->recv_mr); return ret; } static void rping_free_buffers(struct rping_cb *cb) { DEBUG_LOG("rping_free_buffers called on cb %p\n", cb); - ibv_dereg_mr(cb->recv_mr); - ibv_dereg_mr(cb->send_mr); - ibv_dereg_mr(cb->rdma_mr); + rdmav_dereg_mr(cb->recv_mr); + rdmav_dereg_mr(cb->send_mr); + rdmav_dereg_mr(cb->rdma_mr); free(cb->rdma_buf); if (!cb->server) { - ibv_dereg_mr(cb->start_mr); + rdmav_dereg_mr(cb->start_mr); free(cb->start_buf); } } static int rping_create_qp(struct rping_cb *cb) { - struct ibv_qp_init_attr init_attr; + struct rdmav_qp_init_attr init_attr; int ret; memset(&init_attr, 0, sizeof(init_attr)); @@ -482,7 +482,7 @@ static int rping_create_qp(struct rping_ init_attr.cap.max_recv_wr = 2; init_attr.cap.max_recv_sge = 1; init_attr.cap.max_send_sge = 1; - init_attr.qp_type = IBV_QPT_RC; + init_attr.qp_type = RDMAV_QPT_RC; init_attr.send_cq = cb->cq; init_attr.recv_cq = cb->cq; @@ -501,43 +501,43 @@ static int rping_create_qp(struct rping_ static void rping_free_qp(struct rping_cb *cb) { - ibv_destroy_qp(cb->qp); - ibv_destroy_cq(cb->cq); - ibv_destroy_comp_channel(cb->channel); - ibv_dealloc_pd(cb->pd); + rdmav_destroy_qp(cb->qp); + rdmav_destroy_cq(cb->cq); + rdmav_destroy_comp_channel(cb->channel); + rdmav_dealloc_pd(cb->pd); } static int rping_setup_qp(struct rping_cb *cb, struct rdma_cm_id *cm_id) { int ret; - cb->pd = ibv_alloc_pd(cm_id->verbs); + cb->pd = rdmav_alloc_pd(cm_id->verbs); if (!cb->pd) { - fprintf(stderr, "ibv_alloc_pd failed\n"); + fprintf(stderr, "rdmav_alloc_pd failed\n"); return errno; } DEBUG_LOG("created pd %p\n", cb->pd); - cb->channel = ibv_create_comp_channel(cm_id->verbs); + cb->channel = rdmav_create_comp_channel(cm_id->verbs); if (!cb->channel) { - fprintf(stderr, "ibv_create_comp_channel failed\n"); + fprintf(stderr, "rdmav_create_comp_channel failed\n"); ret = errno; goto err1; } DEBUG_LOG("created channel %p\n", cb->channel); - cb->cq = ibv_create_cq(cm_id->verbs, RPING_SQ_DEPTH * 2, cb, + cb->cq = rdmav_create_cq(cm_id->verbs, RPING_SQ_DEPTH * 2, cb, cb->channel, 0); if (!cb->cq) { - fprintf(stderr, "ibv_create_cq failed\n"); + fprintf(stderr, "rdmav_create_cq failed\n"); ret = errno; goto err2; } DEBUG_LOG("created cq %p\n", cb->cq); - ret = ibv_req_notify_cq(cb->cq, 0); + ret = rdmav_req_notify_cq(cb->cq, 0); if (ret) { - fprintf(stderr, "ibv_create_cq failed\n"); + fprintf(stderr, "rdmav_create_cq failed\n"); ret = errno; goto err3; } @@ -551,11 +551,11 @@ static int rping_setup_qp(struct rping_c return 0; err3: - ibv_destroy_cq(cb->cq); + rdmav_destroy_cq(cb->cq); err2: - ibv_destroy_comp_channel(cb->channel); + rdmav_destroy_comp_channel(cb->channel); err1: - ibv_dealloc_pd(cb->pd); + rdmav_dealloc_pd(cb->pd); return ret; } @@ -581,14 +581,14 @@ static void *cm_thread(void *arg) static void *cq_thread(void *arg) { struct rping_cb *cb = arg; - struct ibv_cq *ev_cq; + struct rdmav_cq *ev_cq; void *ev_ctx; int ret; DEBUG_LOG("cq_thread started.\n"); while (1) { - ret = ibv_get_cq_event(cb->channel, &ev_cq, &ev_ctx); + ret = rdmav_get_cq_event(cb->channel, &ev_cq, &ev_ctx); if (ret) { fprintf(stderr, "Failed to get cq event!\n"); exit(ret); @@ -597,19 +597,19 @@ static void *cq_thread(void *arg) fprintf(stderr, "Unkown CQ!\n"); exit(-1); } - ret = ibv_req_notify_cq(cb->cq, 0); + ret = rdmav_req_notify_cq(cb->cq, 0); if (ret) { fprintf(stderr, "Failed to set notify!\n"); exit(ret); } ret = rping_cq_event_handler(cb); - ibv_ack_cq_events(cb->cq, 1); + rdmav_ack_cq_events(cb->cq, 1); if (ret) exit(ret); } } -static void rping_format_send(struct rping_cb *cb, char *buf, struct ibv_mr *mr) +static void rping_format_send(struct rping_cb *cb, char *buf, struct rdmav_mr *mr) { struct rping_rdma_info *info = &cb->send_buf; @@ -623,7 +623,7 @@ static void rping_format_send(struct rpi static int rping_test_server(struct rping_cb *cb) { - struct ibv_send_wr *bad_wr; + struct rdmav_send_wr *bad_wr; int ret; while (1) { @@ -639,12 +639,12 @@ static int rping_test_server(struct rpin DEBUG_LOG("server received sink adv\n"); /* Issue RDMA Read. */ - cb->rdma_sq_wr.opcode = IBV_WR_RDMA_READ; + cb->rdma_sq_wr.opcode = RDMAV_WR_RDMA_READ; cb->rdma_sq_wr.wr.rdma.rkey = cb->remote_rkey; cb->rdma_sq_wr.wr.rdma.remote_addr = cb->remote_addr; cb->rdma_sq_wr.sg_list->length = cb->remote_len; - ret = ibv_post_send(cb->qp, &cb->rdma_sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->rdma_sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -666,7 +666,7 @@ static int rping_test_server(struct rpin printf("server ping data: %s\n", cb->rdma_buf); /* Tell client to continue */ - ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -684,7 +684,7 @@ static int rping_test_server(struct rpin DEBUG_LOG("server received sink adv\n"); /* RDMA Write echo data */ - cb->rdma_sq_wr.opcode = IBV_WR_RDMA_WRITE; + cb->rdma_sq_wr.opcode = RDMAV_WR_RDMA_WRITE; cb->rdma_sq_wr.wr.rdma.rkey = cb->remote_rkey; cb->rdma_sq_wr.wr.rdma.remote_addr = cb->remote_addr; cb->rdma_sq_wr.sg_list->length = strlen(cb->rdma_buf) + 1; @@ -693,7 +693,7 @@ static int rping_test_server(struct rpin cb->rdma_sq_wr.sg_list->addr, cb->rdma_sq_wr.sg_list->length); - ret = ibv_post_send(cb->qp, &cb->rdma_sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->rdma_sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -710,7 +710,7 @@ static int rping_test_server(struct rpin DEBUG_LOG("server rdma write complete \n"); /* Tell client to begin again */ - ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -757,7 +757,7 @@ static int rping_bind_server(struct rpin static int rping_run_server(struct rping_cb *cb) { - struct ibv_recv_wr *bad_wr; + struct rdmav_recv_wr *bad_wr; int ret; ret = rping_bind_server(cb); @@ -776,9 +776,9 @@ static int rping_run_server(struct rping goto err1; } - ret = ibv_post_recv(cb->qp, &cb->rq_wr, &bad_wr); + ret = rdmav_post_recv(cb->qp, &cb->rq_wr, &bad_wr); if (ret) { - fprintf(stderr, "ibv_post_recv failed: %d\n", ret); + fprintf(stderr, "rdmav_post_recv failed: %d\n", ret); goto err2; } @@ -804,7 +804,7 @@ err1: static int rping_test_client(struct rping_cb *cb) { int ping, start, cc, i, ret = 0; - struct ibv_send_wr *bad_wr; + struct rdmav_send_wr *bad_wr; unsigned char c; start = 65; @@ -825,7 +825,7 @@ static int rping_test_client(struct rpin cb->start_buf[cb->size - 1] = 0; rping_format_send(cb, cb->start_buf, cb->start_mr); - ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -841,7 +841,7 @@ static int rping_test_client(struct rpin } rping_format_send(cb, cb->rdma_buf, cb->rdma_mr); - ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -926,7 +926,7 @@ static int rping_bind_client(struct rpin static int rping_run_client(struct rping_cb *cb) { - struct ibv_recv_wr *bad_wr; + struct rdmav_recv_wr *bad_wr; int ret; ret = rping_bind_client(cb); @@ -945,9 +945,9 @@ static int rping_run_client(struct rping goto err1; } - ret = ibv_post_recv(cb->qp, &cb->rq_wr, &bad_wr); + ret = rdmav_post_recv(cb->qp, &cb->rq_wr, &bad_wr); if (ret) { - fprintf(stderr, "ibv_post_recv failed: %d\n", ret); + fprintf(stderr, "rdmav_post_recv failed: %d\n", ret); goto err2; } diff -ruNp ORG/librdmacm/examples/udaddy.c NEW/librdmacm/examples/udaddy.c --- ORG/librdmacm/examples/udaddy.c 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/examples/udaddy.c 2006-08-03 17:32:51.000000000 -0700 @@ -55,10 +55,10 @@ struct cmatest_node { int id; struct rdma_cm_id *cma_id; int connected; - struct ibv_pd *pd; - struct ibv_cq *cq; - struct ibv_mr *mr; - struct ibv_ah *ah; + struct rdmav_pd *pd; + struct rdmav_cq *cq; + struct rdmav_mr *mr; + struct rdmav_ah *ah; uint32_t remote_qpn; uint32_t remote_qkey; void *mem; @@ -90,14 +90,14 @@ static int create_message(struct cmatest if (!message_count) return 0; - node->mem = malloc(message_size + sizeof(struct ibv_grh)); + node->mem = malloc(message_size + sizeof(struct rdmav_grh)); if (!node->mem) { printf("failed message allocation\n"); return -1; } - node->mr = ibv_reg_mr(node->pd, node->mem, - message_size + sizeof(struct ibv_grh), - IBV_ACCESS_LOCAL_WRITE); + node->mr = rdmav_reg_mr(node->pd, node->mem, + message_size + sizeof(struct rdmav_grh), + RDMAV_ACCESS_LOCAL_WRITE); if (!node->mr) { printf("failed to reg MR\n"); goto err; @@ -110,10 +110,10 @@ err: static int init_node(struct cmatest_node *node) { - struct ibv_qp_init_attr init_qp_attr; + struct rdmav_qp_init_attr init_qp_attr; int cqe, ret; - node->pd = ibv_alloc_pd(node->cma_id->verbs); + node->pd = rdmav_alloc_pd(node->cma_id->verbs); if (!node->pd) { ret = -ENOMEM; printf("udaddy: unable to allocate PD\n"); @@ -121,7 +121,7 @@ static int init_node(struct cmatest_node } cqe = message_count ? message_count * 2 : 2; - node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); + node->cq = rdmav_create_cq(node->cma_id->verbs, cqe, node, 0, 0); if (!node->cq) { ret = -ENOMEM; printf("udaddy: unable to create CQ\n"); @@ -135,7 +135,7 @@ static int init_node(struct cmatest_node init_qp_attr.cap.max_recv_sge = 1; init_qp_attr.qp_context = node; init_qp_attr.sq_sig_all = 0; - init_qp_attr.qp_type = IBV_QPT_UD; + init_qp_attr.qp_type = RDMAV_QPT_UD; init_qp_attr.send_cq = node->cq; init_qp_attr.recv_cq = node->cq; ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); @@ -155,8 +155,8 @@ out: static int post_recvs(struct cmatest_node *node) { - struct ibv_recv_wr recv_wr, *recv_failure; - struct ibv_sge sge; + struct rdmav_recv_wr recv_wr, *recv_failure; + struct rdmav_sge sge; int i, ret = 0; if (!message_count) @@ -167,12 +167,12 @@ static int post_recvs(struct cmatest_nod recv_wr.num_sge = 1; recv_wr.wr_id = (uintptr_t) node; - sge.length = message_size + sizeof(struct ibv_grh); + sge.length = message_size + sizeof(struct rdmav_grh); sge.lkey = node->mr->lkey; sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++ ) { - ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); + ret = rdmav_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); if (ret) { printf("failed to post receives: %d\n", ret); break; @@ -183,8 +183,8 @@ static int post_recvs(struct cmatest_nod static int post_sends(struct cmatest_node *node, int signal_flag) { - struct ibv_send_wr send_wr, *bad_send_wr; - struct ibv_sge sge; + struct rdmav_send_wr send_wr, *bad_send_wr; + struct rdmav_sge sge; int i, ret = 0; if (!node->connected || !message_count) @@ -193,8 +193,8 @@ static int post_sends(struct cmatest_nod send_wr.next = NULL; send_wr.sg_list = &sge; send_wr.num_sge = 1; - send_wr.opcode = IBV_WR_SEND_WITH_IMM; - send_wr.send_flags = IBV_SEND_INLINE | signal_flag; + send_wr.opcode = RDMAV_WR_SEND_WITH_IMM; + send_wr.send_flags = RDMAV_SEND_INLINE | signal_flag; send_wr.wr_id = (unsigned long)node; send_wr.imm_data = htonl(node->cma_id->qp->qp_num); @@ -202,12 +202,12 @@ static int post_sends(struct cmatest_nod send_wr.wr.ud.remote_qpn = node->remote_qpn; send_wr.wr.ud.remote_qkey = node->remote_qkey; - sge.length = message_size - sizeof(struct ibv_grh); + sge.length = message_size - sizeof(struct rdmav_grh); sge.lkey = node->mr->lkey; sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++) { - ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); + ret = rdmav_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); if (ret) printf("failed to post sends: %d\n", ret); } @@ -305,7 +305,7 @@ err1: static int resolved_handler(struct cmatest_node *node) { - struct ibv_ah_attr ah_attr; + struct rdmav_ah_attr ah_attr; int ret; ret = rdma_get_dst_attr(node->cma_id, test.dst_addr, &ah_attr, @@ -315,7 +315,7 @@ static int resolved_handler(struct cmate goto err; } - node->ah = ibv_create_ah(node->pd, &ah_attr); + node->ah = rdmav_create_ah(node->pd, &ah_attr); if (!node->ah) { printf("udaddy: failure creating address handle\n"); goto err; @@ -371,21 +371,21 @@ static void destroy_node(struct cmatest_ return; if (node->ah) - ibv_destroy_ah(node->ah); + rdmav_destroy_ah(node->ah); if (node->cma_id->qp) rdma_destroy_qp(node->cma_id); if (node->cq) - ibv_destroy_cq(node->cq); + rdmav_destroy_cq(node->cq); if (node->mem) { - ibv_dereg_mr(node->mr); + rdmav_dereg_mr(node->mr); free(node->mem); } if (node->pd) - ibv_dealloc_pd(node->pd); + rdmav_dealloc_pd(node->pd); /* Destroy the RDMA ID after all device resources */ rdma_destroy_id(node->cma_id); @@ -429,9 +429,9 @@ static void destroy_nodes(void) free(test.nodes); } -static void create_reply_ah(struct cmatest_node *node, struct ibv_wc *wc) +static void create_reply_ah(struct cmatest_node *node, struct rdmav_wc *wc) { - node->ah = ibv_create_ah_from_wc(node->pd, wc, node->mem, + node->ah = rdmav_create_ah_from_wc(node->pd, wc, node->mem, node->cma_id->port_num); node->remote_qpn = ntohl(wc->imm_data); node->remote_qkey = ntohs(rdma_get_dst_port(node->cma_id)); @@ -439,7 +439,7 @@ static void create_reply_ah(struct cmate static int poll_cqs(void) { - struct ibv_wc wc[8]; + struct rdmav_wc wc[8]; int done, i, ret; for (i = 0; i < connections; i++) { @@ -447,7 +447,7 @@ static int poll_cqs(void) continue; for (done = 0; done < message_count; done += ret) { - ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + ret = rdmav_poll_cq(test.nodes[i].cq, 8, wc); if (ret < 0) { printf("udaddy: failed polling CQ: %d\n", ret); return ret; @@ -511,7 +511,7 @@ static int run_server(void) printf("sending replies\n"); for (i = 0; i < connections; i++) { - ret = post_sends(&test.nodes[i], IBV_SEND_SIGNALED); + ret = post_sends(&test.nodes[i], RDMAV_SEND_SIGNALED); if (ret) goto out; } From krkumar2 at in.ibm.com Thu Aug 3 01:37:23 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 14:07:23 +0530 Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. Message-ID: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> [REPOSTING - as 2 patches in my first set didn't seem to have got to openib mailing list though I checked after 2 hours. Apologies in advance since people will get duplicates of atleast 4 of the successful patches] This patchset is a proposal to create new API's and data structures with transport neutral names. The idea is to remove the old API once all libraries/applications/examples are gradually converted to use the new API. Patch 1/6 - Changes to libibverbs configuration file to build the libibverbs with the new API. Patch 2/6 - Additions to include files in libibverbs for the new API. Patch 3/6 - Source files in libibverbs defining the new API. Patch 4/6 - Convert librdmacm examples to use the new API. Patch 5/6 - Convert librdmacm include files to use the new libibverbs API. Patch 6/6 - Convert librdmacm source files to use the new libibverbs API. Changes from previous (v2) round : ---------------------------------- 1. #defined most data structures as suggested by Sean. Created a deprecate.h which can be removed once all apps are converted to new API. 2. Changed rdma_ to rdmav_. This also enabled to retain rdma_create_qp() and rdma_destroy_qp() routines. (The last suggestion to not convert IB specific types to generic types was not done at this time, since my previous note explained that it is not clear how to retain some names while changing others. Eg, ibv_event_type or ibv_qp_attr_mask, which have enums for both IB and generic types, etc). Testing : --------- 1. Compile tested libibverbs, librdmacm, libmthca, libsdp, libibcm, libipathverbs and dapl. No warnings or failures in any of these. The only warning in libmthca was regarding multiple ibv_read_sysfs_file()'s, which is not a compile issue (and also can be removed). 2. Tested rping, ibv_devices & ibv_devinfo utils. Information notes found during the changes : -------------------------------------------- 1. Added LIBRDMAVERBS_DRIVER_PATH and also use old OPENIB_DRIVER_PATH_ENV for backwards compatibility, but have not set user_path to include OPENIB_DRIVER_PATH_ENV results. 2. Currently ibv_driver_init is implemented in all drivers. But the function returns a "struct ibv_driver *", while we expect "struct rdma_driver *". In reality this is fine as they are both pointers pointing to identical objects. Otherwise each driver has to be changed now. Once all drivers are changed to use rdma_* API's, this will not be an issue. 3. All names are changed to neutral names, even IB specific names as it is not clear how to retain some names while changing others. Eg, ibv_event_type or ibv_qp_attr_mask, etc. 4. Passing different pointer to verbs, though the end result is the same (no warnings generated though as this is a link-time trick). Eg : int rdma_query_device(struct rdma_context *context, struct rdma_device_attr *device_attr) { return context->ops.query_device(context, device_attr); } However this will not be an issue once the drivers are changed to use the new API. Eg : int mthca_query_device(struct rdma_context *context, struct rdma_device_attr *attr) 5. Makefile.am still makes libibverbs.* libraries so that other apps do not break. librdmaverbs.spec.in also does the same. 6. Kept ibv_driver_init call as all libraries have implemented ibv_driver_init, but this can be changed easily to new API (and then retired). 7. Prefix is kept as rdmav_ (rdv_ didn't have much takers) to be generic and consistent enough. 8. [Missing] IBV_OPCODE() macro is not done (but no use ones it currently). --- Signed-off-by: Krishna Kumar From krkumar2 at in.ibm.com Thu Aug 3 01:37:40 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 14:07:40 +0530 Subject: [openib-general] [PATCH v3 2/6] libibverbs source files changes. In-Reply-To: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803083740.6346.42894.sendpatchset@K50wks273950wss.in.ibm.com> Source files in libibverbs defining the new API. Signed-off-by: Krishna Kumar diff -ruNp ORG/libibverbs/src/cmd.c NEW/libibverbs/src/cmd.c --- ORG/libibverbs/src/cmd.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/cmd.c 2006-08-03 17:29:24.000000000 -0700 @@ -45,16 +45,16 @@ #include #include -#include "ibverbs.h" +#include "rdmaverbs.h" -static int ibv_cmd_get_context_v2(struct ibv_context *context, - struct ibv_get_context *new_cmd, +static int rdmav_cmd_get_context_v2(struct rdmav_context *context, + struct rdmav_get_context *new_cmd, size_t new_cmd_size, - struct ibv_get_context_resp *resp, + struct rdmav_get_context_resp *resp, size_t resp_size) { - struct ibv_abi_compat_v2 *t; - struct ibv_get_context_v2 *cmd; + struct rdmav_abi_compat_v2 *t; + struct rdmav_get_context_v2 *cmd; size_t cmd_size; uint32_t cq_fd; @@ -65,9 +65,10 @@ static int ibv_cmd_get_context_v2(struct cmd_size = sizeof *cmd + new_cmd_size - sizeof *new_cmd; cmd = alloca(cmd_size); - memcpy(cmd->driver_data, new_cmd->driver_data, new_cmd_size - sizeof *new_cmd); + memcpy(cmd->driver_data, new_cmd->driver_data, + new_cmd_size - sizeof *new_cmd); - IBV_INIT_CMD_RESP(cmd, cmd_size, GET_CONTEXT, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, GET_CONTEXT, resp, resp_size); cmd->cq_fd_tab = (uintptr_t) &cq_fd; if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) @@ -81,14 +82,16 @@ static int ibv_cmd_get_context_v2(struct return 0; } -int ibv_cmd_get_context(struct ibv_context *context, struct ibv_get_context *cmd, - size_t cmd_size, struct ibv_get_context_resp *resp, +int rdmav_cmd_get_context(struct rdmav_context *context, + struct rdmav_get_context *cmd, + size_t cmd_size, struct rdmav_get_context_resp *resp, size_t resp_size) { if (abi_ver <= 2) - return ibv_cmd_get_context_v2(context, cmd, cmd_size, resp, resp_size); + return rdmav_cmd_get_context_v2(context, cmd, cmd_size, resp, + resp_size); - IBV_INIT_CMD_RESP(cmd, cmd_size, GET_CONTEXT, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, GET_CONTEXT, resp, resp_size); if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; @@ -99,14 +102,14 @@ int ibv_cmd_get_context(struct ibv_conte return 0; } -int ibv_cmd_query_device(struct ibv_context *context, - struct ibv_device_attr *device_attr, +int rdmav_cmd_query_device(struct rdmav_context *context, + struct rdmav_device_attr *device_attr, uint64_t *raw_fw_ver, - struct ibv_query_device *cmd, size_t cmd_size) + struct rdmav_query_device *cmd, size_t cmd_size) { - struct ibv_query_device_resp resp; + struct rdmav_query_device_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_DEVICE, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, QUERY_DEVICE, &resp, sizeof resp); if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; @@ -156,13 +159,13 @@ int ibv_cmd_query_device(struct ibv_cont return 0; } -int ibv_cmd_query_port(struct ibv_context *context, uint8_t port_num, - struct ibv_port_attr *port_attr, - struct ibv_query_port *cmd, size_t cmd_size) +int rdmav_cmd_query_port(struct rdmav_context *context, uint8_t port_num, + struct rdmav_port_attr *port_attr, + struct rdmav_query_port *cmd, size_t cmd_size) { - struct ibv_query_port_resp resp; + struct rdmav_query_port_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_PORT, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, QUERY_PORT, &resp, sizeof resp); cmd->port_num = port_num; if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) @@ -191,11 +194,11 @@ int ibv_cmd_query_port(struct ibv_contex return 0; } -int ibv_cmd_alloc_pd(struct ibv_context *context, struct ibv_pd *pd, - struct ibv_alloc_pd *cmd, size_t cmd_size, - struct ibv_alloc_pd_resp *resp, size_t resp_size) +int rdmav_cmd_alloc_pd(struct rdmav_context *context, struct rdmav_pd *pd, + struct rdmav_alloc_pd *cmd, size_t cmd_size, + struct rdmav_alloc_pd_resp *resp, size_t resp_size) { - IBV_INIT_CMD_RESP(cmd, cmd_size, ALLOC_PD, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, ALLOC_PD, resp, resp_size); if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; @@ -205,11 +208,11 @@ int ibv_cmd_alloc_pd(struct ibv_context return 0; } -int ibv_cmd_dealloc_pd(struct ibv_pd *pd) +int rdmav_cmd_dealloc_pd(struct rdmav_pd *pd) { - struct ibv_dealloc_pd cmd; + struct rdmav_dealloc_pd cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DEALLOC_PD); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DEALLOC_PD); cmd.pd_handle = pd->handle; if (write(pd->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -218,14 +221,14 @@ int ibv_cmd_dealloc_pd(struct ibv_pd *pd return 0; } -int ibv_cmd_reg_mr(struct ibv_pd *pd, void *addr, size_t length, - uint64_t hca_va, enum ibv_access_flags access, - struct ibv_mr *mr, struct ibv_reg_mr *cmd, +int rdmav_cmd_reg_mr(struct rdmav_pd *pd, void *addr, size_t length, + uint64_t hca_va, enum rdmav_access_flags access, + struct rdmav_mr *mr, struct rdmav_reg_mr *cmd, size_t cmd_size) { - struct ibv_reg_mr_resp resp; + struct rdmav_reg_mr_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, REG_MR, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, REG_MR, &resp, sizeof resp); cmd->start = (uintptr_t) addr; cmd->length = length; @@ -243,11 +246,11 @@ int ibv_cmd_reg_mr(struct ibv_pd *pd, vo return 0; } -int ibv_cmd_dereg_mr(struct ibv_mr *mr) +int rdmav_cmd_dereg_mr(struct rdmav_mr *mr) { - struct ibv_dereg_mr cmd; + struct rdmav_dereg_mr cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DEREG_MR); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DEREG_MR); cmd.mr_handle = mr->handle; if (write(mr->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -256,19 +259,22 @@ int ibv_cmd_dereg_mr(struct ibv_mr *mr) return 0; } -static int ibv_cmd_create_cq_v2(struct ibv_context *context, int cqe, - struct ibv_cq *cq, - struct ibv_create_cq *new_cmd, size_t new_cmd_size, - struct ibv_create_cq_resp *resp, size_t resp_size) +static int rdmav_cmd_create_cq_v2(struct rdmav_context *context, int cqe, + struct rdmav_cq *cq, + struct rdmav_create_cq *new_cmd, + size_t new_cmd_size, + struct rdmav_create_cq_resp *resp, + size_t resp_size) { - struct ibv_create_cq_v2 *cmd; + struct rdmav_create_cq_v2 *cmd; size_t cmd_size; cmd_size = sizeof *cmd + new_cmd_size - sizeof *new_cmd; cmd = alloca(cmd_size); - memcpy(cmd->driver_data, new_cmd->driver_data, new_cmd_size - sizeof *new_cmd); + memcpy(cmd->driver_data, new_cmd->driver_data, + new_cmd_size - sizeof *new_cmd); - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_CQ, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, CREATE_CQ, resp, resp_size); cmd->user_handle = (uintptr_t) cq; cmd->cqe = cqe; cmd->event_handler = 0; @@ -282,17 +288,17 @@ static int ibv_cmd_create_cq_v2(struct i return 0; } -int ibv_cmd_create_cq(struct ibv_context *context, int cqe, - struct ibv_comp_channel *channel, - int comp_vector, struct ibv_cq *cq, - struct ibv_create_cq *cmd, size_t cmd_size, - struct ibv_create_cq_resp *resp, size_t resp_size) +int rdmav_cmd_create_cq(struct rdmav_context *context, int cqe, + struct rdmav_comp_channel *channel, + int comp_vector, struct rdmav_cq *cq, + struct rdmav_create_cq *cmd, size_t cmd_size, + struct rdmav_create_cq_resp *resp, size_t resp_size) { if (abi_ver <= 2) - return ibv_cmd_create_cq_v2(context, cqe, cq, + return rdmav_cmd_create_cq_v2(context, cqe, cq, cmd, cmd_size, resp, resp_size); - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_CQ, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, CREATE_CQ, resp, resp_size); cmd->user_handle = (uintptr_t) cq; cmd->cqe = cqe; cmd->comp_vector = comp_vector; @@ -308,20 +314,20 @@ int ibv_cmd_create_cq(struct ibv_context return 0; } -int ibv_cmd_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc) +int rdmav_cmd_poll_cq(struct rdmav_cq *ibcq, int ne, struct rdmav_wc *wc) { - struct ibv_poll_cq cmd; - struct ibv_poll_cq_resp *resp; + struct rdmav_poll_cq cmd; + struct rdmav_poll_cq_resp *resp; int i; int rsize; int ret; - rsize = sizeof *resp + ne * sizeof(struct ibv_kern_wc); + rsize = sizeof *resp + ne * sizeof(struct rdmav_kern_wc); resp = malloc(rsize); if (!resp) return -1; - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, POLL_CQ, resp, rsize); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, POLL_CQ, resp, rsize); cmd.cq_handle = ibcq->handle; cmd.ne = ne; @@ -353,11 +359,11 @@ out: return ret; } -int ibv_cmd_req_notify_cq(struct ibv_cq *ibcq, int solicited_only) +int rdmav_cmd_req_notify_cq(struct rdmav_cq *ibcq, int solicited_only) { - struct ibv_req_notify_cq cmd; + struct rdmav_req_notify_cq cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, REQ_NOTIFY_CQ); + RDMAV_INIT_CMD(&cmd, sizeof cmd, REQ_NOTIFY_CQ); cmd.cq_handle = ibcq->handle; cmd.solicited = !!solicited_only; @@ -367,12 +373,12 @@ int ibv_cmd_req_notify_cq(struct ibv_cq return 0; } -int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, - struct ibv_resize_cq *cmd, size_t cmd_size) +int rdmav_cmd_resize_cq(struct rdmav_cq *cq, int cqe, + struct rdmav_resize_cq *cmd, size_t cmd_size) { - struct ibv_resize_cq_resp resp; + struct rdmav_resize_cq_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, &resp, sizeof resp); cmd->cq_handle = cq->handle; cmd->cqe = cqe; @@ -384,11 +390,11 @@ int ibv_cmd_resize_cq(struct ibv_cq *cq, return 0; } -static int ibv_cmd_destroy_cq_v1(struct ibv_cq *cq) +static int rdmav_cmd_destroy_cq_v1(struct rdmav_cq *cq) { - struct ibv_destroy_cq_v1 cmd; + struct rdmav_destroy_cq_v1 cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DESTROY_CQ); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DESTROY_CQ); cmd.cq_handle = cq->handle; if (write(cq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -397,15 +403,15 @@ static int ibv_cmd_destroy_cq_v1(struct return 0; } -int ibv_cmd_destroy_cq(struct ibv_cq *cq) +int rdmav_cmd_destroy_cq(struct rdmav_cq *cq) { - struct ibv_destroy_cq cmd; - struct ibv_destroy_cq_resp resp; + struct rdmav_destroy_cq cmd; + struct rdmav_destroy_cq_resp resp; if (abi_ver == 1) - return ibv_cmd_destroy_cq_v1(cq); + return rdmav_cmd_destroy_cq_v1(cq); - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_CQ, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_CQ, &resp, sizeof resp); cmd.cq_handle = cq->handle; if (write(cq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -420,12 +426,12 @@ int ibv_cmd_destroy_cq(struct ibv_cq *cq return 0; } -int ibv_cmd_create_srq(struct ibv_pd *pd, - struct ibv_srq *srq, struct ibv_srq_init_attr *attr, - struct ibv_create_srq *cmd, size_t cmd_size, - struct ibv_create_srq_resp *resp, size_t resp_size) +int rdmav_cmd_create_srq(struct rdmav_pd *pd, + struct rdmav_srq *srq, struct rdmav_srq_init_attr *attr, + struct rdmav_create_srq *cmd, size_t cmd_size, + struct rdmav_create_srq_resp *resp, size_t resp_size) { - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_SRQ, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, CREATE_SRQ, resp, resp_size); cmd->user_handle = (uintptr_t) srq; cmd->pd_handle = pd->handle; cmd->max_wr = attr->attr.max_wr; @@ -441,8 +447,8 @@ int ibv_cmd_create_srq(struct ibv_pd *pd attr->attr.max_wr = resp->max_wr; attr->attr.max_sge = resp->max_sge; } else { - struct ibv_create_srq_resp_v5 *resp_v5 = - (struct ibv_create_srq_resp_v5 *) resp; + struct rdmav_create_srq_resp_v5 *resp_v5 = + (struct rdmav_create_srq_resp_v5 *) resp; memmove((void *) resp + sizeof *resp, (void *) resp_v5 + sizeof *resp_v5, @@ -452,20 +458,21 @@ int ibv_cmd_create_srq(struct ibv_pd *pd return 0; } -static int ibv_cmd_modify_srq_v3(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask, - struct ibv_modify_srq *new_cmd, +static int rdmav_cmd_modify_srq_v3(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask, + struct rdmav_modify_srq *new_cmd, size_t new_cmd_size) { - struct ibv_modify_srq_v3 *cmd; + struct rdmav_modify_srq_v3 *cmd; size_t cmd_size; cmd_size = sizeof *cmd + new_cmd_size - sizeof *new_cmd; cmd = alloca(cmd_size); - memcpy(cmd->driver_data, new_cmd->driver_data, new_cmd_size - sizeof *new_cmd); + memcpy(cmd->driver_data, new_cmd->driver_data, + new_cmd_size - sizeof *new_cmd); - IBV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); + RDMAV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); cmd->srq_handle = srq->handle; cmd->attr_mask = srq_attr_mask; @@ -480,16 +487,16 @@ static int ibv_cmd_modify_srq_v3(struct return 0; } -int ibv_cmd_modify_srq(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask, - struct ibv_modify_srq *cmd, size_t cmd_size) +int rdmav_cmd_modify_srq(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask, + struct rdmav_modify_srq *cmd, size_t cmd_size) { if (abi_ver == 3) - return ibv_cmd_modify_srq_v3(srq, srq_attr, srq_attr_mask, + return rdmav_cmd_modify_srq_v3(srq, srq_attr, srq_attr_mask, cmd, cmd_size); - IBV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); + RDMAV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); cmd->srq_handle = srq->handle; cmd->attr_mask = srq_attr_mask; @@ -502,12 +509,12 @@ int ibv_cmd_modify_srq(struct ibv_srq *s return 0; } -int ibv_cmd_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr, - struct ibv_query_srq *cmd, size_t cmd_size) +int rdmav_cmd_query_srq(struct rdmav_srq *srq, struct rdmav_srq_attr *srq_attr, + struct rdmav_query_srq *cmd, size_t cmd_size) { - struct ibv_query_srq_resp resp; + struct rdmav_query_srq_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_SRQ, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, QUERY_SRQ, &resp, sizeof resp); cmd->srq_handle = srq->handle; if (write(srq->context->cmd_fd, cmd, cmd_size) != cmd_size) @@ -520,11 +527,11 @@ int ibv_cmd_query_srq(struct ibv_srq *sr return 0; } -static int ibv_cmd_destroy_srq_v1(struct ibv_srq *srq) +static int rdmav_cmd_destroy_srq_v1(struct rdmav_srq *srq) { - struct ibv_destroy_srq_v1 cmd; + struct rdmav_destroy_srq_v1 cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DESTROY_SRQ); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DESTROY_SRQ); cmd.srq_handle = srq->handle; if (write(srq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -533,15 +540,15 @@ static int ibv_cmd_destroy_srq_v1(struct return 0; } -int ibv_cmd_destroy_srq(struct ibv_srq *srq) +int rdmav_cmd_destroy_srq(struct rdmav_srq *srq) { - struct ibv_destroy_srq cmd; - struct ibv_destroy_srq_resp resp; + struct rdmav_destroy_srq cmd; + struct rdmav_destroy_srq_resp resp; if (abi_ver == 1) - return ibv_cmd_destroy_srq_v1(srq); + return rdmav_cmd_destroy_srq_v1(srq); - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_SRQ, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_SRQ, &resp, sizeof resp); cmd.srq_handle = srq->handle; if (write(srq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -555,12 +562,12 @@ int ibv_cmd_destroy_srq(struct ibv_srq * return 0; } -int ibv_cmd_create_qp(struct ibv_pd *pd, - struct ibv_qp *qp, struct ibv_qp_init_attr *attr, - struct ibv_create_qp *cmd, size_t cmd_size, - struct ibv_create_qp_resp *resp, size_t resp_size) +int rdmav_cmd_create_qp(struct rdmav_pd *pd, + struct rdmav_qp *qp, struct rdmav_qp_init_attr *attr, + struct rdmav_create_qp *cmd, size_t cmd_size, + struct rdmav_create_qp_resp *resp, size_t resp_size) { - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, resp, resp_size); cmd->user_handle = (uintptr_t) qp; cmd->pd_handle = pd->handle; @@ -591,15 +598,15 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, } if (abi_ver == 4) { - struct ibv_create_qp_resp_v4 *resp_v4 = - (struct ibv_create_qp_resp_v4 *) resp; + struct rdmav_create_qp_resp_v4 *resp_v4 = + (struct rdmav_create_qp_resp_v4 *) resp; memmove((void *) resp + sizeof *resp, (void *) resp_v4 + sizeof *resp_v4, resp_size - sizeof *resp); } else if (abi_ver <= 3) { - struct ibv_create_qp_resp_v3 *resp_v3 = - (struct ibv_create_qp_resp_v3 *) resp; + struct rdmav_create_qp_resp_v3 *resp_v3 = + (struct rdmav_create_qp_resp_v3 *) resp; memmove((void *) resp + sizeof *resp, (void *) resp_v3 + sizeof *resp_v3, @@ -609,14 +616,14 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, return 0; } -int ibv_cmd_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_qp_init_attr *init_attr, - struct ibv_query_qp *cmd, size_t cmd_size) +int rdmav_cmd_query_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_qp_init_attr *init_attr, + struct rdmav_query_qp *cmd, size_t cmd_size) { - struct ibv_query_qp_resp resp; + struct rdmav_query_qp_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_QP, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, QUERY_QP, &resp, sizeof resp); cmd->qp_handle = qp->handle; cmd->attr_mask = attr_mask; @@ -689,11 +696,11 @@ int ibv_cmd_query_qp(struct ibv_qp *qp, return 0; } -int ibv_cmd_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_modify_qp *cmd, size_t cmd_size) +int rdmav_cmd_modify_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_modify_qp *cmd, size_t cmd_size) { - IBV_INIT_CMD(cmd, cmd_size, MODIFY_QP); + RDMAV_INIT_CMD(cmd, cmd_size, MODIFY_QP); cmd->qp_handle = qp->handle; cmd->attr_mask = attr_mask; @@ -749,11 +756,11 @@ int ibv_cmd_modify_qp(struct ibv_qp *qp, return 0; } -static int ibv_cmd_destroy_qp_v1(struct ibv_qp *qp) +static int rdmav_cmd_destroy_qp_v1(struct rdmav_qp *qp) { - struct ibv_destroy_qp_v1 cmd; + struct rdmav_destroy_qp_v1 cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DESTROY_QP); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DESTROY_QP); cmd.qp_handle = qp->handle; if (write(qp->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -762,14 +769,14 @@ static int ibv_cmd_destroy_qp_v1(struct return 0; } -int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, - struct ibv_send_wr **bad_wr) +int rdmav_cmd_post_send(struct rdmav_qp *ibqp, struct rdmav_send_wr *wr, + struct rdmav_send_wr **bad_wr) { - struct ibv_post_send *cmd; - struct ibv_post_send_resp resp; - struct ibv_send_wr *i; - struct ibv_kern_send_wr *n, *tmp; - struct ibv_sge *s; + struct rdmav_post_send *cmd; + struct rdmav_post_send_resp resp; + struct rdmav_send_wr *i; + struct rdmav_kern_send_wr *n, *tmp; + struct rdmav_sge *s; unsigned wr_count = 0; unsigned sge_count = 0; int cmd_size; @@ -783,14 +790,14 @@ int ibv_cmd_post_send(struct ibv_qp *ibq cmd_size = sizeof *cmd + wr_count * sizeof *n + sge_count * sizeof *s; cmd = alloca(cmd_size); - IBV_INIT_CMD_RESP(cmd, cmd_size, POST_SEND, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, POST_SEND, &resp, sizeof resp); cmd->qp_handle = ibqp->handle; cmd->wr_count = wr_count; cmd->sge_count = sge_count; cmd->wqe_size = sizeof *n; - n = (struct ibv_kern_send_wr *) ((void *) cmd + sizeof *cmd); - s = (struct ibv_sge *) (n + wr_count); + n = (struct rdmav_kern_send_wr *) ((void *) cmd + sizeof *cmd); + s = (struct rdmav_sge *) (n + wr_count); tmp = n; for (i = wr; i; i = i->next) { @@ -799,21 +806,21 @@ int ibv_cmd_post_send(struct ibv_qp *ibq tmp->opcode = i->opcode; tmp->send_flags = i->send_flags; tmp->imm_data = i->imm_data; - if (ibqp->qp_type == IBV_QPT_UD) { + if (ibqp->qp_type == RDMAV_QPT_UD) { tmp->wr.ud.ah = i->wr.ud.ah->handle; tmp->wr.ud.remote_qpn = i->wr.ud.remote_qpn; tmp->wr.ud.remote_qkey = i->wr.ud.remote_qkey; } else { switch(i->opcode) { - case IBV_WR_RDMA_WRITE: - case IBV_WR_RDMA_WRITE_WITH_IMM: - case IBV_WR_RDMA_READ: + case RDMAV_WR_RDMA_WRITE: + case RDMAV_WR_RDMA_WRITE_WITH_IMM: + case RDMAV_WR_RDMA_READ: tmp->wr.rdma.remote_addr = i->wr.rdma.remote_addr; tmp->wr.rdma.rkey = i->wr.rdma.rkey; break; - case IBV_WR_ATOMIC_CMP_AND_SWP: - case IBV_WR_ATOMIC_FETCH_AND_ADD: + case RDMAV_WR_ATOMIC_CMP_AND_SWP: + case RDMAV_WR_ATOMIC_FETCH_AND_ADD: tmp->wr.atomic.remote_addr = i->wr.atomic.remote_addr; tmp->wr.atomic.compare_add = @@ -849,14 +856,14 @@ int ibv_cmd_post_send(struct ibv_qp *ibq return ret; } -int ibv_cmd_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr) +int rdmav_cmd_post_recv(struct rdmav_qp *ibqp, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr) { - struct ibv_post_recv *cmd; - struct ibv_post_recv_resp resp; - struct ibv_recv_wr *i; - struct ibv_kern_recv_wr *n, *tmp; - struct ibv_sge *s; + struct rdmav_post_recv *cmd; + struct rdmav_post_recv_resp resp; + struct rdmav_recv_wr *i; + struct rdmav_kern_recv_wr *n, *tmp; + struct rdmav_sge *s; unsigned wr_count = 0; unsigned sge_count = 0; int cmd_size; @@ -870,14 +877,14 @@ int ibv_cmd_post_recv(struct ibv_qp *ibq cmd_size = sizeof *cmd + wr_count * sizeof *n + sge_count * sizeof *s; cmd = alloca(cmd_size); - IBV_INIT_CMD_RESP(cmd, cmd_size, POST_RECV, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, POST_RECV, &resp, sizeof resp); cmd->qp_handle = ibqp->handle; cmd->wr_count = wr_count; cmd->sge_count = sge_count; cmd->wqe_size = sizeof *n; - n = (struct ibv_kern_recv_wr *) ((void *) cmd + sizeof *cmd); - s = (struct ibv_sge *) (n + wr_count); + n = (struct rdmav_kern_recv_wr *) ((void *) cmd + sizeof *cmd); + s = (struct rdmav_sge *) (n + wr_count); tmp = n; for (i = wr; i; i = i->next) { @@ -907,14 +914,14 @@ int ibv_cmd_post_recv(struct ibv_qp *ibq return ret; } -int ibv_cmd_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr) +int rdmav_cmd_post_srq_recv(struct rdmav_srq *srq, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr) { - struct ibv_post_srq_recv *cmd; - struct ibv_post_srq_recv_resp resp; - struct ibv_recv_wr *i; - struct ibv_kern_recv_wr *n, *tmp; - struct ibv_sge *s; + struct rdmav_post_srq_recv *cmd; + struct rdmav_post_srq_recv_resp resp; + struct rdmav_recv_wr *i; + struct rdmav_kern_recv_wr *n, *tmp; + struct rdmav_sge *s; unsigned wr_count = 0; unsigned sge_count = 0; int cmd_size; @@ -928,14 +935,14 @@ int ibv_cmd_post_srq_recv(struct ibv_srq cmd_size = sizeof *cmd + wr_count * sizeof *n + sge_count * sizeof *s; cmd = alloca(cmd_size); - IBV_INIT_CMD_RESP(cmd, cmd_size, POST_SRQ_RECV, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, POST_SRQ_RECV, &resp, sizeof resp); cmd->srq_handle = srq->handle; cmd->wr_count = wr_count; cmd->sge_count = sge_count; cmd->wqe_size = sizeof *n; - n = (struct ibv_kern_recv_wr *) ((void *) cmd + sizeof *cmd); - s = (struct ibv_sge *) (n + wr_count); + n = (struct rdmav_kern_recv_wr *) ((void *) cmd + sizeof *cmd); + s = (struct rdmav_sge *) (n + wr_count); tmp = n; for (i = wr; i; i = i->next) { @@ -965,13 +972,13 @@ int ibv_cmd_post_srq_recv(struct ibv_srq return ret; } -int ibv_cmd_create_ah(struct ibv_pd *pd, struct ibv_ah *ah, - struct ibv_ah_attr *attr) +int rdmav_cmd_create_ah(struct rdmav_pd *pd, struct rdmav_ah *ah, + struct rdmav_ah_attr *attr) { - struct ibv_create_ah cmd; - struct ibv_create_ah_resp resp; + struct rdmav_create_ah cmd; + struct rdmav_create_ah_resp resp; - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, CREATE_AH, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, CREATE_AH, &resp, sizeof resp); cmd.user_handle = (uintptr_t) ah; cmd.pd_handle = pd->handle; cmd.attr.dlid = attr->dlid; @@ -994,11 +1001,11 @@ int ibv_cmd_create_ah(struct ibv_pd *pd, return 0; } -int ibv_cmd_destroy_ah(struct ibv_ah *ah) +int rdmav_cmd_destroy_ah(struct rdmav_ah *ah) { - struct ibv_destroy_ah cmd; + struct rdmav_destroy_ah cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DESTROY_AH); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DESTROY_AH); cmd.ah_handle = ah->handle; if (write(ah->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -1007,15 +1014,15 @@ int ibv_cmd_destroy_ah(struct ibv_ah *ah return 0; } -int ibv_cmd_destroy_qp(struct ibv_qp *qp) +int rdmav_cmd_destroy_qp(struct rdmav_qp *qp) { - struct ibv_destroy_qp cmd; - struct ibv_destroy_qp_resp resp; + struct rdmav_destroy_qp cmd; + struct rdmav_destroy_qp_resp resp; if (abi_ver == 1) - return ibv_cmd_destroy_qp_v1(qp); + return rdmav_cmd_destroy_qp_v1(qp); - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_QP, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_QP, &resp, sizeof resp); cmd.qp_handle = qp->handle; if (write(qp->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -1029,11 +1036,11 @@ int ibv_cmd_destroy_qp(struct ibv_qp *qp return 0; } -int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +int rdmav_cmd_attach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid) { - struct ibv_attach_mcast cmd; + struct rdmav_attach_mcast cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, ATTACH_MCAST); + RDMAV_INIT_CMD(&cmd, sizeof cmd, ATTACH_MCAST); memcpy(cmd.gid, gid->raw, sizeof cmd.gid); cmd.qp_handle = qp->handle; cmd.mlid = lid; @@ -1044,11 +1051,11 @@ int ibv_cmd_attach_mcast(struct ibv_qp * return 0; } -int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +int rdmav_cmd_detach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid) { - struct ibv_detach_mcast cmd; + struct rdmav_detach_mcast cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DETACH_MCAST); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DETACH_MCAST); memcpy(cmd.gid, gid->raw, sizeof cmd.gid); cmd.qp_handle = qp->handle; cmd.mlid = lid; diff -ruNp ORG/libibverbs/src/device.c NEW/libibverbs/src/device.c --- ORG/libibverbs/src/device.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/device.c 2006-08-02 23:57:31.000000000 -0700 @@ -48,23 +48,23 @@ #include -#include "ibverbs.h" +#include "rdmaverbs.h" static pthread_mutex_t device_list_lock = PTHREAD_MUTEX_INITIALIZER; static int num_devices; -static struct ibv_device **device_list; +static struct rdmav_device **device_list; -struct ibv_device **ibv_get_device_list(int *num) +struct rdmav_device **rdmav_get_device_list(int *num) { - struct ibv_device **l; + struct rdmav_device **l; int i; pthread_mutex_lock(&device_list_lock); if (!num_devices) - num_devices = ibverbs_init(&device_list); + num_devices = rdmaverbs_init(&device_list); - l = calloc(num_devices + 1, sizeof (struct ibv_device *)); + l = calloc(num_devices + 1, sizeof (struct rdmav_device *)); for (i = 0; i < num_devices; ++i) l[i] = device_list[i]; @@ -76,24 +76,30 @@ struct ibv_device **ibv_get_device_list( return l; } -void ibv_free_device_list(struct ibv_device **list) +/* XXX - to be removed when all apps are converted to new API */ +struct rdmav_device **ibv_get_device_list(int *num) +{ + return rdmav_get_device_list(num); +} + +void rdmav_free_device_list(struct rdmav_device **list) { free(list); } -const char *ibv_get_device_name(struct ibv_device *device) +const char *rdmav_get_device_name(struct rdmav_device *device) { return device->name; } -uint64_t ibv_get_device_guid(struct ibv_device *device) +uint64_t rdmav_get_device_guid(struct rdmav_device *device) { char attr[24]; uint64_t guid = 0; uint16_t parts[4]; int i; - if (ibv_read_sysfs_file(device->ibdev_path, "node_guid", + if (rdmav_read_sysfs_file(device->ibdev_path, "node_guid", attr, sizeof attr) < 0) return 0; @@ -107,11 +113,11 @@ uint64_t ibv_get_device_guid(struct ibv_ return htonll(guid); } -struct ibv_context *ibv_open_device(struct ibv_device *device) +struct rdmav_context *rdmav_open_device(struct rdmav_device *device) { char *devpath; int cmd_fd; - struct ibv_context *context; + struct rdmav_context *context; asprintf(&devpath, "/dev/infiniband/%s", device->dev_name); @@ -140,14 +146,14 @@ err: return NULL; } -int ibv_close_device(struct ibv_context *context) +int rdmav_close_device(struct rdmav_context *context) { int async_fd = context->async_fd; int cmd_fd = context->cmd_fd; int cq_fd = -1; if (abi_ver <= 2) { - struct ibv_abi_compat_v2 *t = context->abi_compat; + struct rdmav_abi_compat_v2 *t = context->abi_compat; cq_fd = t->channel.fd; free(context->abi_compat); } @@ -162,10 +168,10 @@ int ibv_close_device(struct ibv_context return 0; } -int ibv_get_async_event(struct ibv_context *context, - struct ibv_async_event *event) +int rdmav_get_async_event(struct rdmav_context *context, + struct rdmav_async_event *event) { - struct ibv_kern_async_event ev; + struct rdmav_kern_async_event ev; if (read(context->async_fd, &ev, sizeof ev) != sizeof ev) return -1; @@ -173,23 +179,23 @@ int ibv_get_async_event(struct ibv_conte event->event_type = ev.event_type; switch (event->event_type) { - case IBV_EVENT_CQ_ERR: + case RDMAV_EVENT_CQ_ERR: event->element.cq = (void *) (uintptr_t) ev.element; break; - case IBV_EVENT_QP_FATAL: - case IBV_EVENT_QP_REQ_ERR: - case IBV_EVENT_QP_ACCESS_ERR: - case IBV_EVENT_COMM_EST: - case IBV_EVENT_SQ_DRAINED: - case IBV_EVENT_PATH_MIG: - case IBV_EVENT_PATH_MIG_ERR: - case IBV_EVENT_QP_LAST_WQE_REACHED: + case RDMAV_EVENT_QP_FATAL: + case RDMAV_EVENT_QP_REQ_ERR: + case RDMAV_EVENT_QP_ACCESS_ERR: + case RDMAV_EVENT_COMM_EST: + case RDMAV_EVENT_SQ_DRAINED: + case RDMAV_EVENT_PATH_MIG: + case RDMAV_EVENT_PATH_MIG_ERR: + case RDMAV_EVENT_QP_LAST_WQE_REACHED: event->element.qp = (void *) (uintptr_t) ev.element; break; - case IBV_EVENT_SRQ_ERR: - case IBV_EVENT_SRQ_LIMIT_REACHED: + case RDMAV_EVENT_SRQ_ERR: + case RDMAV_EVENT_SRQ_LIMIT_REACHED: event->element.srq = (void *) (uintptr_t) ev.element; break; @@ -201,12 +207,12 @@ int ibv_get_async_event(struct ibv_conte return 0; } -void ibv_ack_async_event(struct ibv_async_event *event) +void rdmav_ack_async_event(struct rdmav_async_event *event) { switch (event->event_type) { - case IBV_EVENT_CQ_ERR: + case RDMAV_EVENT_CQ_ERR: { - struct ibv_cq *cq = event->element.cq; + struct rdmav_cq *cq = event->element.cq; pthread_mutex_lock(&cq->mutex); ++cq->async_events_completed; @@ -216,16 +222,16 @@ void ibv_ack_async_event(struct ibv_asyn return; } - case IBV_EVENT_QP_FATAL: - case IBV_EVENT_QP_REQ_ERR: - case IBV_EVENT_QP_ACCESS_ERR: - case IBV_EVENT_COMM_EST: - case IBV_EVENT_SQ_DRAINED: - case IBV_EVENT_PATH_MIG: - case IBV_EVENT_PATH_MIG_ERR: - case IBV_EVENT_QP_LAST_WQE_REACHED: + case RDMAV_EVENT_QP_FATAL: + case RDMAV_EVENT_QP_REQ_ERR: + case RDMAV_EVENT_QP_ACCESS_ERR: + case RDMAV_EVENT_COMM_EST: + case RDMAV_EVENT_SQ_DRAINED: + case RDMAV_EVENT_PATH_MIG: + case RDMAV_EVENT_PATH_MIG_ERR: + case RDMAV_EVENT_QP_LAST_WQE_REACHED: { - struct ibv_qp *qp = event->element.qp; + struct rdmav_qp *qp = event->element.qp; pthread_mutex_lock(&qp->mutex); ++qp->events_completed; @@ -235,10 +241,10 @@ void ibv_ack_async_event(struct ibv_asyn return; } - case IBV_EVENT_SRQ_ERR: - case IBV_EVENT_SRQ_LIMIT_REACHED: + case RDMAV_EVENT_SRQ_ERR: + case RDMAV_EVENT_SRQ_LIMIT_REACHED: { - struct ibv_srq *srq = event->element.srq; + struct rdmav_srq *srq = event->element.srq; pthread_mutex_lock(&srq->mutex); ++srq->events_completed; diff -ruNp ORG/libibverbs/src/ibverbs.h NEW/libibverbs/src/ibverbs.h --- ORG/libibverbs/src/ibverbs.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/ibverbs.h 1969-12-31 16:00:00.000000000 -0800 @@ -1,88 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id: ibverbs.h 4466 2005-12-14 20:44:36Z roland $ - */ - -#ifndef IB_VERBS_H -#define IB_VERBS_H - -#include - -#include - -#define HIDDEN __attribute__((visibility ("hidden"))) - -#define INIT __attribute__((constructor)) -#define FINI __attribute__((destructor)) - -#define PFX "libibverbs: " - -struct ibv_driver { - ibv_driver_init_func init_func; - struct ibv_driver *next; -}; - -struct ibv_abi_compat_v2 { - struct ibv_comp_channel channel; - pthread_mutex_t in_use; -}; - -extern HIDDEN int abi_ver; - -extern HIDDEN int ibverbs_init(struct ibv_device ***list); - -extern HIDDEN int ibv_init_mem_map(void); -extern HIDDEN int ibv_lock_range(void *base, size_t size); -extern HIDDEN int ibv_unlock_range(void *base, size_t size); - -#define IBV_INIT_CMD(cmd, size, opcode) \ - do { \ - if (abi_ver > 2) \ - (cmd)->command = IB_USER_VERBS_CMD_##opcode; \ - else \ - (cmd)->command = IB_USER_VERBS_CMD_##opcode##_V2; \ - (cmd)->in_words = (size) / 4; \ - (cmd)->out_words = 0; \ - } while (0) - -#define IBV_INIT_CMD_RESP(cmd, size, opcode, out, outsize) \ - do { \ - if (abi_ver > 2) \ - (cmd)->command = IB_USER_VERBS_CMD_##opcode; \ - else \ - (cmd)->command = IB_USER_VERBS_CMD_##opcode##_V2; \ - (cmd)->in_words = (size) / 4; \ - (cmd)->out_words = (outsize) / 4; \ - (cmd)->response = (uintptr_t) (out); \ - } while (0) - -#endif /* IB_VERBS_H */ diff -ruNp ORG/libibverbs/src/init.c NEW/libibverbs/src/init.c --- ORG/libibverbs/src/init.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/init.c 2006-08-02 18:24:49.000000000 -0700 @@ -46,24 +46,28 @@ #include #include -#include "ibverbs.h" +#include "rdmaverbs.h" #ifndef OPENIB_DRIVER_PATH_ENV # define OPENIB_DRIVER_PATH_ENV "OPENIB_DRIVER_PATH" #endif +#ifndef LIBRDMAVERBS_DRIVER_PATH_ENV +# define LIBRDMAVERBS_DRIVER_PATH_ENV "LIBRDMAVERBS_DRIVER_PATH" +#endif + HIDDEN int abi_ver; static char default_path[] = DRIVER_PATH; static const char *user_path; -static struct ibv_driver *driver_list; +static struct rdmav_driver *driver_list; static void load_driver(char *so_path) { void *dlhandle; - ibv_driver_init_func init_func; - struct ibv_driver *driver; + rdmav_driver_init_func init_func; + struct rdmav_driver *driver; dlhandle = dlopen(so_path, RTLD_NOW); if (!dlhandle) { @@ -81,7 +85,8 @@ static void load_driver(char *so_path) driver = malloc(sizeof *driver); if (!driver) { - fprintf(stderr, PFX "Fatal: couldn't allocate driver for %s\n", so_path); + fprintf(stderr, PFX "Fatal: couldn't allocate driver for %s\n", + so_path); dlclose(dlhandle); return; } @@ -122,23 +127,25 @@ static void find_drivers(char *dir) globfree(&so_glob); } -static struct ibv_device *init_drivers(const char *class_path, +static struct rdmav_device *init_drivers(const char *class_path, const char *dev_name) { - struct ibv_driver *driver; - struct ibv_device *dev; + struct rdmav_driver *driver; + struct rdmav_device *dev; int abi_ver = 0; - char sys_path[IBV_SYSFS_PATH_MAX]; - char ibdev_name[IBV_SYSFS_NAME_MAX]; + char sys_path[RDMAV_SYSFS_PATH_MAX]; + char ibdev_name[RDMAV_SYSFS_NAME_MAX]; char value[8]; snprintf(sys_path, sizeof sys_path, "%s/%s", class_path, dev_name); - if (ibv_read_sysfs_file(sys_path, "abi_version", value, sizeof value) > 0) + if (rdmav_read_sysfs_file(sys_path, "abi_version", value, + sizeof value) > 0) abi_ver = strtol(value, NULL, 10); - if (ibv_read_sysfs_file(sys_path, "ibdev", ibdev_name, sizeof ibdev_name) < 0) { + if (rdmav_read_sysfs_file(sys_path, "ibdev", ibdev_name, + sizeof ibdev_name) < 0) { fprintf(stderr, PFX "Warning: no ibdev class attr for %s\n", sys_path); return NULL; @@ -151,8 +158,9 @@ static struct ibv_device *init_drivers(c dev->driver = driver; strcpy(dev->dev_path, sys_path); - snprintf(dev->ibdev_path, IBV_SYSFS_PATH_MAX, "%s/class/infiniband/%s", - ibv_get_sysfs_path(), ibdev_name); + snprintf(dev->ibdev_path, RDMAV_SYSFS_PATH_MAX, + "%s/class/infiniband/%s", + rdmav_get_sysfs_path(), ibdev_name); strcpy(dev->dev_name, dev_name); strcpy(dev->name, ibdev_name); @@ -172,7 +180,7 @@ static int check_abi_version(const char { char value[8]; - if (ibv_read_sysfs_file(path, "class/infiniband_verbs/abi_version", + if (rdmav_read_sysfs_file(path, "class/infiniband_verbs/abi_version", value, sizeof value) < 0) { fprintf(stderr, PFX "Fatal: couldn't read uverbs ABI version.\n"); return -1; @@ -180,32 +188,32 @@ static int check_abi_version(const char abi_ver = strtol(value, NULL, 10); - if (abi_ver < IB_USER_VERBS_MIN_ABI_VERSION || - abi_ver > IB_USER_VERBS_MAX_ABI_VERSION) { + if (abi_ver < RDMAV_USER_VERBS_MIN_ABI_VERSION || + abi_ver > RDMAV_USER_VERBS_MAX_ABI_VERSION) { fprintf(stderr, PFX "Fatal: kernel ABI version %d " "doesn't match library version %d.\n", - abi_ver, IB_USER_VERBS_MAX_ABI_VERSION); + abi_ver, RDMAV_USER_VERBS_MAX_ABI_VERSION); return -1; } return 0; } -HIDDEN int ibverbs_init(struct ibv_device ***list) +HIDDEN int rdmaverbs_init(struct rdmav_device ***list) { const char *sysfs_path; char *wr_path, *dir; - char class_path[IBV_SYSFS_PATH_MAX]; + char class_path[RDMAV_SYSFS_PATH_MAX]; DIR *class_dir; struct dirent *dent; - struct ibv_device *device; - struct ibv_device **new_list; + struct rdmav_device *device; + struct rdmav_device **new_list; int num_devices = 0; int list_size = 0; *list = NULL; - if (ibv_init_mem_map()) + if (rdmav_init_mem_map()) return 0; find_drivers(default_path); @@ -215,12 +223,22 @@ HIDDEN int ibverbs_init(struct ibv_devic * environment if we're not running SUID. */ if (getuid() == geteuid()) { - user_path = getenv(OPENIB_DRIVER_PATH_ENV); + const char *user_path_extra; + + user_path = getenv(LIBRDMAVERBS_DRIVER_PATH_ENV); if (user_path) { wr_path = strdupa(user_path); while ((dir = strsep(&wr_path, ";:"))) find_drivers(dir); } + + /* for backwards compatibility */ + user_path_extra = getenv(OPENIB_DRIVER_PATH_ENV); + if (user_path_extra) { + wr_path = strdupa(user_path_extra); + while ((dir = strsep(&wr_path, ";:"))) + find_drivers(dir); + } } /* @@ -230,7 +248,7 @@ HIDDEN int ibverbs_init(struct ibv_devic */ load_driver(NULL); - sysfs_path = ibv_get_sysfs_path(); + sysfs_path = rdmav_get_sysfs_path(); if (!sysfs_path) { fprintf(stderr, PFX "Fatal: couldn't find sysfs mount.\n"); return 0; @@ -258,7 +276,7 @@ HIDDEN int ibverbs_init(struct ibv_devic if (list_size <= num_devices) { list_size = list_size ? list_size * 2 : 1; - new_list = realloc(*list, list_size * sizeof (struct ibv_device *)); + new_list = realloc(*list, list_size * sizeof (struct rdmav_device *)); if (!new_list) goto out; *list = new_list; diff -ruNp ORG/libibverbs/src/libibverbs.map NEW/libibverbs/src/libibverbs.map --- ORG/libibverbs/src/libibverbs.map 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/libibverbs.map 1969-12-31 16:00:00.000000000 -0800 @@ -1,79 +0,0 @@ -IBVERBS_1.0 { - global: - ibv_get_device_list; - ibv_free_device_list; - ibv_get_device_name; - ibv_get_device_guid; - ibv_open_device; - ibv_close_device; - ibv_get_async_event; - ibv_ack_async_event; - ibv_query_device; - ibv_query_port; - ibv_query_gid; - ibv_query_pkey; - ibv_alloc_pd; - ibv_dealloc_pd; - ibv_reg_mr; - ibv_dereg_mr; - ibv_create_comp_channel; - ibv_destroy_comp_channel; - ibv_create_cq; - ibv_resize_cq; - ibv_destroy_cq; - ibv_get_cq_event; - ibv_ack_cq_events; - ibv_create_srq; - ibv_modify_srq; - ibv_query_srq; - ibv_destroy_srq; - ibv_create_qp; - ibv_query_qp; - ibv_modify_qp; - ibv_destroy_qp; - ibv_create_ah; - ibv_init_ah_from_wc; - ibv_create_ah_from_wc; - ibv_destroy_ah; - ibv_attach_mcast; - ibv_detach_mcast; - ibv_cmd_get_context; - ibv_cmd_query_device; - ibv_cmd_query_port; - ibv_cmd_query_gid; - ibv_cmd_query_pkey; - ibv_cmd_alloc_pd; - ibv_cmd_dealloc_pd; - ibv_cmd_reg_mr; - ibv_cmd_dereg_mr; - ibv_cmd_create_cq; - ibv_cmd_poll_cq; - ibv_cmd_req_notify_cq; - ibv_cmd_resize_cq; - ibv_cmd_destroy_cq; - ibv_cmd_create_srq; - ibv_cmd_modify_srq; - ibv_cmd_query_srq; - ibv_cmd_destroy_srq; - ibv_cmd_create_qp; - ibv_cmd_query_qp; - ibv_cmd_modify_qp; - ibv_cmd_destroy_qp; - ibv_cmd_post_send; - ibv_cmd_post_recv; - ibv_cmd_post_srq_recv; - ibv_cmd_create_ah; - ibv_cmd_destroy_ah; - ibv_cmd_attach_mcast; - ibv_cmd_detach_mcast; - ibv_copy_qp_attr_from_kern; - ibv_copy_ah_attr_from_kern; - ibv_copy_path_rec_from_kern; - ibv_copy_path_rec_to_kern; - ibv_rate_to_mult; - mult_to_ibv_rate; - ibv_get_sysfs_path; - ibv_read_sysfs_file; - - local: *; -}; diff -ruNp ORG/libibverbs/src/librdmaverbs.map NEW/libibverbs/src/librdmaverbs.map --- ORG/libibverbs/src/librdmaverbs.map 1969-12-31 16:00:00.000000000 -0800 +++ NEW/libibverbs/src/librdmaverbs.map 2006-08-02 23:50:50.000000000 -0700 @@ -0,0 +1,80 @@ +RDMAVERBS_1.0 { + global: + ibv_get_device_list; + rdmav_get_device_list; + rdmav_free_device_list; + rdmav_get_device_name; + rdmav_get_device_guid; + rdmav_open_device; + rdmav_close_device; + rdmav_get_async_event; + rdmav_ack_async_event; + rdmav_query_device; + rdmav_query_port; + rdmav_query_gid; + rdmav_query_pkey; + rdmav_alloc_pd; + rdmav_dealloc_pd; + rdmav_reg_mr; + rdmav_dereg_mr; + rdmav_create_comp_channel; + rdmav_destroy_comp_channel; + rdmav_create_cq; + rdmav_resize_cq; + rdmav_destroy_cq; + rdmav_get_cq_event; + rdmav_ack_cq_events; + rdmav_create_srq; + rdmav_modify_srq; + rdmav_query_srq; + rdmav_destroy_srq; + rdmav_create_qp; + rdmav_query_qp; + rdmav_modify_qp; + rdmav_destroy_qp; + rdmav_create_ah; + rdmav_init_ah_from_wc; + rdmav_create_ah_from_wc; + rdmav_destroy_ah; + rdmav_attach_mcast; + rdmav_detach_mcast; + rdmav_cmd_get_context; + rdmav_cmd_query_device; + rdmav_cmd_query_port; + rdmav_cmd_query_gid; + rdmav_cmd_query_pkey; + rdmav_cmd_alloc_pd; + rdmav_cmd_dealloc_pd; + rdmav_cmd_reg_mr; + rdmav_cmd_dereg_mr; + rdmav_cmd_create_cq; + rdmav_cmd_poll_cq; + rdmav_cmd_req_notify_cq; + rdmav_cmd_resize_cq; + rdmav_cmd_destroy_cq; + rdmav_cmd_create_srq; + rdmav_cmd_modify_srq; + rdmav_cmd_query_srq; + rdmav_cmd_destroy_srq; + rdmav_cmd_create_qp; + rdmav_cmd_query_qp; + rdmav_cmd_modify_qp; + rdmav_cmd_destroy_qp; + rdmav_cmd_post_send; + rdmav_cmd_post_recv; + rdmav_cmd_post_srq_recv; + rdmav_cmd_create_ah; + rdmav_cmd_destroy_ah; + rdmav_cmd_attach_mcast; + rdmav_cmd_detach_mcast; + rdmav_copy_qp_attr_from_kern; + rdmav_copy_ah_attr_from_kern; + rdmav_copy_path_rec_from_kern; + rdmav_copy_path_rec_to_kern; + rdmav_rate_to_mult; + mult_to_rdmav_rate; + rdmav_get_sysfs_path; + rdmav_read_sysfs_file; + + local: *; +}; diff -ruNp ORG/libibverbs/src/marshall.c NEW/libibverbs/src/marshall.c --- ORG/libibverbs/src/marshall.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/marshall.c 2006-08-02 18:24:49.000000000 -0700 @@ -38,8 +38,8 @@ #include -void ibv_copy_ah_attr_from_kern(struct ibv_ah_attr *dst, - struct ibv_kern_ah_attr *src) +void rdmav_copy_ah_attr_from_kern(struct rdmav_ah_attr *dst, + struct rdmav_kern_ah_attr *src) { memcpy(dst->grh.dgid.raw, src->grh.dgid, sizeof dst->grh.dgid); dst->grh.flow_label = src->grh.flow_label; @@ -55,8 +55,8 @@ void ibv_copy_ah_attr_from_kern(struct i dst->port_num = src->port_num; } -void ibv_copy_qp_attr_from_kern(struct ibv_qp_attr *dst, - struct ibv_kern_qp_attr *src) +void rdmav_copy_qp_attr_from_kern(struct rdmav_qp_attr *dst, + struct rdmav_kern_qp_attr *src) { dst->cur_qp_state = src->cur_qp_state; dst->path_mtu = src->path_mtu; @@ -73,8 +73,8 @@ void ibv_copy_qp_attr_from_kern(struct i dst->cap.max_recv_sge = src->max_recv_sge; dst->cap.max_inline_data = src->max_inline_data; - ibv_copy_ah_attr_from_kern(&dst->ah_attr, &src->ah_attr); - ibv_copy_ah_attr_from_kern(&dst->alt_ah_attr, &src->alt_ah_attr); + rdmav_copy_ah_attr_from_kern(&dst->ah_attr, &src->ah_attr); + rdmav_copy_ah_attr_from_kern(&dst->alt_ah_attr, &src->alt_ah_attr); dst->pkey_index = src->pkey_index; dst->alt_pkey_index = src->alt_pkey_index; @@ -91,8 +91,8 @@ void ibv_copy_qp_attr_from_kern(struct i dst->alt_timeout = src->alt_timeout; } -void ibv_copy_path_rec_from_kern(struct ibv_sa_path_rec *dst, - struct ibv_kern_path_rec *src) +void rdmav_copy_path_rec_from_kern(struct rdmav_sa_path_rec *dst, + struct rdmav_kern_path_rec *src) { memcpy(dst->dgid.raw, src->dgid, sizeof dst->dgid); memcpy(dst->sgid.raw, src->sgid, sizeof dst->sgid); @@ -116,8 +116,8 @@ void ibv_copy_path_rec_from_kern(struct dst->packet_life_time_selector = src->packet_life_time_selector; } -void ibv_copy_path_rec_to_kern(struct ibv_kern_path_rec *dst, - struct ibv_sa_path_rec *src) +void rdmav_copy_path_rec_to_kern(struct rdmav_kern_path_rec *dst, + struct rdmav_sa_path_rec *src) { memcpy(dst->dgid, src->dgid.raw, sizeof src->dgid); memcpy(dst->sgid, src->sgid.raw, sizeof src->sgid); diff -ruNp ORG/libibverbs/src/memory.c NEW/libibverbs/src/memory.c --- ORG/libibverbs/src/memory.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/memory.c 2006-08-02 18:24:49.000000000 -0700 @@ -41,7 +41,7 @@ #include #include -#include "ibverbs.h" +#include "rdmaverbs.h" /* * We keep a linked list of page ranges that have been locked along with a @@ -51,21 +51,21 @@ * to avoid the O(n) cost of registering/unregistering memory. */ -struct ibv_mem_node { - struct ibv_mem_node *prev, *next; +struct rdmav_mem_node { + struct rdmav_mem_node *prev, *next; uintptr_t start, end; int refcnt; }; static struct { - struct ibv_mem_node *first; + struct rdmav_mem_node *first; pthread_mutex_t mutex; uintptr_t page_size; } mem_map; -int ibv_init_mem_map(void) +int rdmav_init_mem_map(void) { - struct ibv_mem_node *node = NULL; + struct rdmav_mem_node *node = NULL; node = malloc(sizeof *node); if (!node) @@ -94,9 +94,9 @@ fail: return -1; } -static struct ibv_mem_node *__mm_find_first(uintptr_t start, uintptr_t end) +static struct rdmav_mem_node *__mm_find_first(uintptr_t start, uintptr_t end) { - struct ibv_mem_node *node = mem_map.first; + struct rdmav_mem_node *node = mem_map.first; while (node) { if ((node->start <= start && node->end >= start) || @@ -108,18 +108,18 @@ static struct ibv_mem_node *__mm_find_fi return node; } -static struct ibv_mem_node *__mm_prev(struct ibv_mem_node *node) +static struct rdmav_mem_node *__mm_prev(struct rdmav_mem_node *node) { return node->prev; } -static struct ibv_mem_node *__mm_next(struct ibv_mem_node *node) +static struct rdmav_mem_node *__mm_next(struct rdmav_mem_node *node) { return node->next; } -static void __mm_add(struct ibv_mem_node *node, - struct ibv_mem_node *new) +static void __mm_add(struct rdmav_mem_node *node, + struct rdmav_mem_node *new) { new->prev = node; new->next = node->next; @@ -128,7 +128,7 @@ static void __mm_add(struct ibv_mem_node new->next->prev = new; } -static void __mm_remove(struct ibv_mem_node *node) +static void __mm_remove(struct rdmav_mem_node *node) { /* Never have to remove the first node, so we can use prev */ node->prev->next = node->next; @@ -136,10 +136,10 @@ static void __mm_remove(struct ibv_mem_n node->next->prev = node->prev; } -int ibv_lock_range(void *base, size_t size) +int rdmav_lock_range(void *base, size_t size) { uintptr_t start, end; - struct ibv_mem_node *node, *tmp; + struct rdmav_mem_node *node, *tmp; int ret = 0; if (!size) @@ -202,10 +202,10 @@ out: return ret; } -int ibv_unlock_range(void *base, size_t size) +int rdmav_unlock_range(void *base, size_t size) { uintptr_t start, end; - struct ibv_mem_node *node, *tmp; + struct rdmav_mem_node *node, *tmp; int ret = 0; if (!size) diff -ruNp ORG/libibverbs/src/rdmaverbs.h NEW/libibverbs/src/rdmaverbs.h --- ORG/libibverbs/src/rdmaverbs.h 1969-12-31 16:00:00.000000000 -0800 +++ NEW/libibverbs/src/rdmaverbs.h 2006-08-03 17:29:42.000000000 -0700 @@ -0,0 +1,91 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: + */ + +#ifndef SRC_RDMA_VERBS_H +#define SRC_RDMA_VERBS_H + +#include + +#include +#include + +#define HIDDEN __attribute__((visibility ("hidden"))) + +#define INIT __attribute__((constructor)) +#define FINI __attribute__((destructor)) + +#ifndef PFX +#define PFX "librdmaverbs: " +#endif + +struct rdmav_driver { + rdmav_driver_init_func init_func; + struct rdmav_driver *next; +}; + +struct rdmav_abi_compat_v2 { + struct rdmav_comp_channel channel; + pthread_mutex_t in_use; +}; + +extern HIDDEN int abi_ver; + +extern HIDDEN int rdmaverbs_init(struct rdmav_device ***list); + +extern HIDDEN int rdmav_init_mem_map(void); +extern HIDDEN int rdmav_lock_range(void *base, size_t size); +extern HIDDEN int rdmav_unlock_range(void *base, size_t size); + +#define RDMAV_INIT_CMD(cmd, size, opcode) \ + do { \ + if (abi_ver > 2) \ + (cmd)->command = RDMAV_USER_VERBS_CMD_##opcode; \ + else \ + (cmd)->command = RDMAV_USER_VERBS_CMD_##opcode##_V2; \ + (cmd)->in_words = (size) / 4; \ + (cmd)->out_words = 0; \ + } while (0) + +#define RDMAV_INIT_CMD_RESP(cmd, size, opcode, out, outsize) \ + do { \ + if (abi_ver > 2) \ + (cmd)->command = RDMAV_USER_VERBS_CMD_##opcode; \ + else \ + (cmd)->command = RDMAV_USER_VERBS_CMD_##opcode##_V2; \ + (cmd)->in_words = (size) / 4; \ + (cmd)->out_words = (outsize) / 4; \ + (cmd)->response = (uintptr_t) (out); \ + } while (0) + +#endif /* SRC_RDMA_VERBS_H */ diff -ruNp ORG/libibverbs/src/sysfs.c NEW/libibverbs/src/sysfs.c --- ORG/libibverbs/src/sysfs.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/sysfs.c 2006-08-02 18:24:49.000000000 -0700 @@ -44,11 +44,11 @@ #include #include -#include "ibverbs.h" +#include "rdmaverbs.h" static char *sysfs_path; -const char *ibv_get_sysfs_path(void) +const char *rdmav_get_sysfs_path(void) { char *env = NULL; @@ -65,7 +65,7 @@ const char *ibv_get_sysfs_path(void) if (env) { int len; - sysfs_path = strndup(env, IBV_SYSFS_PATH_MAX); + sysfs_path = strndup(env, RDMAV_SYSFS_PATH_MAX); len = strlen(sysfs_path); while (len > 0 && sysfs_path[len - 1] == '/') { --len; @@ -77,7 +77,7 @@ const char *ibv_get_sysfs_path(void) return sysfs_path; } -int ibv_read_sysfs_file(const char *dir, const char *file, +int rdmav_read_sysfs_file(const char *dir, const char *file, char *buf, size_t size) { char *path; diff -ruNp ORG/libibverbs/src/verbs.c NEW/libibverbs/src/verbs.c --- ORG/libibverbs/src/verbs.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/verbs.c 2006-08-02 18:24:49.000000000 -0700 @@ -44,54 +44,54 @@ #include #include -#include "ibverbs.h" +#include "rdmaverbs.h" -int ibv_rate_to_mult(enum ibv_rate rate) +int rdmav_rate_to_mult(enum rdmav_rate rate) { switch (rate) { - case IBV_RATE_2_5_GBPS: return 1; - case IBV_RATE_5_GBPS: return 2; - case IBV_RATE_10_GBPS: return 4; - case IBV_RATE_20_GBPS: return 8; - case IBV_RATE_30_GBPS: return 12; - case IBV_RATE_40_GBPS: return 16; - case IBV_RATE_60_GBPS: return 24; - case IBV_RATE_80_GBPS: return 32; - case IBV_RATE_120_GBPS: return 48; + case RDMAV_RATE_2_5_GBPS: return 1; + case RDMAV_RATE_5_GBPS: return 2; + case RDMAV_RATE_10_GBPS: return 4; + case RDMAV_RATE_20_GBPS: return 8; + case RDMAV_RATE_30_GBPS: return 12; + case RDMAV_RATE_40_GBPS: return 16; + case RDMAV_RATE_60_GBPS: return 24; + case RDMAV_RATE_80_GBPS: return 32; + case RDMAV_RATE_120_GBPS: return 48; default: return -1; } } -enum ibv_rate mult_to_ibv_rate(int mult) +enum rdmav_rate mult_to_rdmav_rate(int mult) { switch (mult) { - case 1: return IBV_RATE_2_5_GBPS; - case 2: return IBV_RATE_5_GBPS; - case 4: return IBV_RATE_10_GBPS; - case 8: return IBV_RATE_20_GBPS; - case 12: return IBV_RATE_30_GBPS; - case 16: return IBV_RATE_40_GBPS; - case 24: return IBV_RATE_60_GBPS; - case 32: return IBV_RATE_80_GBPS; - case 48: return IBV_RATE_120_GBPS; - default: return IBV_RATE_MAX; + case 1: return RDMAV_RATE_2_5_GBPS; + case 2: return RDMAV_RATE_5_GBPS; + case 4: return RDMAV_RATE_10_GBPS; + case 8: return RDMAV_RATE_20_GBPS; + case 12: return RDMAV_RATE_30_GBPS; + case 16: return RDMAV_RATE_40_GBPS; + case 24: return RDMAV_RATE_60_GBPS; + case 32: return RDMAV_RATE_80_GBPS; + case 48: return RDMAV_RATE_120_GBPS; + default: return RDMAV_RATE_MAX; } } -int ibv_query_device(struct ibv_context *context, - struct ibv_device_attr *device_attr) +int rdmav_query_device(struct rdmav_context *context, + struct rdmav_device_attr *device_attr) { return context->ops.query_device(context, device_attr); } -int ibv_query_port(struct ibv_context *context, uint8_t port_num, - struct ibv_port_attr *port_attr) +int rdmav_query_port(struct rdmav_context *context, uint8_t port_num, + struct rdmav_port_attr *port_attr) { return context->ops.query_port(context, port_num, port_attr); } -int ibv_query_gid(struct ibv_context *context, uint8_t port_num, - int index, union ibv_gid *gid) +int rdmav_query_gid(struct rdmav_context *context, uint8_t port_num, + int index, union rdmav_gid *gid) { char name[24]; char attr[41]; @@ -100,7 +100,7 @@ int ibv_query_gid(struct ibv_context *co snprintf(name, sizeof name, "ports/%d/gids/%d", port_num, index); - if (ibv_read_sysfs_file(context->device->ibdev_path, name, + if (rdmav_read_sysfs_file(context->device->ibdev_path, name, attr, sizeof attr) < 0) return -1; @@ -114,7 +114,7 @@ int ibv_query_gid(struct ibv_context *co return 0; } -int ibv_query_pkey(struct ibv_context *context, uint8_t port_num, +int rdmav_query_pkey(struct rdmav_context *context, uint8_t port_num, int index, uint16_t *pkey) { char name[24]; @@ -123,7 +123,7 @@ int ibv_query_pkey(struct ibv_context *c snprintf(name, sizeof name, "ports/%d/pkeys/%d", port_num, index); - if (ibv_read_sysfs_file(context->device->ibdev_path, name, + if (rdmav_read_sysfs_file(context->device->ibdev_path, name, attr, sizeof attr) < 0) return -1; @@ -134,9 +134,9 @@ int ibv_query_pkey(struct ibv_context *c return 0; } -struct ibv_pd *ibv_alloc_pd(struct ibv_context *context) +struct rdmav_pd *rdmav_alloc_pd(struct rdmav_context *context) { - struct ibv_pd *pd; + struct rdmav_pd *pd; pd = context->ops.alloc_pd(context); if (pd) @@ -145,15 +145,15 @@ struct ibv_pd *ibv_alloc_pd(struct ibv_c return pd; } -int ibv_dealloc_pd(struct ibv_pd *pd) +int rdmav_dealloc_pd(struct rdmav_pd *pd) { return pd->context->ops.dealloc_pd(pd); } -struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, - size_t length, enum ibv_access_flags access) +struct rdmav_mr *rdmav_reg_mr(struct rdmav_pd *pd, void *addr, + size_t length, enum rdmav_access_flags access) { - struct ibv_mr *mr; + struct rdmav_mr *mr; mr = pd->context->ops.reg_mr(pd, addr, length, access); if (mr) { @@ -164,14 +164,14 @@ struct ibv_mr *ibv_reg_mr(struct ibv_pd return mr; } -int ibv_dereg_mr(struct ibv_mr *mr) +int rdmav_dereg_mr(struct rdmav_mr *mr) { return mr->context->ops.dereg_mr(mr); } -static struct ibv_comp_channel *ibv_create_comp_channel_v2(struct ibv_context *context) +static struct rdmav_comp_channel *rdmav_create_comp_channel_v2(struct rdmav_context *context) { - struct ibv_abi_compat_v2 *t = context->abi_compat; + struct rdmav_abi_compat_v2 *t = context->abi_compat; static int warned; if (!pthread_mutex_trylock(&t->in_use)) @@ -187,20 +187,20 @@ static struct ibv_comp_channel *ibv_crea return NULL; } -struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context *context) +struct rdmav_comp_channel *rdmav_create_comp_channel(struct rdmav_context *context) { - struct ibv_comp_channel *channel; - struct ibv_create_comp_channel cmd; - struct ibv_create_comp_channel_resp resp; + struct rdmav_comp_channel *channel; + struct rdmav_create_comp_channel cmd; + struct rdmav_create_comp_channel_resp resp; if (abi_ver <= 2) - return ibv_create_comp_channel_v2(context); + return rdmav_create_comp_channel_v2(context); channel = malloc(sizeof *channel); if (!channel) return NULL; - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, CREATE_COMP_CHANNEL, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, CREATE_COMP_CHANNEL, &resp, sizeof resp); if (write(context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) { free(channel); return NULL; @@ -211,17 +211,17 @@ struct ibv_comp_channel *ibv_create_comp return channel; } -static int ibv_destroy_comp_channel_v2(struct ibv_comp_channel *channel) +static int rdmav_destroy_comp_channel_v2(struct rdmav_comp_channel *channel) { - struct ibv_abi_compat_v2 *t = (struct ibv_abi_compat_v2 *) channel; + struct rdmav_abi_compat_v2 *t = (struct rdmav_abi_compat_v2 *) channel; pthread_mutex_unlock(&t->in_use); return 0; } -int ibv_destroy_comp_channel(struct ibv_comp_channel *channel) +int rdmav_destroy_comp_channel(struct rdmav_comp_channel *channel) { if (abi_ver <= 2) - return ibv_destroy_comp_channel_v2(channel); + return rdmav_destroy_comp_channel_v2(channel); close(channel->fd); free(channel); @@ -229,10 +229,12 @@ int ibv_destroy_comp_channel(struct ibv_ return 0; } -struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe, void *cq_context, - struct ibv_comp_channel *channel, int comp_vector) +struct rdmav_cq *rdmav_create_cq(struct rdmav_context *context, int cqe, + void *cq_context, + struct rdmav_comp_channel *channel, + int comp_vector) { - struct ibv_cq *cq = context->ops.create_cq(context, cqe, channel, + struct rdmav_cq *cq = context->ops.create_cq(context, cqe, channel, comp_vector); if (cq) { @@ -247,7 +249,7 @@ struct ibv_cq *ibv_create_cq(struct ibv_ return cq; } -int ibv_resize_cq(struct ibv_cq *cq, int cqe) +int rdmav_resize_cq(struct rdmav_cq *cq, int cqe) { if (!cq->context->ops.resize_cq) return ENOSYS; @@ -255,21 +257,20 @@ int ibv_resize_cq(struct ibv_cq *cq, int return cq->context->ops.resize_cq(cq, cqe); } -int ibv_destroy_cq(struct ibv_cq *cq) +int rdmav_destroy_cq(struct rdmav_cq *cq) { return cq->context->ops.destroy_cq(cq); } - -int ibv_get_cq_event(struct ibv_comp_channel *channel, - struct ibv_cq **cq, void **cq_context) +int rdmav_get_cq_event(struct rdmav_comp_channel *channel, + struct rdmav_cq **cq, void **cq_context) { - struct ibv_comp_event ev; + struct rdmav_comp_event ev; if (read(channel->fd, &ev, sizeof ev) != sizeof ev) return -1; - *cq = (struct ibv_cq *) (uintptr_t) ev.cq_handle; + *cq = (struct rdmav_cq *) (uintptr_t) ev.cq_handle; *cq_context = (*cq)->cq_context; if ((*cq)->context->ops.cq_event) @@ -278,7 +279,7 @@ int ibv_get_cq_event(struct ibv_comp_cha return 0; } -void ibv_ack_cq_events(struct ibv_cq *cq, unsigned int nevents) +void rdmav_ack_cq_events(struct rdmav_cq *cq, unsigned int nevents) { pthread_mutex_lock(&cq->mutex); cq->comp_events_completed += nevents; @@ -286,10 +287,10 @@ void ibv_ack_cq_events(struct ibv_cq *cq pthread_mutex_unlock(&cq->mutex); } -struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, - struct ibv_srq_init_attr *srq_init_attr) +struct rdmav_srq *rdmav_create_srq(struct rdmav_pd *pd, + struct rdmav_srq_init_attr *srq_init_attr) { - struct ibv_srq *srq; + struct rdmav_srq *srq; if (!pd->context->ops.create_srq) return NULL; @@ -307,27 +308,27 @@ struct ibv_srq *ibv_create_srq(struct ib return srq; } -int ibv_modify_srq(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask) +int rdmav_modify_srq(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask) { return srq->context->ops.modify_srq(srq, srq_attr, srq_attr_mask); } -int ibv_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr) +int rdmav_query_srq(struct rdmav_srq *srq, struct rdmav_srq_attr *srq_attr) { return srq->context->ops.query_srq(srq, srq_attr); } -int ibv_destroy_srq(struct ibv_srq *srq) +int rdmav_destroy_srq(struct rdmav_srq *srq) { return srq->context->ops.destroy_srq(srq); } -struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, - struct ibv_qp_init_attr *qp_init_attr) +struct rdmav_qp *rdmav_create_qp(struct rdmav_pd *pd, + struct rdmav_qp_init_attr *qp_init_attr) { - struct ibv_qp *qp = pd->context->ops.create_qp(pd, qp_init_attr); + struct rdmav_qp *qp = pd->context->ops.create_qp(pd, qp_init_attr); if (qp) { qp->context = pd->context; @@ -345,9 +346,9 @@ struct ibv_qp *ibv_create_qp(struct ibv_ return qp; } -int ibv_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_qp_init_attr *init_attr) +int rdmav_query_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_qp_init_attr *init_attr) { int ret; @@ -355,14 +356,14 @@ int ibv_query_qp(struct ibv_qp *qp, stru if (ret) return ret; - if (attr_mask & IBV_QP_STATE) + if (attr_mask & RDMAV_QP_STATE) qp->state = attr->qp_state; return 0; } -int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask) +int rdmav_modify_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask) { int ret; @@ -370,20 +371,20 @@ int ibv_modify_qp(struct ibv_qp *qp, str if (ret) return ret; - if (attr_mask & IBV_QP_STATE) + if (attr_mask & RDMAV_QP_STATE) qp->state = attr->qp_state; return 0; } -int ibv_destroy_qp(struct ibv_qp *qp) +int rdmav_destroy_qp(struct rdmav_qp *qp) { return qp->context->ops.destroy_qp(qp); } -struct ibv_ah *ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr) +struct rdmav_ah *rdmav_create_ah(struct rdmav_pd *pd, struct rdmav_ah_attr *attr) { - struct ibv_ah *ah = pd->context->ops.create_ah(pd, attr); + struct rdmav_ah *ah = pd->context->ops.create_ah(pd, attr); if (ah) { ah->context = pd->context; @@ -393,22 +394,22 @@ struct ibv_ah *ibv_create_ah(struct ibv_ return ah; } -static int ibv_find_gid_index(struct ibv_context *context, uint8_t port_num, - union ibv_gid *gid) +static int rdmav_find_gid_index(struct rdmav_context *context, uint8_t port_num, + union rdmav_gid *gid) { - union ibv_gid sgid; + union rdmav_gid sgid; int i = 0, ret; do { - ret = ibv_query_gid(context, port_num, i++, &sgid); + ret = rdmav_query_gid(context, port_num, i++, &sgid); } while (!ret && memcmp(&sgid, gid, sizeof *gid)); return ret ? ret : i - 1; } -int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num, - struct ibv_wc *wc, struct ibv_grh *grh, - struct ibv_ah_attr *ah_attr) +int rdmav_init_ah_from_wc(struct rdmav_context *context, uint8_t port_num, + struct rdmav_wc *wc, struct rdmav_grh *grh, + struct rdmav_ah_attr *ah_attr) { uint32_t flow_class; int ret; @@ -419,11 +420,11 @@ int ibv_init_ah_from_wc(struct ibv_conte ah_attr->src_path_bits = wc->dlid_path_bits; ah_attr->port_num = port_num; - if (wc->wc_flags & IBV_WC_GRH) { + if (wc->wc_flags & RDMAV_WC_GRH) { ah_attr->is_global = 1; ah_attr->grh.dgid = grh->sgid; - ret = ibv_find_gid_index(context, port_num, &grh->dgid); + ret = rdmav_find_gid_index(context, port_num, &grh->dgid); if (ret < 0) return ret; @@ -436,30 +437,30 @@ int ibv_init_ah_from_wc(struct ibv_conte return 0; } -struct ibv_ah *ibv_create_ah_from_wc(struct ibv_pd *pd, struct ibv_wc *wc, - struct ibv_grh *grh, uint8_t port_num) +struct rdmav_ah *rdmav_create_ah_from_wc(struct rdmav_pd *pd, struct rdmav_wc *wc, + struct rdmav_grh *grh, uint8_t port_num) { - struct ibv_ah_attr ah_attr; + struct rdmav_ah_attr ah_attr; int ret; - ret = ibv_init_ah_from_wc(pd->context, port_num, wc, grh, &ah_attr); + ret = rdmav_init_ah_from_wc(pd->context, port_num, wc, grh, &ah_attr); if (ret) return NULL; - return ibv_create_ah(pd, &ah_attr); + return rdmav_create_ah(pd, &ah_attr); } -int ibv_destroy_ah(struct ibv_ah *ah) +int rdmav_destroy_ah(struct rdmav_ah *ah) { return ah->context->ops.destroy_ah(ah); } -int ibv_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +int rdmav_attach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid) { return qp->context->ops.attach_mcast(qp, gid, lid); } -int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +int rdmav_detach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid) { return qp->context->ops.detach_mcast(qp, gid, lid); } From krkumar2 at in.ibm.com Thu Aug 3 01:37:49 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 14:07:49 +0530 Subject: [openib-general] [PATCH v3 3/6] libibverbs configuration files changes. In-Reply-To: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803083749.6346.81211.sendpatchset@K50wks273950wss.in.ibm.com> Configuration/Makefiles to build libibverbs with the new API. Signed-off-by: Krishna Kumar diff -ruNp ORG/libibverbs/Makefile.am NEW/libibverbs/Makefile.am --- ORG/libibverbs/Makefile.am 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/Makefile.am 2006-08-03 17:15:33.000000000 -0700 @@ -9,7 +9,7 @@ AM_CFLAGS = -g -Wall -D_GNU_SOURCE src_libibverbs_la_CFLAGS = -g -Wall -D_GNU_SOURCE -DDRIVER_PATH=\"$(libdir)/infiniband\" if HAVE_LD_VERSION_SCRIPT - libibverbs_version_script = -Wl,--version-script=$(srcdir)/src/libibverbs.map + libibverbs_version_script = -Wl,--version-script=$(srcdir)/src/librdmaverbs.map else libibverbs_version_script = endif @@ -18,7 +18,7 @@ src_libibverbs_la_SOURCES = src/cmd.c sr src/memory.c src/sysfs.c src/verbs.c src_libibverbs_la_LDFLAGS = -version-info 2 -export-dynamic \ $(libibverbs_version_script) -src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map +src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/librdmaverbs.map bin_PROGRAMS = examples/ibv_devices examples/ibv_devinfo \ examples/ibv_asyncwatch examples/ibv_rc_pingpong examples/ibv_uc_pingpong \ @@ -42,7 +42,8 @@ libibverbsincludedir = $(includedir)/inf libibverbsinclude_HEADERS = include/infiniband/arch.h include/infiniband/driver.h \ include/infiniband/kern-abi.h include/infiniband/opcode.h include/infiniband/verbs.h \ - include/infiniband/sa-kern-abi.h include/infiniband/sa.h include/infiniband/marshall.h + include/infiniband/sa-kern-abi.h include/infiniband/sa.h include/infiniband/marshall.h \ + include/infiniband/deprecate.h man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1 \ man/ibv_rc_pingpong.1 man/ibv_uc_pingpong.1 man/ibv_ud_pingpong.1 \ @@ -56,8 +57,9 @@ DEBIAN = debian/changelog debian/compat EXTRA_DIST = include/infiniband/driver.h include/infiniband/kern-abi.h \ include/infiniband/opcode.h include/infiniband/verbs.h include/infiniband/marshall.h \ include/infiniband/sa-kern-abi.h include/infiniband/sa.h \ - src/ibverbs.h examples/pingpong.h \ - src/libibverbs.map libibverbs.spec.in $(man_MANS) + include/infiniband/deprecate.h \ + src/rdmaverbs.h examples/pingpong.h \ + src/librdmaverbs.map libibverbs.spec.in $(man_MANS) dist-hook: libibverbs.spec cp libibverbs.spec $(distdir) diff -ruNp ORG/libibverbs/configure.in NEW/libibverbs/configure.in --- ORG/libibverbs/configure.in 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/configure.in 2006-08-02 18:24:49.000000000 -0700 @@ -2,7 +2,7 @@ dnl Process this file with autoconf to p AC_PREREQ(2.57) AC_INIT(libibverbs, 1.1-pre1, openib-general at openib.org) -AC_CONFIG_SRCDIR([src/ibverbs.h]) +AC_CONFIG_SRCDIR([src/rdmaverbs.h]) AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(libibverbs, 1.1-pre1) @@ -33,5 +33,5 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") -AC_CONFIG_FILES([Makefile libibverbs.spec]) +AC_CONFIG_FILES([Makefile librdmaverbs.spec]) AC_OUTPUT diff -ruNp ORG/libibverbs/libibverbs.spec.in NEW/libibverbs/libibverbs.spec.in --- ORG/libibverbs/libibverbs.spec.in 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/libibverbs.spec.in 1969-12-31 16:00:00.000000000 -0800 @@ -1,106 +0,0 @@ -# $Id: libibverbs.spec.in 7484 2006-05-24 21:12:21Z roland $ - -%define ver @VERSION@ - -Name: libibverbs -Version: 1.1 -Release: 0.1.pre1%{?dist} -Summary: A library for direct userspace use of InfiniBand - -Group: System Environment/Libraries -License: GPL/BSD -Url: http://openib.org/ -Source: http://openib.org/downloads/libibverbs-1.1-pre1.tar.gz -BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) - -%description -libibverbs is a library that allows userspace processes to use -InfiniBand "verbs" as described in the InfiniBand Architecture -Specification. This includes direct hardware access for fast path -operations. - -For this library to be useful, a device-specific plug-in module should -also be installed. - -%package devel -Summary: Development files for the libibverbs library -Group: System Environment/Libraries - -%description devel -Static libraries and header files for the libibverbs verbs library. - -%package utils -Summary: Examples for the libibverbs library -Group: System Environment/Libraries -Requires: %{name} = %{version}-%{release} - -%description utils -Useful libibverbs1 example programs such as ibv_devinfo, which -displays information about InfiniBand devices. - -%prep -%setup -q -n %{name}-%{ver} - -%build -%configure -make %{?_smp_mflags} - -%install -rm -rf $RPM_BUILD_ROOT -%makeinstall -# remove unpackaged files from the buildroot -rm -f $RPM_BUILD_ROOT%{_libdir}/*.la - -%clean -rm -rf $RPM_BUILD_ROOT - -%post -p /sbin/ldconfig -%postun -p /sbin/ldconfig - -%files -%defattr(-,root,root,-) -%{_libdir}/libibverbs*.so.* -%doc AUTHORS COPYING ChangeLog README - -%files devel -%defattr(-,root,root,-) -%{_libdir}/lib*.so -%{_libdir}/*.a -%{_includedir}/* - -%files utils -%defattr(-,root,root,-) -%{_bindir}/* -%{_mandir}/man1/* - -%changelog -* Mon May 22 2006 Roland Dreier - 1.1-0.1.pre1 -- New upstream release -- Remove dependency on libsysfs, since it is no longer used - -* Thu May 4 2006 Roland Dreier - 1.0.4-1 -- New upstream release - -* Mon Mar 14 2006 Roland Dreier - 1.0.3-1 -- New upstream release - -* Mon Mar 13 2006 Roland Dreier - 1.0.1-1 -- New upstream release - -* Thu Feb 16 2006 Roland Dreier - 1.0-1 -- New upstream release - -* Wed Feb 15 2006 Roland Dreier - 1.0-0.5.rc7 -- New upstream release - -* Sun Jan 22 2006 Roland Dreier - 1.0-0.4.rc6 -- New upstream release - -* Tue Oct 25 2005 Roland Dreier - 1.0-0.3.rc5 -- New upstream release - -* Wed Oct 5 2005 Roland Dreier - 1.0-0.2.rc4 -- Update to upstream 1.0-rc4 release - -* Mon Sep 26 2005 Roland Dreier - 1.0-0.1.rc3 -- Initial attempt at Fedora Extras-compliant spec file diff -ruNp ORG/libibverbs/librdmaverbs.spec.in NEW/libibverbs/librdmaverbs.spec.in --- ORG/libibverbs/librdmaverbs.spec.in 1969-12-31 16:00:00.000000000 -0800 +++ NEW/libibverbs/librdmaverbs.spec.in 2006-08-02 18:24:49.000000000 -0700 @@ -0,0 +1,106 @@ +# $Id: + +%define ver @VERSION@ + +Name: libibverbs +Version: 1.1 +Release: 0.1.pre1%{?dist} +Summary: A library for direct userspace use of InfiniBand + +Group: System Environment/Libraries +License: GPL/BSD +Url: http://openib.org/ +Source: http://openib.org/downloads/libibverbs-1.1-pre1.tar.gz +BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) + +%description +libibverbs is a library that allows userspace processes to use +InfiniBand and iWARP "verbs" as described in the InfiniBand Architecture +Specification and the iWARP documents. This includes direct hardware access +for fast path operations. + +For this library to be useful, a device-specific plug-in module should +also be installed. + +%package devel +Summary: Development files for the libibverbs library +Group: System Environment/Libraries + +%description devel +Static libraries and header files for the libibverbs verbs library. + +%package utils +Summary: Examples for the libibverbs library +Group: System Environment/Libraries +Requires: %{name} = %{version}-%{release} + +%description utils +Useful libibverbs example programs such as ibv_devinfo, which +displays information about InfiniBand devices. + +%prep +%setup -q -n %{name}-%{ver} + +%build +%configure +make %{?_smp_mflags} + +%install +rm -rf $RPM_BUILD_ROOT +%makeinstall +# remove unpackaged files from the buildroot +rm -f $RPM_BUILD_ROOT%{_libdir}/*.la + +%clean +rm -rf $RPM_BUILD_ROOT + +%post -p /sbin/ldconfig +%postun -p /sbin/ldconfig + +%files +%defattr(-,root,root,-) +%{_libdir}/libibverbs*.so.* +%doc AUTHORS COPYING ChangeLog README + +%files devel +%defattr(-,root,root,-) +%{_libdir}/lib*.so +%{_libdir}/*.a +%{_includedir}/* + +%files utils +%defattr(-,root,root,-) +%{_bindir}/* +%{_mandir}/man1/* + +%changelog +* Mon May 22 2006 Roland Dreier - 1.1-0.1.pre1 +- New upstream release +- Remove dependency on libsysfs, since it is no longer used + +* Thu May 4 2006 Roland Dreier - 1.0.4-1 +- New upstream release + +* Mon Mar 14 2006 Roland Dreier - 1.0.3-1 +- New upstream release + +* Mon Mar 13 2006 Roland Dreier - 1.0.1-1 +- New upstream release + +* Thu Feb 16 2006 Roland Dreier - 1.0-1 +- New upstream release + +* Wed Feb 15 2006 Roland Dreier - 1.0-0.5.rc7 +- New upstream release + +* Sun Jan 22 2006 Roland Dreier - 1.0-0.4.rc6 +- New upstream release + +* Tue Oct 25 2005 Roland Dreier - 1.0-0.3.rc5 +- New upstream release + +* Wed Oct 5 2005 Roland Dreier - 1.0-0.2.rc4 +- Update to upstream 1.0-rc4 release + +* Mon Sep 26 2005 Roland Dreier - 1.0-0.1.rc3 +- Initial attempt at Fedora Extras-compliant spec file From krkumar2 at in.ibm.com Thu Aug 3 01:37:31 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 14:07:31 +0530 Subject: [openib-general] [PATCH v3 1/6] libibverbs include files changes. In-Reply-To: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803083731.6346.93423.sendpatchset@K50wks273950wss.in.ibm.com> Additions to include files in libibverbs for the new API. Signed-off-by: Krishna Kumar diff -ruNp ORG/libibverbs/include/infiniband/arch.h NEW/libibverbs/include/infiniband/arch.h --- ORG/libibverbs/include/infiniband/arch.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/arch.h 2006-08-02 18:24:49.000000000 -0700 @@ -32,8 +32,8 @@ * $Id: arch.h 8358 2006-07-04 20:38:54Z roland $ */ -#ifndef INFINIBAND_ARCH_H -#define INFINIBAND_ARCH_H +#ifndef RDMAV_ARCH_H +#define RDMAV_ARCH_H #include #include @@ -92,4 +92,4 @@ static inline uint64_t ntohll(uint64_t x #endif -#endif /* INFINIBAND_ARCH_H */ +#endif /* RDMAV_ARCH_H */ diff -ruNp ORG/libibverbs/include/infiniband/deprecate.h NEW/libibverbs/include/infiniband/deprecate.h --- ORG/libibverbs/include/infiniband/deprecate.h 1969-12-31 16:00:00.000000000 -0800 +++ NEW/libibverbs/include/infiniband/deprecate.h 2006-08-03 17:50:06.000000000 -0700 @@ -0,0 +1,387 @@ +/* + * Copyright (c) 2006 IBM. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: + */ + +#ifndef RDMAV_DEPRECATE_H +#define RDMAV_DEPRECATE_H + +/* + * This header file can be removed once all applications are ported over + * to the new API. Till then, this is kept around as a compatibility + * header. + */ + +/* All exported IBV_ defines */ + +#define IBV_NODE_CA RDMAV_NODE_CA +#define IBV_NODE_SWITCH RDMAV_NODE_SWITCH +#define IBV_NODE_ROUTER RDMAV_NODE_ROUTER +#define IBV_DEVICE_RESIZE_MAX_WR RDMAV_DEVICE_RESIZE_MAX_WR +#define IBV_DEVICE_BAD_PKEY_CNTR RDMAV_DEVICE_BAD_PKEY_CNTR +#define IBV_DEVICE_BAD_QKEY_CNTR RDMAV_DEVICE_BAD_QKEY_CNTR +#define IBV_DEVICE_RAW_MULTI RDMAV_DEVICE_RAW_MULTI +#define IBV_DEVICE_AUTO_PATH_MIG RDMAV_DEVICE_AUTO_PATH_MIG +#define IBV_DEVICE_CHANGE_PHY_PORT RDMAV_DEVICE_CHANGE_PHY_PORT +#define IBV_DEVICE_UD_AV_PORT_ENFORCE RDMAV_DEVICE_UD_AV_PORT_ENFORCE +#define IBV_DEVICE_CURR_QP_STATE_MOD RDMAV_DEVICE_CURR_QP_STATE_MOD +#define IBV_DEVICE_SHUTDOWN_PORT RDMAV_DEVICE_SHUTDOWN_PORT +#define IBV_DEVICE_INIT_TYPE RDMAV_DEVICE_INIT_TYPE +#define IBV_DEVICE_PORT_ACTIVE_EVENT RDMAV_DEVICE_PORT_ACTIVE_EVENT +#define IBV_DEVICE_SYS_IMAGE_GUID RDMAV_DEVICE_SYS_IMAGE_GUID +#define IBV_DEVICE_RC_RNR_NAK_GEN RDMAV_DEVICE_RC_RNR_NAK_GEN +#define IBV_DEVICE_SRQ_RESIZE RDMAV_DEVICE_SRQ_RESIZE +#define IBV_DEVICE_N_NOTIFY_CQ RDMAV_DEVICE_N_NOTIFY_CQ +#define IBV_ATOMIC_NONE RDMAV_ATOMIC_NONE +#define IBV_ATOMIC_HCA RDMAV_ATOMIC_HCA +#define IBV_ATOMIC_GLOB RDMAV_ATOMIC_GLOB +#define IBV_MTU_256 RDMAV_MTU_256 +#define IBV_MTU_512 RDMAV_MTU_512 +#define IBV_MTU_1024 RDMAV_MTU_1024 +#define IBV_MTU_2048 RDMAV_MTU_2048 +#define IBV_MTU_4096 RDMAV_MTU_4096 +#define IBV_PORT_NOP RDMAV_PORT_NOP +#define IBV_PORT_DOWN RDMAV_PORT_DOWN +#define IBV_PORT_INIT RDMAV_PORT_INIT +#define IBV_PORT_ARMED RDMAV_PORT_ARMED +#define IBV_PORT_ACTIVE RDMAV_PORT_ACTIVE +#define IBV_PORT_ACTIVE_DEFER RDMAV_PORT_ACTIVE_DEFER +#define IBV_EVENT_CQ_ERR RDMAV_EVENT_CQ_ERR +#define IBV_EVENT_QP_FATAL RDMAV_EVENT_QP_FATAL +#define IBV_EVENT_QP_REQ_ERR RDMAV_EVENT_QP_REQ_ERR +#define IBV_EVENT_QP_ACCESS_ERR RDMAV_EVENT_QP_ACCESS_ERR +#define IBV_EVENT_COMM_EST RDMAV_EVENT_COMM_EST +#define IBV_EVENT_SQ_DRAINED RDMAV_EVENT_SQ_DRAINED +#define IBV_EVENT_PATH_MIG RDMAV_EVENT_PATH_MIG +#define IBV_EVENT_PATH_MIG_ERR RDMAV_EVENT_PATH_MIG_ERR +#define IBV_EVENT_DEVICE_FATAL RDMAV_EVENT_DEVICE_FATAL +#define IBV_EVENT_PORT_ACTIVE RDMAV_EVENT_PORT_ACTIVE +#define IBV_EVENT_PORT_ERR RDMAV_EVENT_PORT_ERR +#define IBV_EVENT_LID_CHANGE RDMAV_EVENT_LID_CHANGE +#define IBV_EVENT_PKEY_CHANGE RDMAV_EVENT_PKEY_CHANGE +#define IBV_EVENT_SM_CHANGE RDMAV_EVENT_SM_CHANGE +#define IBV_EVENT_SRQ_ERR RDMAV_EVENT_SRQ_ERR +#define IBV_EVENT_SRQ_LIMIT_REACHED RDMAV_EVENT_SRQ_LIMIT_REACHED +#define IBV_EVENT_QP_LAST_WQE_REACHED RDMAV_EVENT_QP_LAST_WQE_REACHED +#define IBV_EVENT_CLIENT_REREGISTER RDMAV_EVENT_CLIENT_REREGISTER +#define IBV_WC_SUCCESS RDMAV_WC_SUCCESS +#define IBV_WC_LOC_LEN_ERR RDMAV_WC_LOC_LEN_ERR +#define IBV_WC_LOC_QP_OP_ERR RDMAV_WC_LOC_QP_OP_ERR +#define IBV_WC_LOC_EEC_OP_ERR RDMAV_WC_LOC_EEC_OP_ERR +#define IBV_WC_LOC_PROT_ERR RDMAV_WC_LOC_PROT_ERR +#define IBV_WC_WR_FLUSH_ERR RDMAV_WC_WR_FLUSH_ERR +#define IBV_WC_MW_BIND_ERR RDMAV_WC_MW_BIND_ERR +#define IBV_WC_BAD_RESP_ERR RDMAV_WC_BAD_RESP_ERR +#define IBV_WC_LOC_ACCESS_ERR RDMAV_WC_LOC_ACCESS_ERR +#define IBV_WC_REM_INV_REQ_ERR RDMAV_WC_REM_INV_REQ_ERR +#define IBV_WC_REM_ACCESS_ERR RDMAV_WC_REM_ACCESS_ERR +#define IBV_WC_REM_OP_ERR RDMAV_WC_REM_OP_ERR +#define IBV_WC_RETRY_EXC_ERR RDMAV_WC_RETRY_EXC_ERR +#define IBV_WC_RNR_RETRY_EXC_ERR RDMAV_WC_RNR_RETRY_EXC_ERR +#define IBV_WC_LOC_RDD_VIOL_ERR RDMAV_WC_LOC_RDD_VIOL_ERR +#define IBV_WC_REM_INV_RD_REQ_ERR RDMAV_WC_REM_INV_RD_REQ_ERR +#define IBV_WC_REM_ABORT_ERR RDMAV_WC_REM_ABORT_ERR +#define IBV_WC_INV_EECN_ERR RDMAV_WC_INV_EECN_ERR +#define IBV_WC_INV_EEC_STATE_ERR RDMAV_WC_INV_EEC_STATE_ERR +#define IBV_WC_FATAL_ERR RDMAV_WC_FATAL_ERR +#define IBV_WC_RESP_TIMEOUT_ERR RDMAV_WC_RESP_TIMEOUT_ERR +#define IBV_WC_GENERAL_ERR RDMAV_WC_GENERAL_ERR +#define IBV_WC_SEND RDMAV_WC_SEND +#define IBV_WC_RDMA_WRITE RDMAV_WC_RDMA_WRITE +#define IBV_WC_RDMA_READ RDMAV_WC_RDMA_READ +#define IBV_WC_COMP_SWAP RDMAV_WC_COMP_SWAP +#define IBV_WC_FETCH_ADD RDMAV_WC_FETCH_ADD +#define IBV_WC_BIND_MW RDMAV_WC_BIND_MW +#define IBV_WC_RECV RDMAV_WC_RECV +#define IBV_WC_RECV_RDMA_WITH_IMM RDMAV_WC_RECV_RDMA_WITH_IMM +#define IBV_WC_GRH RDMAV_WC_GRH +#define IBV_WC_WITH_IMM RDMAV_WC_WITH_IMM +#define IBV_ACCESS_LOCAL_WRITE RDMAV_ACCESS_LOCAL_WRITE +#define IBV_ACCESS_REMOTE_WRITE RDMAV_ACCESS_REMOTE_WRITE +#define IBV_ACCESS_REMOTE_READ RDMAV_ACCESS_REMOTE_READ +#define IBV_ACCESS_REMOTE_ATOMIC RDMAV_ACCESS_REMOTE_ATOMIC +#define IBV_ACCESS_MW_BIND RDMAV_ACCESS_MW_BIND +#define IBV_RATE_MAX RDMAV_RATE_MAX +#define IBV_RATE_2_5_GBPS RDMAV_RATE_2_5_GBPS +#define IBV_RATE_5_GBPS RDMAV_RATE_5_GBPS +#define IBV_RATE_10_GBPS RDMAV_RATE_10_GBPS +#define IBV_RATE_20_GBPS RDMAV_RATE_20_GBPS +#define IBV_RATE_30_GBPS RDMAV_RATE_30_GBPS +#define IBV_RATE_40_GBPS RDMAV_RATE_40_GBPS +#define IBV_RATE_60_GBPS RDMAV_RATE_60_GBPS +#define IBV_RATE_80_GBPS RDMAV_RATE_80_GBPS +#define IBV_RATE_120_GBPS RDMAV_RATE_120_GBPS +#define IBV_SRQ_MAX_WR RDMAV_SRQ_MAX_WR +#define IBV_SRQ_LIMIT RDMAV_SRQ_LIMIT +#define IBV_QPT_RC RDMAV_QPT_RC +#define IBV_QPT_UC RDMAV_QPT_UC +#define IBV_QPT_UD RDMAV_QPT_UD +#define IBV_QP_STATE RDMAV_QP_STATE +#define IBV_QP_CUR_STATE RDMAV_QP_CUR_STATE +#define IBV_QP_EN_SQD_ASYNC_NOTIFY RDMAV_QP_EN_SQD_ASYNC_NOTIFY +#define IBV_QP_ACCESS_FLAGS RDMAV_QP_ACCESS_FLAGS +#define IBV_QP_PKEY_INDEX RDMAV_QP_PKEY_INDEX +#define IBV_QP_PORT RDMAV_QP_PORT +#define IBV_QP_QKEY RDMAV_QP_QKEY +#define IBV_QP_AV RDMAV_QP_AV +#define IBV_QP_PATH_MTU RDMAV_QP_PATH_MTU +#define IBV_QP_TIMEOUT RDMAV_QP_TIMEOUT +#define IBV_QP_RETRY_CNT RDMAV_QP_RETRY_CNT +#define IBV_QP_RNR_RETRY RDMAV_QP_RNR_RETRY +#define IBV_QP_RQ_PSN RDMAV_QP_RQ_PSN +#define IBV_QP_MAX_QP_RD_ATOMIC RDMAV_QP_MAX_QP_RD_ATOMIC +#define IBV_QP_ALT_PATH RDMAV_QP_ALT_PATH +#define IBV_QP_MIN_RNR_TIMER RDMAV_QP_MIN_RNR_TIMER +#define IBV_QP_SQ_PSN RDMAV_QP_SQ_PSN +#define IBV_QP_MAX_DEST_RD_ATOMIC RDMAV_QP_MAX_DEST_RD_ATOMIC +#define IBV_QP_PATH_MIG_STATE RDMAV_QP_PATH_MIG_STATE +#define IBV_QP_CAP RDMAV_QP_CAP +#define IBV_QP_DEST_QPN RDMAV_QP_DEST_QPN +#define IBV_QPS_RESET RDMAV_QPS_RESET +#define IBV_QPS_INIT RDMAV_QPS_INIT +#define IBV_QPS_RTR RDMAV_QPS_RTR +#define IBV_QPS_RTS RDMAV_QPS_RTS +#define IBV_QPS_SQD RDMAV_QPS_SQD +#define IBV_QPS_SQE RDMAV_QPS_SQE +#define IBV_QPS_ERR RDMAV_QPS_ERR +#define IBV_MIG_MIGRATED RDMAV_MIG_MIGRATED +#define IBV_MIG_REARM RDMAV_MIG_REARM +#define IBV_MIG_ARMED RDMAV_MIG_ARMED +#define IBV_WR_RDMA_WRITE RDMAV_WR_RDMA_WRITE +#define IBV_WR_RDMA_WRITE_WITH_IMM RDMAV_WR_RDMA_WRITE_WITH_IMM +#define IBV_WR_SEND RDMAV_WR_SEND +#define IBV_WR_SEND_WITH_IMM RDMAV_WR_SEND_WITH_IMM +#define IBV_WR_RDMA_READ RDMAV_WR_RDMA_READ +#define IBV_WR_ATOMIC_CMP_AND_SWP RDMAV_WR_ATOMIC_CMP_AND_SWP +#define IBV_WR_ATOMIC_FETCH_AND_ADD RDMAV_WR_ATOMIC_FETCH_AND_ADD +#define IBV_SEND_FENCE RDMAV_SEND_FENCE +#define IBV_SEND_SIGNALED RDMAV_SEND_SIGNALED +#define IBV_SEND_SOLICITED RDMAV_SEND_SOLICITED +#define IBV_SEND_INLINE RDMAV_SEND_INLINE +#define IBV_SYSFS_NAME_MAX RDMAV_SYSFS_NAME_MAX +#define IBV_SYSFS_PATH_MAX RDMAV_SYSFS_PATH_MAX + + +#define IBV_OPCODE_RC RDMAV_OPCODE_RC +#define IBV_OPCODE_UC RDMAV_OPCODE_UC +#define IBV_OPCODE_RD RDMAV_OPCODE_RD +#define IBV_OPCODE_UD RDMAV_OPCODE_UD +#define IBV_OPCODE_SEND_FIRST RDMAV_OPCODE_SEND_FIRST +#define IBV_OPCODE_SEND_MIDDLE RDMAV_OPCODE_SEND_MIDDLE +#define IBV_OPCODE_SEND_LAST RDMAV_OPCODE_SEND_LAST +#define IBV_OPCODE_SEND_LAST_WITH_IMMEDIATE RDMAV_OPCODE_SEND_LAST_WITH_IMMEDIATE +#define IBV_OPCODE_SEND_ONLY RDMAV_OPCODE_SEND_ONLY +#define IBV_OPCODE_SEND_ONLY_WITH_IMMEDIATE RDMAV_OPCODE_SEND_ONLY_WITH_IMMEDIATE +#define IBV_OPCODE_RDMA_WRITE_FIRST RDMAV_OPCODE_RDMA_WRITE_FIRST +#define IBV_OPCODE_RDMA_WRITE_MIDDLE RDMAV_OPCODE_RDMA_WRITE_MIDDLE +#define IBV_OPCODE_RDMA_WRITE_LAST RDMAV_OPCODE_RDMA_WRITE_LAST +#define IBV_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE RDMAV_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE +#define IBV_OPCODE_RDMA_WRITE_ONLY RDMAV_OPCODE_RDMA_WRITE_ONLY +#define IBV_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE RDMAV_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE +#define IBV_OPCODE_RDMA_READ_REQUEST RDMAV_OPCODE_RDMA_READ_REQUEST +#define IBV_OPCODE_RDMA_READ_RESPONSE_FIRST RDMAV_OPCODE_RDMA_READ_RESPONSE_FIRST +#define IBV_OPCODE_RDMA_READ_RESPONSE_MIDDLE RDMAV_OPCODE_RDMA_READ_RESPONSE_MIDDLE +#define IBV_OPCODE_RDMA_READ_RESPONSE_LAST RDMAV_OPCODE_RDMA_READ_RESPONSE_LAST +#define IBV_OPCODE_RDMA_READ_RESPONSE_ONLY RDMAV_OPCODE_RDMA_READ_RESPONSE_ONLY +#define IBV_OPCODE_ACKNOWLEDGE RDMAV_OPCODE_ACKNOWLEDGE +#define IBV_OPCODE_ATOMIC_ACKNOWLEDGE RDMAV_OPCODE_ATOMIC_ACKNOWLEDGE +#define IBV_OPCODE_COMPARE_SWAP RDMAV_OPCODE_COMPARE_SWAP +#define IBV_OPCODE_FETCH_ADD RDMAV_OPCODE_FETCH_ADD + +/* All exported ibv_ routines */ + +#define ibv_open_device rdmav_open_device +#define ibv_get_device_guid rdmav_get_device_guid +#define ibv_get_device_name rdmav_get_device_name +#define ibv_ack_async_event rdmav_ack_async_event +#define ibv_ack_cq_events rdmav_ack_cq_events +#define ibv_alloc_pd rdmav_alloc_pd +#define ibv_attach_mcast rdmav_attach_mcast +#define ibv_close_device rdmav_close_device +#define ibv_cmd_alloc_pd rdmav_cmd_alloc_pd +#define ibv_cmd_attach_mcast rdmav_cmd_attach_mcast +#define ibv_cmd_create_ah rdmav_cmd_create_ah +#define ibv_cmd_create_cq rdmav_cmd_create_cq +#define ibv_cmd_create_qp rdmav_cmd_create_qp +#define ibv_cmd_create_srq rdmav_cmd_create_srq +#define ibv_cmd_dealloc_pd rdmav_cmd_dealloc_pd +#define ibv_cmd_dereg_mr rdmav_cmd_dereg_mr +#define ibv_cmd_destroy_ah rdmav_cmd_destroy_ah +#define ibv_cmd_destroy_cq rdmav_cmd_destroy_cq +#define ibv_cmd_destroy_qp rdmav_cmd_destroy_qp +#define ibv_cmd_destroy_srq rdmav_cmd_destroy_srq +#define ibv_cmd_detach_mcast rdmav_cmd_detach_mcast +#define ibv_cmd_get_context rdmav_cmd_get_context +#define ibv_cmd_modify_qp rdmav_cmd_modify_qp +#define ibv_cmd_modify_srq rdmav_cmd_modify_srq +#define ibv_cmd_poll_cq rdmav_cmd_poll_cq +#define ibv_cmd_post_recv rdmav_cmd_post_recv +#define ibv_cmd_post_send rdmav_cmd_post_send +#define ibv_cmd_post_srq_recv rdmav_cmd_post_srq_recv +#define ibv_cmd_query_device rdmav_cmd_query_device +#define ibv_cmd_query_port rdmav_cmd_query_port +#define ibv_cmd_query_qp rdmav_cmd_query_qp +#define ibv_cmd_query_srq rdmav_cmd_query_srq +#define ibv_cmd_reg_mr rdmav_cmd_reg_mr +#define ibv_cmd_req_notify_cq rdmav_cmd_req_notify_cq +#define ibv_cmd_resize_cq rdmav_cmd_resize_cq +#define ibv_copy_ah_attr_from_kern rdmav_copy_ah_attr_from_kern +#define ibv_copy_path_rec_from_kern rdmav_copy_path_rec_from_kern +#define ibv_copy_path_rec_to_kern rdmav_copy_path_rec_to_kern +#define ibv_copy_qp_attr_from_kern rdmav_copy_qp_attr_from_kern +#define ibv_create_ah rdmav_create_ah +#define ibv_create_comp_channel rdmav_create_comp_channel +#define ibv_create_cq rdmav_create_cq +#define ibv_create_qp rdmav_create_qp +#define ibv_create_srq rdmav_create_srq +#define ibv_dealloc_pd rdmav_dealloc_pd +#define ibv_dereg_mr rdmav_dereg_mr +#define ibv_destroy_ah rdmav_destroy_ah +#define ibv_destroy_comp_channel rdmav_destroy_comp_channel +#define ibv_destroy_cq rdmav_destroy_cq +#define ibv_destroy_qp rdmav_destroy_qp +#define ibv_destroy_srq rdmav_destroy_srq +#define ibv_detach_mcast rdmav_detach_mcast +#define ibv_free_device_list rdmav_free_device_list +#define ibv_get_async_event rdmav_get_async_event +#define ibv_get_cq_event rdmav_get_cq_event +#define ibv_get_device_guid rdmav_get_device_guid +#define ibv_init_ah_from_wc rdmav_init_ah_from_wc +#define ibv_modify_qp rdmav_modify_qp +#define ibv_modify_srq rdmav_modify_srq +#define ibv_poll_cq rdmav_poll_cq +#define ibv_post_recv rdmav_post_recv +#define ibv_post_send rdmav_post_send +#define ibv_post_srq_recv rdmav_post_srq_recv +#define ibv_query_device rdmav_query_device +#define ibv_query_gid rdmav_query_gid +#define ibv_query_pkey rdmav_query_pkey +#define ibv_query_port rdmav_query_port +#define ibv_query_qp rdmav_query_qp +#define ibv_query_srq rdmav_query_srq +#define ibv_rate_to_mult rdmav_rate_to_mult +#define ibv_read_sysfs_file rdmav_read_sysfs_file +#define ibv_reg_mr rdmav_reg_mr +#define ibv_req_notify_cq rdmav_req_notify_cq +#define ibv_resize_cq rdmav_resize_cq + +/* All exported ibv_ data structures */ + +#define ibv_access_flags rdmav_access_flags +#define ibv_ah rdmav_ah +#define ibv_ah_attr rdmav_ah_attr +#define ibv_alloc_pd_resp rdmav_alloc_pd_resp +#define ibv_async_event rdmav_async_event +#define ibv_atomic_cap rdmav_atomic_cap +#define ibv_cmd_query_gid rdmav_cmd_query_gid +#define ibv_cmd_query_pkey rdmav_cmd_query_pkey +#define ibv_comp_channel rdmav_comp_channel +#define ibv_comp_event rdmav_comp_event +#define ibv_context rdmav_context +#define ibv_context_ops rdmav_context_ops +#define ibv_cq rdmav_cq +#define ibv_create_ah_resp rdmav_create_ah_resp +#define ibv_create_comp_channel_resp rdmav_create_comp_channel_resp +#define ibv_create_cq_resp rdmav_create_cq_resp +#define ibv_create_qp_resp rdmav_create_qp_resp +#define ibv_create_srq_resp rdmav_create_srq_resp +#define ibv_destroy_cq_resp rdmav_destroy_cq_resp +#define ibv_destroy_qp_resp rdmav_destroy_qp_resp +#define ibv_destroy_srq_resp rdmav_destroy_srq_resp +#define ibv_device rdmav_device +#define ibv_device_attr rdmav_device_attr +#define ibv_device_cap_flags rdmav_device_cap_flags +#define ibv_device_ops rdmav_device_ops +#define ibv_driver rdmav_driver +#define ibv_event_type rdmav_event_type +#define ibv_get_context rdmav_get_context +#define ibv_get_context_resp rdmav_get_context_resp +#define ibv_gid rdmav_gid +#define ibv_global_route rdmav_global_route +#define ibv_grh rdmav_grh +#define ibv_kern_ah_attr rdmav_kern_ah_attr +#define ibv_kern_async_event rdmav_kern_async_event +#define ibv_kern_global_route rdmav_kern_global_route +#define ibv_kern_path_rec rdmav_kern_path_rec +#define ibv_kern_qp_attr rdmav_kern_qp_attr +#define ibv_kern_recv_wr rdmav_kern_recv_wr +#define ibv_kern_send_wr rdmav_kern_send_wr +#define ibv_kern_wc rdmav_kern_wc +#define ibv_mig_state rdmav_mig_state +#define ibv_mr rdmav_mr +#define ibv_mtu rdmav_mtu +#define ibv_node_type rdmav_node_type +#define ibv_pd rdmav_pd +#define ibv_poll_cq_resp rdmav_poll_cq_resp +#define ibv_port_attr rdmav_port_attr +#define ibv_port_state rdmav_port_state +#define ibv_post_recv_resp rdmav_post_recv_resp +#define ibv_post_send_resp rdmav_post_send_resp +#define ibv_post_srq_recv_resp rdmav_post_srq_recv_resp +#define ibv_qp rdmav_qp +#define ibv_qp_attr rdmav_qp_attr +#define ibv_qp_attr_mask rdmav_qp_attr_mask +#define ibv_qp_cap rdmav_qp_cap +#define ibv_qp_dest rdmav_qp_dest +#define ibv_qp_init_attr rdmav_qp_init_attr +#define ibv_qp_state rdmav_qp_state +#define ibv_qp_type rdmav_qp_type +#define ibv_query_device_resp rdmav_query_device_resp +#define ibv_query_params rdmav_query_params +#define ibv_query_params_resp rdmav_query_params_resp +#define ibv_query_port_resp rdmav_query_port_resp +#define ibv_query_qp_resp rdmav_query_qp_resp +#define ibv_query_srq_resp rdmav_query_srq_resp +#define ibv_rate rdmav_rate +#define ibv_recv_wr rdmav_recv_wr +#define ibv_reg_mr_resp rdmav_reg_mr_resp +#define ibv_resize_cq_resp rdmav_resize_cq_resp +#define ibv_sa_mcmember_rec rdmav_sa_mcmember_rec +#define ibv_sa_path_rec rdmav_sa_path_rec +#define ibv_sa_service_rec rdmav_sa_service_rec +#define ibv_send_flags rdmav_send_flags +#define ibv_send_wr rdmav_send_wr +#define ibv_sge rdmav_sge +#define ibv_srq rdmav_srq +#define ibv_srq_attr rdmav_srq_attr +#define ibv_srq_attr_mask rdmav_srq_attr_mask +#define ibv_srq_init_attr rdmav_srq_init_attr +#define ibv_wc rdmav_wc +#define ibv_wc_flags rdmav_wc_flags +#define ibv_wc_opcode rdmav_wc_opcode +#define ibv_wc_status rdmav_wc_status +#define ibv_wr_opcode rdmav_wr_opcode + +/* All declarations needed for compiles */ +extern struct rdmav_device **ibv_get_device_list(int *num); + +#endif /* RDMAV_DEPRECATE_H */ diff -ruNp ORG/libibverbs/include/infiniband/driver.h NEW/libibverbs/include/infiniband/driver.h --- ORG/libibverbs/include/infiniband/driver.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/driver.h 2006-08-02 18:24:49.000000000 -0700 @@ -34,8 +34,8 @@ * $Id: driver.h 7484 2006-05-24 21:12:21Z roland $ */ -#ifndef INFINIBAND_DRIVER_H -#define INFINIBAND_DRIVER_H +#ifndef RDMAV_DRIVER_H +#define RDMAV_DRIVER_H #include #include @@ -57,90 +57,90 @@ * * libibverbs will call each driver's ibv_driver_init() function once * for each InfiniBand device. If the device is one that the driver - * can support, it should return a struct ibv_device * with the ops + * can support, it should return a struct rdmav_device * with the ops * member filled in. If the driver does not support the device, it * should return NULL from openib_driver_init(). */ -typedef struct ibv_device *(*ibv_driver_init_func)(const char *, int); +typedef struct rdmav_device *(*rdmav_driver_init_func)(const char *, int); -int ibv_cmd_get_context(struct ibv_context *context, struct ibv_get_context *cmd, - size_t cmd_size, struct ibv_get_context_resp *resp, +int rdmav_cmd_get_context(struct rdmav_context *context, struct rdmav_get_context *cmd, + size_t cmd_size, struct rdmav_get_context_resp *resp, size_t resp_size); -int ibv_cmd_query_device(struct ibv_context *context, - struct ibv_device_attr *device_attr, +int rdmav_cmd_query_device(struct rdmav_context *context, + struct rdmav_device_attr *device_attr, uint64_t *raw_fw_ver, - struct ibv_query_device *cmd, size_t cmd_size); -int ibv_cmd_query_port(struct ibv_context *context, uint8_t port_num, - struct ibv_port_attr *port_attr, - struct ibv_query_port *cmd, size_t cmd_size); -int ibv_cmd_query_gid(struct ibv_context *context, uint8_t port_num, - int index, union ibv_gid *gid); -int ibv_cmd_query_pkey(struct ibv_context *context, uint8_t port_num, + struct rdmav_query_device *cmd, size_t cmd_size); +int rdmav_cmd_query_port(struct rdmav_context *context, uint8_t port_num, + struct rdmav_port_attr *port_attr, + struct rdmav_query_port *cmd, size_t cmd_size); +int rdmav_cmd_query_gid(struct rdmav_context *context, uint8_t port_num, + int index, union rdmav_gid *gid); +int rdmav_cmd_query_pkey(struct rdmav_context *context, uint8_t port_num, int index, uint16_t *pkey); -int ibv_cmd_alloc_pd(struct ibv_context *context, struct ibv_pd *pd, - struct ibv_alloc_pd *cmd, size_t cmd_size, - struct ibv_alloc_pd_resp *resp, size_t resp_size); -int ibv_cmd_dealloc_pd(struct ibv_pd *pd); -int ibv_cmd_reg_mr(struct ibv_pd *pd, void *addr, size_t length, - uint64_t hca_va, enum ibv_access_flags access, - struct ibv_mr *mr, struct ibv_reg_mr *cmd, +int rdmav_cmd_alloc_pd(struct rdmav_context *context, struct rdmav_pd *pd, + struct rdmav_alloc_pd *cmd, size_t cmd_size, + struct rdmav_alloc_pd_resp *resp, size_t resp_size); +int rdmav_cmd_dealloc_pd(struct rdmav_pd *pd); +int rdmav_cmd_reg_mr(struct rdmav_pd *pd, void *addr, size_t length, + uint64_t hca_va, enum rdmav_access_flags access, + struct rdmav_mr *mr, struct rdmav_reg_mr *cmd, size_t cmd_size); -int ibv_cmd_dereg_mr(struct ibv_mr *mr); -int ibv_cmd_create_cq(struct ibv_context *context, int cqe, - struct ibv_comp_channel *channel, - int comp_vector, struct ibv_cq *cq, - struct ibv_create_cq *cmd, size_t cmd_size, - struct ibv_create_cq_resp *resp, size_t resp_size); -int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); -int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only); -int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, - struct ibv_resize_cq *cmd, size_t cmd_size); -int ibv_cmd_destroy_cq(struct ibv_cq *cq); - -int ibv_cmd_create_srq(struct ibv_pd *pd, - struct ibv_srq *srq, struct ibv_srq_init_attr *attr, - struct ibv_create_srq *cmd, size_t cmd_size, - struct ibv_create_srq_resp *resp, size_t resp_size); -int ibv_cmd_modify_srq(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask, - struct ibv_modify_srq *cmd, size_t cmd_size); -int ibv_cmd_query_srq(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - struct ibv_query_srq *cmd, size_t cmd_size); -int ibv_cmd_destroy_srq(struct ibv_srq *srq); - -int ibv_cmd_create_qp(struct ibv_pd *pd, - struct ibv_qp *qp, struct ibv_qp_init_attr *attr, - struct ibv_create_qp *cmd, size_t cmd_size, - struct ibv_create_qp_resp *resp, size_t resp_size); -int ibv_cmd_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *qp_attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_qp_init_attr *qp_init_attr, - struct ibv_query_qp *cmd, size_t cmd_size); -int ibv_cmd_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_modify_qp *cmd, size_t cmd_size); -int ibv_cmd_destroy_qp(struct ibv_qp *qp); -int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, - struct ibv_send_wr **bad_wr); -int ibv_cmd_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr); -int ibv_cmd_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr); -int ibv_cmd_create_ah(struct ibv_pd *pd, struct ibv_ah *ah, - struct ibv_ah_attr *attr); -int ibv_cmd_destroy_ah(struct ibv_ah *ah); -int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); -int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); +int rdmav_cmd_dereg_mr(struct rdmav_mr *mr); +int rdmav_cmd_create_cq(struct rdmav_context *context, int cqe, + struct rdmav_comp_channel *channel, + int comp_vector, struct rdmav_cq *cq, + struct rdmav_create_cq *cmd, size_t cmd_size, + struct rdmav_create_cq_resp *resp, size_t resp_size); +int rdmav_cmd_poll_cq(struct rdmav_cq *cq, int ne, struct rdmav_wc *wc); +int rdmav_cmd_req_notify_cq(struct rdmav_cq *cq, int solicited_only); +int rdmav_cmd_resize_cq(struct rdmav_cq *cq, int cqe, + struct rdmav_resize_cq *cmd, size_t cmd_size); +int rdmav_cmd_destroy_cq(struct rdmav_cq *cq); + +int rdmav_cmd_create_srq(struct rdmav_pd *pd, + struct rdmav_srq *srq, struct rdmav_srq_init_attr *attr, + struct rdmav_create_srq *cmd, size_t cmd_size, + struct rdmav_create_srq_resp *resp, size_t resp_size); +int rdmav_cmd_modify_srq(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask, + struct rdmav_modify_srq *cmd, size_t cmd_size); +int rdmav_cmd_query_srq(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + struct rdmav_query_srq *cmd, size_t cmd_size); +int rdmav_cmd_destroy_srq(struct rdmav_srq *srq); + +int rdmav_cmd_create_qp(struct rdmav_pd *pd, + struct rdmav_qp *qp, struct rdmav_qp_init_attr *attr, + struct rdmav_create_qp *cmd, size_t cmd_size, + struct rdmav_create_qp_resp *resp, size_t resp_size); +int rdmav_cmd_query_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *qp_attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_qp_init_attr *qp_init_attr, + struct rdmav_query_qp *cmd, size_t cmd_size); +int rdmav_cmd_modify_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_modify_qp *cmd, size_t cmd_size); +int rdmav_cmd_destroy_qp(struct rdmav_qp *qp); +int rdmav_cmd_post_send(struct rdmav_qp *ibqp, struct rdmav_send_wr *wr, + struct rdmav_send_wr **bad_wr); +int rdmav_cmd_post_recv(struct rdmav_qp *ibqp, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr); +int rdmav_cmd_post_srq_recv(struct rdmav_srq *srq, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr); +int rdmav_cmd_create_ah(struct rdmav_pd *pd, struct rdmav_ah *ah, + struct rdmav_ah_attr *attr); +int rdmav_cmd_destroy_ah(struct rdmav_ah *ah); +int rdmav_cmd_attach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); +int rdmav_cmd_detach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); /* * sysfs helper functions */ -const char *ibv_get_sysfs_path(void); +const char *rdmav_get_sysfs_path(void); -int ibv_read_sysfs_file(const char *dir, const char *file, +int rdmav_read_sysfs_file(const char *dir, const char *file, char *buf, size_t size); -#endif /* INFINIBAND_DRIVER_H */ +#endif /* RDMAV_DRIVER_H */ diff -ruNp ORG/libibverbs/include/infiniband/kern-abi.h NEW/libibverbs/include/infiniband/kern-abi.h --- ORG/libibverbs/include/infiniband/kern-abi.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/kern-abi.h 2006-08-02 18:24:49.000000000 -0700 @@ -47,47 +47,47 @@ /* * The minimum and maximum kernel ABI that we can handle. */ -#define IB_USER_VERBS_MIN_ABI_VERSION 1 -#define IB_USER_VERBS_MAX_ABI_VERSION 6 +#define RDMAV_USER_VERBS_MIN_ABI_VERSION 1 +#define RDMAV_USER_VERBS_MAX_ABI_VERSION 6 enum { - IB_USER_VERBS_CMD_GET_CONTEXT, - IB_USER_VERBS_CMD_QUERY_DEVICE, - IB_USER_VERBS_CMD_QUERY_PORT, - IB_USER_VERBS_CMD_ALLOC_PD, - IB_USER_VERBS_CMD_DEALLOC_PD, - IB_USER_VERBS_CMD_CREATE_AH, - IB_USER_VERBS_CMD_MODIFY_AH, - IB_USER_VERBS_CMD_QUERY_AH, - IB_USER_VERBS_CMD_DESTROY_AH, - IB_USER_VERBS_CMD_REG_MR, - IB_USER_VERBS_CMD_REG_SMR, - IB_USER_VERBS_CMD_REREG_MR, - IB_USER_VERBS_CMD_QUERY_MR, - IB_USER_VERBS_CMD_DEREG_MR, - IB_USER_VERBS_CMD_ALLOC_MW, - IB_USER_VERBS_CMD_BIND_MW, - IB_USER_VERBS_CMD_DEALLOC_MW, - IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL, - IB_USER_VERBS_CMD_CREATE_CQ, - IB_USER_VERBS_CMD_RESIZE_CQ, - IB_USER_VERBS_CMD_DESTROY_CQ, - IB_USER_VERBS_CMD_POLL_CQ, - IB_USER_VERBS_CMD_PEEK_CQ, - IB_USER_VERBS_CMD_REQ_NOTIFY_CQ, - IB_USER_VERBS_CMD_CREATE_QP, - IB_USER_VERBS_CMD_QUERY_QP, - IB_USER_VERBS_CMD_MODIFY_QP, - IB_USER_VERBS_CMD_DESTROY_QP, - IB_USER_VERBS_CMD_POST_SEND, - IB_USER_VERBS_CMD_POST_RECV, - IB_USER_VERBS_CMD_ATTACH_MCAST, - IB_USER_VERBS_CMD_DETACH_MCAST, - IB_USER_VERBS_CMD_CREATE_SRQ, - IB_USER_VERBS_CMD_MODIFY_SRQ, - IB_USER_VERBS_CMD_QUERY_SRQ, - IB_USER_VERBS_CMD_DESTROY_SRQ, - IB_USER_VERBS_CMD_POST_SRQ_RECV + RDMAV_USER_VERBS_CMD_GET_CONTEXT, + RDMAV_USER_VERBS_CMD_QUERY_DEVICE, + RDMAV_USER_VERBS_CMD_QUERY_PORT, + RDMAV_USER_VERBS_CMD_ALLOC_PD, + RDMAV_USER_VERBS_CMD_DEALLOC_PD, + RDMAV_USER_VERBS_CMD_CREATE_AH, + RDMAV_USER_VERBS_CMD_MODIFY_AH, + RDMAV_USER_VERBS_CMD_QUERY_AH, + RDMAV_USER_VERBS_CMD_DESTROY_AH, + RDMAV_USER_VERBS_CMD_REG_MR, + RDMAV_USER_VERBS_CMD_REG_SMR, + RDMAV_USER_VERBS_CMD_REREG_MR, + RDMAV_USER_VERBS_CMD_QUERY_MR, + RDMAV_USER_VERBS_CMD_DEREG_MR, + RDMAV_USER_VERBS_CMD_ALLOC_MW, + RDMAV_USER_VERBS_CMD_BIND_MW, + RDMAV_USER_VERBS_CMD_DEALLOC_MW, + RDMAV_USER_VERBS_CMD_CREATE_COMP_CHANNEL, + RDMAV_USER_VERBS_CMD_CREATE_CQ, + RDMAV_USER_VERBS_CMD_RESIZE_CQ, + RDMAV_USER_VERBS_CMD_DESTROY_CQ, + RDMAV_USER_VERBS_CMD_POLL_CQ, + RDMAV_USER_VERBS_CMD_PEEK_CQ, + RDMAV_USER_VERBS_CMD_REQ_NOTIFY_CQ, + RDMAV_USER_VERBS_CMD_CREATE_QP, + RDMAV_USER_VERBS_CMD_QUERY_QP, + RDMAV_USER_VERBS_CMD_MODIFY_QP, + RDMAV_USER_VERBS_CMD_DESTROY_QP, + RDMAV_USER_VERBS_CMD_POST_SEND, + RDMAV_USER_VERBS_CMD_POST_RECV, + RDMAV_USER_VERBS_CMD_ATTACH_MCAST, + RDMAV_USER_VERBS_CMD_DETACH_MCAST, + RDMAV_USER_VERBS_CMD_CREATE_SRQ, + RDMAV_USER_VERBS_CMD_MODIFY_SRQ, + RDMAV_USER_VERBS_CMD_QUERY_SRQ, + RDMAV_USER_VERBS_CMD_DESTROY_SRQ, + RDMAV_USER_VERBS_CMD_POST_SRQ_RECV }; /* @@ -101,13 +101,13 @@ enum { * different between 32-bit and 64-bit architectures. */ -struct ibv_kern_async_event { +struct rdmav_kern_async_event { __u64 element; __u32 event_type; __u32 reserved; }; -struct ibv_comp_event { +struct rdmav_comp_event { __u64 cq_handle; }; @@ -119,18 +119,18 @@ struct ibv_comp_event { * the rest of the command struct based on these value. */ -struct ibv_query_params { +struct rdmav_query_params { __u32 command; __u16 in_words; __u16 out_words; __u64 response; }; -struct ibv_query_params_resp { +struct rdmav_query_params_resp { __u32 num_cq_events; }; -struct ibv_get_context { +struct rdmav_get_context { __u32 command; __u16 in_words; __u16 out_words; @@ -138,12 +138,12 @@ struct ibv_get_context { __u64 driver_data[0]; }; -struct ibv_get_context_resp { +struct rdmav_get_context_resp { __u32 async_fd; __u32 num_comp_vectors; }; -struct ibv_query_device { +struct rdmav_query_device { __u32 command; __u16 in_words; __u16 out_words; @@ -151,7 +151,7 @@ struct ibv_query_device { __u64 driver_data[0]; }; -struct ibv_query_device_resp { +struct rdmav_query_device_resp { __u64 fw_ver; __u64 node_guid; __u64 sys_image_guid; @@ -195,7 +195,7 @@ struct ibv_query_device_resp { __u8 reserved[4]; }; -struct ibv_query_port { +struct rdmav_query_port { __u32 command; __u16 in_words; __u16 out_words; @@ -205,7 +205,7 @@ struct ibv_query_port { __u64 driver_data[0]; }; -struct ibv_query_port_resp { +struct rdmav_query_port_resp { __u32 port_cap_flags; __u32 max_msg_sz; __u32 bad_pkey_cntr; @@ -228,7 +228,7 @@ struct ibv_query_port_resp { __u8 reserved[3]; }; -struct ibv_alloc_pd { +struct rdmav_alloc_pd { __u32 command; __u16 in_words; __u16 out_words; @@ -236,18 +236,18 @@ struct ibv_alloc_pd { __u64 driver_data[0]; }; -struct ibv_alloc_pd_resp { +struct rdmav_alloc_pd_resp { __u32 pd_handle; }; -struct ibv_dealloc_pd { +struct rdmav_dealloc_pd { __u32 command; __u16 in_words; __u16 out_words; __u32 pd_handle; }; -struct ibv_reg_mr { +struct rdmav_reg_mr { __u32 command; __u16 in_words; __u16 out_words; @@ -260,31 +260,31 @@ struct ibv_reg_mr { __u64 driver_data[0]; }; -struct ibv_reg_mr_resp { +struct rdmav_reg_mr_resp { __u32 mr_handle; __u32 lkey; __u32 rkey; }; -struct ibv_dereg_mr { +struct rdmav_dereg_mr { __u32 command; __u16 in_words; __u16 out_words; __u32 mr_handle; }; -struct ibv_create_comp_channel { +struct rdmav_create_comp_channel { __u32 command; __u16 in_words; __u16 out_words; __u64 response; }; -struct ibv_create_comp_channel_resp { +struct rdmav_create_comp_channel_resp { __u32 fd; }; -struct ibv_create_cq { +struct rdmav_create_cq { __u32 command; __u16 in_words; __u16 out_words; @@ -297,12 +297,12 @@ struct ibv_create_cq { __u64 driver_data[0]; }; -struct ibv_create_cq_resp { +struct rdmav_create_cq_resp { __u32 cq_handle; __u32 cqe; }; -struct ibv_kern_wc { +struct rdmav_kern_wc { __u64 wr_id; __u32 status; __u32 opcode; @@ -320,7 +320,7 @@ struct ibv_kern_wc { __u8 reserved; }; -struct ibv_poll_cq { +struct rdmav_poll_cq { __u32 command; __u16 in_words; __u16 out_words; @@ -329,13 +329,13 @@ struct ibv_poll_cq { __u32 ne; }; -struct ibv_poll_cq_resp { +struct rdmav_poll_cq_resp { __u32 count; __u32 reserved; - struct ibv_kern_wc wc[0]; + struct rdmav_kern_wc wc[0]; }; -struct ibv_req_notify_cq { +struct rdmav_req_notify_cq { __u32 command; __u16 in_words; __u16 out_words; @@ -343,7 +343,7 @@ struct ibv_req_notify_cq { __u32 solicited; }; -struct ibv_resize_cq { +struct rdmav_resize_cq { __u32 command; __u16 in_words; __u16 out_words; @@ -353,11 +353,11 @@ struct ibv_resize_cq { __u64 driver_data[0]; }; -struct ibv_resize_cq_resp { +struct rdmav_resize_cq_resp { __u32 cqe; }; -struct ibv_destroy_cq { +struct rdmav_destroy_cq { __u32 command; __u16 in_words; __u16 out_words; @@ -366,12 +366,12 @@ struct ibv_destroy_cq { __u32 reserved; }; -struct ibv_destroy_cq_resp { +struct rdmav_destroy_cq_resp { __u32 comp_events_reported; __u32 async_events_reported; }; -struct ibv_kern_global_route { +struct rdmav_kern_global_route { __u8 dgid[16]; __u32 flow_label; __u8 sgid_index; @@ -380,8 +380,8 @@ struct ibv_kern_global_route { __u8 reserved; }; -struct ibv_kern_ah_attr { - struct ibv_kern_global_route grh; +struct rdmav_kern_ah_attr { + struct rdmav_kern_global_route grh; __u16 dlid; __u8 sl; __u8 src_path_bits; @@ -391,7 +391,7 @@ struct ibv_kern_ah_attr { __u8 reserved; }; -struct ibv_kern_qp_attr { +struct rdmav_kern_qp_attr { __u32 qp_attr_mask; __u32 qp_state; __u32 cur_qp_state; @@ -403,8 +403,8 @@ struct ibv_kern_qp_attr { __u32 dest_qp_num; __u32 qp_access_flags; - struct ibv_kern_ah_attr ah_attr; - struct ibv_kern_ah_attr alt_ah_attr; + struct rdmav_kern_ah_attr ah_attr; + struct rdmav_kern_ah_attr alt_ah_attr; /* ib_qp_cap */ __u32 max_send_wr; @@ -429,7 +429,7 @@ struct ibv_kern_qp_attr { __u8 reserved[5]; }; -struct ibv_create_qp { +struct rdmav_create_qp { __u32 command; __u16 in_words; __u16 out_words; @@ -451,7 +451,7 @@ struct ibv_create_qp { __u64 driver_data[0]; }; -struct ibv_create_qp_resp { +struct rdmav_create_qp_resp { __u32 qp_handle; __u32 qpn; __u32 max_send_wr; @@ -462,7 +462,7 @@ struct ibv_create_qp_resp { __u32 reserved; }; -struct ibv_qp_dest { +struct rdmav_qp_dest { __u8 dgid[16]; __u32 flow_label; __u16 dlid; @@ -477,7 +477,7 @@ struct ibv_qp_dest { __u8 port_num; }; -struct ibv_query_qp { +struct rdmav_query_qp { __u32 command; __u16 in_words; __u16 out_words; @@ -487,9 +487,9 @@ struct ibv_query_qp { __u64 driver_data[0]; }; -struct ibv_query_qp_resp { - struct ibv_qp_dest dest; - struct ibv_qp_dest alt_dest; +struct rdmav_query_qp_resp { + struct rdmav_qp_dest dest; + struct rdmav_qp_dest alt_dest; __u32 max_send_wr; __u32 max_recv_wr; __u32 max_send_sge; @@ -521,12 +521,12 @@ struct ibv_query_qp_resp { __u64 driver_data[0]; }; -struct ibv_modify_qp { +struct rdmav_modify_qp { __u32 command; __u16 in_words; __u16 out_words; - struct ibv_qp_dest dest; - struct ibv_qp_dest alt_dest; + struct rdmav_qp_dest dest; + struct rdmav_qp_dest alt_dest; __u32 qp_handle; __u32 attr_mask; __u32 qkey; @@ -554,7 +554,7 @@ struct ibv_modify_qp { __u64 driver_data[0]; }; -struct ibv_destroy_qp { +struct rdmav_destroy_qp { __u32 command; __u16 in_words; __u16 out_words; @@ -563,11 +563,11 @@ struct ibv_destroy_qp { __u32 reserved; }; -struct ibv_destroy_qp_resp { +struct rdmav_destroy_qp_resp { __u32 events_reported; }; -struct ibv_kern_send_wr { +struct rdmav_kern_send_wr { __u64 wr_id; __u32 num_sge; __u32 opcode; @@ -595,7 +595,7 @@ struct ibv_kern_send_wr { } wr; }; -struct ibv_post_send { +struct rdmav_post_send { __u32 command; __u16 in_words; __u16 out_words; @@ -604,20 +604,20 @@ struct ibv_post_send { __u32 wr_count; __u32 sge_count; __u32 wqe_size; - struct ibv_kern_send_wr send_wr[0]; + struct rdmav_kern_send_wr send_wr[0]; }; -struct ibv_post_send_resp { +struct rdmav_post_send_resp { __u32 bad_wr; }; -struct ibv_kern_recv_wr { +struct rdmav_kern_recv_wr { __u64 wr_id; __u32 num_sge; __u32 reserved; }; -struct ibv_post_recv { +struct rdmav_post_recv { __u32 command; __u16 in_words; __u16 out_words; @@ -626,14 +626,14 @@ struct ibv_post_recv { __u32 wr_count; __u32 sge_count; __u32 wqe_size; - struct ibv_kern_recv_wr recv_wr[0]; + struct rdmav_kern_recv_wr recv_wr[0]; }; -struct ibv_post_recv_resp { +struct rdmav_post_recv_resp { __u32 bad_wr; }; -struct ibv_post_srq_recv { +struct rdmav_post_srq_recv { __u32 command; __u16 in_words; __u16 out_words; @@ -642,14 +642,14 @@ struct ibv_post_srq_recv { __u32 wr_count; __u32 sge_count; __u32 wqe_size; - struct ibv_kern_recv_wr recv_wr[0]; + struct rdmav_kern_recv_wr recv_wr[0]; }; -struct ibv_post_srq_recv_resp { +struct rdmav_post_srq_recv_resp { __u32 bad_wr; }; -struct ibv_create_ah { +struct rdmav_create_ah { __u32 command; __u16 in_words; __u16 out_words; @@ -657,21 +657,21 @@ struct ibv_create_ah { __u64 user_handle; __u32 pd_handle; __u32 reserved; - struct ibv_kern_ah_attr attr; + struct rdmav_kern_ah_attr attr; }; -struct ibv_create_ah_resp { +struct rdmav_create_ah_resp { __u32 handle; }; -struct ibv_destroy_ah { +struct rdmav_destroy_ah { __u32 command; __u16 in_words; __u16 out_words; __u32 ah_handle; }; -struct ibv_attach_mcast { +struct rdmav_attach_mcast { __u32 command; __u16 in_words; __u16 out_words; @@ -682,7 +682,7 @@ struct ibv_attach_mcast { __u64 driver_data[0]; }; -struct ibv_detach_mcast { +struct rdmav_detach_mcast { __u32 command; __u16 in_words; __u16 out_words; @@ -693,7 +693,7 @@ struct ibv_detach_mcast { __u64 driver_data[0]; }; -struct ibv_create_srq { +struct rdmav_create_srq { __u32 command; __u16 in_words; __u16 out_words; @@ -706,14 +706,14 @@ struct ibv_create_srq { __u64 driver_data[0]; }; -struct ibv_create_srq_resp { +struct rdmav_create_srq_resp { __u32 srq_handle; __u32 max_wr; __u32 max_sge; __u32 reserved; }; -struct ibv_modify_srq { +struct rdmav_modify_srq { __u32 command; __u16 in_words; __u16 out_words; @@ -724,7 +724,7 @@ struct ibv_modify_srq { __u64 driver_data[0]; }; -struct ibv_query_srq { +struct rdmav_query_srq { __u32 command; __u16 in_words; __u16 out_words; @@ -734,14 +734,14 @@ struct ibv_query_srq { __u64 driver_data[0]; }; -struct ibv_query_srq_resp { +struct rdmav_query_srq_resp { __u32 max_wr; __u32 max_sge; __u32 srq_limit; __u32 reserved; }; -struct ibv_destroy_srq { +struct rdmav_destroy_srq { __u32 command; __u16 in_words; __u16 out_words; @@ -750,7 +750,7 @@ struct ibv_destroy_srq { __u32 reserved; }; -struct ibv_destroy_srq_resp { +struct rdmav_destroy_srq_resp { __u32 events_reported; }; @@ -759,74 +759,74 @@ struct ibv_destroy_srq_resp { */ enum { - IB_USER_VERBS_CMD_QUERY_PARAMS_V2, - IB_USER_VERBS_CMD_GET_CONTEXT_V2, - IB_USER_VERBS_CMD_QUERY_DEVICE_V2, - IB_USER_VERBS_CMD_QUERY_PORT_V2, - IB_USER_VERBS_CMD_QUERY_GID_V2, - IB_USER_VERBS_CMD_QUERY_PKEY_V2, - IB_USER_VERBS_CMD_ALLOC_PD_V2, - IB_USER_VERBS_CMD_DEALLOC_PD_V2, - IB_USER_VERBS_CMD_CREATE_AH_V2, - IB_USER_VERBS_CMD_MODIFY_AH_V2, - IB_USER_VERBS_CMD_QUERY_AH_V2, - IB_USER_VERBS_CMD_DESTROY_AH_V2, - IB_USER_VERBS_CMD_REG_MR_V2, - IB_USER_VERBS_CMD_REG_SMR_V2, - IB_USER_VERBS_CMD_REREG_MR_V2, - IB_USER_VERBS_CMD_QUERY_MR_V2, - IB_USER_VERBS_CMD_DEREG_MR_V2, - IB_USER_VERBS_CMD_ALLOC_MW_V2, - IB_USER_VERBS_CMD_BIND_MW_V2, - IB_USER_VERBS_CMD_DEALLOC_MW_V2, - IB_USER_VERBS_CMD_CREATE_CQ_V2, - IB_USER_VERBS_CMD_RESIZE_CQ_V2, - IB_USER_VERBS_CMD_DESTROY_CQ_V2, - IB_USER_VERBS_CMD_POLL_CQ_V2, - IB_USER_VERBS_CMD_PEEK_CQ_V2, - IB_USER_VERBS_CMD_REQ_NOTIFY_CQ_V2, - IB_USER_VERBS_CMD_CREATE_QP_V2, - IB_USER_VERBS_CMD_QUERY_QP_V2, - IB_USER_VERBS_CMD_MODIFY_QP_V2, - IB_USER_VERBS_CMD_DESTROY_QP_V2, - IB_USER_VERBS_CMD_POST_SEND_V2, - IB_USER_VERBS_CMD_POST_RECV_V2, - IB_USER_VERBS_CMD_ATTACH_MCAST_V2, - IB_USER_VERBS_CMD_DETACH_MCAST_V2, - IB_USER_VERBS_CMD_CREATE_SRQ_V2, - IB_USER_VERBS_CMD_MODIFY_SRQ_V2, - IB_USER_VERBS_CMD_QUERY_SRQ_V2, - IB_USER_VERBS_CMD_DESTROY_SRQ_V2, - IB_USER_VERBS_CMD_POST_SRQ_RECV_V2, + RDMAV_USER_VERBS_CMD_QUERY_PARAMS_V2, + RDMAV_USER_VERBS_CMD_GET_CONTEXT_V2, + RDMAV_USER_VERBS_CMD_QUERY_DEVICE_V2, + RDMAV_USER_VERBS_CMD_QUERY_PORT_V2, + RDMAV_USER_VERBS_CMD_QUERY_GID_V2, + RDMAV_USER_VERBS_CMD_QUERY_PKEY_V2, + RDMAV_USER_VERBS_CMD_ALLOC_PD_V2, + RDMAV_USER_VERBS_CMD_DEALLOC_PD_V2, + RDMAV_USER_VERBS_CMD_CREATE_AH_V2, + RDMAV_USER_VERBS_CMD_MODIFY_AH_V2, + RDMAV_USER_VERBS_CMD_QUERY_AH_V2, + RDMAV_USER_VERBS_CMD_DESTROY_AH_V2, + RDMAV_USER_VERBS_CMD_REG_MR_V2, + RDMAV_USER_VERBS_CMD_REG_SMR_V2, + RDMAV_USER_VERBS_CMD_REREG_MR_V2, + RDMAV_USER_VERBS_CMD_QUERY_MR_V2, + RDMAV_USER_VERBS_CMD_DEREG_MR_V2, + RDMAV_USER_VERBS_CMD_ALLOC_MW_V2, + RDMAV_USER_VERBS_CMD_BIND_MW_V2, + RDMAV_USER_VERBS_CMD_DEALLOC_MW_V2, + RDMAV_USER_VERBS_CMD_CREATE_CQ_V2, + RDMAV_USER_VERBS_CMD_RESIZE_CQ_V2, + RDMAV_USER_VERBS_CMD_DESTROY_CQ_V2, + RDMAV_USER_VERBS_CMD_POLL_CQ_V2, + RDMAV_USER_VERBS_CMD_PEEK_CQ_V2, + RDMAV_USER_VERBS_CMD_REQ_NOTIFY_CQ_V2, + RDMAV_USER_VERBS_CMD_CREATE_QP_V2, + RDMAV_USER_VERBS_CMD_QUERY_QP_V2, + RDMAV_USER_VERBS_CMD_MODIFY_QP_V2, + RDMAV_USER_VERBS_CMD_DESTROY_QP_V2, + RDMAV_USER_VERBS_CMD_POST_SEND_V2, + RDMAV_USER_VERBS_CMD_POST_RECV_V2, + RDMAV_USER_VERBS_CMD_ATTACH_MCAST_V2, + RDMAV_USER_VERBS_CMD_DETACH_MCAST_V2, + RDMAV_USER_VERBS_CMD_CREATE_SRQ_V2, + RDMAV_USER_VERBS_CMD_MODIFY_SRQ_V2, + RDMAV_USER_VERBS_CMD_QUERY_SRQ_V2, + RDMAV_USER_VERBS_CMD_DESTROY_SRQ_V2, + RDMAV_USER_VERBS_CMD_POST_SRQ_RECV_V2, /* * Set commands that didn't exist to -1 so our compile-time * trick opcodes in IBV_INIT_CMD() doesn't break. */ - IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL_V2 = -1, + RDMAV_USER_VERBS_CMD_CREATE_COMP_CHANNEL_V2 = -1, }; -struct ibv_destroy_cq_v1 { +struct rdmav_destroy_cq_v1 { __u32 command; __u16 in_words; __u16 out_words; __u32 cq_handle; }; -struct ibv_destroy_qp_v1 { +struct rdmav_destroy_qp_v1 { __u32 command; __u16 in_words; __u16 out_words; __u32 qp_handle; }; -struct ibv_destroy_srq_v1 { +struct rdmav_destroy_srq_v1 { __u32 command; __u16 in_words; __u16 out_words; __u32 srq_handle; }; -struct ibv_get_context_v2 { +struct rdmav_get_context_v2 { __u32 command; __u16 in_words; __u16 out_words; @@ -835,7 +835,7 @@ struct ibv_get_context_v2 { __u64 driver_data[0]; }; -struct ibv_create_cq_v2 { +struct rdmav_create_cq_v2 { __u32 command; __u16 in_words; __u16 out_words; @@ -846,7 +846,7 @@ struct ibv_create_cq_v2 { __u64 driver_data[0]; }; -struct ibv_modify_srq_v3 { +struct rdmav_modify_srq_v3 { __u32 command; __u16 in_words; __u16 out_words; @@ -859,12 +859,12 @@ struct ibv_modify_srq_v3 { __u64 driver_data[0]; }; -struct ibv_create_qp_resp_v3 { +struct rdmav_create_qp_resp_v3 { __u32 qp_handle; __u32 qpn; }; -struct ibv_create_qp_resp_v4 { +struct rdmav_create_qp_resp_v4 { __u32 qp_handle; __u32 qpn; __u32 max_send_wr; @@ -874,7 +874,7 @@ struct ibv_create_qp_resp_v4 { __u32 max_inline_data; }; -struct ibv_create_srq_resp_v5 { +struct rdmav_create_srq_resp_v5 { __u32 srq_handle; }; diff -ruNp ORG/libibverbs/include/infiniband/marshall.h NEW/libibverbs/include/infiniband/marshall.h --- ORG/libibverbs/include/infiniband/marshall.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/marshall.h 2006-08-02 18:24:49.000000000 -0700 @@ -30,8 +30,8 @@ * SOFTWARE. */ -#ifndef INFINIBAND_MARSHALL_H -#define INFINIBAND_MARSHALL_H +#ifndef RDMAV_MARSHALL_H +#define RDMAV_MARSHALL_H #include #include @@ -48,18 +48,18 @@ BEGIN_C_DECLS -void ibv_copy_qp_attr_from_kern(struct ibv_qp_attr *dst, - struct ibv_kern_qp_attr *src); +void rdmav_copy_qp_attr_from_kern(struct rdmav_qp_attr *dst, + struct rdmav_kern_qp_attr *src); -void ibv_copy_ah_attr_from_kern(struct ibv_ah_attr *dst, - struct ibv_kern_ah_attr *src); +void rdmav_copy_ah_attr_from_kern(struct rdmav_ah_attr *dst, + struct rdmav_kern_ah_attr *src); -void ibv_copy_path_rec_from_kern(struct ibv_sa_path_rec *dst, - struct ibv_kern_path_rec *src); +void rdmav_copy_path_rec_from_kern(struct rdmav_sa_path_rec *dst, + struct rdmav_kern_path_rec *src); -void ibv_copy_path_rec_to_kern(struct ibv_kern_path_rec *dst, - struct ibv_sa_path_rec *src); +void rdmav_copy_path_rec_to_kern(struct rdmav_kern_path_rec *dst, + struct rdmav_sa_path_rec *src); END_C_DECLS -#endif /* INFINIBAND_MARSHALL_H */ +#endif /* RDMAV_MARSHALL_H */ diff -ruNp ORG/libibverbs/include/infiniband/opcode.h NEW/libibverbs/include/infiniband/opcode.h --- ORG/libibverbs/include/infiniband/opcode.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/opcode.h 2006-08-03 17:42:30.000000000 -0700 @@ -32,118 +32,119 @@ * $Id: opcode.h 1989 2005-03-14 20:25:13Z roland $ */ -#ifndef INFINIBAND_OPCODE_H -#define INFINIBAND_OPCODE_H +#ifndef RDMAV_OPCODE_H +#define RDMAV_OPCODE_H /* * This macro cleans up the definitions of constants for BTH opcodes. - * It is used to define constants such as IBV_OPCODE_UD_SEND_ONLY, - * which becomes IBV_OPCODE_UD + IBV_OPCODE_SEND_ONLY, and this gives + * It is used to define constants such as RDMAV_OPCODE_UD_SEND_ONLY, + * which becomes RDMAV_OPCODE_UD + RDMAV_OPCODE_SEND_ONLY, and this gives * the correct value. * * In short, user code should use the constants defined using the * macro rather than worrying about adding together other constants. */ -#define IBV_OPCODE(transport, op) \ - IBV_OPCODE_ ## transport ## _ ## op = \ - IBV_OPCODE_ ## transport + IBV_OPCODE_ ## op + +#define RDMAV_OPCODE(transport, op) \ + RDMAV_OPCODE_ ## transport ## _ ## op = \ + RDMAV_OPCODE_ ## transport + RDMAV_OPCODE_ ## op enum { /* transport types -- just used to define real constants */ - IBV_OPCODE_RC = 0x00, - IBV_OPCODE_UC = 0x20, - IBV_OPCODE_RD = 0x40, - IBV_OPCODE_UD = 0x60, + RDMAV_OPCODE_RC = 0x00, + RDMAV_OPCODE_UC = 0x20, + RDMAV_OPCODE_RD = 0x40, + RDMAV_OPCODE_UD = 0x60, /* operations -- just used to define real constants */ - IBV_OPCODE_SEND_FIRST = 0x00, - IBV_OPCODE_SEND_MIDDLE = 0x01, - IBV_OPCODE_SEND_LAST = 0x02, - IBV_OPCODE_SEND_LAST_WITH_IMMEDIATE = 0x03, - IBV_OPCODE_SEND_ONLY = 0x04, - IBV_OPCODE_SEND_ONLY_WITH_IMMEDIATE = 0x05, - IBV_OPCODE_RDMA_WRITE_FIRST = 0x06, - IBV_OPCODE_RDMA_WRITE_MIDDLE = 0x07, - IBV_OPCODE_RDMA_WRITE_LAST = 0x08, - IBV_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE = 0x09, - IBV_OPCODE_RDMA_WRITE_ONLY = 0x0a, - IBV_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE = 0x0b, - IBV_OPCODE_RDMA_READ_REQUEST = 0x0c, - IBV_OPCODE_RDMA_READ_RESPONSE_FIRST = 0x0d, - IBV_OPCODE_RDMA_READ_RESPONSE_MIDDLE = 0x0e, - IBV_OPCODE_RDMA_READ_RESPONSE_LAST = 0x0f, - IBV_OPCODE_RDMA_READ_RESPONSE_ONLY = 0x10, - IBV_OPCODE_ACKNOWLEDGE = 0x11, - IBV_OPCODE_ATOMIC_ACKNOWLEDGE = 0x12, - IBV_OPCODE_COMPARE_SWAP = 0x13, - IBV_OPCODE_FETCH_ADD = 0x14, + RDMAV_OPCODE_SEND_FIRST = 0x00, + RDMAV_OPCODE_SEND_MIDDLE = 0x01, + RDMAV_OPCODE_SEND_LAST = 0x02, + RDMAV_OPCODE_SEND_LAST_WITH_IMMEDIATE = 0x03, + RDMAV_OPCODE_SEND_ONLY = 0x04, + RDMAV_OPCODE_SEND_ONLY_WITH_IMMEDIATE = 0x05, + RDMAV_OPCODE_RDMA_WRITE_FIRST = 0x06, + RDMAV_OPCODE_RDMA_WRITE_MIDDLE = 0x07, + RDMAV_OPCODE_RDMA_WRITE_LAST = 0x08, + RDMAV_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE = 0x09, + RDMAV_OPCODE_RDMA_WRITE_ONLY = 0x0a, + RDMAV_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE = 0x0b, + RDMAV_OPCODE_RDMA_READ_REQUEST = 0x0c, + RDMAV_OPCODE_RDMA_READ_RESPONSE_FIRST = 0x0d, + RDMAV_OPCODE_RDMA_READ_RESPONSE_MIDDLE = 0x0e, + RDMAV_OPCODE_RDMA_READ_RESPONSE_LAST = 0x0f, + RDMAV_OPCODE_RDMA_READ_RESPONSE_ONLY = 0x10, + RDMAV_OPCODE_ACKNOWLEDGE = 0x11, + RDMAV_OPCODE_ATOMIC_ACKNOWLEDGE = 0x12, + RDMAV_OPCODE_COMPARE_SWAP = 0x13, + RDMAV_OPCODE_FETCH_ADD = 0x14, - /* real constants follow -- see comment about above IBV_OPCODE() + /* real constants follow -- see comment about above RDMAV_OPCODE() macro for more details */ /* RC */ - IBV_OPCODE(RC, SEND_FIRST), - IBV_OPCODE(RC, SEND_MIDDLE), - IBV_OPCODE(RC, SEND_LAST), - IBV_OPCODE(RC, SEND_LAST_WITH_IMMEDIATE), - IBV_OPCODE(RC, SEND_ONLY), - IBV_OPCODE(RC, SEND_ONLY_WITH_IMMEDIATE), - IBV_OPCODE(RC, RDMA_WRITE_FIRST), - IBV_OPCODE(RC, RDMA_WRITE_MIDDLE), - IBV_OPCODE(RC, RDMA_WRITE_LAST), - IBV_OPCODE(RC, RDMA_WRITE_LAST_WITH_IMMEDIATE), - IBV_OPCODE(RC, RDMA_WRITE_ONLY), - IBV_OPCODE(RC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), - IBV_OPCODE(RC, RDMA_READ_REQUEST), - IBV_OPCODE(RC, RDMA_READ_RESPONSE_FIRST), - IBV_OPCODE(RC, RDMA_READ_RESPONSE_MIDDLE), - IBV_OPCODE(RC, RDMA_READ_RESPONSE_LAST), - IBV_OPCODE(RC, RDMA_READ_RESPONSE_ONLY), - IBV_OPCODE(RC, ACKNOWLEDGE), - IBV_OPCODE(RC, ATOMIC_ACKNOWLEDGE), - IBV_OPCODE(RC, COMPARE_SWAP), - IBV_OPCODE(RC, FETCH_ADD), + RDMAV_OPCODE(RC, SEND_FIRST), + RDMAV_OPCODE(RC, SEND_MIDDLE), + RDMAV_OPCODE(RC, SEND_LAST), + RDMAV_OPCODE(RC, SEND_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(RC, SEND_ONLY), + RDMAV_OPCODE(RC, SEND_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(RC, RDMA_WRITE_FIRST), + RDMAV_OPCODE(RC, RDMA_WRITE_MIDDLE), + RDMAV_OPCODE(RC, RDMA_WRITE_LAST), + RDMAV_OPCODE(RC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(RC, RDMA_WRITE_ONLY), + RDMAV_OPCODE(RC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(RC, RDMA_READ_REQUEST), + RDMAV_OPCODE(RC, RDMA_READ_RESPONSE_FIRST), + RDMAV_OPCODE(RC, RDMA_READ_RESPONSE_MIDDLE), + RDMAV_OPCODE(RC, RDMA_READ_RESPONSE_LAST), + RDMAV_OPCODE(RC, RDMA_READ_RESPONSE_ONLY), + RDMAV_OPCODE(RC, ACKNOWLEDGE), + RDMAV_OPCODE(RC, ATOMIC_ACKNOWLEDGE), + RDMAV_OPCODE(RC, COMPARE_SWAP), + RDMAV_OPCODE(RC, FETCH_ADD), /* UC */ - IBV_OPCODE(UC, SEND_FIRST), - IBV_OPCODE(UC, SEND_MIDDLE), - IBV_OPCODE(UC, SEND_LAST), - IBV_OPCODE(UC, SEND_LAST_WITH_IMMEDIATE), - IBV_OPCODE(UC, SEND_ONLY), - IBV_OPCODE(UC, SEND_ONLY_WITH_IMMEDIATE), - IBV_OPCODE(UC, RDMA_WRITE_FIRST), - IBV_OPCODE(UC, RDMA_WRITE_MIDDLE), - IBV_OPCODE(UC, RDMA_WRITE_LAST), - IBV_OPCODE(UC, RDMA_WRITE_LAST_WITH_IMMEDIATE), - IBV_OPCODE(UC, RDMA_WRITE_ONLY), - IBV_OPCODE(UC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(UC, SEND_FIRST), + RDMAV_OPCODE(UC, SEND_MIDDLE), + RDMAV_OPCODE(UC, SEND_LAST), + RDMAV_OPCODE(UC, SEND_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(UC, SEND_ONLY), + RDMAV_OPCODE(UC, SEND_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(UC, RDMA_WRITE_FIRST), + RDMAV_OPCODE(UC, RDMA_WRITE_MIDDLE), + RDMAV_OPCODE(UC, RDMA_WRITE_LAST), + RDMAV_OPCODE(UC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(UC, RDMA_WRITE_ONLY), + RDMAV_OPCODE(UC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), /* RD */ - IBV_OPCODE(RD, SEND_FIRST), - IBV_OPCODE(RD, SEND_MIDDLE), - IBV_OPCODE(RD, SEND_LAST), - IBV_OPCODE(RD, SEND_LAST_WITH_IMMEDIATE), - IBV_OPCODE(RD, SEND_ONLY), - IBV_OPCODE(RD, SEND_ONLY_WITH_IMMEDIATE), - IBV_OPCODE(RD, RDMA_WRITE_FIRST), - IBV_OPCODE(RD, RDMA_WRITE_MIDDLE), - IBV_OPCODE(RD, RDMA_WRITE_LAST), - IBV_OPCODE(RD, RDMA_WRITE_LAST_WITH_IMMEDIATE), - IBV_OPCODE(RD, RDMA_WRITE_ONLY), - IBV_OPCODE(RD, RDMA_WRITE_ONLY_WITH_IMMEDIATE), - IBV_OPCODE(RD, RDMA_READ_REQUEST), - IBV_OPCODE(RD, RDMA_READ_RESPONSE_FIRST), - IBV_OPCODE(RD, RDMA_READ_RESPONSE_MIDDLE), - IBV_OPCODE(RD, RDMA_READ_RESPONSE_LAST), - IBV_OPCODE(RD, RDMA_READ_RESPONSE_ONLY), - IBV_OPCODE(RD, ACKNOWLEDGE), - IBV_OPCODE(RD, ATOMIC_ACKNOWLEDGE), - IBV_OPCODE(RD, COMPARE_SWAP), - IBV_OPCODE(RD, FETCH_ADD), + RDMAV_OPCODE(RD, SEND_FIRST), + RDMAV_OPCODE(RD, SEND_MIDDLE), + RDMAV_OPCODE(RD, SEND_LAST), + RDMAV_OPCODE(RD, SEND_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(RD, SEND_ONLY), + RDMAV_OPCODE(RD, SEND_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(RD, RDMA_WRITE_FIRST), + RDMAV_OPCODE(RD, RDMA_WRITE_MIDDLE), + RDMAV_OPCODE(RD, RDMA_WRITE_LAST), + RDMAV_OPCODE(RD, RDMA_WRITE_LAST_WITH_IMMEDIATE), + RDMAV_OPCODE(RD, RDMA_WRITE_ONLY), + RDMAV_OPCODE(RD, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + RDMAV_OPCODE(RD, RDMA_READ_REQUEST), + RDMAV_OPCODE(RD, RDMA_READ_RESPONSE_FIRST), + RDMAV_OPCODE(RD, RDMA_READ_RESPONSE_MIDDLE), + RDMAV_OPCODE(RD, RDMA_READ_RESPONSE_LAST), + RDMAV_OPCODE(RD, RDMA_READ_RESPONSE_ONLY), + RDMAV_OPCODE(RD, ACKNOWLEDGE), + RDMAV_OPCODE(RD, ATOMIC_ACKNOWLEDGE), + RDMAV_OPCODE(RD, COMPARE_SWAP), + RDMAV_OPCODE(RD, FETCH_ADD), /* UD */ - IBV_OPCODE(UD, SEND_ONLY), - IBV_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE) + RDMAV_OPCODE(UD, SEND_ONLY), + RDMAV_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE) }; -#endif /* INFINIBAND_OPCODE_H */ +#endif /* RDMAV_OPCODE_H */ diff -ruNp ORG/libibverbs/include/infiniband/sa-kern-abi.h NEW/libibverbs/include/infiniband/sa-kern-abi.h --- ORG/libibverbs/include/infiniband/sa-kern-abi.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/sa-kern-abi.h 2006-08-02 18:24:49.000000000 -0700 @@ -30,8 +30,8 @@ * SOFTWARE. */ -#ifndef INFINIBAND_SA_KERN_ABI_H -#define INFINIBAND_SA_KERN_ABI_H +#ifndef RDMAV_SA_KERN_ABI_H +#define RDMAV_SA_KERN_ABI_H #include @@ -40,7 +40,7 @@ */ #define ib_kern_path_rec ibv_kern_path_rec -struct ibv_kern_path_rec { +struct rdmav_kern_path_rec { __u8 dgid[16]; __u8 sgid[16]; __u16 dlid; @@ -62,4 +62,4 @@ struct ibv_kern_path_rec { __u8 preference; }; -#endif /* INFINIBAND_SA_KERN_ABI_H */ +#endif /* RDMAV_SA_KERN_ABI_H */ diff -ruNp ORG/libibverbs/include/infiniband/sa.h NEW/libibverbs/include/infiniband/sa.h --- ORG/libibverbs/include/infiniband/sa.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/sa.h 2006-08-02 18:24:49.000000000 -0700 @@ -33,16 +33,16 @@ * $Id: sa.h 2616 2005-06-15 15:22:39Z halr $ */ -#ifndef INFINIBAND_SA_H -#define INFINIBAND_SA_H +#ifndef RDMAV_SA_H +#define RDMAV_SA_H #include -struct ibv_sa_path_rec { +struct rdmav_sa_path_rec { /* reserved */ /* reserved */ - union ibv_gid dgid; - union ibv_gid sgid; + union rdmav_gid dgid; + union rdmav_gid sgid; uint16_t dlid; uint16_t slid; int raw_traffic; @@ -64,9 +64,9 @@ struct ibv_sa_path_rec { uint8_t preference; }; -struct ibv_sa_mcmember_rec { - union ibv_gid mgid; - union ibv_gid port_gid; +struct rdmav_sa_mcmember_rec { + union rdmav_gid mgid; + union rdmav_gid port_gid; uint32_t qkey; uint16_t mlid; uint8_t mtu_selector; @@ -85,9 +85,9 @@ struct ibv_sa_mcmember_rec { int proxy_join; }; -struct ibv_sa_service_rec { +struct rdmav_sa_service_rec { uint64_t id; - union ibv_gid gid; + union rdmav_gid gid; uint16_t pkey; /* uint16_t resv; */ uint32_t lease; @@ -99,4 +99,4 @@ struct ibv_sa_service_rec { uint64_t data64[2]; }; -#endif /* INFINIBAND_SA_H */ +#endif /* RDMAV_SA_H */ diff -ruNp ORG/libibverbs/include/infiniband/verbs.h NEW/libibverbs/include/infiniband/verbs.h --- ORG/libibverbs/include/infiniband/verbs.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/include/infiniband/verbs.h 2006-08-03 17:29:08.000000000 -0700 @@ -35,11 +35,12 @@ * $Id: verbs.h 8076 2006-06-16 18:26:34Z sean.hefty $ */ -#ifndef INFINIBAND_VERBS_H -#define INFINIBAND_VERBS_H +#ifndef RDMAV_VERBS_H +#define RDMAV_VERBS_H #include #include +#include "deprecate.h" #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -57,7 +58,7 @@ BEGIN_C_DECLS -union ibv_gid { +union rdmav_gid { uint8_t raw[16]; struct { uint64_t subnet_prefix; @@ -65,37 +66,37 @@ union ibv_gid { } global; }; -enum ibv_node_type { - IBV_NODE_CA = 1, - IBV_NODE_SWITCH, - IBV_NODE_ROUTER -}; - -enum ibv_device_cap_flags { - IBV_DEVICE_RESIZE_MAX_WR = 1, - IBV_DEVICE_BAD_PKEY_CNTR = 1 << 1, - IBV_DEVICE_BAD_QKEY_CNTR = 1 << 2, - IBV_DEVICE_RAW_MULTI = 1 << 3, - IBV_DEVICE_AUTO_PATH_MIG = 1 << 4, - IBV_DEVICE_CHANGE_PHY_PORT = 1 << 5, - IBV_DEVICE_UD_AV_PORT_ENFORCE = 1 << 6, - IBV_DEVICE_CURR_QP_STATE_MOD = 1 << 7, - IBV_DEVICE_SHUTDOWN_PORT = 1 << 8, - IBV_DEVICE_INIT_TYPE = 1 << 9, - IBV_DEVICE_PORT_ACTIVE_EVENT = 1 << 10, - IBV_DEVICE_SYS_IMAGE_GUID = 1 << 11, - IBV_DEVICE_RC_RNR_NAK_GEN = 1 << 12, - IBV_DEVICE_SRQ_RESIZE = 1 << 13, - IBV_DEVICE_N_NOTIFY_CQ = 1 << 14 -}; - -enum ibv_atomic_cap { - IBV_ATOMIC_NONE, - IBV_ATOMIC_HCA, - IBV_ATOMIC_GLOB +enum rdmav_node_type { + RDMAV_NODE_CA = 1, + RDMAV_NODE_SWITCH, + RDMAV_NODE_ROUTER +}; + +enum rdmav_device_cap_flags { + RDMAV_DEVICE_RESIZE_MAX_WR = 1, + RDMAV_DEVICE_BAD_PKEY_CNTR = 1 << 1, + RDMAV_DEVICE_BAD_QKEY_CNTR = 1 << 2, + RDMAV_DEVICE_RAW_MULTI = 1 << 3, + RDMAV_DEVICE_AUTO_PATH_MIG = 1 << 4, + RDMAV_DEVICE_CHANGE_PHY_PORT = 1 << 5, + RDMAV_DEVICE_UD_AV_PORT_ENFORCE = 1 << 6, + RDMAV_DEVICE_CURR_QP_STATE_MOD = 1 << 7, + RDMAV_DEVICE_SHUTDOWN_PORT = 1 << 8, + RDMAV_DEVICE_INIT_TYPE = 1 << 9, + RDMAV_DEVICE_PORT_ACTIVE_EVENT = 1 << 10, + RDMAV_DEVICE_SYS_IMAGE_GUID = 1 << 11, + RDMAV_DEVICE_RC_RNR_NAK_GEN = 1 << 12, + RDMAV_DEVICE_SRQ_RESIZE = 1 << 13, + RDMAV_DEVICE_N_NOTIFY_CQ = 1 << 14 +}; + +enum rdmav_atomic_cap { + RDMAV_ATOMIC_NONE, + RDMAV_ATOMIC_HCA, + RDMAV_ATOMIC_GLOB }; -struct ibv_device_attr { +struct rdmav_device_attr { char fw_ver[64]; uint64_t node_guid; uint64_t sys_image_guid; @@ -118,7 +119,7 @@ struct ibv_device_attr { int max_res_rd_atom; int max_qp_init_rd_atom; int max_ee_init_rd_atom; - enum ibv_atomic_cap atomic_cap; + enum rdmav_atomic_cap atomic_cap; int max_ee; int max_rdd; int max_mw; @@ -138,27 +139,27 @@ struct ibv_device_attr { uint8_t phys_port_cnt; }; -enum ibv_mtu { - IBV_MTU_256 = 1, - IBV_MTU_512 = 2, - IBV_MTU_1024 = 3, - IBV_MTU_2048 = 4, - IBV_MTU_4096 = 5 -}; - -enum ibv_port_state { - IBV_PORT_NOP = 0, - IBV_PORT_DOWN = 1, - IBV_PORT_INIT = 2, - IBV_PORT_ARMED = 3, - IBV_PORT_ACTIVE = 4, - IBV_PORT_ACTIVE_DEFER = 5 -}; - -struct ibv_port_attr { - enum ibv_port_state state; - enum ibv_mtu max_mtu; - enum ibv_mtu active_mtu; +enum rdmav_mtu { + RDMAV_MTU_256 = 1, + RDMAV_MTU_512 = 2, + RDMAV_MTU_1024 = 3, + RDMAV_MTU_2048 = 4, + RDMAV_MTU_4096 = 5 +}; + +enum rdmav_port_state { + RDMAV_PORT_NOP = 0, + RDMAV_PORT_DOWN = 1, + RDMAV_PORT_INIT = 2, + RDMAV_PORT_ARMED = 3, + RDMAV_PORT_ACTIVE = 4, + RDMAV_PORT_ACTIVE_DEFER = 5 +}; + +struct rdmav_port_attr { + enum rdmav_port_state state; + enum rdmav_mtu max_mtu; + enum rdmav_mtu active_mtu; int gid_tbl_len; uint32_t port_cap_flags; uint32_t max_msg_sz; @@ -177,165 +178,165 @@ struct ibv_port_attr { uint8_t phys_state; }; -enum ibv_event_type { - IBV_EVENT_CQ_ERR, - IBV_EVENT_QP_FATAL, - IBV_EVENT_QP_REQ_ERR, - IBV_EVENT_QP_ACCESS_ERR, - IBV_EVENT_COMM_EST, - IBV_EVENT_SQ_DRAINED, - IBV_EVENT_PATH_MIG, - IBV_EVENT_PATH_MIG_ERR, - IBV_EVENT_DEVICE_FATAL, - IBV_EVENT_PORT_ACTIVE, - IBV_EVENT_PORT_ERR, - IBV_EVENT_LID_CHANGE, - IBV_EVENT_PKEY_CHANGE, - IBV_EVENT_SM_CHANGE, - IBV_EVENT_SRQ_ERR, - IBV_EVENT_SRQ_LIMIT_REACHED, - IBV_EVENT_QP_LAST_WQE_REACHED, - IBV_EVENT_CLIENT_REREGISTER +enum rdmav_event_type { + RDMAV_EVENT_CQ_ERR, + RDMAV_EVENT_QP_FATAL, + RDMAV_EVENT_QP_REQ_ERR, + RDMAV_EVENT_QP_ACCESS_ERR, + RDMAV_EVENT_COMM_EST, + RDMAV_EVENT_SQ_DRAINED, + RDMAV_EVENT_PATH_MIG, + RDMAV_EVENT_PATH_MIG_ERR, + RDMAV_EVENT_DEVICE_FATAL, + RDMAV_EVENT_PORT_ACTIVE, + RDMAV_EVENT_PORT_ERR, + RDMAV_EVENT_LID_CHANGE, + RDMAV_EVENT_PKEY_CHANGE, + RDMAV_EVENT_SM_CHANGE, + RDMAV_EVENT_SRQ_ERR, + RDMAV_EVENT_SRQ_LIMIT_REACHED, + RDMAV_EVENT_QP_LAST_WQE_REACHED, + RDMAV_EVENT_CLIENT_REREGISTER }; -struct ibv_async_event { +struct rdmav_async_event { union { - struct ibv_cq *cq; - struct ibv_qp *qp; - struct ibv_srq *srq; + struct rdmav_cq *cq; + struct rdmav_qp *qp; + struct rdmav_srq *srq; int port_num; } element; - enum ibv_event_type event_type; + enum rdmav_event_type event_type; }; -enum ibv_wc_status { - IBV_WC_SUCCESS, - IBV_WC_LOC_LEN_ERR, - IBV_WC_LOC_QP_OP_ERR, - IBV_WC_LOC_EEC_OP_ERR, - IBV_WC_LOC_PROT_ERR, - IBV_WC_WR_FLUSH_ERR, - IBV_WC_MW_BIND_ERR, - IBV_WC_BAD_RESP_ERR, - IBV_WC_LOC_ACCESS_ERR, - IBV_WC_REM_INV_REQ_ERR, - IBV_WC_REM_ACCESS_ERR, - IBV_WC_REM_OP_ERR, - IBV_WC_RETRY_EXC_ERR, - IBV_WC_RNR_RETRY_EXC_ERR, - IBV_WC_LOC_RDD_VIOL_ERR, - IBV_WC_REM_INV_RD_REQ_ERR, - IBV_WC_REM_ABORT_ERR, - IBV_WC_INV_EECN_ERR, - IBV_WC_INV_EEC_STATE_ERR, - IBV_WC_FATAL_ERR, - IBV_WC_RESP_TIMEOUT_ERR, - IBV_WC_GENERAL_ERR -}; - -enum ibv_wc_opcode { - IBV_WC_SEND, - IBV_WC_RDMA_WRITE, - IBV_WC_RDMA_READ, - IBV_WC_COMP_SWAP, - IBV_WC_FETCH_ADD, - IBV_WC_BIND_MW, +enum rdmav_wc_status { + RDMAV_WC_SUCCESS, + RDMAV_WC_LOC_LEN_ERR, + RDMAV_WC_LOC_QP_OP_ERR, + RDMAV_WC_LOC_EEC_OP_ERR, + RDMAV_WC_LOC_PROT_ERR, + RDMAV_WC_WR_FLUSH_ERR, + RDMAV_WC_MW_BIND_ERR, + RDMAV_WC_BAD_RESP_ERR, + RDMAV_WC_LOC_ACCESS_ERR, + RDMAV_WC_REM_INV_REQ_ERR, + RDMAV_WC_REM_ACCESS_ERR, + RDMAV_WC_REM_OP_ERR, + RDMAV_WC_RETRY_EXC_ERR, + RDMAV_WC_RNR_RETRY_EXC_ERR, + RDMAV_WC_LOC_RDD_VIOL_ERR, + RDMAV_WC_REM_INV_RD_REQ_ERR, + RDMAV_WC_REM_ABORT_ERR, + RDMAV_WC_INV_EECN_ERR, + RDMAV_WC_INV_EEC_STATE_ERR, + RDMAV_WC_FATAL_ERR, + RDMAV_WC_RESP_TIMEOUT_ERR, + RDMAV_WC_GENERAL_ERR +}; + +enum rdmav_wc_opcode { + RDMAV_WC_SEND, + RDMAV_WC_RDMA_WRITE, + RDMAV_WC_RDMA_READ, + RDMAV_WC_COMP_SWAP, + RDMAV_WC_FETCH_ADD, + RDMAV_WC_BIND_MW, /* - * Set value of IBV_WC_RECV so consumers can test if a completion is a - * receive by testing (opcode & IBV_WC_RECV). + * Set value of RDMAV_WC_RECV so consumers can test if a completion is a + * receive by testing (opcode & RDMAV_WC_RECV). */ - IBV_WC_RECV = 1 << 7, - IBV_WC_RECV_RDMA_WITH_IMM + RDMAV_WC_RECV = 1 << 7, + RDMAV_WC_RECV_RDMA_WITH_IMM }; -enum ibv_wc_flags { - IBV_WC_GRH = 1 << 0, - IBV_WC_WITH_IMM = 1 << 1 +enum rdmav_wc_flags { + RDMAV_WC_GRH = 1 << 0, + RDMAV_WC_WITH_IMM = 1 << 1 }; -struct ibv_wc { +struct rdmav_wc { uint64_t wr_id; - enum ibv_wc_status status; - enum ibv_wc_opcode opcode; + enum rdmav_wc_status status; + enum rdmav_wc_opcode opcode; uint32_t vendor_err; uint32_t byte_len; uint32_t imm_data; /* in network byte order */ uint32_t qp_num; uint32_t src_qp; - enum ibv_wc_flags wc_flags; + enum rdmav_wc_flags wc_flags; uint16_t pkey_index; uint16_t slid; uint8_t sl; uint8_t dlid_path_bits; }; -enum ibv_access_flags { - IBV_ACCESS_LOCAL_WRITE = 1, - IBV_ACCESS_REMOTE_WRITE = (1<<1), - IBV_ACCESS_REMOTE_READ = (1<<2), - IBV_ACCESS_REMOTE_ATOMIC = (1<<3), - IBV_ACCESS_MW_BIND = (1<<4) +enum rdmav_access_flags { + RDMAV_ACCESS_LOCAL_WRITE = 1, + RDMAV_ACCESS_REMOTE_WRITE = (1<<1), + RDMAV_ACCESS_REMOTE_READ = (1<<2), + RDMAV_ACCESS_REMOTE_ATOMIC = (1<<3), + RDMAV_ACCESS_MW_BIND = (1<<4) }; -struct ibv_pd { - struct ibv_context *context; +struct rdmav_pd { + struct rdmav_context *context; uint32_t handle; }; -struct ibv_mr { - struct ibv_context *context; - struct ibv_pd *pd; +struct rdmav_mr { + struct rdmav_context *context; + struct rdmav_pd *pd; uint32_t handle; uint32_t lkey; uint32_t rkey; }; -struct ibv_global_route { - union ibv_gid dgid; +struct rdmav_global_route { + union rdmav_gid dgid; uint32_t flow_label; uint8_t sgid_index; uint8_t hop_limit; uint8_t traffic_class; }; -struct ibv_grh { +struct rdmav_grh { uint32_t version_tclass_flow; uint16_t paylen; uint8_t next_hdr; uint8_t hop_limit; - union ibv_gid sgid; - union ibv_gid dgid; + union rdmav_gid sgid; + union rdmav_gid dgid; }; -enum ibv_rate { - IBV_RATE_MAX = 0, - IBV_RATE_2_5_GBPS = 2, - IBV_RATE_5_GBPS = 5, - IBV_RATE_10_GBPS = 3, - IBV_RATE_20_GBPS = 6, - IBV_RATE_30_GBPS = 4, - IBV_RATE_40_GBPS = 7, - IBV_RATE_60_GBPS = 8, - IBV_RATE_80_GBPS = 9, - IBV_RATE_120_GBPS = 10 +enum rdmav_rate { + RDMAV_RATE_MAX = 0, + RDMAV_RATE_2_5_GBPS = 2, + RDMAV_RATE_5_GBPS = 5, + RDMAV_RATE_10_GBPS = 3, + RDMAV_RATE_20_GBPS = 6, + RDMAV_RATE_30_GBPS = 4, + RDMAV_RATE_40_GBPS = 7, + RDMAV_RATE_60_GBPS = 8, + RDMAV_RATE_80_GBPS = 9, + RDMAV_RATE_120_GBPS = 10 }; /** - * ibv_rate_to_mult - Convert the IB rate enum to a multiple of the - * base rate of 2.5 Gbit/sec. For example, IBV_RATE_5_GBPS will be + * rdmav_rate_to_mult - Convert the IB rate enum to a multiple of the + * base rate of 2.5 Gbit/sec. For example, RDMAV_RATE_5_GBPS will be * converted to 2, since 5 Gbit/sec is 2 * 2.5 Gbit/sec. * @rate: rate to convert. */ -int ibv_rate_to_mult(enum ibv_rate rate) __attribute_const; +int rdmav_rate_to_mult(enum rdmav_rate rate) __attribute_const; /** - * mult_to_ibv_rate - Convert a multiple of 2.5 Gbit/sec to an IB rate enum. + * mult_to_rdmav_rate - Convert a multiple of 2.5 Gbit/sec to an IB rate enum. * @mult: multiple to convert. */ -enum ibv_rate mult_to_ibv_rate(int mult) __attribute_const; +enum rdmav_rate mult_to_rdmav_rate(int mult) __attribute_const; -struct ibv_ah_attr { - struct ibv_global_route grh; +struct rdmav_ah_attr { + struct rdmav_global_route grh; uint16_t dlid; uint8_t sl; uint8_t src_path_bits; @@ -344,29 +345,29 @@ struct ibv_ah_attr { uint8_t port_num; }; -enum ibv_srq_attr_mask { - IBV_SRQ_MAX_WR = 1 << 0, - IBV_SRQ_LIMIT = 1 << 1 +enum rdmav_srq_attr_mask { + RDMAV_SRQ_MAX_WR = 1 << 0, + RDMAV_SRQ_LIMIT = 1 << 1 }; -struct ibv_srq_attr { +struct rdmav_srq_attr { uint32_t max_wr; uint32_t max_sge; uint32_t srq_limit; }; -struct ibv_srq_init_attr { +struct rdmav_srq_init_attr { void *srq_context; - struct ibv_srq_attr attr; + struct rdmav_srq_attr attr; }; -enum ibv_qp_type { - IBV_QPT_RC = 2, - IBV_QPT_UC, - IBV_QPT_UD +enum rdmav_qp_type { + RDMAV_QPT_RC = 2, + RDMAV_QPT_UC, + RDMAV_QPT_UD }; -struct ibv_qp_cap { +struct rdmav_qp_cap { uint32_t max_send_wr; uint32_t max_recv_wr; uint32_t max_send_sge; @@ -374,69 +375,69 @@ struct ibv_qp_cap { uint32_t max_inline_data; }; -struct ibv_qp_init_attr { +struct rdmav_qp_init_attr { void *qp_context; - struct ibv_cq *send_cq; - struct ibv_cq *recv_cq; - struct ibv_srq *srq; - struct ibv_qp_cap cap; - enum ibv_qp_type qp_type; + struct rdmav_cq *send_cq; + struct rdmav_cq *recv_cq; + struct rdmav_srq *srq; + struct rdmav_qp_cap cap; + enum rdmav_qp_type qp_type; int sq_sig_all; }; -enum ibv_qp_attr_mask { - IBV_QP_STATE = 1 << 0, - IBV_QP_CUR_STATE = 1 << 1, - IBV_QP_EN_SQD_ASYNC_NOTIFY = 1 << 2, - IBV_QP_ACCESS_FLAGS = 1 << 3, - IBV_QP_PKEY_INDEX = 1 << 4, - IBV_QP_PORT = 1 << 5, - IBV_QP_QKEY = 1 << 6, - IBV_QP_AV = 1 << 7, - IBV_QP_PATH_MTU = 1 << 8, - IBV_QP_TIMEOUT = 1 << 9, - IBV_QP_RETRY_CNT = 1 << 10, - IBV_QP_RNR_RETRY = 1 << 11, - IBV_QP_RQ_PSN = 1 << 12, - IBV_QP_MAX_QP_RD_ATOMIC = 1 << 13, - IBV_QP_ALT_PATH = 1 << 14, - IBV_QP_MIN_RNR_TIMER = 1 << 15, - IBV_QP_SQ_PSN = 1 << 16, - IBV_QP_MAX_DEST_RD_ATOMIC = 1 << 17, - IBV_QP_PATH_MIG_STATE = 1 << 18, - IBV_QP_CAP = 1 << 19, - IBV_QP_DEST_QPN = 1 << 20 -}; - -enum ibv_qp_state { - IBV_QPS_RESET, - IBV_QPS_INIT, - IBV_QPS_RTR, - IBV_QPS_RTS, - IBV_QPS_SQD, - IBV_QPS_SQE, - IBV_QPS_ERR -}; - -enum ibv_mig_state { - IBV_MIG_MIGRATED, - IBV_MIG_REARM, - IBV_MIG_ARMED -}; - -struct ibv_qp_attr { - enum ibv_qp_state qp_state; - enum ibv_qp_state cur_qp_state; - enum ibv_mtu path_mtu; - enum ibv_mig_state path_mig_state; +enum rdmav_qp_attr_mask { + RDMAV_QP_STATE = 1 << 0, + RDMAV_QP_CUR_STATE = 1 << 1, + RDMAV_QP_EN_SQD_ASYNC_NOTIFY = 1 << 2, + RDMAV_QP_ACCESS_FLAGS = 1 << 3, + RDMAV_QP_PKEY_INDEX = 1 << 4, + RDMAV_QP_PORT = 1 << 5, + RDMAV_QP_QKEY = 1 << 6, + RDMAV_QP_AV = 1 << 7, + RDMAV_QP_PATH_MTU = 1 << 8, + RDMAV_QP_TIMEOUT = 1 << 9, + RDMAV_QP_RETRY_CNT = 1 << 10, + RDMAV_QP_RNR_RETRY = 1 << 11, + RDMAV_QP_RQ_PSN = 1 << 12, + RDMAV_QP_MAX_QP_RD_ATOMIC = 1 << 13, + RDMAV_QP_ALT_PATH = 1 << 14, + RDMAV_QP_MIN_RNR_TIMER = 1 << 15, + RDMAV_QP_SQ_PSN = 1 << 16, + RDMAV_QP_MAX_DEST_RD_ATOMIC = 1 << 17, + RDMAV_QP_PATH_MIG_STATE = 1 << 18, + RDMAV_QP_CAP = 1 << 19, + RDMAV_QP_DEST_QPN = 1 << 20 +}; + +enum rdmav_qp_state { + RDMAV_QPS_RESET, + RDMAV_QPS_INIT, + RDMAV_QPS_RTR, + RDMAV_QPS_RTS, + RDMAV_QPS_SQD, + RDMAV_QPS_SQE, + RDMAV_QPS_ERR +}; + +enum rdmav_mig_state { + RDMAV_MIG_MIGRATED, + RDMAV_MIG_REARM, + RDMAV_MIG_ARMED +}; + +struct rdmav_qp_attr { + enum rdmav_qp_state qp_state; + enum rdmav_qp_state cur_qp_state; + enum rdmav_mtu path_mtu; + enum rdmav_mig_state path_mig_state; uint32_t qkey; uint32_t rq_psn; uint32_t sq_psn; uint32_t dest_qp_num; int qp_access_flags; - struct ibv_qp_cap cap; - struct ibv_ah_attr ah_attr; - struct ibv_ah_attr alt_ah_attr; + struct rdmav_qp_cap cap; + struct rdmav_ah_attr ah_attr; + struct rdmav_ah_attr alt_ah_attr; uint16_t pkey_index; uint16_t alt_pkey_index; uint8_t en_sqd_async_notify; @@ -452,36 +453,36 @@ struct ibv_qp_attr { uint8_t alt_timeout; }; -enum ibv_wr_opcode { - IBV_WR_RDMA_WRITE, - IBV_WR_RDMA_WRITE_WITH_IMM, - IBV_WR_SEND, - IBV_WR_SEND_WITH_IMM, - IBV_WR_RDMA_READ, - IBV_WR_ATOMIC_CMP_AND_SWP, - IBV_WR_ATOMIC_FETCH_AND_ADD -}; - -enum ibv_send_flags { - IBV_SEND_FENCE = 1 << 0, - IBV_SEND_SIGNALED = 1 << 1, - IBV_SEND_SOLICITED = 1 << 2, - IBV_SEND_INLINE = 1 << 3 +enum rdmav_wr_opcode { + RDMAV_WR_RDMA_WRITE, + RDMAV_WR_RDMA_WRITE_WITH_IMM, + RDMAV_WR_SEND, + RDMAV_WR_SEND_WITH_IMM, + RDMAV_WR_RDMA_READ, + RDMAV_WR_ATOMIC_CMP_AND_SWP, + RDMAV_WR_ATOMIC_FETCH_AND_ADD +}; + +enum rdmav_send_flags { + RDMAV_SEND_FENCE = 1 << 0, + RDMAV_SEND_SIGNALED = 1 << 1, + RDMAV_SEND_SOLICITED = 1 << 2, + RDMAV_SEND_INLINE = 1 << 3 }; -struct ibv_sge { +struct rdmav_sge { uint64_t addr; uint32_t length; uint32_t lkey; }; -struct ibv_send_wr { - struct ibv_send_wr *next; +struct rdmav_send_wr { + struct rdmav_send_wr *next; uint64_t wr_id; - struct ibv_sge *sg_list; + struct rdmav_sge *sg_list; int num_sge; - enum ibv_wr_opcode opcode; - enum ibv_send_flags send_flags; + enum rdmav_wr_opcode opcode; + enum rdmav_send_flags send_flags; uint32_t imm_data; /* in network byte order */ union { struct { @@ -495,24 +496,24 @@ struct ibv_send_wr { uint32_t rkey; } atomic; struct { - struct ibv_ah *ah; + struct rdmav_ah *ah; uint32_t remote_qpn; uint32_t remote_qkey; } ud; } wr; }; -struct ibv_recv_wr { - struct ibv_recv_wr *next; +struct rdmav_recv_wr { + struct rdmav_recv_wr *next; uint64_t wr_id; - struct ibv_sge *sg_list; + struct rdmav_sge *sg_list; int num_sge; }; -struct ibv_srq { - struct ibv_context *context; +struct rdmav_srq { + struct rdmav_context *context; void *srq_context; - struct ibv_pd *pd; + struct rdmav_pd *pd; uint32_t handle; pthread_mutex_t mutex; @@ -520,29 +521,29 @@ struct ibv_srq { uint32_t events_completed; }; -struct ibv_qp { - struct ibv_context *context; +struct rdmav_qp { + struct rdmav_context *context; void *qp_context; - struct ibv_pd *pd; - struct ibv_cq *send_cq; - struct ibv_cq *recv_cq; - struct ibv_srq *srq; + struct rdmav_pd *pd; + struct rdmav_cq *send_cq; + struct rdmav_cq *recv_cq; + struct rdmav_srq *srq; uint32_t handle; uint32_t qp_num; - enum ibv_qp_state state; - enum ibv_qp_type qp_type; + enum rdmav_qp_state state; + enum rdmav_qp_type qp_type; pthread_mutex_t mutex; pthread_cond_t cond; uint32_t events_completed; }; -struct ibv_comp_channel { +struct rdmav_comp_channel { int fd; }; -struct ibv_cq { - struct ibv_context *context; +struct rdmav_cq { + struct rdmav_context *context; void *cq_context; uint32_t handle; int cqe; @@ -553,89 +554,89 @@ struct ibv_cq { uint32_t async_events_completed; }; -struct ibv_ah { - struct ibv_context *context; - struct ibv_pd *pd; +struct rdmav_ah { + struct rdmav_context *context; + struct rdmav_pd *pd; uint32_t handle; }; -struct ibv_device; -struct ibv_context; +struct rdmav_device; +struct rdmav_context; -struct ibv_device_ops { - struct ibv_context * (*alloc_context)(struct ibv_device *device, int cmd_fd); - void (*free_context)(struct ibv_context *context); +struct rdmav_device_ops { + struct rdmav_context * (*alloc_context)(struct rdmav_device *device, int cmd_fd); + void (*free_context)(struct rdmav_context *context); }; enum { - IBV_SYSFS_NAME_MAX = 64, - IBV_SYSFS_PATH_MAX = 256 + RDMAV_SYSFS_NAME_MAX = 64, + RDMAV_SYSFS_PATH_MAX = 256 }; -struct ibv_device { - struct ibv_driver *driver; - struct ibv_device_ops ops; +struct rdmav_device { + struct rdmav_driver *driver; + struct rdmav_device_ops ops; /* Name of underlying kernel IB device, eg "mthca0" */ - char name[IBV_SYSFS_NAME_MAX]; + char name[RDMAV_SYSFS_NAME_MAX]; /* Name of uverbs device, eg "uverbs0" */ - char dev_name[IBV_SYSFS_NAME_MAX]; + char dev_name[RDMAV_SYSFS_NAME_MAX]; /* Path to infiniband_verbs class device in sysfs */ - char dev_path[IBV_SYSFS_PATH_MAX]; + char dev_path[RDMAV_SYSFS_PATH_MAX]; /* Path to infiniband class device in sysfs */ - char ibdev_path[IBV_SYSFS_PATH_MAX]; + char ibdev_path[RDMAV_SYSFS_PATH_MAX]; }; -struct ibv_context_ops { - int (*query_device)(struct ibv_context *context, - struct ibv_device_attr *device_attr); - int (*query_port)(struct ibv_context *context, uint8_t port_num, - struct ibv_port_attr *port_attr); - struct ibv_pd * (*alloc_pd)(struct ibv_context *context); - int (*dealloc_pd)(struct ibv_pd *pd); - struct ibv_mr * (*reg_mr)(struct ibv_pd *pd, void *addr, size_t length, - enum ibv_access_flags access); - int (*dereg_mr)(struct ibv_mr *mr); - struct ibv_cq * (*create_cq)(struct ibv_context *context, int cqe, - struct ibv_comp_channel *channel, +struct rdmav_context_ops { + int (*query_device)(struct rdmav_context *context, + struct rdmav_device_attr *device_attr); + int (*query_port)(struct rdmav_context *context, uint8_t port_num, + struct rdmav_port_attr *port_attr); + struct rdmav_pd * (*alloc_pd)(struct rdmav_context *context); + int (*dealloc_pd)(struct rdmav_pd *pd); + struct rdmav_mr * (*reg_mr)(struct rdmav_pd *pd, void *addr, size_t length, + enum rdmav_access_flags access); + int (*dereg_mr)(struct rdmav_mr *mr); + struct rdmav_cq * (*create_cq)(struct rdmav_context *context, int cqe, + struct rdmav_comp_channel *channel, int comp_vector); - int (*poll_cq)(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc); - int (*req_notify_cq)(struct ibv_cq *cq, int solicited_only); - void (*cq_event)(struct ibv_cq *cq); - int (*resize_cq)(struct ibv_cq *cq, int cqe); - int (*destroy_cq)(struct ibv_cq *cq); - struct ibv_srq * (*create_srq)(struct ibv_pd *pd, - struct ibv_srq_init_attr *srq_init_attr); - int (*modify_srq)(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask); - int (*query_srq)(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr); - int (*destroy_srq)(struct ibv_srq *srq); - int (*post_srq_recv)(struct ibv_srq *srq, - struct ibv_recv_wr *recv_wr, - struct ibv_recv_wr **bad_recv_wr); - struct ibv_qp * (*create_qp)(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); - int (*query_qp)(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_qp_init_attr *init_attr); - int (*modify_qp)(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask); - int (*destroy_qp)(struct ibv_qp *qp); - int (*post_send)(struct ibv_qp *qp, struct ibv_send_wr *wr, - struct ibv_send_wr **bad_wr); - int (*post_recv)(struct ibv_qp *qp, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr); - struct ibv_ah * (*create_ah)(struct ibv_pd *pd, struct ibv_ah_attr *attr); - int (*destroy_ah)(struct ibv_ah *ah); - int (*attach_mcast)(struct ibv_qp *qp, union ibv_gid *gid, + int (*poll_cq)(struct rdmav_cq *cq, int num_entries, struct rdmav_wc *wc); + int (*req_notify_cq)(struct rdmav_cq *cq, int solicited_only); + void (*cq_event)(struct rdmav_cq *cq); + int (*resize_cq)(struct rdmav_cq *cq, int cqe); + int (*destroy_cq)(struct rdmav_cq *cq); + struct rdmav_srq * (*create_srq)(struct rdmav_pd *pd, + struct rdmav_srq_init_attr *srq_init_attr); + int (*modify_srq)(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask); + int (*query_srq)(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr); + int (*destroy_srq)(struct rdmav_srq *srq); + int (*post_srq_recv)(struct rdmav_srq *srq, + struct rdmav_recv_wr *recv_wr, + struct rdmav_recv_wr **bad_recv_wr); + struct rdmav_qp * (*create_qp)(struct rdmav_pd *pd, struct rdmav_qp_init_attr *attr); + int (*query_qp)(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_qp_init_attr *init_attr); + int (*modify_qp)(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask); + int (*destroy_qp)(struct rdmav_qp *qp); + int (*post_send)(struct rdmav_qp *qp, struct rdmav_send_wr *wr, + struct rdmav_send_wr **bad_wr); + int (*post_recv)(struct rdmav_qp *qp, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr); + struct rdmav_ah * (*create_ah)(struct rdmav_pd *pd, struct rdmav_ah_attr *attr); + int (*destroy_ah)(struct rdmav_ah *ah); + int (*attach_mcast)(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); - int (*detach_mcast)(struct ibv_qp *qp, union ibv_gid *gid, + int (*detach_mcast)(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); }; -struct ibv_context { - struct ibv_device *device; - struct ibv_context_ops ops; +struct rdmav_context { + struct rdmav_device *device; + struct rdmav_context_ops ops; int cmd_fd; int async_fd; int num_comp_vectors; @@ -643,124 +644,124 @@ struct ibv_context { }; /** - * ibv_get_device_list - Get list of IB devices currently available + * rdmav_get_device_list - Get list of IB devices currently available * @num_devices: optional. if non-NULL, set to the number of devices * returned in the array. * * Return a NULL-terminated array of IB devices. The array can be - * released with ibv_free_device_list(). + * released with rdmav_free_device_list(). */ -struct ibv_device **ibv_get_device_list(int *num_devices); +struct rdmav_device **rdmav_get_device_list(int *num_devices); /** - * ibv_free_device_list - Free list from ibv_get_device_list() + * rdmav_free_device_list - Free list from rdmav_get_device_list() * - * Free an array of devices returned from ibv_get_device_list(). Once + * Free an array of devices returned from rdmav_get_device_list(). Once * the array is freed, pointers to devices that were not opened with - * ibv_open_device() are no longer valid. Client code must open all - * devices it intends to use before calling ibv_free_device_list(). + * rdmav_open_device() are no longer valid. Client code must open all + * devices it intends to use before calling rdmav_free_device_list(). */ -void ibv_free_device_list(struct ibv_device **list); +void rdmav_free_device_list(struct rdmav_device **list); /** - * ibv_get_device_name - Return kernel device name + * rdmav_get_device_name - Return kernel device name */ -const char *ibv_get_device_name(struct ibv_device *device); +const char *rdmav_get_device_name(struct rdmav_device *device); /** - * ibv_get_device_guid - Return device's node GUID + * rdmav_get_device_guid - Return device's node GUID */ -uint64_t ibv_get_device_guid(struct ibv_device *device); +uint64_t rdmav_get_device_guid(struct rdmav_device *device); /** - * ibv_open_device - Initialize device for use + * rdmav_open_device - Initialize device for use */ -struct ibv_context *ibv_open_device(struct ibv_device *device); +struct rdmav_context *rdmav_open_device(struct rdmav_device *device); /** - * ibv_close_device - Release device + * rdmav_close_device - Release device */ -int ibv_close_device(struct ibv_context *context); +int rdmav_close_device(struct rdmav_context *context); /** - * ibv_get_async_event - Get next async event + * rdmav_get_async_event - Get next async event * @event: Pointer to use to return async event * - * All async events returned by ibv_get_async_event() must eventually - * be acknowledged with ibv_ack_async_event(). + * All async events returned by rdmav_get_async_event() must eventually + * be acknowledged with rdmav_ack_async_event(). */ -int ibv_get_async_event(struct ibv_context *context, - struct ibv_async_event *event); +int rdmav_get_async_event(struct rdmav_context *context, + struct rdmav_async_event *event); /** - * ibv_ack_async_event - Acknowledge an async event + * rdmav_ack_async_event - Acknowledge an async event * @event: Event to be acknowledged. * - * All async events which are returned by ibv_get_async_event() must + * All async events which are returned by rdmav_get_async_event() must * be acknowledged. To avoid races, destroying an object (CQ, SRQ or * QP) will wait for all affiliated events to be acknowledged, so * there should be a one-to-one correspondence between acks and * successful gets. */ -void ibv_ack_async_event(struct ibv_async_event *event); +void rdmav_ack_async_event(struct rdmav_async_event *event); /** - * ibv_query_device - Get device properties + * rdmav_query_device - Get device properties */ -int ibv_query_device(struct ibv_context *context, - struct ibv_device_attr *device_attr); +int rdmav_query_device(struct rdmav_context *context, + struct rdmav_device_attr *device_attr); /** - * ibv_query_port - Get port properties + * rdmav_query_port - Get port properties */ -int ibv_query_port(struct ibv_context *context, uint8_t port_num, - struct ibv_port_attr *port_attr); +int rdmav_query_port(struct rdmav_context *context, uint8_t port_num, + struct rdmav_port_attr *port_attr); /** - * ibv_query_gid - Get a GID table entry + * rdmav_query_gid - Get a GID table entry */ -int ibv_query_gid(struct ibv_context *context, uint8_t port_num, - int index, union ibv_gid *gid); +int rdmav_query_gid(struct rdmav_context *context, uint8_t port_num, + int index, union rdmav_gid *gid); /** - * ibv_query_pkey - Get a P_Key table entry + * rdmav_query_pkey - Get a P_Key table entry */ -int ibv_query_pkey(struct ibv_context *context, uint8_t port_num, +int rdmav_query_pkey(struct rdmav_context *context, uint8_t port_num, int index, uint16_t *pkey); /** - * ibv_alloc_pd - Allocate a protection domain + * rdmav_alloc_pd - Allocate a protection domain */ -struct ibv_pd *ibv_alloc_pd(struct ibv_context *context); +struct rdmav_pd *rdmav_alloc_pd(struct rdmav_context *context); /** - * ibv_dealloc_pd - Free a protection domain + * rdmav_dealloc_pd - Free a protection domain */ -int ibv_dealloc_pd(struct ibv_pd *pd); +int rdmav_dealloc_pd(struct rdmav_pd *pd); /** - * ibv_reg_mr - Register a memory region + * rdmav_reg_mr - Register a memory region */ -struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, - size_t length, enum ibv_access_flags access); +struct rdmav_mr *rdmav_reg_mr(struct rdmav_pd *pd, void *addr, + size_t length, enum rdmav_access_flags access); /** - * ibv_dereg_mr - Deregister a memory region + * rdmav_dereg_mr - Deregister a memory region */ -int ibv_dereg_mr(struct ibv_mr *mr); +int rdmav_dereg_mr(struct rdmav_mr *mr); /** - * ibv_create_comp_channel - Create a completion event channel + * rdmav_create_comp_channel - Create a completion event channel */ -struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context *context); +struct rdmav_comp_channel *rdmav_create_comp_channel(struct rdmav_context *context); /** - * ibv_destroy_comp_channel - Destroy a completion event channel + * rdmav_destroy_comp_channel - Destroy a completion event channel */ -int ibv_destroy_comp_channel(struct ibv_comp_channel *channel); +int rdmav_destroy_comp_channel(struct rdmav_comp_channel *channel); /** - * ibv_create_cq - Create a completion queue + * rdmav_create_cq - Create a completion queue * @context - Context CQ will be attached to * @cqe - Minimum number of entries required for CQ * @cq_context - Consumer-supplied context returned for completion events @@ -769,57 +770,57 @@ int ibv_destroy_comp_channel(struct ibv_ * @comp_vector - Completion vector used to signal completion events. * Must be >= 0 and < context->num_comp_vectors. */ -struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe, +struct rdmav_cq *rdmav_create_cq(struct rdmav_context *context, int cqe, void *cq_context, - struct ibv_comp_channel *channel, + struct rdmav_comp_channel *channel, int comp_vector); /** - * ibv_resize_cq - Modifies the capacity of the CQ. + * rdmav_resize_cq - Modifies the capacity of the CQ. * @cq: The CQ to resize. * @cqe: The minimum size of the CQ. * * Users can examine the cq structure to determine the actual CQ size. */ -int ibv_resize_cq(struct ibv_cq *cq, int cqe); +int rdmav_resize_cq(struct rdmav_cq *cq, int cqe); /** - * ibv_destroy_cq - Destroy a completion queue + * rdmav_destroy_cq - Destroy a completion queue */ -int ibv_destroy_cq(struct ibv_cq *cq); +int rdmav_destroy_cq(struct rdmav_cq *cq); /** - * ibv_get_cq_event - Read next CQ event + * rdmav_get_cq_event - Read next CQ event * @channel: Channel to get next event from. * @cq: Used to return pointer to CQ. * @cq_context: Used to return consumer-supplied CQ context. * - * All completion events returned by ibv_get_cq_event() must - * eventually be acknowledged with ibv_ack_cq_events(). + * All completion events returned by rdmav_get_cq_event() must + * eventually be acknowledged with rdmav_ack_cq_events(). */ -int ibv_get_cq_event(struct ibv_comp_channel *channel, - struct ibv_cq **cq, void **cq_context); +int rdmav_get_cq_event(struct rdmav_comp_channel *channel, + struct rdmav_cq **cq, void **cq_context); /** - * ibv_ack_cq_events - Acknowledge CQ completion events + * rdmav_ack_cq_events - Acknowledge CQ completion events * @cq: CQ to acknowledge events for * @nevents: Number of events to acknowledge. * - * All completion events which are returned by ibv_get_cq_event() must - * be acknowledged. To avoid races, ibv_destroy_cq() will wait for + * All completion events which are returned by rdmav_get_cq_event() must + * be acknowledged. To avoid races, rdmav_destroy_cq() will wait for * all completion events to be acknowledged, so there should be a * one-to-one correspondence between acks and successful gets. An * application may accumulate multiple completion events and - * acknowledge them in a single call to ibv_ack_cq_events() by passing + * acknowledge them in a single call to rdmav_ack_cq_events() by passing * the number of events to ack in @nevents. */ -void ibv_ack_cq_events(struct ibv_cq *cq, unsigned int nevents); +void rdmav_ack_cq_events(struct rdmav_cq *cq, unsigned int nevents); /** - * ibv_poll_cq - Poll a CQ for work completions + * rdmav_poll_cq - Poll a CQ for work completions * @cq:the CQ being polled * @num_entries:maximum number of completions to return - * @wc:array of at least @num_entries of &struct ibv_wc where completions + * @wc:array of at least @num_entries of &struct rdmav_wc where completions * will be returned * * Poll a CQ for (possibly multiple) completions. If the return value @@ -828,13 +829,13 @@ void ibv_ack_cq_events(struct ibv_cq *cq * non-negative and strictly less than num_entries, then the CQ was * emptied. */ -static inline int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc) +static inline int rdmav_poll_cq(struct rdmav_cq *cq, int num_entries, struct rdmav_wc *wc) { return cq->context->ops.poll_cq(cq, num_entries, wc); } /** - * ibv_req_notify_cq - Request completion notification on a CQ. An + * rdmav_req_notify_cq - Request completion notification on a CQ. An * event will be added to the completion channel associated with the * CQ when an entry is added to the CQ. * @cq: The completion queue to request notification for. @@ -842,83 +843,83 @@ static inline int ibv_poll_cq(struct ibv * the next solicited CQ entry. If zero, any CQ entry, solicited or * not, will generate an event. */ -static inline int ibv_req_notify_cq(struct ibv_cq *cq, int solicited_only) +static inline int rdmav_req_notify_cq(struct rdmav_cq *cq, int solicited_only) { return cq->context->ops.req_notify_cq(cq, solicited_only); } /** - * ibv_create_srq - Creates a SRQ associated with the specified protection + * rdmav_create_srq - Creates a SRQ associated with the specified protection * domain. * @pd: The protection domain associated with the SRQ. * @srq_init_attr: A list of initial attributes required to create the SRQ. * * srq_attr->max_wr and srq_attr->max_sge are read the determine the * requested size of the SRQ, and set to the actual values allocated - * on return. If ibv_create_srq() succeeds, then max_wr and max_sge + * on return. If rdmav_create_srq() succeeds, then max_wr and max_sge * will always be at least as large as the requested values. */ -struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, - struct ibv_srq_init_attr *srq_init_attr); +struct rdmav_srq *rdmav_create_srq(struct rdmav_pd *pd, + struct rdmav_srq_init_attr *srq_init_attr); /** - * ibv_modify_srq - Modifies the attributes for the specified SRQ. + * rdmav_modify_srq - Modifies the attributes for the specified SRQ. * @srq: The SRQ to modify. * @srq_attr: On input, specifies the SRQ attributes to modify. On output, * the current values of selected SRQ attributes are returned. * @srq_attr_mask: A bit-mask used to specify which attributes of the SRQ * are being modified. * - * The mask may contain IBV_SRQ_MAX_WR to resize the SRQ and/or - * IBV_SRQ_LIMIT to set the SRQ's limit and request notification when + * The mask may contain RDMAV_SRQ_MAX_WR to resize the SRQ and/or + * RDMAV_SRQ_LIMIT to set the SRQ's limit and request notification when * the number of receives queued drops below the limit. */ -int ibv_modify_srq(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask); +int rdmav_modify_srq(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask); /** - * ibv_query_srq - Returns the attribute list and current values for the + * rdmav_query_srq - Returns the attribute list and current values for the * specified SRQ. * @srq: The SRQ to query. * @srq_attr: The attributes of the specified SRQ. */ -int ibv_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr); +int rdmav_query_srq(struct rdmav_srq *srq, struct rdmav_srq_attr *srq_attr); /** - * ibv_destroy_srq - Destroys the specified SRQ. + * rdmav_destroy_srq - Destroys the specified SRQ. * @srq: The SRQ to destroy. */ -int ibv_destroy_srq(struct ibv_srq *srq); +int rdmav_destroy_srq(struct rdmav_srq *srq); /** - * ibv_post_srq_recv - Posts a list of work requests to the specified SRQ. + * rdmav_post_srq_recv - Posts a list of work requests to the specified SRQ. * @srq: The SRQ to post the work request on. * @recv_wr: A list of work requests to post on the receive queue. * @bad_recv_wr: On an immediate failure, this parameter will reference * the work request that failed to be posted on the QP. */ -static inline int ibv_post_srq_recv(struct ibv_srq *srq, - struct ibv_recv_wr *recv_wr, - struct ibv_recv_wr **bad_recv_wr) +static inline int rdmav_post_srq_recv(struct rdmav_srq *srq, + struct rdmav_recv_wr *recv_wr, + struct rdmav_recv_wr **bad_recv_wr) { return srq->context->ops.post_srq_recv(srq, recv_wr, bad_recv_wr); } /** - * ibv_create_qp - Create a queue pair. + * rdmav_create_qp - Create a queue pair. */ -struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, - struct ibv_qp_init_attr *qp_init_attr); +struct rdmav_qp *rdmav_create_qp(struct rdmav_pd *pd, + struct rdmav_qp_init_attr *qp_init_attr); /** - * ibv_modify_qp - Modify a queue pair. + * rdmav_modify_qp - Modify a queue pair. */ -int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask); +int rdmav_modify_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask); /** - * ibv_query_qp - Returns the attribute list and current values for the + * rdmav_query_qp - Returns the attribute list and current values for the * specified QP. * @qp: The QP to query. * @attr: The attributes of the specified QP. @@ -928,40 +929,40 @@ int ibv_modify_qp(struct ibv_qp *qp, str * The qp_attr_mask may be used to limit the query to gathering only the * selected attributes. */ -int ibv_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_qp_init_attr *init_attr); +int rdmav_query_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_qp_init_attr *init_attr); /** - * ibv_destroy_qp - Destroy a queue pair. + * rdmav_destroy_qp - Destroy a queue pair. */ -int ibv_destroy_qp(struct ibv_qp *qp); +int rdmav_destroy_qp(struct rdmav_qp *qp); /** - * ibv_post_send - Post a list of work requests to a send queue. + * rdmav_post_send - Post a list of work requests to a send queue. */ -static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, - struct ibv_send_wr **bad_wr) +static inline int rdmav_post_send(struct rdmav_qp *qp, struct rdmav_send_wr *wr, + struct rdmav_send_wr **bad_wr) { return qp->context->ops.post_send(qp, wr, bad_wr); } /** - * ibv_post_recv - Post a list of work requests to a receive queue. + * rdmav_post_recv - Post a list of work requests to a receive queue. */ -static inline int ibv_post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr) +static inline int rdmav_post_recv(struct rdmav_qp *qp, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr) { return qp->context->ops.post_recv(qp, wr, bad_wr); } /** - * ibv_create_ah - Create an address handle. + * rdmav_create_ah - Create an address handle. */ -struct ibv_ah *ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr); +struct rdmav_ah *rdmav_create_ah(struct rdmav_pd *pd, struct rdmav_ah_attr *attr); /** - * ibv_init_ah_from_wc - Initializes address handle attributes from a + * rdmav_init_ah_from_wc - Initializes address handle attributes from a * work completion. * @context: Device context on which the received message arrived. * @port_num: Port on which the received message arrived. @@ -971,12 +972,12 @@ struct ibv_ah *ibv_create_ah(struct ibv_ * @ah_attr: Returned attributes that can be used when creating an address * handle for replying to the message. */ -int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num, - struct ibv_wc *wc, struct ibv_grh *grh, - struct ibv_ah_attr *ah_attr); +int rdmav_init_ah_from_wc(struct rdmav_context *context, uint8_t port_num, + struct rdmav_wc *wc, struct rdmav_grh *grh, + struct rdmav_ah_attr *ah_attr); /** - * ibv_create_ah_from_wc - Creates an address handle associated with the + * rdmav_create_ah_from_wc - Creates an address handle associated with the * sender of the specified work completion. * @pd: The protection domain associated with the address handle. * @wc: Work completion information associated with a received message. @@ -987,16 +988,16 @@ int ibv_init_ah_from_wc(struct ibv_conte * The address handle is used to reference a local or global destination * in all UD QP post sends. */ -struct ibv_ah *ibv_create_ah_from_wc(struct ibv_pd *pd, struct ibv_wc *wc, - struct ibv_grh *grh, uint8_t port_num); +struct rdmav_ah *rdmav_create_ah_from_wc(struct rdmav_pd *pd, struct rdmav_wc *wc, + struct rdmav_grh *grh, uint8_t port_num); /** - * ibv_destroy_ah - Destroy an address handle. + * rdmav_destroy_ah - Destroy an address handle. */ -int ibv_destroy_ah(struct ibv_ah *ah); +int rdmav_destroy_ah(struct rdmav_ah *ah); /** - * ibv_attach_mcast - Attaches the specified QP to a multicast group. + * rdmav_attach_mcast - Attaches the specified QP to a multicast group. * @qp: QP to attach to the multicast group. The QP must be a UD QP. * @gid: Multicast group GID. * @lid: Multicast group LID in host byte order. @@ -1006,18 +1007,18 @@ int ibv_destroy_ah(struct ibv_ah *ah); * the fabric appropriately. The port associated with the specified * QP must also be a member of the multicast group. */ -int ibv_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); +int rdmav_attach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); /** - * ibv_detach_mcast - Detaches the specified QP from a multicast group. + * rdmav_detach_mcast - Detaches the specified QP from a multicast group. * @qp: QP to detach from the multicast group. * @gid: Multicast group GID. * @lid: Multicast group LID in host byte order. */ -int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); +int rdmav_detach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid); END_C_DECLS # undef __attribute_const -#endif /* INFINIBAND_VERBS_H */ +#endif /* RDMAV_VERBS_H */ From krkumar2 at in.ibm.com Thu Aug 3 01:37:57 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 14:07:57 +0530 Subject: [openib-general] [PATCH v3 4/6] librdmacm include file changes. In-Reply-To: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803083757.6346.74417.sendpatchset@K50wks273950wss.in.ibm.com> Convert librdmacm include files to use the new libibverbs API. Signed-off-by: Krishna Kumar diff -ruNp ORG/librdmacm/include/rdma/rdma_cma.h NEW/librdmacm/include/rdma/rdma_cma.h --- ORG/librdmacm/include/rdma/rdma_cma.h 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/include/rdma/rdma_cma.h 2006-08-03 17:22:26.000000000 -0700 @@ -68,8 +68,8 @@ enum { }; struct ib_addr { - union ibv_gid sgid; - union ibv_gid dgid; + union rdmav_gid sgid; + union rdmav_gid dgid; uint16_t pkey; }; @@ -83,7 +83,7 @@ struct rdma_addr { struct rdma_route { struct rdma_addr addr; - struct ibv_sa_path_rec *path_rec; + struct rdmav_sa_path_rec *path_rec; int num_paths; }; @@ -92,10 +92,10 @@ struct rdma_event_channel { }; struct rdma_cm_id { - struct ibv_context *verbs; + struct rdmav_context *verbs; struct rdma_event_channel *channel; void *context; - struct ibv_qp *qp; + struct rdmav_qp *qp; struct rdma_route route; enum rdma_port_space ps; uint8_t port_num; @@ -191,8 +191,8 @@ int rdma_resolve_route(struct rdma_cm_id * QPs allocated to an rdma_cm_id will automatically be transitioned by the CMA * through their states. */ -int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd, - struct ibv_qp_init_attr *qp_init_attr); +int rdma_create_qp(struct rdma_cm_id *id, struct rdmav_pd *pd, + struct rdmav_qp_init_attr *qp_init_attr); /** * rdma_destroy_qp - Deallocate the QP associated with the specified RDMA @@ -214,7 +214,7 @@ struct rdma_conn_param { /* Fields below ignored if a QP is created on the rdma_cm_id. */ uint8_t srq; uint32_t qp_num; - enum ibv_qp_type qp_type; + enum rdmav_qp_type qp_type; }; /** @@ -341,11 +341,11 @@ static inline uint16_t rdma_get_dst_port * across multiple rdma_cm_id's. * The array must be released by calling rdma_free_devices(). */ -struct ibv_context **rdma_get_devices(int *num_devices); +struct rdmav_context **rdma_get_devices(int *num_devices); /** * rdma_free_devices - Frees the list of devices returned by rdma_get_devices(). */ -void rdma_free_devices(struct ibv_context **list); +void rdma_free_devices(struct rdmav_context **list); #endif /* RDMA_CMA_H */ diff -ruNp ORG/librdmacm/include/rdma/rdma_cma_abi.h NEW/librdmacm/include/rdma/rdma_cma_abi.h --- ORG/librdmacm/include/rdma/rdma_cma_abi.h 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/include/rdma/rdma_cma_abi.h 2006-08-03 17:22:32.000000000 -0700 @@ -123,7 +123,7 @@ struct ucma_abi_query_route { struct ucma_abi_query_route_resp { __u64 node_guid; - struct ibv_kern_path_rec ib_route[2]; + struct rdmav_kern_path_rec ib_route[2]; struct sockaddr_in6 src_addr; struct sockaddr_in6 dst_addr; __u32 num_paths; @@ -194,7 +194,7 @@ struct ucma_abi_leave_mcast { struct ucma_abi_dst_attr_resp { __u32 remote_qpn; __u32 remote_qkey; - struct ibv_kern_ah_attr ah_attr; + struct rdmav_kern_ah_attr ah_attr; }; struct ucma_abi_get_dst_attr { diff -ruNp ORG/librdmacm/include/rdma/rdma_cma_ib.h NEW/librdmacm/include/rdma/rdma_cma_ib.h --- ORG/librdmacm/include/rdma/rdma_cma_ib.h 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/include/rdma/rdma_cma_ib.h 2006-08-03 17:22:39.000000000 -0700 @@ -34,7 +34,7 @@ /* IB specific option names for get/set. */ enum { - IB_PATH_OPTIONS = 1, /* struct ibv_kern_path_rec */ + IB_PATH_OPTIONS = 1, /* struct rdmav_kern_path_rec */ IB_CM_REQ_OPTIONS = 2 /* struct ib_cm_req_opt */ }; @@ -56,7 +56,7 @@ struct ib_cm_req_opt { * Users must have called rdma_connect() to resolve the destination information. */ int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr, - struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn, + struct rdmav_ah_attr *ah_attr, uint32_t *remote_qpn, uint32_t *remote_qkey); #endif /* RDMA_CMA_IB_H */ From krkumar2 at in.ibm.com Thu Aug 3 01:38:05 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 14:08:05 +0530 Subject: [openib-general] [PATCH v3 5/6] librdmacm source file changes. In-Reply-To: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803083805.6346.23205.sendpatchset@K50wks273950wss.in.ibm.com> Convert librdmacm source files to use the new libibverbs API. Signed-off-by: Krishna Kumar diff -ruNp ORG/librdmacm/configure.in NEW/librdmacm/configure.in --- ORG/librdmacm/configure.in 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/configure.in 2006-08-03 00:02:57.000000000 -0700 @@ -25,8 +25,8 @@ AC_CHECK_SIZEOF(long) dnl Checks for libraries if test "$disable_libcheck" != "yes" then -AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], - AC_MSG_ERROR([ibv_get_device_list() not found. librdmacm requires libibverbs.])) +AC_CHECK_LIB(ibverbs, rdmav_get_device_list, [], + AC_MSG_ERROR([rdmav_get_device_list() not found. librdmacm requires libibverbs.])) fi dnl Checks for header files. diff -ruNp ORG/librdmacm/src/cma.c NEW/librdmacm/src/cma.c --- ORG/librdmacm/src/cma.c 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/src/cma.c 2006-08-03 17:23:08.000000000 -0700 @@ -103,7 +103,7 @@ do { } while (0) struct cma_device { - struct ibv_context *verbs; + struct rdmav_context *verbs; uint64_t guid; int port_cnt; }; @@ -130,7 +130,7 @@ static void ucma_cleanup(void) { if (cma_dev_cnt) { while (cma_dev_cnt) - ibv_close_device(cma_dev_array[--cma_dev_cnt].verbs); + rdmav_close_device(cma_dev_array[--cma_dev_cnt].verbs); free(cma_dev_array); cma_dev_cnt = 0; @@ -141,7 +141,7 @@ static int check_abi_version(void) { char value[8]; - if (ibv_read_sysfs_file(ibv_get_sysfs_path(), + if (rdmav_read_sysfs_file(rdmav_get_sysfs_path(), "class/misc/rdma_cm/abi_version", value, sizeof value) < 0) { /* @@ -167,9 +167,9 @@ static int check_abi_version(void) static int ucma_init(void) { - struct ibv_device **dev_list = NULL; + struct rdmav_device **dev_list = NULL; struct cma_device *cma_dev; - struct ibv_device_attr attr; + struct rdmav_device_attr attr; int i, ret; pthread_mutex_lock(&mut); @@ -180,7 +180,7 @@ static int ucma_init(void) if (ret) goto err; - dev_list = ibv_get_device_list(&cma_dev_cnt); + dev_list = rdmav_get_device_list(&cma_dev_cnt); if (!dev_list) { printf("CMA: unable to get RDMA device list\n"); ret = -ENODEV; @@ -196,15 +196,15 @@ static int ucma_init(void) for (i = 0; dev_list[i]; ++i) { cma_dev = &cma_dev_array[i]; - cma_dev->guid = ibv_get_device_guid(dev_list[i]); - cma_dev->verbs = ibv_open_device(dev_list[i]); + cma_dev->guid = rdmav_get_device_guid(dev_list[i]); + cma_dev->verbs = rdmav_open_device(dev_list[i]); if (!cma_dev->verbs) { printf("CMA: unable to open RDMA device\n"); ret = -ENODEV; goto err; } - ret = ibv_query_device(cma_dev->verbs, &attr); + ret = rdmav_query_device(cma_dev->verbs, &attr); if (ret) { printf("CMA: unable to query RDMA device\n"); goto err; @@ -219,13 +219,13 @@ err: ucma_cleanup(); pthread_mutex_unlock(&mut); if (dev_list) - ibv_free_device_list(dev_list); + rdmav_free_device_list(dev_list); return ret; } -struct ibv_context **rdma_get_devices(int *num_devices) +struct rdmav_context **rdma_get_devices(int *num_devices) { - struct ibv_context **devs = NULL; + struct rdmav_context **devs = NULL; int i; if (!cma_dev_cnt && ucma_init()) @@ -244,7 +244,7 @@ out: return devs; } -void rdma_free_devices(struct ibv_context **list) +void rdma_free_devices(struct rdmav_context **list) { free(list); } @@ -479,7 +479,7 @@ static int ucma_query_route(struct rdma_ id->route.num_paths = resp->num_paths; for (i = 0; i < resp->num_paths; i++) - ibv_copy_path_rec_from_kern(&id->route.path_rec[i], + rdmav_copy_path_rec_from_kern(&id->route.path_rec[i], &resp->ib_route[i]); } @@ -578,11 +578,11 @@ int rdma_resolve_route(struct rdma_cm_id return 0; } -static int rdma_init_qp_attr(struct rdma_cm_id *id, struct ibv_qp_attr *qp_attr, +static int rdma_init_qp_attr(struct rdma_cm_id *id, struct rdmav_qp_attr *qp_attr, int *qp_attr_mask) { struct ucma_abi_init_qp_attr *cmd; - struct ibv_kern_qp_attr *resp; + struct rdmav_kern_qp_attr *resp; struct cma_id_private *id_priv; void *msg; int ret, size; @@ -596,59 +596,59 @@ static int rdma_init_qp_attr(struct rdma if (ret != size) return (ret > 0) ? -ENODATA : ret; - ibv_copy_qp_attr_from_kern(qp_attr, resp); + rdmav_copy_qp_attr_from_kern(qp_attr, resp); *qp_attr_mask = resp->qp_attr_mask; return 0; } static int ucma_modify_qp_rtr(struct rdma_cm_id *id) { - struct ibv_qp_attr qp_attr; + struct rdmav_qp_attr qp_attr; int qp_attr_mask, ret; if (!id->qp) return -EINVAL; /* Need to update QP attributes from default values. */ - qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.qp_state = RDMAV_QPS_INIT; ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); if (ret) return ret; - ret = ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = rdmav_modify_qp(id->qp, &qp_attr, qp_attr_mask); if (ret) return ret; - qp_attr.qp_state = IBV_QPS_RTR; + qp_attr.qp_state = RDMAV_QPS_RTR; ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); if (ret) return ret; - return ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); + return rdmav_modify_qp(id->qp, &qp_attr, qp_attr_mask); } static int ucma_modify_qp_rts(struct rdma_cm_id *id) { - struct ibv_qp_attr qp_attr; + struct rdmav_qp_attr qp_attr; int qp_attr_mask, ret; - qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.qp_state = RDMAV_QPS_RTS; ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); if (ret) return ret; - return ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); + return rdmav_modify_qp(id->qp, &qp_attr, qp_attr_mask); } static int ucma_modify_qp_err(struct rdma_cm_id *id) { - struct ibv_qp_attr qp_attr; + struct rdmav_qp_attr qp_attr; if (!id->qp) return 0; - qp_attr.qp_state = IBV_QPS_ERR; - return ibv_modify_qp(id->qp, &qp_attr, IBV_QP_STATE); + qp_attr.qp_state = RDMAV_QPS_ERR; + return rdmav_modify_qp(id->qp, &qp_attr, RDMAV_QP_STATE); } static int ucma_find_pkey(struct cma_device *cma_dev, uint8_t port_num, @@ -658,7 +658,7 @@ static int ucma_find_pkey(struct cma_dev uint16_t chk_pkey; for (i = 0, ret = 0; !ret; i++) { - ret = ibv_query_pkey(cma_dev->verbs, port_num, i, &chk_pkey); + ret = rdmav_query_pkey(cma_dev->verbs, port_num, i, &chk_pkey); if (!ret && pkey == chk_pkey) { *pkey_index = (uint16_t) i; return 0; @@ -668,9 +668,9 @@ static int ucma_find_pkey(struct cma_dev return -EINVAL; } -static int ucma_init_ib_qp(struct cma_id_private *id_priv, struct ibv_qp *qp) +static int ucma_init_ib_qp(struct cma_id_private *id_priv, struct rdmav_qp *qp) { - struct ibv_qp_attr qp_attr; + struct rdmav_qp_attr qp_attr; struct ib_addr *ibaddr; int ret; @@ -681,15 +681,15 @@ static int ucma_init_ib_qp(struct cma_id return ret; qp_attr.port_num = id_priv->id.port_num; - qp_attr.qp_state = IBV_QPS_INIT; - qp_attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE; - return ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_ACCESS_FLAGS | - IBV_QP_PKEY_INDEX | IBV_QP_PORT); + qp_attr.qp_state = RDMAV_QPS_INIT; + qp_attr.qp_access_flags = RDMAV_ACCESS_LOCAL_WRITE; + return rdmav_modify_qp(qp, &qp_attr, RDMAV_QP_STATE | RDMAV_QP_ACCESS_FLAGS | + RDMAV_QP_PKEY_INDEX | RDMAV_QP_PORT); } -static int ucma_init_ud_qp(struct cma_id_private *id_priv, struct ibv_qp *qp) +static int ucma_init_ud_qp(struct cma_id_private *id_priv, struct rdmav_qp *qp) { - struct ibv_qp_attr qp_attr; + struct rdmav_qp_attr qp_attr; struct ib_addr *ibaddr; int ret; @@ -700,35 +700,35 @@ static int ucma_init_ud_qp(struct cma_id return ret; qp_attr.port_num = id_priv->id.port_num; - qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.qp_state = RDMAV_QPS_INIT; qp_attr.qkey = ntohs(rdma_get_src_port(&id_priv->id)); - ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | - IBV_QP_PORT | IBV_QP_QKEY); + ret = rdmav_modify_qp(qp, &qp_attr, RDMAV_QP_STATE | RDMAV_QP_PKEY_INDEX | + RDMAV_QP_PORT | RDMAV_QP_QKEY); if (ret) return ret; - qp_attr.qp_state = IBV_QPS_RTR; - ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE); + qp_attr.qp_state = RDMAV_QPS_RTR; + ret = rdmav_modify_qp(qp, &qp_attr, RDMAV_QP_STATE); if (ret) return ret; - qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.qp_state = RDMAV_QPS_RTS; qp_attr.sq_psn = 0; - return ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_SQ_PSN); + return rdmav_modify_qp(qp, &qp_attr, RDMAV_QP_STATE | RDMAV_QP_SQ_PSN); } -int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd, - struct ibv_qp_init_attr *qp_init_attr) +int rdma_create_qp(struct rdma_cm_id *id, struct rdmav_pd *pd, + struct rdmav_qp_init_attr *qp_init_attr) { struct cma_id_private *id_priv; - struct ibv_qp *qp; + struct rdmav_qp *qp; int ret; id_priv = container_of(id, struct cma_id_private, id); if (id->verbs != pd->context) return -EINVAL; - qp = ibv_create_qp(pd, qp_init_attr); + qp = rdmav_create_qp(pd, qp_init_attr); if (!qp) return -ENOMEM; @@ -742,19 +742,19 @@ int rdma_create_qp(struct rdma_cm_id *id id->qp = qp; return 0; err: - ibv_destroy_qp(qp); + rdmav_destroy_qp(qp); return ret; } void rdma_destroy_qp(struct rdma_cm_id *id) { - ibv_destroy_qp(id->qp); + rdmav_destroy_qp(id->qp); } static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst, struct rdma_conn_param *src, uint32_t qp_num, - enum ibv_qp_type qp_type, uint8_t srq) + enum rdmav_qp_type qp_type, uint8_t srq) { dst->qp_num = qp_num; dst->qp_type = qp_type; @@ -934,7 +934,7 @@ int rdma_leave_multicast(struct rdma_cm_ struct cma_id_private *id_priv; void *msg; int ret, size, addrlen; - struct ibv_ah_attr ah_attr; + struct rdmav_ah_attr ah_attr; uint32_t qp_info; addrlen = ucma_addrlen(addr); @@ -951,7 +951,7 @@ int rdma_leave_multicast(struct rdma_cm_ if (ret) goto out; - ret = ibv_detach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); + ret = rdmav_detach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); if (ret) goto out; } @@ -1075,7 +1075,7 @@ static void ucma_process_mcast(struct rd { struct ucma_abi_join_mcast kmc_data; struct rdma_multicast_data *mc_data; - struct ibv_ah_attr ah_attr; + struct rdmav_ah_attr ah_attr; uint32_t qp_info; kmc_data = *(struct ucma_abi_join_mcast *) evt->private_data; @@ -1093,7 +1093,7 @@ static void ucma_process_mcast(struct rd if (evt->status) goto err; - evt->status = ibv_attach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); + evt->status = rdmav_attach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); if (evt->status) goto err; return; @@ -1243,7 +1243,7 @@ int rdma_set_option(struct rdma_cm_id *i } int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr, - struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn, + struct rdmav_ah_attr *ah_attr, uint32_t *remote_qpn, uint32_t *remote_qkey) { struct ucma_abi_dst_attr_resp *resp; @@ -1265,7 +1265,7 @@ int rdma_get_dst_attr(struct rdma_cm_id if (ret != size) return (ret > 0) ? -ENODATA : ret; - ibv_copy_ah_attr_from_kern(ah_attr, &resp->ah_attr); + rdmav_copy_ah_attr_from_kern(ah_attr, &resp->ah_attr); *remote_qpn = resp->remote_qpn; *remote_qkey = resp->remote_qkey; return 0; From krkumar2 at in.ibm.com Thu Aug 3 01:38:12 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 14:08:12 +0530 Subject: [openib-general] [PATCH v3 6/6] librdmacm examples changes. In-Reply-To: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803083812.6346.57762.sendpatchset@K50wks273950wss.in.ibm.com> Convert librdmacm examples to use the new API. Signed-off-by: Krishna Kumar diff -ruNp ORG/librdmacm/examples/cmatose.c NEW/librdmacm/examples/cmatose.c --- ORG/librdmacm/examples/cmatose.c 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/examples/cmatose.c 2006-08-03 17:32:36.000000000 -0700 @@ -62,9 +62,9 @@ struct cmatest_node { int id; struct rdma_cm_id *cma_id; int connected; - struct ibv_pd *pd; - struct ibv_cq *cq; - struct ibv_mr *mr; + struct rdmav_pd *pd; + struct rdmav_cq *cq; + struct rdmav_mr *mr; void *mem; }; @@ -100,8 +100,8 @@ static int create_message(struct cmatest printf("failed message allocation\n"); return -1; } - node->mr = ibv_reg_mr(node->pd, node->mem, message_size, - IBV_ACCESS_LOCAL_WRITE); + node->mr = rdmav_reg_mr(node->pd, node->mem, message_size, + RDMAV_ACCESS_LOCAL_WRITE); if (!node->mr) { printf("failed to reg MR\n"); goto err; @@ -114,10 +114,10 @@ err: static int init_node(struct cmatest_node *node) { - struct ibv_qp_init_attr init_qp_attr; + struct rdmav_qp_init_attr init_qp_attr; int cqe, ret; - node->pd = ibv_alloc_pd(node->cma_id->verbs); + node->pd = rdmav_alloc_pd(node->cma_id->verbs); if (!node->pd) { ret = -ENOMEM; printf("cmatose: unable to allocate PD\n"); @@ -125,7 +125,7 @@ static int init_node(struct cmatest_node } cqe = message_count ? message_count * 2 : 2; - node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); + node->cq = rdmav_create_cq(node->cma_id->verbs, cqe, node, 0, 0); if (!node->cq) { ret = -ENOMEM; printf("cmatose: unable to create CQ\n"); @@ -139,7 +139,7 @@ static int init_node(struct cmatest_node init_qp_attr.cap.max_recv_sge = 1; init_qp_attr.qp_context = node; init_qp_attr.sq_sig_all = 1; - init_qp_attr.qp_type = IBV_QPT_RC; + init_qp_attr.qp_type = RDMAV_QPT_RC; init_qp_attr.send_cq = node->cq; init_qp_attr.recv_cq = node->cq; ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); @@ -159,8 +159,8 @@ out: static int post_recvs(struct cmatest_node *node) { - struct ibv_recv_wr recv_wr, *recv_failure; - struct ibv_sge sge; + struct rdmav_recv_wr recv_wr, *recv_failure; + struct rdmav_sge sge; int i, ret = 0; if (!message_count) @@ -176,7 +176,7 @@ static int post_recvs(struct cmatest_nod sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++ ) { - ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); + ret = rdmav_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); if (ret) { printf("failed to post receives: %d\n", ret); break; @@ -187,8 +187,8 @@ static int post_recvs(struct cmatest_nod static int post_sends(struct cmatest_node *node) { - struct ibv_send_wr send_wr, *bad_send_wr; - struct ibv_sge sge; + struct rdmav_send_wr send_wr, *bad_send_wr; + struct rdmav_sge sge; int i, ret = 0; if (!node->connected || !message_count) @@ -197,7 +197,7 @@ static int post_sends(struct cmatest_nod send_wr.next = NULL; send_wr.sg_list = &sge; send_wr.num_sge = 1; - send_wr.opcode = IBV_WR_SEND; + send_wr.opcode = RDMAV_WR_SEND; send_wr.send_flags = 0; send_wr.wr_id = (unsigned long)node; @@ -206,7 +206,7 @@ static int post_sends(struct cmatest_nod sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++) { - ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); + ret = rdmav_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); if (ret) printf("failed to post sends: %d\n", ret); } @@ -350,15 +350,15 @@ static void destroy_node(struct cmatest_ rdma_destroy_qp(node->cma_id); if (node->cq) - ibv_destroy_cq(node->cq); + rdmav_destroy_cq(node->cq); if (node->mem) { - ibv_dereg_mr(node->mr); + rdmav_dereg_mr(node->mr); free(node->mem); } if (node->pd) - ibv_dealloc_pd(node->pd); + rdmav_dealloc_pd(node->pd); /* Destroy the RDMA ID after all device resources */ rdma_destroy_id(node->cma_id); @@ -404,7 +404,7 @@ static void destroy_nodes(void) static int poll_cqs(void) { - struct ibv_wc wc[8]; + struct rdmav_wc wc[8]; int done, i, ret; for (i = 0; i < connections; i++) { @@ -412,7 +412,7 @@ static int poll_cqs(void) continue; for (done = 0; done < message_count; done += ret) { - ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + ret = rdmav_poll_cq(test.nodes[i].cq, 8, wc); if (ret < 0) { printf("cmatose: failed polling CQ: %d\n", ret); return ret; diff -ruNp ORG/librdmacm/examples/mckey.c NEW/librdmacm/examples/mckey.c --- ORG/librdmacm/examples/mckey.c 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/examples/mckey.c 2006-08-03 17:32:41.000000000 -0700 @@ -50,10 +50,10 @@ struct cmatest_node { int id; struct rdma_cm_id *cma_id; int connected; - struct ibv_pd *pd; - struct ibv_cq *cq; - struct ibv_mr *mr; - struct ibv_ah *ah; + struct rdmav_pd *pd; + struct rdmav_cq *cq; + struct rdmav_mr *mr; + struct rdmav_ah *ah; uint32_t remote_qpn; uint32_t remote_qkey; void *mem; @@ -85,14 +85,14 @@ static int create_message(struct cmatest if (!message_count) return 0; - node->mem = malloc(message_size + sizeof(struct ibv_grh)); + node->mem = malloc(message_size + sizeof(struct rdmav_grh)); if (!node->mem) { printf("failed message allocation\n"); return -1; } - node->mr = ibv_reg_mr(node->pd, node->mem, - message_size + sizeof(struct ibv_grh), - IBV_ACCESS_LOCAL_WRITE); + node->mr = rdmav_reg_mr(node->pd, node->mem, + message_size + sizeof(struct rdmav_grh), + RDMAV_ACCESS_LOCAL_WRITE); if (!node->mr) { printf("failed to reg MR\n"); goto err; @@ -105,10 +105,10 @@ err: static int init_node(struct cmatest_node *node) { - struct ibv_qp_init_attr init_qp_attr; + struct rdmav_qp_init_attr init_qp_attr; int cqe, ret; - node->pd = ibv_alloc_pd(node->cma_id->verbs); + node->pd = rdmav_alloc_pd(node->cma_id->verbs); if (!node->pd) { ret = -ENOMEM; printf("mckey: unable to allocate PD\n"); @@ -116,7 +116,7 @@ static int init_node(struct cmatest_node } cqe = message_count ? message_count * 2 : 2; - node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); + node->cq = rdmav_create_cq(node->cma_id->verbs, cqe, node, 0, 0); if (!node->cq) { ret = -ENOMEM; printf("mckey: unable to create CQ\n"); @@ -130,7 +130,7 @@ static int init_node(struct cmatest_node init_qp_attr.cap.max_recv_sge = 1; init_qp_attr.qp_context = node; init_qp_attr.sq_sig_all = 0; - init_qp_attr.qp_type = IBV_QPT_UD; + init_qp_attr.qp_type = RDMAV_QPT_UD; init_qp_attr.send_cq = node->cq; init_qp_attr.recv_cq = node->cq; ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); @@ -150,8 +150,8 @@ out: static int post_recvs(struct cmatest_node *node) { - struct ibv_recv_wr recv_wr, *recv_failure; - struct ibv_sge sge; + struct rdmav_recv_wr recv_wr, *recv_failure; + struct rdmav_sge sge; int i, ret = 0; if (!message_count) @@ -162,12 +162,12 @@ static int post_recvs(struct cmatest_nod recv_wr.num_sge = 1; recv_wr.wr_id = (uintptr_t) node; - sge.length = message_size + sizeof(struct ibv_grh); + sge.length = message_size + sizeof(struct rdmav_grh); sge.lkey = node->mr->lkey; sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++ ) { - ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); + ret = rdmav_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); if (ret) { printf("failed to post receives: %d\n", ret); break; @@ -178,8 +178,8 @@ static int post_recvs(struct cmatest_nod static int post_sends(struct cmatest_node *node, int signal_flag) { - struct ibv_send_wr send_wr, *bad_send_wr; - struct ibv_sge sge; + struct rdmav_send_wr send_wr, *bad_send_wr; + struct rdmav_sge sge; int i, ret = 0; if (!node->connected || !message_count) @@ -188,8 +188,8 @@ static int post_sends(struct cmatest_nod send_wr.next = NULL; send_wr.sg_list = &sge; send_wr.num_sge = 1; - send_wr.opcode = IBV_WR_SEND_WITH_IMM; - send_wr.send_flags = IBV_SEND_INLINE | signal_flag; + send_wr.opcode = RDMAV_WR_SEND_WITH_IMM; + send_wr.send_flags = RDMAV_SEND_INLINE | signal_flag; send_wr.wr_id = (unsigned long)node; send_wr.imm_data = htonl(node->cma_id->qp->qp_num); @@ -197,12 +197,12 @@ static int post_sends(struct cmatest_nod send_wr.wr.ud.remote_qpn = node->remote_qpn; send_wr.wr.ud.remote_qkey = node->remote_qkey; - sge.length = message_size - sizeof(struct ibv_grh); + sge.length = message_size - sizeof(struct rdmav_grh); sge.lkey = node->mr->lkey; sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++) { - ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); + ret = rdmav_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); if (ret) printf("failed to post sends: %d\n", ret); } @@ -241,7 +241,7 @@ err: static int join_handler(struct cmatest_node *node) { - struct ibv_ah_attr ah_attr; + struct rdmav_ah_attr ah_attr; int ret; ret = rdma_get_dst_attr(node->cma_id, test.dst_addr, &ah_attr, @@ -251,7 +251,7 @@ static int join_handler(struct cmatest_n goto err; } - node->ah = ibv_create_ah(node->pd, &ah_attr); + node->ah = rdmav_create_ah(node->pd, &ah_attr); if (!node->ah) { printf("mckey: failure creating address handle\n"); goto err; @@ -299,21 +299,21 @@ static void destroy_node(struct cmatest_ return; if (node->ah) - ibv_destroy_ah(node->ah); + rdmav_destroy_ah(node->ah); if (node->cma_id->qp) rdma_destroy_qp(node->cma_id); if (node->cq) - ibv_destroy_cq(node->cq); + rdmav_destroy_cq(node->cq); if (node->mem) { - ibv_dereg_mr(node->mr); + rdmav_dereg_mr(node->mr); free(node->mem); } if (node->pd) - ibv_dealloc_pd(node->pd); + rdmav_dealloc_pd(node->pd); /* Destroy the RDMA ID after all device resources */ rdma_destroy_id(node->cma_id); @@ -356,7 +356,7 @@ static void destroy_nodes(void) static int poll_cqs(void) { - struct ibv_wc wc[8]; + struct rdmav_wc wc[8]; int done, i, ret; for (i = 0; i < connections; i++) { @@ -364,7 +364,7 @@ static int poll_cqs(void) continue; for (done = 0; done < message_count; done += ret) { - ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + ret = rdmav_poll_cq(test.nodes[i].cq, 8, wc); if (ret < 0) { printf("mckey: failed polling CQ: %d\n", ret); return ret; diff -ruNp ORG/librdmacm/examples/rping.c NEW/librdmacm/examples/rping.c --- ORG/librdmacm/examples/rping.c 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/examples/rping.c 2006-08-03 17:32:45.000000000 -0700 @@ -111,32 +111,32 @@ struct rping_rdma_info { struct rping_cb { int server; /* 0 iff client */ pthread_t cqthread; - struct ibv_comp_channel *channel; - struct ibv_cq *cq; - struct ibv_pd *pd; - struct ibv_qp *qp; + struct rdmav_comp_channel *channel; + struct rdmav_cq *cq; + struct rdmav_pd *pd; + struct rdmav_qp *qp; - struct ibv_recv_wr rq_wr; /* recv work request record */ - struct ibv_sge recv_sgl; /* recv single SGE */ + struct rdmav_recv_wr rq_wr; /* recv work request record */ + struct rdmav_sge recv_sgl; /* recv single SGE */ struct rping_rdma_info recv_buf;/* malloc'd buffer */ - struct ibv_mr *recv_mr; /* MR associated with this buffer */ + struct rdmav_mr *recv_mr; /* MR associated with this buffer */ - struct ibv_send_wr sq_wr; /* send work requrest record */ - struct ibv_sge send_sgl; + struct rdmav_send_wr sq_wr; /* send work requrest record */ + struct rdmav_sge send_sgl; struct rping_rdma_info send_buf;/* single send buf */ - struct ibv_mr *send_mr; + struct rdmav_mr *send_mr; - struct ibv_send_wr rdma_sq_wr; /* rdma work request record */ - struct ibv_sge rdma_sgl; /* rdma single SGE */ + struct rdmav_send_wr rdma_sq_wr; /* rdma work request record */ + struct rdmav_sge rdma_sgl; /* rdma single SGE */ char *rdma_buf; /* used as rdma sink */ - struct ibv_mr *rdma_mr; + struct rdmav_mr *rdma_mr; uint32_t remote_rkey; /* remote guys RKEY */ uint64_t remote_addr; /* remote guys TO */ uint32_t remote_len; /* remote guys LEN */ char *start_buf; /* rdma read src */ - struct ibv_mr *start_mr; + struct rdmav_mr *start_mr; enum test_state state; /* used for cond/signalling */ sem_t sem; @@ -232,7 +232,7 @@ static int rping_cma_event_handler(struc return ret; } -static int server_recv(struct rping_cb *cb, struct ibv_wc *wc) +static int server_recv(struct rping_cb *cb, struct rdmav_wc *wc) { if (wc->byte_len != sizeof(cb->recv_buf)) { fprintf(stderr, "Received bogus data, size %d\n", wc->byte_len); @@ -253,7 +253,7 @@ static int server_recv(struct rping_cb * return 0; } -static int client_recv(struct rping_cb *cb, struct ibv_wc *wc) +static int client_recv(struct rping_cb *cb, struct rdmav_wc *wc) { if (wc->byte_len != sizeof(cb->recv_buf)) { fprintf(stderr, "Received bogus data, size %d\n", wc->byte_len); @@ -270,39 +270,39 @@ static int client_recv(struct rping_cb * static int rping_cq_event_handler(struct rping_cb *cb) { - struct ibv_wc wc; - struct ibv_recv_wr *bad_wr; + struct rdmav_wc wc; + struct rdmav_recv_wr *bad_wr; int ret; - while ((ret = ibv_poll_cq(cb->cq, 1, &wc)) == 1) { + while ((ret = rdmav_poll_cq(cb->cq, 1, &wc)) == 1) { ret = 0; if (wc.status) { fprintf(stderr, "cq completion failed status %d\n", wc.status); - if (wc.status != IBV_WC_WR_FLUSH_ERR) + if (wc.status != RDMAV_WC_WR_FLUSH_ERR) ret = -1; goto error; } switch (wc.opcode) { - case IBV_WC_SEND: + case RDMAV_WC_SEND: DEBUG_LOG("send completion\n"); break; - case IBV_WC_RDMA_WRITE: + case RDMAV_WC_RDMA_WRITE: DEBUG_LOG("rdma write completion\n"); cb->state = RDMA_WRITE_COMPLETE; sem_post(&cb->sem); break; - case IBV_WC_RDMA_READ: + case RDMAV_WC_RDMA_READ: DEBUG_LOG("rdma read completion\n"); cb->state = RDMA_READ_COMPLETE; sem_post(&cb->sem); break; - case IBV_WC_RECV: + case RDMAV_WC_RECV: DEBUG_LOG("recv completion\n"); ret = cb->server ? server_recv(cb, &wc) : client_recv(cb, &wc); @@ -311,7 +311,7 @@ static int rping_cq_event_handler(struct goto error; } - ret = ibv_post_recv(cb->qp, &cb->rq_wr, &bad_wr); + ret = rdmav_post_recv(cb->qp, &cb->rq_wr, &bad_wr); if (ret) { fprintf(stderr, "post recv error: %d\n", ret); goto error; @@ -374,14 +374,14 @@ static void rping_setup_wr(struct rping_ cb->send_sgl.length = sizeof cb->send_buf; cb->send_sgl.lkey = cb->send_mr->lkey; - cb->sq_wr.opcode = IBV_WR_SEND; - cb->sq_wr.send_flags = IBV_SEND_SIGNALED; + cb->sq_wr.opcode = RDMAV_WR_SEND; + cb->sq_wr.send_flags = RDMAV_SEND_SIGNALED; cb->sq_wr.sg_list = &cb->send_sgl; cb->sq_wr.num_sge = 1; cb->rdma_sgl.addr = (uint64_t) (unsigned long) cb->rdma_buf; cb->rdma_sgl.lkey = cb->rdma_mr->lkey; - cb->rdma_sq_wr.send_flags = IBV_SEND_SIGNALED; + cb->rdma_sq_wr.send_flags = RDMAV_SEND_SIGNALED; cb->rdma_sq_wr.sg_list = &cb->rdma_sgl; cb->rdma_sq_wr.num_sge = 1; } @@ -392,14 +392,14 @@ static int rping_setup_buffers(struct rp DEBUG_LOG("rping_setup_buffers called on cb %p\n", cb); - cb->recv_mr = ibv_reg_mr(cb->pd, &cb->recv_buf, sizeof cb->recv_buf, - IBV_ACCESS_LOCAL_WRITE); + cb->recv_mr = rdmav_reg_mr(cb->pd, &cb->recv_buf, sizeof cb->recv_buf, + RDMAV_ACCESS_LOCAL_WRITE); if (!cb->recv_mr) { fprintf(stderr, "recv_buf reg_mr failed\n"); return errno; } - cb->send_mr = ibv_reg_mr(cb->pd, &cb->send_buf, sizeof cb->send_buf, 0); + cb->send_mr = rdmav_reg_mr(cb->pd, &cb->send_buf, sizeof cb->send_buf, 0); if (!cb->send_mr) { fprintf(stderr, "send_buf reg_mr failed\n"); ret = errno; @@ -413,10 +413,10 @@ static int rping_setup_buffers(struct rp goto err2; } - cb->rdma_mr = ibv_reg_mr(cb->pd, cb->rdma_buf, cb->size, - IBV_ACCESS_LOCAL_WRITE | - IBV_ACCESS_REMOTE_READ | - IBV_ACCESS_REMOTE_WRITE); + cb->rdma_mr = rdmav_reg_mr(cb->pd, cb->rdma_buf, cb->size, + RDMAV_ACCESS_LOCAL_WRITE | + RDMAV_ACCESS_REMOTE_READ | + RDMAV_ACCESS_REMOTE_WRITE); if (!cb->rdma_mr) { fprintf(stderr, "rdma_buf reg_mr failed\n"); ret = errno; @@ -431,10 +431,10 @@ static int rping_setup_buffers(struct rp goto err4; } - cb->start_mr = ibv_reg_mr(cb->pd, cb->start_buf, cb->size, - IBV_ACCESS_LOCAL_WRITE | - IBV_ACCESS_REMOTE_READ | - IBV_ACCESS_REMOTE_WRITE); + cb->start_mr = rdmav_reg_mr(cb->pd, cb->start_buf, cb->size, + RDMAV_ACCESS_LOCAL_WRITE | + RDMAV_ACCESS_REMOTE_READ | + RDMAV_ACCESS_REMOTE_WRITE); if (!cb->start_mr) { fprintf(stderr, "start_buf reg_mr failed\n"); ret = errno; @@ -449,32 +449,32 @@ static int rping_setup_buffers(struct rp err5: free(cb->start_buf); err4: - ibv_dereg_mr(cb->rdma_mr); + rdmav_dereg_mr(cb->rdma_mr); err3: free(cb->rdma_buf); err2: - ibv_dereg_mr(cb->send_mr); + rdmav_dereg_mr(cb->send_mr); err1: - ibv_dereg_mr(cb->recv_mr); + rdmav_dereg_mr(cb->recv_mr); return ret; } static void rping_free_buffers(struct rping_cb *cb) { DEBUG_LOG("rping_free_buffers called on cb %p\n", cb); - ibv_dereg_mr(cb->recv_mr); - ibv_dereg_mr(cb->send_mr); - ibv_dereg_mr(cb->rdma_mr); + rdmav_dereg_mr(cb->recv_mr); + rdmav_dereg_mr(cb->send_mr); + rdmav_dereg_mr(cb->rdma_mr); free(cb->rdma_buf); if (!cb->server) { - ibv_dereg_mr(cb->start_mr); + rdmav_dereg_mr(cb->start_mr); free(cb->start_buf); } } static int rping_create_qp(struct rping_cb *cb) { - struct ibv_qp_init_attr init_attr; + struct rdmav_qp_init_attr init_attr; int ret; memset(&init_attr, 0, sizeof(init_attr)); @@ -482,7 +482,7 @@ static int rping_create_qp(struct rping_ init_attr.cap.max_recv_wr = 2; init_attr.cap.max_recv_sge = 1; init_attr.cap.max_send_sge = 1; - init_attr.qp_type = IBV_QPT_RC; + init_attr.qp_type = RDMAV_QPT_RC; init_attr.send_cq = cb->cq; init_attr.recv_cq = cb->cq; @@ -501,43 +501,43 @@ static int rping_create_qp(struct rping_ static void rping_free_qp(struct rping_cb *cb) { - ibv_destroy_qp(cb->qp); - ibv_destroy_cq(cb->cq); - ibv_destroy_comp_channel(cb->channel); - ibv_dealloc_pd(cb->pd); + rdmav_destroy_qp(cb->qp); + rdmav_destroy_cq(cb->cq); + rdmav_destroy_comp_channel(cb->channel); + rdmav_dealloc_pd(cb->pd); } static int rping_setup_qp(struct rping_cb *cb, struct rdma_cm_id *cm_id) { int ret; - cb->pd = ibv_alloc_pd(cm_id->verbs); + cb->pd = rdmav_alloc_pd(cm_id->verbs); if (!cb->pd) { - fprintf(stderr, "ibv_alloc_pd failed\n"); + fprintf(stderr, "rdmav_alloc_pd failed\n"); return errno; } DEBUG_LOG("created pd %p\n", cb->pd); - cb->channel = ibv_create_comp_channel(cm_id->verbs); + cb->channel = rdmav_create_comp_channel(cm_id->verbs); if (!cb->channel) { - fprintf(stderr, "ibv_create_comp_channel failed\n"); + fprintf(stderr, "rdmav_create_comp_channel failed\n"); ret = errno; goto err1; } DEBUG_LOG("created channel %p\n", cb->channel); - cb->cq = ibv_create_cq(cm_id->verbs, RPING_SQ_DEPTH * 2, cb, + cb->cq = rdmav_create_cq(cm_id->verbs, RPING_SQ_DEPTH * 2, cb, cb->channel, 0); if (!cb->cq) { - fprintf(stderr, "ibv_create_cq failed\n"); + fprintf(stderr, "rdmav_create_cq failed\n"); ret = errno; goto err2; } DEBUG_LOG("created cq %p\n", cb->cq); - ret = ibv_req_notify_cq(cb->cq, 0); + ret = rdmav_req_notify_cq(cb->cq, 0); if (ret) { - fprintf(stderr, "ibv_create_cq failed\n"); + fprintf(stderr, "rdmav_create_cq failed\n"); ret = errno; goto err3; } @@ -551,11 +551,11 @@ static int rping_setup_qp(struct rping_c return 0; err3: - ibv_destroy_cq(cb->cq); + rdmav_destroy_cq(cb->cq); err2: - ibv_destroy_comp_channel(cb->channel); + rdmav_destroy_comp_channel(cb->channel); err1: - ibv_dealloc_pd(cb->pd); + rdmav_dealloc_pd(cb->pd); return ret; } @@ -581,14 +581,14 @@ static void *cm_thread(void *arg) static void *cq_thread(void *arg) { struct rping_cb *cb = arg; - struct ibv_cq *ev_cq; + struct rdmav_cq *ev_cq; void *ev_ctx; int ret; DEBUG_LOG("cq_thread started.\n"); while (1) { - ret = ibv_get_cq_event(cb->channel, &ev_cq, &ev_ctx); + ret = rdmav_get_cq_event(cb->channel, &ev_cq, &ev_ctx); if (ret) { fprintf(stderr, "Failed to get cq event!\n"); exit(ret); @@ -597,19 +597,19 @@ static void *cq_thread(void *arg) fprintf(stderr, "Unkown CQ!\n"); exit(-1); } - ret = ibv_req_notify_cq(cb->cq, 0); + ret = rdmav_req_notify_cq(cb->cq, 0); if (ret) { fprintf(stderr, "Failed to set notify!\n"); exit(ret); } ret = rping_cq_event_handler(cb); - ibv_ack_cq_events(cb->cq, 1); + rdmav_ack_cq_events(cb->cq, 1); if (ret) exit(ret); } } -static void rping_format_send(struct rping_cb *cb, char *buf, struct ibv_mr *mr) +static void rping_format_send(struct rping_cb *cb, char *buf, struct rdmav_mr *mr) { struct rping_rdma_info *info = &cb->send_buf; @@ -623,7 +623,7 @@ static void rping_format_send(struct rpi static int rping_test_server(struct rping_cb *cb) { - struct ibv_send_wr *bad_wr; + struct rdmav_send_wr *bad_wr; int ret; while (1) { @@ -639,12 +639,12 @@ static int rping_test_server(struct rpin DEBUG_LOG("server received sink adv\n"); /* Issue RDMA Read. */ - cb->rdma_sq_wr.opcode = IBV_WR_RDMA_READ; + cb->rdma_sq_wr.opcode = RDMAV_WR_RDMA_READ; cb->rdma_sq_wr.wr.rdma.rkey = cb->remote_rkey; cb->rdma_sq_wr.wr.rdma.remote_addr = cb->remote_addr; cb->rdma_sq_wr.sg_list->length = cb->remote_len; - ret = ibv_post_send(cb->qp, &cb->rdma_sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->rdma_sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -666,7 +666,7 @@ static int rping_test_server(struct rpin printf("server ping data: %s\n", cb->rdma_buf); /* Tell client to continue */ - ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -684,7 +684,7 @@ static int rping_test_server(struct rpin DEBUG_LOG("server received sink adv\n"); /* RDMA Write echo data */ - cb->rdma_sq_wr.opcode = IBV_WR_RDMA_WRITE; + cb->rdma_sq_wr.opcode = RDMAV_WR_RDMA_WRITE; cb->rdma_sq_wr.wr.rdma.rkey = cb->remote_rkey; cb->rdma_sq_wr.wr.rdma.remote_addr = cb->remote_addr; cb->rdma_sq_wr.sg_list->length = strlen(cb->rdma_buf) + 1; @@ -693,7 +693,7 @@ static int rping_test_server(struct rpin cb->rdma_sq_wr.sg_list->addr, cb->rdma_sq_wr.sg_list->length); - ret = ibv_post_send(cb->qp, &cb->rdma_sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->rdma_sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -710,7 +710,7 @@ static int rping_test_server(struct rpin DEBUG_LOG("server rdma write complete \n"); /* Tell client to begin again */ - ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -757,7 +757,7 @@ static int rping_bind_server(struct rpin static int rping_run_server(struct rping_cb *cb) { - struct ibv_recv_wr *bad_wr; + struct rdmav_recv_wr *bad_wr; int ret; ret = rping_bind_server(cb); @@ -776,9 +776,9 @@ static int rping_run_server(struct rping goto err1; } - ret = ibv_post_recv(cb->qp, &cb->rq_wr, &bad_wr); + ret = rdmav_post_recv(cb->qp, &cb->rq_wr, &bad_wr); if (ret) { - fprintf(stderr, "ibv_post_recv failed: %d\n", ret); + fprintf(stderr, "rdmav_post_recv failed: %d\n", ret); goto err2; } @@ -804,7 +804,7 @@ err1: static int rping_test_client(struct rping_cb *cb) { int ping, start, cc, i, ret = 0; - struct ibv_send_wr *bad_wr; + struct rdmav_send_wr *bad_wr; unsigned char c; start = 65; @@ -825,7 +825,7 @@ static int rping_test_client(struct rpin cb->start_buf[cb->size - 1] = 0; rping_format_send(cb, cb->start_buf, cb->start_mr); - ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -841,7 +841,7 @@ static int rping_test_client(struct rpin } rping_format_send(cb, cb->rdma_buf, cb->rdma_mr); - ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + ret = rdmav_post_send(cb->qp, &cb->sq_wr, &bad_wr); if (ret) { fprintf(stderr, "post send error %d\n", ret); break; @@ -926,7 +926,7 @@ static int rping_bind_client(struct rpin static int rping_run_client(struct rping_cb *cb) { - struct ibv_recv_wr *bad_wr; + struct rdmav_recv_wr *bad_wr; int ret; ret = rping_bind_client(cb); @@ -945,9 +945,9 @@ static int rping_run_client(struct rping goto err1; } - ret = ibv_post_recv(cb->qp, &cb->rq_wr, &bad_wr); + ret = rdmav_post_recv(cb->qp, &cb->rq_wr, &bad_wr); if (ret) { - fprintf(stderr, "ibv_post_recv failed: %d\n", ret); + fprintf(stderr, "rdmav_post_recv failed: %d\n", ret); goto err2; } diff -ruNp ORG/librdmacm/examples/udaddy.c NEW/librdmacm/examples/udaddy.c --- ORG/librdmacm/examples/udaddy.c 2006-07-30 21:18:17.000000000 -0700 +++ NEW/librdmacm/examples/udaddy.c 2006-08-03 17:32:51.000000000 -0700 @@ -55,10 +55,10 @@ struct cmatest_node { int id; struct rdma_cm_id *cma_id; int connected; - struct ibv_pd *pd; - struct ibv_cq *cq; - struct ibv_mr *mr; - struct ibv_ah *ah; + struct rdmav_pd *pd; + struct rdmav_cq *cq; + struct rdmav_mr *mr; + struct rdmav_ah *ah; uint32_t remote_qpn; uint32_t remote_qkey; void *mem; @@ -90,14 +90,14 @@ static int create_message(struct cmatest if (!message_count) return 0; - node->mem = malloc(message_size + sizeof(struct ibv_grh)); + node->mem = malloc(message_size + sizeof(struct rdmav_grh)); if (!node->mem) { printf("failed message allocation\n"); return -1; } - node->mr = ibv_reg_mr(node->pd, node->mem, - message_size + sizeof(struct ibv_grh), - IBV_ACCESS_LOCAL_WRITE); + node->mr = rdmav_reg_mr(node->pd, node->mem, + message_size + sizeof(struct rdmav_grh), + RDMAV_ACCESS_LOCAL_WRITE); if (!node->mr) { printf("failed to reg MR\n"); goto err; @@ -110,10 +110,10 @@ err: static int init_node(struct cmatest_node *node) { - struct ibv_qp_init_attr init_qp_attr; + struct rdmav_qp_init_attr init_qp_attr; int cqe, ret; - node->pd = ibv_alloc_pd(node->cma_id->verbs); + node->pd = rdmav_alloc_pd(node->cma_id->verbs); if (!node->pd) { ret = -ENOMEM; printf("udaddy: unable to allocate PD\n"); @@ -121,7 +121,7 @@ static int init_node(struct cmatest_node } cqe = message_count ? message_count * 2 : 2; - node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); + node->cq = rdmav_create_cq(node->cma_id->verbs, cqe, node, 0, 0); if (!node->cq) { ret = -ENOMEM; printf("udaddy: unable to create CQ\n"); @@ -135,7 +135,7 @@ static int init_node(struct cmatest_node init_qp_attr.cap.max_recv_sge = 1; init_qp_attr.qp_context = node; init_qp_attr.sq_sig_all = 0; - init_qp_attr.qp_type = IBV_QPT_UD; + init_qp_attr.qp_type = RDMAV_QPT_UD; init_qp_attr.send_cq = node->cq; init_qp_attr.recv_cq = node->cq; ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); @@ -155,8 +155,8 @@ out: static int post_recvs(struct cmatest_node *node) { - struct ibv_recv_wr recv_wr, *recv_failure; - struct ibv_sge sge; + struct rdmav_recv_wr recv_wr, *recv_failure; + struct rdmav_sge sge; int i, ret = 0; if (!message_count) @@ -167,12 +167,12 @@ static int post_recvs(struct cmatest_nod recv_wr.num_sge = 1; recv_wr.wr_id = (uintptr_t) node; - sge.length = message_size + sizeof(struct ibv_grh); + sge.length = message_size + sizeof(struct rdmav_grh); sge.lkey = node->mr->lkey; sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++ ) { - ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); + ret = rdmav_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); if (ret) { printf("failed to post receives: %d\n", ret); break; @@ -183,8 +183,8 @@ static int post_recvs(struct cmatest_nod static int post_sends(struct cmatest_node *node, int signal_flag) { - struct ibv_send_wr send_wr, *bad_send_wr; - struct ibv_sge sge; + struct rdmav_send_wr send_wr, *bad_send_wr; + struct rdmav_sge sge; int i, ret = 0; if (!node->connected || !message_count) @@ -193,8 +193,8 @@ static int post_sends(struct cmatest_nod send_wr.next = NULL; send_wr.sg_list = &sge; send_wr.num_sge = 1; - send_wr.opcode = IBV_WR_SEND_WITH_IMM; - send_wr.send_flags = IBV_SEND_INLINE | signal_flag; + send_wr.opcode = RDMAV_WR_SEND_WITH_IMM; + send_wr.send_flags = RDMAV_SEND_INLINE | signal_flag; send_wr.wr_id = (unsigned long)node; send_wr.imm_data = htonl(node->cma_id->qp->qp_num); @@ -202,12 +202,12 @@ static int post_sends(struct cmatest_nod send_wr.wr.ud.remote_qpn = node->remote_qpn; send_wr.wr.ud.remote_qkey = node->remote_qkey; - sge.length = message_size - sizeof(struct ibv_grh); + sge.length = message_size - sizeof(struct rdmav_grh); sge.lkey = node->mr->lkey; sge.addr = (uintptr_t) node->mem; for (i = 0; i < message_count && !ret; i++) { - ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); + ret = rdmav_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); if (ret) printf("failed to post sends: %d\n", ret); } @@ -305,7 +305,7 @@ err1: static int resolved_handler(struct cmatest_node *node) { - struct ibv_ah_attr ah_attr; + struct rdmav_ah_attr ah_attr; int ret; ret = rdma_get_dst_attr(node->cma_id, test.dst_addr, &ah_attr, @@ -315,7 +315,7 @@ static int resolved_handler(struct cmate goto err; } - node->ah = ibv_create_ah(node->pd, &ah_attr); + node->ah = rdmav_create_ah(node->pd, &ah_attr); if (!node->ah) { printf("udaddy: failure creating address handle\n"); goto err; @@ -371,21 +371,21 @@ static void destroy_node(struct cmatest_ return; if (node->ah) - ibv_destroy_ah(node->ah); + rdmav_destroy_ah(node->ah); if (node->cma_id->qp) rdma_destroy_qp(node->cma_id); if (node->cq) - ibv_destroy_cq(node->cq); + rdmav_destroy_cq(node->cq); if (node->mem) { - ibv_dereg_mr(node->mr); + rdmav_dereg_mr(node->mr); free(node->mem); } if (node->pd) - ibv_dealloc_pd(node->pd); + rdmav_dealloc_pd(node->pd); /* Destroy the RDMA ID after all device resources */ rdma_destroy_id(node->cma_id); @@ -429,9 +429,9 @@ static void destroy_nodes(void) free(test.nodes); } -static void create_reply_ah(struct cmatest_node *node, struct ibv_wc *wc) +static void create_reply_ah(struct cmatest_node *node, struct rdmav_wc *wc) { - node->ah = ibv_create_ah_from_wc(node->pd, wc, node->mem, + node->ah = rdmav_create_ah_from_wc(node->pd, wc, node->mem, node->cma_id->port_num); node->remote_qpn = ntohl(wc->imm_data); node->remote_qkey = ntohs(rdma_get_dst_port(node->cma_id)); @@ -439,7 +439,7 @@ static void create_reply_ah(struct cmate static int poll_cqs(void) { - struct ibv_wc wc[8]; + struct rdmav_wc wc[8]; int done, i, ret; for (i = 0; i < connections; i++) { @@ -447,7 +447,7 @@ static int poll_cqs(void) continue; for (done = 0; done < message_count; done += ret) { - ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + ret = rdmav_poll_cq(test.nodes[i].cq, 8, wc); if (ret < 0) { printf("udaddy: failed polling CQ: %d\n", ret); return ret; @@ -511,7 +511,7 @@ static int run_server(void) printf("sending replies\n"); for (i = 0; i < connections; i++) { - ret = post_sends(&test.nodes[i], IBV_SEND_SIGNALED); + ret = post_sends(&test.nodes[i], RDMAV_SEND_SIGNALED); if (ret) goto out; } From krkumar2 at in.ibm.com Thu Aug 3 00:11:11 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 12:41:11 +0530 Subject: [openib-general] [PATCH v3 3/6] libibverbs configuration files changes. In-Reply-To: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803071111.6106.72072.sendpatchset@K50wks273950wss.in.ibm.com> Configuration/Makefiles to build libibverbs with the new API. Signed-off-by: Krishna Kumar diff -ruNp ORG/libibverbs/Makefile.am NEW/libibverbs/Makefile.am --- ORG/libibverbs/Makefile.am 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/Makefile.am 2006-08-03 17:15:33.000000000 -0700 @@ -9,7 +9,7 @@ AM_CFLAGS = -g -Wall -D_GNU_SOURCE src_libibverbs_la_CFLAGS = -g -Wall -D_GNU_SOURCE -DDRIVER_PATH=\"$(libdir)/infiniband\" if HAVE_LD_VERSION_SCRIPT - libibverbs_version_script = -Wl,--version-script=$(srcdir)/src/libibverbs.map + libibverbs_version_script = -Wl,--version-script=$(srcdir)/src/librdmaverbs.map else libibverbs_version_script = endif @@ -18,7 +18,7 @@ src_libibverbs_la_SOURCES = src/cmd.c sr src/memory.c src/sysfs.c src/verbs.c src_libibverbs_la_LDFLAGS = -version-info 2 -export-dynamic \ $(libibverbs_version_script) -src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map +src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/librdmaverbs.map bin_PROGRAMS = examples/ibv_devices examples/ibv_devinfo \ examples/ibv_asyncwatch examples/ibv_rc_pingpong examples/ibv_uc_pingpong \ @@ -42,7 +42,8 @@ libibverbsincludedir = $(includedir)/inf libibverbsinclude_HEADERS = include/infiniband/arch.h include/infiniband/driver.h \ include/infiniband/kern-abi.h include/infiniband/opcode.h include/infiniband/verbs.h \ - include/infiniband/sa-kern-abi.h include/infiniband/sa.h include/infiniband/marshall.h + include/infiniband/sa-kern-abi.h include/infiniband/sa.h include/infiniband/marshall.h \ + include/infiniband/deprecate.h man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1 \ man/ibv_rc_pingpong.1 man/ibv_uc_pingpong.1 man/ibv_ud_pingpong.1 \ @@ -56,8 +57,9 @@ DEBIAN = debian/changelog debian/compat EXTRA_DIST = include/infiniband/driver.h include/infiniband/kern-abi.h \ include/infiniband/opcode.h include/infiniband/verbs.h include/infiniband/marshall.h \ include/infiniband/sa-kern-abi.h include/infiniband/sa.h \ - src/ibverbs.h examples/pingpong.h \ - src/libibverbs.map libibverbs.spec.in $(man_MANS) + include/infiniband/deprecate.h \ + src/rdmaverbs.h examples/pingpong.h \ + src/librdmaverbs.map libibverbs.spec.in $(man_MANS) dist-hook: libibverbs.spec cp libibverbs.spec $(distdir) diff -ruNp ORG/libibverbs/configure.in NEW/libibverbs/configure.in --- ORG/libibverbs/configure.in 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/configure.in 2006-08-02 18:24:49.000000000 -0700 @@ -2,7 +2,7 @@ dnl Process this file with autoconf to p AC_PREREQ(2.57) AC_INIT(libibverbs, 1.1-pre1, openib-general at openib.org) -AC_CONFIG_SRCDIR([src/ibverbs.h]) +AC_CONFIG_SRCDIR([src/rdmaverbs.h]) AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(libibverbs, 1.1-pre1) @@ -33,5 +33,5 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") -AC_CONFIG_FILES([Makefile libibverbs.spec]) +AC_CONFIG_FILES([Makefile librdmaverbs.spec]) AC_OUTPUT diff -ruNp ORG/libibverbs/libibverbs.spec.in NEW/libibverbs/libibverbs.spec.in --- ORG/libibverbs/libibverbs.spec.in 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/libibverbs.spec.in 1969-12-31 16:00:00.000000000 -0800 @@ -1,106 +0,0 @@ -# $Id: libibverbs.spec.in 7484 2006-05-24 21:12:21Z roland $ - -%define ver @VERSION@ - -Name: libibverbs -Version: 1.1 -Release: 0.1.pre1%{?dist} -Summary: A library for direct userspace use of InfiniBand - -Group: System Environment/Libraries -License: GPL/BSD -Url: http://openib.org/ -Source: http://openib.org/downloads/libibverbs-1.1-pre1.tar.gz -BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) - -%description -libibverbs is a library that allows userspace processes to use -InfiniBand "verbs" as described in the InfiniBand Architecture -Specification. This includes direct hardware access for fast path -operations. - -For this library to be useful, a device-specific plug-in module should -also be installed. - -%package devel -Summary: Development files for the libibverbs library -Group: System Environment/Libraries - -%description devel -Static libraries and header files for the libibverbs verbs library. - -%package utils -Summary: Examples for the libibverbs library -Group: System Environment/Libraries -Requires: %{name} = %{version}-%{release} - -%description utils -Useful libibverbs1 example programs such as ibv_devinfo, which -displays information about InfiniBand devices. - -%prep -%setup -q -n %{name}-%{ver} - -%build -%configure -make %{?_smp_mflags} - -%install -rm -rf $RPM_BUILD_ROOT -%makeinstall -# remove unpackaged files from the buildroot -rm -f $RPM_BUILD_ROOT%{_libdir}/*.la - -%clean -rm -rf $RPM_BUILD_ROOT - -%post -p /sbin/ldconfig -%postun -p /sbin/ldconfig - -%files -%defattr(-,root,root,-) -%{_libdir}/libibverbs*.so.* -%doc AUTHORS COPYING ChangeLog README - -%files devel -%defattr(-,root,root,-) -%{_libdir}/lib*.so -%{_libdir}/*.a -%{_includedir}/* - -%files utils -%defattr(-,root,root,-) -%{_bindir}/* -%{_mandir}/man1/* - -%changelog -* Mon May 22 2006 Roland Dreier - 1.1-0.1.pre1 -- New upstream release -- Remove dependency on libsysfs, since it is no longer used - -* Thu May 4 2006 Roland Dreier - 1.0.4-1 -- New upstream release - -* Mon Mar 14 2006 Roland Dreier - 1.0.3-1 -- New upstream release - -* Mon Mar 13 2006 Roland Dreier - 1.0.1-1 -- New upstream release - -* Thu Feb 16 2006 Roland Dreier - 1.0-1 -- New upstream release - -* Wed Feb 15 2006 Roland Dreier - 1.0-0.5.rc7 -- New upstream release - -* Sun Jan 22 2006 Roland Dreier - 1.0-0.4.rc6 -- New upstream release - -* Tue Oct 25 2005 Roland Dreier - 1.0-0.3.rc5 -- New upstream release - -* Wed Oct 5 2005 Roland Dreier - 1.0-0.2.rc4 -- Update to upstream 1.0-rc4 release - -* Mon Sep 26 2005 Roland Dreier - 1.0-0.1.rc3 -- Initial attempt at Fedora Extras-compliant spec file diff -ruNp ORG/libibverbs/librdmaverbs.spec.in NEW/libibverbs/librdmaverbs.spec.in --- ORG/libibverbs/librdmaverbs.spec.in 1969-12-31 16:00:00.000000000 -0800 +++ NEW/libibverbs/librdmaverbs.spec.in 2006-08-02 18:24:49.000000000 -0700 @@ -0,0 +1,106 @@ +# $Id: + +%define ver @VERSION@ + +Name: libibverbs +Version: 1.1 +Release: 0.1.pre1%{?dist} +Summary: A library for direct userspace use of InfiniBand + +Group: System Environment/Libraries +License: GPL/BSD +Url: http://openib.org/ +Source: http://openib.org/downloads/libibverbs-1.1-pre1.tar.gz +BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) + +%description +libibverbs is a library that allows userspace processes to use +InfiniBand and iWARP "verbs" as described in the InfiniBand Architecture +Specification and the iWARP documents. This includes direct hardware access +for fast path operations. + +For this library to be useful, a device-specific plug-in module should +also be installed. + +%package devel +Summary: Development files for the libibverbs library +Group: System Environment/Libraries + +%description devel +Static libraries and header files for the libibverbs verbs library. + +%package utils +Summary: Examples for the libibverbs library +Group: System Environment/Libraries +Requires: %{name} = %{version}-%{release} + +%description utils +Useful libibverbs example programs such as ibv_devinfo, which +displays information about InfiniBand devices. + +%prep +%setup -q -n %{name}-%{ver} + +%build +%configure +make %{?_smp_mflags} + +%install +rm -rf $RPM_BUILD_ROOT +%makeinstall +# remove unpackaged files from the buildroot +rm -f $RPM_BUILD_ROOT%{_libdir}/*.la + +%clean +rm -rf $RPM_BUILD_ROOT + +%post -p /sbin/ldconfig +%postun -p /sbin/ldconfig + +%files +%defattr(-,root,root,-) +%{_libdir}/libibverbs*.so.* +%doc AUTHORS COPYING ChangeLog README + +%files devel +%defattr(-,root,root,-) +%{_libdir}/lib*.so +%{_libdir}/*.a +%{_includedir}/* + +%files utils +%defattr(-,root,root,-) +%{_bindir}/* +%{_mandir}/man1/* + +%changelog +* Mon May 22 2006 Roland Dreier - 1.1-0.1.pre1 +- New upstream release +- Remove dependency on libsysfs, since it is no longer used + +* Thu May 4 2006 Roland Dreier - 1.0.4-1 +- New upstream release + +* Mon Mar 14 2006 Roland Dreier - 1.0.3-1 +- New upstream release + +* Mon Mar 13 2006 Roland Dreier - 1.0.1-1 +- New upstream release + +* Thu Feb 16 2006 Roland Dreier - 1.0-1 +- New upstream release + +* Wed Feb 15 2006 Roland Dreier - 1.0-0.5.rc7 +- New upstream release + +* Sun Jan 22 2006 Roland Dreier - 1.0-0.4.rc6 +- New upstream release + +* Tue Oct 25 2005 Roland Dreier - 1.0-0.3.rc5 +- New upstream release + +* Wed Oct 5 2005 Roland Dreier - 1.0-0.2.rc4 +- Update to upstream 1.0-rc4 release + +* Mon Sep 26 2005 Roland Dreier - 1.0-0.1.rc3 +- Initial attempt at Fedora Extras-compliant spec file From monil at voltaire.com Thu Aug 3 02:03:33 2006 From: monil at voltaire.com (Moni Levy) Date: Thu, 3 Aug 2006 12:03:33 +0300 Subject: [openib-general] [openfabrics-ewg] Multicast traffic performace of OFED 1.0 ipoib In-Reply-To: <6.2.0.14.2.20060802075013.0397ca08@esmail.cup.hp.com> References: <6a122cc00608020430k4a520d10xfebb256829feb752@mail.gmail.com> <6.2.0.14.2.20060802075013.0397ca08@esmail.cup.hp.com> Message-ID: <6a122cc00608030203h29c75ee6x1127c8af79da5c40@mail.gmail.com> Mike, On 8/2/06, Michael Krause wrote: > > > Is the performance being measured on an identical topology and hardware set > as before? Multicast by its very nature is sensitive to topology, hardware > components used (buffer depth, latency, etc.) and workload occurring within > the fabric. Loss occurs as a function of congestion or lack of forward > progress resulting in a timeout and thus a toss of a packet. If the > hardware is different or the settings chosen are changed, then the results > would be expected to change. > > It is not clear what you hope to achieve with such tests as there will be > other workloads flowing over the fabric which will create random HOL > blocking which can result in packet loss. Multicast workloads should be > tolerant of such loss. > > Mike I'm sorry about not beeing clear. My intention in the last sentance was that we got the better (120k-140k PPS) results with our proprietary IB stack and not with a previous openib snapshot. The tests were run on the same setup, which by the way was dedicated only to that traffic. I' m aware of the network implications of the test, I was looking for hints of improvements needed in the ipoib implementation. -- Moni > > > > > At 04:30 AM 8/2/2006, Moni Levy wrote: > > Hi, > we are doing some performance testing of multicast traffic over > ipoib. The tests are performed by using iperf on dual 1.6G AMD PCI-X > servers with PCI-X Tavor cards with 3.4.FW. Below are the command the > may be used to run the test. > > Iperf server: > route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0 > /home/qa/testing-tools/iperf-2.0.2/iperf -us -B 224.4.4.4 -i 1 > > Iperf client: > route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0 > /home/qa/testing-tools/iperf-2.0.2/iperf -uc 224.4.4.4 -i 1 -b 100M -t > 400 -l 100 > > We are looking for the max PPT rate (100 byte packets size) without > losses, by changing the BW parameter and looking at the point where we > get no losses reported. The best results we received were around 50k > PPS. I remember that we got some 120k-140k packets of the same size > running without losses. > > We are going to look into it and try to see where is the time spent, > but any ideas are welcome. > > Best regards, > Moni > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > At 04:30 AM 8/2/2006, Moni Levy wrote: > > Hi, > we are doing some performance testing of multicast traffic over > ipoib. The tests are performed by using iperf on dual 1.6G AMD PCI-X > servers with PCI-X Tavor cards with 3.4.FW. Below are the command the > may be used to run the test. > > Iperf server: > route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0 > /home/qa/testing-tools/iperf-2.0.2/iperf -us -B 224.4.4.4 -i 1 > > Iperf client: > route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0 > /home/qa/testing-tools/iperf-2.0.2/iperf -uc 224.4.4.4 -i 1 -b 100M -t > 400 -l 100 > > We are looking for the max PPT rate (100 byte packets size) without > losses, by changing the BW parameter and looking at the point where we > get no losses reported. The best results we received were around 50k > PPS. I remember that we got some 120k-140k packets of the same size > running without losses. > > We are going to look into it and try to see where is the time spent, > but any ideas are welcome. > > Best regards, > Moni > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > > From dotanb at mellanox.co.il Thu Aug 3 02:15:23 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 3 Aug 2006 12:15:23 +0300 Subject: [openib-general] [libibcm] linking object files with the libibcm failed on latest driver Message-ID: <200608031215.23529.dotanb@mellanox.co.il> Hi. I tried to add a (basic) support of libibcm to one of our tests and i got linking problems with latest driver. (several days ago, i didn't get this error). Here are the machine and driver props: ************************************************************* Host Name : sw087 Host Architecture : x86_64 Linux Distribution: Fedora Core release 4 (Stentz) Kernel Version : 2.6.11-1.1369_FC4smp Memory size : 4071672 kB Driver Version : gen2_linux-20060803-0800 (REV=8813) HCA ID(s) : mthca0 HCA model(s) : 23108 FW version(s) : 3.4.000 Board(s) : MT_0030000001 ************************************************************* Here are the compilation errors: gcc main.o cmd_pars.o common.o qp_test.o connect_qp.o data_operation.o -o qp_test -libverbs -libcm -lvl /usr/local/lib64/libibcm.so: undefined reference to `_dlist_mark_move' /usr/local/lib64/libibcm.so: undefined reference to `dlist_destroy' /usr/local/lib64/libibcm.so: undefined reference to `sysfs_get_mnt_path' /usr/local/lib64/libibcm.so: undefined reference to `sysfs_close_class_device' /usr/local/lib64/libibcm.so: undefined reference to `dlist_push' /usr/local/lib64/libibcm.so: undefined reference to `sysfs_open_class' /usr/local/lib64/libibcm.so: undefined reference to `sysfs_get_classdev_attr' /usr/local/lib64/libibcm.so: undefined reference to `sysfs_read_attribute' /usr/local/lib64/libibcm.so: undefined reference to `sysfs_open_class_device' /usr/local/lib64/libibcm.so: undefined reference to `sysfs_get_class_devices' /usr/local/lib64/libibcm.so: undefined reference to `dlist_start' /usr/local/lib64/libibcm.so: undefined reference to `sysfs_close_attribute' /usr/local/lib64/libibcm.so: undefined reference to `sysfs_close_class' /usr/local/lib64/libibcm.so: undefined reference to `dlist_new' /usr/local/lib64/libibcm.so: undefined reference to `sysfs_open_attribute' collect2: ld returned 1 exit status make: *** [qp_test] Error 1 Did you notice this error before? Thanks Dotan From dotanb at mellanox.co.il Thu Aug 3 04:28:19 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 3 Aug 2006 14:28:19 +0300 Subject: [openib-general] [libibcm] does the libibcm support multithreaded applications? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302A3C659@mtlexch01.mtl.com> Hi. I'm trying to use the libibcm in a multithreaded test and i get weird failures (instead of RTU event i get a DREQ event). Does the libibcm supports multi threading applications? (every thread have it's own CM device and each one of them listen is using a different service ID) thanks Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Thu Aug 3 04:29:03 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 03 Aug 2006 14:29:03 +0300 Subject: [openib-general] some issues related to when/why IPoIB calls netif_carrier_on() etc In-Reply-To: References: Message-ID: <44D1DDFF.3070805@voltaire.com> Roland Dreier wrote: > > 1) what is the exact reason that ib0 is running here, is it as of this > > "magic" configuration of the IPv6 addr that caused it to join to > > the IPv4 and IPv6 broascast groups? > > No, ipv6 autoconf has nothing to do with it. I think it's because you > did ifconfig ib0 up, which called ipoib_open(), which calls > ipoib_ib_dev_up(), which joins the ipv4 broadcast group. > > Bringing the interface up the starts ipv6 autoconf but that is just a > side issue. You could build a kernel without ipv6 and see what happens. I have verified your assumption to be correct, building the kernel without ipv6 support and doing ifconfig up ib0 i see that the port associated with ib0 joined the ipv4 broadcast group and ib0 is in "running" state. Or. From glebn at voltaire.com Thu Aug 3 04:46:35 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Thu, 3 Aug 2006 14:46:35 +0300 Subject: [openib-general] [PATCH/RFC] libibverbs and libmthca fork support In-Reply-To: References: <20060801131756.GF4681@minantech.com> <20060802114242.GL4681@minantech.com> Message-ID: <20060803114635.GG23626@minantech.com> On Wed, Aug 02, 2006 at 08:08:48AM -0700, Roland Dreier wrote: > > Roland> But one last call for comments: in particular, does anyone > Roland> object to libibverbs being fork-unsafe by default unless > Roland> ibv_fork_init is called? > > Gleb> I am not sure about this one. The library like MPI will have > Gleb> to always call this anyway. On the other side if some > Gleb> library calls fork() without application knowing it? > Gleb> Suddenly programmer should care about such details. Perhaps > Gleb> opt out is better the opt in and libibverbs should skip > Gleb> ibv_fork_init() only if application ask this explicitly? > > But then what should libibverbs do on a kernel that doesn't support > the required madvise(MADV_DONTFORK) call? > If your kernel doesn't support MADV_DONTFORK then you are SOL and I think that what you are doing now (disabling it) is perfect. It should be clearly documented starting from what kernel version fork() is supported. > It's fine if MPI calls ibv_fork_init() always -- at least then it has > a hint about whether fork() will work. There is nothing MPI can do with this hint. It is not going to use fork by itself but because library never knows what application is going to do it will have to call ibv_fork_init() just in case. > > Do you know if there are libraries that call fork()? > Honestly I can think of anything right now :) But I haven't looked at enough of them. By the way in multi threaded program simple system() from inside the library will be enough to screw up. We can also provide environment variable to control libibverbs behaviour. This way if programmer made a wrong assumption user will be able to fix it. -- Gleb. From dotanb at mellanox.co.il Thu Aug 3 04:54:40 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 3 Aug 2006 14:54:40 +0300 Subject: [openib-general] [libibcm] linking object files with the libibcm failed on latest driver In-Reply-To: <200608031215.23529.dotanb@mellanox.co.il> References: <200608031215.23529.dotanb@mellanox.co.il> Message-ID: <200608031454.41322.dotanb@mellanox.co.il> please ignore this email, this failure occured because of internal code modifications. thanks Dotan On Thursday 03 August 2006 12:15, Dotan Barak wrote: > Hi. > > I tried to add a (basic) support of libibcm to one of our tests and i got linking problems with latest driver. > (several days ago, i didn't get this error). > > Here are the machine and driver props: > ************************************************************* > Host Name : sw087 > Host Architecture : x86_64 > Linux Distribution: Fedora Core release 4 (Stentz) > Kernel Version : 2.6.11-1.1369_FC4smp > Memory size : 4071672 kB > Driver Version : gen2_linux-20060803-0800 (REV=8813) > HCA ID(s) : mthca0 > HCA model(s) : 23108 > FW version(s) : 3.4.000 > Board(s) : MT_0030000001 > ************************************************************* > > Here are the compilation errors: > > gcc main.o cmd_pars.o common.o qp_test.o connect_qp.o data_operation.o -o qp_test -libverbs -libcm -lvl > /usr/local/lib64/libibcm.so: undefined reference to `_dlist_mark_move' > /usr/local/lib64/libibcm.so: undefined reference to `dlist_destroy' > /usr/local/lib64/libibcm.so: undefined reference to `sysfs_get_mnt_path' > /usr/local/lib64/libibcm.so: undefined reference to `sysfs_close_class_device' > /usr/local/lib64/libibcm.so: undefined reference to `dlist_push' > /usr/local/lib64/libibcm.so: undefined reference to `sysfs_open_class' > /usr/local/lib64/libibcm.so: undefined reference to `sysfs_get_classdev_attr' > /usr/local/lib64/libibcm.so: undefined reference to `sysfs_read_attribute' > /usr/local/lib64/libibcm.so: undefined reference to `sysfs_open_class_device' > /usr/local/lib64/libibcm.so: undefined reference to `sysfs_get_class_devices' > /usr/local/lib64/libibcm.so: undefined reference to `dlist_start' > /usr/local/lib64/libibcm.so: undefined reference to `sysfs_close_attribute' > /usr/local/lib64/libibcm.so: undefined reference to `sysfs_close_class' > /usr/local/lib64/libibcm.so: undefined reference to `dlist_new' > /usr/local/lib64/libibcm.so: undefined reference to `sysfs_open_attribute' > collect2: ld returned 1 exit status > make: *** [qp_test] Error 1 > > > Did you notice this error before? > Thanks > Dotan > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Thu Aug 3 06:19:30 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 08:19:30 -0500 Subject: [openib-general] rdma cm process hang In-Reply-To: <20060802155721.GA20429@osc.edu> References: <20060801213416.GA18941@osc.edu> <1154531379.32560.13.camel@stevo-desktop> <20060802155721.GA20429@osc.edu> Message-ID: <1154611170.29187.7.camel@stevo-desktop> On Wed, 2006-08-02 at 11:57 -0400, Pete Wyckoff wrote: > swise at opengridcomputing.com wrote on Wed, 02 Aug 2006 10:09 -0500: > > This hang is due to 2 things: > > > > 1) the amso card will _never_ timeout a connection that is awaiting an > > MP reply. That is exactly what is happening here. The fix for this > > (timeout mpa connection setup stalls) is a firmware fix and we don't > > have the firmware src. > > > > 2) the IWCM holds a reference on the QP until connection setup either > > succeeds or fails. So that's where we get the stall. The amso driver > > is waiting for the reference on the qp to go to zero, and it never will > > because the amso firmware will never timeout the stalled mpa connection > > setup. > > > > Lemme look more at the amso driver and see if this can be avoided. > > Perhaps the amso driver can blow away the qp and stop the stall. I > > thought thats what it did, but I'll look... > > Thanks for looking. I'd just come to the conclusion that it was > waiting on the qp refcnt, but didn't get much farther when your mail > arrived. > I don't know when, or if I'll have time to address this limitation in the ammasso firmware. But there is a way (if anyone wants to implement it): 1) add a timer to the c2_qp struct and start it when c2_llp_connect() is called. 2) if the timer fires, generate a CONNECT_REPLY upcall to the IWCM with status TIMEDOUT. Mark in the qp that the connect timed out. 3) deal with the rare condition that the timer fires at or about the same time the connection really does get established: if the adapter passes up a CCAE_ACTIVE_CONNECT_RESULTS -after- the timer fires but before the qp is destroyed by the consumer, then you must squelch this event and probably destroy the HWQP at least from the adapter's perspective... > Testing on mthca would be a bit more difficult here, but hopefully > that's not an issue now. There's no need. This is an Ammaso-only issue. Steve. From krkumar2 at in.ibm.com Thu Aug 3 00:10:53 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 03 Aug 2006 12:40:53 +0530 Subject: [openib-general] [PATCH v3 2/6] libibverbs source files changes. In-Reply-To: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803071005.6106.96850.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: <20060803071053.6106.45921.sendpatchset@K50wks273950wss.in.ibm.com> Source files in libibverbs defining the new API. Signed-off-by: Krishna Kumar diff -ruNp ORG/libibverbs/src/cmd.c NEW/libibverbs/src/cmd.c --- ORG/libibverbs/src/cmd.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/cmd.c 2006-08-03 17:29:24.000000000 -0700 @@ -45,16 +45,16 @@ #include #include -#include "ibverbs.h" +#include "rdmaverbs.h" -static int ibv_cmd_get_context_v2(struct ibv_context *context, - struct ibv_get_context *new_cmd, +static int rdmav_cmd_get_context_v2(struct rdmav_context *context, + struct rdmav_get_context *new_cmd, size_t new_cmd_size, - struct ibv_get_context_resp *resp, + struct rdmav_get_context_resp *resp, size_t resp_size) { - struct ibv_abi_compat_v2 *t; - struct ibv_get_context_v2 *cmd; + struct rdmav_abi_compat_v2 *t; + struct rdmav_get_context_v2 *cmd; size_t cmd_size; uint32_t cq_fd; @@ -65,9 +65,10 @@ static int ibv_cmd_get_context_v2(struct cmd_size = sizeof *cmd + new_cmd_size - sizeof *new_cmd; cmd = alloca(cmd_size); - memcpy(cmd->driver_data, new_cmd->driver_data, new_cmd_size - sizeof *new_cmd); + memcpy(cmd->driver_data, new_cmd->driver_data, + new_cmd_size - sizeof *new_cmd); - IBV_INIT_CMD_RESP(cmd, cmd_size, GET_CONTEXT, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, GET_CONTEXT, resp, resp_size); cmd->cq_fd_tab = (uintptr_t) &cq_fd; if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) @@ -81,14 +82,16 @@ static int ibv_cmd_get_context_v2(struct return 0; } -int ibv_cmd_get_context(struct ibv_context *context, struct ibv_get_context *cmd, - size_t cmd_size, struct ibv_get_context_resp *resp, +int rdmav_cmd_get_context(struct rdmav_context *context, + struct rdmav_get_context *cmd, + size_t cmd_size, struct rdmav_get_context_resp *resp, size_t resp_size) { if (abi_ver <= 2) - return ibv_cmd_get_context_v2(context, cmd, cmd_size, resp, resp_size); + return rdmav_cmd_get_context_v2(context, cmd, cmd_size, resp, + resp_size); - IBV_INIT_CMD_RESP(cmd, cmd_size, GET_CONTEXT, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, GET_CONTEXT, resp, resp_size); if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; @@ -99,14 +102,14 @@ int ibv_cmd_get_context(struct ibv_conte return 0; } -int ibv_cmd_query_device(struct ibv_context *context, - struct ibv_device_attr *device_attr, +int rdmav_cmd_query_device(struct rdmav_context *context, + struct rdmav_device_attr *device_attr, uint64_t *raw_fw_ver, - struct ibv_query_device *cmd, size_t cmd_size) + struct rdmav_query_device *cmd, size_t cmd_size) { - struct ibv_query_device_resp resp; + struct rdmav_query_device_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_DEVICE, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, QUERY_DEVICE, &resp, sizeof resp); if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; @@ -156,13 +159,13 @@ int ibv_cmd_query_device(struct ibv_cont return 0; } -int ibv_cmd_query_port(struct ibv_context *context, uint8_t port_num, - struct ibv_port_attr *port_attr, - struct ibv_query_port *cmd, size_t cmd_size) +int rdmav_cmd_query_port(struct rdmav_context *context, uint8_t port_num, + struct rdmav_port_attr *port_attr, + struct rdmav_query_port *cmd, size_t cmd_size) { - struct ibv_query_port_resp resp; + struct rdmav_query_port_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_PORT, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, QUERY_PORT, &resp, sizeof resp); cmd->port_num = port_num; if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) @@ -191,11 +194,11 @@ int ibv_cmd_query_port(struct ibv_contex return 0; } -int ibv_cmd_alloc_pd(struct ibv_context *context, struct ibv_pd *pd, - struct ibv_alloc_pd *cmd, size_t cmd_size, - struct ibv_alloc_pd_resp *resp, size_t resp_size) +int rdmav_cmd_alloc_pd(struct rdmav_context *context, struct rdmav_pd *pd, + struct rdmav_alloc_pd *cmd, size_t cmd_size, + struct rdmav_alloc_pd_resp *resp, size_t resp_size) { - IBV_INIT_CMD_RESP(cmd, cmd_size, ALLOC_PD, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, ALLOC_PD, resp, resp_size); if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; @@ -205,11 +208,11 @@ int ibv_cmd_alloc_pd(struct ibv_context return 0; } -int ibv_cmd_dealloc_pd(struct ibv_pd *pd) +int rdmav_cmd_dealloc_pd(struct rdmav_pd *pd) { - struct ibv_dealloc_pd cmd; + struct rdmav_dealloc_pd cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DEALLOC_PD); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DEALLOC_PD); cmd.pd_handle = pd->handle; if (write(pd->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -218,14 +221,14 @@ int ibv_cmd_dealloc_pd(struct ibv_pd *pd return 0; } -int ibv_cmd_reg_mr(struct ibv_pd *pd, void *addr, size_t length, - uint64_t hca_va, enum ibv_access_flags access, - struct ibv_mr *mr, struct ibv_reg_mr *cmd, +int rdmav_cmd_reg_mr(struct rdmav_pd *pd, void *addr, size_t length, + uint64_t hca_va, enum rdmav_access_flags access, + struct rdmav_mr *mr, struct rdmav_reg_mr *cmd, size_t cmd_size) { - struct ibv_reg_mr_resp resp; + struct rdmav_reg_mr_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, REG_MR, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, REG_MR, &resp, sizeof resp); cmd->start = (uintptr_t) addr; cmd->length = length; @@ -243,11 +246,11 @@ int ibv_cmd_reg_mr(struct ibv_pd *pd, vo return 0; } -int ibv_cmd_dereg_mr(struct ibv_mr *mr) +int rdmav_cmd_dereg_mr(struct rdmav_mr *mr) { - struct ibv_dereg_mr cmd; + struct rdmav_dereg_mr cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DEREG_MR); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DEREG_MR); cmd.mr_handle = mr->handle; if (write(mr->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -256,19 +259,22 @@ int ibv_cmd_dereg_mr(struct ibv_mr *mr) return 0; } -static int ibv_cmd_create_cq_v2(struct ibv_context *context, int cqe, - struct ibv_cq *cq, - struct ibv_create_cq *new_cmd, size_t new_cmd_size, - struct ibv_create_cq_resp *resp, size_t resp_size) +static int rdmav_cmd_create_cq_v2(struct rdmav_context *context, int cqe, + struct rdmav_cq *cq, + struct rdmav_create_cq *new_cmd, + size_t new_cmd_size, + struct rdmav_create_cq_resp *resp, + size_t resp_size) { - struct ibv_create_cq_v2 *cmd; + struct rdmav_create_cq_v2 *cmd; size_t cmd_size; cmd_size = sizeof *cmd + new_cmd_size - sizeof *new_cmd; cmd = alloca(cmd_size); - memcpy(cmd->driver_data, new_cmd->driver_data, new_cmd_size - sizeof *new_cmd); + memcpy(cmd->driver_data, new_cmd->driver_data, + new_cmd_size - sizeof *new_cmd); - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_CQ, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, CREATE_CQ, resp, resp_size); cmd->user_handle = (uintptr_t) cq; cmd->cqe = cqe; cmd->event_handler = 0; @@ -282,17 +288,17 @@ static int ibv_cmd_create_cq_v2(struct i return 0; } -int ibv_cmd_create_cq(struct ibv_context *context, int cqe, - struct ibv_comp_channel *channel, - int comp_vector, struct ibv_cq *cq, - struct ibv_create_cq *cmd, size_t cmd_size, - struct ibv_create_cq_resp *resp, size_t resp_size) +int rdmav_cmd_create_cq(struct rdmav_context *context, int cqe, + struct rdmav_comp_channel *channel, + int comp_vector, struct rdmav_cq *cq, + struct rdmav_create_cq *cmd, size_t cmd_size, + struct rdmav_create_cq_resp *resp, size_t resp_size) { if (abi_ver <= 2) - return ibv_cmd_create_cq_v2(context, cqe, cq, + return rdmav_cmd_create_cq_v2(context, cqe, cq, cmd, cmd_size, resp, resp_size); - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_CQ, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, CREATE_CQ, resp, resp_size); cmd->user_handle = (uintptr_t) cq; cmd->cqe = cqe; cmd->comp_vector = comp_vector; @@ -308,20 +314,20 @@ int ibv_cmd_create_cq(struct ibv_context return 0; } -int ibv_cmd_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc) +int rdmav_cmd_poll_cq(struct rdmav_cq *ibcq, int ne, struct rdmav_wc *wc) { - struct ibv_poll_cq cmd; - struct ibv_poll_cq_resp *resp; + struct rdmav_poll_cq cmd; + struct rdmav_poll_cq_resp *resp; int i; int rsize; int ret; - rsize = sizeof *resp + ne * sizeof(struct ibv_kern_wc); + rsize = sizeof *resp + ne * sizeof(struct rdmav_kern_wc); resp = malloc(rsize); if (!resp) return -1; - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, POLL_CQ, resp, rsize); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, POLL_CQ, resp, rsize); cmd.cq_handle = ibcq->handle; cmd.ne = ne; @@ -353,11 +359,11 @@ out: return ret; } -int ibv_cmd_req_notify_cq(struct ibv_cq *ibcq, int solicited_only) +int rdmav_cmd_req_notify_cq(struct rdmav_cq *ibcq, int solicited_only) { - struct ibv_req_notify_cq cmd; + struct rdmav_req_notify_cq cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, REQ_NOTIFY_CQ); + RDMAV_INIT_CMD(&cmd, sizeof cmd, REQ_NOTIFY_CQ); cmd.cq_handle = ibcq->handle; cmd.solicited = !!solicited_only; @@ -367,12 +373,12 @@ int ibv_cmd_req_notify_cq(struct ibv_cq return 0; } -int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, - struct ibv_resize_cq *cmd, size_t cmd_size) +int rdmav_cmd_resize_cq(struct rdmav_cq *cq, int cqe, + struct rdmav_resize_cq *cmd, size_t cmd_size) { - struct ibv_resize_cq_resp resp; + struct rdmav_resize_cq_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, &resp, sizeof resp); cmd->cq_handle = cq->handle; cmd->cqe = cqe; @@ -384,11 +390,11 @@ int ibv_cmd_resize_cq(struct ibv_cq *cq, return 0; } -static int ibv_cmd_destroy_cq_v1(struct ibv_cq *cq) +static int rdmav_cmd_destroy_cq_v1(struct rdmav_cq *cq) { - struct ibv_destroy_cq_v1 cmd; + struct rdmav_destroy_cq_v1 cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DESTROY_CQ); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DESTROY_CQ); cmd.cq_handle = cq->handle; if (write(cq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -397,15 +403,15 @@ static int ibv_cmd_destroy_cq_v1(struct return 0; } -int ibv_cmd_destroy_cq(struct ibv_cq *cq) +int rdmav_cmd_destroy_cq(struct rdmav_cq *cq) { - struct ibv_destroy_cq cmd; - struct ibv_destroy_cq_resp resp; + struct rdmav_destroy_cq cmd; + struct rdmav_destroy_cq_resp resp; if (abi_ver == 1) - return ibv_cmd_destroy_cq_v1(cq); + return rdmav_cmd_destroy_cq_v1(cq); - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_CQ, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_CQ, &resp, sizeof resp); cmd.cq_handle = cq->handle; if (write(cq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -420,12 +426,12 @@ int ibv_cmd_destroy_cq(struct ibv_cq *cq return 0; } -int ibv_cmd_create_srq(struct ibv_pd *pd, - struct ibv_srq *srq, struct ibv_srq_init_attr *attr, - struct ibv_create_srq *cmd, size_t cmd_size, - struct ibv_create_srq_resp *resp, size_t resp_size) +int rdmav_cmd_create_srq(struct rdmav_pd *pd, + struct rdmav_srq *srq, struct rdmav_srq_init_attr *attr, + struct rdmav_create_srq *cmd, size_t cmd_size, + struct rdmav_create_srq_resp *resp, size_t resp_size) { - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_SRQ, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, CREATE_SRQ, resp, resp_size); cmd->user_handle = (uintptr_t) srq; cmd->pd_handle = pd->handle; cmd->max_wr = attr->attr.max_wr; @@ -441,8 +447,8 @@ int ibv_cmd_create_srq(struct ibv_pd *pd attr->attr.max_wr = resp->max_wr; attr->attr.max_sge = resp->max_sge; } else { - struct ibv_create_srq_resp_v5 *resp_v5 = - (struct ibv_create_srq_resp_v5 *) resp; + struct rdmav_create_srq_resp_v5 *resp_v5 = + (struct rdmav_create_srq_resp_v5 *) resp; memmove((void *) resp + sizeof *resp, (void *) resp_v5 + sizeof *resp_v5, @@ -452,20 +458,21 @@ int ibv_cmd_create_srq(struct ibv_pd *pd return 0; } -static int ibv_cmd_modify_srq_v3(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask, - struct ibv_modify_srq *new_cmd, +static int rdmav_cmd_modify_srq_v3(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask, + struct rdmav_modify_srq *new_cmd, size_t new_cmd_size) { - struct ibv_modify_srq_v3 *cmd; + struct rdmav_modify_srq_v3 *cmd; size_t cmd_size; cmd_size = sizeof *cmd + new_cmd_size - sizeof *new_cmd; cmd = alloca(cmd_size); - memcpy(cmd->driver_data, new_cmd->driver_data, new_cmd_size - sizeof *new_cmd); + memcpy(cmd->driver_data, new_cmd->driver_data, + new_cmd_size - sizeof *new_cmd); - IBV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); + RDMAV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); cmd->srq_handle = srq->handle; cmd->attr_mask = srq_attr_mask; @@ -480,16 +487,16 @@ static int ibv_cmd_modify_srq_v3(struct return 0; } -int ibv_cmd_modify_srq(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask, - struct ibv_modify_srq *cmd, size_t cmd_size) +int rdmav_cmd_modify_srq(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask, + struct rdmav_modify_srq *cmd, size_t cmd_size) { if (abi_ver == 3) - return ibv_cmd_modify_srq_v3(srq, srq_attr, srq_attr_mask, + return rdmav_cmd_modify_srq_v3(srq, srq_attr, srq_attr_mask, cmd, cmd_size); - IBV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); + RDMAV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); cmd->srq_handle = srq->handle; cmd->attr_mask = srq_attr_mask; @@ -502,12 +509,12 @@ int ibv_cmd_modify_srq(struct ibv_srq *s return 0; } -int ibv_cmd_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr, - struct ibv_query_srq *cmd, size_t cmd_size) +int rdmav_cmd_query_srq(struct rdmav_srq *srq, struct rdmav_srq_attr *srq_attr, + struct rdmav_query_srq *cmd, size_t cmd_size) { - struct ibv_query_srq_resp resp; + struct rdmav_query_srq_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_SRQ, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, QUERY_SRQ, &resp, sizeof resp); cmd->srq_handle = srq->handle; if (write(srq->context->cmd_fd, cmd, cmd_size) != cmd_size) @@ -520,11 +527,11 @@ int ibv_cmd_query_srq(struct ibv_srq *sr return 0; } -static int ibv_cmd_destroy_srq_v1(struct ibv_srq *srq) +static int rdmav_cmd_destroy_srq_v1(struct rdmav_srq *srq) { - struct ibv_destroy_srq_v1 cmd; + struct rdmav_destroy_srq_v1 cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DESTROY_SRQ); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DESTROY_SRQ); cmd.srq_handle = srq->handle; if (write(srq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -533,15 +540,15 @@ static int ibv_cmd_destroy_srq_v1(struct return 0; } -int ibv_cmd_destroy_srq(struct ibv_srq *srq) +int rdmav_cmd_destroy_srq(struct rdmav_srq *srq) { - struct ibv_destroy_srq cmd; - struct ibv_destroy_srq_resp resp; + struct rdmav_destroy_srq cmd; + struct rdmav_destroy_srq_resp resp; if (abi_ver == 1) - return ibv_cmd_destroy_srq_v1(srq); + return rdmav_cmd_destroy_srq_v1(srq); - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_SRQ, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_SRQ, &resp, sizeof resp); cmd.srq_handle = srq->handle; if (write(srq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -555,12 +562,12 @@ int ibv_cmd_destroy_srq(struct ibv_srq * return 0; } -int ibv_cmd_create_qp(struct ibv_pd *pd, - struct ibv_qp *qp, struct ibv_qp_init_attr *attr, - struct ibv_create_qp *cmd, size_t cmd_size, - struct ibv_create_qp_resp *resp, size_t resp_size) +int rdmav_cmd_create_qp(struct rdmav_pd *pd, + struct rdmav_qp *qp, struct rdmav_qp_init_attr *attr, + struct rdmav_create_qp *cmd, size_t cmd_size, + struct rdmav_create_qp_resp *resp, size_t resp_size) { - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, resp, resp_size); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, resp, resp_size); cmd->user_handle = (uintptr_t) qp; cmd->pd_handle = pd->handle; @@ -591,15 +598,15 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, } if (abi_ver == 4) { - struct ibv_create_qp_resp_v4 *resp_v4 = - (struct ibv_create_qp_resp_v4 *) resp; + struct rdmav_create_qp_resp_v4 *resp_v4 = + (struct rdmav_create_qp_resp_v4 *) resp; memmove((void *) resp + sizeof *resp, (void *) resp_v4 + sizeof *resp_v4, resp_size - sizeof *resp); } else if (abi_ver <= 3) { - struct ibv_create_qp_resp_v3 *resp_v3 = - (struct ibv_create_qp_resp_v3 *) resp; + struct rdmav_create_qp_resp_v3 *resp_v3 = + (struct rdmav_create_qp_resp_v3 *) resp; memmove((void *) resp + sizeof *resp, (void *) resp_v3 + sizeof *resp_v3, @@ -609,14 +616,14 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, return 0; } -int ibv_cmd_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_qp_init_attr *init_attr, - struct ibv_query_qp *cmd, size_t cmd_size) +int rdmav_cmd_query_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_qp_init_attr *init_attr, + struct rdmav_query_qp *cmd, size_t cmd_size) { - struct ibv_query_qp_resp resp; + struct rdmav_query_qp_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_QP, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, QUERY_QP, &resp, sizeof resp); cmd->qp_handle = qp->handle; cmd->attr_mask = attr_mask; @@ -689,11 +696,11 @@ int ibv_cmd_query_qp(struct ibv_qp *qp, return 0; } -int ibv_cmd_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_modify_qp *cmd, size_t cmd_size) +int rdmav_cmd_modify_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_modify_qp *cmd, size_t cmd_size) { - IBV_INIT_CMD(cmd, cmd_size, MODIFY_QP); + RDMAV_INIT_CMD(cmd, cmd_size, MODIFY_QP); cmd->qp_handle = qp->handle; cmd->attr_mask = attr_mask; @@ -749,11 +756,11 @@ int ibv_cmd_modify_qp(struct ibv_qp *qp, return 0; } -static int ibv_cmd_destroy_qp_v1(struct ibv_qp *qp) +static int rdmav_cmd_destroy_qp_v1(struct rdmav_qp *qp) { - struct ibv_destroy_qp_v1 cmd; + struct rdmav_destroy_qp_v1 cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DESTROY_QP); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DESTROY_QP); cmd.qp_handle = qp->handle; if (write(qp->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -762,14 +769,14 @@ static int ibv_cmd_destroy_qp_v1(struct return 0; } -int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, - struct ibv_send_wr **bad_wr) +int rdmav_cmd_post_send(struct rdmav_qp *ibqp, struct rdmav_send_wr *wr, + struct rdmav_send_wr **bad_wr) { - struct ibv_post_send *cmd; - struct ibv_post_send_resp resp; - struct ibv_send_wr *i; - struct ibv_kern_send_wr *n, *tmp; - struct ibv_sge *s; + struct rdmav_post_send *cmd; + struct rdmav_post_send_resp resp; + struct rdmav_send_wr *i; + struct rdmav_kern_send_wr *n, *tmp; + struct rdmav_sge *s; unsigned wr_count = 0; unsigned sge_count = 0; int cmd_size; @@ -783,14 +790,14 @@ int ibv_cmd_post_send(struct ibv_qp *ibq cmd_size = sizeof *cmd + wr_count * sizeof *n + sge_count * sizeof *s; cmd = alloca(cmd_size); - IBV_INIT_CMD_RESP(cmd, cmd_size, POST_SEND, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, POST_SEND, &resp, sizeof resp); cmd->qp_handle = ibqp->handle; cmd->wr_count = wr_count; cmd->sge_count = sge_count; cmd->wqe_size = sizeof *n; - n = (struct ibv_kern_send_wr *) ((void *) cmd + sizeof *cmd); - s = (struct ibv_sge *) (n + wr_count); + n = (struct rdmav_kern_send_wr *) ((void *) cmd + sizeof *cmd); + s = (struct rdmav_sge *) (n + wr_count); tmp = n; for (i = wr; i; i = i->next) { @@ -799,21 +806,21 @@ int ibv_cmd_post_send(struct ibv_qp *ibq tmp->opcode = i->opcode; tmp->send_flags = i->send_flags; tmp->imm_data = i->imm_data; - if (ibqp->qp_type == IBV_QPT_UD) { + if (ibqp->qp_type == RDMAV_QPT_UD) { tmp->wr.ud.ah = i->wr.ud.ah->handle; tmp->wr.ud.remote_qpn = i->wr.ud.remote_qpn; tmp->wr.ud.remote_qkey = i->wr.ud.remote_qkey; } else { switch(i->opcode) { - case IBV_WR_RDMA_WRITE: - case IBV_WR_RDMA_WRITE_WITH_IMM: - case IBV_WR_RDMA_READ: + case RDMAV_WR_RDMA_WRITE: + case RDMAV_WR_RDMA_WRITE_WITH_IMM: + case RDMAV_WR_RDMA_READ: tmp->wr.rdma.remote_addr = i->wr.rdma.remote_addr; tmp->wr.rdma.rkey = i->wr.rdma.rkey; break; - case IBV_WR_ATOMIC_CMP_AND_SWP: - case IBV_WR_ATOMIC_FETCH_AND_ADD: + case RDMAV_WR_ATOMIC_CMP_AND_SWP: + case RDMAV_WR_ATOMIC_FETCH_AND_ADD: tmp->wr.atomic.remote_addr = i->wr.atomic.remote_addr; tmp->wr.atomic.compare_add = @@ -849,14 +856,14 @@ int ibv_cmd_post_send(struct ibv_qp *ibq return ret; } -int ibv_cmd_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr) +int rdmav_cmd_post_recv(struct rdmav_qp *ibqp, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr) { - struct ibv_post_recv *cmd; - struct ibv_post_recv_resp resp; - struct ibv_recv_wr *i; - struct ibv_kern_recv_wr *n, *tmp; - struct ibv_sge *s; + struct rdmav_post_recv *cmd; + struct rdmav_post_recv_resp resp; + struct rdmav_recv_wr *i; + struct rdmav_kern_recv_wr *n, *tmp; + struct rdmav_sge *s; unsigned wr_count = 0; unsigned sge_count = 0; int cmd_size; @@ -870,14 +877,14 @@ int ibv_cmd_post_recv(struct ibv_qp *ibq cmd_size = sizeof *cmd + wr_count * sizeof *n + sge_count * sizeof *s; cmd = alloca(cmd_size); - IBV_INIT_CMD_RESP(cmd, cmd_size, POST_RECV, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, POST_RECV, &resp, sizeof resp); cmd->qp_handle = ibqp->handle; cmd->wr_count = wr_count; cmd->sge_count = sge_count; cmd->wqe_size = sizeof *n; - n = (struct ibv_kern_recv_wr *) ((void *) cmd + sizeof *cmd); - s = (struct ibv_sge *) (n + wr_count); + n = (struct rdmav_kern_recv_wr *) ((void *) cmd + sizeof *cmd); + s = (struct rdmav_sge *) (n + wr_count); tmp = n; for (i = wr; i; i = i->next) { @@ -907,14 +914,14 @@ int ibv_cmd_post_recv(struct ibv_qp *ibq return ret; } -int ibv_cmd_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, - struct ibv_recv_wr **bad_wr) +int rdmav_cmd_post_srq_recv(struct rdmav_srq *srq, struct rdmav_recv_wr *wr, + struct rdmav_recv_wr **bad_wr) { - struct ibv_post_srq_recv *cmd; - struct ibv_post_srq_recv_resp resp; - struct ibv_recv_wr *i; - struct ibv_kern_recv_wr *n, *tmp; - struct ibv_sge *s; + struct rdmav_post_srq_recv *cmd; + struct rdmav_post_srq_recv_resp resp; + struct rdmav_recv_wr *i; + struct rdmav_kern_recv_wr *n, *tmp; + struct rdmav_sge *s; unsigned wr_count = 0; unsigned sge_count = 0; int cmd_size; @@ -928,14 +935,14 @@ int ibv_cmd_post_srq_recv(struct ibv_srq cmd_size = sizeof *cmd + wr_count * sizeof *n + sge_count * sizeof *s; cmd = alloca(cmd_size); - IBV_INIT_CMD_RESP(cmd, cmd_size, POST_SRQ_RECV, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(cmd, cmd_size, POST_SRQ_RECV, &resp, sizeof resp); cmd->srq_handle = srq->handle; cmd->wr_count = wr_count; cmd->sge_count = sge_count; cmd->wqe_size = sizeof *n; - n = (struct ibv_kern_recv_wr *) ((void *) cmd + sizeof *cmd); - s = (struct ibv_sge *) (n + wr_count); + n = (struct rdmav_kern_recv_wr *) ((void *) cmd + sizeof *cmd); + s = (struct rdmav_sge *) (n + wr_count); tmp = n; for (i = wr; i; i = i->next) { @@ -965,13 +972,13 @@ int ibv_cmd_post_srq_recv(struct ibv_srq return ret; } -int ibv_cmd_create_ah(struct ibv_pd *pd, struct ibv_ah *ah, - struct ibv_ah_attr *attr) +int rdmav_cmd_create_ah(struct rdmav_pd *pd, struct rdmav_ah *ah, + struct rdmav_ah_attr *attr) { - struct ibv_create_ah cmd; - struct ibv_create_ah_resp resp; + struct rdmav_create_ah cmd; + struct rdmav_create_ah_resp resp; - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, CREATE_AH, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, CREATE_AH, &resp, sizeof resp); cmd.user_handle = (uintptr_t) ah; cmd.pd_handle = pd->handle; cmd.attr.dlid = attr->dlid; @@ -994,11 +1001,11 @@ int ibv_cmd_create_ah(struct ibv_pd *pd, return 0; } -int ibv_cmd_destroy_ah(struct ibv_ah *ah) +int rdmav_cmd_destroy_ah(struct rdmav_ah *ah) { - struct ibv_destroy_ah cmd; + struct rdmav_destroy_ah cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DESTROY_AH); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DESTROY_AH); cmd.ah_handle = ah->handle; if (write(ah->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -1007,15 +1014,15 @@ int ibv_cmd_destroy_ah(struct ibv_ah *ah return 0; } -int ibv_cmd_destroy_qp(struct ibv_qp *qp) +int rdmav_cmd_destroy_qp(struct rdmav_qp *qp) { - struct ibv_destroy_qp cmd; - struct ibv_destroy_qp_resp resp; + struct rdmav_destroy_qp cmd; + struct rdmav_destroy_qp_resp resp; if (abi_ver == 1) - return ibv_cmd_destroy_qp_v1(qp); + return rdmav_cmd_destroy_qp_v1(qp); - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_QP, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_QP, &resp, sizeof resp); cmd.qp_handle = qp->handle; if (write(qp->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) @@ -1029,11 +1036,11 @@ int ibv_cmd_destroy_qp(struct ibv_qp *qp return 0; } -int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +int rdmav_cmd_attach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid) { - struct ibv_attach_mcast cmd; + struct rdmav_attach_mcast cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, ATTACH_MCAST); + RDMAV_INIT_CMD(&cmd, sizeof cmd, ATTACH_MCAST); memcpy(cmd.gid, gid->raw, sizeof cmd.gid); cmd.qp_handle = qp->handle; cmd.mlid = lid; @@ -1044,11 +1051,11 @@ int ibv_cmd_attach_mcast(struct ibv_qp * return 0; } -int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +int rdmav_cmd_detach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid) { - struct ibv_detach_mcast cmd; + struct rdmav_detach_mcast cmd; - IBV_INIT_CMD(&cmd, sizeof cmd, DETACH_MCAST); + RDMAV_INIT_CMD(&cmd, sizeof cmd, DETACH_MCAST); memcpy(cmd.gid, gid->raw, sizeof cmd.gid); cmd.qp_handle = qp->handle; cmd.mlid = lid; diff -ruNp ORG/libibverbs/src/device.c NEW/libibverbs/src/device.c --- ORG/libibverbs/src/device.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/device.c 2006-08-02 23:57:31.000000000 -0700 @@ -48,23 +48,23 @@ #include -#include "ibverbs.h" +#include "rdmaverbs.h" static pthread_mutex_t device_list_lock = PTHREAD_MUTEX_INITIALIZER; static int num_devices; -static struct ibv_device **device_list; +static struct rdmav_device **device_list; -struct ibv_device **ibv_get_device_list(int *num) +struct rdmav_device **rdmav_get_device_list(int *num) { - struct ibv_device **l; + struct rdmav_device **l; int i; pthread_mutex_lock(&device_list_lock); if (!num_devices) - num_devices = ibverbs_init(&device_list); + num_devices = rdmaverbs_init(&device_list); - l = calloc(num_devices + 1, sizeof (struct ibv_device *)); + l = calloc(num_devices + 1, sizeof (struct rdmav_device *)); for (i = 0; i < num_devices; ++i) l[i] = device_list[i]; @@ -76,24 +76,30 @@ struct ibv_device **ibv_get_device_list( return l; } -void ibv_free_device_list(struct ibv_device **list) +/* XXX - to be removed when all apps are converted to new API */ +struct rdmav_device **ibv_get_device_list(int *num) +{ + return rdmav_get_device_list(num); +} + +void rdmav_free_device_list(struct rdmav_device **list) { free(list); } -const char *ibv_get_device_name(struct ibv_device *device) +const char *rdmav_get_device_name(struct rdmav_device *device) { return device->name; } -uint64_t ibv_get_device_guid(struct ibv_device *device) +uint64_t rdmav_get_device_guid(struct rdmav_device *device) { char attr[24]; uint64_t guid = 0; uint16_t parts[4]; int i; - if (ibv_read_sysfs_file(device->ibdev_path, "node_guid", + if (rdmav_read_sysfs_file(device->ibdev_path, "node_guid", attr, sizeof attr) < 0) return 0; @@ -107,11 +113,11 @@ uint64_t ibv_get_device_guid(struct ibv_ return htonll(guid); } -struct ibv_context *ibv_open_device(struct ibv_device *device) +struct rdmav_context *rdmav_open_device(struct rdmav_device *device) { char *devpath; int cmd_fd; - struct ibv_context *context; + struct rdmav_context *context; asprintf(&devpath, "/dev/infiniband/%s", device->dev_name); @@ -140,14 +146,14 @@ err: return NULL; } -int ibv_close_device(struct ibv_context *context) +int rdmav_close_device(struct rdmav_context *context) { int async_fd = context->async_fd; int cmd_fd = context->cmd_fd; int cq_fd = -1; if (abi_ver <= 2) { - struct ibv_abi_compat_v2 *t = context->abi_compat; + struct rdmav_abi_compat_v2 *t = context->abi_compat; cq_fd = t->channel.fd; free(context->abi_compat); } @@ -162,10 +168,10 @@ int ibv_close_device(struct ibv_context return 0; } -int ibv_get_async_event(struct ibv_context *context, - struct ibv_async_event *event) +int rdmav_get_async_event(struct rdmav_context *context, + struct rdmav_async_event *event) { - struct ibv_kern_async_event ev; + struct rdmav_kern_async_event ev; if (read(context->async_fd, &ev, sizeof ev) != sizeof ev) return -1; @@ -173,23 +179,23 @@ int ibv_get_async_event(struct ibv_conte event->event_type = ev.event_type; switch (event->event_type) { - case IBV_EVENT_CQ_ERR: + case RDMAV_EVENT_CQ_ERR: event->element.cq = (void *) (uintptr_t) ev.element; break; - case IBV_EVENT_QP_FATAL: - case IBV_EVENT_QP_REQ_ERR: - case IBV_EVENT_QP_ACCESS_ERR: - case IBV_EVENT_COMM_EST: - case IBV_EVENT_SQ_DRAINED: - case IBV_EVENT_PATH_MIG: - case IBV_EVENT_PATH_MIG_ERR: - case IBV_EVENT_QP_LAST_WQE_REACHED: + case RDMAV_EVENT_QP_FATAL: + case RDMAV_EVENT_QP_REQ_ERR: + case RDMAV_EVENT_QP_ACCESS_ERR: + case RDMAV_EVENT_COMM_EST: + case RDMAV_EVENT_SQ_DRAINED: + case RDMAV_EVENT_PATH_MIG: + case RDMAV_EVENT_PATH_MIG_ERR: + case RDMAV_EVENT_QP_LAST_WQE_REACHED: event->element.qp = (void *) (uintptr_t) ev.element; break; - case IBV_EVENT_SRQ_ERR: - case IBV_EVENT_SRQ_LIMIT_REACHED: + case RDMAV_EVENT_SRQ_ERR: + case RDMAV_EVENT_SRQ_LIMIT_REACHED: event->element.srq = (void *) (uintptr_t) ev.element; break; @@ -201,12 +207,12 @@ int ibv_get_async_event(struct ibv_conte return 0; } -void ibv_ack_async_event(struct ibv_async_event *event) +void rdmav_ack_async_event(struct rdmav_async_event *event) { switch (event->event_type) { - case IBV_EVENT_CQ_ERR: + case RDMAV_EVENT_CQ_ERR: { - struct ibv_cq *cq = event->element.cq; + struct rdmav_cq *cq = event->element.cq; pthread_mutex_lock(&cq->mutex); ++cq->async_events_completed; @@ -216,16 +222,16 @@ void ibv_ack_async_event(struct ibv_asyn return; } - case IBV_EVENT_QP_FATAL: - case IBV_EVENT_QP_REQ_ERR: - case IBV_EVENT_QP_ACCESS_ERR: - case IBV_EVENT_COMM_EST: - case IBV_EVENT_SQ_DRAINED: - case IBV_EVENT_PATH_MIG: - case IBV_EVENT_PATH_MIG_ERR: - case IBV_EVENT_QP_LAST_WQE_REACHED: + case RDMAV_EVENT_QP_FATAL: + case RDMAV_EVENT_QP_REQ_ERR: + case RDMAV_EVENT_QP_ACCESS_ERR: + case RDMAV_EVENT_COMM_EST: + case RDMAV_EVENT_SQ_DRAINED: + case RDMAV_EVENT_PATH_MIG: + case RDMAV_EVENT_PATH_MIG_ERR: + case RDMAV_EVENT_QP_LAST_WQE_REACHED: { - struct ibv_qp *qp = event->element.qp; + struct rdmav_qp *qp = event->element.qp; pthread_mutex_lock(&qp->mutex); ++qp->events_completed; @@ -235,10 +241,10 @@ void ibv_ack_async_event(struct ibv_asyn return; } - case IBV_EVENT_SRQ_ERR: - case IBV_EVENT_SRQ_LIMIT_REACHED: + case RDMAV_EVENT_SRQ_ERR: + case RDMAV_EVENT_SRQ_LIMIT_REACHED: { - struct ibv_srq *srq = event->element.srq; + struct rdmav_srq *srq = event->element.srq; pthread_mutex_lock(&srq->mutex); ++srq->events_completed; diff -ruNp ORG/libibverbs/src/ibverbs.h NEW/libibverbs/src/ibverbs.h --- ORG/libibverbs/src/ibverbs.h 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/ibverbs.h 1969-12-31 16:00:00.000000000 -0800 @@ -1,88 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id: ibverbs.h 4466 2005-12-14 20:44:36Z roland $ - */ - -#ifndef IB_VERBS_H -#define IB_VERBS_H - -#include - -#include - -#define HIDDEN __attribute__((visibility ("hidden"))) - -#define INIT __attribute__((constructor)) -#define FINI __attribute__((destructor)) - -#define PFX "libibverbs: " - -struct ibv_driver { - ibv_driver_init_func init_func; - struct ibv_driver *next; -}; - -struct ibv_abi_compat_v2 { - struct ibv_comp_channel channel; - pthread_mutex_t in_use; -}; - -extern HIDDEN int abi_ver; - -extern HIDDEN int ibverbs_init(struct ibv_device ***list); - -extern HIDDEN int ibv_init_mem_map(void); -extern HIDDEN int ibv_lock_range(void *base, size_t size); -extern HIDDEN int ibv_unlock_range(void *base, size_t size); - -#define IBV_INIT_CMD(cmd, size, opcode) \ - do { \ - if (abi_ver > 2) \ - (cmd)->command = IB_USER_VERBS_CMD_##opcode; \ - else \ - (cmd)->command = IB_USER_VERBS_CMD_##opcode##_V2; \ - (cmd)->in_words = (size) / 4; \ - (cmd)->out_words = 0; \ - } while (0) - -#define IBV_INIT_CMD_RESP(cmd, size, opcode, out, outsize) \ - do { \ - if (abi_ver > 2) \ - (cmd)->command = IB_USER_VERBS_CMD_##opcode; \ - else \ - (cmd)->command = IB_USER_VERBS_CMD_##opcode##_V2; \ - (cmd)->in_words = (size) / 4; \ - (cmd)->out_words = (outsize) / 4; \ - (cmd)->response = (uintptr_t) (out); \ - } while (0) - -#endif /* IB_VERBS_H */ diff -ruNp ORG/libibverbs/src/init.c NEW/libibverbs/src/init.c --- ORG/libibverbs/src/init.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/init.c 2006-08-02 18:24:49.000000000 -0700 @@ -46,24 +46,28 @@ #include #include -#include "ibverbs.h" +#include "rdmaverbs.h" #ifndef OPENIB_DRIVER_PATH_ENV # define OPENIB_DRIVER_PATH_ENV "OPENIB_DRIVER_PATH" #endif +#ifndef LIBRDMAVERBS_DRIVER_PATH_ENV +# define LIBRDMAVERBS_DRIVER_PATH_ENV "LIBRDMAVERBS_DRIVER_PATH" +#endif + HIDDEN int abi_ver; static char default_path[] = DRIVER_PATH; static const char *user_path; -static struct ibv_driver *driver_list; +static struct rdmav_driver *driver_list; static void load_driver(char *so_path) { void *dlhandle; - ibv_driver_init_func init_func; - struct ibv_driver *driver; + rdmav_driver_init_func init_func; + struct rdmav_driver *driver; dlhandle = dlopen(so_path, RTLD_NOW); if (!dlhandle) { @@ -81,7 +85,8 @@ static void load_driver(char *so_path) driver = malloc(sizeof *driver); if (!driver) { - fprintf(stderr, PFX "Fatal: couldn't allocate driver for %s\n", so_path); + fprintf(stderr, PFX "Fatal: couldn't allocate driver for %s\n", + so_path); dlclose(dlhandle); return; } @@ -122,23 +127,25 @@ static void find_drivers(char *dir) globfree(&so_glob); } -static struct ibv_device *init_drivers(const char *class_path, +static struct rdmav_device *init_drivers(const char *class_path, const char *dev_name) { - struct ibv_driver *driver; - struct ibv_device *dev; + struct rdmav_driver *driver; + struct rdmav_device *dev; int abi_ver = 0; - char sys_path[IBV_SYSFS_PATH_MAX]; - char ibdev_name[IBV_SYSFS_NAME_MAX]; + char sys_path[RDMAV_SYSFS_PATH_MAX]; + char ibdev_name[RDMAV_SYSFS_NAME_MAX]; char value[8]; snprintf(sys_path, sizeof sys_path, "%s/%s", class_path, dev_name); - if (ibv_read_sysfs_file(sys_path, "abi_version", value, sizeof value) > 0) + if (rdmav_read_sysfs_file(sys_path, "abi_version", value, + sizeof value) > 0) abi_ver = strtol(value, NULL, 10); - if (ibv_read_sysfs_file(sys_path, "ibdev", ibdev_name, sizeof ibdev_name) < 0) { + if (rdmav_read_sysfs_file(sys_path, "ibdev", ibdev_name, + sizeof ibdev_name) < 0) { fprintf(stderr, PFX "Warning: no ibdev class attr for %s\n", sys_path); return NULL; @@ -151,8 +158,9 @@ static struct ibv_device *init_drivers(c dev->driver = driver; strcpy(dev->dev_path, sys_path); - snprintf(dev->ibdev_path, IBV_SYSFS_PATH_MAX, "%s/class/infiniband/%s", - ibv_get_sysfs_path(), ibdev_name); + snprintf(dev->ibdev_path, RDMAV_SYSFS_PATH_MAX, + "%s/class/infiniband/%s", + rdmav_get_sysfs_path(), ibdev_name); strcpy(dev->dev_name, dev_name); strcpy(dev->name, ibdev_name); @@ -172,7 +180,7 @@ static int check_abi_version(const char { char value[8]; - if (ibv_read_sysfs_file(path, "class/infiniband_verbs/abi_version", + if (rdmav_read_sysfs_file(path, "class/infiniband_verbs/abi_version", value, sizeof value) < 0) { fprintf(stderr, PFX "Fatal: couldn't read uverbs ABI version.\n"); return -1; @@ -180,32 +188,32 @@ static int check_abi_version(const char abi_ver = strtol(value, NULL, 10); - if (abi_ver < IB_USER_VERBS_MIN_ABI_VERSION || - abi_ver > IB_USER_VERBS_MAX_ABI_VERSION) { + if (abi_ver < RDMAV_USER_VERBS_MIN_ABI_VERSION || + abi_ver > RDMAV_USER_VERBS_MAX_ABI_VERSION) { fprintf(stderr, PFX "Fatal: kernel ABI version %d " "doesn't match library version %d.\n", - abi_ver, IB_USER_VERBS_MAX_ABI_VERSION); + abi_ver, RDMAV_USER_VERBS_MAX_ABI_VERSION); return -1; } return 0; } -HIDDEN int ibverbs_init(struct ibv_device ***list) +HIDDEN int rdmaverbs_init(struct rdmav_device ***list) { const char *sysfs_path; char *wr_path, *dir; - char class_path[IBV_SYSFS_PATH_MAX]; + char class_path[RDMAV_SYSFS_PATH_MAX]; DIR *class_dir; struct dirent *dent; - struct ibv_device *device; - struct ibv_device **new_list; + struct rdmav_device *device; + struct rdmav_device **new_list; int num_devices = 0; int list_size = 0; *list = NULL; - if (ibv_init_mem_map()) + if (rdmav_init_mem_map()) return 0; find_drivers(default_path); @@ -215,12 +223,22 @@ HIDDEN int ibverbs_init(struct ibv_devic * environment if we're not running SUID. */ if (getuid() == geteuid()) { - user_path = getenv(OPENIB_DRIVER_PATH_ENV); + const char *user_path_extra; + + user_path = getenv(LIBRDMAVERBS_DRIVER_PATH_ENV); if (user_path) { wr_path = strdupa(user_path); while ((dir = strsep(&wr_path, ";:"))) find_drivers(dir); } + + /* for backwards compatibility */ + user_path_extra = getenv(OPENIB_DRIVER_PATH_ENV); + if (user_path_extra) { + wr_path = strdupa(user_path_extra); + while ((dir = strsep(&wr_path, ";:"))) + find_drivers(dir); + } } /* @@ -230,7 +248,7 @@ HIDDEN int ibverbs_init(struct ibv_devic */ load_driver(NULL); - sysfs_path = ibv_get_sysfs_path(); + sysfs_path = rdmav_get_sysfs_path(); if (!sysfs_path) { fprintf(stderr, PFX "Fatal: couldn't find sysfs mount.\n"); return 0; @@ -258,7 +276,7 @@ HIDDEN int ibverbs_init(struct ibv_devic if (list_size <= num_devices) { list_size = list_size ? list_size * 2 : 1; - new_list = realloc(*list, list_size * sizeof (struct ibv_device *)); + new_list = realloc(*list, list_size * sizeof (struct rdmav_device *)); if (!new_list) goto out; *list = new_list; diff -ruNp ORG/libibverbs/src/libibverbs.map NEW/libibverbs/src/libibverbs.map --- ORG/libibverbs/src/libibverbs.map 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/libibverbs.map 1969-12-31 16:00:00.000000000 -0800 @@ -1,79 +0,0 @@ -IBVERBS_1.0 { - global: - ibv_get_device_list; - ibv_free_device_list; - ibv_get_device_name; - ibv_get_device_guid; - ibv_open_device; - ibv_close_device; - ibv_get_async_event; - ibv_ack_async_event; - ibv_query_device; - ibv_query_port; - ibv_query_gid; - ibv_query_pkey; - ibv_alloc_pd; - ibv_dealloc_pd; - ibv_reg_mr; - ibv_dereg_mr; - ibv_create_comp_channel; - ibv_destroy_comp_channel; - ibv_create_cq; - ibv_resize_cq; - ibv_destroy_cq; - ibv_get_cq_event; - ibv_ack_cq_events; - ibv_create_srq; - ibv_modify_srq; - ibv_query_srq; - ibv_destroy_srq; - ibv_create_qp; - ibv_query_qp; - ibv_modify_qp; - ibv_destroy_qp; - ibv_create_ah; - ibv_init_ah_from_wc; - ibv_create_ah_from_wc; - ibv_destroy_ah; - ibv_attach_mcast; - ibv_detach_mcast; - ibv_cmd_get_context; - ibv_cmd_query_device; - ibv_cmd_query_port; - ibv_cmd_query_gid; - ibv_cmd_query_pkey; - ibv_cmd_alloc_pd; - ibv_cmd_dealloc_pd; - ibv_cmd_reg_mr; - ibv_cmd_dereg_mr; - ibv_cmd_create_cq; - ibv_cmd_poll_cq; - ibv_cmd_req_notify_cq; - ibv_cmd_resize_cq; - ibv_cmd_destroy_cq; - ibv_cmd_create_srq; - ibv_cmd_modify_srq; - ibv_cmd_query_srq; - ibv_cmd_destroy_srq; - ibv_cmd_create_qp; - ibv_cmd_query_qp; - ibv_cmd_modify_qp; - ibv_cmd_destroy_qp; - ibv_cmd_post_send; - ibv_cmd_post_recv; - ibv_cmd_post_srq_recv; - ibv_cmd_create_ah; - ibv_cmd_destroy_ah; - ibv_cmd_attach_mcast; - ibv_cmd_detach_mcast; - ibv_copy_qp_attr_from_kern; - ibv_copy_ah_attr_from_kern; - ibv_copy_path_rec_from_kern; - ibv_copy_path_rec_to_kern; - ibv_rate_to_mult; - mult_to_ibv_rate; - ibv_get_sysfs_path; - ibv_read_sysfs_file; - - local: *; -}; diff -ruNp ORG/libibverbs/src/librdmaverbs.map NEW/libibverbs/src/librdmaverbs.map --- ORG/libibverbs/src/librdmaverbs.map 1969-12-31 16:00:00.000000000 -0800 +++ NEW/libibverbs/src/librdmaverbs.map 2006-08-02 23:50:50.000000000 -0700 @@ -0,0 +1,80 @@ +RDMAVERBS_1.0 { + global: + ibv_get_device_list; + rdmav_get_device_list; + rdmav_free_device_list; + rdmav_get_device_name; + rdmav_get_device_guid; + rdmav_open_device; + rdmav_close_device; + rdmav_get_async_event; + rdmav_ack_async_event; + rdmav_query_device; + rdmav_query_port; + rdmav_query_gid; + rdmav_query_pkey; + rdmav_alloc_pd; + rdmav_dealloc_pd; + rdmav_reg_mr; + rdmav_dereg_mr; + rdmav_create_comp_channel; + rdmav_destroy_comp_channel; + rdmav_create_cq; + rdmav_resize_cq; + rdmav_destroy_cq; + rdmav_get_cq_event; + rdmav_ack_cq_events; + rdmav_create_srq; + rdmav_modify_srq; + rdmav_query_srq; + rdmav_destroy_srq; + rdmav_create_qp; + rdmav_query_qp; + rdmav_modify_qp; + rdmav_destroy_qp; + rdmav_create_ah; + rdmav_init_ah_from_wc; + rdmav_create_ah_from_wc; + rdmav_destroy_ah; + rdmav_attach_mcast; + rdmav_detach_mcast; + rdmav_cmd_get_context; + rdmav_cmd_query_device; + rdmav_cmd_query_port; + rdmav_cmd_query_gid; + rdmav_cmd_query_pkey; + rdmav_cmd_alloc_pd; + rdmav_cmd_dealloc_pd; + rdmav_cmd_reg_mr; + rdmav_cmd_dereg_mr; + rdmav_cmd_create_cq; + rdmav_cmd_poll_cq; + rdmav_cmd_req_notify_cq; + rdmav_cmd_resize_cq; + rdmav_cmd_destroy_cq; + rdmav_cmd_create_srq; + rdmav_cmd_modify_srq; + rdmav_cmd_query_srq; + rdmav_cmd_destroy_srq; + rdmav_cmd_create_qp; + rdmav_cmd_query_qp; + rdmav_cmd_modify_qp; + rdmav_cmd_destroy_qp; + rdmav_cmd_post_send; + rdmav_cmd_post_recv; + rdmav_cmd_post_srq_recv; + rdmav_cmd_create_ah; + rdmav_cmd_destroy_ah; + rdmav_cmd_attach_mcast; + rdmav_cmd_detach_mcast; + rdmav_copy_qp_attr_from_kern; + rdmav_copy_ah_attr_from_kern; + rdmav_copy_path_rec_from_kern; + rdmav_copy_path_rec_to_kern; + rdmav_rate_to_mult; + mult_to_rdmav_rate; + rdmav_get_sysfs_path; + rdmav_read_sysfs_file; + + local: *; +}; diff -ruNp ORG/libibverbs/src/marshall.c NEW/libibverbs/src/marshall.c --- ORG/libibverbs/src/marshall.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/marshall.c 2006-08-02 18:24:49.000000000 -0700 @@ -38,8 +38,8 @@ #include -void ibv_copy_ah_attr_from_kern(struct ibv_ah_attr *dst, - struct ibv_kern_ah_attr *src) +void rdmav_copy_ah_attr_from_kern(struct rdmav_ah_attr *dst, + struct rdmav_kern_ah_attr *src) { memcpy(dst->grh.dgid.raw, src->grh.dgid, sizeof dst->grh.dgid); dst->grh.flow_label = src->grh.flow_label; @@ -55,8 +55,8 @@ void ibv_copy_ah_attr_from_kern(struct i dst->port_num = src->port_num; } -void ibv_copy_qp_attr_from_kern(struct ibv_qp_attr *dst, - struct ibv_kern_qp_attr *src) +void rdmav_copy_qp_attr_from_kern(struct rdmav_qp_attr *dst, + struct rdmav_kern_qp_attr *src) { dst->cur_qp_state = src->cur_qp_state; dst->path_mtu = src->path_mtu; @@ -73,8 +73,8 @@ void ibv_copy_qp_attr_from_kern(struct i dst->cap.max_recv_sge = src->max_recv_sge; dst->cap.max_inline_data = src->max_inline_data; - ibv_copy_ah_attr_from_kern(&dst->ah_attr, &src->ah_attr); - ibv_copy_ah_attr_from_kern(&dst->alt_ah_attr, &src->alt_ah_attr); + rdmav_copy_ah_attr_from_kern(&dst->ah_attr, &src->ah_attr); + rdmav_copy_ah_attr_from_kern(&dst->alt_ah_attr, &src->alt_ah_attr); dst->pkey_index = src->pkey_index; dst->alt_pkey_index = src->alt_pkey_index; @@ -91,8 +91,8 @@ void ibv_copy_qp_attr_from_kern(struct i dst->alt_timeout = src->alt_timeout; } -void ibv_copy_path_rec_from_kern(struct ibv_sa_path_rec *dst, - struct ibv_kern_path_rec *src) +void rdmav_copy_path_rec_from_kern(struct rdmav_sa_path_rec *dst, + struct rdmav_kern_path_rec *src) { memcpy(dst->dgid.raw, src->dgid, sizeof dst->dgid); memcpy(dst->sgid.raw, src->sgid, sizeof dst->sgid); @@ -116,8 +116,8 @@ void ibv_copy_path_rec_from_kern(struct dst->packet_life_time_selector = src->packet_life_time_selector; } -void ibv_copy_path_rec_to_kern(struct ibv_kern_path_rec *dst, - struct ibv_sa_path_rec *src) +void rdmav_copy_path_rec_to_kern(struct rdmav_kern_path_rec *dst, + struct rdmav_sa_path_rec *src) { memcpy(dst->dgid, src->dgid.raw, sizeof src->dgid); memcpy(dst->sgid, src->sgid.raw, sizeof src->sgid); diff -ruNp ORG/libibverbs/src/memory.c NEW/libibverbs/src/memory.c --- ORG/libibverbs/src/memory.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/memory.c 2006-08-02 18:24:49.000000000 -0700 @@ -41,7 +41,7 @@ #include #include -#include "ibverbs.h" +#include "rdmaverbs.h" /* * We keep a linked list of page ranges that have been locked along with a @@ -51,21 +51,21 @@ * to avoid the O(n) cost of registering/unregistering memory. */ -struct ibv_mem_node { - struct ibv_mem_node *prev, *next; +struct rdmav_mem_node { + struct rdmav_mem_node *prev, *next; uintptr_t start, end; int refcnt; }; static struct { - struct ibv_mem_node *first; + struct rdmav_mem_node *first; pthread_mutex_t mutex; uintptr_t page_size; } mem_map; -int ibv_init_mem_map(void) +int rdmav_init_mem_map(void) { - struct ibv_mem_node *node = NULL; + struct rdmav_mem_node *node = NULL; node = malloc(sizeof *node); if (!node) @@ -94,9 +94,9 @@ fail: return -1; } -static struct ibv_mem_node *__mm_find_first(uintptr_t start, uintptr_t end) +static struct rdmav_mem_node *__mm_find_first(uintptr_t start, uintptr_t end) { - struct ibv_mem_node *node = mem_map.first; + struct rdmav_mem_node *node = mem_map.first; while (node) { if ((node->start <= start && node->end >= start) || @@ -108,18 +108,18 @@ static struct ibv_mem_node *__mm_find_fi return node; } -static struct ibv_mem_node *__mm_prev(struct ibv_mem_node *node) +static struct rdmav_mem_node *__mm_prev(struct rdmav_mem_node *node) { return node->prev; } -static struct ibv_mem_node *__mm_next(struct ibv_mem_node *node) +static struct rdmav_mem_node *__mm_next(struct rdmav_mem_node *node) { return node->next; } -static void __mm_add(struct ibv_mem_node *node, - struct ibv_mem_node *new) +static void __mm_add(struct rdmav_mem_node *node, + struct rdmav_mem_node *new) { new->prev = node; new->next = node->next; @@ -128,7 +128,7 @@ static void __mm_add(struct ibv_mem_node new->next->prev = new; } -static void __mm_remove(struct ibv_mem_node *node) +static void __mm_remove(struct rdmav_mem_node *node) { /* Never have to remove the first node, so we can use prev */ node->prev->next = node->next; @@ -136,10 +136,10 @@ static void __mm_remove(struct ibv_mem_n node->next->prev = node->prev; } -int ibv_lock_range(void *base, size_t size) +int rdmav_lock_range(void *base, size_t size) { uintptr_t start, end; - struct ibv_mem_node *node, *tmp; + struct rdmav_mem_node *node, *tmp; int ret = 0; if (!size) @@ -202,10 +202,10 @@ out: return ret; } -int ibv_unlock_range(void *base, size_t size) +int rdmav_unlock_range(void *base, size_t size) { uintptr_t start, end; - struct ibv_mem_node *node, *tmp; + struct rdmav_mem_node *node, *tmp; int ret = 0; if (!size) diff -ruNp ORG/libibverbs/src/rdmaverbs.h NEW/libibverbs/src/rdmaverbs.h --- ORG/libibverbs/src/rdmaverbs.h 1969-12-31 16:00:00.000000000 -0800 +++ NEW/libibverbs/src/rdmaverbs.h 2006-08-03 17:29:42.000000000 -0700 @@ -0,0 +1,91 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: + */ + +#ifndef SRC_RDMA_VERBS_H +#define SRC_RDMA_VERBS_H + +#include + +#include +#include + +#define HIDDEN __attribute__((visibility ("hidden"))) + +#define INIT __attribute__((constructor)) +#define FINI __attribute__((destructor)) + +#ifndef PFX +#define PFX "librdmaverbs: " +#endif + +struct rdmav_driver { + rdmav_driver_init_func init_func; + struct rdmav_driver *next; +}; + +struct rdmav_abi_compat_v2 { + struct rdmav_comp_channel channel; + pthread_mutex_t in_use; +}; + +extern HIDDEN int abi_ver; + +extern HIDDEN int rdmaverbs_init(struct rdmav_device ***list); + +extern HIDDEN int rdmav_init_mem_map(void); +extern HIDDEN int rdmav_lock_range(void *base, size_t size); +extern HIDDEN int rdmav_unlock_range(void *base, size_t size); + +#define RDMAV_INIT_CMD(cmd, size, opcode) \ + do { \ + if (abi_ver > 2) \ + (cmd)->command = RDMAV_USER_VERBS_CMD_##opcode; \ + else \ + (cmd)->command = RDMAV_USER_VERBS_CMD_##opcode##_V2; \ + (cmd)->in_words = (size) / 4; \ + (cmd)->out_words = 0; \ + } while (0) + +#define RDMAV_INIT_CMD_RESP(cmd, size, opcode, out, outsize) \ + do { \ + if (abi_ver > 2) \ + (cmd)->command = RDMAV_USER_VERBS_CMD_##opcode; \ + else \ + (cmd)->command = RDMAV_USER_VERBS_CMD_##opcode##_V2; \ + (cmd)->in_words = (size) / 4; \ + (cmd)->out_words = (outsize) / 4; \ + (cmd)->response = (uintptr_t) (out); \ + } while (0) + +#endif /* SRC_RDMA_VERBS_H */ diff -ruNp ORG/libibverbs/src/sysfs.c NEW/libibverbs/src/sysfs.c --- ORG/libibverbs/src/sysfs.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/sysfs.c 2006-08-02 18:24:49.000000000 -0700 @@ -44,11 +44,11 @@ #include #include -#include "ibverbs.h" +#include "rdmaverbs.h" static char *sysfs_path; -const char *ibv_get_sysfs_path(void) +const char *rdmav_get_sysfs_path(void) { char *env = NULL; @@ -65,7 +65,7 @@ const char *ibv_get_sysfs_path(void) if (env) { int len; - sysfs_path = strndup(env, IBV_SYSFS_PATH_MAX); + sysfs_path = strndup(env, RDMAV_SYSFS_PATH_MAX); len = strlen(sysfs_path); while (len > 0 && sysfs_path[len - 1] == '/') { --len; @@ -77,7 +77,7 @@ const char *ibv_get_sysfs_path(void) return sysfs_path; } -int ibv_read_sysfs_file(const char *dir, const char *file, +int rdmav_read_sysfs_file(const char *dir, const char *file, char *buf, size_t size) { char *path; diff -ruNp ORG/libibverbs/src/verbs.c NEW/libibverbs/src/verbs.c --- ORG/libibverbs/src/verbs.c 2006-07-30 21:18:16.000000000 -0700 +++ NEW/libibverbs/src/verbs.c 2006-08-02 18:24:49.000000000 -0700 @@ -44,54 +44,54 @@ #include #include -#include "ibverbs.h" +#include "rdmaverbs.h" -int ibv_rate_to_mult(enum ibv_rate rate) +int rdmav_rate_to_mult(enum rdmav_rate rate) { switch (rate) { - case IBV_RATE_2_5_GBPS: return 1; - case IBV_RATE_5_GBPS: return 2; - case IBV_RATE_10_GBPS: return 4; - case IBV_RATE_20_GBPS: return 8; - case IBV_RATE_30_GBPS: return 12; - case IBV_RATE_40_GBPS: return 16; - case IBV_RATE_60_GBPS: return 24; - case IBV_RATE_80_GBPS: return 32; - case IBV_RATE_120_GBPS: return 48; + case RDMAV_RATE_2_5_GBPS: return 1; + case RDMAV_RATE_5_GBPS: return 2; + case RDMAV_RATE_10_GBPS: return 4; + case RDMAV_RATE_20_GBPS: return 8; + case RDMAV_RATE_30_GBPS: return 12; + case RDMAV_RATE_40_GBPS: return 16; + case RDMAV_RATE_60_GBPS: return 24; + case RDMAV_RATE_80_GBPS: return 32; + case RDMAV_RATE_120_GBPS: return 48; default: return -1; } } -enum ibv_rate mult_to_ibv_rate(int mult) +enum rdmav_rate mult_to_rdmav_rate(int mult) { switch (mult) { - case 1: return IBV_RATE_2_5_GBPS; - case 2: return IBV_RATE_5_GBPS; - case 4: return IBV_RATE_10_GBPS; - case 8: return IBV_RATE_20_GBPS; - case 12: return IBV_RATE_30_GBPS; - case 16: return IBV_RATE_40_GBPS; - case 24: return IBV_RATE_60_GBPS; - case 32: return IBV_RATE_80_GBPS; - case 48: return IBV_RATE_120_GBPS; - default: return IBV_RATE_MAX; + case 1: return RDMAV_RATE_2_5_GBPS; + case 2: return RDMAV_RATE_5_GBPS; + case 4: return RDMAV_RATE_10_GBPS; + case 8: return RDMAV_RATE_20_GBPS; + case 12: return RDMAV_RATE_30_GBPS; + case 16: return RDMAV_RATE_40_GBPS; + case 24: return RDMAV_RATE_60_GBPS; + case 32: return RDMAV_RATE_80_GBPS; + case 48: return RDMAV_RATE_120_GBPS; + default: return RDMAV_RATE_MAX; } } -int ibv_query_device(struct ibv_context *context, - struct ibv_device_attr *device_attr) +int rdmav_query_device(struct rdmav_context *context, + struct rdmav_device_attr *device_attr) { return context->ops.query_device(context, device_attr); } -int ibv_query_port(struct ibv_context *context, uint8_t port_num, - struct ibv_port_attr *port_attr) +int rdmav_query_port(struct rdmav_context *context, uint8_t port_num, + struct rdmav_port_attr *port_attr) { return context->ops.query_port(context, port_num, port_attr); } -int ibv_query_gid(struct ibv_context *context, uint8_t port_num, - int index, union ibv_gid *gid) +int rdmav_query_gid(struct rdmav_context *context, uint8_t port_num, + int index, union rdmav_gid *gid) { char name[24]; char attr[41]; @@ -100,7 +100,7 @@ int ibv_query_gid(struct ibv_context *co snprintf(name, sizeof name, "ports/%d/gids/%d", port_num, index); - if (ibv_read_sysfs_file(context->device->ibdev_path, name, + if (rdmav_read_sysfs_file(context->device->ibdev_path, name, attr, sizeof attr) < 0) return -1; @@ -114,7 +114,7 @@ int ibv_query_gid(struct ibv_context *co return 0; } -int ibv_query_pkey(struct ibv_context *context, uint8_t port_num, +int rdmav_query_pkey(struct rdmav_context *context, uint8_t port_num, int index, uint16_t *pkey) { char name[24]; @@ -123,7 +123,7 @@ int ibv_query_pkey(struct ibv_context *c snprintf(name, sizeof name, "ports/%d/pkeys/%d", port_num, index); - if (ibv_read_sysfs_file(context->device->ibdev_path, name, + if (rdmav_read_sysfs_file(context->device->ibdev_path, name, attr, sizeof attr) < 0) return -1; @@ -134,9 +134,9 @@ int ibv_query_pkey(struct ibv_context *c return 0; } -struct ibv_pd *ibv_alloc_pd(struct ibv_context *context) +struct rdmav_pd *rdmav_alloc_pd(struct rdmav_context *context) { - struct ibv_pd *pd; + struct rdmav_pd *pd; pd = context->ops.alloc_pd(context); if (pd) @@ -145,15 +145,15 @@ struct ibv_pd *ibv_alloc_pd(struct ibv_c return pd; } -int ibv_dealloc_pd(struct ibv_pd *pd) +int rdmav_dealloc_pd(struct rdmav_pd *pd) { return pd->context->ops.dealloc_pd(pd); } -struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, - size_t length, enum ibv_access_flags access) +struct rdmav_mr *rdmav_reg_mr(struct rdmav_pd *pd, void *addr, + size_t length, enum rdmav_access_flags access) { - struct ibv_mr *mr; + struct rdmav_mr *mr; mr = pd->context->ops.reg_mr(pd, addr, length, access); if (mr) { @@ -164,14 +164,14 @@ struct ibv_mr *ibv_reg_mr(struct ibv_pd return mr; } -int ibv_dereg_mr(struct ibv_mr *mr) +int rdmav_dereg_mr(struct rdmav_mr *mr) { return mr->context->ops.dereg_mr(mr); } -static struct ibv_comp_channel *ibv_create_comp_channel_v2(struct ibv_context *context) +static struct rdmav_comp_channel *rdmav_create_comp_channel_v2(struct rdmav_context *context) { - struct ibv_abi_compat_v2 *t = context->abi_compat; + struct rdmav_abi_compat_v2 *t = context->abi_compat; static int warned; if (!pthread_mutex_trylock(&t->in_use)) @@ -187,20 +187,20 @@ static struct ibv_comp_channel *ibv_crea return NULL; } -struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context *context) +struct rdmav_comp_channel *rdmav_create_comp_channel(struct rdmav_context *context) { - struct ibv_comp_channel *channel; - struct ibv_create_comp_channel cmd; - struct ibv_create_comp_channel_resp resp; + struct rdmav_comp_channel *channel; + struct rdmav_create_comp_channel cmd; + struct rdmav_create_comp_channel_resp resp; if (abi_ver <= 2) - return ibv_create_comp_channel_v2(context); + return rdmav_create_comp_channel_v2(context); channel = malloc(sizeof *channel); if (!channel) return NULL; - IBV_INIT_CMD_RESP(&cmd, sizeof cmd, CREATE_COMP_CHANNEL, &resp, sizeof resp); + RDMAV_INIT_CMD_RESP(&cmd, sizeof cmd, CREATE_COMP_CHANNEL, &resp, sizeof resp); if (write(context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) { free(channel); return NULL; @@ -211,17 +211,17 @@ struct ibv_comp_channel *ibv_create_comp return channel; } -static int ibv_destroy_comp_channel_v2(struct ibv_comp_channel *channel) +static int rdmav_destroy_comp_channel_v2(struct rdmav_comp_channel *channel) { - struct ibv_abi_compat_v2 *t = (struct ibv_abi_compat_v2 *) channel; + struct rdmav_abi_compat_v2 *t = (struct rdmav_abi_compat_v2 *) channel; pthread_mutex_unlock(&t->in_use); return 0; } -int ibv_destroy_comp_channel(struct ibv_comp_channel *channel) +int rdmav_destroy_comp_channel(struct rdmav_comp_channel *channel) { if (abi_ver <= 2) - return ibv_destroy_comp_channel_v2(channel); + return rdmav_destroy_comp_channel_v2(channel); close(channel->fd); free(channel); @@ -229,10 +229,12 @@ int ibv_destroy_comp_channel(struct ibv_ return 0; } -struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe, void *cq_context, - struct ibv_comp_channel *channel, int comp_vector) +struct rdmav_cq *rdmav_create_cq(struct rdmav_context *context, int cqe, + void *cq_context, + struct rdmav_comp_channel *channel, + int comp_vector) { - struct ibv_cq *cq = context->ops.create_cq(context, cqe, channel, + struct rdmav_cq *cq = context->ops.create_cq(context, cqe, channel, comp_vector); if (cq) { @@ -247,7 +249,7 @@ struct ibv_cq *ibv_create_cq(struct ibv_ return cq; } -int ibv_resize_cq(struct ibv_cq *cq, int cqe) +int rdmav_resize_cq(struct rdmav_cq *cq, int cqe) { if (!cq->context->ops.resize_cq) return ENOSYS; @@ -255,21 +257,20 @@ int ibv_resize_cq(struct ibv_cq *cq, int return cq->context->ops.resize_cq(cq, cqe); } -int ibv_destroy_cq(struct ibv_cq *cq) +int rdmav_destroy_cq(struct rdmav_cq *cq) { return cq->context->ops.destroy_cq(cq); } - -int ibv_get_cq_event(struct ibv_comp_channel *channel, - struct ibv_cq **cq, void **cq_context) +int rdmav_get_cq_event(struct rdmav_comp_channel *channel, + struct rdmav_cq **cq, void **cq_context) { - struct ibv_comp_event ev; + struct rdmav_comp_event ev; if (read(channel->fd, &ev, sizeof ev) != sizeof ev) return -1; - *cq = (struct ibv_cq *) (uintptr_t) ev.cq_handle; + *cq = (struct rdmav_cq *) (uintptr_t) ev.cq_handle; *cq_context = (*cq)->cq_context; if ((*cq)->context->ops.cq_event) @@ -278,7 +279,7 @@ int ibv_get_cq_event(struct ibv_comp_cha return 0; } -void ibv_ack_cq_events(struct ibv_cq *cq, unsigned int nevents) +void rdmav_ack_cq_events(struct rdmav_cq *cq, unsigned int nevents) { pthread_mutex_lock(&cq->mutex); cq->comp_events_completed += nevents; @@ -286,10 +287,10 @@ void ibv_ack_cq_events(struct ibv_cq *cq pthread_mutex_unlock(&cq->mutex); } -struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, - struct ibv_srq_init_attr *srq_init_attr) +struct rdmav_srq *rdmav_create_srq(struct rdmav_pd *pd, + struct rdmav_srq_init_attr *srq_init_attr) { - struct ibv_srq *srq; + struct rdmav_srq *srq; if (!pd->context->ops.create_srq) return NULL; @@ -307,27 +308,27 @@ struct ibv_srq *ibv_create_srq(struct ib return srq; } -int ibv_modify_srq(struct ibv_srq *srq, - struct ibv_srq_attr *srq_attr, - enum ibv_srq_attr_mask srq_attr_mask) +int rdmav_modify_srq(struct rdmav_srq *srq, + struct rdmav_srq_attr *srq_attr, + enum rdmav_srq_attr_mask srq_attr_mask) { return srq->context->ops.modify_srq(srq, srq_attr, srq_attr_mask); } -int ibv_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr) +int rdmav_query_srq(struct rdmav_srq *srq, struct rdmav_srq_attr *srq_attr) { return srq->context->ops.query_srq(srq, srq_attr); } -int ibv_destroy_srq(struct ibv_srq *srq) +int rdmav_destroy_srq(struct rdmav_srq *srq) { return srq->context->ops.destroy_srq(srq); } -struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, - struct ibv_qp_init_attr *qp_init_attr) +struct rdmav_qp *rdmav_create_qp(struct rdmav_pd *pd, + struct rdmav_qp_init_attr *qp_init_attr) { - struct ibv_qp *qp = pd->context->ops.create_qp(pd, qp_init_attr); + struct rdmav_qp *qp = pd->context->ops.create_qp(pd, qp_init_attr); if (qp) { qp->context = pd->context; @@ -345,9 +346,9 @@ struct ibv_qp *ibv_create_qp(struct ibv_ return qp; } -int ibv_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask, - struct ibv_qp_init_attr *init_attr) +int rdmav_query_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask, + struct rdmav_qp_init_attr *init_attr) { int ret; @@ -355,14 +356,14 @@ int ibv_query_qp(struct ibv_qp *qp, stru if (ret) return ret; - if (attr_mask & IBV_QP_STATE) + if (attr_mask & RDMAV_QP_STATE) qp->state = attr->qp_state; return 0; } -int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, - enum ibv_qp_attr_mask attr_mask) +int rdmav_modify_qp(struct rdmav_qp *qp, struct rdmav_qp_attr *attr, + enum rdmav_qp_attr_mask attr_mask) { int ret; @@ -370,20 +371,20 @@ int ibv_modify_qp(struct ibv_qp *qp, str if (ret) return ret; - if (attr_mask & IBV_QP_STATE) + if (attr_mask & RDMAV_QP_STATE) qp->state = attr->qp_state; return 0; } -int ibv_destroy_qp(struct ibv_qp *qp) +int rdmav_destroy_qp(struct rdmav_qp *qp) { return qp->context->ops.destroy_qp(qp); } -struct ibv_ah *ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr) +struct rdmav_ah *rdmav_create_ah(struct rdmav_pd *pd, struct rdmav_ah_attr *attr) { - struct ibv_ah *ah = pd->context->ops.create_ah(pd, attr); + struct rdmav_ah *ah = pd->context->ops.create_ah(pd, attr); if (ah) { ah->context = pd->context; @@ -393,22 +394,22 @@ struct ibv_ah *ibv_create_ah(struct ibv_ return ah; } -static int ibv_find_gid_index(struct ibv_context *context, uint8_t port_num, - union ibv_gid *gid) +static int rdmav_find_gid_index(struct rdmav_context *context, uint8_t port_num, + union rdmav_gid *gid) { - union ibv_gid sgid; + union rdmav_gid sgid; int i = 0, ret; do { - ret = ibv_query_gid(context, port_num, i++, &sgid); + ret = rdmav_query_gid(context, port_num, i++, &sgid); } while (!ret && memcmp(&sgid, gid, sizeof *gid)); return ret ? ret : i - 1; } -int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num, - struct ibv_wc *wc, struct ibv_grh *grh, - struct ibv_ah_attr *ah_attr) +int rdmav_init_ah_from_wc(struct rdmav_context *context, uint8_t port_num, + struct rdmav_wc *wc, struct rdmav_grh *grh, + struct rdmav_ah_attr *ah_attr) { uint32_t flow_class; int ret; @@ -419,11 +420,11 @@ int ibv_init_ah_from_wc(struct ibv_conte ah_attr->src_path_bits = wc->dlid_path_bits; ah_attr->port_num = port_num; - if (wc->wc_flags & IBV_WC_GRH) { + if (wc->wc_flags & RDMAV_WC_GRH) { ah_attr->is_global = 1; ah_attr->grh.dgid = grh->sgid; - ret = ibv_find_gid_index(context, port_num, &grh->dgid); + ret = rdmav_find_gid_index(context, port_num, &grh->dgid); if (ret < 0) return ret; @@ -436,30 +437,30 @@ int ibv_init_ah_from_wc(struct ibv_conte return 0; } -struct ibv_ah *ibv_create_ah_from_wc(struct ibv_pd *pd, struct ibv_wc *wc, - struct ibv_grh *grh, uint8_t port_num) +struct rdmav_ah *rdmav_create_ah_from_wc(struct rdmav_pd *pd, struct rdmav_wc *wc, + struct rdmav_grh *grh, uint8_t port_num) { - struct ibv_ah_attr ah_attr; + struct rdmav_ah_attr ah_attr; int ret; - ret = ibv_init_ah_from_wc(pd->context, port_num, wc, grh, &ah_attr); + ret = rdmav_init_ah_from_wc(pd->context, port_num, wc, grh, &ah_attr); if (ret) return NULL; - return ibv_create_ah(pd, &ah_attr); + return rdmav_create_ah(pd, &ah_attr); } -int ibv_destroy_ah(struct ibv_ah *ah) +int rdmav_destroy_ah(struct rdmav_ah *ah) { return ah->context->ops.destroy_ah(ah); } -int ibv_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +int rdmav_attach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid) { return qp->context->ops.attach_mcast(qp, gid, lid); } -int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +int rdmav_detach_mcast(struct rdmav_qp *qp, union rdmav_gid *gid, uint16_t lid) { return qp->context->ops.detach_mcast(qp, gid, lid); } From mshefty at ichips.intel.com Thu Aug 3 09:19:36 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 03 Aug 2006 09:19:36 -0700 Subject: [openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT) In-Reply-To: <44D1986B.6070302@voltaire.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> <44D049B5.6000505@voltaire.com> <44D0D808.1070003@ichips.intel.com> <44D1986B.6070302@voltaire.com> Message-ID: <44D22218.7000005@ichips.intel.com> Or Gerlitz wrote: > Is it correct that with the gen2 code, the remote **CM** will reconnect > on that case? I don't think so. The QP needs to move into timewait, so a new connection request is needed with a different QPN. > I see in cm.c :: cm_rej_handler() that when the state is IB_CM_REQ_SENT > and the reject reason is IB_CM_REJ_STALE_CONN you just move the cm_id > into timewait state, which will cause a retry on the REQ, correct? The cm_id moves into timewait, but that shouldn't cause a retry. The CM should notify the ULP of the reject. The QP cannot be re-used until the cm_id exits the timewait state. - Sean From mshefty at ichips.intel.com Thu Aug 3 09:15:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 03 Aug 2006 09:15:41 -0700 Subject: [openib-general] [libibcm] does the libibcm support multithreaded applications? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302A3C659@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302A3C659@mtlexch01.mtl.com> Message-ID: <44D2212D.4060508@ichips.intel.com> Dotan Barak wrote: > I'm trying to use the libibcm in a multithreaded test and i get weird > failures (instead of RTU event i get a DREQ event). This is possible sequence. If the RTU is lost, or the connecting client aborts before sending the RTU, a DREQ can occur. > Does the libibcm supports multi threading applications? > (every thread have it's own CM device and each one of them listen is > using a different service ID) There's nothing (other than an unknown bug) that should prevent a multi-threaded application from running. - Sean From rdreier at cisco.com Thu Aug 3 09:40:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 09:40:09 -0700 Subject: [openib-general] [PATCH/RFC] libibverbs and libmthca fork support In-Reply-To: <20060803114635.GG23626@minantech.com> (Gleb Natapov's message of "Thu, 3 Aug 2006 14:46:35 +0300") References: <20060801131756.GF4681@minantech.com> <20060802114242.GL4681@minantech.com> <20060803114635.GG23626@minantech.com> Message-ID: Gleb> We can also provide environment variable to control Gleb> libibverbs behaviour. This way if programmer made a wrong Gleb> assumption user will be able to fix it. OK, I added this little snippet: if (getenv("RDMAV_FORK_SAFE") || getenv("IBV_FORK_SAFE")) if (ibv_fork_init()) fprintf(stderr, PFX "Warning: fork()-safety requested " "but init failed\n"); so that setting either RDMAV_FORK_SAFE or IBV_FORK_SAFE environment variables forces the library to try to be fork()-safe. - R. From jlentini at netapp.com Thu Aug 3 10:00:58 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 3 Aug 2006 13:00:58 -0400 (EDT) Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> References: <20060803083723.6346.44450.sendpatchset@K50wks273950wss.in.ibm.com> Message-ID: On Thu, 3 Aug 2006, Krishna Kumar wrote: > 2. Changed rdma_ to rdmav_. This also enabled to retain rdma_create_qp() and > rdma_destroy_qp() routines. I'm glad to see that the generic RDMA verbs are becoming a reality. Exporting QP create/destroy functions from both the RDMA CM library and RDMA verbs library is going to confusion new developers. The names, rdmav_create_qp() and rdma_create_qp(), only differ by 1 character and their arguments only differ by 1 parameter (the RDMA CM's version takes a cma id). Are the rdmav_ versions intended to be generic or are they intended for use with the native communications managers (IB CM and iWARP CM)? Is there a way that the differences could be made clearer? Could one be eliminated? From rdreier at cisco.com Thu Aug 3 10:30:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 10:30:18 -0700 Subject: [openib-general] [PATCH repost] libmthca: stricter checks in mthca_create_srq In-Reply-To: <20060731120712.GI9411@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 31 Jul 2006 15:07:12 +0300") References: <20060731120712.GI9411@mellanox.co.il> Message-ID: thanks, applied From rdreier at cisco.com Thu Aug 3 10:31:35 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 10:31:35 -0700 Subject: [openib-general] [PATCH repost] libmthca: fix compilation on SLES10 In-Reply-To: <20060731120812.GJ9411@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 31 Jul 2006 15:08:12 +0300") References: <20060731120812.GJ9411@mellanox.co.il> Message-ID: Thanks, applied. From glebn at voltaire.com Thu Aug 3 10:36:54 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Thu, 3 Aug 2006 20:36:54 +0300 Subject: [openib-general] [PATCH/RFC] libibverbs and libmthca fork support In-Reply-To: References: <20060801131756.GF4681@minantech.com> <20060802114242.GL4681@minantech.com> <20060803114635.GG23626@minantech.com> Message-ID: <20060803173654.GA23014@minantech.com> On Thu, Aug 03, 2006 at 09:40:09AM -0700, Roland Dreier wrote: > Gleb> We can also provide environment variable to control > Gleb> libibverbs behaviour. This way if programmer made a wrong > Gleb> assumption user will be able to fix it. > > OK, I added this little snippet: > > if (getenv("RDMAV_FORK_SAFE") || getenv("IBV_FORK_SAFE")) > if (ibv_fork_init()) > fprintf(stderr, PFX "Warning: fork()-safety requested " > "but init failed\n"); > > so that setting either RDMAV_FORK_SAFE or IBV_FORK_SAFE environment > variables forces the library to try to be fork()-safe. > Looks good to me. Thanks. -- Gleb. From rdreier at cisco.com Thu Aug 3 10:39:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 10:39:54 -0700 Subject: [openib-general] [PATCH] RFC: srp filesystem data corruption problem/work-around In-Reply-To: <20060802164421.GB19103@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Aug 2006 19:44:21 +0300") References: <44D0D0C1.1060704@mellanox.com> <20060802164421.GB19103@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Aug 3 10:40:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 10:40:51 -0700 Subject: [openib-general] [PATCH v4 0/2][RFC] iWARP Core Support In-Reply-To: <20060802202747.24212.10931.stgit@dell3.ogc.int> (Steve Wise's message of "Wed, 02 Aug 2006 15:27:47 -0500") References: <20060802202747.24212.10931.stgit@dell3.ogc.int> Message-ID: Steve> Here is the iWARP Core Support patchset merged to your Steve> latest for-2.6.19 branch. It has gone through 3 reviews on Steve> lklm and netdev a while ago, and I think its ready to be Steve> pulled in. I agree. I'll read this over and queue it for 2.6.19 unless someone objects. What's the status of the amasso driver? do you think that's ready to merge too? - R. From sean.hefty at intel.com Thu Aug 3 10:37:11 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Aug 2006 10:37:11 -0700 Subject: [openib-general] [PATCH v2 1/2] sa_query: add generic query interfaces capable of supporting RMPP Message-ID: <000001c6b723$73cd2380$e598070a@amr.corp.intel.com> The following patch adds a generic interface to send MADs to the SA. The primary motivation of adding these calls is to expand the SA query interface to include RMPP responses for users wanting more than a single attribute returned from a query (e.g. multipath record queries). The implementation of existing SA query routines was layered on top of the generic query interface. Signed-off-by: Sean Hefty --- Notes from v1: - "cursor" was renamed to "iter" - ib_sa_iter_next() returns a pointer to the packed attribute - The interface turned out being easier to use having a single function call, ib_sa_iter_next(), to walk through the attributes. Index: include/rdma/ib_sa.h =================================================================== --- include/rdma/ib_sa.h (revision 8647) +++ include/rdma/ib_sa.h (working copy) @@ -254,6 +254,71 @@ struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); +struct ib_sa_iter; + +/** + * ib_sa_iter_create - Create an iterator that may be used to walk through + * a list of returned SA records. + * @mad_recv_wc: A received response from the SA. + * + * This call allocates an iterator that is used to walk through a list of + * SA records. Users must free the iterator by calling ib_sa_iter_free. + */ +struct ib_sa_iter *ib_sa_iter_create(struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_sa_iter_free - Release an iterator. + * @iter: The iterator to free. + */ +void ib_sa_iter_free(struct ib_sa_iter *iter); + +/** + * ib_sa_iter_next - Move an iterator to reference the next attribute and + * return the attribute. + * @iter: The iterator to move. + * + * The referenced attribute will be in wire format. The funtion returns NULL + * if there are no more attributes to return. + */ +void *ib_sa_iter_next(struct ib_sa_iter *iter); + +/** + * ib_sa_send_mad - Send a MAD to the SA. + * @device:device to send query on + * @port_num: port number to send query on + * @method:MAD method to use in the send. + * @attr:Reference to attribute to send in MAD. + * @attr_id:Attribute type identifier. + * @comp_mask:component mask to send in MAD + * @timeout_ms:time to wait for response, if one is expected + * @retries:number of times to retry request + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a message to the SA. If a response is expected (timeout_ms is + * non-zero), the callback function will be called when the query completes. + * Status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. Mad_recv_wc will reference any returned + * response from the SA. It is the responsibility of the caller to free + * mad_recv_wc by call ib_free_recv_mad() if it is non-NULL. + * + * If the return value of ib_sa_send_mad() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_send_mad(struct ib_device *device, u8 port_num, + int method, void *attr, int attr_id, + ib_sa_comp_mask comp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context), + void *context, struct ib_sa_query **query); + int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 8647) +++ core/sa_query.c (working copy) @@ -72,30 +72,42 @@ struct ib_sa_device { }; struct ib_sa_query { - void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); - void (*release)(struct ib_sa_query *); + void (*callback)(int, struct ib_mad_recv_wc *, void *); struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; + void *context; int id; }; struct ib_sa_service_query { void (*callback)(int, struct ib_sa_service_rec *, void *); void *context; - struct ib_sa_query sa_query; + struct ib_sa_query *sa_query; }; struct ib_sa_path_query { void (*callback)(int, struct ib_sa_path_rec *, void *); void *context; - struct ib_sa_query sa_query; + struct ib_sa_query *sa_query; }; struct ib_sa_mcmember_query { void (*callback)(int, struct ib_sa_mcmember_rec *, void *); void *context; - struct ib_sa_query sa_query; + struct ib_sa_query *sa_query; +}; + +struct ib_sa_iter { + struct ib_mad_recv_wc *recv_wc; + struct ib_mad_recv_buf *recv_buf; + int attr_id; + int attr_size; + int attr_offset; + int data_offset; + int data_left; + void *attr; + u8 attr_data[0]; }; static void ib_sa_add_one(struct ib_device *device); @@ -504,9 +516,17 @@ EXPORT_SYMBOL(ib_init_ah_from_mcmember); int ib_sa_pack_attr(void *dst, void *src, int attr_id) { switch (attr_id) { + case IB_SA_ATTR_SERVICE_REC: + ib_pack(service_rec_table, ARRAY_SIZE(service_rec_table), + src, dst); + break; case IB_SA_ATTR_PATH_REC: ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); break; + case IB_SA_ATTR_MC_MEMBER_REC: + ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + src, dst); + break; default: return -EINVAL; } @@ -517,9 +537,17 @@ EXPORT_SYMBOL(ib_sa_pack_attr); int ib_sa_unpack_attr(void *dst, void *src, int attr_id) { switch (attr_id) { + case IB_SA_ATTR_SERVICE_REC: + ib_unpack(service_rec_table, ARRAY_SIZE(service_rec_table), + src, dst); + break; case IB_SA_ATTR_PATH_REC: ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); break; + case IB_SA_ATTR_MC_MEMBER_REC: + ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + src, dst); + break; default: return -EINVAL; } @@ -527,15 +555,20 @@ int ib_sa_unpack_attr(void *dst, void *s } EXPORT_SYMBOL(ib_sa_unpack_attr); -static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) +static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent, + int method, void *attr, int attr_id, + ib_sa_comp_mask comp_mask) { unsigned long flags; - memset(mad, 0, sizeof *mad); - mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = cpu_to_be16(attr_id); + mad->sa_hdr.comp_mask = comp_mask; + + ib_sa_pack_attr(mad->data, attr, attr_id); spin_lock_irqsave(&tid_lock, flags); mad->mad_hdr.tid = @@ -589,26 +622,175 @@ retry: return ret ? ret : id; } -static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, - int status, - struct ib_sa_mad *mad) +/* Return size of SA attributes on the wire. */ +static int sa_mad_attr_size(int attr_id) +{ + int size; + + switch (attr_id) { + case IB_SA_ATTR_SERVICE_REC: + size = 176; + break; + case IB_SA_ATTR_PATH_REC: + size = 64; + break; + case IB_SA_ATTR_MC_MEMBER_REC: + size = 52; + break; + default: + size = 0; + break; + } + return size; +} + +struct ib_sa_iter *ib_sa_iter_create(struct ib_mad_recv_wc *mad_recv_wc) { - struct ib_sa_path_query *query = - container_of(sa_query, struct ib_sa_path_query, sa_query); + struct ib_sa_iter *iter; + struct ib_sa_mad *mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + int attr_id, attr_size, attr_offset; + + attr_id = be16_to_cpu(mad->mad_hdr.attr_id); + attr_offset = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; + attr_size = sa_mad_attr_size(attr_id); + if (!attr_size || attr_offset < attr_size) + return ERR_PTR(-EINVAL); + + iter = kzalloc(sizeof *iter + attr_size, GFP_KERNEL); + if (!iter) + return ERR_PTR(-ENOMEM); + + iter->data_left = mad_recv_wc->mad_len - IB_MGMT_SA_HDR; + iter->recv_wc = mad_recv_wc; + iter->recv_buf = &mad_recv_wc->recv_buf; + iter->attr_id = attr_id; + iter->attr_offset = attr_offset; + iter->attr_size = attr_size; + return iter; +} +EXPORT_SYMBOL(ib_sa_iter_create); + +void ib_sa_iter_free(struct ib_sa_iter *iter) +{ + kfree(iter); +} +EXPORT_SYMBOL(ib_sa_iter_free); + +void *ib_sa_iter_next(struct ib_sa_iter *iter) +{ + struct ib_sa_mad *mad; + int left, offset = 0; + + while (iter->data_left >= iter->attr_offset) { + while (iter->data_offset < IB_MGMT_SA_DATA) { + mad = (struct ib_sa_mad *) iter->recv_buf->mad; + + left = IB_MGMT_SA_DATA - iter->data_offset; + if (left < iter->attr_size) { + /* copy first piece of the attribute */ + iter->attr = &iter->attr_data; + memcpy(iter->attr, + &mad->data[iter->data_offset], left); + offset = left; + break; + } else if (offset) { + /* copy the second piece of the attribute */ + memcpy(iter->attr + offset, &mad->data[0], + iter->attr_size - offset); + iter->data_offset = iter->attr_size - offset; + offset = 0; + } else { + iter->attr = &mad->data[iter->data_offset]; + iter->data_offset += iter->attr_size; + } + + iter->data_left -= iter->attr_offset; + goto out; + } + iter->data_offset = 0; + iter->recv_buf = list_entry(iter->recv_buf->list.next, + struct ib_mad_recv_buf, list); + } + iter->attr = NULL; +out: + return iter->attr; +} +EXPORT_SYMBOL(ib_sa_iter_next); + +int ib_sa_send_mad(struct ib_device *device, u8 port_num, + int method, void *attr, int attr_id, + ib_sa_comp_mask comp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context), + void *context, struct ib_sa_query **query) +{ + struct ib_sa_query *sa_query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port; + struct ib_mad_agent *agent; + int ret; + + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + + sa_query = kmalloc(sizeof *query, gfp_mask); + if (!sa_query) + return -ENOMEM; + + sa_query->mad_buf = ib_create_send_mad(agent, 1, 0, 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, gfp_mask); + if (!sa_query->mad_buf) { + ret = -ENOMEM; + goto err1; + } - if (mad) { - struct ib_sa_path_rec rec; + sa_query->port = port; + sa_query->callback = callback; + sa_query->context = context; - ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), - mad->data, &rec); - query->callback(status, &rec, query->context); - } else - query->callback(status, NULL, query->context); + init_mad(sa_query->mad_buf->mad, agent, method, attr, attr_id, + comp_mask); + + ret = send_mad(sa_query, timeout_ms, retries, gfp_mask); + if (ret < 0) + goto err2; + + *query = sa_query; + return ret; + +err2: + ib_free_send_mad(sa_query->mad_buf); +err1: + kfree(query); + return ret; } -static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) +static void ib_sa_path_rec_callback(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context) { - kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); + struct ib_sa_path_query *query = context; + + if (query->callback) { + if (mad_recv_wc) { + struct ib_sa_mad *mad; + struct ib_sa_path_rec rec; + + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); + } + if (mad_recv_wc) + ib_free_recv_mad(mad_recv_wc); + kfree(query); } /** @@ -647,83 +829,47 @@ int ib_sa_path_rec_get(struct ib_device struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port; - struct ib_mad_agent *agent; - struct ib_sa_mad *mad; int ret; - if (!sa_dev) - return -ENODEV; - - port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; - query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; - goto err1; - } - query->callback = callback; query->context = context; - - mad = query->sa_query.mad_buf->mad; - init_mad(mad, agent); - - query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; - query->sa_query.release = ib_sa_path_rec_release; - query->sa_query.port = port; - mad->mad_hdr.method = IB_MGMT_METHOD_GET; - mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); - mad->sa_hdr.comp_mask = comp_mask; - - ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), rec, mad->data); - - *sa_query = &query->sa_query; - - ret = send_mad(&query->sa_query, timeout_ms, retries, gfp_mask); + ret = ib_sa_send_mad(device, port_num, IB_MGMT_METHOD_GET, rec, + IB_SA_ATTR_PATH_REC, comp_mask, timeout_ms, + retries, gfp_mask, ib_sa_path_rec_callback, + query, &query->sa_query); if (ret < 0) - goto err2; + kfree(query); return ret; - -err2: - *sa_query = NULL; - ib_free_send_mad(query->sa_query.mad_buf); - -err1: - kfree(query); - return ret; } EXPORT_SYMBOL(ib_sa_path_rec_get); -static void ib_sa_service_rec_callback(struct ib_sa_query *sa_query, - int status, - struct ib_sa_mad *mad) +static void ib_sa_service_rec_callback(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context) { - struct ib_sa_service_query *query = - container_of(sa_query, struct ib_sa_service_query, sa_query); + struct ib_sa_service_query *query = context; - if (mad) { - struct ib_sa_service_rec rec; - - ib_unpack(service_rec_table, ARRAY_SIZE(service_rec_table), - mad->data, &rec); - query->callback(status, &rec, query->context); - } else - query->callback(status, NULL, query->context); -} - -static void ib_sa_service_rec_release(struct ib_sa_query *sa_query) -{ - kfree(container_of(sa_query, struct ib_sa_service_query, sa_query)); + if (query->callback) { + if (mad_recv_wc) { + struct ib_sa_mad *mad; + struct ib_sa_service_rec rec; + + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + + ib_unpack(service_rec_table, ARRAY_SIZE(service_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); + } + if (mad_recv_wc) + ib_free_recv_mad(mad_recv_wc); + kfree(query); } /** @@ -764,89 +910,47 @@ int ib_sa_service_rec_query(struct ib_de struct ib_sa_query **sa_query) { struct ib_sa_service_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port; - struct ib_mad_agent *agent; - struct ib_sa_mad *mad; int ret; - if (!sa_dev) - return -ENODEV; - - port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; - - if (method != IB_MGMT_METHOD_GET && - method != IB_MGMT_METHOD_SET && - method != IB_SA_METHOD_DELETE) - return -EINVAL; - query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; - goto err1; - } - query->callback = callback; query->context = context; - - mad = query->sa_query.mad_buf->mad; - init_mad(mad, agent); - - query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; - query->sa_query.release = ib_sa_service_rec_release; - query->sa_query.port = port; - mad->mad_hdr.method = method; - mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_SERVICE_REC); - mad->sa_hdr.comp_mask = comp_mask; - - ib_pack(service_rec_table, ARRAY_SIZE(service_rec_table), - rec, mad->data); - - *sa_query = &query->sa_query; - - ret = send_mad(&query->sa_query, timeout_ms, retries, gfp_mask); + ret = ib_sa_send_mad(device, port_num, method, rec, + IB_SA_ATTR_SERVICE_REC, comp_mask, timeout_ms, + retries, gfp_mask, ib_sa_service_rec_callback, + query, &query->sa_query); if (ret < 0) - goto err2; - - return ret; - -err2: - *sa_query = NULL; - ib_free_send_mad(query->sa_query.mad_buf); + kfree(query); -err1: - kfree(query); return ret; } EXPORT_SYMBOL(ib_sa_service_rec_query); -static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, - int status, - struct ib_sa_mad *mad) +static void ib_sa_mcmember_rec_callback(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context) { - struct ib_sa_mcmember_query *query = - container_of(sa_query, struct ib_sa_mcmember_query, sa_query); + struct ib_sa_mcmember_query *query = context; - if (mad) { - struct ib_sa_mcmember_rec rec; - - ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), - mad->data, &rec); - query->callback(status, &rec, query->context); - } else - query->callback(status, NULL, query->context); -} - -static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) -{ - kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); + if (query->callback) { + if (mad_recv_wc) { + struct ib_sa_mad *mad; + struct ib_sa_mcmember_rec rec; + + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(mcmember_rec_table, + ARRAY_SIZE(mcmember_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); + } + if (mad_recv_wc) + ib_free_recv_mad(mad_recv_wc); + kfree(query); } int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, @@ -861,61 +965,22 @@ int ib_sa_mcmember_rec_query(struct ib_d struct ib_sa_query **sa_query) { struct ib_sa_mcmember_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port; - struct ib_mad_agent *agent; - struct ib_sa_mad *mad; int ret; - if (!sa_dev) - return -ENODEV; - - port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; - query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; - goto err1; - } - query->callback = callback; query->context = context; - - mad = query->sa_query.mad_buf->mad; - init_mad(mad, agent); - - query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; - query->sa_query.release = ib_sa_mcmember_rec_release; - query->sa_query.port = port; - mad->mad_hdr.method = method; - mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); - mad->sa_hdr.comp_mask = comp_mask; - - ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), - rec, mad->data); - - *sa_query = &query->sa_query; - - ret = send_mad(&query->sa_query, timeout_ms, retries, gfp_mask); + ret = ib_sa_send_mad(device, port_num, method, rec, + IB_SA_ATTR_MC_MEMBER_REC, comp_mask, timeout_ms, + retries, gfp_mask, ib_sa_mcmember_rec_callback, + query, &query->sa_query); if (ret < 0) - goto err2; + kfree(query); return ret; - -err2: - *sa_query = NULL; - ib_free_send_mad(query->sa_query.mad_buf); - -err1: - kfree(query); - return ret; } EXPORT_SYMBOL(ib_sa_mcmember_rec_query); @@ -931,13 +996,13 @@ static void send_handler(struct ib_mad_a /* No callback -- already got recv */ break; case IB_WC_RESP_TIMEOUT_ERR: - query->callback(query, -ETIMEDOUT, NULL); + query->callback(-ETIMEDOUT, NULL, query->context); break; case IB_WC_WR_FLUSH_ERR: - query->callback(query, -EINTR, NULL); + query->callback(-EINTR, NULL, query->context); break; default: - query->callback(query, -EIO, NULL); + query->callback(-EIO, NULL, query->context); break; } @@ -947,7 +1012,7 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(mad_send_wc->send_buf); kref_put(&query->sm_ah->ref, free_sm_ah); - query->release(query); + kfree(query); } static void recv_handler(struct ib_mad_agent *mad_agent, @@ -959,17 +1024,11 @@ static void recv_handler(struct ib_mad_a mad_buf = (void *) (unsigned long) mad_recv_wc->wc->wr_id; query = mad_buf->context[0]; - if (query->callback) { - if (mad_recv_wc->wc->status == IB_WC_SUCCESS) - query->callback(query, - mad_recv_wc->recv_buf.mad->mad_hdr.status ? - -EINVAL : 0, - (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); - else - query->callback(query, -EIO, NULL); - } - - ib_free_recv_mad(mad_recv_wc); + if (query->callback) + query->callback(mad_recv_wc->recv_buf.mad->mad_hdr.status ? + -EINVAL : 0, mad_recv_wc, query->context); + else + ib_free_recv_mad(mad_recv_wc); } static void ib_sa_add_one(struct ib_device *device) From sean.hefty at intel.com Thu Aug 3 10:39:43 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Aug 2006 10:39:43 -0700 Subject: [openib-general] [PATCH v2 2/2] local_sa: use SA iterator routines to walk attributes in RMPP response In-Reply-To: <000001c6b723$73cd2380$e598070a@amr.corp.intel.com> Message-ID: <000101c6b723$ce375930$e598070a@amr.corp.intel.com> Convert local SA to use the new SA iterator routines for walking a list of attributes in an RMPP response returned by the SA. This replaces a local SA specific implementation. Signed-off-by: Sean Hefty --- Index: core/local_sa.c =================================================================== --- core/local_sa.c (revision 8647) +++ core/local_sa.c (working copy) @@ -107,16 +107,6 @@ struct sa_db_device { struct sa_db_port port[0]; }; -/* Define path record format to enable needed checks against MAD data. */ -struct ib_path_rec { - u8 reserved[8]; - u8 dgid[16]; - u8 sgid[16]; - __be16 dlid; - __be16 slid; - u8 reserved2[20]; -}; - struct ib_sa_cursor { struct ib_sa_cursor *next; }; @@ -194,60 +184,27 @@ static int insert_attr(struct index_root static void update_path_rec(struct sa_db_port *port, struct ib_mad_recv_wc *mad_recv_wc) { - struct ib_mad_recv_buf *recv_buf; - struct ib_sa_mad *mad = (void *) mad_recv_wc->recv_buf.mad; + struct ib_sa_iter *iter; struct ib_path_rec_info *path_info; - struct ib_path_rec ib_path, *path = NULL; - int i, attr_size, left, offset = 0; + void *attr; - attr_size = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; - if (attr_size < sizeof ib_path) + iter = ib_sa_iter_create(mad_recv_wc); + if (IS_ERR(iter)) return; down_write(&lock); port->update++; - list_for_each_entry(recv_buf, &mad_recv_wc->rmpp_list, list) { - for (i = 0; i < IB_MGMT_SA_DATA;) { - mad = (struct ib_sa_mad *) recv_buf->mad; - - left = IB_MGMT_SA_DATA - i; - if (left < sizeof ib_path) { - /* copy first piece of the attribute */ - memcpy(&ib_path, &mad->data[i], left); - path = &ib_path; - offset = left; - break; - } else if (offset) { - /* copy the second piece of the attribute */ - memcpy((void*) path + offset, &mad->data[i], - sizeof ib_path - offset); - i += attr_size - offset; - offset = 0; - } else { - path = (void *) &mad->data[i]; - i += attr_size; - } - - if (!path->slid) - goto unlock; - - path_info = kmalloc(sizeof *path_info, GFP_KERNEL); - if (!path_info) - goto unlock; - - ib_sa_unpack_attr(&path_info->rec, path, - IB_SA_ATTR_PATH_REC); - - if (insert_attr(&port->index, port->update, - path_info->rec.dgid.raw, - &path_info->cursor)) { - kfree(path_info); - goto unlock; - } - } + while ((attr = ib_sa_iter_next(iter)) && + (path_info = kmalloc(sizeof *path_info, GFP_KERNEL))) { + + ib_sa_unpack_attr(&path_info->rec, attr, IB_SA_ATTR_PATH_REC); + if (insert_attr(&port->index, port->update, + path_info->rec.dgid.raw, + &path_info->cursor)) + break; } -unlock: up_write(&lock); + ib_sa_iter_free(iter); } static void recv_handler(struct ib_mad_agent *mad_agent, From rdreier at cisco.com Thu Aug 3 10:48:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 10:48:47 -0700 Subject: [openib-general] [PATCH] IB/ipoib: remove broken link from Kconfig and documentation In-Reply-To: (Or Gerlitz's message of "Thu, 13 Jul 2006 11:00:39 +0300 (IDT)") References: Message-ID: thanks, applied From rdreier at cisco.com Thu Aug 3 10:52:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 10:52:10 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060802183832.GB20435@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Aug 2006 21:38:32 +0300") References: <20060802183832.GB20435@mellanox.co.il> Message-ID: > + if (atomic_dec_and_test(&client->users)) > + wake_up(&client->wait); > + wait_event(client->wait, atomic_read(&client->users) == 0); I think this is vulnerable to the race we fixed up all over the place a few months ago, where wait_event runs between the atomic_dec_and_test() and the wake_up(). - R. From rdreier at cisco.com Thu Aug 3 10:57:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 10:57:26 -0700 Subject: [openib-general] hotplug support in mthca In-Reply-To: <20060802131331.GB15769@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Aug 2006 16:13:31 +0300") References: <20060802050750.GE9411@mellanox.co.il> <20060802131331.GB15769@mellanox.co.il> Message-ID: OK, I applied this for now. Let's try to revisit the revoke method later though. From rdreier at cisco.com Thu Aug 3 11:07:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 11:07:58 -0700 Subject: [openib-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus to get a few fixes: Ishai Rabinovitz: IB/srp: Fix crash in srp_reconnect_target IB/srp: Work around data corruption bug on Mellanox targets Jack Morgenstein: IB/uverbs: Avoid a crash on device hot remove Michael S. Tsirkin: IB/mthca: Fix mthca_array_clear() thinko Or Gerlitz: IB/ipoib: Remove broken link from Kconfig and documentation Roland Dreier: IB/mthca: Clean up mthca array index mask Sean Hefty: IB/cm: Fix error handling in ib_send_cm_req Documentation/infiniband/ipoib.txt | 2 -- drivers/infiniband/core/cm.c | 4 +++- drivers/infiniband/core/uverbs.h | 2 ++ drivers/infiniband/core/uverbs_main.c | 8 +++++++- drivers/infiniband/hw/mthca/mthca_allocator.c | 15 ++++++++------- drivers/infiniband/ulp/ipoib/Kconfig | 3 +-- drivers/infiniband/ulp/srp/ib_srp.c | 19 +++++++++++++++++-- 7 files changed, 38 insertions(+), 15 deletions(-) diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt index 1870355..864ff32 100644 --- a/Documentation/infiniband/ipoib.txt +++ b/Documentation/infiniband/ipoib.txt @@ -51,8 +51,6 @@ Debugging Information References - IETF IP over InfiniBand (ipoib) Working Group - http://ietf.org/html.charters/ipoib-charter.html Transmission of IP over InfiniBand (IPoIB) (RFC 4391) http://ietf.org/rfc/rfc4391.txt IP over InfiniBand (IPoIB) Architecture (RFC 4392) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index f85c97f..0de335b 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -975,8 +975,10 @@ int ib_send_cm_req(struct ib_cm_id *cm_i cm_id_priv->timewait_info = cm_create_timewait_info(cm_id_priv-> id.local_id); - if (IS_ERR(cm_id_priv->timewait_info)) + if (IS_ERR(cm_id_priv->timewait_info)) { + ret = PTR_ERR(cm_id_priv->timewait_info); goto out; + } ret = cm_init_av_by_path(param->primary_path, &cm_id_priv->av); if (ret) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index bb9bee5..102a59c 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -42,6 +42,7 @@ #define UVERBS_H #include #include #include +#include #include #include @@ -69,6 +70,7 @@ #include struct ib_uverbs_device { struct kref ref; + struct completion comp; int devnum; struct cdev *dev; struct class_device *class_dev; diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index e725ccc..4e16314 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -122,7 +122,7 @@ static void ib_uverbs_release_dev(struct struct ib_uverbs_device *dev = container_of(ref, struct ib_uverbs_device, ref); - kfree(dev); + complete(&dev->comp); } void ib_uverbs_release_ucq(struct ib_uverbs_file *file, @@ -740,6 +740,7 @@ static void ib_uverbs_add_one(struct ib_ return; kref_init(&uverbs_dev->ref); + init_completion(&uverbs_dev->comp); spin_lock(&map_lock); uverbs_dev->devnum = find_first_zero_bit(dev_map, IB_UVERBS_MAX_DEVICES); @@ -793,6 +794,8 @@ err_cdev: err: kref_put(&uverbs_dev->ref, ib_uverbs_release_dev); + wait_for_completion(&uverbs_dev->comp); + kfree(uverbs_dev); return; } @@ -812,7 +815,10 @@ static void ib_uverbs_remove_one(struct spin_unlock(&map_lock); clear_bit(uverbs_dev->devnum, dev_map); + kref_put(&uverbs_dev->ref, ib_uverbs_release_dev); + wait_for_completion(&uverbs_dev->comp); + kfree(uverbs_dev); } static int uverbs_event_get_sb(struct file_system_type *fs_type, int flags, diff --git a/drivers/infiniband/hw/mthca/mthca_allocator.c b/drivers/infiniband/hw/mthca/mthca_allocator.c index 9ba3211..25157f5 100644 --- a/drivers/infiniband/hw/mthca/mthca_allocator.c +++ b/drivers/infiniband/hw/mthca/mthca_allocator.c @@ -108,14 +108,15 @@ void mthca_alloc_cleanup(struct mthca_al * serialize access to the array. */ +#define MTHCA_ARRAY_MASK (PAGE_SIZE / sizeof (void *) - 1) + void *mthca_array_get(struct mthca_array *array, int index) { int p = (index * sizeof (void *)) >> PAGE_SHIFT; - if (array->page_list[p].page) { - int i = index & (PAGE_SIZE / sizeof (void *) - 1); - return array->page_list[p].page[i]; - } else + if (array->page_list[p].page) + return array->page_list[p].page[index & MTHCA_ARRAY_MASK]; + else return NULL; } @@ -130,8 +131,7 @@ int mthca_array_set(struct mthca_array * if (!array->page_list[p].page) return -ENOMEM; - array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = - value; + array->page_list[p].page[index & MTHCA_ARRAY_MASK] = value; ++array->page_list[p].used; return 0; @@ -144,7 +144,8 @@ void mthca_array_clear(struct mthca_arra if (--array->page_list[p].used == 0) { free_page((unsigned long) array->page_list[p].page); array->page_list[p].page = NULL; - } + } else + array->page_list[p].page[index & MTHCA_ARRAY_MASK] = NULL; if (array->page_list[p].used < 0) pr_debug("Array %p index %d page %d with ref count %d < 0\n", diff --git a/drivers/infiniband/ulp/ipoib/Kconfig b/drivers/infiniband/ulp/ipoib/Kconfig index 13d6d01..d74653d 100644 --- a/drivers/infiniband/ulp/ipoib/Kconfig +++ b/drivers/infiniband/ulp/ipoib/Kconfig @@ -6,8 +6,7 @@ config INFINIBAND_IPOIB transports IP packets over InfiniBand so you can use your IB device as a fancy NIC. - The IPoIB protocol is defined by the IETF ipoib working - group: . + See Documentation/infiniband/ipoib.txt for more information config INFINIBAND_IPOIB_DEBUG bool "IP-over-InfiniBand debugging" if EMBEDDED diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..8257d5a 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -77,6 +77,14 @@ MODULE_PARM_DESC(topspin_workarounds, static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; +static int mellanox_workarounds = 1; + +module_param(mellanox_workarounds, int, 0444); +MODULE_PARM_DESC(mellanox_workarounds, + "Enable workarounds for Mellanox SRP target bugs if != 0"); + +static const u8 mellanox_oui[3] = { 0x00, 0x02, 0xc9 }; + static void srp_add_one(struct ib_device *device); static void srp_remove_one(struct ib_device *device); static void srp_completion(struct ib_cq *cq, void *target_ptr); @@ -526,8 +534,10 @@ static int srp_reconnect_target(struct s while (ib_poll_cq(target->cq, 1, &wc) > 0) ; /* nothing */ + spin_lock_irq(target->scsi_host->host_lock); list_for_each_entry_safe(req, tmp, &target->req_queue, list) srp_reset_req(target, req); + spin_unlock_irq(target->scsi_host->host_lock); target->rx_head = 0; target->tx_head = 0; @@ -567,7 +577,7 @@ err: return ret; } -static int srp_map_fmr(struct srp_device *dev, struct scatterlist *scat, +static int srp_map_fmr(struct srp_target_port *target, struct scatterlist *scat, int sg_cnt, struct srp_request *req, struct srp_direct_buf *buf) { @@ -577,10 +587,15 @@ static int srp_map_fmr(struct srp_device int page_cnt; int i, j; int ret; + struct srp_device *dev = target->srp_host->dev; if (!dev->fmr_pool) return -ENODEV; + if ((sg_dma_address(&scat[0]) & ~dev->fmr_page_mask) && + mellanox_workarounds && !memcmp(&target->ioc_guid, mellanox_oui, 3)) + return -EINVAL; + len = page_cnt = 0; for (i = 0; i < sg_cnt; ++i) { if (sg_dma_address(&scat[i]) & ~dev->fmr_page_mask) { @@ -683,7 +698,7 @@ static int srp_map_data(struct scsi_cmnd buf->va = cpu_to_be64(sg_dma_address(scat)); buf->key = cpu_to_be32(target->srp_host->dev->mr->rkey); buf->len = cpu_to_be32(sg_dma_len(scat)); - } else if (srp_map_fmr(target->srp_host->dev, scat, count, req, + } else if (srp_map_fmr(target, scat, count, req, (void *) cmd->add_data)) { /* * FMR mapping failed, and the scatterlist has more From mshefty at ichips.intel.com Thu Aug 3 11:11:11 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 03 Aug 2006 11:11:11 -0700 Subject: [openib-general] [PATCH v4 2/2] iWARP Core Changes. In-Reply-To: <20060802202752.24212.73349.stgit@dell3.ogc.int> References: <20060802202747.24212.10931.stgit@dell3.ogc.int> <20060802202752.24212.73349.stgit@dell3.ogc.int> Message-ID: <44D23C3F.3000204@ichips.intel.com> Steve Wise wrote: > diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c > index d294bbc..83f84ef 100644 > --- a/drivers/infiniband/core/addr.c > +++ b/drivers/infiniband/core/addr.c > @@ -32,6 +32,7 @@ #include > #include > #include > #include > +#include File is included 3 lines up. > diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c > index e05ca2c..061858c 100644 > #include > #include > #include > -#include /* INIT_WORK, schedule_work(), flush_scheduled_work() */ I'm guessing that the include isn't currently needed, since none of the other changes to the file should have removed its dependency. Should this be put into a separate patch? > +static int iw_conn_req_handler(struct iw_cm_id *cm_id, > + struct iw_cm_event *iw_event) > +{ > + struct rdma_cm_id *new_cm_id; > + struct rdma_id_private *listen_id, *conn_id; > + struct sockaddr_in *sin; > + struct net_device *dev = NULL; > + int ret; > + > + listen_id = cm_id->context; > + atomic_inc(&listen_id->dev_remove); > + if (!cma_comp(listen_id, CMA_LISTEN)) { > + ret = -ECONNABORTED; > + goto out; > + } > + > + /* Create a new RDMA id for the new IW CM ID */ > + new_cm_id = rdma_create_id(listen_id->id.event_handler, > + listen_id->id.context, > + RDMA_PS_TCP); > + if (!new_cm_id) { > + ret = -ENOMEM; > + goto out; > + } > + conn_id = container_of(new_cm_id, struct rdma_id_private, id); > + atomic_inc(&conn_id->dev_remove); This is not released in error cases. See below. > + conn_id->state = CMA_CONNECT; > + > + dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr); > + if (!dev) { > + ret = -EADDRNOTAVAIL; cma_release_remove(conn_id); > + rdma_destroy_id(new_cm_id); > + goto out; > + } > + ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL); > + if (ret) { cma_release_remove(conn_id); > + rdma_destroy_id(new_cm_id); > + goto out; > + } > + > + ret = cma_acquire_dev(conn_id); > + if (ret) { cma_release_remove(conn_id); > + rdma_destroy_id(new_cm_id); > + goto out; > + } > + > + conn_id->cm_id.iw = cm_id; > + cm_id->context = conn_id; > + cm_id->cm_handler = cma_iw_handler; > + > + sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr; > + *sin = iw_event->local_addr; > + sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr; > + *sin = iw_event->remote_addr; > + > + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, > + iw_event->private_data, > + iw_event->private_data_len); > + if (ret) { > + /* User wants to destroy the CM ID */ > + conn_id->cm_id.iw = NULL; > + cma_exch(conn_id, CMA_DESTROYING); > + cma_release_remove(conn_id); > + rdma_destroy_id(&conn_id->id); > + } > + > +out: > + if (!dev) > + dev_put(dev); Shouldn't this be: if (dev)? > + cma_release_remove(listen_id); > + return ret; > +} > @@ -1357,8 +1552,8 @@ static int cma_resolve_loopback(struct r > ib_addr_set_dgid(&id_priv->id.route.addr.dev_addr, &gid); > > if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) { > - src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr; > - dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr; > + src_in = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; > + dst_in = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; trivial spacing change only > +static inline void iw_addr_get_sgid(struct rdma_dev_addr* rda, > + union ib_gid *gid) > +{ > + memcpy(gid, rda->src_dev_addr, sizeof *gid); > +} > + > +static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda) > +{ > + return (union ib_gid *) rda->dst_dev_addr; > +} Minor personal nit: for consistency with the rest of the file, can you use dev_addr in place of rda? - Sean From swise at opengridcomputing.com Thu Aug 3 11:25:21 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 13:25:21 -0500 Subject: [openib-general] [PATCH v4 0/2][RFC] iWARP Core Support In-Reply-To: References: <20060802202747.24212.10931.stgit@dell3.ogc.int> Message-ID: <1154629521.29187.27.camel@stevo-desktop> On Thu, 2006-08-03 at 10:40 -0700, Roland Dreier wrote: > Steve> Here is the iWARP Core Support patchset merged to your > Steve> latest for-2.6.19 branch. It has gone through 3 reviews on > Steve> lklm and netdev a while ago, and I think its ready to be > Steve> pulled in. > > I agree. I'll read this over and queue it for 2.6.19 unless someone > objects. > > What's the status of the amasso driver? do you think that's ready to > merge too? > Yea, I'll re-merge it and post it soon. From swise at opengridcomputing.com Thu Aug 3 11:43:03 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 13:43:03 -0500 Subject: [openib-general] [PATCH v4 2/2] iWARP Core Changes. In-Reply-To: <44D23C3F.3000204@ichips.intel.com> References: <20060802202747.24212.10931.stgit@dell3.ogc.int> <20060802202752.24212.73349.stgit@dell3.ogc.int> <44D23C3F.3000204@ichips.intel.com> Message-ID: <1154630583.29187.42.camel@stevo-desktop> Roland/Sean, I'll fix all these and retest, then resubmit... Comments below... Steve. On Thu, 2006-08-03 at 11:11 -0700, Sean Hefty wrote: > Steve Wise wrote: > > diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c > > index d294bbc..83f84ef 100644 > > --- a/drivers/infiniband/core/addr.c > > +++ b/drivers/infiniband/core/addr.c > > @@ -32,6 +32,7 @@ #include > > #include > > #include > > #include > > +#include > > File is included 3 lines up. > Oops. I'll fix this. > > diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c > > index e05ca2c..061858c 100644 > > #include > > #include > > #include > > -#include /* INIT_WORK, schedule_work(), flush_scheduled_work() */ > > I'm guessing that the include isn't currently needed, since none of the other > changes to the file should have removed its dependency. Should this be put into > a separate patch? > And this. > > +static int iw_conn_req_handler(struct iw_cm_id *cm_id, > > + struct iw_cm_event *iw_event) > > +{ > > + struct rdma_cm_id *new_cm_id; > > + struct rdma_id_private *listen_id, *conn_id; > > + struct sockaddr_in *sin; > > + struct net_device *dev = NULL; > > + int ret; > > + > > + listen_id = cm_id->context; > > + atomic_inc(&listen_id->dev_remove); > > + if (!cma_comp(listen_id, CMA_LISTEN)) { > > + ret = -ECONNABORTED; > > + goto out; > > + } > > + > > + /* Create a new RDMA id for the new IW CM ID */ > > + new_cm_id = rdma_create_id(listen_id->id.event_handler, > > + listen_id->id.context, > > + RDMA_PS_TCP); > > + if (!new_cm_id) { > > + ret = -ENOMEM; > > + goto out; > > + } > > + conn_id = container_of(new_cm_id, struct rdma_id_private, id); > > + atomic_inc(&conn_id->dev_remove); > > This is not released in error cases. See below. > > > + conn_id->state = CMA_CONNECT; > > + > > + dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr); > > + if (!dev) { > > + ret = -EADDRNOTAVAIL; > > cma_release_remove(conn_id); > > > + rdma_destroy_id(new_cm_id); > > + goto out; > > + } > > + ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL); > > + if (ret) { > > cma_release_remove(conn_id); > > > + rdma_destroy_id(new_cm_id); > > + goto out; > > + } > > + > > + ret = cma_acquire_dev(conn_id); > > + if (ret) { > > cma_release_remove(conn_id); > I'll fix these too. > > + rdma_destroy_id(new_cm_id); > > + goto out; > > + } > > + > > + conn_id->cm_id.iw = cm_id; > > + cm_id->context = conn_id; > > + cm_id->cm_handler = cma_iw_handler; > > + > > + sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr; > > + *sin = iw_event->local_addr; > > + sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr; > > + *sin = iw_event->remote_addr; > > + > > + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, > > + iw_event->private_data, > > + iw_event->private_data_len); > > + if (ret) { > > + /* User wants to destroy the CM ID */ > > + conn_id->cm_id.iw = NULL; > > + cma_exch(conn_id, CMA_DESTROYING); > > + cma_release_remove(conn_id); > > + rdma_destroy_id(&conn_id->id); > > + } > > + > > +out: > > + if (!dev) > > + dev_put(dev); > > Shouldn't this be: if (dev)? > yup. This was added (incorrecty by me) during the review process... > > + cma_release_remove(listen_id); > > + return ret; > > +} > > @@ -1357,8 +1552,8 @@ static int cma_resolve_loopback(struct r > > ib_addr_set_dgid(&id_priv->id.route.addr.dev_addr, &gid); > > > > if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) { > > - src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr; > > - dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr; > > + src_in = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; > > + dst_in = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; > > trivial spacing change only > right. > > +static inline void iw_addr_get_sgid(struct rdma_dev_addr* rda, > > + union ib_gid *gid) > > +{ > > + memcpy(gid, rda->src_dev_addr, sizeof *gid); > > +} > > + > > +static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda) > > +{ > > + return (union ib_gid *) rda->dst_dev_addr; > > +} > > Minor personal nit: for consistency with the rest of the file, can you use > dev_addr in place of rda? > okay. From mshefty at ichips.intel.com Thu Aug 3 11:57:50 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 03 Aug 2006 11:57:50 -0700 Subject: [openib-general] [PATCH v2 1/2] sa_query: add generic query interfaces capable of supporting RMPP In-Reply-To: <000001c6b723$73cd2380$e598070a@amr.corp.intel.com> References: <000001c6b723$73cd2380$e598070a@amr.corp.intel.com> Message-ID: <44D2472E.5030204@ichips.intel.com> Sean Hefty wrote: > +int ib_sa_send_mad(struct ib_device *device, u8 port_num, > + int method, void *attr, int attr_id, > + ib_sa_comp_mask comp_mask, > + int timeout_ms, int retries, gfp_t gfp_mask, > + void (*callback)(int status, > + struct ib_mad_recv_wc *mad_recv_wc, > + void *context), > + void *context, struct ib_sa_query **query); Now that I've just updated and posted this... I think it makes sense for this call to take a packed attribute as input, versus an unpacked one. Attr_id can then be passed in as be16, which lets this interface, the callback, and it_sa_iter calls all deal with data in wire format. My longer term intent is to create a libibsa to support easier userspace SA interactions. It may be easier to handle packing / unpacking of attributes from userspace. Comments? - Sean From mst at mellanox.co.il Thu Aug 3 12:12:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Aug 2006 22:12:07 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060803191207.GA26623@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Fwd: issues in ipoib > > > + if (atomic_dec_and_test(&client->users)) > > + wake_up(&client->wait); > > > + wait_event(client->wait, atomic_read(&client->users) == 0); > > I think this is vulnerable to the race we fixed up all over the place > a few months ago, where wait_event runs between the > atomic_dec_and_test() and the wake_up(). No, I think there's no problem - user callback has finished running and that's all we care about. But I'd like to point out that using rwsem gets all these corners right automatically. Reconsider that approach? -- MST From mst at mellanox.co.il Thu Aug 3 12:16:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Aug 2006 22:16:06 +0300 Subject: [openib-general] [PATCH report] IB/ipoib fix flush/start xmit race (from code review) Message-ID: <20060803191606.GB26623@mellanox.co.il> Seems to work fine here. Pls consider for 2.6.18. --- Prevent flush task from freeing the ipoib_neigh pointer, while ipoib_start_xmit is accessing the ipoib_neigh through the pointer is has loaded from the hardware address. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cf71d2a..31c4b05 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -336,7 +336,8 @@ void ipoib_flush_paths(struct net_device struct ipoib_path *path, *tp; LIST_HEAD(remove_list); - spin_lock_irq(&priv->lock); + spin_lock_irq(&priv->tx_lock); + spin_lock(&priv->lock); list_splice(&priv->path_list, &remove_list); INIT_LIST_HEAD(&priv->path_list); @@ -352,7 +353,8 @@ void ipoib_flush_paths(struct net_device path_free(dev, path); spin_lock_irq(&priv->lock); } - spin_unlock_irq(&priv->lock); + spin_unlock(&priv->lock); + spin_unlock_irq(&priv->tx_lock); } static void path_rec_completion(int status, -- MST From rdreier at cisco.com Thu Aug 3 12:28:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 12:28:04 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060803191207.GA26623@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 Aug 2006 22:12:07 +0300") References: <20060803191207.GA26623@mellanox.co.il> Message-ID: Michael> No, I think there's no problem - user callback has Michael> finished running and that's all we care about. I don't think so -- because then the unregister call can return and the client module free the client struct before the wake_up() runs, which leads to use-after-free. Michael> But I'd like to point out that using rwsem gets all these Michael> corners right automatically. Reconsider that approach? Again, I don't think so -- rwsem just hides the race from you. There is a race with semaphores (access semaphore after waking up waiter) which is the original reason for creating struct completion. - R. From mst at mellanox.co.il Thu Aug 3 12:34:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Aug 2006 22:34:00 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060803193400.GC26623@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Fwd: issues in ipoib > > Michael> No, I think there's no problem - user callback has > Michael> finished running and that's all we care about. > > I don't think so -- because then the unregister call can return and > the client module free the client struct before the wake_up() runs, > which leads to use-after-free. > > Michael> But I'd like to point out that using rwsem gets all these > Michael> corners right automatically. Reconsider that approach? > > Again, I don't think so -- rwsem just hides the race from you. There > is a race with semaphores (access semaphore after waking up waiter) > which is the original reason for creating struct completion. Hmm, you are right. So I need to replace the wait queue with completion. Thanks. -- MST From swise at opengridcomputing.com Thu Aug 3 14:02:38 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:02:38 -0500 Subject: [openib-general] [PATCH v5 0/2] iWARP Core Support Message-ID: <20060803210238.16228.47335.stgit@dell3.ogc.int> With Seans comments resolved... I also changed iw_addr_get_dgid() to memcpy the gid like the other functions do. ---- This patchset defines the modifications to the Linux infiniband subsystem to support iWARP devices. The patchset consists of 2 patches: 1 - New iWARP CM implementation. 2 - Core changes to support iWARP. Signed-off-by: Tom Tucker Signed-off-by: Steve Wise From swise at opengridcomputing.com Thu Aug 3 14:02:42 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:02:42 -0500 Subject: [openib-general] [PATCH v5 2/2] iWARP Core Changes. In-Reply-To: <20060803210238.16228.47335.stgit@dell3.ogc.int> References: <20060803210238.16228.47335.stgit@dell3.ogc.int> Message-ID: <20060803210242.16228.39306.stgit@dell3.ogc.int> Modifications to the existing rdma header files, core files, drivers, and ulp files to support iWARP. --- drivers/infiniband/core/Makefile | 4 drivers/infiniband/core/addr.c | 18 + drivers/infiniband/core/cache.c | 7 - drivers/infiniband/core/cm.c | 3 drivers/infiniband/core/cma.c | 355 +++++++++++++++++++++++--- drivers/infiniband/core/device.c | 6 drivers/infiniband/core/mad.c | 11 + drivers/infiniband/core/sa_query.c | 5 drivers/infiniband/core/smi.c | 18 + drivers/infiniband/core/sysfs.c | 18 + drivers/infiniband/core/ucm.c | 5 drivers/infiniband/core/user_mad.c | 9 - drivers/infiniband/hw/ipath/ipath_verbs.c | 2 drivers/infiniband/hw/mthca/mthca_provider.c | 2 drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 + drivers/infiniband/ulp/srp/ib_srp.c | 2 include/rdma/ib_addr.h | 17 + include/rdma/ib_verbs.h | 39 ++- 18 files changed, 439 insertions(+), 90 deletions(-) diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 68e73ec..163d991 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -1,7 +1,7 @@ infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS) := ib_addr.o rdma_cm.o obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o $(infiniband-y) + ib_cm.o iw_cm.o $(infiniband-y) obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o @@ -14,6 +14,8 @@ ib_sa-y := sa_query.o ib_cm-y := cm.o +iw_cm-y := iwcm.o + rdma_cm-y := cma.o ib_addr-y := addr.o diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index d294bbc..399151a 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -60,12 +60,15 @@ static LIST_HEAD(req_list); static DECLARE_WORK(work, process_req, NULL); static struct workqueue_struct *addr_wq; -static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, - unsigned char *dst_dev_addr) +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + const unsigned char *dst_dev_addr) { switch (dev->type) { case ARPHRD_INFINIBAND: - dev_addr->dev_type = IB_NODE_CA; + dev_addr->dev_type = RDMA_NODE_IB_CA; + break; + case ARPHRD_ETHER: + dev_addr->dev_type = RDMA_NODE_RNIC; break; default: return -EADDRNOTAVAIL; @@ -77,6 +80,7 @@ static int copy_addr(struct rdma_dev_add memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); return 0; } +EXPORT_SYMBOL(rdma_copy_addr); int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) { @@ -88,7 +92,7 @@ int rdma_translate_ip(struct sockaddr *a if (!dev) return -EADDRNOTAVAIL; - ret = copy_addr(dev_addr, dev, NULL); + ret = rdma_copy_addr(dev_addr, dev, NULL); dev_put(dev); return ret; } @@ -160,7 +164,7 @@ static int addr_resolve_remote(struct so /* If the device does ARP internally, return 'done' */ if (rt->idev->dev->flags & IFF_NOARP) { - copy_addr(addr, rt->idev->dev, NULL); + rdma_copy_addr(addr, rt->idev->dev, NULL); goto put; } @@ -180,7 +184,7 @@ static int addr_resolve_remote(struct so src_in->sin_addr.s_addr = rt->rt_src; } - ret = copy_addr(addr, neigh->dev, neigh->ha); + ret = rdma_copy_addr(addr, neigh->dev, neigh->ha); release: neigh_release(neigh); put: @@ -244,7 +248,7 @@ static int addr_resolve_local(struct soc if (ZERONET(src_ip)) { src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = dst_ip; - ret = copy_addr(addr, dev, dev->dev_addr); + ret = rdma_copy_addr(addr, dev, dev->dev_addr); } else if (LOOPBACK(src_ip)) { ret = rdma_translate_ip((struct sockaddr *)dst_in, addr); if (!ret) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index e05ca2c..425a218 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -32,7 +32,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: cache.c 6885 2006-05-03 18:22:02Z sean.hefty $ */ #include @@ -62,12 +62,13 @@ struct ib_update_work { static inline int start_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : 1; + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; } static inline int end_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; } int ib_get_cached_gid(struct ib_device *device, diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index f85c97f..21312fe 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -3260,6 +3260,9 @@ static void cm_add_one(struct ib_device int ret; u8 i; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d6f99d5..27e18b0 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -35,6 +35,7 @@ #include #include #include #include +#include #include @@ -43,6 +44,7 @@ #include #include #include #include +#include MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); @@ -124,6 +126,7 @@ struct rdma_id_private { int query_id; union { struct ib_cm_id *ib; + struct iw_cm_id *iw; } cm_id; u32 seq_num; @@ -259,14 +262,23 @@ static void cma_detach_from_dev(struct r id_priv->cma_dev = NULL; } -static int cma_acquire_ib_dev(struct rdma_id_private *id_priv) +static int cma_acquire_dev(struct rdma_id_private *id_priv) { + enum rdma_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type; struct cma_device *cma_dev; union ib_gid gid; int ret = -ENODEV; - ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid), - + switch (rdma_node_get_transport(dev_type)) { + case RDMA_TRANSPORT_IB: + ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid); + break; + case RDMA_TRANSPORT_IWARP: + iw_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid); + break; + default: + return -ENODEV; + } mutex_lock(&lock); list_for_each_entry(cma_dev, &dev_list, list) { ret = ib_find_cached_gid(cma_dev->device, &gid, @@ -280,16 +292,6 @@ static int cma_acquire_ib_dev(struct rdm return ret; } -static int cma_acquire_dev(struct rdma_id_private *id_priv) -{ - switch (id_priv->id.route.addr.dev_addr.dev_type) { - case IB_NODE_CA: - return cma_acquire_ib_dev(id_priv); - default: - return -ENODEV; - } -} - static void cma_deref_id(struct rdma_id_private *id_priv) { if (atomic_dec_and_test(&id_priv->refcount)) @@ -347,6 +349,16 @@ static int cma_init_ib_qp(struct rdma_id IB_QP_PKEY_INDEX | IB_QP_PORT); } +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS); +} + int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr) { @@ -362,10 +374,13 @@ int rdma_create_qp(struct rdma_cm_id *id if (IS_ERR(qp)) return PTR_ERR(qp); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_init_ib_qp(id_priv, qp); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_init_iw_qp(id_priv, qp); + break; default: ret = -ENOSYS; break; @@ -451,13 +466,17 @@ int rdma_init_qp_attr(struct rdma_cm_id int ret; id_priv = container_of(id, struct rdma_id_private, id); - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, qp_attr_mask); if (qp_attr->qp_state == IB_QPS_RTR) qp_attr->rq_psn = id_priv->seq_num; break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr, + qp_attr_mask); + break; default: ret = -ENOSYS; break; @@ -590,8 +609,8 @@ static int cma_notify_user(struct rdma_i static void cma_cancel_route(struct rdma_id_private *id_priv) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->query) ib_sa_cancel_query(id_priv->query_id, id_priv->query); break; @@ -611,11 +630,15 @@ static void cma_destroy_listen(struct rd cma_exch(id_priv, CMA_DESTROYING); if (id_priv->cma_dev) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; + case RDMA_TRANSPORT_IWARP: + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) + iw_destroy_cm_id(id_priv->cm_id.iw); + break; default: break; } @@ -690,11 +713,15 @@ void rdma_destroy_id(struct rdma_cm_id * cma_cancel_operation(id_priv, state); if (id_priv->cma_dev) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; + case RDMA_TRANSPORT_IWARP: + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) + iw_destroy_cm_id(id_priv->cm_id.iw); + break; default: break; } @@ -869,7 +896,7 @@ static struct rdma_id_private *cma_new_i ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); - rt->addr.dev_addr.dev_type = IB_NODE_CA; + rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA; id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; @@ -898,7 +925,7 @@ static int cma_req_handler(struct ib_cm_ } atomic_inc(&conn_id->dev_remove); - ret = cma_acquire_ib_dev(conn_id); + ret = cma_acquire_dev(conn_id); if (ret) { ret = -ENODEV; cma_release_remove(conn_id); @@ -982,6 +1009,128 @@ static void cma_set_compare_data(enum rd } } +static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) +{ + struct rdma_id_private *id_priv = iw_id->context; + enum rdma_cm_event_type event = 0; + struct sockaddr_in *sin; + int ret = 0; + + atomic_inc(&id_priv->dev_remove); + + switch (iw_event->event) { + case IW_CM_EVENT_CLOSE: + event = RDMA_CM_EVENT_DISCONNECTED; + break; + case IW_CM_EVENT_CONNECT_REPLY: + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; + *sin = iw_event->remote_addr; + if (iw_event->status) + event = RDMA_CM_EVENT_REJECTED; + else + event = RDMA_CM_EVENT_ESTABLISHED; + break; + case IW_CM_EVENT_ESTABLISHED: + event = RDMA_CM_EVENT_ESTABLISHED; + break; + default: + BUG_ON(1); + } + + ret = cma_notify_user(id_priv, event, iw_event->status, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id.iw = NULL; + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return ret; + } + + cma_release_remove(id_priv); + return ret; +} + +static int iw_conn_req_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct rdma_cm_id *new_cm_id; + struct rdma_id_private *listen_id, *conn_id; + struct sockaddr_in *sin; + struct net_device *dev = NULL; + int ret; + + listen_id = cm_id->context; + atomic_inc(&listen_id->dev_remove); + if (!cma_comp(listen_id, CMA_LISTEN)) { + ret = -ECONNABORTED; + goto out; + } + + /* Create a new RDMA id for the new IW CM ID */ + new_cm_id = rdma_create_id(listen_id->id.event_handler, + listen_id->id.context, + RDMA_PS_TCP); + if (!new_cm_id) { + ret = -ENOMEM; + goto out; + } + conn_id = container_of(new_cm_id, struct rdma_id_private, id); + atomic_inc(&conn_id->dev_remove); + conn_id->state = CMA_CONNECT; + + dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr); + if (!dev) { + ret = -EADDRNOTAVAIL; + cma_release_remove(conn_id); + rdma_destroy_id(new_cm_id); + goto out; + } + ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL); + if (ret) { + cma_release_remove(conn_id); + rdma_destroy_id(new_cm_id); + goto out; + } + + ret = cma_acquire_dev(conn_id); + if (ret) { + cma_release_remove(conn_id); + rdma_destroy_id(new_cm_id); + goto out; + } + + conn_id->cm_id.iw = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_iw_handler; + + sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr; + *sin = iw_event->remote_addr; + + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* User wants to destroy the CM ID */ + conn_id->cm_id.iw = NULL; + cma_exch(conn_id, CMA_DESTROYING); + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + } + +out: + if (dev) + dev_put(dev); + cma_release_remove(listen_id); + return ret; +} + static int cma_ib_listen(struct rdma_id_private *id_priv) { struct ib_cm_compare_data compare_data; @@ -1011,6 +1160,30 @@ static int cma_ib_listen(struct rdma_id_ return ret; } +static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog) +{ + int ret; + struct sockaddr_in *sin; + + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, + iw_conn_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id.iw)) + return PTR_ERR(id_priv->cm_id.iw); + + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + id_priv->cm_id.iw->local_addr = *sin; + + ret = iw_cm_listen(id_priv->cm_id.iw, backlog); + + if (ret) { + iw_destroy_cm_id(id_priv->cm_id.iw); + id_priv->cm_id.iw = NULL; + } + + return ret; +} + static int cma_listen_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) { @@ -1087,12 +1260,17 @@ int rdma_listen(struct rdma_cm_id *id, i id_priv->backlog = backlog; if (id->device) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_ib_listen(id_priv); if (ret) goto err; break; + case RDMA_TRANSPORT_IWARP: + ret = cma_iw_listen(id_priv, backlog); + if (ret) + goto err; + break; default: ret = -ENOSYS; goto err; @@ -1231,6 +1409,23 @@ err: } EXPORT_SYMBOL(rdma_set_ib_paths); +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct cma_work *work; + + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler, work); + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ROUTE_RESOLVED; + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; + queue_work(cma_wq, &work->work); + return 0; +} + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -1241,10 +1436,13 @@ int rdma_resolve_route(struct rdma_cm_id return -EINVAL; atomic_inc(&id_priv->refcount); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_resolve_iw_route(id_priv, timeout_ms); + break; default: ret = -ENOSYS; break; @@ -1649,6 +1847,47 @@ out: return ret; } +static int cma_connect_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_id *cm_id; + struct sockaddr_in* sin; + int ret; + struct iw_cm_conn_param iw_param; + + cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv); + if (IS_ERR(cm_id)) { + ret = PTR_ERR(cm_id); + goto out; + } + + id_priv->cm_id.iw = cm_id; + + sin = (struct sockaddr_in*) &id_priv->id.route.addr.src_addr; + cm_id->local_addr = *sin; + + sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr; + cm_id->remote_addr = *sin; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) { + iw_destroy_cm_id(cm_id); + return ret; + } + + iw_param.ord = conn_param->initiator_depth; + iw_param.ird = conn_param->responder_resources; + iw_param.private_data = conn_param->private_data; + iw_param.private_data_len = conn_param->private_data_len; + if (id_priv->id.qp) + iw_param.qpn = id_priv->qp_num; + else + iw_param.qpn = conn_param->qp_num; + ret = iw_cm_connect(cm_id, &iw_param); +out: + return ret; +} + int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1664,10 +1903,13 @@ int rdma_connect(struct rdma_cm_id *id, id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_connect_ib(id_priv, conn_param); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_connect_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1708,6 +1950,28 @@ static int cma_accept_ib(struct rdma_id_ return ib_send_cm_rep(id_priv->cm_id.ib, &rep); } +static int cma_accept_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_conn_param iw_param; + int ret; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) + return ret; + + iw_param.ord = conn_param->initiator_depth; + iw_param.ird = conn_param->responder_resources; + iw_param.private_data = conn_param->private_data; + iw_param.private_data_len = conn_param->private_data_len; + if (id_priv->id.qp) { + iw_param.qpn = id_priv->qp_num; + } else + iw_param.qpn = conn_param->qp_num; + + return iw_cm_accept(id_priv->cm_id.iw, &iw_param); +} + int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1723,13 +1987,16 @@ int rdma_accept(struct rdma_cm_id *id, s id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: if (conn_param) ret = cma_accept_ib(id_priv, conn_param); else ret = cma_rep_recv(id_priv); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_accept_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1756,12 +2023,16 @@ int rdma_reject(struct rdma_cm_id *id, c if (!cma_comp(id_priv, CMA_CONNECT)) return -EINVAL; - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, private_data, private_data_len); break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_reject(id_priv->cm_id.iw, + private_data, private_data_len); + break; default: ret = -ENOSYS; break; @@ -1780,16 +2051,18 @@ int rdma_disconnect(struct rdma_cm_id *i !cma_comp(id_priv, CMA_DISCONNECT)) return -EINVAL; - ret = cma_modify_qp_err(id); - if (ret) - goto out; - - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: + ret = cma_modify_qp_err(id); + if (ret) + goto out; /* Initiate or respond to a disconnect. */ if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_disconnect(id_priv->cm_id.iw, 0); + break; default: break; } diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index b2f3cb9..7318fba 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: device.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: device.c 5943 2006-03-22 00:58:04Z roland $ */ #include @@ -505,7 +505,7 @@ int ib_query_port(struct ib_device *devi u8 port_num, struct ib_port_attr *port_attr) { - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) @@ -580,7 +580,7 @@ int ib_modify_port(struct ib_device *dev u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify) { - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 1c3cfbb..b105e6a 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. * @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ + * $Id: mad.c 7294 2006-05-17 18:12:30Z roland $ */ #include #include @@ -2876,7 +2876,10 @@ static void ib_mad_init_device(struct ib { int start, end, i; - if (device->node_type == IB_NODE_SWITCH) { + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) { start = 0; end = 0; } else { @@ -2923,7 +2926,7 @@ static void ib_mad_remove_device(struct { int i, num_ports, cur_port; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { num_ports = 1; cur_port = 0; } else { diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index aeda484..a7482c8 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -918,7 +918,10 @@ static void ib_sa_add_one(struct ib_devi struct ib_sa_device *sa_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { s = 1; diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c index 35852e7..b81b2b9 100644 --- a/drivers/infiniband/core/smi.c +++ b/drivers/infiniband/core/smi.c @@ -34,7 +34,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $ + * $Id: smi.c 5258 2006-02-01 20:32:40Z sean.hefty $ */ #include @@ -64,7 +64,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-9:2 */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; /* smp->return_path set when received */ @@ -77,7 +77,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == hop_cnt) { /* smp->return_path set when received */ smp->hop_ptr++; - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -95,7 +95,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; smp->hop_ptr--; @@ -107,7 +107,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == 1) { smp->hop_ptr--; /* C14-13:3 -- SMPs destined for SM shouldn't be here */ - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_slid == IB_LID_PERMISSIVE); } @@ -142,7 +142,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; smp->return_path[hop_ptr] = port_num; @@ -156,7 +156,7 @@ int smi_handle_dr_smp_recv(struct ib_smp smp->return_path[hop_ptr] = port_num; /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -175,7 +175,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; /* smp->hop_ptr updated when sending */ @@ -190,7 +190,7 @@ int smi_handle_dr_smp_recv(struct ib_smp return 1; } /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH); + return (node_type == RDMA_NODE_IB_SWITCH); } /* C14-13:4 -- hop_ptr = 0 -> give to SM */ diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 21f9282..cfd2c06 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: sysfs.c 6940 2006-05-04 17:04:55Z roland $ */ #include "core_priv.h" @@ -589,10 +589,16 @@ static ssize_t show_node_type(struct cla return -ENODEV; switch (dev->node_type) { - case IB_NODE_CA: return sprintf(buf, "%d: CA\n", dev->node_type); - case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type); - case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type); - default: return sprintf(buf, "%d: \n", dev->node_type); + case RDMA_NODE_IB_CA: + return sprintf(buf, "%d: CA\n", dev->node_type); + case RDMA_NODE_RNIC: + return sprintf(buf, "%d: RNIC\n", dev->node_type); + case RDMA_NODE_IB_SWITCH: + return sprintf(buf, "%d: switch\n", dev->node_type); + case RDMA_NODE_IB_ROUTER: + return sprintf(buf, "%d: router\n", dev->node_type); + default: + return sprintf(buf, "%d: \n", dev->node_type); } } @@ -708,7 +714,7 @@ int ib_device_register_sysfs(struct ib_d if (ret) goto err_put; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { ret = add_port(device, 0); if (ret) goto err_put; diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index c1c6fda..936afc8 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ + * $Id: ucm.c 7119 2006-05-11 16:40:38Z sean.hefty $ */ #include @@ -1247,7 +1247,8 @@ static void ib_ucm_add_one(struct ib_dev { struct ib_ucm_device *ucm_dev; - if (!device->alloc_ucontext) + if (!device->alloc_ucontext || + rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) return; ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 1273f88..d6c151b 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ + * $Id: user_mad.c 6041 2006-03-27 21:06:00Z halr $ */ #include @@ -1032,7 +1032,10 @@ static void ib_umad_add_one(struct ib_de struct ib_umad_device *umad_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { s = 1; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index d70a9b6..5f41441 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -1093,7 +1093,7 @@ static void *ipath_register_ib_device(in (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); - dev->node_type = IB_NODE_CA; + dev->node_type = RDMA_NODE_IB_CA; dev->phys_port_cnt = 1; dev->dma_device = ipath_layer_get_device(dd); dev->class_dev.dev = dev->dma_device; diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 230ae21..2103ee8 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1292,7 +1292,7 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); - dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.node_type = RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; dev->ib_dev.class_dev.dev = &dev->pdev->dev; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cf71d2a..8a67b87 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1107,13 +1107,16 @@ static void ipoib_add_one(struct ib_devi struct ipoib_dev_priv *priv; int s, e, p; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; INIT_LIST_HEAD(dev_list); - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { @@ -1137,6 +1140,9 @@ static void ipoib_remove_one(struct ib_d struct ipoib_dev_priv *priv, *tmp; struct list_head *dev_list; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev_list = ib_get_client_data(device, &ipoib_client); list_for_each_entry_safe(priv, tmp, dev_list, list) { diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..df3120a 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1880,7 +1880,7 @@ static void srp_add_one(struct ib_device if (IS_ERR(srp_dev->fmr_pool)) srp_dev->fmr_pool = NULL; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index 0ff6739..46a2032 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -40,7 +40,7 @@ struct rdma_dev_addr { unsigned char src_dev_addr[MAX_ADDR_LEN]; unsigned char dst_dev_addr[MAX_ADDR_LEN]; unsigned char broadcast[MAX_ADDR_LEN]; - enum ib_node_type dev_type; + enum rdma_node_type dev_type; }; /** @@ -72,6 +72,9 @@ int rdma_resolve_ip(struct sockaddr *src void rdma_addr_cancel(struct rdma_dev_addr *addr); +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + const unsigned char *dst_dev_addr); + static inline int ip_addr_size(struct sockaddr *addr) { return addr->sa_family == AF_INET6 ? @@ -113,4 +116,16 @@ static inline void ib_addr_set_dgid(stru memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); } +static inline void iw_addr_get_sgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + memcpy(gid, dev_addr->src_dev_addr, sizeof *gid); +} + +static inline void iw_addr_get_dgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid); +} + #endif /* IB_ADDR_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index ee1f3a3..4b4c30a 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -35,7 +35,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $ + * $Id: ib_verbs.h 6885 2006-05-03 18:22:02Z sean.hefty $ */ #if !defined(IB_VERBS_H) @@ -56,12 +56,35 @@ union ib_gid { } global; }; -enum ib_node_type { - IB_NODE_CA = 1, - IB_NODE_SWITCH, - IB_NODE_ROUTER +enum rdma_node_type { + /* IB values map to NodeInfo:NodeType. */ + RDMA_NODE_IB_CA = 1, + RDMA_NODE_IB_SWITCH, + RDMA_NODE_IB_ROUTER, + RDMA_NODE_RNIC }; +enum rdma_transport_type { + RDMA_TRANSPORT_IB, + RDMA_TRANSPORT_IWARP +}; + +static inline enum rdma_transport_type +rdma_node_get_transport(enum rdma_node_type node_type) +{ + switch (node_type) { + case RDMA_NODE_IB_CA: + case RDMA_NODE_IB_SWITCH: + case RDMA_NODE_IB_ROUTER: + return RDMA_TRANSPORT_IB; + case RDMA_NODE_RNIC: + return RDMA_TRANSPORT_IWARP; + default: + BUG(); + return 0; + } +} + enum ib_device_cap_flags { IB_DEVICE_RESIZE_MAX_WR = 1, IB_DEVICE_BAD_PKEY_CNTR = (1<<1), @@ -78,6 +101,9 @@ enum ib_device_cap_flags { IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_ZERO_STAG = (1<<15), + IB_DEVICE_SEND_W_INV = (1<<16), + IB_DEVICE_MEM_WINDOW = (1<<17) }; enum ib_atomic_cap { @@ -835,6 +861,7 @@ struct ib_cache { u8 *lmc_cache; }; +struct iw_cm_verbs; struct ib_device { struct device *dma_device; @@ -851,6 +878,8 @@ struct ib_device { u32 flags; + struct iw_cm_verbs *iwcm; + int (*query_device)(struct ib_device *device, struct ib_device_attr *device_attr); int (*query_port)(struct ib_device *device, From swise at opengridcomputing.com Thu Aug 3 14:02:40 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:02:40 -0500 Subject: [openib-general] [PATCH v5 1/2] iWARP Connection Manager. In-Reply-To: <20060803210238.16228.47335.stgit@dell3.ogc.int> References: <20060803210238.16228.47335.stgit@dell3.ogc.int> Message-ID: <20060803210240.16228.18429.stgit@dell3.ogc.int> This patch provides the new files implementing the iWARP Connection Manager. This module is a logical instance of the xx_cm where xx is the transport type (ib or iw). The symbols exported are used by the transport independent rdma_cm module, and are available also for transport dependent ULPs. --- drivers/infiniband/core/iwcm.c | 1008 ++++++++++++++++++++++++++++++++++++++++ include/rdma/iw_cm.h | 255 ++++++++++ include/rdma/iw_cm_private.h | 63 +++ 3 files changed, 1326 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c new file mode 100644 index 0000000..fe43c00 --- /dev/null +++ b/drivers/infiniband/core/iwcm.c @@ -0,0 +1,1008 @@ +/* + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Tom Tucker"); +MODULE_DESCRIPTION("iWARP CM"); +MODULE_LICENSE("Dual BSD/GPL"); + +static struct workqueue_struct *iwcm_wq; +struct iwcm_work { + struct work_struct work; + struct iwcm_id_private *cm_id; + struct list_head list; + struct iw_cm_event event; + struct list_head free_list; +}; + +/* + * The following services provide a mechanism for pre-allocating iwcm_work + * elements. The design pre-allocates them based on the cm_id type: + * LISTENING IDS: Get enough elements preallocated to handle the + * listen backlog. + * ACTIVE IDS: 4: CONNECT_REPLY, ESTABLISHED, DISCONNECT, CLOSE + * PASSIVE IDS: 3: ESTABLISHED, DISCONNECT, CLOSE + * + * Allocating them in connect and listen avoids having to deal + * with allocation failures on the event upcall from the provider (which + * is called in the interrupt context). + * + * One exception is when creating the cm_id for incoming connection requests. + * There are two cases: + * 1) in the event upcall, cm_event_handler(), for a listening cm_id. If + * the backlog is exceeded, then no more connection request events will + * be processed. cm_event_handler() returns -ENOMEM in this case. Its up + * to the provider to reject the connectino request. + * 2) in the connection request workqueue handler, cm_conn_req_handler(). + * If work elements cannot be allocated for the new connect request cm_id, + * then IWCM will call the provider reject method. This is ok since + * cm_conn_req_handler() runs in the workqueue thread context. + */ + +static struct iwcm_work *get_work(struct iwcm_id_private *cm_id_priv) +{ + struct iwcm_work *work; + + if (list_empty(&cm_id_priv->work_free_list)) + return NULL; + work = list_entry(cm_id_priv->work_free_list.next, struct iwcm_work, + free_list); + list_del_init(&work->free_list); + return work; +} + +static void put_work(struct iwcm_work *work) +{ + list_add(&work->free_list, &work->cm_id->work_free_list); +} + +static void dealloc_work_entries(struct iwcm_id_private *cm_id_priv) +{ + struct list_head *e, *tmp; + + list_for_each_safe(e, tmp, &cm_id_priv->work_free_list) + kfree(list_entry(e, struct iwcm_work, free_list)); +} + +static int alloc_work_entries(struct iwcm_id_private *cm_id_priv, int count) +{ + struct iwcm_work *work; + + BUG_ON(!list_empty(&cm_id_priv->work_free_list)); + while (count--) { + work = kmalloc(sizeof(struct iwcm_work), GFP_KERNEL); + if (!work) { + dealloc_work_entries(cm_id_priv); + return -ENOMEM; + } + work->cm_id = cm_id_priv; + INIT_LIST_HEAD(&work->list); + put_work(work); + } + return 0; +} + +/* + * Save private data from incoming connection requests in the + * cm_id_priv so the low level driver doesn't have to. Adjust + * the event ptr to point to the local copy. + */ +static int copy_private_data(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *event) +{ + void *p; + + p = kmalloc(event->private_data_len, GFP_ATOMIC); + if (!p) + return -ENOMEM; + memcpy(p, event->private_data, event->private_data_len); + event->private_data = p; + return 0; +} + +/* + * Release a reference on cm_id. If the last reference is being removed + * and iw_destroy_cm_id is waiting, wake up the waiting thread. + */ +static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv) +{ + int ret = 0; + + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); + if (atomic_dec_and_test(&cm_id_priv->refcount)) { + BUG_ON(!list_empty(&cm_id_priv->work_list)); + if (waitqueue_active(&cm_id_priv->destroy_comp.wait)) { + BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING); + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, + &cm_id_priv->flags)); + ret = 1; + } + complete(&cm_id_priv->destroy_comp); + } + + return ret; +} + +static void add_ref(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + atomic_inc(&cm_id_priv->refcount); +} + +static void rem_ref(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + iwcm_deref_id(cm_id_priv); +} + +static int cm_event_handler(struct iw_cm_id *cm_id, struct iw_cm_event *event); + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context) +{ + struct iwcm_id_private *cm_id_priv; + + cm_id_priv = kzalloc(sizeof(*cm_id_priv), GFP_KERNEL); + if (!cm_id_priv) + return ERR_PTR(-ENOMEM); + + cm_id_priv->state = IW_CM_STATE_IDLE; + cm_id_priv->id.device = device; + cm_id_priv->id.cm_handler = cm_handler; + cm_id_priv->id.context = context; + cm_id_priv->id.event_handler = cm_event_handler; + cm_id_priv->id.add_ref = add_ref; + cm_id_priv->id.rem_ref = rem_ref; + spin_lock_init(&cm_id_priv->lock); + atomic_set(&cm_id_priv->refcount, 1); + init_waitqueue_head(&cm_id_priv->connect_wait); + init_completion(&cm_id_priv->destroy_comp); + INIT_LIST_HEAD(&cm_id_priv->work_list); + INIT_LIST_HEAD(&cm_id_priv->work_free_list); + + return &cm_id_priv->id; +} +EXPORT_SYMBOL(iw_create_cm_id); + + +static int iwcm_modify_qp_err(struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + if (!qp) + return -EINVAL; + + qp_attr.qp_state = IB_QPS_ERR; + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE); +} + +/* + * This is really the RDMAC CLOSING state. It is most similar to the + * IB SQD QP state. + */ +static int iwcm_modify_qp_sqd(struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + BUG_ON(qp == NULL); + qp_attr.qp_state = IB_QPS_SQD; + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE); +} + +/* + * CM_ID <-- CLOSING + * + * Block if a passive or active connection is currenlty being processed. Then + * process the event as follows: + * - If we are ESTABLISHED, move to CLOSING and modify the QP state + * based on the abrupt flag + * - If the connection is already in the CLOSING or IDLE state, the peer is + * disconnecting concurrently with us and we've already seen the + * DISCONNECT event -- ignore the request and return 0 + * - Disconnect on a listening endpoint returns -EINVAL + */ +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + struct ib_qp *qp = NULL; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_ESTABLISHED: + cm_id_priv->state = IW_CM_STATE_CLOSING; + + /* QP could be for user-mode client */ + if (cm_id_priv->qp) + qp = cm_id_priv->qp; + else + ret = -EINVAL; + break; + case IW_CM_STATE_LISTEN: + ret = -EINVAL; + break; + case IW_CM_STATE_CLOSING: + /* remote peer closed first */ + case IW_CM_STATE_IDLE: + /* accept or connect returned !0 */ + break; + case IW_CM_STATE_CONN_RECV: + /* + * App called disconnect before/without calling accept after + * connect_request event delivered. + */ + break; + case IW_CM_STATE_CONN_SENT: + /* Can only get here if wait above fails */ + default: + BUG(); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + if (qp) { + if (abrupt) + ret = iwcm_modify_qp_err(qp); + else + ret = iwcm_modify_qp_sqd(qp); + + /* + * If both sides are disconnecting the QP could + * already be in ERR or SQD states + */ + ret = 0; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_disconnect); + +/* + * CM_ID <-- DESTROYING + * + * Clean up all resources associated with the connection and release + * the initial reference taken by iw_create_cm_id. + */ +static void destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall. A + * listening endpoint should never block here. */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_LISTEN: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + /* destroy the listening endpoint */ + ret = cm_id->device->iwcm->destroy_listen(cm_id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_ESTABLISHED: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + /* Abrupt close of the connection */ + (void)iwcm_modify_qp_err(cm_id_priv->qp); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CLOSING: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + break; + case IW_CM_STATE_CONN_RECV: + /* + * App called destroy before/without calling accept after + * receiving connection request event notification. + */ + cm_id_priv->state = IW_CM_STATE_DESTROYING; + break; + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_DESTROYING: + default: + BUG(); + break; + } + if (cm_id_priv->qp) { + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + (void)iwcm_deref_id(cm_id_priv); +} + +/* + * This function is only called by the application thread and cannot + * be called by the event thread. The function will wait for all + * references to be released on the cm_id and then kfree the cm_id + * object. + */ +void iw_destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)); + + destroy_cm_id(cm_id); + + wait_for_completion(&cm_id_priv->destroy_comp); + + dealloc_work_entries(cm_id_priv); + + kfree(cm_id_priv); +} +EXPORT_SYMBOL(iw_destroy_cm_id); + +/* + * CM_ID <-- LISTEN + * + * Start listening for connect requests. Generates one CONNECT_REQUEST + * event for each inbound connect request. + */ +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + ret = alloc_work_entries(cm_id_priv, backlog); + if (ret) + return ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + cm_id_priv->state = IW_CM_STATE_LISTEN; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); + if (ret) + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + default: + ret = -EINVAL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + return ret; +} +EXPORT_SYMBOL(iw_cm_listen); + +/* + * CM_ID <-- IDLE + * + * Rejects an inbound connection request. No events are generated. + */ +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->reject(cm_id, private_data, + private_data_len); + + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} +EXPORT_SYMBOL(iw_cm_reject); + +/* + * CM_ID <-- ESTABLISHED + * + * Accepts an inbound connection request and generates an ESTABLISHED + * event. Callers of iw_cm_disconnect and iw_destroy_cm_id will block + * until the ESTABLISHED event is received from the provider. + */ +int iw_cm_accept(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *iw_param) +{ + struct iwcm_id_private *cm_id_priv; + struct ib_qp *qp; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + /* Get the ib_qp given the QPN */ + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); + if (!qp) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return -EINVAL; + } + cm_id->device->iwcm->add_ref(qp); + cm_id_priv->qp = qp; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->accept(cm_id, iw_param); + if (ret) { + /* An error on accept precludes provider events */ + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV); + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->qp) { + cm_id->device->iwcm->rem_ref(qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_accept); + +/* + * Active Side: CM_ID <-- CONN_SENT + * + * If successful, results in the generation of a CONNECT_REPLY + * event. iw_cm_disconnect and iw_cm_destroy will block until the + * CONNECT_REPLY event is received from the provider. + */ +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + struct iwcm_id_private *cm_id_priv; + int ret = 0; + unsigned long flags; + struct ib_qp *qp; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + ret = alloc_work_entries(cm_id_priv, 4); + if (ret) + return ret; + + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + spin_lock_irqsave(&cm_id_priv->lock, flags); + + if (cm_id_priv->state != IW_CM_STATE_IDLE) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + + /* Get the ib_qp given the QPN */ + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); + if (!qp) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return -EINVAL; + } + cm_id->device->iwcm->add_ref(qp); + cm_id_priv->qp = qp; + cm_id_priv->state = IW_CM_STATE_CONN_SENT; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->connect(cm_id, iw_param); + if (ret) { + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->qp) { + cm_id->device->iwcm->rem_ref(qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT); + cm_id_priv->state = IW_CM_STATE_IDLE; + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_connect); + +/* + * Passive Side: new CM_ID <-- CONN_RECV + * + * Handles an inbound connect request. The function creates a new + * iw_cm_id to represent the new connection and inherits the client + * callback function and other attributes from the listening parent. + * + * The work item contains a pointer to the listen_cm_id and the event. The + * listen_cm_id contains the client cm_handler, context and + * device. These are copied when the device is cloned. The event + * contains the new four tuple. + * + * An error on the child should not affect the parent, so this + * function does not return a value. + */ +static void cm_conn_req_handler(struct iwcm_id_private *listen_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + struct iw_cm_id *cm_id; + struct iwcm_id_private *cm_id_priv; + int ret; + + /* The provider should never generate a connection request + * event with a bad status. + */ + BUG_ON(iw_event->status); + + /* We could be destroying the listening id. If so, ignore this + * upcall. */ + spin_lock_irqsave(&listen_id_priv->lock, flags); + if (listen_id_priv->state != IW_CM_STATE_LISTEN) { + spin_unlock_irqrestore(&listen_id_priv->lock, flags); + return; + } + spin_unlock_irqrestore(&listen_id_priv->lock, flags); + + cm_id = iw_create_cm_id(listen_id_priv->id.device, + listen_id_priv->id.cm_handler, + listen_id_priv->id.context); + /* If the cm_id could not be created, ignore the request */ + if (IS_ERR(cm_id)) + return; + + cm_id->provider_data = iw_event->provider_data; + cm_id->local_addr = iw_event->local_addr; + cm_id->remote_addr = iw_event->remote_addr; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + cm_id_priv->state = IW_CM_STATE_CONN_RECV; + + ret = alloc_work_entries(cm_id_priv, 3); + if (ret) { + iw_cm_reject(cm_id, NULL, 0); + iw_destroy_cm_id(cm_id); + return; + } + + /* Call the client CM handler */ + ret = cm_id->cm_handler(cm_id, iw_event); + if (ret) { + set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); + destroy_cm_id(cm_id); + if (atomic_read(&cm_id_priv->refcount)==0) + kfree(cm_id); + } + + if (iw_event->private_data_len) + kfree(iw_event->private_data); +} + +/* + * Passive Side: CM_ID <-- ESTABLISHED + * + * The provider generated an ESTABLISHED event which means that + * the MPA negotion has completed successfully and we are now in MPA + * FPDU mode. + * + * This event can only be received in the CONN_RECV state. If the + * remote peer closed, the ESTABLISHED event would be received followed + * by the CLOSE event. If the app closes, it will block until we wake + * it up after processing this event. + */ +static int cm_conn_est_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + + /* We clear the CONNECT_WAIT bit here to allow the callback + * function to call iw_cm_disconnect. Calling iw_destroy_cm_id + * from a callback handler is not allowed */ + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV); + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} + +/* + * Active Side: CM_ID <-- ESTABLISHED + * + * The app has called connect and is waiting for the established event to + * post it's requests to the server. This event will wake up anyone + * blocked in iw_cm_disconnect or iw_destroy_id. + */ +static int cm_conn_rep_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + /* Clear the connect wait bit so a callback function calling + * iw_cm_disconnect will not wait and deadlock this thread */ + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT); + if (iw_event->status == IW_CM_EVENT_STATUS_ACCEPTED) { + cm_id_priv->id.local_addr = iw_event->local_addr; + cm_id_priv->id.remote_addr = iw_event->remote_addr; + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; + } else { + /* REJECTED or RESET */ + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + cm_id_priv->state = IW_CM_STATE_IDLE; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + + if (iw_event->private_data_len) + kfree(iw_event->private_data); + + /* Wake up waiters on connect complete */ + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} + +/* + * CM_ID <-- CLOSING + * + * If in the ESTABLISHED state, move to CLOSING. + */ +static void cm_disconnect_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state == IW_CM_STATE_ESTABLISHED) + cm_id_priv->state = IW_CM_STATE_CLOSING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); +} + +/* + * CM_ID <-- IDLE + * + * If in the ESTBLISHED or CLOSING states, the QP will have have been + * moved by the provider to the ERR state. Disassociate the CM_ID from + * the QP, move to IDLE, and remove the 'connected' reference. + * + * If in some other state, the cm_id was destroyed asynchronously. + * This is the last reference that will result in waking up + * the app thread blocked in iw_destroy_cm_id. + */ +static int cm_close_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + spin_lock_irqsave(&cm_id_priv->lock, flags); + + if (cm_id_priv->qp) { + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + } + switch (cm_id_priv->state) { + case IW_CM_STATE_ESTABLISHED: + case IW_CM_STATE_CLOSING: + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_DESTROYING: + break; + default: + BUG(); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + return ret; +} + +static int process_event(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + int ret = 0; + + switch (iw_event->event) { + case IW_CM_EVENT_CONNECT_REQUEST: + cm_conn_req_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_CONNECT_REPLY: + ret = cm_conn_rep_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_ESTABLISHED: + ret = cm_conn_est_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_DISCONNECT: + cm_disconnect_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_CLOSE: + ret = cm_close_handler(cm_id_priv, iw_event); + break; + default: + BUG(); + } + + return ret; +} + +/* + * Process events on the work_list for the cm_id. If the callback + * function requests that the cm_id be deleted, a flag is set in the + * cm_id flags to indicate that when the last reference is + * removed, the cm_id is to be destroyed. This is necessary to + * distinguish between an object that will be destroyed by the app + * thread asleep on the destroy_comp list vs. an object destroyed + * here synchronously when the last reference is removed. + */ +static void cm_work_handler(void *arg) +{ + struct iwcm_work *work = arg, lwork; + struct iwcm_id_private *cm_id_priv = work->cm_id; + unsigned long flags; + int empty; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + empty = list_empty(&cm_id_priv->work_list); + while (!empty) { + work = list_entry(cm_id_priv->work_list.next, + struct iwcm_work, list); + list_del_init(&work->list); + empty = list_empty(&cm_id_priv->work_list); + lwork = *work; + put_work(work); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = process_event(cm_id_priv, &work->event); + if (ret) { + set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); + destroy_cm_id(&cm_id_priv->id); + } + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); + if (iwcm_deref_id(cm_id_priv)) + return; + + if (atomic_read(&cm_id_priv->refcount)==0 && + test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)) { + dealloc_work_entries(cm_id_priv); + kfree(cm_id_priv); + return; + } + spin_lock_irqsave(&cm_id_priv->lock, flags); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); +} + +/* + * This function is called on interrupt context. Schedule events on + * the iwcm_wq thread to allow callback functions to downcall into + * the CM and/or block. Events are queued to a per-CM_ID + * work_list. If this is the first event on the work_list, the work + * element is also queued on the iwcm_wq thread. + * + * Each event holds a reference on the cm_id. Until the last posted + * event has been delivered and processed, the cm_id cannot be + * deleted. + * + * Returns: + * 0 - the event was handled. + * -ENOMEM - the event was not handled due to lack of resources. + */ +static int cm_event_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct iwcm_work *work; + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + work = get_work(cm_id_priv); + if (!work) { + ret = -ENOMEM; + goto out; + } + + INIT_WORK(&work->work, cm_work_handler, work); + work->cm_id = cm_id_priv; + work->event = *iw_event; + + if ((work->event.event == IW_CM_EVENT_CONNECT_REQUEST || + work->event.event == IW_CM_EVENT_CONNECT_REPLY) && + work->event.private_data_len) { + ret = copy_private_data(cm_id_priv, &work->event); + if (ret) { + put_work(work); + goto out; + } + } + + atomic_inc(&cm_id_priv->refcount); + if (list_empty(&cm_id_priv->work_list)) { + list_add_tail(&work->list, &cm_id_priv->work_list); + queue_work(iwcm_wq, &work->work); + } else + list_add_tail(&work->list, &cm_id_priv->work_list); +out: + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +static int iwcm_init_qp_init_attr(struct iwcm_id_private *cm_id_priv, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_ESTABLISHED: + *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS; + qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE| + IB_ACCESS_REMOTE_READ; + ret = 0; + break; + default: + ret = -EINVAL; + break; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +static int iwcm_init_qp_rts_attr(struct iwcm_id_private *cm_id_priv, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_ESTABLISHED: + *qp_attr_mask = 0; + ret = 0; + break; + default: + ret = -EINVAL; + break; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +int iw_cm_init_qp_attr(struct iw_cm_id *cm_id, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct iwcm_id_private *cm_id_priv; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + switch (qp_attr->qp_state) { + case IB_QPS_INIT: + case IB_QPS_RTR: + ret = iwcm_init_qp_init_attr(cm_id_priv, + qp_attr, qp_attr_mask); + break; + case IB_QPS_RTS: + ret = iwcm_init_qp_rts_attr(cm_id_priv, + qp_attr, qp_attr_mask); + break; + default: + ret = -EINVAL; + break; + } + return ret; +} +EXPORT_SYMBOL(iw_cm_init_qp_attr); + +static int __init iw_cm_init(void) +{ + iwcm_wq = create_singlethread_workqueue("iw_cm_wq"); + if (!iwcm_wq) + return -ENOMEM; + + return 0; +} + +static void __exit iw_cm_cleanup(void) +{ + destroy_workqueue(iwcm_wq); +} + +module_init(iw_cm_init); +module_exit(iw_cm_cleanup); diff --git a/include/rdma/iw_cm.h b/include/rdma/iw_cm.h new file mode 100644 index 0000000..36f44aa --- /dev/null +++ b/include/rdma/iw_cm.h @@ -0,0 +1,255 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef IW_CM_H +#define IW_CM_H + +#include +#include + +struct iw_cm_id; + +enum iw_cm_event_type { + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ + IW_CM_EVENT_ESTABLISHED, /* passive side accept successful */ + IW_CM_EVENT_DISCONNECT, /* orderly shutdown */ + IW_CM_EVENT_CLOSE /* close complete */ +}; +enum iw_cm_event_status { + IW_CM_EVENT_STATUS_OK = 0, /* request successful */ + IW_CM_EVENT_STATUS_ACCEPTED = 0, /* connect request accepted */ + IW_CM_EVENT_STATUS_REJECTED, /* connect request rejected */ + IW_CM_EVENT_STATUS_TIMEOUT, /* the operation timed out */ + IW_CM_EVENT_STATUS_RESET, /* reset from remote peer */ + IW_CM_EVENT_STATUS_EINVAL, /* asynchronous failure for bad parm */ +}; +struct iw_cm_event { + enum iw_cm_event_type event; + enum iw_cm_event_status status; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *private_data; + u8 private_data_len; + void* provider_data; +}; + +/** + * iw_cm_handler - Function to be called by the IW CM when delivering events + * to the client. + * + * @cm_id: The IW CM identifier associated with the event. + * @event: Pointer to the event structure. + */ +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); + +/** + * iw_event_handler - Function called by the provider when delivering provider + * events to the IW CM. Returns either 0 indicating the event was processed + * or -errno if the event could not be processed. + * + * @cm_id: The IW CM identifier associated with the event. + * @event: Pointer to the event structure. + */ +typedef int (*iw_event_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); +struct iw_cm_id { + iw_cm_handler cm_handler; /* client callback function */ + void *context; /* client cb context */ + struct ib_device *device; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *provider_data; /* provider private data */ + iw_event_handler event_handler; /* cb for provider + events */ + /* Used by provider to add and remove refs on IW cm_id */ + void (*add_ref)(struct iw_cm_id *); + void (*rem_ref)(struct iw_cm_id *); +}; + +struct iw_cm_conn_param { + const void *private_data; + u16 private_data_len; + u32 ord; + u32 ird; + u32 qpn; +}; + +struct iw_cm_verbs { + void (*add_ref)(struct ib_qp *qp); + + void (*rem_ref)(struct ib_qp *qp); + + struct ib_qp * (*get_qp)(struct ib_device *device, + int qpn); + + int (*connect)(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *conn_param); + + int (*accept)(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *conn_param); + + int (*reject)(struct iw_cm_id *cm_id, + const void *pdata, u8 pdata_len); + + int (*create_listen)(struct iw_cm_id *cm_id, + int backlog); + + int (*destroy_listen)(struct iw_cm_id *cm_id); +}; + +/** + * iw_create_cm_id - Create an IW CM identifier. + * + * @device: The IB device on which to create the IW CM identier. + * @event_handler: User callback invoked to report events associated with the + * returned IW CM identifier. + * @context: User specified context associated with the id. + */ +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, void *context); + +/** + * iw_destroy_cm_id - Destroy an IW CM identifier. + * + * @cm_id: The previously created IW CM identifier to destroy. + * + * The client can assume that no events will be delivered for the CM ID after + * this function returns. + */ +void iw_destroy_cm_id(struct iw_cm_id *cm_id); + +/** + * iw_cm_bind_qp - Unbind the specified IW CM identifier and QP + * + * @cm_id: The IW CM idenfier to unbind from the QP. + * @qp: The QP + * + * This is called by the provider when destroying the QP to ensure + * that any references held by the IWCM are released. It may also + * be called by the IWCM when destroying a CM_ID to that any + * references held by the provider are released. + */ +void iw_cm_unbind_qp(struct iw_cm_id *cm_id, struct ib_qp *qp); + +/** + * iw_cm_get_qp - Return the ib_qp associated with a QPN + * + * @ib_device: The IB device + * @qpn: The queue pair number + */ +struct ib_qp *iw_cm_get_qp(struct ib_device *device, int qpn); + +/** + * iw_cm_listen - Listen for incoming connection requests on the + * specified IW CM id. + * + * @cm_id: The IW CM identifier. + * @backlog: The maximum number of outstanding un-accepted inbound listen + * requests to queue. + * + * The source address and port number are specified in the IW CM identifier + * structure. + */ +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); + +/** + * iw_cm_accept - Called to accept an incoming connect request. + * + * @cm_id: The IW CM identifier associated with the connection request. + * @iw_param: Pointer to a structure containing connection establishment + * parameters. + * + * The specified cm_id will have been provided in the event data for a + * CONNECT_REQUEST event. Subsequent events related to this connection will be + * delivered to the specified IW CM identifier prior and may occur prior to + * the return of this function. If this function returns a non-zero value, the + * client can assume that no events will be delivered to the specified IW CM + * identifier. + */ +int iw_cm_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param); + +/** + * iw_cm_reject - Reject an incoming connection request. + * + * @cm_id: Connection identifier associated with the request. + * @private_daa: Pointer to data to deliver to the remote peer as part of the + * reject message. + * @private_data_len: The number of bytes in the private_data parameter. + * + * The client can assume that no events will be delivered to the specified IW + * CM identifier following the return of this function. The private_data + * buffer is available for reuse when this function returns. + */ +int iw_cm_reject(struct iw_cm_id *cm_id, const void *private_data, + u8 private_data_len); + +/** + * iw_cm_connect - Called to request a connection to a remote peer. + * + * @cm_id: The IW CM identifier for the connection. + * @iw_param: Pointer to a structure containing connection establishment + * parameters. + * + * Events may be delivered to the specified IW CM identifier prior to the + * return of this function. If this function returns a non-zero value, the + * client can assume that no events will be delivered to the specified IW CM + * identifier. + */ +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param); + +/** + * iw_cm_disconnect - Close the specified connection. + * + * @cm_id: The IW CM identifier to close. + * @abrupt: If 0, the connection will be closed gracefully, otherwise, the + * connection will be reset. + * + * The IW CM identifier is still active until the IW_CM_EVENT_CLOSE event is + * delivered. + */ +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt); + +/** + * iw_cm_init_qp_attr - Called to initialize the attributes of the QP + * associated with a IW CM identifier. + * + * @cm_id: The IW CM identifier associated with the QP + * @qp_attr: Pointer to the QP attributes structure. + * @qp_attr_mask: Pointer to a bit vector specifying which QP attributes are + * valid. + */ +int iw_cm_init_qp_attr(struct iw_cm_id *cm_id, struct ib_qp_attr *qp_attr, + int *qp_attr_mask); + +#endif /* IW_CM_H */ diff --git a/include/rdma/iw_cm_private.h b/include/rdma/iw_cm_private.h new file mode 100644 index 0000000..fc28e34 --- /dev/null +++ b/include/rdma/iw_cm_private.h @@ -0,0 +1,63 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef IW_CM_PRIVATE_H +#define IW_CM_PRIVATE_H + +#include + +enum iw_cm_state { + IW_CM_STATE_IDLE, /* unbound, inactive */ + IW_CM_STATE_LISTEN, /* listen waiting for connect */ + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ + IW_CM_STATE_ESTABLISHED, /* established */ + IW_CM_STATE_CLOSING, /* disconnect */ + IW_CM_STATE_DESTROYING /* object being deleted */ +}; + +struct iwcm_id_private { + struct iw_cm_id id; + enum iw_cm_state state; + unsigned long flags; + struct ib_qp *qp; + struct completion destroy_comp; + wait_queue_head_t connect_wait; + struct list_head work_list; + spinlock_t lock; + atomic_t refcount; + struct list_head work_free_list; +}; +#define IWCM_F_CALLBACK_DESTROY 1 +#define IWCM_F_CONNECT_WAIT 2 + +#endif /* IW_CM_PRIVATE_H */ From swise at opengridcomputing.com Thu Aug 3 14:07:23 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:07:23 -0500 Subject: [openib-general] [PATCH v4 0/7] Ammasso 1100 iWARP Driver Message-ID: <20060803210723.16572.34829.stgit@dell3.ogc.int> Roland, here is the updated Ammasso 1100 iWARP Driver. ---- This patchset implements the iWARP provider driver for the Ammasso 1100 RNIC. It is dependent on the "iWARP Core Support" patch set. The patchset consists of 7 patches: 1 - Low-level device interface and native stack support 2 - Work request definitions 3 - Provider interface 4 - Memory management 5 - User mode message queue implementation 6 - Verbs queue implementation 7 - Kconfig and Makefile Signed-off-by: Tom Tucker Signed-off-by: Steve Wise From swise at opengridcomputing.com Thu Aug 3 14:07:27 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:07:27 -0500 Subject: [openib-general] [PATCH v4 2/7] AMSO1100 WR / Event Definitions. In-Reply-To: <20060803210723.16572.34829.stgit@dell3.ogc.int> References: <20060803210723.16572.34829.stgit@dell3.ogc.int> Message-ID: <20060803210727.16572.44558.stgit@dell3.ogc.int> --- drivers/infiniband/hw/amso1100/c2_ae.h | 108 ++ drivers/infiniband/hw/amso1100/c2_status.h | 158 +++ drivers/infiniband/hw/amso1100/c2_wr.h | 1520 ++++++++++++++++++++++++++++ 3 files changed, 1786 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_ae.h b/drivers/infiniband/hw/amso1100/c2_ae.h new file mode 100644 index 0000000..3a065c3 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_ae.h @@ -0,0 +1,108 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_AE_H_ +#define _C2_AE_H_ + +/* + * WARNING: If you change this file, also bump C2_IVN_BASE + * in common/include/clustercore/c2_ivn.h. + */ + +/* + * Asynchronous Event Identifiers + * + * These start at 0x80 only so it's obvious from inspection that + * they are not work-request statuses. This isn't critical. + * + * NOTE: these event id's must fit in eight bits. + */ +enum c2_event_id { + CCAE_REMOTE_SHUTDOWN = 0x80, + CCAE_ACTIVE_CONNECT_RESULTS, + CCAE_CONNECTION_REQUEST, + CCAE_LLP_CLOSE_COMPLETE, + CCAE_TERMINATE_MESSAGE_RECEIVED, + CCAE_LLP_CONNECTION_RESET, + CCAE_LLP_CONNECTION_LOST, + CCAE_LLP_SEGMENT_SIZE_INVALID, + CCAE_LLP_INVALID_CRC, + CCAE_LLP_BAD_FPDU, + CCAE_INVALID_DDP_VERSION, + CCAE_INVALID_RDMA_VERSION, + CCAE_UNEXPECTED_OPCODE, + CCAE_INVALID_DDP_QUEUE_NUMBER, + CCAE_RDMA_READ_NOT_ENABLED, + CCAE_RDMA_WRITE_NOT_ENABLED, + CCAE_RDMA_READ_TOO_SMALL, + CCAE_NO_L_BIT, + CCAE_TAGGED_INVALID_STAG, + CCAE_TAGGED_BASE_BOUNDS_VIOLATION, + CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION, + CCAE_TAGGED_INVALID_PD, + CCAE_WRAP_ERROR, + CCAE_BAD_CLOSE, + CCAE_BAD_LLP_CLOSE, + CCAE_INVALID_MSN_RANGE, + CCAE_INVALID_MSN_GAP, + CCAE_IRRQ_OVERFLOW, + CCAE_IRRQ_MSN_GAP, + CCAE_IRRQ_MSN_RANGE, + CCAE_IRRQ_INVALID_STAG, + CCAE_IRRQ_BASE_BOUNDS_VIOLATION, + CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION, + CCAE_IRRQ_INVALID_PD, + CCAE_IRRQ_WRAP_ERROR, + CCAE_CQ_SQ_COMPLETION_OVERFLOW, + CCAE_CQ_RQ_COMPLETION_ERROR, + CCAE_QP_SRQ_WQE_ERROR, + CCAE_QP_LOCAL_CATASTROPHIC_ERROR, + CCAE_CQ_OVERFLOW, + CCAE_CQ_OPERATION_ERROR, + CCAE_SRQ_LIMIT_REACHED, + CCAE_QP_RQ_LIMIT_REACHED, + CCAE_SRQ_CATASTROPHIC_ERROR, + CCAE_RNIC_CATASTROPHIC_ERROR +/* WARNING If you add more id's, make sure their values fit in eight bits. */ +}; + +/* + * Resource Indicators and Identifiers + */ +enum c2_resource_indicator { + C2_RES_IND_QP = 1, + C2_RES_IND_EP, + C2_RES_IND_CQ, + C2_RES_IND_SRQ, +}; + +#endif /* _C2_AE_H_ */ diff --git a/drivers/infiniband/hw/amso1100/c2_status.h b/drivers/infiniband/hw/amso1100/c2_status.h new file mode 100644 index 0000000..6ee4aa9 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_status.h @@ -0,0 +1,158 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_STATUS_H_ +#define _C2_STATUS_H_ + +/* + * Verbs Status Codes + */ +enum c2_status { + C2_OK = 0, /* This must be zero */ + CCERR_INSUFFICIENT_RESOURCES = 1, + CCERR_INVALID_MODIFIER = 2, + CCERR_INVALID_MODE = 3, + CCERR_IN_USE = 4, + CCERR_INVALID_RNIC = 5, + CCERR_INTERRUPTED_OPERATION = 6, + CCERR_INVALID_EH = 7, + CCERR_INVALID_CQ = 8, + CCERR_CQ_EMPTY = 9, + CCERR_NOT_IMPLEMENTED = 10, + CCERR_CQ_DEPTH_TOO_SMALL = 11, + CCERR_PD_IN_USE = 12, + CCERR_INVALID_PD = 13, + CCERR_INVALID_SRQ = 14, + CCERR_INVALID_ADDRESS = 15, + CCERR_INVALID_NETMASK = 16, + CCERR_INVALID_QP = 17, + CCERR_INVALID_QP_STATE = 18, + CCERR_TOO_MANY_WRS_POSTED = 19, + CCERR_INVALID_WR_TYPE = 20, + CCERR_INVALID_SGL_LENGTH = 21, + CCERR_INVALID_SQ_DEPTH = 22, + CCERR_INVALID_RQ_DEPTH = 23, + CCERR_INVALID_ORD = 24, + CCERR_INVALID_IRD = 25, + CCERR_QP_ATTR_CANNOT_CHANGE = 26, + CCERR_INVALID_STAG = 27, + CCERR_QP_IN_USE = 28, + CCERR_OUTSTANDING_WRS = 29, + CCERR_STAG_IN_USE = 30, + CCERR_INVALID_STAG_INDEX = 31, + CCERR_INVALID_SGL_FORMAT = 32, + CCERR_ADAPTER_TIMEOUT = 33, + CCERR_INVALID_CQ_DEPTH = 34, + CCERR_INVALID_PRIVATE_DATA_LENGTH = 35, + CCERR_INVALID_EP = 36, + CCERR_MR_IN_USE = CCERR_STAG_IN_USE, + CCERR_FLUSHED = 38, + CCERR_INVALID_WQE = 39, + CCERR_LOCAL_QP_CATASTROPHIC_ERROR = 40, + CCERR_REMOTE_TERMINATION_ERROR = 41, + CCERR_BASE_AND_BOUNDS_VIOLATION = 42, + CCERR_ACCESS_VIOLATION = 43, + CCERR_INVALID_PD_ID = 44, + CCERR_WRAP_ERROR = 45, + CCERR_INV_STAG_ACCESS_ERROR = 46, + CCERR_ZERO_RDMA_READ_RESOURCES = 47, + CCERR_QP_NOT_PRIVILEGED = 48, + CCERR_STAG_STATE_NOT_INVALID = 49, + CCERR_INVALID_PAGE_SIZE = 50, + CCERR_INVALID_BUFFER_SIZE = 51, + CCERR_INVALID_PBE = 52, + CCERR_INVALID_FBO = 53, + CCERR_INVALID_LENGTH = 54, + CCERR_INVALID_ACCESS_RIGHTS = 55, + CCERR_PBL_TOO_BIG = 56, + CCERR_INVALID_VA = 57, + CCERR_INVALID_REGION = 58, + CCERR_INVALID_WINDOW = 59, + CCERR_TOTAL_LENGTH_TOO_BIG = 60, + CCERR_INVALID_QP_ID = 61, + CCERR_ADDR_IN_USE = 62, + CCERR_ADDR_NOT_AVAIL = 63, + CCERR_NET_DOWN = 64, + CCERR_NET_UNREACHABLE = 65, + CCERR_CONN_ABORTED = 66, + CCERR_CONN_RESET = 67, + CCERR_NO_BUFS = 68, + CCERR_CONN_TIMEDOUT = 69, + CCERR_CONN_REFUSED = 70, + CCERR_HOST_UNREACHABLE = 71, + CCERR_INVALID_SEND_SGL_DEPTH = 72, + CCERR_INVALID_RECV_SGL_DEPTH = 73, + CCERR_INVALID_RDMA_WRITE_SGL_DEPTH = 74, + CCERR_INSUFFICIENT_PRIVILEGES = 75, + CCERR_STACK_ERROR = 76, + CCERR_INVALID_VERSION = 77, + CCERR_INVALID_MTU = 78, + CCERR_INVALID_IMAGE = 79, + CCERR_PENDING = 98, /* not an error; user internally by adapter */ + CCERR_DEFER = 99, /* not an error; used internally by adapter */ + CCERR_FAILED_WRITE = 100, + CCERR_FAILED_ERASE = 101, + CCERR_FAILED_VERIFICATION = 102, + CCERR_NOT_FOUND = 103, + +}; + +/* + * CCAE_ACTIVE_CONNECT_RESULTS status result codes. + */ +enum c2_connect_status { + C2_CONN_STATUS_SUCCESS = C2_OK, + C2_CONN_STATUS_NO_MEM = CCERR_INSUFFICIENT_RESOURCES, + C2_CONN_STATUS_TIMEDOUT = CCERR_CONN_TIMEDOUT, + C2_CONN_STATUS_REFUSED = CCERR_CONN_REFUSED, + C2_CONN_STATUS_NETUNREACH = CCERR_NET_UNREACHABLE, + C2_CONN_STATUS_HOSTUNREACH = CCERR_HOST_UNREACHABLE, + C2_CONN_STATUS_INVALID_RNIC = CCERR_INVALID_RNIC, + C2_CONN_STATUS_INVALID_QP = CCERR_INVALID_QP, + C2_CONN_STATUS_INVALID_QP_STATE = CCERR_INVALID_QP_STATE, + C2_CONN_STATUS_REJECTED = CCERR_CONN_RESET, + C2_CONN_STATUS_ADDR_NOT_AVAIL = CCERR_ADDR_NOT_AVAIL, +}; + +/* + * Flash programming status codes. + */ +enum c2_flash_status { + C2_FLASH_STATUS_SUCCESS = 0x0000, + C2_FLASH_STATUS_VERIFY_ERR = 0x0002, + C2_FLASH_STATUS_IMAGE_ERR = 0x0004, + C2_FLASH_STATUS_ECLBS = 0x0400, + C2_FLASH_STATUS_PSLBS = 0x0800, + C2_FLASH_STATUS_VPENS = 0x1000, +}; + +#endif /* _C2_STATUS_H_ */ diff --git a/drivers/infiniband/hw/amso1100/c2_wr.h b/drivers/infiniband/hw/amso1100/c2_wr.h new file mode 100644 index 0000000..bd9905b --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_wr.h @@ -0,0 +1,1520 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_WR_H_ +#define _C2_WR_H_ + +#ifdef CCDEBUG +#define CCWR_MAGIC 0xb07700b0 +#endif + +#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF + +/* Maximum allowed size in bytes of private_data exchange + * on connect. + */ +#define C2_MAX_PRIVATE_DATA_SIZE 200 + +/* + * These types are shared among the adapter, host, and CCIL consumer. + */ +enum c2_cq_notification_type { + C2_CQ_NOTIFICATION_TYPE_NONE = 1, + C2_CQ_NOTIFICATION_TYPE_NEXT, + C2_CQ_NOTIFICATION_TYPE_NEXT_SE +}; + +enum c2_setconfig_cmd { + C2_CFG_ADD_ADDR = 1, + C2_CFG_DEL_ADDR = 2, + C2_CFG_ADD_ROUTE = 3, + C2_CFG_DEL_ROUTE = 4 +}; + +enum c2_getconfig_cmd { + C2_GETCONFIG_ROUTES = 1, + C2_GETCONFIG_ADDRS +}; + +/* + * CCIL Work Request Identifiers + */ +enum c2wr_ids { + CCWR_RNIC_OPEN = 1, + CCWR_RNIC_QUERY, + CCWR_RNIC_SETCONFIG, + CCWR_RNIC_GETCONFIG, + CCWR_RNIC_CLOSE, + CCWR_CQ_CREATE, + CCWR_CQ_QUERY, + CCWR_CQ_MODIFY, + CCWR_CQ_DESTROY, + CCWR_QP_CONNECT, + CCWR_PD_ALLOC, + CCWR_PD_DEALLOC, + CCWR_SRQ_CREATE, + CCWR_SRQ_QUERY, + CCWR_SRQ_MODIFY, + CCWR_SRQ_DESTROY, + CCWR_QP_CREATE, + CCWR_QP_QUERY, + CCWR_QP_MODIFY, + CCWR_QP_DESTROY, + CCWR_NSMR_STAG_ALLOC, + CCWR_NSMR_REGISTER, + CCWR_NSMR_PBL, + CCWR_STAG_DEALLOC, + CCWR_NSMR_REREGISTER, + CCWR_SMR_REGISTER, + CCWR_MR_QUERY, + CCWR_MW_ALLOC, + CCWR_MW_QUERY, + CCWR_EP_CREATE, + CCWR_EP_GETOPT, + CCWR_EP_SETOPT, + CCWR_EP_DESTROY, + CCWR_EP_BIND, + CCWR_EP_CONNECT, + CCWR_EP_LISTEN, + CCWR_EP_SHUTDOWN, + CCWR_EP_LISTEN_CREATE, + CCWR_EP_LISTEN_DESTROY, + CCWR_EP_QUERY, + CCWR_CR_ACCEPT, + CCWR_CR_REJECT, + CCWR_CONSOLE, + CCWR_TERM, + CCWR_FLASH_INIT, + CCWR_FLASH, + CCWR_BUF_ALLOC, + CCWR_BUF_FREE, + CCWR_FLASH_WRITE, + CCWR_INIT, /* WARNING: Don't move this ever again! */ + + + + /* Add new IDs here */ + + + + /* + * WARNING: CCWR_LAST must always be the last verbs id defined! + * All the preceding IDs are fixed, and must not change. + * You can add new IDs, but must not remove or reorder + * any IDs. If you do, YOU will ruin any hope of + * compatability between versions. + */ + CCWR_LAST, + + /* + * Start over at 1 so that arrays indexed by user wr id's + * begin at 1. This is OK since the verbs and user wr id's + * are always used on disjoint sets of queues. + */ + /* + * The order of the CCWR_SEND_XX verbs must + * match the order of the RDMA_OPs + */ + CCWR_SEND = 1, + CCWR_SEND_INV, + CCWR_SEND_SE, + CCWR_SEND_SE_INV, + CCWR_RDMA_WRITE, + CCWR_RDMA_READ, + CCWR_RDMA_READ_INV, + CCWR_MW_BIND, + CCWR_NSMR_FASTREG, + CCWR_STAG_INVALIDATE, + CCWR_RECV, + CCWR_NOP, + CCWR_UNIMPL, +/* WARNING: This must always be the last user wr id defined! */ +}; +#define RDMA_SEND_OPCODE_FROM_WR_ID(x) (x+2) + +/* + * SQ/RQ Work Request Types + */ +enum c2_wr_type { + C2_WR_TYPE_SEND = CCWR_SEND, + C2_WR_TYPE_SEND_SE = CCWR_SEND_SE, + C2_WR_TYPE_SEND_INV = CCWR_SEND_INV, + C2_WR_TYPE_SEND_SE_INV = CCWR_SEND_SE_INV, + C2_WR_TYPE_RDMA_WRITE = CCWR_RDMA_WRITE, + C2_WR_TYPE_RDMA_READ = CCWR_RDMA_READ, + C2_WR_TYPE_RDMA_READ_INV_STAG = CCWR_RDMA_READ_INV, + C2_WR_TYPE_BIND_MW = CCWR_MW_BIND, + C2_WR_TYPE_FASTREG_NSMR = CCWR_NSMR_FASTREG, + C2_WR_TYPE_INV_STAG = CCWR_STAG_INVALIDATE, + C2_WR_TYPE_RECV = CCWR_RECV, + C2_WR_TYPE_NOP = CCWR_NOP, +}; + +struct c2_netaddr { + u32 ip_addr; + u32 netmask; + u32 mtu; +}; + +struct c2_route { + u32 ip_addr; /* 0 indicates the default route */ + u32 netmask; /* netmask associated with dst */ + u32 flags; + union { + u32 ipaddr; /* address of the nexthop interface */ + u8 enaddr[6]; + } nexthop; +}; + +/* + * A Scatter Gather Entry. + */ +struct c2_data_addr { + u32 stag; + u32 length; + u64 to; +}; + +/* + * MR and MW flags used by the consumer, RI, and RNIC. + */ +enum c2_mm_flags { + MEM_REMOTE = 0x0001, /* allow mw binds with remote access. */ + MEM_VA_BASED = 0x0002, /* Not Zero-based */ + MEM_PBL_COMPLETE = 0x0004, /* PBL array is complete in this msg */ + MEM_LOCAL_READ = 0x0008, /* allow local reads */ + MEM_LOCAL_WRITE = 0x0010, /* allow local writes */ + MEM_REMOTE_READ = 0x0020, /* allow remote reads */ + MEM_REMOTE_WRITE = 0x0040, /* allow remote writes */ + MEM_WINDOW_BIND = 0x0080, /* binds allowed */ + MEM_SHARED = 0x0100, /* set if MR is shared */ + MEM_STAG_VALID = 0x0200 /* set if STAG is in valid state */ +}; + +/* + * CCIL API ACF flags defined in terms of the low level mem flags. + * This minimizes translation needed in the user API + */ +enum c2_acf { + C2_ACF_LOCAL_READ = MEM_LOCAL_READ, + C2_ACF_LOCAL_WRITE = MEM_LOCAL_WRITE, + C2_ACF_REMOTE_READ = MEM_REMOTE_READ, + C2_ACF_REMOTE_WRITE = MEM_REMOTE_WRITE, + C2_ACF_WINDOW_BIND = MEM_WINDOW_BIND +}; + +/* + * Image types of objects written to flash + */ +#define C2_FLASH_IMG_BITFILE 1 +#define C2_FLASH_IMG_OPTION_ROM 2 +#define C2_FLASH_IMG_VPD 3 + +/* + * to fix bug 1815 we define the max size allowable of the + * terminate message (per the IETF spec).Refer to the IETF + * protocal specification, section 12.1.6, page 64) + * The message is prefixed by 20 types of DDP info. + * + * Then the message has 6 bytes for the terminate control + * and DDP segment length info plus a DDP header (either + * 14 or 18 byts) plus 28 bytes for the RDMA header. + * Thus the max size in: + * 20 + (6 + 18 + 28) = 72 + */ +#define C2_MAX_TERMINATE_MESSAGE_SIZE (72) + +/* + * Build String Length. It must be the same as C2_BUILD_STR_LEN in ccil_api.h + */ +#define WR_BUILD_STR_LEN 64 + +/* + * WARNING: All of these structs need to align any 64bit types on + * 64 bit boundaries! 64bit types include u64 and u64. + */ + +/* + * Clustercore Work Request Header. Be sensitive to field layout + * and alignment. + */ +struct c2wr_hdr { + /* wqe_count is part of the cqe. It is put here so the + * adapter can write to it while the wr is pending without + * clobbering part of the wr. This word need not be dma'd + * from the host to adapter by libccil, but we copy it anyway + * to make the memcpy to the adapter better aligned. + */ + u32 wqe_count; + + /* Put these fields next so that later 32- and 64-bit + * quantities are naturally aligned. + */ + u8 id; + u8 result; /* adapter -> host */ + u8 sge_count; /* host -> adapter */ + u8 flags; /* host -> adapter */ + + u64 context; +#ifdef CCMSGMAGIC + u32 magic; + u32 pad; +#endif +} __attribute__((packed)); + +/* + *------------------------ RNIC ------------------------ + */ + +/* + * WR_RNIC_OPEN + */ + +/* + * Flags for the RNIC WRs + */ +enum c2_rnic_flags { + RNIC_IRD_STATIC = 0x0001, + RNIC_ORD_STATIC = 0x0002, + RNIC_QP_STATIC = 0x0004, + RNIC_SRQ_SUPPORTED = 0x0008, + RNIC_PBL_BLOCK_MODE = 0x0010, + RNIC_SRQ_MODEL_ARRIVAL = 0x0020, + RNIC_CQ_OVF_DETECTED = 0x0040, + RNIC_PRIV_MODE = 0x0080 +}; + +struct c2wr_rnic_open_req { + struct c2wr_hdr hdr; + u64 user_context; + u16 flags; /* See enum c2_rnic_flags */ + u16 port_num; +} __attribute__((packed)); + +struct c2wr_rnic_open_rep { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)); + +union c2wr_rnic_open { + struct c2wr_rnic_open_req req; + struct c2wr_rnic_open_rep rep; +} __attribute__((packed)); + +struct c2wr_rnic_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)); + +/* + * WR_RNIC_QUERY + */ +struct c2wr_rnic_query_rep { + struct c2wr_hdr hdr; + u64 user_context; + u32 vendor_id; + u32 part_number; + u32 hw_version; + u32 fw_ver_major; + u32 fw_ver_minor; + u32 fw_ver_patch; + char fw_ver_build_str[WR_BUILD_STR_LEN]; + u32 max_qps; + u32 max_qp_depth; + u32 max_srq_depth; + u32 max_send_sgl_depth; + u32 max_rdma_sgl_depth; + u32 max_cqs; + u32 max_cq_depth; + u32 max_cq_event_handlers; + u32 max_mrs; + u32 max_pbl_depth; + u32 max_pds; + u32 max_global_ird; + u32 max_global_ord; + u32 max_qp_ird; + u32 max_qp_ord; + u32 flags; + u32 max_mws; + u32 pbe_range_low; + u32 pbe_range_high; + u32 max_srqs; + u32 page_size; +} __attribute__((packed)); + +union c2wr_rnic_query { + struct c2wr_rnic_query_req req; + struct c2wr_rnic_query_rep rep; +} __attribute__((packed)); + +/* + * WR_RNIC_GETCONFIG + */ + +struct c2wr_rnic_getconfig_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 option; /* see c2_getconfig_cmd_t */ + u64 reply_buf; + u32 reply_buf_len; +} __attribute__((packed)) ; + +struct c2wr_rnic_getconfig_rep { + struct c2wr_hdr hdr; + u32 option; /* see c2_getconfig_cmd_t */ + u32 count_len; /* length of the number of addresses configured */ +} __attribute__((packed)) ; + +union c2wr_rnic_getconfig { + struct c2wr_rnic_getconfig_req req; + struct c2wr_rnic_getconfig_rep rep; +} __attribute__((packed)) ; + +/* + * WR_RNIC_SETCONFIG + */ +struct c2wr_rnic_setconfig_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 option; /* See c2_setconfig_cmd_t */ + /* variable data and pad. See c2_netaddr and c2_route */ + u8 data[0]; +} __attribute__((packed)) ; + +struct c2wr_rnic_setconfig_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_rnic_setconfig { + struct c2wr_rnic_setconfig_req req; + struct c2wr_rnic_setconfig_rep rep; +} __attribute__((packed)) ; + +/* + * WR_RNIC_CLOSE + */ +struct c2wr_rnic_close_req { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)) ; + +struct c2wr_rnic_close_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_rnic_close { + struct c2wr_rnic_close_req req; + struct c2wr_rnic_close_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ CQ ------------------------ + */ +struct c2wr_cq_create_req { + struct c2wr_hdr hdr; + u64 shared_ht; + u64 user_context; + u64 msg_pool; + u32 rnic_handle; + u32 msg_size; + u32 depth; +} __attribute__((packed)) ; + +struct c2wr_cq_create_rep { + struct c2wr_hdr hdr; + u32 mq_index; + u32 adapter_shared; + u32 cq_handle; +} __attribute__((packed)) ; + +union c2wr_cq_create { + struct c2wr_cq_create_req req; + struct c2wr_cq_create_rep rep; +} __attribute__((packed)) ; + +struct c2wr_cq_modify_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 cq_handle; + u32 new_depth; + u64 new_msg_pool; +} __attribute__((packed)) ; + +struct c2wr_cq_modify_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_cq_modify { + struct c2wr_cq_modify_req req; + struct c2wr_cq_modify_rep rep; +} __attribute__((packed)) ; + +struct c2wr_cq_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 cq_handle; +} __attribute__((packed)) ; + +struct c2wr_cq_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_cq_destroy { + struct c2wr_cq_destroy_req req; + struct c2wr_cq_destroy_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ PD ------------------------ + */ +struct c2wr_pd_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_pd_alloc_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_pd_alloc { + struct c2wr_pd_alloc_req req; + struct c2wr_pd_alloc_rep rep; +} __attribute__((packed)) ; + +struct c2wr_pd_dealloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_pd_dealloc_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_pd_dealloc { + struct c2wr_pd_dealloc_req req; + struct c2wr_pd_dealloc_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ SRQ ------------------------ + */ +struct c2wr_srq_create_req { + struct c2wr_hdr hdr; + u64 shared_ht; + u64 user_context; + u32 rnic_handle; + u32 srq_depth; + u32 srq_limit; + u32 sgl_depth; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_srq_create_rep { + struct c2wr_hdr hdr; + u32 srq_depth; + u32 sgl_depth; + u32 msg_size; + u32 mq_index; + u32 mq_start; + u32 srq_handle; +} __attribute__((packed)) ; + +union c2wr_srq_create { + struct c2wr_srq_create_req req; + struct c2wr_srq_create_rep rep; +} __attribute__((packed)) ; + +struct c2wr_srq_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 srq_handle; +} __attribute__((packed)) ; + +struct c2wr_srq_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_srq_destroy { + struct c2wr_srq_destroy_req req; + struct c2wr_srq_destroy_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ QP ------------------------ + */ +enum c2wr_qp_flags { + QP_RDMA_READ = 0x00000001, /* RDMA read enabled? */ + QP_RDMA_WRITE = 0x00000002, /* RDMA write enabled? */ + QP_MW_BIND = 0x00000004, /* MWs enabled */ + QP_ZERO_STAG = 0x00000008, /* enabled? */ + QP_REMOTE_TERMINATION = 0x00000010, /* remote end terminated */ + QP_RDMA_READ_RESPONSE = 0x00000020 /* Remote RDMA read */ + /* enabled? */ +}; + +struct c2wr_qp_create_req { + struct c2wr_hdr hdr; + u64 shared_sq_ht; + u64 shared_rq_ht; + u64 user_context; + u32 rnic_handle; + u32 sq_cq_handle; + u32 rq_cq_handle; + u32 sq_depth; + u32 rq_depth; + u32 srq_handle; + u32 srq_limit; + u32 flags; /* see enum c2wr_qp_flags */ + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u32 ord; + u32 ird; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_qp_create_rep { + struct c2wr_hdr hdr; + u32 sq_depth; + u32 rq_depth; + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u32 ord; + u32 ird; + u32 sq_msg_size; + u32 sq_mq_index; + u32 sq_mq_start; + u32 rq_msg_size; + u32 rq_mq_index; + u32 rq_mq_start; + u32 qp_handle; +} __attribute__((packed)) ; + +union c2wr_qp_create { + struct c2wr_qp_create_req req; + struct c2wr_qp_create_rep rep; +} __attribute__((packed)) ; + +struct c2wr_qp_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; +} __attribute__((packed)) ; + +struct c2wr_qp_query_rep { + struct c2wr_hdr hdr; + u64 user_context; + u32 rnic_handle; + u32 sq_depth; + u32 rq_depth; + u32 send_sgl_depth; + u32 rdma_write_sgl_depth; + u32 recv_sgl_depth; + u32 ord; + u32 ird; + u16 qp_state; + u16 flags; /* see c2wr_qp_flags_t */ + u32 qp_id; + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; + u32 terminate_msg_length; /* 0 if not present */ + u8 data[0]; + /* Terminate Message in-line here. */ +} __attribute__((packed)) ; + +union c2wr_qp_query { + struct c2wr_qp_query_req req; + struct c2wr_qp_query_rep rep; +} __attribute__((packed)) ; + +struct c2wr_qp_modify_req { + struct c2wr_hdr hdr; + u64 stream_msg; + u32 stream_msg_length; + u32 rnic_handle; + u32 qp_handle; + u32 next_qp_state; + u32 ord; + u32 ird; + u32 sq_depth; + u32 rq_depth; + u32 llp_ep_handle; +} __attribute__((packed)) ; + +struct c2wr_qp_modify_rep { + struct c2wr_hdr hdr; + u32 ord; + u32 ird; + u32 sq_depth; + u32 rq_depth; + u32 sq_msg_size; + u32 sq_mq_index; + u32 sq_mq_start; + u32 rq_msg_size; + u32 rq_mq_index; + u32 rq_mq_start; +} __attribute__((packed)) ; + +union c2wr_qp_modify { + struct c2wr_qp_modify_req req; + struct c2wr_qp_modify_rep rep; +} __attribute__((packed)) ; + +struct c2wr_qp_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; +} __attribute__((packed)) ; + +struct c2wr_qp_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_qp_destroy { + struct c2wr_qp_destroy_req req; + struct c2wr_qp_destroy_rep rep; +} __attribute__((packed)) ; + +/* + * The CCWR_QP_CONNECT msg is posted on the verbs request queue. It can + * only be posted when a QP is in IDLE state. After the connect request is + * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING state. + * No synchronous reply from adapter to this WR. The results of + * connection are passed back in an async event CCAE_ACTIVE_CONNECT_RESULTS + * See c2wr_ae_active_connect_results_t + */ +struct c2wr_qp_connect_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; + u32 remote_addr; + u16 remote_port; + u16 pad; + u32 private_data_length; + u8 private_data[0]; /* Private data in-line. */ +} __attribute__((packed)) ; + +struct c2wr_qp_connect { + struct c2wr_qp_connect_req req; + /* no synchronous reply. */ +} __attribute__((packed)) ; + + +/* + *------------------------ MM ------------------------ + */ + +struct c2wr_nsmr_stag_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pbl_depth; + u32 pd_id; + u32 flags; +} __attribute__((packed)) ; + +struct c2wr_nsmr_stag_alloc_rep { + struct c2wr_hdr hdr; + u32 pbl_depth; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_nsmr_stag_alloc { + struct c2wr_nsmr_stag_alloc_req req; + struct c2wr_nsmr_stag_alloc_rep rep; +} __attribute__((packed)) ; + +struct c2wr_nsmr_register_req { + struct c2wr_hdr hdr; + u64 va; + u32 rnic_handle; + u16 flags; + u8 stag_key; + u8 pad; + u32 pd_id; + u32 pbl_depth; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)) ; + +struct c2wr_nsmr_register_rep { + struct c2wr_hdr hdr; + u32 pbl_depth; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_nsmr_register { + struct c2wr_nsmr_register_req req; + struct c2wr_nsmr_register_rep rep; +} __attribute__((packed)) ; + +struct c2wr_nsmr_pbl_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 flags; + u32 stag_index; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)) ; + +struct c2wr_nsmr_pbl_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_nsmr_pbl { + struct c2wr_nsmr_pbl_req req; + struct c2wr_nsmr_pbl_rep rep; +} __attribute__((packed)) ; + +struct c2wr_mr_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 stag_index; +} __attribute__((packed)) ; + +struct c2wr_mr_query_rep { + struct c2wr_hdr hdr; + u8 stag_key; + u8 pad[3]; + u32 pd_id; + u32 flags; + u32 pbl_depth; +} __attribute__((packed)) ; + +union c2wr_mr_query { + struct c2wr_mr_query_req req; + struct c2wr_mr_query_rep rep; +} __attribute__((packed)) ; + +struct c2wr_mw_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 stag_index; +} __attribute__((packed)) ; + +struct c2wr_mw_query_rep { + struct c2wr_hdr hdr; + u8 stag_key; + u8 pad[3]; + u32 pd_id; + u32 flags; +} __attribute__((packed)) ; + +union c2wr_mw_query { + struct c2wr_mw_query_req req; + struct c2wr_mw_query_rep rep; +} __attribute__((packed)) ; + + +struct c2wr_stag_dealloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 stag_index; +} __attribute__((packed)) ; + +struct c2wr_stag_dealloc_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_stag_dealloc { + struct c2wr_stag_dealloc_req req; + struct c2wr_stag_dealloc_rep rep; +} __attribute__((packed)) ; + +struct c2wr_nsmr_reregister_req { + struct c2wr_hdr hdr; + u64 va; + u32 rnic_handle; + u16 flags; + u8 stag_key; + u8 pad; + u32 stag_index; + u32 pd_id; + u32 pbl_depth; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + u32 pad1; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)) ; + +struct c2wr_nsmr_reregister_rep { + struct c2wr_hdr hdr; + u32 pbl_depth; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_nsmr_reregister { + struct c2wr_nsmr_reregister_req req; + struct c2wr_nsmr_reregister_rep rep; +} __attribute__((packed)) ; + +struct c2wr_smr_register_req { + struct c2wr_hdr hdr; + u64 va; + u32 rnic_handle; + u16 flags; + u8 stag_key; + u8 pad; + u32 stag_index; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_smr_register_rep { + struct c2wr_hdr hdr; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_smr_register { + struct c2wr_smr_register_req req; + struct c2wr_smr_register_rep rep; +} __attribute__((packed)) ; + +struct c2wr_mw_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_mw_alloc_rep { + struct c2wr_hdr hdr; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_mw_alloc { + struct c2wr_mw_alloc_req req; + struct c2wr_mw_alloc_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ WRs ----------------------- + */ + +struct c2wr_user_hdr { + struct c2wr_hdr hdr; /* Has status and WR Type */ +} __attribute__((packed)) ; + +enum c2_qp_state { + C2_QP_STATE_IDLE = 0x01, + C2_QP_STATE_CONNECTING = 0x02, + C2_QP_STATE_RTS = 0x04, + C2_QP_STATE_CLOSING = 0x08, + C2_QP_STATE_TERMINATE = 0x10, + C2_QP_STATE_ERROR = 0x20, +}; + +/* Completion queue entry. */ +struct c2wr_ce { + struct c2wr_hdr hdr; /* Has status and WR Type */ + u64 qp_user_context; /* c2_user_qp_t * */ + u32 qp_state; /* Current QP State */ + u32 handle; /* QPID or EP Handle */ + u32 bytes_rcvd; /* valid for RECV WCs */ + u32 stag; +} __attribute__((packed)) ; + + +/* + * Flags used for all post-sq WRs. These must fit in the flags + * field of the struct c2wr_hdr (eight bits). + */ +enum { + SQ_SIGNALED = 0x01, + SQ_READ_FENCE = 0x02, + SQ_FENCE = 0x04, +}; + +/* + * Common fields for all post-sq WRs. Namely the standard header and a + * secondary header with fields common to all post-sq WRs. + */ +struct c2_sq_hdr { + struct c2wr_user_hdr user_hdr; +} __attribute__((packed)); + +/* + * Same as above but for post-rq WRs. + */ +struct c2_rq_hdr { + struct c2wr_user_hdr user_hdr; +} __attribute__((packed)); + +/* + * use the same struct for all sends. + */ +struct c2wr_send_req { + struct c2_sq_hdr sq_hdr; + u32 sge_len; + u32 remote_stag; + u8 data[0]; /* SGE array */ +} __attribute__((packed)); + +union c2wr_send { + struct c2wr_send_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_rdma_write_req { + struct c2_sq_hdr sq_hdr; + u64 remote_to; + u32 remote_stag; + u32 sge_len; + u8 data[0]; /* SGE array */ +} __attribute__((packed)); + +union c2wr_rdma_write { + struct c2wr_rdma_write_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_rdma_read_req { + struct c2_sq_hdr sq_hdr; + u64 local_to; + u64 remote_to; + u32 local_stag; + u32 remote_stag; + u32 length; +} __attribute__((packed)); + +union c2wr_rdma_read { + struct c2wr_rdma_read_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_mw_bind_req { + struct c2_sq_hdr sq_hdr; + u64 va; + u8 stag_key; + u8 pad[3]; + u32 mw_stag_index; + u32 mr_stag_index; + u32 length; + u32 flags; +} __attribute__((packed)); + +union c2wr_mw_bind { + struct c2wr_mw_bind_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_nsmr_fastreg_req { + struct c2_sq_hdr sq_hdr; + u64 va; + u8 stag_key; + u8 pad[3]; + u32 stag_index; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)); + +union c2wr_nsmr_fastreg { + struct c2wr_nsmr_fastreg_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_stag_invalidate_req { + struct c2_sq_hdr sq_hdr; + u8 stag_key; + u8 pad[3]; + u32 stag_index; +} __attribute__((packed)); + +union c2wr_stag_invalidate { + struct c2wr_stag_invalidate_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +union c2wr_sqwr { + struct c2_sq_hdr sq_hdr; + struct c2wr_send_req send; + struct c2wr_send_req send_se; + struct c2wr_send_req send_inv; + struct c2wr_send_req send_se_inv; + struct c2wr_rdma_write_req rdma_write; + struct c2wr_rdma_read_req rdma_read; + struct c2wr_mw_bind_req mw_bind; + struct c2wr_nsmr_fastreg_req nsmr_fastreg; + struct c2wr_stag_invalidate_req stag_inv; +} __attribute__((packed)); + + +/* + * RQ WRs + */ +struct c2wr_rqwr { + struct c2_rq_hdr rq_hdr; + u8 data[0]; /* array of SGEs */ +} __attribute__((packed)); + +union c2wr_recv { + struct c2wr_rqwr req; + struct c2wr_ce rep; +} __attribute__((packed)); + +/* + * All AEs start with this header. Most AEs only need to convey the + * information in the header. Some, like LLP connection events, need + * more info. The union typdef c2wr_ae_t has all the possible AEs. + * + * hdr.context is the user_context from the rnic_open WR. NULL If this + * is not affiliated with an rnic + * + * hdr.id is the AE identifier (eg; CCAE_REMOTE_SHUTDOWN, + * CCAE_LLP_CLOSE_COMPLETE) + * + * resource_type is one of: C2_RES_IND_QP, C2_RES_IND_CQ, C2_RES_IND_SRQ + * + * user_context is the context passed down when the host created the resource. + */ +struct c2wr_ae_hdr { + struct c2wr_hdr hdr; + u64 user_context; /* user context for this res. */ + u32 resource_type; /* see enum c2_resource_indicator */ + u32 resource; /* handle for resource */ + u32 qp_state; /* current QP State */ +} __attribute__((packed)); + +/* + * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, + * the adapter moves the QP into RTS state + */ +struct c2wr_ae_active_connect_results { + struct c2wr_ae_hdr ae_hdr; + u32 laddr; + u32 raddr; + u16 lport; + u16 rport; + u32 private_data_length; + u8 private_data[0]; /* data is in-line in the msg. */ +} __attribute__((packed)); + +/* + * When connections are established by the stack (and the private data + * MPA frame is received), the adapter will generate an event to the host. + * The details of the connection, any private data, and the new connection + * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on the + * AE queue: + */ +struct c2wr_ae_connection_request { + struct c2wr_ae_hdr ae_hdr; + u32 cr_handle; /* connreq handle (sock ptr) */ + u32 laddr; + u32 raddr; + u16 lport; + u16 rport; + u32 private_data_length; + u8 private_data[0]; /* data is in-line in the msg. */ +} __attribute__((packed)); + +union c2wr_ae { + struct c2wr_ae_hdr ae_generic; + struct c2wr_ae_active_connect_results ae_active_connect_results; + struct c2wr_ae_connection_request ae_connection_request; +} __attribute__((packed)); + +struct c2wr_init_req { + struct c2wr_hdr hdr; + u64 hint_count; + u64 q0_host_shared; + u64 q1_host_shared; + u64 q1_host_msg_pool; + u64 q2_host_shared; + u64 q2_host_msg_pool; +} __attribute__((packed)); + +struct c2wr_init_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_init { + struct c2wr_init_req req; + struct c2wr_init_rep rep; +} __attribute__((packed)); + +/* + * For upgrading flash. + */ + +struct c2wr_flash_init_req { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)); + +struct c2wr_flash_init_rep { + struct c2wr_hdr hdr; + u32 adapter_flash_buf_offset; + u32 adapter_flash_len; +} __attribute__((packed)); + +union c2wr_flash_init { + struct c2wr_flash_init_req req; + struct c2wr_flash_init_rep rep; +} __attribute__((packed)); + +struct c2wr_flash_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 len; +} __attribute__((packed)); + +struct c2wr_flash_rep { + struct c2wr_hdr hdr; + u32 status; +} __attribute__((packed)); + +union c2wr_flash { + struct c2wr_flash_req req; + struct c2wr_flash_rep rep; +} __attribute__((packed)); + +struct c2wr_buf_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 size; +} __attribute__((packed)); + +struct c2wr_buf_alloc_rep { + struct c2wr_hdr hdr; + u32 offset; /* 0 if mem not available */ + u32 size; /* 0 if mem not available */ +} __attribute__((packed)); + +union c2wr_buf_alloc { + struct c2wr_buf_alloc_req req; + struct c2wr_buf_alloc_rep rep; +} __attribute__((packed)); + +struct c2wr_buf_free_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 offset; /* Must match value from alloc */ + u32 size; /* Must match value from alloc */ +} __attribute__((packed)); + +struct c2wr_buf_free_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_buf_free { + struct c2wr_buf_free_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_flash_write_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 offset; + u32 size; + u32 type; + u32 flags; +} __attribute__((packed)); + +struct c2wr_flash_write_rep { + struct c2wr_hdr hdr; + u32 status; +} __attribute__((packed)); + +union c2wr_flash_write { + struct c2wr_flash_write_req req; + struct c2wr_flash_write_rep rep; +} __attribute__((packed)); + +/* + * Messages for LLP connection setup. + */ + +/* + * Listen Request. This allocates a listening endpoint to allow passive + * connection setup. Newly established LLP connections are passed up + * via an AE. See c2wr_ae_connection_request_t + */ +struct c2wr_ep_listen_create_req { + struct c2wr_hdr hdr; + u64 user_context; /* returned in AEs. */ + u32 rnic_handle; + u32 local_addr; /* local addr, or 0 */ + u16 local_port; /* 0 means "pick one" */ + u16 pad; + u32 backlog; /* tradional tcp listen bl */ +} __attribute__((packed)); + +struct c2wr_ep_listen_create_rep { + struct c2wr_hdr hdr; + u32 ep_handle; /* handle to new listening ep */ + u16 local_port; /* resulting port... */ + u16 pad; +} __attribute__((packed)); + +union c2wr_ep_listen_create { + struct c2wr_ep_listen_create_req req; + struct c2wr_ep_listen_create_rep rep; +} __attribute__((packed)); + +struct c2wr_ep_listen_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 ep_handle; +} __attribute__((packed)); + +struct c2wr_ep_listen_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_ep_listen_destroy { + struct c2wr_ep_listen_destroy_req req; + struct c2wr_ep_listen_destroy_rep rep; +} __attribute__((packed)); + +struct c2wr_ep_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 ep_handle; +} __attribute__((packed)); + +struct c2wr_ep_query_rep { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; +} __attribute__((packed)); + +union c2wr_ep_query { + struct c2wr_ep_query_req req; + struct c2wr_ep_query_rep rep; +} __attribute__((packed)); + + +/* + * The host passes this down to indicate acceptance of a pending iWARP + * connection. The cr_handle was obtained from the CONNECTION_REQUEST + * AE passed up by the adapter. See c2wr_ae_connection_request_t. + */ +struct c2wr_cr_accept_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; /* QP to bind to this LLP conn */ + u32 ep_handle; /* LLP handle to accept */ + u32 private_data_length; + u8 private_data[0]; /* data in-line in msg. */ +} __attribute__((packed)); + +/* + * adapter sends reply when private data is successfully submitted to + * the LLP. + */ +struct c2wr_cr_accept_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_cr_accept { + struct c2wr_cr_accept_req req; + struct c2wr_cr_accept_rep rep; +} __attribute__((packed)); + +/* + * The host sends this down if a given iWARP connection request was + * rejected by the consumer. The cr_handle was obtained from a + * previous c2wr_ae_connection_request_t AE sent by the adapter. + */ +struct c2wr_cr_reject_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 ep_handle; /* LLP handle to reject */ +} __attribute__((packed)); + +/* + * Dunno if this is needed, but we'll add it for now. The adapter will + * send the reject_reply after the LLP endpoint has been destroyed. + */ +struct c2wr_cr_reject_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_cr_reject { + struct c2wr_cr_reject_req req; + struct c2wr_cr_reject_rep rep; +} __attribute__((packed)); + +/* + * console command. Used to implement a debug console over the verbs + * request and reply queues. + */ + +/* + * Console request message. It contains: + * - message hdr with id = CCWR_CONSOLE + * - the physaddr/len of host memory to be used for the reply. + * - the command string. eg: "netstat -s" or "zoneinfo" + */ +struct c2wr_console_req { + struct c2wr_hdr hdr; /* id = CCWR_CONSOLE */ + u64 reply_buf; /* pinned host buf for reply */ + u32 reply_buf_len; /* length of reply buffer */ + u8 command[0]; /* NUL terminated ascii string */ + /* containing the command req */ +} __attribute__((packed)); + +/* + * flags used in the console reply. + */ +enum c2_console_flags { + CONS_REPLY_TRUNCATED = 0x00000001 /* reply was truncated */ +} __attribute__((packed)); + +/* + * Console reply message. + * hdr.result contains the c2_status_t error if the reply was _not_ generated, + * or C2_OK if the reply was generated. + */ +struct c2wr_console_rep { + struct c2wr_hdr hdr; /* id = CCWR_CONSOLE */ + u32 flags; +} __attribute__((packed)); + +union c2wr_console { + struct c2wr_console_req req; + struct c2wr_console_rep rep; +} __attribute__((packed)); + + +/* + * Giant union with all WRs. Makes life easier... + */ +union c2wr { + struct c2wr_hdr hdr; + struct c2wr_user_hdr user_hdr; + union c2wr_rnic_open rnic_open; + union c2wr_rnic_query rnic_query; + union c2wr_rnic_getconfig rnic_getconfig; + union c2wr_rnic_setconfig rnic_setconfig; + union c2wr_rnic_close rnic_close; + union c2wr_cq_create cq_create; + union c2wr_cq_modify cq_modify; + union c2wr_cq_destroy cq_destroy; + union c2wr_pd_alloc pd_alloc; + union c2wr_pd_dealloc pd_dealloc; + union c2wr_srq_create srq_create; + union c2wr_srq_destroy srq_destroy; + union c2wr_qp_create qp_create; + union c2wr_qp_query qp_query; + union c2wr_qp_modify qp_modify; + union c2wr_qp_destroy qp_destroy; + struct c2wr_qp_connect qp_connect; + union c2wr_nsmr_stag_alloc nsmr_stag_alloc; + union c2wr_nsmr_register nsmr_register; + union c2wr_nsmr_pbl nsmr_pbl; + union c2wr_mr_query mr_query; + union c2wr_mw_query mw_query; + union c2wr_stag_dealloc stag_dealloc; + union c2wr_sqwr sqwr; + struct c2wr_rqwr rqwr; + struct c2wr_ce ce; + union c2wr_ae ae; + union c2wr_init init; + union c2wr_ep_listen_create ep_listen_create; + union c2wr_ep_listen_destroy ep_listen_destroy; + union c2wr_cr_accept cr_accept; + union c2wr_cr_reject cr_reject; + union c2wr_console console; + union c2wr_flash_init flash_init; + union c2wr_flash flash; + union c2wr_buf_alloc buf_alloc; + union c2wr_buf_free buf_free; + union c2wr_flash_write flash_write; +} __attribute__((packed)); + + +/* + * Accessors for the wr fields that are packed together tightly to + * reduce the wr message size. The wr arguments are void* so that + * either a struct c2wr*, a struct c2wr_hdr*, or a pointer to any of the types + * in the struct c2wr union can be passed in. + */ +static __inline__ u8 c2_wr_get_id(void *wr) +{ + return ((struct c2wr_hdr *) wr)->id; +} +static __inline__ void c2_wr_set_id(void *wr, u8 id) +{ + ((struct c2wr_hdr *) wr)->id = id; +} +static __inline__ u8 c2_wr_get_result(void *wr) +{ + return ((struct c2wr_hdr *) wr)->result; +} +static __inline__ void c2_wr_set_result(void *wr, u8 result) +{ + ((struct c2wr_hdr *) wr)->result = result; +} +static __inline__ u8 c2_wr_get_flags(void *wr) +{ + return ((struct c2wr_hdr *) wr)->flags; +} +static __inline__ void c2_wr_set_flags(void *wr, u8 flags) +{ + ((struct c2wr_hdr *) wr)->flags = flags; +} +static __inline__ u8 c2_wr_get_sge_count(void *wr) +{ + return ((struct c2wr_hdr *) wr)->sge_count; +} +static __inline__ void c2_wr_set_sge_count(void *wr, u8 sge_count) +{ + ((struct c2wr_hdr *) wr)->sge_count = sge_count; +} +static __inline__ u32 c2_wr_get_wqe_count(void *wr) +{ + return ((struct c2wr_hdr *) wr)->wqe_count; +} +static __inline__ void c2_wr_set_wqe_count(void *wr, u32 wqe_count) +{ + ((struct c2wr_hdr *) wr)->wqe_count = wqe_count; +} + +#endif /* _C2_WR_H_ */ From swise at opengridcomputing.com Thu Aug 3 14:07:25 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:07:25 -0500 Subject: [openib-general] [PATCH v4 1/7] AMSO1100 Low Level Driver. In-Reply-To: <20060803210723.16572.34829.stgit@dell3.ogc.int> References: <20060803210723.16572.34829.stgit@dell3.ogc.int> Message-ID: <20060803210725.16572.88167.stgit@dell3.ogc.int> This is the core of the driver and includes the hardware probe, low-level device interfaces and native Ethernet support. --- drivers/infiniband/hw/amso1100/c2.c | 1255 ++++++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2.h | 552 +++++++++++++ drivers/infiniband/hw/amso1100/c2_ae.c | 321 ++++++++ drivers/infiniband/hw/amso1100/c2_intr.c | 209 +++++ drivers/infiniband/hw/amso1100/c2_rnic.c | 664 ++++++++++++++++ 5 files changed, 3001 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2.c b/drivers/infiniband/hw/amso1100/c2.c new file mode 100644 index 0000000..4fdbd80 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2.c @@ -0,0 +1,1255 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include "c2.h" +#include "c2_provider.h" + +MODULE_AUTHOR("Tom Tucker "); +MODULE_DESCRIPTION("Ammasso AMSO1100 Low-level iWARP Driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK + | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN; + +static int debug = -1; /* defaults above */ +module_param(debug, int, 0); +MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); + +static int c2_up(struct net_device *netdev); +static int c2_down(struct net_device *netdev); +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev); +static void c2_tx_interrupt(struct net_device *netdev); +static void c2_rx_interrupt(struct net_device *netdev); +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs); +static void c2_tx_timeout(struct net_device *netdev); +static int c2_change_mtu(struct net_device *netdev, int new_mtu); +static void c2_reset(struct c2_port *c2_port); +static struct net_device_stats *c2_get_stats(struct net_device *netdev); + +static struct pci_device_id c2_pci_table[] = { + {0x18b8, 0xb001, PCI_ANY_ID, PCI_ANY_ID}, + {0} +}; + +MODULE_DEVICE_TABLE(pci, c2_pci_table); + +static void c2_print_macaddr(struct net_device *netdev) +{ + pr_debug("%s: MAC %02X:%02X:%02X:%02X:%02X:%02X, " + "IRQ %u\n", netdev->name, + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5], + netdev->irq); +} + +static void c2_set_rxbufsize(struct c2_port *c2_port) +{ + struct net_device *netdev = c2_port->netdev; + + if (netdev->mtu > RX_BUF_SIZE) + c2_port->rx_buf_size = + netdev->mtu + ETH_HLEN + sizeof(struct c2_rxp_hdr) + + NET_IP_ALIGN; + else + c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE; +} + +/* + * Allocate TX ring elements and chain them together. + * One-to-one association of adapter descriptors with ring elements. + */ +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr, + dma_addr_t base, void __iomem * mmio_txp_ring) +{ + struct c2_tx_desc *tx_desc; + struct c2_txp_desc __iomem *txp_desc; + struct c2_element *elem; + int i; + + tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL); + if (!tx_ring->start) + return -ENOMEM; + + elem = tx_ring->start; + tx_desc = vaddr; + txp_desc = mmio_txp_ring; + for (i = 0; i < tx_ring->count; i++, elem++, tx_desc++, txp_desc++) { + tx_desc->len = 0; + tx_desc->status = 0; + + /* Set TXP_HTXD_UNINIT */ + __raw_writeq(cpu_to_be64(0x1122334455667788ULL), + (void __iomem *) txp_desc + C2_TXP_ADDR); + __raw_writew(0, (void __iomem *) txp_desc + C2_TXP_LEN); + __raw_writew(cpu_to_be16(TXP_HTXD_UNINIT), + (void __iomem *) txp_desc + C2_TXP_FLAGS); + + elem->skb = NULL; + elem->ht_desc = tx_desc; + elem->hw_desc = txp_desc; + + if (i == tx_ring->count - 1) { + elem->next = tx_ring->start; + tx_desc->next_offset = base; + } else { + elem->next = elem + 1; + tx_desc->next_offset = + base + (i + 1) * sizeof(*tx_desc); + } + } + + tx_ring->to_use = tx_ring->to_clean = tx_ring->start; + + return 0; +} + +/* + * Allocate RX ring elements and chain them together. + * One-to-one association of adapter descriptors with ring elements. + */ +static int c2_rx_ring_alloc(struct c2_ring *rx_ring, void *vaddr, + dma_addr_t base, void __iomem * mmio_rxp_ring) +{ + struct c2_rx_desc *rx_desc; + struct c2_rxp_desc __iomem *rxp_desc; + struct c2_element *elem; + int i; + + rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, GFP_KERNEL); + if (!rx_ring->start) + return -ENOMEM; + + elem = rx_ring->start; + rx_desc = vaddr; + rxp_desc = mmio_rxp_ring; + for (i = 0; i < rx_ring->count; i++, elem++, rx_desc++, rxp_desc++) { + rx_desc->len = 0; + rx_desc->status = 0; + + /* Set RXP_HRXD_UNINIT */ + __raw_writew(cpu_to_be16(RXP_HRXD_OK), + (void __iomem *) rxp_desc + C2_RXP_STATUS); + __raw_writew(0, (void __iomem *) rxp_desc + C2_RXP_COUNT); + __raw_writew(0, (void __iomem *) rxp_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(0x99aabbccddeeffULL), + (void __iomem *) rxp_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_UNINIT), + (void __iomem *) rxp_desc + C2_RXP_FLAGS); + + elem->skb = NULL; + elem->ht_desc = rx_desc; + elem->hw_desc = rxp_desc; + + if (i == rx_ring->count - 1) { + elem->next = rx_ring->start; + rx_desc->next_offset = base; + } else { + elem->next = elem + 1; + rx_desc->next_offset = + base + (i + 1) * sizeof(*rx_desc); + } + } + + rx_ring->to_use = rx_ring->to_clean = rx_ring->start; + + return 0; +} + +/* Setup buffer for receiving */ +static inline int c2_rx_alloc(struct c2_port *c2_port, struct c2_element *elem) +{ + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_rx_desc *rx_desc = elem->ht_desc; + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen; + struct c2_rxp_hdr *rxp_hdr; + + skb = dev_alloc_skb(c2_port->rx_buf_size); + if (unlikely(!skb)) { + pr_debug("%s: out of memory for receive\n", + c2_port->netdev->name); + return -ENOMEM; + } + + /* Zero out the rxp hdr in the sk_buff */ + memset(skb->data, 0, sizeof(*rxp_hdr)); + + skb->dev = c2_port->netdev; + + maplen = c2_port->rx_buf_size; + mapaddr = + pci_map_single(c2dev->pcidev, skb->data, maplen, + PCI_DMA_FROMDEVICE); + + /* Set the sk_buff RXP_header to RXP_HRXD_READY */ + rxp_hdr = (struct c2_rxp_hdr *) skb->data; + rxp_hdr->flags = RXP_HRXD_READY; + + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); + __raw_writew(cpu_to_be16((u16) maplen - sizeof(*rxp_hdr)), + elem->hw_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_READY), elem->hw_desc + C2_RXP_FLAGS); + + elem->skb = skb; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + rx_desc->len = maplen; + + return 0; +} + +/* + * Allocate buffers for the Rx ring + * For receive: rx_ring.to_clean is next received frame + */ +static int c2_rx_fill(struct c2_port *c2_port) +{ + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + int ret = 0; + + elem = rx_ring->start; + do { + if (c2_rx_alloc(c2_port, elem)) { + ret = 1; + break; + } + } while ((elem = elem->next) != rx_ring->start); + + rx_ring->to_clean = rx_ring->start; + return ret; +} + +/* Free all buffers in RX ring, assumes receiver stopped */ +static void c2_rx_clean(struct c2_port *c2_port) +{ + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + struct c2_rx_desc *rx_desc; + + elem = rx_ring->start; + do { + rx_desc = elem->ht_desc; + rx_desc->len = 0; + + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT); + __raw_writew(0, elem->hw_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(0x99aabbccddeeffULL), + elem->hw_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_UNINIT), + elem->hw_desc + C2_RXP_FLAGS); + + if (elem->skb) { + pci_unmap_single(c2dev->pcidev, elem->mapaddr, + elem->maplen, PCI_DMA_FROMDEVICE); + dev_kfree_skb(elem->skb); + elem->skb = NULL; + } + } while ((elem = elem->next) != rx_ring->start); +} + +static inline int c2_tx_free(struct c2_dev *c2dev, struct c2_element *elem) +{ + struct c2_tx_desc *tx_desc = elem->ht_desc; + + tx_desc->len = 0; + + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, + PCI_DMA_TODEVICE); + + if (elem->skb) { + dev_kfree_skb_any(elem->skb); + elem->skb = NULL; + } + + return 0; +} + +/* Free all buffers in TX ring, assumes transmitter stopped */ +static void c2_tx_clean(struct c2_port *c2_port) +{ + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + struct c2_txp_desc txp_htxd; + int retry; + unsigned long flags; + + spin_lock_irqsave(&c2_port->tx_lock, flags); + + elem = tx_ring->start; + + do { + retry = 0; + do { + txp_htxd.flags = + readw(elem->hw_desc + C2_TXP_FLAGS); + + if (txp_htxd.flags == TXP_HTXD_READY) { + retry = 1; + __raw_writew(0, + elem->hw_desc + C2_TXP_LEN); + __raw_writeq(0, + elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(TXP_HTXD_DONE), + elem->hw_desc + C2_TXP_FLAGS); + c2_port->netstats.tx_dropped++; + break; + } else { + __raw_writew(0, + elem->hw_desc + C2_TXP_LEN); + __raw_writeq(cpu_to_be64(0x1122334455667788ULL), + elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(TXP_HTXD_UNINIT), + elem->hw_desc + C2_TXP_FLAGS); + } + + c2_tx_free(c2_port->c2dev, elem); + + } while ((elem = elem->next) != tx_ring->start); + } while (retry); + + c2_port->tx_avail = c2_port->tx_ring.count - 1; + c2_port->c2dev->cur_tx = tx_ring->to_use - tx_ring->start; + + if (c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(c2_port->netdev); + + spin_unlock_irqrestore(&c2_port->tx_lock, flags); +} + +/* + * Process transmit descriptors marked 'DONE' by the firmware, + * freeing up their unneeded sk_buffs. + */ +static void c2_tx_interrupt(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + struct c2_txp_desc txp_htxd; + + spin_lock(&c2_port->tx_lock); + + for (elem = tx_ring->to_clean; elem != tx_ring->to_use; + elem = elem->next) { + txp_htxd.flags = + be16_to_cpu(readw(elem->hw_desc + C2_TXP_FLAGS)); + + if (txp_htxd.flags != TXP_HTXD_DONE) + break; + + if (netif_msg_tx_done(c2_port)) { + /* PCI reads are expensive in fast path */ + txp_htxd.len = + be16_to_cpu(readw(elem->hw_desc + C2_TXP_LEN)); + pr_debug("%s: tx done slot %3Zu status 0x%x len " + "%5u bytes\n", + netdev->name, elem - tx_ring->start, + txp_htxd.flags, txp_htxd.len); + } + + c2_tx_free(c2dev, elem); + ++(c2_port->tx_avail); + } + + tx_ring->to_clean = elem; + + if (netif_queue_stopped(netdev) + && c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(netdev); + + spin_unlock(&c2_port->tx_lock); +} + +static void c2_rx_error(struct c2_port *c2_port, struct c2_element *elem) +{ + struct c2_rx_desc *rx_desc = elem->ht_desc; + struct c2_rxp_hdr *rxp_hdr = (struct c2_rxp_hdr *) elem->skb->data; + + if (rxp_hdr->status != RXP_HRXD_OK || + rxp_hdr->len > (rx_desc->len - sizeof(*rxp_hdr))) { + pr_debug("BAD RXP_HRXD\n"); + pr_debug(" rx_desc : %p\n", rx_desc); + pr_debug(" index : %Zu\n", + elem - c2_port->rx_ring.start); + pr_debug(" len : %u\n", rx_desc->len); + pr_debug(" rxp_hdr : %p [PA %p]\n", rxp_hdr, + (void *) __pa((unsigned long) rxp_hdr)); + pr_debug(" flags : 0x%x\n", rxp_hdr->flags); + pr_debug(" status: 0x%x\n", rxp_hdr->status); + pr_debug(" len : %u\n", rxp_hdr->len); + pr_debug(" rsvd : 0x%x\n", rxp_hdr->rsvd); + } + + /* Setup the skb for reuse since we're dropping this pkt */ + elem->skb->tail = elem->skb->data = elem->skb->head; + + /* Zero out the rxp hdr in the sk_buff */ + memset(elem->skb->data, 0, sizeof(*rxp_hdr)); + + /* Write the descriptor to the adapter's rx ring */ + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT); + __raw_writew(cpu_to_be16((u16) elem->maplen - sizeof(*rxp_hdr)), + elem->hw_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(elem->mapaddr), elem->hw_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_READY), elem->hw_desc + C2_RXP_FLAGS); + + pr_debug("packet dropped\n"); + c2_port->netstats.rx_dropped++; +} + +static void c2_rx_interrupt(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + struct c2_rx_desc *rx_desc; + struct c2_rxp_hdr *rxp_hdr; + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen, buflen; + unsigned long flags; + + spin_lock_irqsave(&c2dev->lock, flags); + + /* Begin where we left off */ + rx_ring->to_clean = rx_ring->start + c2dev->cur_rx; + + for (elem = rx_ring->to_clean; elem->next != rx_ring->to_clean; + elem = elem->next) { + rx_desc = elem->ht_desc; + mapaddr = elem->mapaddr; + maplen = elem->maplen; + skb = elem->skb; + rxp_hdr = (struct c2_rxp_hdr *) skb->data; + + if (rxp_hdr->flags != RXP_HRXD_DONE) + break; + buflen = rxp_hdr->len; + + /* Sanity check the RXP header */ + if (rxp_hdr->status != RXP_HRXD_OK || + buflen > (rx_desc->len - sizeof(*rxp_hdr))) { + c2_rx_error(c2_port, elem); + continue; + } + + /* + * Allocate and map a new skb for replenishing the host + * RX desc + */ + if (c2_rx_alloc(c2_port, elem)) { + c2_rx_error(c2_port, elem); + continue; + } + + /* Unmap the old skb */ + pci_unmap_single(c2dev->pcidev, mapaddr, maplen, + PCI_DMA_FROMDEVICE); + + prefetch(skb->data); + + /* + * Skip past the leading 8 bytes comprising of the + * "struct c2_rxp_hdr", prepended by the adapter + * to the usual Ethernet header ("struct ethhdr"), + * to the start of the raw Ethernet packet. + * + * Fix up the various fields in the sk_buff before + * passing it up to netif_rx(). The transfer size + * (in bytes) specified by the adapter len field of + * the "struct rxp_hdr_t" does NOT include the + * "sizeof(struct c2_rxp_hdr)". + */ + skb->data += sizeof(*rxp_hdr); + skb->tail = skb->data + buflen; + skb->len = buflen; + skb->dev = netdev; + skb->protocol = eth_type_trans(skb, netdev); + + netif_rx(skb); + + netdev->last_rx = jiffies; + c2_port->netstats.rx_packets++; + c2_port->netstats.rx_bytes += buflen; + } + + /* Save where we left off */ + rx_ring->to_clean = elem; + c2dev->cur_rx = elem - rx_ring->start; + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); + + spin_unlock_irqrestore(&c2dev->lock, flags); +} + +/* + * Handle netisr0 TX & RX interrupts. + */ +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs) +{ + unsigned int netisr0, dmaisr; + int handled = 0; + struct c2_dev *c2dev = (struct c2_dev *) dev_id; + + /* Process CCILNET interrupts */ + netisr0 = readl(c2dev->regs + C2_NISR0); + if (netisr0) { + + /* + * There is an issue with the firmware that always + * provides the status of RX for both TX & RX + * interrupts. So process both queues here. + */ + c2_rx_interrupt(c2dev->netdev); + c2_tx_interrupt(c2dev->netdev); + + /* Clear the interrupt */ + writel(netisr0, c2dev->regs + C2_NISR0); + handled++; + } + + /* Process RNIC interrupts */ + dmaisr = readl(c2dev->regs + C2_DISR); + if (dmaisr) { + writel(dmaisr, c2dev->regs + C2_DISR); + c2_rnic_interrupt(c2dev); + handled++; + } + + if (handled) { + return IRQ_HANDLED; + } else { + return IRQ_NONE; + } +} + +static int c2_up(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_element *elem; + struct c2_rxp_hdr *rxp_hdr; + struct in_device *in_dev; + size_t rx_size, tx_size; + int ret, i; + unsigned int netimr0; + + if (netif_msg_ifup(c2_port)) + pr_debug("%s: enabling interface\n", netdev->name); + + /* Set the Rx buffer size based on MTU */ + c2_set_rxbufsize(c2_port); + + /* Allocate DMA'able memory for Tx/Rx host descriptor rings */ + rx_size = c2_port->rx_ring.count * sizeof(struct c2_rx_desc); + tx_size = c2_port->tx_ring.count * sizeof(struct c2_tx_desc); + + c2_port->mem_size = tx_size + rx_size; + c2_port->mem = pci_alloc_consistent(c2dev->pcidev, c2_port->mem_size, + &c2_port->dma); + if (c2_port->mem == NULL) { + pr_debug("Unable to allocate memory for " + "host descriptor rings\n"); + return -ENOMEM; + } + + memset(c2_port->mem, 0, c2_port->mem_size); + + /* Create the Rx host descriptor ring */ + if ((ret = + c2_rx_ring_alloc(&c2_port->rx_ring, c2_port->mem, c2_port->dma, + c2dev->mmio_rxp_ring))) { + pr_debug("Unable to create RX ring\n"); + goto bail0; + } + + /* Allocate Rx buffers for the host descriptor ring */ + if (c2_rx_fill(c2_port)) { + pr_debug("Unable to fill RX ring\n"); + goto bail1; + } + + /* Create the Tx host descriptor ring */ + if ((ret = c2_tx_ring_alloc(&c2_port->tx_ring, c2_port->mem + rx_size, + c2_port->dma + rx_size, + c2dev->mmio_txp_ring))) { + pr_debug("Unable to create TX ring\n"); + goto bail1; + } + + /* Set the TX pointer to where we left off */ + c2_port->tx_avail = c2_port->tx_ring.count - 1; + c2_port->tx_ring.to_use = c2_port->tx_ring.to_clean = + c2_port->tx_ring.start + c2dev->cur_tx; + + /* missing: Initialize MAC */ + + BUG_ON(c2_port->tx_ring.to_use != c2_port->tx_ring.to_clean); + + /* Reset the adapter, ensures the driver is in sync with the RXP */ + c2_reset(c2_port); + + /* Reset the READY bit in the sk_buff RXP headers & adapter HRXDQ */ + for (i = 0, elem = c2_port->rx_ring.start; i < c2_port->rx_ring.count; + i++, elem++) { + rxp_hdr = (struct c2_rxp_hdr *) elem->skb->data; + rxp_hdr->flags = 0; + __raw_writew(cpu_to_be16(RXP_HRXD_READY), + elem->hw_desc + C2_RXP_FLAGS); + } + + /* Enable network packets */ + netif_start_queue(netdev); + + /* Enable IRQ */ + writel(0, c2dev->regs + C2_IDIS); + netimr0 = readl(c2dev->regs + C2_NIMR0); + netimr0 &= ~(C2_PCI_HTX_INT | C2_PCI_HRX_INT); + writel(netimr0, c2dev->regs + C2_NIMR0); + + /* Tell the stack to ignore arp requests for ipaddrs bound to + * other interfaces. This is needed to prevent the host stack + * from responding to arp requests to the ipaddr bound on the + * rdma interface. + */ + in_dev = in_dev_get(netdev); + in_dev->cnf.arp_ignore = 1; + in_dev_put(in_dev); + + return 0; + + bail1: + c2_rx_clean(c2_port); + kfree(c2_port->rx_ring.start); + + bail0: + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, + c2_port->dma); + + return ret; +} + +static int c2_down(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + + if (netif_msg_ifdown(c2_port)) + pr_debug("%s: disabling interface\n", + netdev->name); + + /* Wait for all the queued packets to get sent */ + c2_tx_interrupt(netdev); + + /* Disable network packets */ + netif_stop_queue(netdev); + + /* Disable IRQs by clearing the interrupt mask */ + writel(1, c2dev->regs + C2_IDIS); + writel(0, c2dev->regs + C2_NIMR0); + + /* missing: Stop transmitter */ + + /* missing: Stop receiver */ + + /* Reset the adapter, ensures the driver is in sync with the RXP */ + c2_reset(c2_port); + + /* missing: Turn off LEDs here */ + + /* Free all buffers in the host descriptor rings */ + c2_tx_clean(c2_port); + c2_rx_clean(c2_port); + + /* Free the host descriptor rings */ + kfree(c2_port->rx_ring.start); + kfree(c2_port->tx_ring.start); + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, + c2_port->dma); + + return 0; +} + +static void c2_reset(struct c2_port *c2_port) +{ + struct c2_dev *c2dev = c2_port->c2dev; + unsigned int cur_rx = c2dev->cur_rx; + + /* Tell the hardware to quiesce */ + C2_SET_CUR_RX(c2dev, cur_rx | C2_PCI_HRX_QUI); + + /* + * The hardware will reset the C2_PCI_HRX_QUI bit once + * the RXP is quiesced. Wait 2 seconds for this. + */ + ssleep(2); + + cur_rx = C2_GET_CUR_RX(c2dev); + + if (cur_rx & C2_PCI_HRX_QUI) + pr_debug("c2_reset: failed to quiesce the hardware!\n"); + + cur_rx &= ~C2_PCI_HRX_QUI; + + c2dev->cur_rx = cur_rx; + + pr_debug("Current RX: %u\n", c2dev->cur_rx); +} + +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + dma_addr_t mapaddr; + u32 maplen; + unsigned long flags; + unsigned int i; + + spin_lock_irqsave(&c2_port->tx_lock, flags); + + if (unlikely(c2_port->tx_avail < (skb_shinfo(skb)->nr_frags + 1))) { + netif_stop_queue(netdev); + spin_unlock_irqrestore(&c2_port->tx_lock, flags); + + pr_debug("%s: Tx ring full when queue awake!\n", + netdev->name); + return NETDEV_TX_BUSY; + } + + maplen = skb_headlen(skb); + mapaddr = + pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_TODEVICE); + + elem = tx_ring->to_use; + elem->skb = skb; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + + /* Tell HW to xmit */ + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN); + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS); + + c2_port->netstats.tx_packets++; + c2_port->netstats.tx_bytes += maplen; + + /* Loop thru additional data fragments and queue them */ + if (skb_shinfo(skb)->nr_frags) { + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + maplen = frag->size; + mapaddr = + pci_map_page(c2dev->pcidev, frag->page, + frag->page_offset, maplen, + PCI_DMA_TODEVICE); + + elem = elem->next; + elem->skb = NULL; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + + /* Tell HW to xmit */ + __raw_writeq(cpu_to_be64(mapaddr), + elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(maplen), + elem->hw_desc + C2_TXP_LEN); + __raw_writew(cpu_to_be16(TXP_HTXD_READY), + elem->hw_desc + C2_TXP_FLAGS); + + c2_port->netstats.tx_packets++; + c2_port->netstats.tx_bytes += maplen; + } + } + + tx_ring->to_use = elem->next; + c2_port->tx_avail -= (skb_shinfo(skb)->nr_frags + 1); + + if (c2_port->tx_avail <= MAX_SKB_FRAGS + 1) { + netif_stop_queue(netdev); + if (netif_msg_tx_queued(c2_port)) + pr_debug("%s: transmit queue full\n", + netdev->name); + } + + spin_unlock_irqrestore(&c2_port->tx_lock, flags); + + netdev->trans_start = jiffies; + + return NETDEV_TX_OK; +} + +static struct net_device_stats *c2_get_stats(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + + return &c2_port->netstats; +} + +static void c2_tx_timeout(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + + if (netif_msg_timer(c2_port)) + pr_debug("%s: tx timeout\n", netdev->name); + + c2_tx_clean(c2_port); +} + +static int c2_change_mtu(struct net_device *netdev, int new_mtu) +{ + int ret = 0; + + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) + return -EINVAL; + + netdev->mtu = new_mtu; + + if (netif_running(netdev)) { + c2_down(netdev); + + c2_up(netdev); + } + + return ret; +} + +/* Initialize network device */ +static struct net_device *c2_devinit(struct c2_dev *c2dev, + void __iomem * mmio_addr) +{ + struct c2_port *c2_port = NULL; + struct net_device *netdev = alloc_etherdev(sizeof(*c2_port)); + + if (!netdev) { + pr_debug("c2_port etherdev alloc failed"); + return NULL; + } + + SET_MODULE_OWNER(netdev); + SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); + + netdev->open = c2_up; + netdev->stop = c2_down; + netdev->hard_start_xmit = c2_xmit_frame; + netdev->get_stats = c2_get_stats; + netdev->tx_timeout = c2_tx_timeout; + netdev->change_mtu = c2_change_mtu; + netdev->watchdog_timeo = C2_TX_TIMEOUT; + netdev->irq = c2dev->pcidev->irq; + + c2_port = netdev_priv(netdev); + c2_port->netdev = netdev; + c2_port->c2dev = c2dev; + c2_port->msg_enable = netif_msg_init(debug, default_msg); + c2_port->tx_ring.count = C2_NUM_TX_DESC; + c2_port->rx_ring.count = C2_NUM_RX_DESC; + + spin_lock_init(&c2_port->tx_lock); + + /* Copy our 48-bit ethernet hardware address */ + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6); + + /* Validate the MAC address */ + if (!is_valid_ether_addr(netdev->dev_addr)) { + pr_debug("Invalid MAC Address\n"); + c2_print_macaddr(netdev); + free_netdev(netdev); + return NULL; + } + + c2dev->netdev = netdev; + + return netdev; +} + +static int __devinit c2_probe(struct pci_dev *pcidev, + const struct pci_device_id *ent) +{ + int ret = 0, i; + unsigned long reg0_start, reg0_flags, reg0_len; + unsigned long reg2_start, reg2_flags, reg2_len; + unsigned long reg4_start, reg4_flags, reg4_len; + unsigned kva_map_size; + struct net_device *netdev = NULL; + struct c2_dev *c2dev = NULL; + void __iomem *mmio_regs = NULL; + + printk(KERN_INFO PFX "AMSO1100 Gigabit Ethernet driver v%s loaded\n", + DRV_VERSION); + + /* Enable PCI device */ + ret = pci_enable_device(pcidev); + if (ret) { + printk(KERN_ERR PFX "%s: Unable to enable PCI device\n", + pci_name(pcidev)); + goto bail0; + } + + reg0_start = pci_resource_start(pcidev, BAR_0); + reg0_len = pci_resource_len(pcidev, BAR_0); + reg0_flags = pci_resource_flags(pcidev, BAR_0); + + reg2_start = pci_resource_start(pcidev, BAR_2); + reg2_len = pci_resource_len(pcidev, BAR_2); + reg2_flags = pci_resource_flags(pcidev, BAR_2); + + reg4_start = pci_resource_start(pcidev, BAR_4); + reg4_len = pci_resource_len(pcidev, BAR_4); + reg4_flags = pci_resource_flags(pcidev, BAR_4); + + pr_debug("BAR0 size = 0x%lX bytes\n", reg0_len); + pr_debug("BAR2 size = 0x%lX bytes\n", reg2_len); + pr_debug("BAR4 size = 0x%lX bytes\n", reg4_len); + + /* Make sure PCI base addr are MMIO */ + if (!(reg0_flags & IORESOURCE_MEM) || + !(reg2_flags & IORESOURCE_MEM) || !(reg4_flags & IORESOURCE_MEM)) { + printk(KERN_ERR PFX "PCI regions not an MMIO resource\n"); + ret = -ENODEV; + goto bail1; + } + + /* Check for weird/broken PCI region reporting */ + if ((reg0_len < C2_REG0_SIZE) || + (reg2_len < C2_REG2_SIZE) || (reg4_len < C2_REG4_SIZE)) { + printk(KERN_ERR PFX "Invalid PCI region sizes\n"); + ret = -ENODEV; + goto bail1; + } + + /* Reserve PCI I/O and memory resources */ + ret = pci_request_regions(pcidev, DRV_NAME); + if (ret) { + printk(KERN_ERR PFX "%s: Unable to request regions\n", + pci_name(pcidev)); + goto bail1; + } + + if ((sizeof(dma_addr_t) > 4)) { + ret = pci_set_dma_mask(pcidev, DMA_64BIT_MASK); + if (ret < 0) { + printk(KERN_ERR PFX "64b DMA configuration failed\n"); + goto bail2; + } + } else { + ret = pci_set_dma_mask(pcidev, DMA_32BIT_MASK); + if (ret < 0) { + printk(KERN_ERR PFX "32b DMA configuration failed\n"); + goto bail2; + } + } + + /* Enables bus-mastering on the device */ + pci_set_master(pcidev); + + /* Remap the adapter PCI registers in BAR4 */ + mmio_regs = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, + sizeof(struct c2_adapter_pci_regs)); + if (mmio_regs == 0UL) { + printk(KERN_ERR PFX + "Unable to remap adapter PCI registers in BAR4\n"); + ret = -EIO; + goto bail2; + } + + /* Validate PCI regs magic */ + for (i = 0; i < sizeof(c2_magic); i++) { + if (c2_magic[i] != readb(mmio_regs + C2_REGS_MAGIC + i)) { + printk(KERN_ERR PFX "Downlevel Firmware boot loader " + "[%d/%Zd: got 0x%x, exp 0x%x]. Use the cc_flash " + "utility to update your boot loader\n", + i + 1, sizeof(c2_magic), + readb(mmio_regs + C2_REGS_MAGIC + i), + c2_magic[i]); + printk(KERN_ERR PFX "Adapter not claimed\n"); + iounmap(mmio_regs); + ret = -EIO; + goto bail2; + } + } + + /* Validate the adapter version */ + if (be32_to_cpu(readl(mmio_regs + C2_REGS_VERS)) != C2_VERSION) { + printk(KERN_ERR PFX "Version mismatch " + "[fw=%u, c2=%u], Adapter not claimed\n", + be32_to_cpu(readl(mmio_regs + C2_REGS_VERS)), + C2_VERSION); + ret = -EINVAL; + iounmap(mmio_regs); + goto bail2; + } + + /* Validate the adapter IVN */ + if (be32_to_cpu(readl(mmio_regs + C2_REGS_IVN)) != C2_IVN) { + printk(KERN_ERR PFX "Downlevel FIrmware level. You should be using " + "the OpenIB device support kit. " + "[fw=0x%x, c2=0x%x], Adapter not claimed\n", + be32_to_cpu(readl(mmio_regs + C2_REGS_IVN)), + C2_IVN); + ret = -EINVAL; + iounmap(mmio_regs); + goto bail2; + } + + /* Allocate hardware structure */ + c2dev = (struct c2_dev *) ib_alloc_device(sizeof(*c2dev)); + if (!c2dev) { + printk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n", + pci_name(pcidev)); + ret = -ENOMEM; + iounmap(mmio_regs); + goto bail2; + } + + memset(c2dev, 0, sizeof(*c2dev)); + spin_lock_init(&c2dev->lock); + c2dev->pcidev = pcidev; + c2dev->cur_tx = 0; + + /* Get the last RX index */ + c2dev->cur_rx = + (be32_to_cpu(readl(mmio_regs + C2_REGS_HRX_CUR)) - + 0xffffc000) / sizeof(struct c2_rxp_desc); + + /* Request an interrupt line for the driver */ + ret = request_irq(pcidev->irq, c2_interrupt, SA_SHIRQ, DRV_NAME, c2dev); + if (ret) { + printk(KERN_ERR PFX "%s: requested IRQ %u is busy\n", + pci_name(pcidev), pcidev->irq); + iounmap(mmio_regs); + goto bail3; + } + + /* Set driver specific data */ + pci_set_drvdata(pcidev, c2dev); + + /* Initialize network device */ + if ((netdev = c2_devinit(c2dev, mmio_regs)) == NULL) { + iounmap(mmio_regs); + goto bail4; + } + + /* Save off the actual size prior to unmapping mmio_regs */ + kva_map_size = be32_to_cpu(readl(mmio_regs + C2_REGS_PCI_WINSIZE)); + + /* Unmap the adapter PCI registers in BAR4 */ + iounmap(mmio_regs); + + /* Register network device */ + ret = register_netdev(netdev); + if (ret) { + printk(KERN_ERR PFX "Unable to register netdev, ret = %d\n", + ret); + goto bail5; + } + + /* Disable network packets */ + netif_stop_queue(netdev); + + /* Remap the adapter HRXDQ PA space to kernel VA space */ + c2dev->mmio_rxp_ring = ioremap_nocache(reg4_start + C2_RXP_HRXDQ_OFFSET, + C2_RXP_HRXDQ_SIZE); + if (c2dev->mmio_rxp_ring == 0UL) { + printk(KERN_ERR PFX "Unable to remap MMIO HRXDQ region\n"); + ret = -EIO; + goto bail6; + } + + /* Remap the adapter HTXDQ PA space to kernel VA space */ + c2dev->mmio_txp_ring = ioremap_nocache(reg4_start + C2_TXP_HTXDQ_OFFSET, + C2_TXP_HTXDQ_SIZE); + if (c2dev->mmio_txp_ring == 0UL) { + printk(KERN_ERR PFX "Unable to remap MMIO HTXDQ region\n"); + ret = -EIO; + goto bail7; + } + + /* Save off the current RX index in the last 4 bytes of the TXP Ring */ + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); + + /* Remap the PCI registers in adapter BAR0 to kernel VA space */ + c2dev->regs = ioremap_nocache(reg0_start, reg0_len); + if (c2dev->regs == 0UL) { + printk(KERN_ERR PFX "Unable to remap BAR0\n"); + ret = -EIO; + goto bail8; + } + + /* Remap the PCI registers in adapter BAR4 to kernel VA space */ + c2dev->pa = reg4_start + C2_PCI_REGS_OFFSET; + c2dev->kva = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, + kva_map_size); + if (c2dev->kva == 0UL) { + printk(KERN_ERR PFX "Unable to remap BAR4\n"); + ret = -EIO; + goto bail9; + } + + /* Print out the MAC address */ + c2_print_macaddr(netdev); + + ret = c2_rnic_init(c2dev); + if (ret) { + printk(KERN_ERR PFX "c2_rnic_init failed: %d\n", ret); + goto bail10; + } + + c2_register_device(c2dev); + + return 0; + + bail10: + iounmap(c2dev->kva); + + bail9: + iounmap(c2dev->regs); + + bail8: + iounmap(c2dev->mmio_txp_ring); + + bail7: + iounmap(c2dev->mmio_rxp_ring); + + bail6: + unregister_netdev(netdev); + + bail5: + free_netdev(netdev); + + bail4: + free_irq(pcidev->irq, c2dev); + + bail3: + ib_dealloc_device(&c2dev->ibdev); + + bail2: + pci_release_regions(pcidev); + + bail1: + pci_disable_device(pcidev); + + bail0: + return ret; +} + +static void __devexit c2_remove(struct pci_dev *pcidev) +{ + struct c2_dev *c2dev = pci_get_drvdata(pcidev); + struct net_device *netdev = c2dev->netdev; + + /* Unregister with OpenIB */ + c2_unregister_device(c2dev); + + /* Clean up the RNIC resources */ + c2_rnic_term(c2dev); + + /* Remove network device from the kernel */ + unregister_netdev(netdev); + + /* Free network device */ + free_netdev(netdev); + + /* Free the interrupt line */ + free_irq(pcidev->irq, c2dev); + + /* missing: Turn LEDs off here */ + + /* Unmap adapter PA space */ + iounmap(c2dev->kva); + iounmap(c2dev->regs); + iounmap(c2dev->mmio_txp_ring); + iounmap(c2dev->mmio_rxp_ring); + + /* Free the hardware structure */ + ib_dealloc_device(&c2dev->ibdev); + + /* Release reserved PCI I/O and memory resources */ + pci_release_regions(pcidev); + + /* Disable PCI device */ + pci_disable_device(pcidev); + + /* Clear driver specific data */ + pci_set_drvdata(pcidev, NULL); +} + +static struct pci_driver c2_pci_driver = { + .name = DRV_NAME, + .id_table = c2_pci_table, + .probe = c2_probe, + .remove = __devexit_p(c2_remove), +}; + +static int __init c2_init_module(void) +{ + return pci_module_init(&c2_pci_driver); +} + +static void __exit c2_exit_module(void) +{ + pci_unregister_driver(&c2_pci_driver); +} + +module_init(c2_init_module); +module_exit(c2_exit_module); diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h new file mode 100644 index 0000000..3b17530 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2.h @@ -0,0 +1,552 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef __C2_H +#define __C2_H + +#include +#include +#include +#include +#include +#include +#include + +#include "c2_provider.h" +#include "c2_mq.h" +#include "c2_status.h" + +#define DRV_NAME "c2" +#define DRV_VERSION "1.1" +#define PFX DRV_NAME ": " + +#define BAR_0 0 +#define BAR_2 2 +#define BAR_4 4 + +#define RX_BUF_SIZE (1536 + 8) +#define ETH_JUMBO_MTU 9000 +#define C2_MAGIC "CEPHEUS" +#define C2_VERSION 4 +#define C2_IVN (18 & 0x7fffffff) + +#define C2_REG0_SIZE (16 * 1024) +#define C2_REG2_SIZE (2 * 1024 * 1024) +#define C2_REG4_SIZE (256 * 1024 * 1024) +#define C2_NUM_TX_DESC 341 +#define C2_NUM_RX_DESC 256 +#define C2_PCI_REGS_OFFSET (0x10000) +#define C2_RXP_HRXDQ_OFFSET (((C2_REG4_SIZE)/2)) +#define C2_RXP_HRXDQ_SIZE (4096) +#define C2_TXP_HTXDQ_OFFSET (((C2_REG4_SIZE)/2) + C2_RXP_HRXDQ_SIZE) +#define C2_TXP_HTXDQ_SIZE (4096) +#define C2_TX_TIMEOUT (6*HZ) + +/* CEPHEUS */ +static const u8 c2_magic[] = { + 0x43, 0x45, 0x50, 0x48, 0x45, 0x55, 0x53 +}; + +enum adapter_pci_regs { + C2_REGS_MAGIC = 0x0000, + C2_REGS_VERS = 0x0008, + C2_REGS_IVN = 0x000C, + C2_REGS_PCI_WINSIZE = 0x0010, + C2_REGS_Q0_QSIZE = 0x0014, + C2_REGS_Q0_MSGSIZE = 0x0018, + C2_REGS_Q0_POOLSTART = 0x001C, + C2_REGS_Q0_SHARED = 0x0020, + C2_REGS_Q1_QSIZE = 0x0024, + C2_REGS_Q1_MSGSIZE = 0x0028, + C2_REGS_Q1_SHARED = 0x0030, + C2_REGS_Q2_QSIZE = 0x0034, + C2_REGS_Q2_MSGSIZE = 0x0038, + C2_REGS_Q2_SHARED = 0x0040, + C2_REGS_ENADDR = 0x004C, + C2_REGS_RDMA_ENADDR = 0x0054, + C2_REGS_HRX_CUR = 0x006C, +}; + +struct c2_adapter_pci_regs { + char reg_magic[8]; + u32 version; + u32 ivn; + u32 pci_window_size; + u32 q0_q_size; + u32 q0_msg_size; + u32 q0_pool_start; + u32 q0_shared; + u32 q1_q_size; + u32 q1_msg_size; + u32 q1_pool_start; + u32 q1_shared; + u32 q2_q_size; + u32 q2_msg_size; + u32 q2_pool_start; + u32 q2_shared; + u32 log_start; + u32 log_size; + u8 host_enaddr[8]; + u8 rdma_enaddr[8]; + u32 crash_entry; + u32 crash_ready[2]; + u32 fw_txd_cur; + u32 fw_hrxd_cur; + u32 fw_rxd_cur; +}; + +enum pci_regs { + C2_HISR = 0x0000, + C2_DISR = 0x0004, + C2_HIMR = 0x0008, + C2_DIMR = 0x000C, + C2_NISR0 = 0x0010, + C2_NISR1 = 0x0014, + C2_NIMR0 = 0x0018, + C2_NIMR1 = 0x001C, + C2_IDIS = 0x0020, +}; + +enum { + C2_PCI_HRX_INT = 1 << 8, + C2_PCI_HTX_INT = 1 << 17, + C2_PCI_HRX_QUI = 1 << 31, +}; + +/* + * Cepheus registers in BAR0. + */ +struct c2_pci_regs { + u32 hostisr; + u32 dmaisr; + u32 hostimr; + u32 dmaimr; + u32 netisr0; + u32 netisr1; + u32 netimr0; + u32 netimr1; + u32 int_disable; +}; + +/* TXP flags */ +enum c2_txp_flags { + TXP_HTXD_DONE = 0, + TXP_HTXD_READY = 1 << 0, + TXP_HTXD_UNINIT = 1 << 1, +}; + +/* RXP flags */ +enum c2_rxp_flags { + RXP_HRXD_UNINIT = 0, + RXP_HRXD_READY = 1 << 0, + RXP_HRXD_DONE = 1 << 1, +}; + +/* RXP status */ +enum c2_rxp_status { + RXP_HRXD_ZERO = 0, + RXP_HRXD_OK = 1 << 0, + RXP_HRXD_BUF_OV = 1 << 1, +}; + +/* TXP descriptor fields */ +enum txp_desc { + C2_TXP_FLAGS = 0x0000, + C2_TXP_LEN = 0x0002, + C2_TXP_ADDR = 0x0004, +}; + +/* RXP descriptor fields */ +enum rxp_desc { + C2_RXP_FLAGS = 0x0000, + C2_RXP_STATUS = 0x0002, + C2_RXP_COUNT = 0x0004, + C2_RXP_LEN = 0x0006, + C2_RXP_ADDR = 0x0008, +}; + +struct c2_txp_desc { + u16 flags; + u16 len; + u64 addr; +} __attribute__ ((packed)); + +struct c2_rxp_desc { + u16 flags; + u16 status; + u16 count; + u16 len; + u64 addr; +} __attribute__ ((packed)); + +struct c2_rxp_hdr { + u16 flags; + u16 status; + u16 len; + u16 rsvd; +} __attribute__ ((packed)); + +struct c2_tx_desc { + u32 len; + u32 status; + dma_addr_t next_offset; +}; + +struct c2_rx_desc { + u32 len; + u32 status; + dma_addr_t next_offset; +}; + +struct c2_alloc { + u32 last; + u32 max; + spinlock_t lock; + unsigned long *table; +}; + +struct c2_array { + struct { + void **page; + int used; + } *page_list; +}; + +/* + * The MQ shared pointer pool is organized as a linked list of + * chunks. Each chunk contains a linked list of free shared pointers + * that can be allocated to a given user mode client. + * + */ +struct sp_chunk { + struct sp_chunk *next; + dma_addr_t dma_addr; + DECLARE_PCI_UNMAP_ADDR(mapping); + u16 head; + u16 shared_ptr[0]; +}; + +struct c2_pd_table { + u32 last; + u32 max; + spinlock_t lock; + unsigned long *table; +}; + +struct c2_qp_table { + struct idr idr; + spinlock_t lock; + int last; +}; + +struct c2_element { + struct c2_element *next; + void *ht_desc; /* host descriptor */ + void __iomem *hw_desc; /* hardware descriptor */ + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen; +}; + +struct c2_ring { + struct c2_element *to_clean; + struct c2_element *to_use; + struct c2_element *start; + unsigned long count; +}; + +struct c2_dev { + struct ib_device ibdev; + void __iomem *regs; + void __iomem *mmio_txp_ring; /* remapped adapter memory for hw rings */ + void __iomem *mmio_rxp_ring; + spinlock_t lock; + struct pci_dev *pcidev; + struct net_device *netdev; + struct net_device *pseudo_netdev; + unsigned int cur_tx; + unsigned int cur_rx; + u32 adapter_handle; + int device_cap_flags; + void __iomem *kva; /* KVA device memory */ + unsigned long pa; /* PA device memory */ + void **qptr_array; + + kmem_cache_t *host_msg_cache; + + struct list_head cca_link; /* adapter list */ + struct list_head eh_wakeup_list; /* event wakeup list */ + wait_queue_head_t req_vq_wo; + + /* Cached RNIC properties */ + struct ib_device_attr props; + + struct c2_pd_table pd_table; + struct c2_qp_table qp_table; + int ports; /* num of GigE ports */ + int devnum; + spinlock_t vqlock; /* sync vbs req MQ */ + + /* Verbs Queues */ + struct c2_mq req_vq; /* Verbs Request MQ */ + struct c2_mq rep_vq; /* Verbs Reply MQ */ + struct c2_mq aeq; /* Async Events MQ */ + + /* Kernel client MQs */ + struct sp_chunk *kern_mqsp_pool; + + /* Device updates these values when posting messages to a host + * target queue */ + u16 req_vq_shared; + u16 rep_vq_shared; + u16 aeq_shared; + u16 irq_claimed; + + /* + * Shared host target pages for user-accessible MQs. + */ + int hthead; /* index of first free entry */ + void *htpages; /* kernel vaddr */ + int htlen; /* length of htpages memory */ + void *htuva; /* user mapped vaddr */ + spinlock_t htlock; /* serialize allocation */ + + u64 adapter_hint_uva; /* access to the activity FIFO */ + + // spinlock_t aeq_lock; + // spinlock_t rnic_lock; + + u16 *hint_count; + dma_addr_t hint_count_dma; + u16 hints_read; + + int init; /* TRUE if it's ready */ + char ae_cache_name[16]; + char vq_cache_name[16]; +}; + +struct c2_port { + u32 msg_enable; + struct c2_dev *c2dev; + struct net_device *netdev; + + spinlock_t tx_lock; + u32 tx_avail; + struct c2_ring tx_ring; + struct c2_ring rx_ring; + + void *mem; /* PCI memory for host rings */ + dma_addr_t dma; + unsigned long mem_size; + + u32 rx_buf_size; + + struct net_device_stats netstats; +}; + +/* + * Activity FIFO registers in BAR0. + */ +#define PCI_BAR0_HOST_HINT 0x100 +#define PCI_BAR0_ADAPTER_HINT 0x2000 + +/* + * Ammasso PCI vendor id and Cepheus PCI device id. + */ +#define CQ_ARMED 0x01 +#define CQ_WAIT_FOR_DMA 0x80 + +/* + * The format of a hint is as follows: + * Lower 16 bits are the count of hints for the queue. + * Next 15 bits are the qp_index + * Upper most bit depends on who reads it: + * If read by producer, then it means Full (1) or Not-Full (0) + * If read by consumer, then it means Empty (1) or Not-Empty (0) + */ +#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count) +#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) +#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) + + +/* + * The following defines the offset in SDRAM for the c2_adapter_pci_regs_t + * struct. + */ +#define C2_ADAPTER_PCI_REGS_OFFSET 0x10000 + +#ifndef readq +static inline u64 readq(const void __iomem * addr) +{ + u64 ret = readl(addr + 4); + ret <<= 32; + ret |= readl(addr); + + return ret; +} +#endif + +#ifndef __raw_writeq +static inline void __raw_writeq(u64 val, void __iomem * addr) +{ + __raw_writel((u32) (val), addr); + __raw_writel((u32) (val >> 32), (addr + 4)); +} +#endif + +#define C2_SET_CUR_RX(c2dev, cur_rx) \ + __raw_writel(cpu_to_be32(cur_rx), c2dev->mmio_txp_ring + 4092) + +#define C2_GET_CUR_RX(c2dev) \ + be32_to_cpu(readl(c2dev->mmio_txp_ring + 4092)) + +static inline struct c2_dev *to_c2dev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct c2_dev, ibdev); +} + +static inline int c2_errno(void *reply) +{ + switch (c2_wr_get_result(reply)) { + case C2_OK: + return 0; + case CCERR_NO_BUFS: + case CCERR_INSUFFICIENT_RESOURCES: + case CCERR_ZERO_RDMA_READ_RESOURCES: + return -ENOMEM; + case CCERR_MR_IN_USE: + case CCERR_QP_IN_USE: + return -EBUSY; + case CCERR_ADDR_IN_USE: + return -EADDRINUSE; + case CCERR_ADDR_NOT_AVAIL: + return -EADDRNOTAVAIL; + case CCERR_CONN_RESET: + return -ECONNRESET; + case CCERR_NOT_IMPLEMENTED: + case CCERR_INVALID_WQE: + return -ENOSYS; + case CCERR_QP_NOT_PRIVILEGED: + return -EPERM; + case CCERR_STACK_ERROR: + return -EPROTO; + case CCERR_ACCESS_VIOLATION: + case CCERR_BASE_AND_BOUNDS_VIOLATION: + return -EFAULT; + case CCERR_STAG_STATE_NOT_INVALID: + case CCERR_INVALID_ADDRESS: + case CCERR_INVALID_CQ: + case CCERR_INVALID_EP: + case CCERR_INVALID_MODIFIER: + case CCERR_INVALID_MTU: + case CCERR_INVALID_PD_ID: + case CCERR_INVALID_QP: + case CCERR_INVALID_RNIC: + case CCERR_INVALID_STAG: + return -EINVAL; + default: + return -EAGAIN; + } +} + +/* Device */ +extern int c2_register_device(struct c2_dev *c2dev); +extern void c2_unregister_device(struct c2_dev *c2dev); +extern int c2_rnic_init(struct c2_dev *c2dev); +extern void c2_rnic_term(struct c2_dev *c2dev); +extern void c2_rnic_interrupt(struct c2_dev *c2dev); +extern int c2_rnic_query(struct c2_dev *c2dev, struct ib_device_attr *props); +extern int c2_del_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask); +extern int c2_add_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask); + +/* QPs */ +extern int c2_alloc_qp(struct c2_dev *c2dev, struct c2_pd *pd, + struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp); +extern void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp); +extern struct ib_qp *c2_get_qp(struct ib_device *device, int qpn); +extern int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, + struct ib_qp_attr *attr, int attr_mask); +extern int c2_qp_set_read_limits(struct c2_dev *c2dev, struct c2_qp *qp, + int ord, int ird); +extern int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, + struct ib_send_wr **bad_wr); +extern int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, + struct ib_recv_wr **bad_wr); +extern void __devinit c2_init_qp_table(struct c2_dev *c2dev); +extern void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev); +extern void c2_set_qp_state(struct c2_qp *, int); +extern struct c2_qp *c2_find_qpn(struct c2_dev *c2dev, int qpn); + +/* PDs */ +extern int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct c2_pd *pd); +extern void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd); +extern int __devinit c2_init_pd_table(struct c2_dev *c2dev); +extern void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev); + +/* CQs */ +extern int c2_init_cq(struct c2_dev *c2dev, int entries, + struct c2_ucontext *ctx, struct c2_cq *cq); +extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq); +extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); +extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); +extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); + +/* CM */ +extern int c2_llp_connect(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *iw_param); +extern int c2_llp_accept(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *iw_param); +extern int c2_llp_reject(struct iw_cm_id *cm_id, const void *pdata, + u8 pdata_len); +extern int c2_llp_service_create(struct iw_cm_id *cm_id, int backlog); +extern int c2_llp_service_destroy(struct iw_cm_id *cm_id); + +/* MM */ +extern int c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 *addr_list, + int page_size, int pbl_depth, u32 length, + u32 off, u64 *va, enum c2_acf acf, + struct c2_mr *mr); +extern int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index); + +/* AE */ +extern void c2_ae_event(struct c2_dev *c2dev, u32 mq_index); + +/* MQSP Allocator */ +extern int c2_init_mqsp_pool(struct c2_dev *c2dev, gfp_t gfp_mask, + struct sp_chunk **root); +extern void c2_free_mqsp_pool(struct c2_dev *c2dev, struct sp_chunk *root); +extern u16 *c2_alloc_mqsp(struct c2_dev *c2dev, struct sp_chunk *head, + dma_addr_t *dma_addr, gfp_t gfp_mask); +extern void c2_free_mqsp(u16 * mqsp); +#endif diff --git a/drivers/infiniband/hw/amso1100/c2_ae.c b/drivers/infiniband/hw/amso1100/c2_ae.c new file mode 100644 index 0000000..495e614 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_ae.c @@ -0,0 +1,321 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include +#include "c2_status.h" +#include "c2_ae.h" + +static int c2_convert_cm_status(u32 c2_status) +{ + switch (c2_status) { + case C2_CONN_STATUS_SUCCESS: + return 0; + case C2_CONN_STATUS_REJECTED: + return -ENETRESET; + case C2_CONN_STATUS_REFUSED: + return -ECONNREFUSED; + case C2_CONN_STATUS_TIMEDOUT: + return -ETIMEDOUT; + case C2_CONN_STATUS_NETUNREACH: + return -ENETUNREACH; + case C2_CONN_STATUS_HOSTUNREACH: + return -EHOSTUNREACH; + case C2_CONN_STATUS_INVALID_RNIC: + return -EINVAL; + case C2_CONN_STATUS_INVALID_QP: + return -EINVAL; + case C2_CONN_STATUS_INVALID_QP_STATE: + return -EINVAL; + case C2_CONN_STATUS_ADDR_NOT_AVAIL: + return -EADDRNOTAVAIL; + default: + printk(KERN_ERR PFX + "%s - Unable to convert CM status: %d\n", + __FUNCTION__, c2_status); + return -EIO; + } +} + +#ifdef DEBUG +static const char* to_event_str(int event) +{ + static const char* event_str[] = { + "CCAE_REMOTE_SHUTDOWN", + "CCAE_ACTIVE_CONNECT_RESULTS", + "CCAE_CONNECTION_REQUEST", + "CCAE_LLP_CLOSE_COMPLETE", + "CCAE_TERMINATE_MESSAGE_RECEIVED", + "CCAE_LLP_CONNECTION_RESET", + "CCAE_LLP_CONNECTION_LOST", + "CCAE_LLP_SEGMENT_SIZE_INVALID", + "CCAE_LLP_INVALID_CRC", + "CCAE_LLP_BAD_FPDU", + "CCAE_INVALID_DDP_VERSION", + "CCAE_INVALID_RDMA_VERSION", + "CCAE_UNEXPECTED_OPCODE", + "CCAE_INVALID_DDP_QUEUE_NUMBER", + "CCAE_RDMA_READ_NOT_ENABLED", + "CCAE_RDMA_WRITE_NOT_ENABLED", + "CCAE_RDMA_READ_TOO_SMALL", + "CCAE_NO_L_BIT", + "CCAE_TAGGED_INVALID_STAG", + "CCAE_TAGGED_BASE_BOUNDS_VIOLATION", + "CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION", + "CCAE_TAGGED_INVALID_PD", + "CCAE_WRAP_ERROR", + "CCAE_BAD_CLOSE", + "CCAE_BAD_LLP_CLOSE", + "CCAE_INVALID_MSN_RANGE", + "CCAE_INVALID_MSN_GAP", + "CCAE_IRRQ_OVERFLOW", + "CCAE_IRRQ_MSN_GAP", + "CCAE_IRRQ_MSN_RANGE", + "CCAE_IRRQ_INVALID_STAG", + "CCAE_IRRQ_BASE_BOUNDS_VIOLATION", + "CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION", + "CCAE_IRRQ_INVALID_PD", + "CCAE_IRRQ_WRAP_ERROR", + "CCAE_CQ_SQ_COMPLETION_OVERFLOW", + "CCAE_CQ_RQ_COMPLETION_ERROR", + "CCAE_QP_SRQ_WQE_ERROR", + "CCAE_QP_LOCAL_CATASTROPHIC_ERROR", + "CCAE_CQ_OVERFLOW", + "CCAE_CQ_OPERATION_ERROR", + "CCAE_SRQ_LIMIT_REACHED", + "CCAE_QP_RQ_LIMIT_REACHED", + "CCAE_SRQ_CATASTROPHIC_ERROR", + "CCAE_RNIC_CATASTROPHIC_ERROR" + }; + + if (event < CCAE_REMOTE_SHUTDOWN || + event > CCAE_RNIC_CATASTROPHIC_ERROR) + return ""; + + event -= CCAE_REMOTE_SHUTDOWN; + return event_str[event]; +} + +const char *to_qp_state_str(int state) +{ + switch (state) { + case C2_QP_STATE_IDLE: + return "C2_QP_STATE_IDLE"; + case C2_QP_STATE_CONNECTING: + return "C2_QP_STATE_CONNECTING"; + case C2_QP_STATE_RTS: + return "C2_QP_STATE_RTS"; + case C2_QP_STATE_CLOSING: + return "C2_QP_STATE_CLOSING"; + case C2_QP_STATE_TERMINATE: + return "C2_QP_STATE_TERMINATE"; + case C2_QP_STATE_ERROR: + return "C2_QP_STATE_ERROR"; + default: + return ""; + }; +} +#endif + +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) +{ + struct c2_mq *mq = c2dev->qptr_array[mq_index]; + union c2wr *wr; + void *resource_user_context; + struct iw_cm_event cm_event; + struct ib_event ib_event; + enum c2_resource_indicator resource_indicator; + enum c2_event_id event_id; + unsigned long flags; + int status; + + /* + * retreive the message + */ + wr = c2_mq_consume(mq); + if (!wr) + return; + + memset(&ib_event, 0, sizeof(ib_event)); + memset(&cm_event, 0, sizeof(cm_event)); + + event_id = c2_wr_get_id(wr); + resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type); + resource_user_context = + (void *) (unsigned long) wr->ae.ae_generic.user_context; + + status = cm_event.status = c2_convert_cm_status(c2_wr_get_result(wr)); + + pr_debug("event received c2_dev=%p, event_id=%d, " + "resource_indicator=%d, user_context=%p, status = %d\n", + c2dev, event_id, resource_indicator, resource_user_context, + status); + + switch (resource_indicator) { + case C2_RES_IND_QP:{ + + struct c2_qp *qp = (struct c2_qp *)resource_user_context; + struct iw_cm_id *cm_id = qp->cm_id; + struct c2wr_ae_active_connect_results *res; + + if (!cm_id) { + pr_debug("event received, but cm_id is , qp=%p!\n", + qp); + goto ignore_it; + } + pr_debug("%s: event = %s, user_context=%llx, " + "resource_type=%x, " + "resource=%x, qp_state=%s\n", + __FUNCTION__, + to_event_str(event_id), + be64_to_cpu(wr->ae.ae_generic.user_context), + be32_to_cpu(wr->ae.ae_generic.resource_type), + be32_to_cpu(wr->ae.ae_generic.resource), + to_qp_state_str(be32_to_cpu(wr->ae.ae_generic.qp_state))); + + c2_set_qp_state(qp, be32_to_cpu(wr->ae.ae_generic.qp_state)); + + switch (event_id) { + case CCAE_ACTIVE_CONNECT_RESULTS: + res = &wr->ae.ae_active_connect_results; + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; + cm_event.local_addr.sin_addr.s_addr = res->laddr; + cm_event.remote_addr.sin_addr.s_addr = res->raddr; + cm_event.local_addr.sin_port = res->lport; + cm_event.remote_addr.sin_port = res->rport; + if (status == 0) { + cm_event.private_data_len = + be32_to_cpu(res->private_data_length); + cm_event.private_data = res->private_data; + } else { + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + cm_event.private_data_len = 0; + cm_event.private_data = NULL; + } + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); + break; + case CCAE_TERMINATE_MESSAGE_RECEIVED: + case CCAE_CQ_SQ_COMPLETION_OVERFLOW: + ib_event.device = &c2dev->ibdev; + ib_event.element.qp = &qp->ibqp; + ib_event.event = IB_EVENT_QP_REQ_ERR; + + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&ib_event, + qp->ibqp. + qp_context); + break; + case CCAE_BAD_CLOSE: + case CCAE_LLP_CLOSE_COMPLETE: + case CCAE_LLP_CONNECTION_RESET: + case CCAE_LLP_CONNECTION_LOST: + BUG_ON(cm_id->event_handler==(void*)0x6b6b6b6b); + + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + cm_event.event = IW_CM_EVENT_CLOSE; + cm_event.status = 0; + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); + break; + default: + BUG_ON(1); + pr_debug("%s:%d Unexpected event_id=%d on QP=%p, " + "CM_ID=%p\n", + __FUNCTION__, __LINE__, + event_id, qp, cm_id); + break; + } + break; + } + + case C2_RES_IND_EP:{ + + struct c2wr_ae_connection_request *req = + &wr->ae.ae_connection_request; + struct iw_cm_id *cm_id = + (struct iw_cm_id *)resource_user_context; + + pr_debug("C2_RES_IND_EP event_id=%d\n", event_id); + if (event_id != CCAE_CONNECTION_REQUEST) { + pr_debug("%s: Invalid event_id: %d\n", + __FUNCTION__, event_id); + break; + } + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; + cm_event.provider_data = (void*)(unsigned long)req->cr_handle; + cm_event.local_addr.sin_addr.s_addr = req->laddr; + cm_event.remote_addr.sin_addr.s_addr = req->raddr; + cm_event.local_addr.sin_port = req->lport; + cm_event.remote_addr.sin_port = req->rport; + cm_event.private_data_len = + be32_to_cpu(req->private_data_length); + cm_event.private_data = req->private_data; + + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); + break; + } + + case C2_RES_IND_CQ:{ + struct c2_cq *cq = + (struct c2_cq *) resource_user_context; + + pr_debug("IB_EVENT_CQ_ERR\n"); + ib_event.device = &c2dev->ibdev; + ib_event.element.cq = &cq->ibcq; + ib_event.event = IB_EVENT_CQ_ERR; + + if (cq->ibcq.event_handler) + cq->ibcq.event_handler(&ib_event, + cq->ibcq.cq_context); + } + + default: + printk("Bad resource indicator = %d\n", + resource_indicator); + break; + } + + ignore_it: + c2_mq_free(mq); +} diff --git a/drivers/infiniband/hw/amso1100/c2_intr.c b/drivers/infiniband/hw/amso1100/c2_intr.c new file mode 100644 index 0000000..454e3e0 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_intr.c @@ -0,0 +1,209 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include +#include "c2_vq.h" + +static void handle_mq(struct c2_dev *c2dev, u32 index); +static void handle_vq(struct c2_dev *c2dev, u32 mq_index); + +/* + * Handle RNIC interrupts + */ +void c2_rnic_interrupt(struct c2_dev *c2dev) +{ + unsigned int mq_index; + + while (c2dev->hints_read != be16_to_cpu(*c2dev->hint_count)) { + mq_index = readl(c2dev->regs + PCI_BAR0_HOST_HINT); + if (mq_index & 0x80000000) { + break; + } + + c2dev->hints_read++; + handle_mq(c2dev, mq_index); + } + +} + +/* + * Top level MQ handler + */ +static void handle_mq(struct c2_dev *c2dev, u32 mq_index) +{ + if (c2dev->qptr_array[mq_index] == NULL) { + pr_debug(KERN_INFO "handle_mq: stray activity for mq_index=%d\n", + mq_index); + return; + } + + switch (mq_index) { + case (0): + /* + * An index of 0 in the activity queue + * indicates the req vq now has messages + * available... + * + * Wake up any waiters waiting on req VQ + * message availability. + */ + wake_up(&c2dev->req_vq_wo); + break; + case (1): + handle_vq(c2dev, mq_index); + break; + case (2): + /* We have to purge the VQ in case there are pending + * accept reply requests that would result in the + * generation of an ESTABLISHED event. If we don't + * generate these first, a CLOSE event could end up + * being delivered before the ESTABLISHED event. + */ + handle_vq(c2dev, 1); + + c2_ae_event(c2dev, mq_index); + break; + default: + /* There is no event synchronization between CQ events + * and AE or CM events. In fact, CQE could be + * delivered for all of the I/O up to and including the + * FLUSH for a peer disconenct prior to the ESTABLISHED + * event being delivered to the app. The reason for this + * is that CM events are delivered on a thread, while AE + * and CM events are delivered on interrupt context. + */ + c2_cq_event(c2dev, mq_index); + break; + } + + return; +} + +/* + * Handles verbs WR replies. + */ +static void handle_vq(struct c2_dev *c2dev, u32 mq_index) +{ + void *adapter_msg, *reply_msg; + struct c2wr_hdr *host_msg; + struct c2wr_hdr tmp; + struct c2_mq *reply_vq; + struct c2_vq_req *req; + struct iw_cm_event cm_event; + int err; + + reply_vq = (struct c2_mq *) c2dev->qptr_array[mq_index]; + + /* + * get next msg from mq_index into adapter_msg. + * don't free it yet. + */ + adapter_msg = c2_mq_consume(reply_vq); + if (adapter_msg == NULL) { + return; + } + + host_msg = vq_repbuf_alloc(c2dev); + + /* + * If we can't get a host buffer, then we'll still + * wakeup the waiter, we just won't give him the msg. + * It is assumed the waiter will deal with this... + */ + if (!host_msg) { + pr_debug("handle_vq: no repbufs!\n"); + + /* + * just copy the WR header into a local variable. + * this allows us to still demux on the context + */ + host_msg = &tmp; + memcpy(host_msg, adapter_msg, sizeof(tmp)); + reply_msg = NULL; + } else { + memcpy(host_msg, adapter_msg, reply_vq->msg_size); + reply_msg = host_msg; + } + + /* + * consume the msg from the MQ + */ + c2_mq_free(reply_vq); + + /* + * wakeup the waiter. + */ + req = (struct c2_vq_req *) (unsigned long) host_msg->context; + if (req == NULL) { + /* + * We should never get here, as the adapter should + * never send us a reply that we're not expecting. + */ + vq_repbuf_free(c2dev, host_msg); + pr_debug("handle_vq: UNEXPECTEDLY got NULL req\n"); + return; + } + + err = c2_errno(reply_msg); + if (!err) switch (req->event) { + case IW_CM_EVENT_ESTABLISHED: + c2_set_qp_state(req->qp, + C2_QP_STATE_RTS); + case IW_CM_EVENT_CLOSE: + + /* + * Move the QP to RTS if this is + * the established event + */ + cm_event.event = req->event; + cm_event.status = 0; + cm_event.local_addr = req->cm_id->local_addr; + cm_event.remote_addr = req->cm_id->remote_addr; + cm_event.private_data = NULL; + cm_event.private_data_len = 0; + req->cm_id->event_handler(req->cm_id, &cm_event); + break; + default: + break; + } + + req->reply_msg = (u64) (unsigned long) (reply_msg); + atomic_set(&req->reply_ready, 1); + wake_up(&req->wait_object); + + /* + * If the request was cancelled, then this put will + * free the vq_req memory...and reply_msg!!! + */ + vq_req_put(c2dev, req); +} diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c new file mode 100644 index 0000000..4d9cc57 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_rnic.c @@ -0,0 +1,664 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include +#include +#include "c2.h" +#include "c2_vq.h" + +/* Device capabilities */ +#define C2_MIN_PAGESIZE 1024 + +#define C2_MAX_MRS 32768 +#define C2_MAX_QPS 16000 +#define C2_MAX_WQE_SZ 256 +#define C2_MAX_QP_WR ((128*1024)/C2_MAX_WQE_SZ) +#define C2_MAX_SGES 4 +#define C2_MAX_SGE_RD 1 +#define C2_MAX_CQS 32768 +#define C2_MAX_CQES 4096 +#define C2_MAX_PDS 16384 + +/* + * Send the adapter INIT message to the amso1100 + */ +static int c2_adapter_init(struct c2_dev *c2dev) +{ + struct c2wr_init_req wr; + int err; + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_INIT); + wr.hdr.context = 0; + wr.hint_count = cpu_to_be64(c2dev->hint_count_dma); + wr.q0_host_shared = cpu_to_be64(c2dev->req_vq.shared_dma); + wr.q1_host_shared = cpu_to_be64(c2dev->rep_vq.shared_dma); + wr.q1_host_msg_pool = cpu_to_be64(c2dev->rep_vq.host_dma); + wr.q2_host_shared = cpu_to_be64(c2dev->aeq.shared_dma); + wr.q2_host_msg_pool = cpu_to_be64(c2dev->aeq.host_dma); + + /* Post the init message */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + + return err; +} + +/* + * Send the adapter TERM message to the amso1100 + */ +static void c2_adapter_term(struct c2_dev *c2dev) +{ + struct c2wr_init_req wr; + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_TERM); + wr.hdr.context = 0; + + /* Post the init message */ + vq_send_wr(c2dev, (union c2wr *) & wr); + c2dev->init = 0; + + return; +} + +/* + * Query the adapter + */ +int c2_rnic_query(struct c2_dev *c2dev, + struct ib_device_attr *props) +{ + struct c2_vq_req *vq_req; + struct c2wr_rnic_query_req wr; + struct c2wr_rnic_query_rep *reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + c2_wr_set_id(&wr, CCWR_RNIC_QUERY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = + (struct c2wr_rnic_query_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) + err = -ENOMEM; + + err = c2_errno(reply); + if (err) + goto bail2; + + props->fw_ver = + ((u64)be32_to_cpu(reply->fw_ver_major) << 32) | + ((be32_to_cpu(reply->fw_ver_minor) && 0xFFFF) << 16) | + (be32_to_cpu(reply->fw_ver_patch) && 0xFFFF); + memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6); + props->max_mr_size = 0xFFFFFFFF; + props->page_size_cap = ~(C2_MIN_PAGESIZE-1); + props->vendor_id = be32_to_cpu(reply->vendor_id); + props->vendor_part_id = be32_to_cpu(reply->part_number); + props->hw_ver = be32_to_cpu(reply->hw_version); + props->max_qp = be32_to_cpu(reply->max_qps); + props->max_qp_wr = be32_to_cpu(reply->max_qp_depth); + props->device_cap_flags = c2dev->device_cap_flags; + props->max_sge = C2_MAX_SGES; + props->max_sge_rd = C2_MAX_SGE_RD; + props->max_cq = be32_to_cpu(reply->max_cqs); + props->max_cqe = be32_to_cpu(reply->max_cq_depth); + props->max_mr = be32_to_cpu(reply->max_mrs); + props->max_pd = be32_to_cpu(reply->max_pds); + props->max_qp_rd_atom = be32_to_cpu(reply->max_qp_ird); + props->max_ee_rd_atom = 0; + props->max_res_rd_atom = be32_to_cpu(reply->max_global_ird); + props->max_qp_init_rd_atom = be32_to_cpu(reply->max_qp_ord); + props->max_ee_init_rd_atom = 0; + props->atomic_cap = IB_ATOMIC_NONE; + props->max_ee = 0; + props->max_rdd = 0; + props->max_mw = be32_to_cpu(reply->max_mws); + props->max_raw_ipv6_qp = 0; + props->max_raw_ethy_qp = 0; + props->max_mcast_grp = 0; + props->max_mcast_qp_attach = 0; + props->max_total_mcast_qp_attach = 0; + props->max_ah = 0; + props->max_fmr = 0; + props->max_map_per_fmr = 0; + props->max_srq = 0; + props->max_srq_wr = 0; + props->max_srq_sge = 0; + props->max_pkeys = 0; + props->local_ca_ack_delay = 0; + + bail2: + vq_repbuf_free(c2dev, reply); + + bail1: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Add an IP address to the RNIC interface + */ +int c2_add_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask) +{ + struct c2_vq_req *vq_req; + struct c2wr_rnic_setconfig_req *wr; + struct c2wr_rnic_setconfig_rep *reply; + struct c2_netaddr netaddr; + int err, len; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + len = sizeof(struct c2_netaddr); + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->option = cpu_to_be32(C2_CFG_ADD_ADDR); + + netaddr.ip_addr = inaddr; + netaddr.netmask = inmask; + netaddr.mtu = 0; + + memcpy(wr->data, &netaddr, len); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = + (struct c2wr_rnic_setconfig_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + + bail1: + kfree(wr); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Delete an IP address from the RNIC interface + */ +int c2_del_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask) +{ + struct c2_vq_req *vq_req; + struct c2wr_rnic_setconfig_req *wr; + struct c2wr_rnic_setconfig_rep *reply; + struct c2_netaddr netaddr; + int err, len; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + len = sizeof(struct c2_netaddr); + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->option = cpu_to_be32(C2_CFG_DEL_ADDR); + + netaddr.ip_addr = inaddr; + netaddr.netmask = inmask; + netaddr.mtu = 0; + + memcpy(wr->data, &netaddr, len); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = + (struct c2wr_rnic_setconfig_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + + bail1: + kfree(wr); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Open a single RNIC instance to use with all + * low level openib calls + */ +static int c2_rnic_open(struct c2_dev *c2dev) +{ + struct c2_vq_req *vq_req; + union c2wr wr; + struct c2wr_rnic_open_rep *reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + return -ENOMEM; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_RNIC_OPEN); + wr.rnic_open.req.hdr.context = (unsigned long) (vq_req); + wr.rnic_open.req.flags = cpu_to_be16(RNIC_PRIV_MODE); + wr.rnic_open.req.port_num = cpu_to_be16(0); + wr.rnic_open.req.user_context = (unsigned long) c2dev; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + reply = (struct c2wr_rnic_open_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + if ((err = c2_errno(reply)) != 0) { + goto bail1; + } + + c2dev->adapter_handle = reply->rnic_handle; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Close the RNIC instance + */ +static int c2_rnic_close(struct c2_dev *c2dev) +{ + struct c2_vq_req *vq_req; + union c2wr wr; + struct c2wr_rnic_close_rep *reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + return -ENOMEM; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_RNIC_CLOSE); + wr.rnic_close.req.hdr.context = (unsigned long) vq_req; + wr.rnic_close.req.rnic_handle = c2dev->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + reply = (struct c2wr_rnic_close_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + if ((err = c2_errno(reply)) != 0) { + goto bail1; + } + + c2dev->adapter_handle = 0; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Called by c2_probe to initialize the RNIC. This principally + * involves initalizing the various limits and resouce pools that + * comprise the RNIC instance. + */ +int c2_rnic_init(struct c2_dev *c2dev) +{ + int err; + u32 qsize, msgsize; + void *q1_pages; + void *q2_pages; + void __iomem *mmio_regs; + + /* Device capabilities */ + c2dev->device_cap_flags = + (IB_DEVICE_RESIZE_MAX_WR | + IB_DEVICE_CURR_QP_STATE_MOD | + IB_DEVICE_SYS_IMAGE_GUID | + IB_DEVICE_ZERO_STAG | + IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW); + + /* Allocate the qptr_array */ + c2dev->qptr_array = vmalloc(C2_MAX_CQS * sizeof(void *)); + if (!c2dev->qptr_array) { + return -ENOMEM; + } + + /* Inialize the qptr_array */ + memset(c2dev->qptr_array, 0, C2_MAX_CQS * sizeof(void *)); + c2dev->qptr_array[0] = (void *) &c2dev->req_vq; + c2dev->qptr_array[1] = (void *) &c2dev->rep_vq; + c2dev->qptr_array[2] = (void *) &c2dev->aeq; + + /* Initialize data structures */ + init_waitqueue_head(&c2dev->req_vq_wo); + spin_lock_init(&c2dev->vqlock); + spin_lock_init(&c2dev->lock); + + /* Allocate MQ shared pointer pool for kernel clients. User + * mode client pools are hung off the user context + */ + err = c2_init_mqsp_pool(c2dev, GFP_KERNEL, &c2dev->kern_mqsp_pool); + if (err) { + goto bail0; + } + + /* Allocate shared pointers for Q0, Q1, and Q2 from + * the shared pointer pool. + */ + + c2dev->hint_count = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &c2dev->hint_count_dma, + GFP_KERNEL); + c2dev->req_vq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &c2dev->req_vq.shared_dma, + GFP_KERNEL); + c2dev->rep_vq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &c2dev->rep_vq.shared_dma, + GFP_KERNEL); + c2dev->aeq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &c2dev->aeq.shared_dma, GFP_KERNEL); + if (!c2dev->hint_count || !c2dev->req_vq.shared || + !c2dev->rep_vq.shared || !c2dev->aeq.shared) { + err = -ENOMEM; + goto bail1; + } + + mmio_regs = c2dev->kva; + /* Initialize the Verbs Request Queue */ + c2_mq_req_init(&c2dev->req_vq, 0, + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_QSIZE)), + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_MSGSIZE)), + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_POOLSTART)), + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_SHARED)), + C2_MQ_ADAPTER_TARGET); + + /* Initialize the Verbs Reply Queue */ + qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_QSIZE)); + msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_MSGSIZE)); + q1_pages = kmalloc(qsize * msgsize, GFP_KERNEL); + if (!q1_pages) { + err = -ENOMEM; + goto bail1; + } + c2dev->rep_vq.host_dma = dma_map_single(c2dev->ibdev.dma_device, + (void *)q1_pages, qsize * msgsize, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&c2dev->rep_vq, mapping, c2dev->rep_vq.host_dma); + pr_debug("%s rep_vq va %p dma %llx\n", __FUNCTION__, q1_pages, + (u64)c2dev->rep_vq.host_dma); + c2_mq_rep_init(&c2dev->rep_vq, + 1, + qsize, + msgsize, + q1_pages, + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_SHARED)), + C2_MQ_HOST_TARGET); + + /* Initialize the Asynchronus Event Queue */ + qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_QSIZE)); + msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_MSGSIZE)); + q2_pages = kmalloc(qsize * msgsize, GFP_KERNEL); + if (!q2_pages) { + err = -ENOMEM; + goto bail2; + } + c2dev->aeq.host_dma = dma_map_single(c2dev->ibdev.dma_device, + (void *)q2_pages, qsize * msgsize, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&c2dev->aeq, mapping, c2dev->aeq.host_dma); + pr_debug("%s aeq va %p dma %llx\n", __FUNCTION__, q1_pages, + (u64)c2dev->rep_vq.host_dma); + c2_mq_rep_init(&c2dev->aeq, + 2, + qsize, + msgsize, + q2_pages, + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_SHARED)), + C2_MQ_HOST_TARGET); + + /* Initialize the verbs request allocator */ + err = vq_init(c2dev); + if (err) + goto bail3; + + /* Enable interrupts on the adapter */ + writel(0, c2dev->regs + C2_IDIS); + + /* create the WR init message */ + err = c2_adapter_init(c2dev); + if (err) + goto bail4; + c2dev->init++; + + /* open an adapter instance */ + err = c2_rnic_open(c2dev); + if (err) + goto bail4; + + /* Initialize cached the adapter limits */ + if (c2_rnic_query(c2dev, &c2dev->props)) + goto bail5; + + /* Initialize the PD pool */ + err = c2_init_pd_table(c2dev); + if (err) + goto bail5; + + /* Initialize the QP pool */ + c2_init_qp_table(c2dev); + return 0; + + bail5: + c2_rnic_close(c2dev); + bail4: + vq_term(c2dev); + bail3: + dma_unmap_single(c2dev->ibdev.dma_device, + pci_unmap_addr(&c2dev->aeq, mapping), + c2dev->aeq.q_size * c2dev->aeq.msg_size, + DMA_FROM_DEVICE); + kfree(q2_pages); + bail2: + dma_unmap_single(c2dev->ibdev.dma_device, + pci_unmap_addr(&c2dev->rep_vq, mapping), + c2dev->rep_vq.q_size * c2dev->rep_vq.msg_size, + DMA_FROM_DEVICE); + kfree(q1_pages); + bail1: + c2_free_mqsp_pool(c2dev, c2dev->kern_mqsp_pool); + bail0: + vfree(c2dev->qptr_array); + + return err; +} + +/* + * Called by c2_remove to cleanup the RNIC resources. + */ +void c2_rnic_term(struct c2_dev *c2dev) +{ + + /* Close the open adapter instance */ + c2_rnic_close(c2dev); + + /* Send the TERM message to the adapter */ + c2_adapter_term(c2dev); + + /* Disable interrupts on the adapter */ + writel(1, c2dev->regs + C2_IDIS); + + /* Free the QP pool */ + c2_cleanup_qp_table(c2dev); + + /* Free the PD pool */ + c2_cleanup_pd_table(c2dev); + + /* Free the verbs request allocator */ + vq_term(c2dev); + + /* Unmap and free the asynchronus event queue */ + dma_unmap_single(c2dev->ibdev.dma_device, + pci_unmap_addr(&c2dev->aeq, mapping), + c2dev->aeq.q_size * c2dev->aeq.msg_size, + DMA_FROM_DEVICE); + kfree(c2dev->aeq.msg_pool.host); + + /* Unmap and free the verbs reply queue */ + dma_unmap_single(c2dev->ibdev.dma_device, + pci_unmap_addr(&c2dev->rep_vq, mapping), + c2dev->rep_vq.q_size * c2dev->rep_vq.msg_size, + DMA_FROM_DEVICE); + kfree(c2dev->rep_vq.msg_pool.host); + + /* Free the MQ shared pointer pool */ + c2_free_mqsp_pool(c2dev, c2dev->kern_mqsp_pool); + + /* Free the qptr_array */ + vfree(c2dev->qptr_array); + + return; +} From swise at opengridcomputing.com Thu Aug 3 14:07:32 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:07:32 -0500 Subject: [openib-general] [PATCH v4 4/7] AMSO1100 Memory Management. In-Reply-To: <20060803210723.16572.34829.stgit@dell3.ogc.int> References: <20060803210723.16572.34829.stgit@dell3.ogc.int> Message-ID: <20060803210732.16572.35910.stgit@dell3.ogc.int> --- drivers/infiniband/hw/amso1100/c2_alloc.c | 144 +++++++++++ drivers/infiniband/hw/amso1100/c2_mm.c | 375 +++++++++++++++++++++++++++++ 2 files changed, 519 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_alloc.c b/drivers/infiniband/hw/amso1100/c2_alloc.c new file mode 100644 index 0000000..013b152 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_alloc.c @@ -0,0 +1,144 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "c2.h" + +static int c2_alloc_mqsp_chunk(struct c2_dev *c2dev, gfp_t gfp_mask, + struct sp_chunk **head) +{ + int i; + struct sp_chunk *new_head; + + new_head = (struct sp_chunk *) __get_free_page(gfp_mask); + if (new_head == NULL) + return -ENOMEM; + + new_head->dma_addr = dma_map_single(c2dev->ibdev.dma_device, new_head, + PAGE_SIZE, DMA_FROM_DEVICE); + pci_unmap_addr_set(new_head, mapping, new_head->dma_addr); + + new_head->next = NULL; + new_head->head = 0; + + /* build list where each index is the next free slot */ + for (i = 0; + i < (PAGE_SIZE - sizeof(struct sp_chunk) - + sizeof(u16)) / sizeof(u16) - 1; + i++) { + new_head->shared_ptr[i] = i + 1; + } + /* terminate list */ + new_head->shared_ptr[i] = 0xFFFF; + + *head = new_head; + return 0; +} + +int c2_init_mqsp_pool(struct c2_dev *c2dev, gfp_t gfp_mask, + struct sp_chunk **root) +{ + return c2_alloc_mqsp_chunk(c2dev, gfp_mask, root); +} + +void c2_free_mqsp_pool(struct c2_dev *c2dev, struct sp_chunk *root) +{ + struct sp_chunk *next; + + while (root) { + next = root->next; + dma_unmap_single(c2dev->ibdev.dma_device, + pci_unmap_addr(root, mapping), PAGE_SIZE, + DMA_FROM_DEVICE); + __free_page((struct page *) root); + root = next; + } +} + +u16 *c2_alloc_mqsp(struct c2_dev *c2dev, struct sp_chunk *head, + dma_addr_t *dma_addr, gfp_t gfp_mask) +{ + u16 mqsp; + + while (head) { + mqsp = head->head; + if (mqsp != 0xFFFF) { + head->head = head->shared_ptr[mqsp]; + break; + } else if (head->next == NULL) { + if (c2_alloc_mqsp_chunk(c2dev, gfp_mask, &head->next) == + 0) { + head = head->next; + mqsp = head->head; + head->head = head->shared_ptr[mqsp]; + break; + } else + return NULL; + } else + head = head->next; + } + if (head) { + *dma_addr = head->dma_addr + + ((unsigned long) &(head->shared_ptr[mqsp]) - + (unsigned long) head); + pr_debug("%s addr %p dma_addr %llx\n", __FUNCTION__, + &(head->shared_ptr[mqsp]), (u64)*dma_addr); + return &(head->shared_ptr[mqsp]); + } + return NULL; +} + +void c2_free_mqsp(u16 * mqsp) +{ + struct sp_chunk *head; + u16 idx; + + /* The chunk containing this ptr begins at the page boundary */ + head = (struct sp_chunk *) ((unsigned long) mqsp & PAGE_MASK); + + /* Link head to new mqsp */ + *mqsp = head->head; + + /* Compute the shared_ptr index */ + idx = ((unsigned long) mqsp & ~PAGE_MASK) >> 1; + idx -= (unsigned long) &(((struct sp_chunk *) 0)->shared_ptr[0]) >> 1; + + /* Point this index at the head */ + head->shared_ptr[idx] = head->head; + + /* Point head at this index */ + head->head = idx; +} diff --git a/drivers/infiniband/hw/amso1100/c2_mm.c b/drivers/infiniband/hw/amso1100/c2_mm.c new file mode 100644 index 0000000..314ec07 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_mm.c @@ -0,0 +1,375 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include "c2_vq.h" + +#define PBL_VIRT 1 +#define PBL_PHYS 2 + +/* + * Send all the PBL messages to convey the remainder of the PBL + * Wait for the adapter's reply on the last one. + * This is indicated by setting the MEM_PBL_COMPLETE in the flags. + * + * NOTE: vq_req is _not_ freed by this function. The VQ Host + * Reply buffer _is_ freed by this function. + */ +static int +send_pbl_messages(struct c2_dev *c2dev, u32 stag_index, + unsigned long va, u32 pbl_depth, + struct c2_vq_req *vq_req, int pbl_type) +{ + u32 pbe_count; /* amt that fits in a PBL msg */ + u32 count; /* amt in this PBL MSG. */ + struct c2wr_nsmr_pbl_req *wr; /* PBL WR ptr */ + struct c2wr_nsmr_pbl_rep *reply; /* reply ptr */ + int err, pbl_virt, pbl_index, i; + + switch (pbl_type) { + case PBL_VIRT: + pbl_virt = 1; + break; + case PBL_PHYS: + pbl_virt = 0; + break; + default: + return -EINVAL; + break; + } + + pbe_count = (c2dev->req_vq.msg_size - + sizeof(struct c2wr_nsmr_pbl_req)) / sizeof(u64); + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + return -ENOMEM; + } + c2_wr_set_id(wr, CCWR_NSMR_PBL); + + /* + * Only the last PBL message will generate a reply from the verbs, + * so we set the context to 0 indicating there is no kernel verbs + * handler blocked awaiting this reply. + */ + wr->hdr.context = 0; + wr->rnic_handle = c2dev->adapter_handle; + wr->stag_index = stag_index; /* already swapped */ + wr->flags = 0; + pbl_index = 0; + while (pbl_depth) { + count = min(pbe_count, pbl_depth); + wr->addrs_length = cpu_to_be32(count); + + /* + * If this is the last message, then reference the + * vq request struct cuz we're gonna wait for a reply. + * also make this PBL msg as the last one. + */ + if (count == pbl_depth) { + /* + * reference the request struct. dereferenced in the + * int handler. + */ + vq_req_get(c2dev, vq_req); + wr->flags = cpu_to_be32(MEM_PBL_COMPLETE); + + /* + * This is the last PBL message. + * Set the context to our VQ Request Object so we can + * wait for the reply. + */ + wr->hdr.context = (unsigned long) vq_req; + } + + /* + * If pbl_virt is set then va is a virtual address + * that describes a virtually contiguous memory + * allocation. The wr needs the start of each virtual page + * to be converted to the corresponding physical address + * of the page. If pbl_virt is not set then va is an array + * of physical addresses and there is no conversion to do. + * Just fill in the wr with what is in the array. + */ + for (i = 0; i < count; i++) { + if (pbl_virt) { + va += PAGE_SIZE; + } else { + wr->paddrs[i] = + cpu_to_be64(((u64 *)va)[pbl_index + i]); + } + } + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + if (count <= pbe_count) { + vq_req_put(c2dev, vq_req); + } + goto bail0; + } + pbl_depth -= count; + pbl_index += count; + } + + /* + * Now wait for the reply... + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (struct c2wr_nsmr_pbl_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + + vq_repbuf_free(c2dev, reply); + bail0: + kfree(wr); + return err; +} + +#define C2_PBL_MAX_DEPTH 131072 +int +c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 *addr_list, + int page_size, int pbl_depth, u32 length, + u32 offset, u64 *va, enum c2_acf acf, + struct c2_mr *mr) +{ + struct c2_vq_req *vq_req; + struct c2wr_nsmr_register_req *wr; + struct c2wr_nsmr_register_rep *reply; + u16 flags; + int i, pbe_count, count; + int err; + + if (!va || !length || !addr_list || !pbl_depth) + return -EINTR; + + /* + * Verify PBL depth is within rnic max + */ + if (pbl_depth > C2_PBL_MAX_DEPTH) { + return -EINTR; + } + + /* + * allocate verbs request object + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + /* + * build the WR + */ + c2_wr_set_id(wr, CCWR_NSMR_REGISTER); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + + flags = (acf | MEM_VA_BASED | MEM_REMOTE); + + /* + * compute how many pbes can fit in the message + */ + pbe_count = (c2dev->req_vq.msg_size - + sizeof(struct c2wr_nsmr_register_req)) / sizeof(u64); + + if (pbl_depth <= pbe_count) { + flags |= MEM_PBL_COMPLETE; + } + wr->flags = cpu_to_be16(flags); + wr->stag_key = 0; //stag_key; + wr->va = cpu_to_be64(*va); + wr->pd_id = mr->pd->pd_id; + wr->pbe_size = cpu_to_be32(page_size); + wr->length = cpu_to_be32(length); + wr->pbl_depth = cpu_to_be32(pbl_depth); + wr->fbo = cpu_to_be32(offset); + count = min(pbl_depth, pbe_count); + wr->addrs_length = cpu_to_be32(count); + + /* + * fill out the PBL for this message + */ + for (i = 0; i < count; i++) { + wr->paddrs[i] = cpu_to_be64(addr_list[i]); + } + + /* + * regerence the request struct + */ + vq_req_get(c2dev, vq_req); + + /* + * send the WR to the adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + /* + * wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail1; + } + + /* + * process reply + */ + reply = + (struct c2wr_nsmr_register_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + if ((err = c2_errno(reply))) { + goto bail2; + } + //*p_pb_entries = be32_to_cpu(reply->pbl_depth); + mr->ibmr.lkey = mr->ibmr.rkey = be32_to_cpu(reply->stag_index); + vq_repbuf_free(c2dev, reply); + + /* + * if there are still more PBEs we need to send them to + * the adapter and wait for a reply on the final one. + * reuse vq_req for this purpose. + */ + pbl_depth -= count; + if (pbl_depth) { + + vq_req->reply_msg = (unsigned long) NULL; + atomic_set(&vq_req->reply_ready, 0); + err = send_pbl_messages(c2dev, + cpu_to_be32(mr->ibmr.lkey), + (unsigned long) &addr_list[i], + pbl_depth, vq_req, PBL_PHYS); + if (err) { + goto bail1; + } + } + + vq_req_free(c2dev, vq_req); + kfree(wr); + + return err; + + bail2: + vq_repbuf_free(c2dev, reply); + bail1: + kfree(wr); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index) +{ + struct c2_vq_req *vq_req; /* verbs request object */ + struct c2wr_stag_dealloc_req wr; /* work request */ + struct c2wr_stag_dealloc_rep *reply; /* WR reply */ + int err; + + + /* + * allocate verbs request object + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_STAG_DEALLOC); + wr.hdr.context = (u64) (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.stag_index = cpu_to_be32(stag_index); + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (struct c2wr_stag_dealloc_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} From swise at opengridcomputing.com Thu Aug 3 14:07:29 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:07:29 -0500 Subject: [openib-general] [PATCH v4 3/7] AMSO1100 OpenFabrics Provider. In-Reply-To: <20060803210723.16572.34829.stgit@dell3.ogc.int> References: <20060803210723.16572.34829.stgit@dell3.ogc.int> Message-ID: <20060803210729.16572.21837.stgit@dell3.ogc.int> --- drivers/infiniband/hw/amso1100/c2_cm.c | 452 ++++++++++++ drivers/infiniband/hw/amso1100/c2_cq.c | 433 ++++++++++++ drivers/infiniband/hw/amso1100/c2_pd.c | 89 ++ drivers/infiniband/hw/amso1100/c2_provider.c | 869 +++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_provider.h | 181 +++++ drivers/infiniband/hw/amso1100/c2_qp.c | 975 ++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_user.h | 82 ++ 7 files changed, 3081 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_cm.c b/drivers/infiniband/hw/amso1100/c2_cm.c new file mode 100644 index 0000000..018d11f --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_cm.c @@ -0,0 +1,452 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include "c2.h" +#include "c2_wr.h" +#include "c2_vq.h" +#include + +int c2_llp_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + struct c2_dev *c2dev = to_c2dev(cm_id->device); + struct ib_qp *ibqp; + struct c2_qp *qp; + struct c2wr_qp_connect_req *wr; /* variable size needs a malloc. */ + struct c2_vq_req *vq_req; + int err; + + ibqp = c2_get_qp(cm_id->device, iw_param->qpn); + if (!ibqp) + return -EINVAL; + qp = to_c2qp(ibqp); + + /* Associate QP <--> CM_ID */ + cm_id->provider_data = qp; + cm_id->add_ref(cm_id); + qp->cm_id = cm_id; + + /* + * only support the max private_data length + */ + if (iw_param->private_data_len > C2_MAX_PRIVATE_DATA_SIZE) { + err = -EINVAL; + goto bail0; + } + /* + * Set the rdma read limits + */ + err = c2_qp_set_read_limits(c2dev, qp, iw_param->ord, iw_param->ird); + if (err) + goto bail0; + + /* + * Create and send a WR_QP_CONNECT... + */ + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail1; + } + + c2_wr_set_id(wr, CCWR_QP_CONNECT); + wr->hdr.context = 0; + wr->rnic_handle = c2dev->adapter_handle; + wr->qp_handle = qp->adapter_handle; + + wr->remote_addr = cm_id->remote_addr.sin_addr.s_addr; + wr->remote_port = cm_id->remote_addr.sin_port; + + /* + * Move any private data from the callers's buf into + * the WR. + */ + if (iw_param->private_data) { + wr->private_data_length = + cpu_to_be32(iw_param->private_data_len); + memcpy(&wr->private_data[0], iw_param->private_data, + iw_param->private_data_len); + } else + wr->private_data_length = 0; + + /* + * Send WR to adapter. NOTE: There is no synch reply from + * the adapter. + */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + vq_req_free(c2dev, vq_req); + + bail1: + kfree(wr); + bail0: + if (err) { + /* + * If we fail, release reference on QP and + * disassociate QP from CM_ID + */ + cm_id->provider_data = NULL; + qp->cm_id = NULL; + cm_id->rem_ref(cm_id); + } + return err; +} + +int c2_llp_service_create(struct iw_cm_id *cm_id, int backlog) +{ + struct c2_dev *c2dev; + struct c2wr_ep_listen_create_req wr; + struct c2wr_ep_listen_create_rep *reply; + struct c2_vq_req *vq_req; + int err; + + c2dev = to_c2dev(cm_id->device); + if (c2dev == NULL) + return -EINVAL; + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_EP_LISTEN_CREATE); + wr.hdr.context = (u64) (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.local_addr = cm_id->local_addr.sin_addr.s_addr; + wr.local_port = cm_id->local_addr.sin_port; + wr.backlog = cpu_to_be32(backlog); + wr.user_context = (u64) (unsigned long) cm_id; + + /* + * Reference the request struct. Dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + /* + * Process reply + */ + reply = + (struct c2wr_ep_listen_create_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + if ((err = c2_errno(reply)) != 0) + goto bail1; + + /* + * Keep the adapter handle. Used in subsequent destroy + */ + cm_id->provider_data = (void*)(unsigned long) reply->ep_handle; + + /* + * free vq stuff + */ + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + return 0; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + + +int c2_llp_service_destroy(struct iw_cm_id *cm_id) +{ + + struct c2_dev *c2dev; + struct c2wr_ep_listen_destroy_req wr; + struct c2wr_ep_listen_destroy_rep *reply; + struct c2_vq_req *vq_req; + int err; + + c2dev = to_c2dev(cm_id->device); + if (c2dev == NULL) + return -EINVAL; + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_EP_LISTEN_DESTROY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.ep_handle = (u32)(unsigned long)cm_id->provider_data; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + /* + * Process reply + */ + reply=(struct c2wr_ep_listen_destroy_rep *)(unsigned long)vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + if ((err = c2_errno(reply)) != 0) + goto bail1; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +int c2_llp_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + struct c2_dev *c2dev = to_c2dev(cm_id->device); + struct c2_qp *qp; + struct ib_qp *ibqp; + struct c2wr_cr_accept_req *wr; /* variable length WR */ + struct c2_vq_req *vq_req; + struct c2wr_cr_accept_rep *reply; /* VQ Reply msg ptr. */ + int err; + + ibqp = c2_get_qp(cm_id->device, iw_param->qpn); + if (!ibqp) + return -EINVAL; + qp = to_c2qp(ibqp); + + /* Set the RDMA read limits */ + err = c2_qp_set_read_limits(c2dev, qp, iw_param->ord, iw_param->ird); + if (err) + goto bail0; + + /* Allocate verbs request. */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail1; + } + vq_req->qp = qp; + vq_req->cm_id = cm_id; + vq_req->event = IW_CM_EVENT_ESTABLISHED; + + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail2; + } + + /* Build the WR */ + c2_wr_set_id(wr, CCWR_CR_ACCEPT); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->ep_handle = (u32) (unsigned long) cm_id->provider_data; + wr->qp_handle = qp->adapter_handle; + + /* Replace the cr_handle with the QP after accept */ + cm_id->provider_data = qp; + cm_id->add_ref(cm_id); + qp->cm_id = cm_id; + + cm_id->provider_data = qp; + + /* Validate private_data length */ + if (iw_param->private_data_len > C2_MAX_PRIVATE_DATA_SIZE) { + err = -EINVAL; + goto bail2; + } + + if (iw_param->private_data) { + wr->private_data_length = cpu_to_be32(iw_param->private_data_len); + memcpy(&wr->private_data[0], + iw_param->private_data, iw_param->private_data_len); + } else + wr->private_data_length = 0; + + /* Reference the request struct. Dereferenced in the int handler. */ + vq_req_get(c2dev, vq_req); + + /* Send WR to adapter */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail2; + } + + /* Wait for reply from adapter */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail2; + + /* Check that reply is present */ + reply = (struct c2wr_cr_accept_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail2; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + + if (!err) + c2_set_qp_state(qp, C2_QP_STATE_RTS); + bail2: + kfree(wr); + bail1: + vq_req_free(c2dev, vq_req); + bail0: + if (err) { + /* + * If we fail, release reference on QP and + * disassociate QP from CM_ID + */ + cm_id->provider_data = NULL; + qp->cm_id = NULL; + cm_id->rem_ref(cm_id); + } + return err; +} + +int c2_llp_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + struct c2_dev *c2dev; + struct c2wr_cr_reject_req wr; + struct c2_vq_req *vq_req; + struct c2wr_cr_reject_rep *reply; + int err; + + c2dev = to_c2dev(cm_id->device); + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_CR_REJECT); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.ep_handle = (u32) (unsigned long) cm_id->provider_data; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + /* + * Process reply + */ + reply = (struct c2wr_cr_reject_rep *) (unsigned long) + vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + err = c2_errno(reply); + /* + * free vq stuff + */ + vq_repbuf_free(c2dev, reply); + + bail0: + vq_req_free(c2dev, vq_req); + return err; +} diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c new file mode 100644 index 0000000..d24da05 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -0,0 +1,433 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include "c2.h" +#include "c2_vq.h" +#include "c2_status.h" + +#define C2_CQ_MSG_SIZE ((sizeof(struct c2wr_ce) + 32-1) & ~(32-1)) + +struct c2_cq *c2_cq_get(struct c2_dev *c2dev, int cqn) +{ + struct c2_cq *cq; + unsigned long flags; + + spin_lock_irqsave(&c2dev->lock, flags); + cq = c2dev->qptr_array[cqn]; + if (!cq) { + spin_unlock_irqrestore(&c2dev->lock, flags); + return NULL; + } + atomic_inc(&cq->refcount); + spin_unlock_irqrestore(&c2dev->lock, flags); + return cq; +} + +void c2_cq_put(struct c2_cq *cq) +{ + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void c2_cq_event(struct c2_dev *c2dev, u32 mq_index) +{ + struct c2_cq *cq; + + cq = c2_cq_get(c2dev, mq_index); + if (!cq) { + printk("discarding events on destroyed CQN=%d\n", mq_index); + return; + } + + (*cq->ibcq.comp_handler) (&cq->ibcq, cq->ibcq.cq_context); + c2_cq_put(cq); +} + +void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index) +{ + struct c2_cq *cq; + struct c2_mq *q; + + cq = c2_cq_get(c2dev, mq_index); + if (!cq) + return; + + spin_lock_irq(&cq->lock); + q = &cq->mq; + if (q && !c2_mq_empty(q)) { + u16 priv = q->priv; + struct c2wr_ce *msg; + + while (priv != be16_to_cpu(*q->shared)) { + msg = (struct c2wr_ce *) + (q->msg_pool.host + priv * q->msg_size); + if (msg->qp_user_context == (u64) (unsigned long) qp) { + msg->qp_user_context = (u64) 0; + } + priv = (priv + 1) % q->q_size; + } + } + spin_unlock_irq(&cq->lock); + c2_cq_put(cq); +} + +static inline enum ib_wc_status c2_cqe_status_to_openib(u8 status) +{ + switch (status) { + case C2_OK: + return IB_WC_SUCCESS; + case CCERR_FLUSHED: + return IB_WC_WR_FLUSH_ERR; + case CCERR_BASE_AND_BOUNDS_VIOLATION: + return IB_WC_LOC_PROT_ERR; + case CCERR_ACCESS_VIOLATION: + return IB_WC_LOC_ACCESS_ERR; + case CCERR_TOTAL_LENGTH_TOO_BIG: + return IB_WC_LOC_LEN_ERR; + case CCERR_INVALID_WINDOW: + return IB_WC_MW_BIND_ERR; + default: + return IB_WC_GENERAL_ERR; + } +} + + +static inline int c2_poll_one(struct c2_dev *c2dev, + struct c2_cq *cq, struct ib_wc *entry) +{ + struct c2wr_ce *ce; + struct c2_qp *qp; + int is_recv = 0; + + ce = (struct c2wr_ce *) c2_mq_consume(&cq->mq); + if (!ce) { + return -EAGAIN; + } + + /* + * if the qp returned is null then this qp has already + * been freed and we are unable process the completion. + * try pulling the next message + */ + while ((qp = + (struct c2_qp *) (unsigned long) ce->qp_user_context) == NULL) { + c2_mq_free(&cq->mq); + ce = (struct c2wr_ce *) c2_mq_consume(&cq->mq); + if (!ce) + return -EAGAIN; + } + + entry->status = c2_cqe_status_to_openib(c2_wr_get_result(ce)); + entry->wr_id = ce->hdr.context; + entry->qp_num = ce->handle; + entry->wc_flags = 0; + entry->slid = 0; + entry->sl = 0; + entry->src_qp = 0; + entry->dlid_path_bits = 0; + entry->pkey_index = 0; + + switch (c2_wr_get_id(ce)) { + case C2_WR_TYPE_SEND: + entry->opcode = IB_WC_SEND; + break; + case C2_WR_TYPE_RDMA_WRITE: + entry->opcode = IB_WC_RDMA_WRITE; + break; + case C2_WR_TYPE_RDMA_READ: + entry->opcode = IB_WC_RDMA_READ; + break; + case C2_WR_TYPE_BIND_MW: + entry->opcode = IB_WC_BIND_MW; + break; + case C2_WR_TYPE_RECV: + entry->byte_len = be32_to_cpu(ce->bytes_rcvd); + entry->opcode = IB_WC_RECV; + is_recv = 1; + break; + default: + break; + } + + /* consume the WQEs */ + if (is_recv) + c2_mq_lconsume(&qp->rq_mq, 1); + else + c2_mq_lconsume(&qp->sq_mq, + be32_to_cpu(c2_wr_get_wqe_count(ce)) + 1); + + /* free the message */ + c2_mq_free(&cq->mq); + + return 0; +} + +int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) +{ + struct c2_dev *c2dev = to_c2dev(ibcq->device); + struct c2_cq *cq = to_c2cq(ibcq); + unsigned long flags; + int npolled, err; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + + err = c2_poll_one(c2dev, cq, entry + npolled); + if (err) + break; + } + + spin_unlock_irqrestore(&cq->lock, flags); + + return npolled; +} + +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + struct c2_mq_shared __iomem *shared; + struct c2_cq *cq; + + cq = to_c2cq(ibcq); + shared = cq->mq.peer; + + if (notify == IB_CQ_NEXT_COMP) + writeb(C2_CQ_NOTIFICATION_TYPE_NEXT, &shared->notification_type); + else if (notify == IB_CQ_SOLICITED) + writeb(C2_CQ_NOTIFICATION_TYPE_NEXT_SE, &shared->notification_type); + else + return -EINVAL; + + writeb(CQ_WAIT_FOR_DMA | CQ_ARMED, &shared->armed); + + /* + * Now read back shared->armed to make the PCI + * write synchronous. This is necessary for + * correct cq notification semantics. + */ + readb(&shared->armed); + + return 0; +} + +static void c2_free_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq) +{ + + dma_unmap_single(c2dev->ibdev.dma_device, pci_unmap_addr(mq, mapping), + mq->q_size * mq->msg_size, DMA_FROM_DEVICE); + free_pages((unsigned long) mq->msg_pool.host, + get_order(mq->q_size * mq->msg_size)); +} + +static int c2_alloc_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq, int q_size, + int msg_size) +{ + unsigned long pool_start; + + pool_start = __get_free_pages(GFP_KERNEL, + get_order(q_size * msg_size)); + if (!pool_start) + return -ENOMEM; + + c2_mq_rep_init(mq, + 0, /* index (currently unknown) */ + q_size, + msg_size, + (u8 *) pool_start, + NULL, /* peer (currently unknown) */ + C2_MQ_HOST_TARGET); + + mq->host_dma = dma_map_single(c2dev->ibdev.dma_device, + (void *)pool_start, + q_size * msg_size, DMA_FROM_DEVICE); + pci_unmap_addr_set(mq, mapping, mq->host_dma); + + return 0; +} + +int c2_init_cq(struct c2_dev *c2dev, int entries, + struct c2_ucontext *ctx, struct c2_cq *cq) +{ + struct c2wr_cq_create_req wr; + struct c2wr_cq_create_rep *reply; + unsigned long peer_pa; + struct c2_vq_req *vq_req; + int err; + + might_sleep(); + + cq->ibcq.cqe = entries - 1; + cq->is_kernel = !ctx; + + /* Allocate a shared pointer */ + cq->mq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &cq->mq.shared_dma, GFP_KERNEL); + if (!cq->mq.shared) + return -ENOMEM; + + /* Allocate pages for the message pool */ + err = c2_alloc_cq_buf(c2dev, &cq->mq, entries + 1, C2_CQ_MSG_SIZE); + if (err) + goto bail0; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail1; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_CQ_CREATE); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.msg_size = cpu_to_be32(cq->mq.msg_size); + wr.depth = cpu_to_be32(cq->mq.q_size); + wr.shared_ht = cpu_to_be64(cq->mq.shared_dma); + wr.msg_pool = cpu_to_be64(cq->mq.host_dma); + wr.user_context = (u64) (unsigned long) (cq); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail2; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail2; + + reply = (struct c2wr_cq_create_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail2; + } + + if ((err = c2_errno(reply)) != 0) + goto bail3; + + cq->adapter_handle = reply->cq_handle; + cq->mq.index = be32_to_cpu(reply->mq_index); + + peer_pa = c2dev->pa + be32_to_cpu(reply->adapter_shared); + cq->mq.peer = ioremap_nocache(peer_pa, PAGE_SIZE); + if (!cq->mq.peer) { + err = -ENOMEM; + goto bail3; + } + + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + /* + * Use the MQ index allocated by the adapter to + * store the CQ in the qptr_array + */ + cq->cqn = cq->mq.index; + c2dev->qptr_array[cq->cqn] = cq; + + return 0; + + bail3: + vq_repbuf_free(c2dev, reply); + bail2: + vq_req_free(c2dev, vq_req); + bail1: + c2_free_cq_buf(c2dev, &cq->mq); + bail0: + c2_free_mqsp(cq->mq.shared); + + return err; +} + +void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq) +{ + int err; + struct c2_vq_req *vq_req; + struct c2wr_cq_destroy_req wr; + struct c2wr_cq_destroy_rep *reply; + + might_sleep(); + + /* Clear CQ from the qptr array */ + spin_lock_irq(&c2dev->lock); + c2dev->qptr_array[cq->mq.index] = NULL; + atomic_dec(&cq->refcount); + spin_unlock_irq(&c2dev->lock); + + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + goto bail0; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_CQ_DESTROY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.cq_handle = cq->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = (struct c2wr_cq_destroy_rep *) (unsigned long) (vq_req->reply_msg); + + vq_repbuf_free(c2dev, reply); + bail1: + vq_req_free(c2dev, vq_req); + bail0: + if (cq->is_kernel) { + c2_free_cq_buf(c2dev, &cq->mq); + } + + return; +} diff --git a/drivers/infiniband/hw/amso1100/c2_pd.c b/drivers/infiniband/hw/amso1100/c2_pd.c new file mode 100644 index 0000000..b9a647a --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_pd.c @@ -0,0 +1,89 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "c2.h" +#include "c2_provider.h" + +int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct c2_pd *pd) +{ + u32 obj; + int ret = 0; + + spin_lock(&c2dev->pd_table.lock); + obj = find_next_zero_bit(c2dev->pd_table.table, c2dev->pd_table.max, + c2dev->pd_table.last); + if (obj >= c2dev->pd_table.max) + obj = find_first_zero_bit(c2dev->pd_table.table, + c2dev->pd_table.max); + if (obj < c2dev->pd_table.max) { + pd->pd_id = obj; + __set_bit(obj, c2dev->pd_table.table); + c2dev->pd_table.last = obj+1; + if (c2dev->pd_table.last >= c2dev->pd_table.max) + c2dev->pd_table.last = 0; + } else + ret = -ENOMEM; + spin_unlock(&c2dev->pd_table.lock); + return ret; +} + +void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd) +{ + spin_lock(&c2dev->pd_table.lock); + __clear_bit(pd->pd_id, c2dev->pd_table.table); + spin_unlock(&c2dev->pd_table.lock); +} + +int __devinit c2_init_pd_table(struct c2_dev *c2dev) +{ + + c2dev->pd_table.last = 0; + c2dev->pd_table.max = c2dev->props.max_pd; + spin_lock_init(&c2dev->pd_table.lock); + c2dev->pd_table.table = kmalloc(BITS_TO_LONGS(c2dev->props.max_pd) * + sizeof(long), GFP_KERNEL); + if (!c2dev->pd_table.table) + return -ENOMEM; + bitmap_zero(c2dev->pd_table.table, c2dev->props.max_pd); + return 0; +} + +void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev) +{ + kfree(c2dev->pd_table.table); +} diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c new file mode 100644 index 0000000..58dc0c5 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -0,0 +1,869 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include "c2.h" +#include "c2_provider.h" +#include "c2_user.h" + +static int c2_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct c2_dev *c2dev = to_c2dev(ibdev); + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + *props = c2dev->props; + return 0; +} + +static int c2_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = + IB_PORT_CM_SUP | + IB_PORT_REINIT_SUP | + IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 1; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 1; + props->active_speed = 1; + + return 0; +} + +static int c2_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return 0; +} + +static int c2_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 * pkey) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + *pkey = 0; + return 0; +} + +static int c2_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct c2_dev *c2dev = to_c2dev(ibdev); + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + memset(&(gid->raw[0]), 0, sizeof(gid->raw)); + memcpy(&(gid->raw[0]), c2dev->pseudo_netdev->dev_addr, 6); + + return 0; +} + +/* Allocate the user context data structure. This keeps track + * of all objects associated with a particular user-mode client. + */ +static struct ib_ucontext *c2_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct c2_ucontext *context; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + context = kmalloc(sizeof(*context), GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + + return &context->ibucontext; +} + +static int c2_dealloc_ucontext(struct ib_ucontext *context) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + kfree(context); + return 0; +} + +static int c2_mmap_uar(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static struct ib_pd *c2_alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct c2_pd *pd; + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + pd = kmalloc(sizeof(*pd), GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = c2_pd_alloc(to_c2dev(ibdev), !context, pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + if (context) { + if (ib_copy_to_udata(udata, &pd->pd_id, sizeof(__u32))) { + c2_pd_free(to_c2dev(ibdev), pd); + kfree(pd); + return ERR_PTR(-EFAULT); + } + } + + return &pd->ibpd; +} + +static int c2_dealloc_pd(struct ib_pd *pd) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + c2_pd_free(to_c2dev(pd->device), to_c2pd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *c2_ah_create(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return ERR_PTR(-ENOSYS); +} + +static int c2_ah_destroy(struct ib_ah *ah) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static void c2_add_ref(struct ib_qp *ibqp) +{ + struct c2_qp *qp; + BUG_ON(!ibqp); + qp = to_c2qp(ibqp); + atomic_inc(&qp->refcount); +} + +static void c2_rem_ref(struct ib_qp *ibqp) +{ + struct c2_qp *qp; + BUG_ON(!ibqp); + qp = to_c2qp(ibqp); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +struct ib_qp *c2_get_qp(struct ib_device *device, int qpn) +{ + struct c2_dev* c2dev = to_c2dev(device); + struct c2_qp *qp; + + qp = c2_find_qpn(c2dev, qpn); + pr_debug("%s Returning QP=%p for QPN=%d, device=%p, refcount=%d\n", + __FUNCTION__, qp, qpn, device, + (qp?atomic_read(&qp->refcount):0)); + + return (qp?&qp->ibqp:NULL); +} + +static struct ib_qp *c2_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + struct c2_qp *qp; + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + switch (init_attr->qp_type) { + case IB_QPT_RC: + qp = kzalloc(sizeof(*qp), GFP_KERNEL); + if (!qp) { + pr_debug("%s: Unable to allocate QP\n", __FUNCTION__); + return ERR_PTR(-ENOMEM); + } + spin_lock_init(&qp->lock); + if (pd->uobject) { + /* userspace specific */ + } + + err = c2_alloc_qp(to_c2dev(pd->device), + to_c2pd(pd), init_attr, qp); + + if (err && pd->uobject) { + /* userspace specific */ + } + + break; + default: + pr_debug("%s: Invalid QP type: %d\n", __FUNCTION__, + init_attr->qp_type); + return ERR_PTR(-EINVAL); + break; + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + return &qp->ibqp; +} + +static int c2_destroy_qp(struct ib_qp *ib_qp) +{ + struct c2_qp *qp = to_c2qp(ib_qp); + + pr_debug("%s:%u qp=%p,qp->state=%d\n", + __FUNCTION__, __LINE__,ib_qp,qp->state); + c2_free_qp(to_c2dev(ib_qp->device), qp); + kfree(qp); + return 0; +} + +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct c2_cq *cq; + int err; + + cq = kmalloc(sizeof(*cq), GFP_KERNEL); + if (!cq) { + pr_debug("%s: Unable to allocate CQ\n", __FUNCTION__); + return ERR_PTR(-ENOMEM); + } + + err = c2_init_cq(to_c2dev(ibdev), entries, NULL, cq); + if (err) { + pr_debug("%s: error initializing CQ\n", __FUNCTION__); + kfree(cq); + return ERR_PTR(err); + } + + return &cq->ibcq; +} + +static int c2_destroy_cq(struct ib_cq *ib_cq) +{ + struct c2_cq *cq = to_c2cq(ib_cq); + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + c2_free_cq(to_c2dev(ib_cq->device), cq); + kfree(cq); + + return 0; +} + +static inline u32 c2_convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_WRITE ? C2_ACF_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? C2_ACF_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? C2_ACF_LOCAL_WRITE : 0) | + C2_ACF_LOCAL_READ | C2_ACF_WINDOW_BIND; +} + +static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, int acc, u64 * iova_start) +{ + struct c2_mr *mr; + u64 *page_list; + u32 total_len; + int err, i, j, k, page_shift, pbl_depth; + + pbl_depth = 0; + total_len = 0; + + page_shift = PAGE_SHIFT; + /* + * If there is only 1 buffer we assume this could + * be a map of all phy mem...use a 32k page_shift. + */ + if (num_phys_buf == 1) + page_shift += 3; + + for (i = 0; i < num_phys_buf; i++) { + + if (buffer_list[i].addr & ~PAGE_MASK) { + pr_debug("Unaligned Memory Buffer: 0x%x\n", + (unsigned int) buffer_list[i].addr); + return ERR_PTR(-EINVAL); + } + + if (!buffer_list[i].size) { + pr_debug("Invalid Buffer Size\n"); + return ERR_PTR(-EINVAL); + } + + total_len += buffer_list[i].size; + pbl_depth += ALIGN(buffer_list[i].size, + (1 << page_shift)) >> page_shift; + } + + page_list = vmalloc(sizeof(u64) * pbl_depth); + if (!page_list) { + pr_debug("couldn't vmalloc page_list of size %zd\n", + (sizeof(u64) * pbl_depth)); + return ERR_PTR(-ENOMEM); + } + + for (i = 0, j = 0; i < num_phys_buf; i++) { + + int naddrs; + + naddrs = ALIGN(buffer_list[i].size, + (1 << page_shift)) >> page_shift; + for (k = 0; k < naddrs; k++) + page_list[j++] = (buffer_list[i].addr + + (k << page_shift)); + } + + mr = kmalloc(sizeof(*mr), GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + mr->pd = to_c2pd(ib_pd); + pr_debug("%s - page shift %d, pbl_depth %d, total_len %u, " + "*iova_start %llx, first pa %llx, last pa %llx\n", + __FUNCTION__, page_shift, pbl_depth, total_len, + *iova_start, page_list[0], page_list[pbl_depth-1]); + err = c2_nsmr_register_phys_kern(to_c2dev(ib_pd->device), page_list, + (1 << page_shift), pbl_depth, + total_len, 0, iova_start, + c2_convert_access(acc), mr); + vfree(page_list); + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ib_phys_buf bl; + u64 kva = 0; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + /* AMSO1100 limit */ + bl.size = 0xffffffff; + bl.addr = 0; + return c2_reg_phys_mr(pd, &bl, 1, acc, &kva); +} + +static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + u64 *pages; + u64 kva = 0; + int shift, n, len; + int i, j, k; + int err = 0; + struct ib_umem_chunk *chunk; + struct c2_pd *c2pd = to_c2pd(pd); + struct c2_mr *c2mr; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + shift = ffs(region->page_size) - 1; + + c2mr = kmalloc(sizeof(*c2mr), GFP_KERNEL); + if (!c2mr) + return ERR_PTR(-ENOMEM); + c2mr->pd = c2pd; + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + pages = kmalloc(n * sizeof(u64), GFP_KERNEL); + if (!pages) { + err = -ENOMEM; + goto err; + } + + i = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) { + for (j = 0; j < chunk->nmap; ++j) { + len = sg_dma_len(&chunk->page_list[j]) >> shift; + for (k = 0; k < len; ++k) { + pages[i++] = + sg_dma_address(&chunk->page_list[j]) + + (region->page_size * k); + } + } + } + + kva = (u64)region->virt_base; + err = c2_nsmr_register_phys_kern(to_c2dev(pd->device), + pages, + region->page_size, + i, + region->length, + region->offset, + &kva, + c2_convert_access(acc), + c2mr); + kfree(pages); + if (err) { + kfree(c2mr); + return ERR_PTR(err); + } + return &c2mr->ibmr; + +err: + kfree(c2mr); + return ERR_PTR(err); +} + +static int c2_dereg_mr(struct ib_mr *ib_mr) +{ + struct c2_mr *mr = to_c2mr(ib_mr); + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey); + if (err) + pr_debug("c2_stag_dealloc failed: %d\n", err); + else + kfree(mr); + + return err; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "%x\n", dev->props.hw_ver); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "%x.%x.%x\n", + (int) (dev->props.fw_ver >> 32), + (int) (dev->props.fw_ver >> 16) & 0xffff, + (int) (dev->props.fw_ver & 0xffff)); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "AMSO1100\n"); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "%.*s\n", 32, "AMSO1100 Board ID"); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *c2_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +static int c2_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask) +{ + int err; + + err = + c2_qp_modify(to_c2dev(ibqp->device), to_c2qp(ibqp), attr, + attr_mask); + + return err; +} + +static int c2_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + /* Request a connection */ + return c2_llp_connect(cm_id, iw_param); +} + +static int c2_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + /* Accept the new connection */ + return c2_llp_accept(cm_id, iw_param); +} + +static int c2_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + err = c2_llp_reject(cm_id, pdata, pdata_len); + return err; +} + +static int c2_service_create(struct iw_cm_id *cm_id, int backlog) +{ + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + err = c2_llp_service_create(cm_id, backlog); + pr_debug("%s:%u err=%d\n", + __FUNCTION__, __LINE__, + err); + return err; +} + +static int c2_service_destroy(struct iw_cm_id *cm_id) +{ + int err; + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + err = c2_llp_service_destroy(cm_id); + + return err; +} + +static int c2_pseudo_up(struct net_device *netdev) +{ + struct in_device *ind; + struct c2_dev *c2dev = netdev->priv; + + ind = in_dev_get(netdev); + if (!ind) + return 0; + + pr_debug("adding...\n"); + for_ifa(ind) { +#ifdef DEBUG + u8 *ip = (u8 *) & ifa->ifa_address; + + pr_debug("%s: %d.%d.%d.%d\n", + ifa->ifa_label, ip[0], ip[1], ip[2], ip[3]); +#endif + c2_add_addr(c2dev, ifa->ifa_address, ifa->ifa_mask); + } + endfor_ifa(ind); + in_dev_put(ind); + + return 0; +} + +static int c2_pseudo_down(struct net_device *netdev) +{ + struct in_device *ind; + struct c2_dev *c2dev = netdev->priv; + + ind = in_dev_get(netdev); + if (!ind) + return 0; + + pr_debug("deleting...\n"); + for_ifa(ind) { +#ifdef DEBUG + u8 *ip = (u8 *) & ifa->ifa_address; + + pr_debug("%s: %d.%d.%d.%d\n", + ifa->ifa_label, ip[0], ip[1], ip[2], ip[3]); +#endif + c2_del_addr(c2dev, ifa->ifa_address, ifa->ifa_mask); + } + endfor_ifa(ind); + in_dev_put(ind); + + return 0; +} + +static int c2_pseudo_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + kfree_skb(skb); + return NETDEV_TX_OK; +} + +static int c2_pseudo_change_mtu(struct net_device *netdev, int new_mtu) +{ + int ret = 0; + + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) + return -EINVAL; + + netdev->mtu = new_mtu; + + /* TODO: Tell rnic about new rmda interface mtu */ + return ret; +} + +static void setup(struct net_device *netdev) +{ + SET_MODULE_OWNER(netdev); + netdev->open = c2_pseudo_up; + netdev->stop = c2_pseudo_down; + netdev->hard_start_xmit = c2_pseudo_xmit_frame; + netdev->get_stats = NULL; + netdev->tx_timeout = NULL; + netdev->set_mac_address = NULL; + netdev->change_mtu = c2_pseudo_change_mtu; + netdev->watchdog_timeo = 0; + netdev->type = ARPHRD_ETHER; + netdev->mtu = 1500; + netdev->hard_header_len = ETH_HLEN; + netdev->addr_len = ETH_ALEN; + netdev->tx_queue_len = 0; + netdev->flags |= IFF_NOARP; + return; +} + +static struct net_device *c2_pseudo_netdev_init(struct c2_dev *c2dev) +{ + char name[IFNAMSIZ]; + struct net_device *netdev; + + /* change ethxxx to iwxxx */ + strcpy(name, "iw"); + strcat(name, &c2dev->netdev->name[3]); + netdev = alloc_netdev(sizeof(*netdev), name, setup); + if (!netdev) { + printk(KERN_ERR PFX "%s - etherdev alloc failed", + __FUNCTION__); + return NULL; + } + + netdev->priv = c2dev; + + SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); + + memcpy_fromio(netdev->dev_addr, c2dev->kva + C2_REGS_RDMA_ENADDR, 6); + + /* Print out the MAC address */ + pr_debug("%s: MAC %02X:%02X:%02X:%02X:%02X:%02X\n", + netdev->name, + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5]); + +#if 0 + /* Disable network packets */ + netif_stop_queue(netdev); +#endif + return netdev; +} + +int c2_register_device(struct c2_dev *dev) +{ + int ret; + int i; + + /* Register pseudo network device */ + dev->pseudo_netdev = c2_pseudo_netdev_init(dev); + if (dev->pseudo_netdev) { + ret = register_netdev(dev->pseudo_netdev); + if (ret) { + printk(KERN_ERR PFX + "Unable to register netdev, ret = %d\n", ret); + free_netdev(dev->pseudo_netdev); + return ret; + } + } + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); + dev->ibdev.owner = THIS_MODULE; + dev->ibdev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV); + + dev->ibdev.node_type = RDMA_NODE_RNIC; + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); + memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6); + dev->ibdev.phys_port_cnt = 1; + dev->ibdev.dma_device = &dev->pcidev->dev; + dev->ibdev.class_dev.dev = &dev->pcidev->dev; + dev->ibdev.query_device = c2_query_device; + dev->ibdev.query_port = c2_query_port; + dev->ibdev.modify_port = c2_modify_port; + dev->ibdev.query_pkey = c2_query_pkey; + dev->ibdev.query_gid = c2_query_gid; + dev->ibdev.alloc_ucontext = c2_alloc_ucontext; + dev->ibdev.dealloc_ucontext = c2_dealloc_ucontext; + dev->ibdev.mmap = c2_mmap_uar; + dev->ibdev.alloc_pd = c2_alloc_pd; + dev->ibdev.dealloc_pd = c2_dealloc_pd; + dev->ibdev.create_ah = c2_ah_create; + dev->ibdev.destroy_ah = c2_ah_destroy; + dev->ibdev.create_qp = c2_create_qp; + dev->ibdev.modify_qp = c2_modify_qp; + dev->ibdev.destroy_qp = c2_destroy_qp; + dev->ibdev.create_cq = c2_create_cq; + dev->ibdev.destroy_cq = c2_destroy_cq; + dev->ibdev.poll_cq = c2_poll_cq; + dev->ibdev.get_dma_mr = c2_get_dma_mr; + dev->ibdev.reg_phys_mr = c2_reg_phys_mr; + dev->ibdev.reg_user_mr = c2_reg_user_mr; + dev->ibdev.dereg_mr = c2_dereg_mr; + + dev->ibdev.alloc_fmr = NULL; + dev->ibdev.unmap_fmr = NULL; + dev->ibdev.dealloc_fmr = NULL; + dev->ibdev.map_phys_fmr = NULL; + + dev->ibdev.attach_mcast = c2_multicast_attach; + dev->ibdev.detach_mcast = c2_multicast_detach; + dev->ibdev.process_mad = c2_process_mad; + + dev->ibdev.req_notify_cq = c2_arm_cq; + dev->ibdev.post_send = c2_post_send; + dev->ibdev.post_recv = c2_post_receive; + + dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL); + dev->ibdev.iwcm->add_ref = c2_add_ref; + dev->ibdev.iwcm->rem_ref = c2_rem_ref; + dev->ibdev.iwcm->get_qp = c2_get_qp; + dev->ibdev.iwcm->connect = c2_connect; + dev->ibdev.iwcm->accept = c2_accept; + dev->ibdev.iwcm->reject = c2_reject; + dev->ibdev.iwcm->create_listen = c2_service_create; + dev->ibdev.iwcm->destroy_listen = c2_service_destroy; + + ret = ib_register_device(&dev->ibdev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + c2_class_attributes[i]); + if (ret) { + unregister_netdev(dev->pseudo_netdev); + free_netdev(dev->pseudo_netdev); + ib_unregister_device(&dev->ibdev); + return ret; + } + } + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return 0; +} + +void c2_unregister_device(struct c2_dev *dev) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + unregister_netdev(dev->pseudo_netdev); + free_netdev(dev->pseudo_netdev); + ib_unregister_device(&dev->ibdev); +} diff --git a/drivers/infiniband/hw/amso1100/c2_provider.h b/drivers/infiniband/hw/amso1100/c2_provider.h new file mode 100644 index 0000000..0fb6f1c --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_provider.h @@ -0,0 +1,181 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef C2_PROVIDER_H +#define C2_PROVIDER_H +#include + +#include +#include + +#include "c2_mq.h" +#include + +#define C2_MPT_FLAG_ATOMIC (1 << 14) +#define C2_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define C2_MPT_FLAG_REMOTE_READ (1 << 12) +#define C2_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define C2_MPT_FLAG_LOCAL_READ (1 << 10) + +struct c2_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + + +/* The user context keeps track of objects allocated for a + * particular user-mode client. */ +struct c2_ucontext { + struct ib_ucontext ibucontext; +}; + +struct c2_mtt; + +/* All objects associated with a PD are kept in the + * associated user context if present. + */ +struct c2_pd { + struct ib_pd ibpd; + u32 pd_id; +}; + +struct c2_mr { + struct ib_mr ibmr; + struct c2_pd *pd; +}; + +struct c2_av; + +enum c2_ah_type { + C2_AH_ON_HCA, + C2_AH_PCI_POOL, + C2_AH_KMALLOC +}; + +struct c2_ah { + struct ib_ah ibah; +}; + +struct c2_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int is_kernel; + wait_queue_head_t wait; + + u32 adapter_handle; + struct c2_mq mq; +}; + +struct c2_wq { + spinlock_t lock; +}; +struct iw_cm_id; +struct c2_qp { + struct ib_qp ibqp; + struct iw_cm_id *cm_id; + spinlock_t lock; + atomic_t refcount; + wait_queue_head_t wait; + int qpn; + + u32 adapter_handle; + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u8 state; + + struct c2_mq sq_mq; + struct c2_mq rq_mq; +}; + +struct c2_cr_query_attrs { + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; +}; + +static inline struct c2_pd *to_c2pd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct c2_pd, ibpd); +} + +static inline struct c2_ucontext *to_c2ucontext(struct ib_ucontext *ibucontext) +{ + return container_of(ibucontext, struct c2_ucontext, ibucontext); +} + +static inline struct c2_mr *to_c2mr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct c2_mr, ibmr); +} + + +static inline struct c2_ah *to_c2ah(struct ib_ah *ibah) +{ + return container_of(ibah, struct c2_ah, ibah); +} + +static inline struct c2_cq *to_c2cq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct c2_cq, ibcq); +} + +static inline struct c2_qp *to_c2qp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct c2_qp, ibqp); +} + +static inline int is_rnic_addr(struct net_device *netdev, u32 addr) +{ + struct in_device *ind; + int ret = 0; + + ind = in_dev_get(netdev); + if (!ind) + return 0; + + for_ifa(ind) { + if (ifa->ifa_address == addr) { + ret = 1; + break; + } + } + endfor_ifa(ind); + in_dev_put(ind); + return ret; +} +#endif /* C2_PROVIDER_H */ diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c new file mode 100644 index 0000000..76a60bc --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_qp.c @@ -0,0 +1,975 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include "c2.h" +#include "c2_vq.h" +#include "c2_status.h" + +#define C2_MAX_ORD_PER_QP 128 +#define C2_MAX_IRD_PER_QP 128 + +#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count) +#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) +#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) + +#define NO_SUPPORT -1 +static const u8 c2_opcode[] = { + [IB_WR_SEND] = C2_WR_TYPE_SEND, + [IB_WR_SEND_WITH_IMM] = NO_SUPPORT, + [IB_WR_RDMA_WRITE] = C2_WR_TYPE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = NO_SUPPORT, + [IB_WR_RDMA_READ] = C2_WR_TYPE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = NO_SUPPORT, + [IB_WR_ATOMIC_FETCH_AND_ADD] = NO_SUPPORT, +}; + +static int to_c2_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: + return C2_QP_STATE_IDLE; + case IB_QPS_RTS: + return C2_QP_STATE_RTS; + case IB_QPS_SQD: + return C2_QP_STATE_CLOSING; + case IB_QPS_SQE: + return C2_QP_STATE_CLOSING; + case IB_QPS_ERR: + return C2_QP_STATE_ERROR; + default: + return -1; + } +} + +int to_ib_state(enum c2_qp_state c2_state) +{ + switch (c2_state) { + case C2_QP_STATE_IDLE: + return IB_QPS_RESET; + case C2_QP_STATE_CONNECTING: + return IB_QPS_RTR; + case C2_QP_STATE_RTS: + return IB_QPS_RTS; + case C2_QP_STATE_CLOSING: + return IB_QPS_SQD; + case C2_QP_STATE_ERROR: + return IB_QPS_ERR; + case C2_QP_STATE_TERMINATE: + return IB_QPS_SQE; + default: + return -1; + } +} + +const char *to_ib_state_str(int ib_state) +{ + static const char *state_str[] = { + "IB_QPS_RESET", + "IB_QPS_INIT", + "IB_QPS_RTR", + "IB_QPS_RTS", + "IB_QPS_SQD", + "IB_QPS_SQE", + "IB_QPS_ERR" + }; + if (ib_state < IB_QPS_RESET || + ib_state > IB_QPS_ERR) + return ""; + + ib_state -= IB_QPS_RESET; + return state_str[ib_state]; +} + +void c2_set_qp_state(struct c2_qp *qp, int c2_state) +{ + int new_state = to_ib_state(c2_state); + + pr_debug("%s: qp[%p] state modify %s --> %s\n", + __FUNCTION__, + qp, + to_ib_state_str(qp->state), + to_ib_state_str(new_state)); + qp->state = new_state; +} + +#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF + +int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, + struct ib_qp_attr *attr, int attr_mask) +{ + struct c2wr_qp_modify_req wr; + struct c2wr_qp_modify_rep *reply; + struct c2_vq_req *vq_req; + unsigned long flags; + u8 next_state; + int err; + + pr_debug("%s:%d qp=%p, %s --> %s\n", + __FUNCTION__, __LINE__, + qp, + to_ib_state_str(qp->state), + to_ib_state_str(attr->qp_state)); + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + c2_wr_set_id(&wr, CCWR_QP_MODIFY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + wr.ord = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.ird = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + + if (attr_mask & IB_QP_STATE) { + /* Ensure the state is valid */ + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + + wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state)); + + if (attr->qp_state == IB_QPS_ERR) { + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id && qp->state == IB_QPS_RTS) { + pr_debug("Generating CLOSE event for QP-->ERR, " + "qp=%p, cm_id=%p\n",qp,qp->cm_id); + /* Generate an CLOSE event */ + vq_req->cm_id = qp->cm_id; + vq_req->event = IW_CM_EVENT_CLOSE; + } + spin_unlock_irqrestore(&qp->lock, flags); + } + next_state = attr->qp_state; + + } else if (attr_mask & IB_QP_CUR_STATE) { + + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + wr.next_qp_state = + cpu_to_be32(to_c2_state(attr->cur_qp_state)); + + next_state = attr->cur_qp_state; + + } else { + err = 0; + goto bail0; + } + + /* reference the request struct */ + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + reply = (struct c2wr_qp_modify_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + if (!err) + qp->state = next_state; +#ifdef DEBUG + else + pr_debug("%s: c2_errno=%d\n", __FUNCTION__, err); +#endif + /* + * If we're going to error and generating the event here, then + * we need to remove the reference because there will be no + * close event generated by the adapter + */ + spin_lock_irqsave(&qp->lock, flags); + if (vq_req->event==IW_CM_EVENT_CLOSE && qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + + pr_debug("%s:%d qp=%p, cur_state=%s\n", + __FUNCTION__, __LINE__, + qp, + to_ib_state_str(qp->state)); + return err; +} + +int c2_qp_set_read_limits(struct c2_dev *c2dev, struct c2_qp *qp, + int ord, int ird) +{ + struct c2wr_qp_modify_req wr; + struct c2wr_qp_modify_rep *reply; + struct c2_vq_req *vq_req; + int err; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + c2_wr_set_id(&wr, CCWR_QP_MODIFY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + wr.ord = cpu_to_be32(ord); + wr.ird = cpu_to_be32(ird); + wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.next_qp_state = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + + /* reference the request struct */ + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + reply = (struct c2wr_qp_modify_rep *) (unsigned long) + vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +static int destroy_qp(struct c2_dev *c2dev, struct c2_qp *qp) +{ + struct c2_vq_req *vq_req; + struct c2wr_qp_destroy_req wr; + struct c2wr_qp_destroy_rep *reply; + unsigned long flags; + int err; + + /* + * Allocate a verb request message + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + /* + * Initialize the WR + */ + c2_wr_set_id(&wr, CCWR_QP_DESTROY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id && qp->state == IB_QPS_RTS) { + pr_debug("destroy_qp: generating CLOSE event for QP-->ERR, " + "qp=%p, cm_id=%p\n",qp,qp->cm_id); + /* Generate an CLOSE event */ + vq_req->qp = qp; + vq_req->cm_id = qp->cm_id; + vq_req->event = IW_CM_EVENT_CLOSE; + } + spin_unlock_irqrestore(&qp->lock, flags); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (struct c2wr_qp_destroy_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +static int c2_alloc_qpn(struct c2_dev *c2dev, struct c2_qp *qp) +{ + int ret; + + do { + spin_lock_irq(&c2dev->qp_table.lock); + ret = idr_get_new_above(&c2dev->qp_table.idr, qp, + c2dev->qp_table.last++, &qp->qpn); + spin_unlock_irq(&c2dev->qp_table.lock); + } while ((ret == -EAGAIN) && + idr_pre_get(&c2dev->qp_table.idr, GFP_KERNEL)); + return ret; +} + +static void c2_free_qpn(struct c2_dev *c2dev, int qpn) +{ + spin_lock_irq(&c2dev->qp_table.lock); + idr_remove(&c2dev->qp_table.idr, qpn); + spin_unlock_irq(&c2dev->qp_table.lock); +} + +struct c2_qp *c2_find_qpn(struct c2_dev *c2dev, int qpn) +{ + unsigned long flags; + struct c2_qp *qp; + + spin_lock_irqsave(&c2dev->qp_table.lock, flags); + qp = idr_find(&c2dev->qp_table.idr, qpn); + spin_unlock_irqrestore(&c2dev->qp_table.lock, flags); + return qp; +} + +int c2_alloc_qp(struct c2_dev *c2dev, + struct c2_pd *pd, + struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp) +{ + struct c2wr_qp_create_req wr; + struct c2wr_qp_create_rep *reply; + struct c2_vq_req *vq_req; + struct c2_cq *send_cq = to_c2cq(qp_attrs->send_cq); + struct c2_cq *recv_cq = to_c2cq(qp_attrs->recv_cq); + unsigned long peer_pa; + u32 q_size, msg_size, mmap_size; + void __iomem *mmap; + int err; + + err = c2_alloc_qpn(c2dev, qp); + if (err) + return err; + qp->ibqp.qp_num = qp->qpn; + qp->ibqp.qp_type = IB_QPT_RC; + + /* Allocate the SQ and RQ shared pointers */ + qp->sq_mq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &qp->sq_mq.shared_dma, GFP_KERNEL); + if (!qp->sq_mq.shared) { + err = -ENOMEM; + goto bail0; + } + + qp->rq_mq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &qp->rq_mq.shared_dma, GFP_KERNEL); + if (!qp->rq_mq.shared) { + err = -ENOMEM; + goto bail1; + } + + /* Allocate the verbs request */ + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + err = -ENOMEM; + goto bail2; + } + + /* Initialize the work request */ + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_QP_CREATE); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.sq_cq_handle = send_cq->adapter_handle; + wr.rq_cq_handle = recv_cq->adapter_handle; + wr.sq_depth = cpu_to_be32(qp_attrs->cap.max_send_wr + 1); + wr.rq_depth = cpu_to_be32(qp_attrs->cap.max_recv_wr + 1); + wr.srq_handle = 0; + wr.flags = cpu_to_be32(QP_RDMA_READ | QP_RDMA_WRITE | QP_MW_BIND | + QP_ZERO_STAG | QP_RDMA_READ_RESPONSE); + wr.send_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); + wr.recv_sgl_depth = cpu_to_be32(qp_attrs->cap.max_recv_sge); + wr.rdma_write_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); + wr.shared_sq_ht = cpu_to_be64(qp->sq_mq.shared_dma); + wr.shared_rq_ht = cpu_to_be64(qp->rq_mq.shared_dma); + wr.ord = cpu_to_be32(C2_MAX_ORD_PER_QP); + wr.ird = cpu_to_be32(C2_MAX_IRD_PER_QP); + wr.pd_id = pd->pd_id; + wr.user_context = (unsigned long) qp; + + vq_req_get(c2dev, vq_req); + + /* Send the WR to the adapter */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail3; + } + + /* Wait for the verb reply */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail3; + } + + /* Process the reply */ + reply = (struct c2wr_qp_create_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail3; + } + + if ((err = c2_wr_get_result(reply)) != 0) { + goto bail4; + } + + /* Fill in the kernel QP struct */ + atomic_set(&qp->refcount, 1); + qp->adapter_handle = reply->qp_handle; + qp->state = IB_QPS_RESET; + qp->send_sgl_depth = qp_attrs->cap.max_send_sge; + qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; + qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; + + /* Initialize the SQ MQ */ + q_size = be32_to_cpu(reply->sq_depth); + msg_size = be32_to_cpu(reply->sq_msg_size); + peer_pa = c2dev->pa + be32_to_cpu(reply->sq_mq_start); + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size); + mmap = ioremap_nocache(peer_pa, mmap_size); + if (!mmap) { + err = -ENOMEM; + goto bail5; + } + + c2_mq_req_init(&qp->sq_mq, + be32_to_cpu(reply->sq_mq_index), + q_size, + msg_size, + mmap + sizeof(struct c2_mq_shared), /* pool start */ + mmap, /* peer */ + C2_MQ_ADAPTER_TARGET); + + /* Initialize the RQ mq */ + q_size = be32_to_cpu(reply->rq_depth); + msg_size = be32_to_cpu(reply->rq_msg_size); + peer_pa = c2dev->pa + be32_to_cpu(reply->rq_mq_start); + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size); + mmap = ioremap_nocache(peer_pa, mmap_size); + if (!mmap) { + err = -ENOMEM; + goto bail6; + } + + c2_mq_req_init(&qp->rq_mq, + be32_to_cpu(reply->rq_mq_index), + q_size, + msg_size, + mmap + sizeof(struct c2_mq_shared), /* pool start */ + mmap, /* peer */ + C2_MQ_ADAPTER_TARGET); + + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + return 0; + + bail6: + iounmap(qp->sq_mq.peer); + bail5: + destroy_qp(c2dev, qp); + bail4: + vq_repbuf_free(c2dev, reply); + bail3: + vq_req_free(c2dev, vq_req); + bail2: + c2_free_mqsp(qp->rq_mq.shared); + bail1: + c2_free_mqsp(qp->sq_mq.shared); + bail0: + c2_free_qpn(c2dev, qp->qpn); + return err; +} + +void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp) +{ + struct c2_cq *send_cq; + struct c2_cq *recv_cq; + + send_cq = to_c2cq(qp->ibqp.send_cq); + recv_cq = to_c2cq(qp->ibqp.recv_cq); + + /* + * Lock CQs here, so that CQ polling code can do QP lookup + * without taking a lock. + */ + spin_lock_irq(&send_cq->lock); + if (send_cq != recv_cq) + spin_lock(&recv_cq->lock); + + c2_free_qpn(c2dev, qp->qpn); + + if (send_cq != recv_cq) + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); + + /* + * Destory qp in the rnic... + */ + destroy_qp(c2dev, qp); + + /* + * Mark any unreaped CQEs as null and void. + */ + c2_cq_clean(c2dev, qp, send_cq->cqn); + if (send_cq != recv_cq) + c2_cq_clean(c2dev, qp, recv_cq->cqn); + /* + * Unmap the MQs and return the shared pointers + * to the message pool. + */ + iounmap(qp->sq_mq.peer); + iounmap(qp->rq_mq.peer); + c2_free_mqsp(qp->sq_mq.shared); + c2_free_mqsp(qp->rq_mq.shared); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); +} + +/* + * Function: move_sgl + * + * Description: + * Move an SGL from the user's work request struct into a CCIL Work Request + * message, swapping to WR byte order and ensure the total length doesn't + * overflow. + * + * IN: + * dst - ptr to CCIL Work Request message SGL memory. + * src - ptr to the consumers SGL memory. + * + * OUT: none + * + * Return: + * CCIL status codes. + */ +static int +move_sgl(struct c2_data_addr * dst, struct ib_sge *src, int count, u32 * p_len, + u8 * actual_count) +{ + u32 tot = 0; /* running total */ + u8 acount = 0; /* running total non-0 len sge's */ + + while (count > 0) { + /* + * If the addition of this SGE causes the + * total SGL length to exceed 2^32-1, then + * fail-n-bail. + * + * If the current total plus the next element length + * wraps, then it will go negative and be less than the + * current total... + */ + if ((tot + src->length) < tot) { + return -EINVAL; + } + /* + * Bug: 1456 (as well as 1498 & 1643) + * Skip over any sge's supplied with len=0 + */ + if (src->length) { + tot += src->length; + dst->stag = cpu_to_be32(src->lkey); + dst->to = cpu_to_be64(src->addr); + dst->length = cpu_to_be32(src->length); + dst++; + acount++; + } + src++; + count--; + } + + if (acount == 0) { + /* + * Bug: 1476 (as well as 1498, 1456 and 1643) + * Setup the SGL in the WR to make it easier for the RNIC. + * This way, the FW doesn't have to deal with special cases. + * Setting length=0 should be sufficient. + */ + dst->stag = 0; + dst->to = 0; + dst->length = 0; + } + + *p_len = tot; + *actual_count = acount; + return 0; +} + +/* + * Function: c2_activity (private function) + * + * Description: + * Post an mq index to the host->adapter activity fifo. + * + * IN: + * c2dev - ptr to c2dev structure + * mq_index - mq index to post + * shared - value most recently written to shared + * + * OUT: + * + * Return: + * none + */ +static inline void c2_activity(struct c2_dev *c2dev, u32 mq_index, u16 shared) +{ + /* + * First read the register to see if the FIFO is full, and if so, + * spin until it's not. This isn't perfect -- there is no + * synchronization among the clients of the register, but in + * practice it prevents multiple CPU from hammering the bus + * with PCI RETRY. Note that when this does happen, the card + * cannot get on the bus and the card and system hang in a + * deadlock -- thus the need for this code. [TOT] + */ + while (readl(c2dev->regs + PCI_BAR0_ADAPTER_HINT) & 0x80000000) { + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(0); + } + + __raw_writel(C2_HINT_MAKE(mq_index, shared), + c2dev->regs + PCI_BAR0_ADAPTER_HINT); +} + +/* + * Function: qp_wr_post + * + * Description: + * This in-line function allocates a MQ msg, then moves the host-copy of + * the completed WR into msg. Then it posts the message. + * + * IN: + * q - ptr to user MQ. + * wr - ptr to host-copy of the WR. + * qp - ptr to user qp + * size - Number of bytes to post. Assumed to be divisible by 4. + * + * OUT: none + * + * Return: + * CCIL status codes. + */ +static int qp_wr_post(struct c2_mq *q, union c2wr * wr, struct c2_qp *qp, u32 size) +{ + union c2wr *msg; + + msg = c2_mq_alloc(q); + if (msg == NULL) { + return -EINVAL; + } +#ifdef CCMSGMAGIC + ((c2wr_hdr_t *) wr)->magic = cpu_to_be32(CCWR_MAGIC); +#endif + + /* + * Since all header fields in the WR are the same as the + * CQE, set the following so the adapter need not. + */ + c2_wr_set_result(wr, CCERR_PENDING); + + /* + * Copy the wr down to the adapter + */ + memcpy((void *) msg, (void *) wr, size); + + c2_mq_produce(q); + return 0; +} + + +int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, + struct ib_send_wr **bad_wr) +{ + struct c2_dev *c2dev = to_c2dev(ibqp->device); + struct c2_qp *qp = to_c2qp(ibqp); + union c2wr wr; + int err = 0; + + u32 flags; + u32 tot_len; + u8 actual_sge_count; + u32 msg_size; + + if (qp->state > IB_QPS_RTS) + return -EINVAL; + + while (ib_wr) { + + flags = 0; + wr.sqwr.sq_hdr.user_hdr.hdr.context = ib_wr->wr_id; + if (ib_wr->send_flags & IB_SEND_SIGNALED) { + flags |= SQ_SIGNALED; + } + + switch (ib_wr->opcode) { + case IB_WR_SEND: + if (ib_wr->send_flags & IB_SEND_SOLICITED) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); + msg_size = sizeof(struct c2wr_send_req); + } else { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND); + msg_size = sizeof(struct c2wr_send_req); + } + + wr.sqwr.send.remote_stag = 0; + msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; + if (ib_wr->num_sge > qp->send_sgl_depth) { + err = -EINVAL; + break; + } + if (ib_wr->send_flags & IB_SEND_FENCE) { + flags |= SQ_READ_FENCE; + } + err = move_sgl((struct c2_data_addr *) & (wr.sqwr.send.data), + ib_wr->sg_list, + ib_wr->num_sge, + &tot_len, &actual_sge_count); + wr.sqwr.send.sge_len = cpu_to_be32(tot_len); + c2_wr_set_sge_count(&wr, actual_sge_count); + break; + case IB_WR_RDMA_WRITE: + c2_wr_set_id(&wr, C2_WR_TYPE_RDMA_WRITE); + msg_size = sizeof(struct c2wr_rdma_write_req) + + (sizeof(struct c2_data_addr) * ib_wr->num_sge); + if (ib_wr->num_sge > qp->rdma_write_sgl_depth) { + err = -EINVAL; + break; + } + if (ib_wr->send_flags & IB_SEND_FENCE) { + flags |= SQ_READ_FENCE; + } + wr.sqwr.rdma_write.remote_stag = + cpu_to_be32(ib_wr->wr.rdma.rkey); + wr.sqwr.rdma_write.remote_to = + cpu_to_be64(ib_wr->wr.rdma.remote_addr); + err = move_sgl((struct c2_data_addr *) + & (wr.sqwr.rdma_write.data), + ib_wr->sg_list, + ib_wr->num_sge, + &tot_len, &actual_sge_count); + wr.sqwr.rdma_write.sge_len = cpu_to_be32(tot_len); + c2_wr_set_sge_count(&wr, actual_sge_count); + break; + case IB_WR_RDMA_READ: + c2_wr_set_id(&wr, C2_WR_TYPE_RDMA_READ); + msg_size = sizeof(struct c2wr_rdma_read_req); + + /* IWarp only suppots 1 sge for RDMA reads */ + if (ib_wr->num_sge > 1) { + err = -EINVAL; + break; + } + + /* + * Move the local and remote stag/to/len into the WR. + */ + wr.sqwr.rdma_read.local_stag = + cpu_to_be32(ib_wr->sg_list->lkey); + wr.sqwr.rdma_read.local_to = + cpu_to_be64(ib_wr->sg_list->addr); + wr.sqwr.rdma_read.remote_stag = + cpu_to_be32(ib_wr->wr.rdma.rkey); + wr.sqwr.rdma_read.remote_to = + cpu_to_be64(ib_wr->wr.rdma.remote_addr); + wr.sqwr.rdma_read.length = + cpu_to_be32(ib_wr->sg_list->length); + break; + default: + /* error */ + msg_size = 0; + err = -EINVAL; + break; + } + + /* + * If we had an error on the last wr build, then + * break out. Possible errors include bogus WR + * type, and a bogus SGL length... + */ + if (err) { + break; + } + + /* + * Store flags + */ + c2_wr_set_flags(&wr, flags); + + /* + * Post the puppy! + */ + err = qp_wr_post(&qp->sq_mq, &wr, qp, msg_size); + if (err) { + break; + } + + /* + * Enqueue mq index to activity FIFO. + */ + c2_activity(c2dev, qp->sq_mq.index, qp->sq_mq.hint_count); + + ib_wr = ib_wr->next; + } + + if (err) + *bad_wr = ib_wr; + return err; +} + +int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, + struct ib_recv_wr **bad_wr) +{ + struct c2_dev *c2dev = to_c2dev(ibqp->device); + struct c2_qp *qp = to_c2qp(ibqp); + union c2wr wr; + int err = 0; + + if (qp->state > IB_QPS_RTS) + return -EINVAL; + + /* + * Try and post each work request + */ + while (ib_wr) { + u32 tot_len; + u8 actual_sge_count; + + if (ib_wr->num_sge > qp->recv_sgl_depth) { + err = -EINVAL; + break; + } + + /* + * Create local host-copy of the WR + */ + wr.rqwr.rq_hdr.user_hdr.hdr.context = ib_wr->wr_id; + c2_wr_set_id(&wr, CCWR_RECV); + c2_wr_set_flags(&wr, 0); + + /* sge_count is limited to eight bits. */ + BUG_ON(ib_wr->num_sge >= 256); + err = move_sgl((struct c2_data_addr *) & (wr.rqwr.data), + ib_wr->sg_list, + ib_wr->num_sge, &tot_len, &actual_sge_count); + c2_wr_set_sge_count(&wr, actual_sge_count); + + /* + * If we had an error on the last wr build, then + * break out. Possible errors include bogus WR + * type, and a bogus SGL length... + */ + if (err) { + break; + } + + err = qp_wr_post(&qp->rq_mq, &wr, qp, qp->rq_mq.msg_size); + if (err) { + break; + } + + /* + * Enqueue mq index to activity FIFO + */ + c2_activity(c2dev, qp->rq_mq.index, qp->rq_mq.hint_count); + + ib_wr = ib_wr->next; + } + + if (err) + *bad_wr = ib_wr; + return err; +} + +void __devinit c2_init_qp_table(struct c2_dev *c2dev) +{ + spin_lock_init(&c2dev->qp_table.lock); + idr_init(&c2dev->qp_table.idr); +} + +void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev) +{ + idr_destroy(&c2dev->qp_table.idr); +} diff --git a/drivers/infiniband/hw/amso1100/c2_user.h b/drivers/infiniband/hw/amso1100/c2_user.h new file mode 100644 index 0000000..7e9e7ad --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_user.h @@ -0,0 +1,82 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef C2_USER_H +#define C2_USER_H + +#include + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct c2_alloc_ucontext_resp { + __u32 qp_tab_size; + __u32 uarc_size; +}; + +struct c2_alloc_pd_resp { + __u32 pdn; + __u32 reserved; +}; + +struct c2_create_cq { + __u32 lkey; + __u32 pdn; + __u64 arm_db_page; + __u64 set_db_page; + __u32 arm_db_index; + __u32 set_db_index; +}; + +struct c2_create_cq_resp { + __u32 cqn; + __u32 reserved; +}; + +struct c2_create_qp { + __u32 lkey; + __u32 reserved; + __u64 sq_db_page; + __u64 rq_db_page; + __u32 sq_db_index; + __u32 rq_db_index; +}; + +#endif /* C2_USER_H */ From swise at opengridcomputing.com Thu Aug 3 14:07:38 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:07:38 -0500 Subject: [openib-general] [PATCH v4 7/7] AMSO1100 Makefiles and Kconfig changes. In-Reply-To: <20060803210723.16572.34829.stgit@dell3.ogc.int> References: <20060803210723.16572.34829.stgit@dell3.ogc.int> Message-ID: <20060803210738.16572.74032.stgit@dell3.ogc.int> --- drivers/infiniband/Kconfig | 1 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/amso1100/Kbuild | 10 ++++++++++ drivers/infiniband/hw/amso1100/Kconfig | 15 +++++++++++++++ drivers/infiniband/hw/amso1100/README | 11 +++++++++++ 5 files changed, 38 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 69a53d4..47f1de5 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -36,6 +36,7 @@ config INFINIBAND_ADDR_TRANS source "drivers/infiniband/hw/mthca/Kconfig" source "drivers/infiniband/hw/ipath/Kconfig" +source "drivers/infiniband/hw/amso1100/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index c7ff58c..ec330d5 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -1,6 +1,7 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ obj-$(CONFIG_IPATH_CORE) += hw/ipath/ +obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ diff --git a/drivers/infiniband/hw/amso1100/Kbuild b/drivers/infiniband/hw/amso1100/Kbuild new file mode 100644 index 0000000..e1f10ab --- /dev/null +++ b/drivers/infiniband/hw/amso1100/Kbuild @@ -0,0 +1,10 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_AMSO1100_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_AMSO1100) += iw_c2.o + +iw_c2-y := c2.o c2_provider.o c2_rnic.o c2_alloc.o c2_mq.o c2_ae.o c2_vq.o \ + c2_intr.o c2_cq.o c2_qp.o c2_cm.o c2_mm.o c2_pd.o diff --git a/drivers/infiniband/hw/amso1100/Kconfig b/drivers/infiniband/hw/amso1100/Kconfig new file mode 100644 index 0000000..809cb14 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/Kconfig @@ -0,0 +1,15 @@ +config INFINIBAND_AMSO1100 + tristate "Ammasso 1100 HCA support" + depends on PCI && INET && INFINIBAND + ---help--- + This is a low-level driver for the Ammasso 1100 host + channel adapter (HCA). + +config INFINIBAND_AMSO1100_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_AMSO1100 + default n + ---help--- + This option causes the amso1100 driver to produce a bunch of + debug messages. Select this if you are developing the driver + or trying to diagnose a problem. diff --git a/drivers/infiniband/hw/amso1100/README b/drivers/infiniband/hw/amso1100/README new file mode 100644 index 0000000..1331353 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/README @@ -0,0 +1,11 @@ +This is the OpenFabrics provider driver for the +AMSO1100 1Gb RNIC adapter. + +This adapter is available in limited quantities +for development purposes from Open Grid Computing. + +This driver requires the IWCM and CMA mods necessary +to support iWARP. + +Contact tom at opengridcomputing.com for more information. + From swise at opengridcomputing.com Thu Aug 3 14:07:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:07:36 -0500 Subject: [openib-general] [PATCH v4 6/7] AMSO1100: Privileged Verbs Queues. In-Reply-To: <20060803210723.16572.34829.stgit@dell3.ogc.int> References: <20060803210723.16572.34829.stgit@dell3.ogc.int> Message-ID: <20060803210736.16572.57805.stgit@dell3.ogc.int> --- drivers/infiniband/hw/amso1100/c2_vq.c | 260 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_vq.h | 63 ++++++++ 2 files changed, 323 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_vq.c b/drivers/infiniband/hw/amso1100/c2_vq.c new file mode 100644 index 0000000..445b1ed --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_vq.c @@ -0,0 +1,260 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include "c2_vq.h" +#include "c2_provider.h" + +/* + * Verbs Request Objects: + * + * VQ Request Objects are allocated by the kernel verbs handlers. + * They contain a wait object, a refcnt, an atomic bool indicating that the + * adapter has replied, and a copy of the verb reply work request. + * A pointer to the VQ Request Object is passed down in the context + * field of the work request message, and reflected back by the adapter + * in the verbs reply message. The function handle_vq() in the interrupt + * path will use this pointer to: + * 1) append a copy of the verbs reply message + * 2) mark that the reply is ready + * 3) wake up the kernel verbs handler blocked awaiting the reply. + * + * + * The kernel verbs handlers do a "get" to put a 2nd reference on the + * VQ Request object. If the kernel verbs handler exits before the adapter + * can respond, this extra reference will keep the VQ Request object around + * until the adapter's reply can be processed. The reason we need this is + * because a pointer to this object is stuffed into the context field of + * the verbs work request message, and reflected back in the reply message. + * It is used in the interrupt handler (handle_vq()) to wake up the appropriate + * kernel verb handler that is blocked awaiting the verb reply. + * So handle_vq() will do a "put" on the object when it's done accessing it. + * NOTE: If we guarantee that the kernel verb handler will never bail before + * getting the reply, then we don't need these refcnts. + * + * + * VQ Request objects are freed by the kernel verbs handlers only + * after the verb has been processed, or when the adapter fails and + * does not reply. + * + * + * Verbs Reply Buffers: + * + * VQ Reply bufs are local host memory copies of a + * outstanding Verb Request reply + * message. The are always allocated by the kernel verbs handlers, and _may_ be + * freed by either the kernel verbs handler -or- the interrupt handler. The + * kernel verbs handler _must_ free the repbuf, then free the vq request object + * in that order. + */ + +int vq_init(struct c2_dev *c2dev) +{ + sprintf(c2dev->vq_cache_name, "c2-vq:dev%c", + (char) ('0' + c2dev->devnum)); + c2dev->host_msg_cache = + kmem_cache_create(c2dev->vq_cache_name, c2dev->rep_vq.msg_size, 0, + SLAB_HWCACHE_ALIGN, NULL, NULL); + if (c2dev->host_msg_cache == NULL) { + return -ENOMEM; + } + return 0; +} + +void vq_term(struct c2_dev *c2dev) +{ + kmem_cache_destroy(c2dev->host_msg_cache); +} + +/* vq_req_alloc - allocate a VQ Request Object and initialize it. + * The refcnt is set to 1. + */ +struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev) +{ + struct c2_vq_req *r; + + r = kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL); + if (r) { + init_waitqueue_head(&r->wait_object); + r->reply_msg = (u64) NULL; + r->event = 0; + r->cm_id = NULL; + r->qp = NULL; + atomic_set(&r->refcnt, 1); + atomic_set(&r->reply_ready, 0); + } + return r; +} + + +/* vq_req_free - free the VQ Request Object. It is assumed the verbs handler + * has already free the VQ Reply Buffer if it existed. + */ +void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + r->reply_msg = (u64) NULL; + if (atomic_dec_and_test(&r->refcnt)) { + kfree(r); + } +} + +/* vq_req_get - reference a VQ Request Object. Done + * only in the kernel verbs handlers. + */ +void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + atomic_inc(&r->refcnt); +} + + +/* vq_req_put - dereference and potentially free a VQ Request Object. + * + * This is only called by handle_vq() on the + * interrupt when it is done processing + * a verb reply message. If the associated + * kernel verbs handler has already bailed, + * then this put will actually free the VQ + * Request object _and_ the VQ Reply Buffer + * if it exists. + */ +void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + if (atomic_dec_and_test(&r->refcnt)) { + if (r->reply_msg != (u64) NULL) + vq_repbuf_free(c2dev, + (void *) (unsigned long) r->reply_msg); + kfree(r); + } +} + + +/* + * vq_repbuf_alloc - allocate a VQ Reply Buffer. + */ +void *vq_repbuf_alloc(struct c2_dev *c2dev) +{ + return kmem_cache_alloc(c2dev->host_msg_cache, SLAB_ATOMIC); +} + +/* + * vq_send_wr - post a verbs request message to the Verbs Request Queue. + * If a message is not available in the MQ, then block until one is available. + * NOTE: handle_mq() on the interrupt context will wake up threads blocked here. + * When the adapter drains the Verbs Request Queue, + * it inserts MQ index 0 in to the + * adapter->host activity fifo and interrupts the host. + */ +int vq_send_wr(struct c2_dev *c2dev, union c2wr *wr) +{ + void *msg; + wait_queue_t __wait; + + /* + * grab adapter vq lock + */ + spin_lock(&c2dev->vqlock); + + /* + * allocate msg + */ + msg = c2_mq_alloc(&c2dev->req_vq); + + /* + * If we cannot get a msg, then we'll wait + * When a messages are available, the int handler will wake_up() + * any waiters. + */ + while (msg == NULL) { + pr_debug("%s:%d no available msg in VQ, waiting...\n", + __FUNCTION__, __LINE__); + init_waitqueue_entry(&__wait, current); + add_wait_queue(&c2dev->req_vq_wo, &__wait); + spin_unlock(&c2dev->vqlock); + for (;;) { + set_current_state(TASK_INTERRUPTIBLE); + if (!c2_mq_full(&c2dev->req_vq)) { + break; + } + if (!signal_pending(current)) { + schedule_timeout(1 * HZ); /* 1 second... */ + continue; + } + set_current_state(TASK_RUNNING); + remove_wait_queue(&c2dev->req_vq_wo, &__wait); + return -EINTR; + } + set_current_state(TASK_RUNNING); + remove_wait_queue(&c2dev->req_vq_wo, &__wait); + spin_lock(&c2dev->vqlock); + msg = c2_mq_alloc(&c2dev->req_vq); + } + + /* + * copy wr into adapter msg + */ + memcpy(msg, wr, c2dev->req_vq.msg_size); + + /* + * post msg + */ + c2_mq_produce(&c2dev->req_vq); + + /* + * release adapter vq lock + */ + spin_unlock(&c2dev->vqlock); + return 0; +} + + +/* + * vq_wait_for_reply - block until the adapter posts a Verb Reply Message. + */ +int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req) +{ + if (!wait_event_timeout(req->wait_object, + atomic_read(&req->reply_ready), + 60*HZ)) + return -ETIMEDOUT; + + return 0; +} + +/* + * vq_repbuf_free - Free a Verbs Reply Buffer. + */ +void vq_repbuf_free(struct c2_dev *c2dev, void *reply) +{ + kmem_cache_free(c2dev->host_msg_cache, reply); +} diff --git a/drivers/infiniband/hw/amso1100/c2_vq.h b/drivers/infiniband/hw/amso1100/c2_vq.h new file mode 100644 index 0000000..3380562 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_vq.h @@ -0,0 +1,63 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_VQ_H_ +#define _C2_VQ_H_ +#include +#include "c2.h" +#include "c2_wr.h" +#include "c2_provider.h" + +struct c2_vq_req { + u64 reply_msg; /* ptr to reply msg */ + wait_queue_head_t wait_object; /* wait object for vq reqs */ + atomic_t reply_ready; /* set when reply is ready */ + atomic_t refcnt; /* used to cancel WRs... */ + int event; + struct iw_cm_id *cm_id; + struct c2_qp *qp; +}; + +extern int vq_init(struct c2_dev *c2dev); +extern void vq_term(struct c2_dev *c2dev); + +extern struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev); +extern void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *req); +extern void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *req); +extern void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *req); +extern int vq_send_wr(struct c2_dev *c2dev, union c2wr * wr); + +extern void *vq_repbuf_alloc(struct c2_dev *c2dev); +extern void vq_repbuf_free(struct c2_dev *c2dev, void *reply); + +extern int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req); +#endif /* _C2_VQ_H_ */ From swise at opengridcomputing.com Thu Aug 3 14:07:34 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 Aug 2006 16:07:34 -0500 Subject: [openib-general] [PATCH v4 5/7] AMSO1100 Message Queues. In-Reply-To: <20060803210723.16572.34829.stgit@dell3.ogc.int> References: <20060803210723.16572.34829.stgit@dell3.ogc.int> Message-ID: <20060803210734.16572.82940.stgit@dell3.ogc.int> --- drivers/infiniband/hw/amso1100/c2_mq.c | 175 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_mq.h | 107 ++++++++++++++++++++ 2 files changed, 282 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_mq.c b/drivers/infiniband/hw/amso1100/c2_mq.c new file mode 100644 index 0000000..96bbe9a --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_mq.c @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include "c2_mq.h" + +void *c2_mq_alloc(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_ADAPTER_TARGET); + + if (c2_mq_full(q)) { + return NULL; + } else { +#ifdef DEBUG + struct c2wr_hdr *m = + (struct c2wr_hdr *) (q->msg_pool.host + q->priv * q->msg_size); +#ifdef CCMSGMAGIC + BUG_ON(m->magic != be32_to_cpu(~CCWR_MAGIC)); + m->magic = cpu_to_be32(CCWR_MAGIC); +#endif + return m; +#else + return q->msg_pool.host + q->priv * q->msg_size; +#endif + } +} + +void c2_mq_produce(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_ADAPTER_TARGET); + + if (!c2_mq_full(q)) { + q->priv = (q->priv + 1) % q->q_size; + q->hint_count++; + /* Update peer's offset. */ + __raw_writew(cpu_to_be16(q->priv), &q->peer->shared); + } +} + +void *c2_mq_consume(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_HOST_TARGET); + + if (c2_mq_empty(q)) { + return NULL; + } else { +#ifdef DEBUG + struct c2wr_hdr *m = (struct c2wr_hdr *) + (q->msg_pool.host + q->priv * q->msg_size); +#ifdef CCMSGMAGIC + BUG_ON(m->magic != be32_to_cpu(CCWR_MAGIC)); +#endif + return m; +#else + return q->msg_pool.host + q->priv * q->msg_size; +#endif + } +} + +void c2_mq_free(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_HOST_TARGET); + + if (!c2_mq_empty(q)) { + +#ifdef CCMSGMAGIC + { + struct c2wr_hdr __iomem *m = (struct c2wr_hdr __iomem *) + (q->msg_pool.adapter + q->priv * q->msg_size); + __raw_writel(cpu_to_be32(~CCWR_MAGIC), &m->magic); + } +#endif + q->priv = (q->priv + 1) % q->q_size; + /* Update peer's offset. */ + __raw_writew(cpu_to_be16(q->priv), &q->peer->shared); + } +} + + +void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_ADAPTER_TARGET); + + while (wqe_count--) { + BUG_ON(c2_mq_empty(q)); + *q->shared = cpu_to_be16((be16_to_cpu(*q->shared)+1) % q->q_size); + } +} + + +u32 c2_mq_count(struct c2_mq *q) +{ + s32 count; + + if (q->type == C2_MQ_HOST_TARGET) { + count = be16_to_cpu(*q->shared) - q->priv; + } else { + count = q->priv - be16_to_cpu(*q->shared); + } + + if (count < 0) { + count += q->q_size; + } + + return (u32) count; +} + +void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 __iomem *pool_start, u16 __iomem *peer, u32 type) +{ + BUG_ON(!q->shared); + + /* This code assumes the byte swapping has already been done! */ + q->index = index; + q->q_size = q_size; + q->msg_size = msg_size; + q->msg_pool.adapter = pool_start; + q->peer = (struct c2_mq_shared __iomem *) peer; + q->magic = C2_MQ_MAGIC; + q->type = type; + q->priv = 0; + q->hint_count = 0; + return; +} +void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 *pool_start, u16 __iomem *peer, u32 type) +{ + BUG_ON(!q->shared); + + /* This code assumes the byte swapping has already been done! */ + q->index = index; + q->q_size = q_size; + q->msg_size = msg_size; + q->msg_pool.host = pool_start; + q->peer = (struct c2_mq_shared __iomem *) peer; + q->magic = C2_MQ_MAGIC; + q->type = type; + q->priv = 0; + q->hint_count = 0; + return; +} diff --git a/drivers/infiniband/hw/amso1100/c2_mq.h b/drivers/infiniband/hw/amso1100/c2_mq.h new file mode 100644 index 0000000..9b1296e --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_mq.h @@ -0,0 +1,107 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef _C2_MQ_H_ +#define _C2_MQ_H_ +#include +#include +#include "c2_wr.h" + +enum c2_shared_regs { + + C2_SHARED_ARMED = 0x10, + C2_SHARED_NOTIFY = 0x18, + C2_SHARED_SHARED = 0x40, +}; + +struct c2_mq_shared { + u16 unused1; + u8 armed; + u8 notification_type; + u32 unused2; + u16 shared; + /* Pad to 64 bytes. */ + u8 pad[64 - sizeof(u16) - 2 * sizeof(u8) - sizeof(u32) - sizeof(u16)]; +}; + +enum c2_mq_type { + C2_MQ_HOST_TARGET = 1, + C2_MQ_ADAPTER_TARGET = 2, +}; + +/* + * c2_mq_t is for kernel-mode MQs like the VQs Cand the AEQ. + * c2_user_mq_t (which is the same format) is for user-mode MQs... + */ +#define C2_MQ_MAGIC 0x4d512020 /* 'MQ ' */ +struct c2_mq { + u32 magic; + union { + u8 *host; + u8 __iomem *adapter; + } msg_pool; + dma_addr_t host_dma; + DECLARE_PCI_UNMAP_ADDR(mapping); + u16 hint_count; + u16 priv; + struct c2_mq_shared __iomem *peer; + u16 *shared; + dma_addr_t shared_dma; + u32 q_size; + u32 msg_size; + u32 index; + enum c2_mq_type type; +}; + +static __inline__ int c2_mq_empty(struct c2_mq *q) +{ + return q->priv == be16_to_cpu(*q->shared); +} + +static __inline__ int c2_mq_full(struct c2_mq *q) +{ + return q->priv == (be16_to_cpu(*q->shared) + q->q_size - 1) % q->q_size; +} + +extern void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count); +extern void *c2_mq_alloc(struct c2_mq *q); +extern void c2_mq_produce(struct c2_mq *q); +extern void *c2_mq_consume(struct c2_mq *q); +extern void c2_mq_free(struct c2_mq *q); +extern u32 c2_mq_count(struct c2_mq *q); +extern void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 __iomem *pool_start, u16 __iomem *peer, u32 type); +extern void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 *pool_start, u16 __iomem *peer, u32 type); + +#endif /* _C2_MQ_H_ */ From mst at mellanox.co.il Thu Aug 3 15:13:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Aug 2006 01:13:13 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: References: Message-ID: <20060803221313.GA10301@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Fwd: issues in ipoib > > Michael> No, I think there's no problem - user callback has > Michael> finished running and that's all we care about. > > I don't think so -- because then the unregister call can return and > the client module free the client struct before the wake_up() runs, > which leads to use-after-free. OK, here's a version using completion which should address this issue. How does this look? -- Require registration with SA module, to prevent module text from going away while sa query callback is still running, and update all users. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d6f99d5..bf668b3 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -60,6 +60,10 @@ static struct ib_client cma_client = { .remove = cma_remove_one }; +static struct ib_sa_client cma_sa_client = { + .name = "cma" +}; + static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); @@ -1140,7 +1144,7 @@ static int cma_query_ib_route(struct rdm path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; - id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + id_priv->query_id = ib_sa_path_rec_get(&cma_sa_client, id_priv->id.device, id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, @@ -1910,6 +1914,8 @@ static int cma_init(void) ret = ib_register_client(&cma_client); if (ret) goto err; + + ib_sa_register_client(&cma_sa_client); return 0; err: @@ -1919,6 +1925,7 @@ err: static void cma_cleanup(void) { + ib_sa_unregister_client(&cma_sa_client); ib_unregister_client(&cma_client); destroy_workqueue(cma_wq); idr_destroy(&sdp_ps); diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index aeda484..ea03677 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -78,6 +78,7 @@ struct ib_sa_query { struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; + struct ib_sa_client *client; int id; }; @@ -532,6 +533,17 @@ retry: return ret ? ret : id; } +static inline void ib_sa_client_get(struct ib_sa_client *client) +{ + atomic_inc(&client->users); +} + +static inline void ib_sa_client_put(struct ib_sa_client *client) +{ + if (atomic_dec_and_test(&client->users)) + complete(&client->completion); +} + static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, int status, struct ib_sa_mad *mad) @@ -539,6 +551,7 @@ static void ib_sa_path_rec_callback(stru struct ib_sa_path_query *query = container_of(sa_query, struct ib_sa_path_query, sa_query); + ib_sa_client_get(sa_query->client); if (mad) { struct ib_sa_path_rec rec; @@ -547,6 +560,7 @@ static void ib_sa_path_rec_callback(stru query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + ib_sa_client_put(sa_query->client); } static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) @@ -556,6 +570,7 @@ static void ib_sa_path_rec_release(struc /** * ib_sa_path_rec_get - Start a Path get query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:Path Record to send in query @@ -578,7 +593,8 @@ static void ib_sa_path_rec_release(struc * error code. Otherwise it is a query ID that can be used to cancel * the query. */ -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -619,6 +635,7 @@ int ib_sa_path_rec_get(struct ib_device mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + query->sa_query.client = client; query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; query->sa_query.port = port; @@ -653,6 +670,7 @@ static void ib_sa_service_rec_callback(s struct ib_sa_service_query *query = container_of(sa_query, struct ib_sa_service_query, sa_query); + ib_sa_client_get(sa_query->client); if (mad) { struct ib_sa_service_rec rec; @@ -661,6 +679,7 @@ static void ib_sa_service_rec_callback(s query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + ib_sa_client_put(sa_query->client); } static void ib_sa_service_rec_release(struct ib_sa_query *sa_query) @@ -670,6 +689,7 @@ static void ib_sa_service_rec_release(st /** * ib_sa_service_rec_query - Start Service Record operation + * @client:client object used to track the query * @device:device to send request on * @port_num: port number to send request on * @method:SA method - should be get, set, or delete @@ -694,7 +714,8 @@ static void ib_sa_service_rec_release(st * error code. Otherwise it is a request ID that can be used to cancel * the query. */ -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -740,6 +761,7 @@ int ib_sa_service_rec_query(struct ib_de mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + query->sa_query.client = client; query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; query->sa_query.release = ib_sa_service_rec_release; query->sa_query.port = port; @@ -775,6 +797,7 @@ static void ib_sa_mcmember_rec_callback( struct ib_sa_mcmember_query *query = container_of(sa_query, struct ib_sa_mcmember_query, sa_query); + ib_sa_client_get(sa_query->client); if (mad) { struct ib_sa_mcmember_rec rec; @@ -783,6 +806,7 @@ static void ib_sa_mcmember_rec_callback( query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + ib_sa_client_put(sa_query->client); } static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) @@ -790,7 +814,8 @@ static void ib_sa_mcmember_rec_release(s kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -832,6 +857,7 @@ int ib_sa_mcmember_rec_query(struct ib_d mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + query->sa_query.client = client; query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; query->sa_query.port = port; @@ -866,6 +892,7 @@ static void send_handler(struct ib_mad_a struct ib_sa_query *query = mad_send_wc->send_buf->context[0]; unsigned long flags; + ib_sa_client_get(query->client); if (query->callback) switch (mad_send_wc->status) { case IB_WC_SUCCESS: @@ -881,6 +908,7 @@ static void send_handler(struct ib_mad_a query->callback(query, -EIO, NULL); break; } + ib_sa_client_put(query->client); spin_lock_irqsave(&idr_lock, flags); idr_remove(&query_idr, query->id); @@ -900,6 +928,7 @@ static void recv_handler(struct ib_mad_a mad_buf = (void *) (unsigned long) mad_recv_wc->wc->wr_id; query = mad_buf->context[0]; + ib_sa_client_get(query->client); if (query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, @@ -909,6 +938,7 @@ static void recv_handler(struct ib_mad_a else query->callback(query, -EIO, NULL); } + ib_sa_client_put(query->client); ib_free_recv_mad(mad_recv_wc); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 474aa21..28a9f0f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -390,4 +390,5 @@ #define IPOIB_GID_RAW_ARG(gid) ((u8 *)(g #define IPOIB_GID_ARG(gid) IPOIB_GID_RAW_ARG((gid).raw) +extern struct ib_sa_client ipoib_sa_client; #endif /* _IPOIB_H */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cf71d2a..ca10724 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -91,6 +91,10 @@ static struct ib_client ipoib_client = { .remove = ipoib_remove_one }; +struct ib_sa_client ipoib_sa_client = { + .name = "ipoib" +}; + int ipoib_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -459,7 +463,7 @@ static int path_rec_start(struct net_dev init_completion(&path->done); path->query_id = - ib_sa_path_rec_get(priv->ca, priv->port, + ib_sa_path_rec_get(&ipoib_sa_client, priv->ca, priv->port, &path->pathrec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | @@ -1185,6 +1189,8 @@ static int __init ipoib_init_module(void if (ret) goto err_wq; + ib_sa_register_client(&ipoib_sa_client); + return 0; err_wq: @@ -1198,6 +1204,7 @@ err_fs: static void __exit ipoib_cleanup_module(void) { + ib_sa_unregister_client(&ipoib_sa_client); ib_unregister_client(&ipoib_client); ipoib_unregister_debugfs(); destroy_workqueue(ipoib_workqueue); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index b5e6a7b..f688323 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -360,7 +360,7 @@ #endif init_completion(&mcast->done); - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, + ret = ib_sa_mcmember_rec_set(&ipoib_sa_client, priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | @@ -484,8 +484,8 @@ static void ipoib_mcast_join(struct net_ init_completion(&mcast->done); - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, - mcast->backoff * 1000, GFP_ATOMIC, + ret = ib_sa_mcmember_rec_set(&ipoib_sa_client, priv->ca, priv->port, &rec, + comp_mask, mcast->backoff * 1000, GFP_ATOMIC, ipoib_mcast_join_complete, mcast, &mcast->query); @@ -680,7 +680,7 @@ static int ipoib_mcast_leave(struct net_ * Just make one shot at leaving and don't wait for a reply; * if we fail, too bad. */ - ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, + ret = ib_sa_mcmember_rec_delete(&ipoib_sa_client, priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..0856d78 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -88,6 +88,10 @@ static struct ib_client srp_client = { .remove = srp_remove_one }; +static struct ib_sa_client srp_sa_client = { + .name = "srp" +}; + static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) { return (struct srp_target_port *) host->hostdata; @@ -259,7 +263,8 @@ static int srp_lookup_path(struct srp_ta init_completion(&target->done); - target->path_query_id = ib_sa_path_rec_get(target->srp_host->dev->dev, + target->path_query_id = ib_sa_path_rec_get(&srp_sa_client, + target->srp_host->dev->dev, target->srp_host->port, &target->path, IB_SA_PATH_REC_DGID | diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index c99e442..084baf4 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -37,6 +37,8 @@ #ifndef IB_SA_H #define IB_SA_H #include +#include +#include #include #include @@ -250,11 +252,18 @@ struct ib_sa_service_rec { u64 data64[2]; }; +struct ib_sa_client { + char *name; + atomic_t users; + struct completion completion; +}; + struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -264,7 +273,8 @@ int ib_sa_path_rec_get(struct ib_device void *context, struct ib_sa_query **query); -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -275,7 +285,8 @@ int ib_sa_mcmember_rec_query(struct ib_d void *context, struct ib_sa_query **query); -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, @@ -288,6 +299,7 @@ int ib_sa_service_rec_query(struct ib_de /** * ib_sa_mcmember_rec_set - Start an MCMember set query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -311,7 +323,8 @@ int ib_sa_service_rec_query(struct ib_de * cancel the query. */ static inline int -ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_set(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -321,7 +334,7 @@ ib_sa_mcmember_rec_set(struct ib_device void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, timeout_ms, gfp_mask, callback, @@ -330,6 +343,7 @@ ib_sa_mcmember_rec_set(struct ib_device /** * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -353,7 +367,8 @@ ib_sa_mcmember_rec_set(struct ib_device * cancel the query. */ static inline int -ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_delete(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -363,7 +378,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, timeout_ms, gfp_mask, callback, @@ -378,4 +393,25 @@ int ib_init_ah_from_path(struct ib_devic struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr); +/** + * ib_sa_register_client - register SA client object + * @client:client object used to track queries + */ +static inline void ib_sa_register_client(struct ib_sa_client *client) +{ + atomic_set(&client->users, 1); + init_completion(&client->completion); +} + +/** + * ib_sa_unregister_client - unregister SA client object + * @client:client object used to track queries + */ +static inline void ib_sa_unregister_client(struct ib_sa_client *client) +{ + if (atomic_dec_and_test(&client->users)) + complete(&client->completion); + wait_for_completion(&client->completion); +} + #endif /* IB_SA_H */ -- MST From sean.hefty at intel.com Thu Aug 3 15:25:07 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Aug 2006 15:25:07 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060803221313.GA10301@mellanox.co.il> Message-ID: <000201c6b74b$aced4960$e598070a@amr.corp.intel.com> > static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, > int status, > struct ib_sa_mad *mad) >@@ -539,6 +551,7 @@ static void ib_sa_path_rec_callback(stru > struct ib_sa_path_query *query = > container_of(sa_query, struct ib_sa_path_query, sa_query); > >+ ib_sa_client_get(sa_query->client); It makes more sense to me to increment the reference count when the query is initiated. We don't know that sa_query->client is still valid here unless a reference was taken earlier. I like this approach, but would like to see it expanded to track the requests, to avoid duplicating this work in every client. Unregistration would then cancel all outstanding queries issued by the user. - Sean From sean.hefty at intel.com Thu Aug 3 15:39:54 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Aug 2006 15:39:54 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <000201c6b74b$aced4960$e598070a@amr.corp.intel.com> Message-ID: <000301c6b74d$bd7082f0$e598070a@amr.corp.intel.com> >I like this approach, but would like to see it expanded to track the requests, >to avoid duplicating this work in every client. Unregistration would then >cancel all outstanding queries issued by the user. To add to this, I'm trying to expose the SA interfaces in userspace, and it would be ideal to have each userspace application contained to a single ib_sa_client. - Sean From mst at mellanox.co.il Thu Aug 3 16:01:40 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Aug 2006 02:01:40 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <000201c6b74b$aced4960$e598070a@amr.corp.intel.com> References: <000201c6b74b$aced4960$e598070a@amr.corp.intel.com> Message-ID: <20060803230140.GA10604@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: Fwd: issues in ipoib > > > static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, > > int status, > > struct ib_sa_mad *mad) > >@@ -539,6 +551,7 @@ static void ib_sa_path_rec_callback(stru > > struct ib_sa_path_query *query = > > container_of(sa_query, struct ib_sa_path_query, sa_query); > > > >+ ib_sa_client_get(sa_query->client); > > It makes more sense to me to increment the reference count when the query is > initiated. We can do that, I don't have a strong opinion either way. Roland? > We don't know that sa_query->client is still valid here unless a > reference was taken earlier. Yes we do. Client must wait till all queries complete before unregistering. > I like this approach, but would like to see it expanded to track the requests, > to avoid duplicating this work in every client. Unregistration would then > cancel all outstanding queries issued by the user. I would not object to this on principle, but let's go there by small steps - for now, let's get the API fixed and solve the race with module unloading that we have for 2.6.18. OK? -- MST From greg.lindahl at qlogic.com Thu Aug 3 16:05:00 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 3 Aug 2006 16:05:00 -0700 Subject: [openib-general] [PATCH 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: <20060801112544.GA20058@lst.de> References: <000001c6b40d$c2b7ae70$cdcc180a@amr.corp.intel.com> <1154357568.1066.4.camel@stevo-desktop> <20060731151538.GA23392@greglaptop.hsd1.ca.comcast.net> <1154359451.3078.4.camel@stevo-desktop> <20060731162333.GD23392@greglaptop.hsd1.ca.comcast.net> <20060731173742.GF1098@greglaptop.internal.keyresearch.com> <20060801112544.GA20058@lst.de> Message-ID: <20060803230500.GD1857@greglaptop.internal.keyresearch.com> On Tue, Aug 01, 2006 at 01:25:44PM +0200, Christoph Hellwig wrote: > Exactly. Please don't even try to put brand names (especially if > they're as stupid as this) in. We don't call our wireless stack > centrino just because intel contributed to it either. Centrino: Intel-only brand name WiFi: trade association brand-name, not joined by all players 802.11{a,b,g}: technical name of technologies wireless: an overly generic name that people might think should include bluetooth, wireless usb, etc etc. OpenFabrics is not a single company brand name, it is the name of the community that's actually implementing this software stack, like 'Gnome' or 'KDE'. BTW, I've had meetings with about 5 startups that began like, "We have an rdma device, but it's not actually RDMA as defined by that IEEE Committee." And these devices don't work like that definition. So there's considerable difference of opinion as to what RDMA means. -- greg From rdreier at cisco.com Thu Aug 3 16:07:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 16:07:29 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060803230140.GA10604@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 4 Aug 2006 02:01:40 +0300") References: <000201c6b74b$aced4960$e598070a@amr.corp.intel.com> <20060803230140.GA10604@mellanox.co.il> Message-ID: Michael> We can do that, I don't have a strong opinion either way. Michael> Roland? There's a similar unfixable race with trying to count queries this way: there's a window between when a client module calls the query function and when the sa_query module actually increments the reference count. So I think it's better to keep track of what we can control in the sa_query module, and leave clients responsible for making sure they wait for all queries that they started. - R. From mst at mellanox.co.il Thu Aug 3 16:07:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Aug 2006 02:07:47 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <000301c6b74d$bd7082f0$e598070a@amr.corp.intel.com> References: <000301c6b74d$bd7082f0$e598070a@amr.corp.intel.com> Message-ID: <20060803230747.GB10604@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: [openib-general] Fwd: issues in ipoib > > >I like this approach, but would like to see it expanded to track the requests, > >to avoid duplicating this work in every client. Unregistration would then > >cancel all outstanding queries issued by the user. > > To add to this, I'm trying to expose the SA interfaces in userspace, and it > would be ideal to have each userspace application contained to a single > ib_sa_client. Is this for some kind of diagnostic/management tool? It seems ultimately a wrong thing to expose to unpriviledged users. -- MST From sean.hefty at intel.com Thu Aug 3 16:12:03 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Aug 2006 16:12:03 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060803230747.GB10604@mellanox.co.il> Message-ID: <000401c6b752$3b1cc3e0$e598070a@amr.corp.intel.com> >Is this for some kind of diagnostic/management tool? >It seems ultimately a wrong thing to expose to unpriviledged users. It's required in order to establish a connection. How would you propose userspace applications get their path records? - Sean From sean.hefty at intel.com Thu Aug 3 16:17:30 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Aug 2006 16:17:30 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060803230140.GA10604@mellanox.co.il> Message-ID: <000501c6b752$fe03eff0$e598070a@amr.corp.intel.com> >> We don't know that sa_query->client is still valid here unless a >> reference was taken earlier. > >Yes we do. >Client must wait till all queries complete before unregistering. > >> I like this approach, but would like to see it expanded to track the >requests, >> to avoid duplicating this work in every client. Unregistration would then >> cancel all outstanding queries issued by the user. > >I would not object to this on principle, but let's go there by small >steps - for now, let's get the API fixed and solve the race with module >unloading that we have for 2.6.18. OK? I don't see the need to rush for 2.6.18. No one is hitting this problem, and it still applies to the ib_cm, ib_addr, and rdma_cm. I'm working on the userspace SA interface now, which means that I'll end up re-working the patch in a week. Tracking queries will require changes to the structure that are best hidden from the users. Hiding those changes requires reworking the proposed ib_sa_register_client API to create and return struct ib_sa_client, rather than it being provided by the caller. - Sean From sean.hefty at intel.com Thu Aug 3 16:20:25 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Aug 2006 16:20:25 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: Message-ID: <000601c6b753$666fa840$e598070a@amr.corp.intel.com> >There's a similar unfixable race with trying to count queries this >way: there's a window between when a client module calls the query >function and when the sa_query module actually increments the >reference count. So I think it's better to keep track of what we can >control in the sa_query module, and leave clients responsible for >making sure they wait for all queries that they started. Can you clarify which way you meant with "this" way? Are you referring to the patch Michael submitted, or incrementing the count when a query function is called? - Sean From mst at mellanox.co.il Thu Aug 3 16:31:40 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Aug 2006 02:31:40 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <000401c6b752$3b1cc3e0$e598070a@amr.corp.intel.com> References: <000401c6b752$3b1cc3e0$e598070a@amr.corp.intel.com> Message-ID: <20060803233140.GA10934@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: [openib-general] Fwd: issues in ipoib > > >Is this for some kind of diagnostic/management tool? > >It seems ultimately a wrong thing to expose to unpriviledged users. > > It's required in order to establish a connection. How would you propose > userspace applications get their path records? > CMA? Or something along these lines ... -- MST From rdreier at cisco.com Thu Aug 3 16:34:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 16:34:12 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <000601c6b753$666fa840$e598070a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 3 Aug 2006 16:20:25 -0700") References: <000601c6b753$666fa840$e598070a@amr.corp.intel.com> Message-ID: Sean> Can you clarify which way you meant with "this" way? Are Sean> you referring to the patch Michael submitted, or Sean> incrementing the count when a query function is called? Incrementing the count inside sa_query when a query function is called. No matter what you do, there's always that window between "started to call a query function" and "bumped the reference count" when another thread might get you in trouble. - R. From sean.hefty at intel.com Thu Aug 3 16:39:57 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Aug 2006 16:39:57 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: Message-ID: <000701c6b756$2100f860$e598070a@amr.corp.intel.com> >Incrementing the count inside sa_query when a query function is >called. No matter what you do, there's always that window between >"started to call a query function" and "bumped the reference count" >when another thread might get you in trouble. I'm not seeing the problem yet. A user can't issue a query function at the same time they call deregister. There's nothing we can ever do against that. Is this what you're referring to? - Sean From rdreier at cisco.com Thu Aug 3 16:48:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 16:48:45 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <000701c6b756$2100f860$e598070a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 3 Aug 2006 16:39:57 -0700") References: <000701c6b756$2100f860$e598070a@amr.corp.intel.com> Message-ID: Sean> I'm not seeing the problem yet. A user can't issue a query Sean> function at the same time they call deregister. There's Sean> nothing we can ever do against that. Is this what you're Sean> referring to? Yes. But if the user already has to keep track of when the deregister of its SA client starts, then what is gained by taking a reference when a query starts? - R. From rdreier at cisco.com Thu Aug 3 16:49:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 Aug 2006 16:49:59 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <000501c6b752$fe03eff0$e598070a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 3 Aug 2006 16:17:30 -0700") References: <000501c6b752$fe03eff0$e598070a@amr.corp.intel.com> Message-ID: Sean> I don't see the need to rush for 2.6.18. No one is hitting Sean> this problem, and it still applies to the ib_cm, ib_addr, Sean> and rdma_cm. I have to agree -- this is a pretty big change, and as far as I know there is no evidence that anyone has ever hit this race in the real world. - R. From mst at mellanox.co.il Thu Aug 3 16:52:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Aug 2006 02:52:54 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <000501c6b752$fe03eff0$e598070a@amr.corp.intel.com> References: <000501c6b752$fe03eff0$e598070a@amr.corp.intel.com> Message-ID: <20060803235254.GB10934@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: Fwd: issues in ipoib > > >> We don't know that sa_query->client is still valid here unless a > >> reference was taken earlier. > > > >Yes we do. > >Client must wait till all queries complete before unregistering. > > > >> I like this approach, but would like to see it expanded to track the > >requests, > >> to avoid duplicating this work in every client. Unregistration would then > >> cancel all outstanding queries issued by the user. > > > >I would not object to this on principle, but let's go there by small > >steps - for now, let's get the API fixed and solve the race with module > >unloading that we have for 2.6.18. OK? > > I don't see the need to rush for 2.6.18. No one is hitting this problem, Well, I know OFED 1.0 shipped with a different fix for this race so that's one reason people do not complain :) I don't think we should leave races that we know about and that are easy to fix in development kernels - these are not -stable rules. When there's a module unload crash at customer's site, I do not want to spend time trying to puzzle out whether this could or could not be related to this window - I want to have it covered, and think about something else. > and it > still applies to the ib_cm, ib_addr, and rdma_cm. > This race only applies to ib_addr - CM cleans up after itself flushing out callbacks when CM ID is destroyed. As you can see the patch is a trivial fix - it's just too late in the night here to code more, but I will do this on Sunday Roland, what is your stance? Can this fix be merged for 2.6.18? > re-working the patch in a week. Tracking queries will require changes to the > structure that are best hidden from the users. Hiding those changes requires > reworking the proposed ib_sa_register_client API to create and return struct > ib_sa_client, rather than it being provided by the caller. In kernel API's need not be stable. Since the code is not even yet written, why try to anticipate its needs now? -- MST From sean.hefty at intel.com Thu Aug 3 17:02:56 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Aug 2006 17:02:56 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: Message-ID: <000801c6b759$56e74fd0$e598070a@amr.corp.intel.com> >Yes. But if the user already has to keep track of when the deregister >of its SA client starts, then what is gained by taking a reference >when a query starts? It seems cleaner to me. When a user calls query(), they provide a pointer to their sa_client. We take a reference to that pointer and store it in the sa_query structure. We now expect that pointer to be valid at a later point, so we can increment the reference count on it. Why not increment the reference count when we take the actual reference and save off the pointer? The benefit is that when the user later tries to deregister, deregister will block while there's an outstanding query. This eliminates the need for clients to track their queries, cancel all of them, then wait for them to complete before calling deregister - which would involve another reference count and completion structure on the part of the client. Thinking about this more, I can see where a user would want to create one struct ib_sa_client per device to simplify their life handling device removal. - Sean From mst at mellanox.co.il Thu Aug 3 17:12:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Aug 2006 03:12:29 +0300 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <000801c6b759$56e74fd0$e598070a@amr.corp.intel.com> References: <000801c6b759$56e74fd0$e598070a@amr.corp.intel.com> Message-ID: <20060804001229.GA11296@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: Fwd: issues in ipoib > > >Yes. But if the user already has to keep track of when the deregister > >of its SA client starts, then what is gained by taking a reference > >when a query starts? > > It seems cleaner to me. When a user calls query(), they provide a pointer to > their sa_client. We take a reference to that pointer and store it in the > sa_query structure. We now expect that pointer to be valid at a later point, so > we can increment the reference count on it. Why not increment the reference > count when we take the actual reference and save off the pointer? > > The benefit is that when the user later tries to deregister, deregister will > block while there's an outstanding query. This eliminates the need for clients > to track their queries, cancel all of them, then wait for them to complete > before calling deregister - which would involve another reference count and > completion structure on the part of the client. > > Thinking about this more, I can see where a user would want to create one struct > ib_sa_client per device to simplify their life handling device removal. > > - Sean > Here's a patch for you to play with. I'll drop this for now. -- Require registration with SA module, to prevent module text from going away while sa query callback is still running, and update all users. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d6f99d5..bf668b3 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -60,6 +60,10 @@ static struct ib_client cma_client = { .remove = cma_remove_one }; +static struct ib_sa_client cma_sa_client = { + .name = "cma" +}; + static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); @@ -1140,7 +1144,7 @@ static int cma_query_ib_route(struct rdm path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; - id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + id_priv->query_id = ib_sa_path_rec_get(&cma_sa_client, id_priv->id.device, id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, @@ -1910,6 +1914,8 @@ static int cma_init(void) ret = ib_register_client(&cma_client); if (ret) goto err; + + ib_sa_register_client(&cma_sa_client); return 0; err: @@ -1919,6 +1925,7 @@ err: static void cma_cleanup(void) { + ib_sa_unregister_client(&cma_sa_client); ib_unregister_client(&cma_client); destroy_workqueue(cma_wq); idr_destroy(&sdp_ps); diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index aeda484..43b0323 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -78,6 +78,7 @@ struct ib_sa_query { struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; + struct ib_sa_client *client; int id; }; @@ -532,6 +533,20 @@ retry: return ret ? ret : id; } +static inline void ib_sa_client_get(struct ib_sa_query *query, + struct ib_sa_client *client) +{ + query->client = client; + atomic_inc(&client->users); +} + +static inline void ib_sa_client_put(struct ib_sa_query *query) +{ + struct ib_sa_client *client = query->client; + if (atomic_dec_and_test(&client->users)) + complete(&client->completion); +} + static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, int status, struct ib_sa_mad *mad) @@ -547,6 +562,7 @@ static void ib_sa_path_rec_callback(stru query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + ib_sa_client_put(sa_query); } static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) @@ -556,6 +572,7 @@ static void ib_sa_path_rec_release(struc /** * ib_sa_path_rec_get - Start a Path get query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:Path Record to send in query @@ -578,7 +595,8 @@ static void ib_sa_path_rec_release(struc * error code. Otherwise it is a query ID that can be used to cancel * the query. */ -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -619,6 +637,7 @@ int ib_sa_path_rec_get(struct ib_device mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + ib_sa_client_get(&query->sa_query, client); query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; query->sa_query.port = port; @@ -638,6 +657,7 @@ int ib_sa_path_rec_get(struct ib_device err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -661,6 +681,7 @@ static void ib_sa_service_rec_callback(s query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + ib_sa_client_put(sa_query); } static void ib_sa_service_rec_release(struct ib_sa_query *sa_query) @@ -670,6 +691,7 @@ static void ib_sa_service_rec_release(st /** * ib_sa_service_rec_query - Start Service Record operation + * @client:client object used to track the query * @device:device to send request on * @port_num: port number to send request on * @method:SA method - should be get, set, or delete @@ -694,7 +716,8 @@ static void ib_sa_service_rec_release(st * error code. Otherwise it is a request ID that can be used to cancel * the query. */ -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -740,6 +763,7 @@ int ib_sa_service_rec_query(struct ib_de mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + ib_sa_client_get(&query->sa_query, client); query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; query->sa_query.release = ib_sa_service_rec_release; query->sa_query.port = port; @@ -760,6 +784,7 @@ int ib_sa_service_rec_query(struct ib_de err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -783,6 +808,7 @@ static void ib_sa_mcmember_rec_callback( query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); + ib_sa_client_put(sa_query); } static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) @@ -790,7 +816,8 @@ static void ib_sa_mcmember_rec_release(s kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -832,6 +859,7 @@ int ib_sa_mcmember_rec_query(struct ib_d mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + ib_sa_client_get(&query->sa_query, client); query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; query->sa_query.port = port; @@ -852,6 +880,7 @@ int ib_sa_mcmember_rec_query(struct ib_d err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -881,6 +910,7 @@ static void send_handler(struct ib_mad_a query->callback(query, -EIO, NULL); break; } + ib_sa_client_put(query); spin_lock_irqsave(&idr_lock, flags); idr_remove(&query_idr, query->id); @@ -909,7 +939,7 @@ static void recv_handler(struct ib_mad_a else query->callback(query, -EIO, NULL); } - + ib_sa_client_put(query); ib_free_recv_mad(mad_recv_wc); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 474aa21..28a9f0f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -390,4 +390,5 @@ #define IPOIB_GID_RAW_ARG(gid) ((u8 *)(g #define IPOIB_GID_ARG(gid) IPOIB_GID_RAW_ARG((gid).raw) +extern struct ib_sa_client ipoib_sa_client; #endif /* _IPOIB_H */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cf71d2a..ca10724 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -91,6 +91,10 @@ static struct ib_client ipoib_client = { .remove = ipoib_remove_one }; +struct ib_sa_client ipoib_sa_client = { + .name = "ipoib" +}; + int ipoib_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -459,7 +463,7 @@ static int path_rec_start(struct net_dev init_completion(&path->done); path->query_id = - ib_sa_path_rec_get(priv->ca, priv->port, + ib_sa_path_rec_get(&ipoib_sa_client, priv->ca, priv->port, &path->pathrec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | @@ -1185,6 +1189,8 @@ static int __init ipoib_init_module(void if (ret) goto err_wq; + ib_sa_register_client(&ipoib_sa_client); + return 0; err_wq: @@ -1198,6 +1204,7 @@ err_fs: static void __exit ipoib_cleanup_module(void) { + ib_sa_unregister_client(&ipoib_sa_client); ib_unregister_client(&ipoib_client); ipoib_unregister_debugfs(); destroy_workqueue(ipoib_workqueue); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index b5e6a7b..f688323 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -360,7 +360,7 @@ #endif init_completion(&mcast->done); - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, + ret = ib_sa_mcmember_rec_set(&ipoib_sa_client, priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | @@ -484,8 +484,8 @@ static void ipoib_mcast_join(struct net_ init_completion(&mcast->done); - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, - mcast->backoff * 1000, GFP_ATOMIC, + ret = ib_sa_mcmember_rec_set(&ipoib_sa_client, priv->ca, priv->port, &rec, + comp_mask, mcast->backoff * 1000, GFP_ATOMIC, ipoib_mcast_join_complete, mcast, &mcast->query); @@ -680,7 +680,7 @@ static int ipoib_mcast_leave(struct net_ * Just make one shot at leaving and don't wait for a reply; * if we fail, too bad. */ - ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, + ret = ib_sa_mcmember_rec_delete(&ipoib_sa_client, priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8f472e7..0856d78 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -88,6 +88,10 @@ static struct ib_client srp_client = { .remove = srp_remove_one }; +static struct ib_sa_client srp_sa_client = { + .name = "srp" +}; + static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) { return (struct srp_target_port *) host->hostdata; @@ -259,7 +263,8 @@ static int srp_lookup_path(struct srp_ta init_completion(&target->done); - target->path_query_id = ib_sa_path_rec_get(target->srp_host->dev->dev, + target->path_query_id = ib_sa_path_rec_get(&srp_sa_client, + target->srp_host->dev->dev, target->srp_host->port, &target->path, IB_SA_PATH_REC_DGID | diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index c99e442..084baf4 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -37,6 +37,8 @@ #ifndef IB_SA_H #define IB_SA_H #include +#include +#include #include #include @@ -250,11 +252,18 @@ struct ib_sa_service_rec { u64 data64[2]; }; +struct ib_sa_client { + char *name; + atomic_t users; + struct completion completion; +}; + struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -264,7 +273,8 @@ int ib_sa_path_rec_get(struct ib_device void *context, struct ib_sa_query **query); -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -275,7 +285,8 @@ int ib_sa_mcmember_rec_query(struct ib_d void *context, struct ib_sa_query **query); -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, @@ -288,6 +299,7 @@ int ib_sa_service_rec_query(struct ib_de /** * ib_sa_mcmember_rec_set - Start an MCMember set query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -311,7 +323,8 @@ int ib_sa_service_rec_query(struct ib_de * cancel the query. */ static inline int -ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_set(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -321,7 +334,7 @@ ib_sa_mcmember_rec_set(struct ib_device void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, timeout_ms, gfp_mask, callback, @@ -330,6 +343,7 @@ ib_sa_mcmember_rec_set(struct ib_device /** * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @client:client object used to track the query * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -353,7 +367,8 @@ ib_sa_mcmember_rec_set(struct ib_device * cancel the query. */ static inline int -ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_delete(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, @@ -363,7 +378,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, timeout_ms, gfp_mask, callback, @@ -378,4 +393,25 @@ int ib_init_ah_from_path(struct ib_devic struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr); +/** + * ib_sa_register_client - register SA client object + * @client:client object used to track queries + */ +static inline void ib_sa_register_client(struct ib_sa_client *client) +{ + atomic_set(&client->users, 1); + init_completion(&client->completion); +} + +/** + * ib_sa_unregister_client - unregister SA client object + * @client:client object used to track queries + */ +static inline void ib_sa_unregister_client(struct ib_sa_client *client) +{ + if (atomic_dec_and_test(&client->users)) + complete(&client->completion); + wait_for_completion(&client->completion); +} + #endif /* IB_SA_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h -- MST From sean.hefty at intel.com Thu Aug 3 18:09:42 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Aug 2006 18:09:42 -0700 Subject: [openib-general] Fwd: issues in ipoib In-Reply-To: <20060804001229.GA11296@mellanox.co.il> Message-ID: <000001c6b762$ab4ee430$30f9070a@amr.corp.intel.com> >Here's a patch for you to play with. I'll drop this for now. Thanks - I'll add the userspace SA support above this and update it with any changes. - Sean From bugzilla-daemon at openib.org Fri Aug 4 09:37:16 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 4 Aug 2006 09:37:16 -0700 (PDT) Subject: [openib-general] [Bug 185] New: ib0: multicast join failed for Message-ID: <20060804163716.E6A622283D8@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=185 Summary: ib0: multicast join failed for Product: OpenFabrics Linux Version: gen2 Platform: Other OS/Version: 2.6.9 Status: NEW Severity: major Priority: P2 Component: IPoIB AssignedTo: bugzilla at openib.org ReportedBy: mdidomenico at silverstorm.com I've just built two Quad Itanium servers with Mellanox Cougar 128 SDR cards, running RHEL4ud3 Advanced Server. I downloaded the OFED v1.0 version from the website. Compiling and Installation went fine, but i'm unable to get the cards to talk. I'm also getting ipath library failures... might be a seperate issue though. See the attached log... ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Fri Aug 4 09:37:40 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 4 Aug 2006 09:37:40 -0700 (PDT) Subject: [openib-general] [Bug 185] ib0: multicast join failed for Message-ID: <20060804163740.E10D9228423@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=185 ------- Comment #1 from mdidomenico at silverstorm.com 2006-08-04 09:37 ------- Created an attachment (id=34) --> (http://openib.org/bugzilla/attachment.cgi?id=34&action=view) ib_debug_info.log ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From Thomas.Talpey at netapp.com Fri Aug 4 07:57:31 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 04 Aug 2006 10:57:31 -0400 Subject: [openib-general] NFS/RDMA for Linux: client and server update release 6 Message-ID: <7.0.1.0.2.20060804100938.0407a008@netapp.com> Network Appliance is pleased to announce release 6 of the NFS/RDMA client and server for Linux 2.6.17. This update to the May 22 release fixes known issues, improves usability and server stability, and supports NFSv4. The code supports both Infiniband and iWARP transports over the standard openfabrics Linux facility. This code is running successfully at multiple user locations. A special thanks goes to Helen Chen and her team at Sandia Labs for their help in resolving multiple usability and stability issues. The code in the current release was used to produce the results reported in their presentation at the recent Commodity Cluster Computing Symposium in Baltimore. Tom Talpey, for the NFS/RDMA project. --- Changes since RC5 2.6.17.* kernel/transport switch target (also fixes IPv6 issues) NFS-RDMA client: support NFSv4 NFS-RDMA server: kconfig changes fully uses dma_map()/dma_unmap() api fix race between connection acceptance and first client request fix I/O thread not going to sleep fix two issues in export cache handling fix data corruption with certain pathological client alignments nfsrdmamount command: support NFSv4 runtime warnings on certain systems addressed From bugzilla-daemon at openib.org Fri Aug 4 10:57:09 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 4 Aug 2006 10:57:09 -0700 (PDT) Subject: [openib-general] [Bug 186] New: ibnetdiscover(4029): unaligned access Message-ID: <20060804175709.1A78B2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=186 Summary: ibnetdiscover(4029): unaligned access Product: OpenFabrics Linux Version: gen2 Platform: Other OS/Version: 2.6.9 Status: NEW Severity: normal Priority: P2 Component: IB Core AssignedTo: bugzilla at openib.org ReportedBy: mdidomenico at silverstorm.com My Itanium servers are producing these messages [root at tse82 bin]# ./ibaddr ibaddr(4040): unaligned access to 0x60000fffffffb6c4, ip=0x2000000000080971 ibaddr(4040): unaligned access to 0x60000fffffffb844, ip=0x2000000000080971 GID 0xfe800000000000000005ad0000013d95 LID start 0x3 end 0x3 ibnetdiscover(4029): unaligned access to 0x600000000000811c, ip=0x2000000000080971 ibnetdiscover(4029): unaligned access to 0x6000000000008114, ip=0x2000000000080971 ibnetdiscover(4029): unaligned access to 0x6000000000008124, ip=0x2000000000080971 ibnetdiscover(4029): unaligned access to 0x60000000000082ac, ip=0x2000000000080971 ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Fri Aug 4 11:34:47 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 4 Aug 2006 11:34:47 -0700 (PDT) Subject: [openib-general] [Bug 185] ib0: multicast join failed for Message-ID: <20060804183447.D28DB2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=185 mdidomenico at silverstorm.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Comment #2 from mdidomenico at silverstorm.com 2006-08-04 11:34 ------- I'm using a SilverStorm switch w/ Embedded Subnet Manager. For IPoIB, you need to run the SST CLI command smSetDefBcGroup command to create the default multicast instance. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From pw at osc.edu Fri Aug 4 12:50:23 2006 From: pw at osc.edu (Pete Wyckoff) Date: Fri, 4 Aug 2006 15:50:23 -0400 Subject: [openib-general] error return values from userspace calls Message-ID: <20060804195023.GB25527@osc.edu> I'm going batty trying to figure out how to convert return codes from the various ibv_ and rdma_ functions into printable error strings. Consider ibv_modify_qp. It goes through the device function and into ibv_cmd_modify_qp, which then returns 0 or the positive errno that glibc sets on a failed write(), like +EINVAL. So I can call strerror(ret) and get something meaningful. Fine. But ibv_query_device. At least the mthca version returns negative error codes, like -ENOMEM. The ibv_cmd version returns the positive glibc write errno again. Now I should take the absolute value then call strerror? Then ibv_get_async_event. If the read() fails, it returns -1 but there presumably is a glibc errno that could be looked at. Check for return of -1, and if so, do strerror(errno)? Then there's ibv_poll_cq that returns +npolled on sucess, -1 on empty, -2 on error. Why not EAGAIN instead of these custom codes? Maybe some of this is bugs. Any general consensus on what the official pattern should be? It would make coders' lives easier if there were a single error-reporting style. The RDMA CM functions follow a different convention pretty consistently. They return negative errnos like -EINVAL on error. Except they do not interpret system call return values that show up in errno: int rdma_get_cm_event(..) { .. if (!event) return -EINVAL; .. ret = write(channel->fd, msg, size); if (ret != size) { free(evt); return (ret > 0) ? -ENODATA : ret; } So my user code tries to react as: ret = rdma_get_cm_event(..); if (ret) { if (ret == -1) /* assume a kernel error, not EPERM */ printf("died in the kernel: %s\n", strerror(errno)); else printf("user library unhappy: %s\n", strerror(-ret)); } Does this seem like the right approach to looking at the return values? Maybe these should all be changed to work like their ibv_ relatives? -- Pete From pw at osc.edu Fri Aug 4 13:08:34 2006 From: pw at osc.edu (Pete Wyckoff) Date: Fri, 4 Aug 2006 16:08:34 -0400 Subject: [openib-general] rdma cm process hang In-Reply-To: <1154611170.29187.7.camel@stevo-desktop> References: <20060801213416.GA18941@osc.edu> <1154531379.32560.13.camel@stevo-desktop> <20060802155721.GA20429@osc.edu> <1154611170.29187.7.camel@stevo-desktop> Message-ID: <20060804200834.GC25527@osc.edu> swise at opengridcomputing.com wrote on Thu, 03 Aug 2006 08:19 -0500: > I don't know when, or if I'll have time to address this limitation in > the ammasso firmware. But there is a way (if anyone wants to implement > it): > > 1) add a timer to the c2_qp struct and start it when c2_llp_connect() is > called. > > 2) if the timer fires, generate a CONNECT_REPLY upcall to the IWCM with > status TIMEDOUT. Mark in the qp that the connect timed out. > > 3) deal with the rare condition that the timer fires at or about the > same time the connection really does get established: if the adapter > passes up a CCAE_ACTIVE_CONNECT_RESULTS -after- the timer fires but > before the qp is destroyed by the consumer, then you must squelch this > event and probably destroy the HWQP at least from the adapter's > perspective... Here's a first cut. It fixes one source of process hangs I had been running into. A couple of issues: - What is the proper connect timeout? Old BSD used 24 sec. Modern linux seems to be around 3 minutes based on experimentation. - I used the new hrtimer code. Backports will need major changes to use the old timer interface. - I also added code in c2_free_qp() to kill the connection (and release the qp ref) to handle the ctrl-C case, or any other event that would cause the QP to go away while a connection was outstanding. - No attempt is made to cleanup hardware state. That code could go in connect_timer_expire(), although there may be issues on who is holding what locks when the timer expires. This against r8688. -- Pete Attempt to work around buggy Ammasso firmware that does not timeout active connection requests. Also actively cancels an outstanding connections when the QP is being freed. Signed-off-by: Pete Wyckoff Index: linux-kernel/infiniband/hw/amso1100/c2_qp.c =================================================================== --- linux-kernel/infiniband/hw/amso1100/c2_qp.c (revision 8688) +++ linux-kernel/infiniband/hw/amso1100/c2_qp.c (working copy) @@ -517,6 +517,8 @@ c2dev->qp_table.map[qp->qpn] = qp; spin_unlock_irq(&c2dev->qp_table.lock); + hrtimer_init(&qp->connect_timer, CLOCK_MONOTONIC, HRTIMER_REL); + return 0; bail6: @@ -545,6 +547,13 @@ recv_cq = to_c2cq(qp->ibqp.recv_cq); /* + * If the timer was still active, a connection attempt is outstanding. + * Call the expire function directly to release the ref on the qp. + */ + if (hrtimer_cancel(&qp->connect_timer)) + qp->connect_timer.function(&qp->connect_timer); + + /* * Lock CQs here, so that CQ polling code can do QP lookup * without taking a lock. */ Index: linux-kernel/infiniband/hw/amso1100/c2_ae.c =================================================================== --- linux-kernel/infiniband/hw/amso1100/c2_ae.c (revision 8688) +++ linux-kernel/infiniband/hw/amso1100/c2_ae.c (working copy) @@ -226,6 +226,14 @@ cm_event.private_data_len = 0; cm_event.private_data = NULL; } + + /* + * Cancel the connect timeout; but if it already + * ran, throw away this hardware connect result + * that raced against it. + */ + if (hrtimer_cancel(&qp->connect_timer) == 0) + goto ignore_it; if (cm_event.private_data_len) { /* copy private data */ pdata = Index: linux-kernel/infiniband/hw/amso1100/c2_provider.h =================================================================== --- linux-kernel/infiniband/hw/amso1100/c2_provider.h (revision 8688) +++ linux-kernel/infiniband/hw/amso1100/c2_provider.h (working copy) @@ -120,6 +120,7 @@ struct c2_mq sq_mq; struct c2_mq rq_mq; + struct hrtimer connect_timer; }; struct c2_cr_query_attrs { Index: linux-kernel/infiniband/hw/amso1100/c2_cm.c =================================================================== --- linux-kernel/infiniband/hw/amso1100/c2_cm.c (revision 8688) +++ linux-kernel/infiniband/hw/amso1100/c2_cm.c (working copy) @@ -36,6 +36,22 @@ #include "c2_vq.h" #include +static int connect_timer_expire(struct hrtimer *timer) +{ + struct c2_qp *qp; + + qp = container_of(timer, struct c2_qp, connect_timer); + if (qp->cm_id && qp->cm_id->event_handler) { + struct iw_cm_event cm_event = { + .event = IW_CM_EVENT_CONNECT_REPLY, + .status = IW_CM_EVENT_STATUS_TIMEOUT, + }; + dprintk("%s: sending connect timeout event\n", __func__); + qp->cm_id->event_handler(qp->cm_id, &cm_event); + } + return HRTIMER_NORESTART; +} + int c2_llp_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) { struct c2_dev *c2dev = to_c2dev(cm_id->device); @@ -123,6 +139,15 @@ cm_id->provider_data = NULL; qp->cm_id = NULL; cm_id->rem_ref(cm_id); + } else { + /* + * Start connect timer. Since buggy firmware will not + * time out active connections, ever, this timer is used + * to force expiry after 30 sec. + */ + qp->connect_timer.function = connect_timer_expire; + hrtimer_start(&qp->connect_timer, ktime_set(30, 0), + HRTIMER_REL); } return err; } From pw at osc.edu Fri Aug 4 13:14:23 2006 From: pw at osc.edu (Pete Wyckoff) Date: Fri, 4 Aug 2006 16:14:23 -0400 Subject: [openib-general] hack around IWCM double-close problem Message-ID: <20060804201423.GA25697@osc.edu> In some cases using Ammasso devices, a CONNECTION_LOST message may come from c2_ae_event while something else is in the middle of c2_destroy_qp. If that happens, a BUG_ON triggers in cm_close_handler as the QP goes to IDLE after the first of the two calls. Perhaps this is a problem with the Ammasso driver, but this little hack-around hid it for me. Signed-off-by: Pete Wyckoff Index: linux-kernel/infiniband/core/iwcm.c =================================================================== --- linux-kernel/infiniband/core/iwcm.c (revision 8688) +++ linux-kernel/infiniband/core/iwcm.c (working copy) @@ -673,6 +673,12 @@ case IW_CM_STATE_DESTROYING: spin_unlock_irqrestore(&cm_id_priv->lock, flags); break; + case IW_CM_STATE_IDLE: + /* protect against double-close from concurrent c2_destroy_qp + * and c2_ae_event CONNECTION_LOST */ + printk(KERN_INFO "%s: in IDLE state, ignoring\n", __func__); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + break; default: BUG_ON(1); } From pw at osc.edu Fri Aug 4 13:15:35 2006 From: pw at osc.edu (Pete Wyckoff) Date: Fri, 4 Aug 2006 16:15:35 -0400 Subject: [openib-general] trivial patch to fix RDMA CM prototype Message-ID: <20060804201535.GB25697@osc.edu> Fix RDMA CM declaration. Signed-off-by: Pete Wyckoff Index: userspace/librdmacm/include/rdma/rdma_cma.h =================================================================== --- userspace/librdmacm/include/rdma/rdma_cma.h (revision 8688) +++ userspace/librdmacm/include/rdma/rdma_cma.h (working copy) @@ -111,7 +111,7 @@ * rdma_create_event_channel - Open a channel used to report communication * events. */ -struct rdma_event_channel *rdma_create_event_channel(); +struct rdma_event_channel *rdma_create_event_channel(void); /** * rdma_destroy_event_channel - Close the event communication channel. From pw at osc.edu Fri Aug 4 13:17:26 2006 From: pw at osc.edu (Pete Wyckoff) Date: Fri, 4 Aug 2006 16:17:26 -0400 Subject: [openib-general] trivial patch to fix cmatose usage message Message-ID: <20060804201726.GC25697@osc.edu> Do not type the characters "dst_ip=" explicitly, just the IP address, in both the trunk and the iwarp branch. Signed-off-by: Pete Wyckoff Index: userspace/librdmacm/examples/cmatose.c =================================================================== --- userspace/librdmacm/examples/cmatose.c (revision 8688) +++ userspace/librdmacm/examples/cmatose.c (working copy) @@ -55,7 +55,7 @@ /* * To execute: * Server: rdma_cmatose - * Client: rdma_cmatose "dst_ip=ip" + * Client: rdma_cmatose */ struct cmatest_node { From pw at osc.edu Fri Aug 4 13:20:27 2006 From: pw at osc.edu (Pete Wyckoff) Date: Fri, 4 Aug 2006 16:20:27 -0400 Subject: [openib-general] work around another Ammasso CM process hang Message-ID: <20060804202027.GD25697@osc.edu> This patch adds code in the TERMINATE_MESSAGE_RECEIVED handler to send a close to the CM as well as to the verbs async handler. Otherwise, the process will hang at QP destroy time due to an extra refcount. Perhaps there is a better fix. iwarp branch. Signed-off-by: Pete Wyckoff Index: linux-kernel/infiniband/hw/amso1100/c2_ae.c =================================================================== --- linux-kernel/infiniband/hw/amso1100/c2_ae.c (revision 8688) +++ linux-kernel/infiniband/hw/amso1100/c2_ae.c (working copy) @@ -259,6 +259,21 @@ qp->ibqp.event_handler(&ib_event, qp->ibqp. qp_context); + /* + * The message above goes to the async handler. + * Also tell the CM else we end up with a dangling + * refcount to the qp and cannot ever free it. + */ + spin_lock_irqsave(&qp->lock, flags); + if (cm_id) { + cm_id->rem_ref(cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + cm_event.event = IW_CM_EVENT_CLOSE; + cm_event.status = 0; + if (cm_id && cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); break; case CCAE_BAD_CLOSE: case CCAE_LLP_CLOSE_COMPLETE: From jlentini at netapp.com Fri Aug 4 13:28:32 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 4 Aug 2006 16:28:32 -0400 (EDT) Subject: [openib-general] [PATCH] remove unnecessary include Message-ID: The ib_mad module does not use any kthread functions, but mad_priv.h includes kthread.h Signed-off-by: James Lentini Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 8752) +++ core/mad_priv.h (working copy) @@ -39,7 +39,6 @@ #include #include -#include #include #include #include From rdreier at cisco.com Fri Aug 4 13:29:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 04 Aug 2006 13:29:08 -0700 Subject: [openib-general] on vacation... Message-ID: I'll be offline from now until Tuesday evening, so if you have something that needs my attention, please be patient until Wednesday. - R. From mshefty at ichips.intel.com Fri Aug 4 13:37:35 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Aug 2006 13:37:35 -0700 Subject: [openib-general] trivial patch to fix RDMA CM prototype In-Reply-To: <20060804201535.GB25697@osc.edu> References: <20060804201535.GB25697@osc.edu> Message-ID: <44D3B00F.2070805@ichips.intel.com> Thanks - applied. From mshefty at ichips.intel.com Fri Aug 4 13:41:43 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Aug 2006 13:41:43 -0700 Subject: [openib-general] trivial patch to fix cmatose usage message In-Reply-To: <20060804201726.GC25697@osc.edu> References: <20060804201726.GC25697@osc.edu> Message-ID: <44D3B107.7040601@ichips.intel.com> thanks - committed From sean.hefty at intel.com Fri Aug 4 13:44:00 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 4 Aug 2006 13:44:00 -0700 Subject: [openib-general] [PATCH] remove unnecessary include In-Reply-To: Message-ID: <000001c6b806$b7660970$8698070a@amr.corp.intel.com> Thanks - applied. From mshefty at ichips.intel.com Fri Aug 4 13:56:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Aug 2006 13:56:17 -0700 Subject: [openib-general] error return values from userspace calls In-Reply-To: <20060804195023.GB25527@osc.edu> References: <20060804195023.GB25527@osc.edu> Message-ID: <44D3B471.3040802@ichips.intel.com> Pete Wyckoff wrote: > Maybe some of this is bugs. Any general consensus on what the > official pattern should be? It would make coders' lives easier if > there were a single error-reporting style. We should have a single error-reporting style. > The RDMA CM functions follow a different convention pretty > consistently. They return negative errnos like -EINVAL on error. > Except they do not interpret system call return values that show up > in errno: Michael or someone pointed this out to me a while ago. I just haven't had time to correct this yet. - Sean From jlentini at netapp.com Fri Aug 4 14:41:07 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 4 Aug 2006 17:41:07 -0400 (EDT) Subject: [openib-general] [PATCH] remove another unnecessary include Message-ID: I spotted another unnecessary include in the ib_mad module. I'd guess that mad.c was used as a starting point for mad_rmpp.c and this include wasn't removed. Signed-off-by: James Lentini Index: core/mad_rmpp.c =================================================================== --- core/mad_rmpp.c (revision 8826) +++ core/mad_rmpp.c (working copy) @@ -33,8 +33,6 @@ * $Id: mad_rmpp.c 1921 2005-03-02 22:58:44Z sean.hefty $ */ -#include - #include "mad_priv.h" #include "mad_rmpp.h" From mshefty at ichips.intel.com Fri Aug 4 14:28:56 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Aug 2006 14:28:56 -0700 Subject: [openib-general] (SPAM?) Re: [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <44D10928.2080605@ichips.intel.com> References: <44D03AF1.8080300@voltaire.com> <44D0D644.4000607@ichips.intel.com> <15ddcffd0608021305o6b320d70y26b02a4c341e3ade@mail.gmail.com> <44D10928.2080605@ichips.intel.com> Message-ID: <44D3BC18.4040501@ichips.intel.com> Sean Hefty wrote: >>Moreover, the solution is one of: >>- the patch you sent >>- enforcing the ULP to call rdma_establish (or cm_establish for direct >>CM consumers) else a repeatedly lost RTU case is not handled. > > or both, or we do nothing and let the connection fail After implementing rdma_establish(), the solutions that I see are: * Dispatch COMM_EST to the IB CM. This is transparent to the users. A user cannot send a reply until they are told that the connection has been established. * Provide rdma_establish(). As it turns out, clients may still need to wait until they are told that the connection has been established. Before sends can be posted to the QP, it must be transitioned to RTS, which may sleep. * Transition the QP to RTS before sending the REP. This may be a slight spec violation. COMM_EST events are not generated. Users can reply to messages immediately. A lost RTU will result in tearing down the connection. A user could disconnect the connection before it's seen as established by the IB CM, which isn't handled currently. * Combine the previous two solutions. Rdma_establish() would set the connection state, but the QP is already in RTS. The "best" solution is debatable, but I'm leaning towards the last option. Also note that in all cases there's a race where the IB CM can time out a connection at the same time that a message shows up at the receive queue. - Sean From jlentini at netapp.com Fri Aug 4 14:48:44 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 4 Aug 2006 17:48:44 -0400 (EDT) Subject: [openib-general] [PATCH] use correct include Message-ID: Directly include the file we really want. Signed-off-by: James Lentini Index: core/ping.c =================================================================== --- core/ping.c (revision 8826) +++ core/ping.c (working copy) @@ -37,7 +37,7 @@ * $Id$ */ -#include +#include #include #include From sean.hefty at intel.com Fri Aug 4 14:55:10 2006 From: sean.hefty at intel.com (Hefty, Sean) Date: Fri, 4 Aug 2006 14:55:10 -0700 Subject: [openib-general] [PATCH] remove another unnecessary include Message-ID: Thanks - applied. From sean.hefty at intel.com Fri Aug 4 14:58:49 2006 From: sean.hefty at intel.com (Hefty, Sean) Date: Fri, 4 Aug 2006 14:58:49 -0700 Subject: [openib-general] [PATCH] use correct include Message-ID: Thanks - applied. - Sean From sean.hefty at intel.com Fri Aug 4 16:55:16 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 4 Aug 2006 16:55:16 -0700 Subject: [openib-general] [PATCH] RDMA / IB CM: support immediately sending replies to received messages In-Reply-To: <44D3BC18.4040501@ichips.intel.com> Message-ID: <000101c6b821$6f783be0$8698070a@amr.corp.intel.com> This is what the patch would look like that would support sending replies immediately after polling a receive before the QP has "finished" connecting. The changes require that the ib_cm support returning QP attributes for the RTS transition after a REQ has been received, and add an rdma_establish() call to the RDMA CM. I'd like to decide on which approach to use by the end of next week, so I can commit any changes and update userspace accordingly. Signed-off-by: Sean Hefty --- Index: include/rdma/rdma_cm.h =================================================================== --- include/rdma/rdma_cm.h (revision 8822) +++ include/rdma/rdma_cm.h (working copy) @@ -256,6 +256,16 @@ int rdma_listen(struct rdma_cm_id *id, i int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param); /** + * rdma_cm_establish - Forces a connection state to established. + * @id: Connection identifier to transition to established. + * + * This routine should be invoked by users who receive messages on a + * QP before being notified that the connection has been established by the + * RDMA CM. + */ +int rdma_establish(struct rdma_cm_id *id); + +/** * rdma_reject - Called to reject a connection request or response. */ int rdma_reject(struct rdma_cm_id *id, const void *private_data, Index: core/cm.c =================================================================== --- core/cm.c (revision 8823) +++ core/cm.c (working copy) @@ -3207,6 +3207,10 @@ static int cm_init_qp_rts_attr(struct cm spin_lock_irqsave(&cm_id_priv->lock, flags); switch (cm_id_priv->id.state) { + /* Allow transition to RTS before sending REP */ + case IB_CM_REQ_RCVD: + case IB_CM_MRA_REQ_SENT: + case IB_CM_REP_RCVD: case IB_CM_MRA_REP_SENT: case IB_CM_REP_SENT: Index: core/cma.c =================================================================== --- core/cma.c (revision 8822) +++ core/cma.c (working copy) @@ -840,22 +840,6 @@ static int cma_verify_rep(struct rdma_id return 0; } -static int cma_rtu_recv(struct rdma_id_private *id_priv) -{ - int ret; - - ret = cma_modify_qp_rts(&id_priv->id); - if (ret) - goto reject; - - return 0; -reject: - cma_modify_qp_err(&id_priv->id); - ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, - NULL, 0, NULL, 0); - return ret; -} - static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) { struct rdma_id_private *id_priv = cm_id->context; @@ -886,9 +870,8 @@ static int cma_ib_handler(struct ib_cm_i private_data_len = IB_CM_REP_PRIVATE_DATA_SIZE; break; case IB_CM_RTU_RECEIVED: - status = cma_rtu_recv(id_priv); - event = status ? RDMA_CM_EVENT_CONNECT_ERROR : - RDMA_CM_EVENT_ESTABLISHED; + case IB_CM_USER_ESTABLISHED: + event = RDMA_CM_EVENT_ESTABLISHED; break; case IB_CM_DREQ_ERROR: status = -ETIMEDOUT; /* fall through */ @@ -1981,11 +1964,25 @@ static int cma_accept_ib(struct rdma_id_ struct rdma_conn_param *conn_param) { struct ib_cm_rep_param rep; - int ret; + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; - ret = cma_modify_qp_rtr(&id_priv->id); - if (ret) - return ret; + if (id_priv->id.qp) { + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) + goto out; + + qp_attr.qp_state = IB_QPS_RTS; + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, &qp_attr, + &qp_attr_mask); + if (ret) + goto out; + + qp_attr.max_rd_atomic = conn_param->initiator_depth; + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); + if (ret) + goto out; + } memset(&rep, 0, sizeof rep); rep.qp_num = id_priv->qp_num; @@ -2000,7 +1997,9 @@ static int cma_accept_ib(struct rdma_id_ rep.rnr_retry_count = conn_param->rnr_retry_count; rep.srq = id_priv->srq ? 1 : 0; - return ib_send_cm_rep(id_priv->cm_id.ib, &rep); + ret = ib_send_cm_rep(id_priv->cm_id.ib, &rep); +out: + return ret; } static int cma_send_sidr_rep(struct rdma_id_private *id_priv, @@ -2058,6 +2057,27 @@ reject: } EXPORT_SYMBOL(rdma_accept); +int rdma_establish(struct rdma_cm_id *id) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: + ret = ib_cm_establish(id_priv->cm_id.ib); + break; + default: + ret = 0; + break; + } + return ret; +} +EXPORT_SYMBOL(rdma_establish); + int rdma_reject(struct rdma_cm_id *id, const void *private_data, u8 private_data_len) { From motokochan at mail2world.com Sat Aug 5 07:08:03 2006 From: motokochan at mail2world.com (motokochan at mail2world.com) Date: Sat, 5 Aug 2006 08:08:03 -0600 (MDT) Subject: [openib-general] =?iso-2022-jp?B?gXmMwJLogXqOwI3dgnWCaIJvj5eQq4KyjneWvA==?= Message-ID: 20031201065209.69408mail@mail.rokungo3518767618612798663173.com ���݂u�h�o�������㕅��Ȃ���ۑ����T���Ă��܂��B �A�����@�͂����� http://www.koikoitrain.com/adultwomango ����ł͋X�������肢���܂��B From mst at mellanox.co.il Sat Aug 5 13:46:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 5 Aug 2006 23:46:34 +0300 Subject: [openib-general] error return values from userspace calls In-Reply-To: <20060804195023.GB25527@osc.edu> References: <20060804195023.GB25527@osc.edu> Message-ID: <20060805204634.GA3059@mellanox.co.il> Quoting r. Pete Wyckoff : > Then there's ibv_poll_cq that returns +npolled on sucess, -1 on > empty, -2 on error. Why not EAGAIN instead of these custom codes? Seems like thre's some mistake. Should be 0 on empty, -1 on error (I'm not sure there *can* be an error from ibv_poll_cq, need to look it up). -- MST From pw at osc.edu Sat Aug 5 15:39:30 2006 From: pw at osc.edu (Pete Wyckoff) Date: Sat, 5 Aug 2006 18:39:30 -0400 Subject: [openib-general] error return values from userspace calls In-Reply-To: <20060805204634.GA3059@mellanox.co.il> References: <20060804195023.GB25527@osc.edu> <20060805204634.GA3059@mellanox.co.il> Message-ID: <20060805223930.GA28695@osc.edu> mst at mellanox.co.il wrote on Sat, 05 Aug 2006 23:46 +0300: > Quoting r. Pete Wyckoff : > > Then there's ibv_poll_cq that returns +npolled on sucess, -1 on > > empty, -2 on error. Why not EAGAIN instead of these custom codes? > > Seems like thre's some mistake. > Should be 0 on empty, -1 on error (I'm not sure there *can* be an error from > ibv_poll_cq, need to look it up). Oh, you're right of course. 0 on empty, but perhaps -2 on error. I saw this in libmthca/src/cq.c and jumped to the conclusion that empty == -1: enum { CQ_OK = 0, CQ_EMPTY = -1, CQ_POLL_ERR = -2 }; But CQ_EMPTY never propagates back to the user. However, mthca_poll_cq does: return err == CQ_POLL_ERR ? err : npolled; which returns the -2. amso_poll_cq uses ibv_cmd_poll_cq, which returns -1 on failed userspace malloc (but nothing in errno), -1 and errno for some kernel interface problem, 0 for empty, or +npolled on success. -- Pete From bunk at stusta.de Sat Aug 5 15:53:33 2006 From: bunk at stusta.de (Adrian Bunk) Date: Sun, 6 Aug 2006 00:53:33 +0200 Subject: [openib-general] [patch 02/45] IB/mthca: restore missing PCI registers after reset In-Reply-To: <20060717162531.GC4829@kroah.com> References: <20060717160652.408007000@blue.kroah.org> <20060717162531.GC4829@kroah.com> Message-ID: <20060805225333.GY25692@stusta.de> It seems this patch should also be included in the 2.6.16.x branch, or do I miss anything? TIA Adrian On Mon, Jul 17, 2006 at 09:25:31AM -0700, Greg KH wrote: > -stable review patch. If anyone has any objections, please let us know. > > ------------------ > mthca does not restore the following PCI-X/PCI Express registers after reset: > PCI-X device: PCI-X command register > PCI-X bridge: upstream and downstream split transaction registers > PCI Express : PCI Express device control and link control registers > > This causes instability and/or bad performance on systems where one of > these registers is set to a non-default value by BIOS. > > Signed-off-by: Michael S. Tsirkin > Signed-off-by: Chris Wright > Signed-off-by: Greg Kroah-Hartman > > --- > drivers/infiniband/hw/mthca/mthca_reset.c | 59 ++++++++++++++++++++++++++++++ > 1 file changed, 59 insertions(+) > > --- linux-2.6.17.2.orig/drivers/infiniband/hw/mthca/mthca_reset.c > +++ linux-2.6.17.2/drivers/infiniband/hw/mthca/mthca_reset.c > @@ -49,6 +49,12 @@ int mthca_reset(struct mthca_dev *mdev) > u32 *hca_header = NULL; > u32 *bridge_header = NULL; > struct pci_dev *bridge = NULL; > + int bridge_pcix_cap = 0; > + int hca_pcie_cap = 0; > + int hca_pcix_cap = 0; > + > + u16 devctl; > + u16 linkctl; > > #define MTHCA_RESET_OFFSET 0xf0010 > #define MTHCA_RESET_VALUE swab32(1) > @@ -110,6 +116,9 @@ int mthca_reset(struct mthca_dev *mdev) > } > } > > + hca_pcix_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); > + hca_pcie_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); > + > if (bridge) { > bridge_header = kmalloc(256, GFP_KERNEL); > if (!bridge_header) { > @@ -129,6 +138,13 @@ int mthca_reset(struct mthca_dev *mdev) > goto out; > } > } > + bridge_pcix_cap = pci_find_capability(bridge, PCI_CAP_ID_PCIX); > + if (!bridge_pcix_cap) { > + err = -ENODEV; > + mthca_err(mdev, "Couldn't locate HCA bridge " > + "PCI-X capability, aborting.\n"); > + goto out; > + } > } > > /* actually hit reset */ > @@ -178,6 +194,20 @@ int mthca_reset(struct mthca_dev *mdev) > good: > /* Now restore the PCI headers */ > if (bridge) { > + if (pci_write_config_dword(bridge, bridge_pcix_cap + 0x8, > + bridge_header[(bridge_pcix_cap + 0x8) / 4])) { > + err = -ENODEV; > + mthca_err(mdev, "Couldn't restore HCA bridge Upstream " > + "split transaction control, aborting.\n"); > + goto out; > + } > + if (pci_write_config_dword(bridge, bridge_pcix_cap + 0xc, > + bridge_header[(bridge_pcix_cap + 0xc) / 4])) { > + err = -ENODEV; > + mthca_err(mdev, "Couldn't restore HCA bridge Downstream " > + "split transaction control, aborting.\n"); > + goto out; > + } > /* > * Bridge control register is at 0x3e, so we'll > * naturally restore it last in this loop. > @@ -203,6 +233,35 @@ good: > } > } > > + if (hca_pcix_cap) { > + if (pci_write_config_dword(mdev->pdev, hca_pcix_cap, > + hca_header[hca_pcix_cap / 4])) { > + err = -ENODEV; > + mthca_err(mdev, "Couldn't restore HCA PCI-X " > + "command register, aborting.\n"); > + goto out; > + } > + } > + > + if (hca_pcie_cap) { > + devctl = hca_header[(hca_pcie_cap + PCI_EXP_DEVCTL) / 4]; > + if (pci_write_config_word(mdev->pdev, hca_pcie_cap + PCI_EXP_DEVCTL, > + devctl)) { > + err = -ENODEV; > + mthca_err(mdev, "Couldn't restore HCA PCI Express " > + "Device Control register, aborting.\n"); > + goto out; > + } > + linkctl = hca_header[(hca_pcie_cap + PCI_EXP_LNKCTL) / 4]; > + if (pci_write_config_word(mdev->pdev, hca_pcie_cap + PCI_EXP_LNKCTL, > + linkctl)) { > + err = -ENODEV; > + mthca_err(mdev, "Couldn't restore HCA PCI Express " > + "Link control register, aborting.\n"); > + goto out; > + } > + } > + > for (i = 0; i < 16; ++i) { > if (i * 4 == PCI_COMMAND) > continue; > > -- > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ From mst at mellanox.co.il Sat Aug 5 22:04:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Aug 2006 08:04:25 +0300 Subject: [openib-general] [patch 02/45] IB/mthca: restore missing PCI registers after reset In-Reply-To: <20060805225333.GY25692@stusta.de> References: <20060805225333.GY25692@stusta.de> Message-ID: <20060806050425.GA6418@mellanox.co.il> Quoting r. Adrian Bunk : > Subject: Re: [patch 02/45] IB/mthca: restore missing PCI registers after reset > > It seems this patch should also be included in the 2.6.16.x branch, or > do I miss anything? Right, good idea. -- MST From dotanb at mellanox.co.il Sun Aug 6 05:44:11 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 6 Aug 2006 15:44:11 +0300 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail Message-ID: <200608061544.11205.dotanb@mellanox.co.il> Hi. I compiled code of a test that we wrote and i get compilaton error, here is the compilation error: cc -c -g -O2 -Wall -W -Werror -I/usr/mst/include -I/usr/local/include/infiniband async_event_test.c In file included from /usr/include/vl_gen2u_str.h:40, from /usr/include/vl.h:49, from async_event_test.c:38: /usr/local/include/infiniband/cm.h:209: error: syntax error before numeric constant make: *** [async_event_test.o] Error 1 when i looked at a preprocessed code of the test i noticed the following code: enum ib_cm_sidr_status { 0, 1, 2, 3, 4, IB_SIDR_UNSUPPORTED_VERSION }; it seems that the enumerations values were replaced with integers. when i searched for the values that were enumerated in the headre files i found the following defines in ib_types.h: #define IB_SIDR_SUCCESS 0 #define IB_SIDR_UNSUPPORTED 1 #define IB_SIDR_REJECT 2 #define IB_SIDR_NO_QP 3 #define IB_SIDR_REDIRECT 4 I think that the problem was that ib_types.h was included in a file that includes the cm.h and the preprocessor replaced the enumeration names with the integer values. who can check this issue? thanks Dotan From dotanb at mellanox.co.il Sun Aug 6 06:50:52 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 6 Aug 2006 16:50:52 +0300 Subject: [openib-general] [libibcm] does the libibcm support multithreaded applications? In-Reply-To: <44D2212D.4060508@ichips.intel.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302A3C659@mtlexch01.mtl.com> <44D2212D.4060508@ichips.intel.com> Message-ID: <200608061650.52846.dotanb@mellanox.co.il> On Thursday 03 August 2006 19:15, Sean Hefty wrote: > Dotan Barak wrote: > > I'm trying to use the libibcm in a multithreaded test and i get weird > > failures (instead of RTU event i get a DREQ event). > > This is possible sequence. If the RTU is lost, or the connecting client aborts > before sending the RTU, a DREQ can occur. > > > Does the libibcm supports multi threading applications? > > (every thread have it's own CM device and each one of them listen is > > using a different service ID) > > There's nothing (other than an unknown bug) that should prevent a multi-threaded > application from running. I have a multithreaded test (qp_test) and i tried to add support to the libibcm: every thread is calling the ib_cm_get_device function and get a cm_device_handle. I checked the handles and it seems that both of the threads get the same CM device handle, thing which causes to thread X sometimes get the event which i wanted to send to thread Y. How should a multithreaded application need to work with the libibcm? thanks Dotan From bugzilla-daemon at openib.org Sun Aug 6 08:20:23 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Sun, 6 Aug 2006 08:20:23 -0700 (PDT) Subject: [openib-general] [Bug 186] ibnetdiscover(4029): unaligned access Message-ID: <20060806152023.2E1A52283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=186 tziporet at mellanox.co.il changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bugzilla at openib.org |halr at voltaire.com ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Sun Aug 6 08:25:25 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Sun, 6 Aug 2006 08:25:25 -0700 (PDT) Subject: [openib-general] [Bug 185] ib0: multicast join failed for Message-ID: <20060806152525.94C872283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=185 ------- Comment #3 from tziporet at mellanox.co.il 2006-08-06 08:25 ------- Note that the issue with ipath driver is a known issue since ipath driver is not supported for Itanium. Please use a conf file where you put no on ipath driver (ib_ipath=n) ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From devesh28 at gmail.com Mon Aug 7 00:06:38 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Mon, 7 Aug 2006 12:36:38 +0530 Subject: [openib-general] cq polling order Message-ID: <309a667c0608070006g77688258ud13630828730e81c@mail.gmail.com> Hello everybody, I have a query regarding cq poll concept. Consider the following situation: Consumer has posted 2 SEND operations after that it posted 1 RDMA_READ operation and againg 1 SEND operation ( Not posted with Barrier Fence, Is it expected from consumer that it can post without Barrier Fence after RDMA READ? ), now while polling in what order completions should be returned by verbs? Is it expected by consumer that completions will be polled in posting order or they can come out of order? Polling order 2 SEND_COMP, 1 RDMA_READ_COMP, 1 SEND_COMP OR Polling order 3 SEND_COMP, 1 RDMA_READ_COMP which is expected? -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Mon Aug 7 00:38:25 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 7 Aug 2006 10:38:25 +0300 Subject: [openib-general] cq polling order In-Reply-To: <309a667c0608070006g77688258ud13630828730e81c@mail.gmail.com> References: <309a667c0608070006g77688258ud13630828730e81c@mail.gmail.com> Message-ID: <200608071038.25701.dotanb@mellanox.co.il> Hi. On Monday 07 August 2006 10:06, Devesh Sharma wrote: > Hello everybody, > I have a query regarding cq poll concept. > Consider the following situation: > > Consumer has posted 2 SEND operations after that it posted 1 RDMA_READ > operation and againg 1 SEND operation ( Not posted with Barrier Fence, Is it > expected from consumer that it can post without Barrier Fence after RDMA > READ? ), now while polling in what order completions should be returned by > verbs? > > Is it expected by consumer that completions will be polled in posting order > or they can come out of order? > Polling order 2 SEND_COMP, 1 RDMA_READ_COMP, 1 SEND_COMP > OR > Polling order 3 SEND_COMP, 1 RDMA_READ_COMP > > which is expected? The order the completions must be the same as the order of the WR that you posted to the Send Queue. so, you should expect the first option. as you mentioned, if the fence bit is enable on a specific SR, the HCA need to wait until all the previous SR will be finished (according to the fence laws in the IB spec ..). Dotan From devesh28 at gmail.com Mon Aug 7 01:14:03 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Mon, 7 Aug 2006 13:44:03 +0530 Subject: [openib-general] cq polling order In-Reply-To: <200608071038.25701.dotanb@mellanox.co.il> References: <309a667c0608070006g77688258ud13630828730e81c@mail.gmail.com> <200608071038.25701.dotanb@mellanox.co.il> Message-ID: <309a667c0608070114m14ee7992ka6f866358b78a980@mail.gmail.com> Hi, Dotan Thanks for quick reply. On 8/7/06, Dotan Barak wrote: > > Hi. > > On Monday 07 August 2006 10:06, Devesh Sharma wro > > Hello everybody, > > I have a query regarding cq poll concept. > > Consider the following situation: > > > > Consumer has posted 2 SEND operations after that it posted 1 RDMA_READ > > operation and againg 1 SEND operation ( Not posted with Barrier Fence, > Is it > > expected from consumer that it can post without Barrier Fence after RDMA > > READ? ), now while polling in what order completions should be returned > by > > verbs? > > > > Is it expected by consumer that completions will be polled in posting > order > > or they can come out of order? > > Polling order 2 SEND_COMP, 1 RDMA_READ_COMP, 1 SEND_COMP > > OR > > Polling order 3 SEND_COMP, 1 RDMA_READ_COMP > > > > which is expected? > > The order the completions must be the same as the order of the WR that you > posted to the Send Queue. > so, you should expect the first option. But RDMA Read may complete out of order. Is it means that HCA Driver should be implemented such that order is maintained in such situations? as you mentioned, if the fence bit is enable on a specific SR, the HCA need > to wait until all the previous SR > will be finished (according to the fence laws in the IB spec ..). If consumer is specifing Fence then there is No problem of polling order, Since, HCA will not start processing next WR untill all WR prior to this WR completes, Hence polling will inherently be done in order, But issue is if Cosumer dose not uses FENCE flag in next send operation after RDMA READ. Dotan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Mon Aug 7 01:55:35 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 7 Aug 2006 11:55:35 +0300 Subject: [openib-general] cq polling order In-Reply-To: <309a667c0608070114m14ee7992ka6f866358b78a980@mail.gmail.com> References: <309a667c0608070006g77688258ud13630828730e81c@mail.gmail.com> <200608071038.25701.dotanb@mellanox.co.il> <309a667c0608070114m14ee7992ka6f866358b78a980@mail.gmail.com> Message-ID: <200608071155.35589.dotanb@mellanox.co.il> On Monday 07 August 2006 11:14, Devesh Sharma wrote: > Hi, Dotan Thanks for quick reply. > > On 8/7/06, Dotan Barak wrote: > > > > Hi. > > > > On Monday 07 August 2006 10:06, Devesh Sharma wro > > > Hello everybody, > > > I have a query regarding cq poll concept. > > > Consider the following situation: > > > > > > Consumer has posted 2 SEND operations after that it posted 1 RDMA_READ > > > operation and againg 1 SEND operation ( Not posted with Barrier Fence, > > Is it > > > expected from consumer that it can post without Barrier Fence after RDMA > > > READ? ), now while polling in what order completions should be returned > > by > > > verbs? > > > > > > Is it expected by consumer that completions will be polled in posting > > order > > > or they can come out of order? > > > Polling order 2 SEND_COMP, 1 RDMA_READ_COMP, 1 SEND_COMP > > > OR > > > Polling order 3 SEND_COMP, 1 RDMA_READ_COMP > > > > > > which is expected? > > > > The order the completions must be the same as the order of the WR that you > > posted to the Send Queue. > > so, you should expect the first option. > > > But RDMA Read may complete out of order. Is it means that HCA Driver should > be implemented such that order is maintained in such situations? > In the IB spec 10.8.5: "... Work Completions are always returned in the order submitted to a given Work Queue with respect to other Work Requests on that Work Queue." so, the order of the completions in the CQ should be the same order of the posts WR on that Wort Queue. > as you mentioned, if the fence bit is enable on a specific SR, the HCA need > > to wait until all the previous SR > > will be finished (according to the fence laws in the IB spec ..). > > > If consumer is specifing Fence then there is No problem of polling order, > Since, HCA will not start processing next WR untill all WR prior to this WR > completes, Hence polling will inherently be done in order, But issue is if > Cosumer dose not uses FENCE flag in next send operation after RDMA READ. I believe that the order of the completions in the CQ is not being effected by the FENCE bit. The FENCE bit will effect the processing timing: if one will post RDMA Read and then Send With/Without FENCE and he will use the same local memory addresses, the difference that he will see is in the data (there is a race between the RDMA Read and the Send: if the Send will start after the Read was finished or not). Dotan From mst at mellanox.co.il Mon Aug 7 02:27:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Aug 2006 12:27:11 +0300 Subject: [openib-general] [PATCH repost] IB/ipoib: fix race reported by Eitan Rabin Message-ID: <20060807092711.GA4778@mellanox.co.il> Please consider the following for 2.6.18. Prevent flush task from freeing the ipoib_neigh pointer, while ipoib_start_xmit is accessing the ipoib_neigh through the pointer is has loaded from the hardware address. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cf71d2a..31c4b05 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -336,7 +336,8 @@ void ipoib_flush_paths(struct net_device struct ipoib_path *path, *tp; LIST_HEAD(remove_list); - spin_lock_irq(&priv->lock); + spin_lock_irq(&priv->tx_lock); + spin_lock(&priv->lock); list_splice(&priv->path_list, &remove_list); INIT_LIST_HEAD(&priv->path_list); @@ -352,7 +353,8 @@ void ipoib_flush_paths(struct net_device path_free(dev, path); spin_lock_irq(&priv->lock); } - spin_unlock_irq(&priv->lock); + spin_unlock(&priv->lock); + spin_unlock_irq(&priv->tx_lock); } static void path_rec_completion(int status, -- MST From ogerlitz at voltaire.com Mon Aug 7 02:55:55 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 07 Aug 2006 12:55:55 +0300 Subject: [openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT) In-Reply-To: <44D22218.7000005@ichips.intel.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> <44D049B5.6000505@voltaire.com> <44D0D808.1070003@ichips.intel.com> <44D1986B.6070302@voltaire.com> <44D22218.7000005@ichips.intel.com> Message-ID: <44D70E2B.60205@voltaire.com> Sean Hefty wrote: > Or Gerlitz wrote: >> Is it correct that with the gen2 code, the remote **CM** will >> reconnect on that case? > I don't think so. The QP needs to move into timewait, so a new > connection request is needed with a different QPN. Just to make sure, you replaced "CM id" with QP and QPN in the sentence above, correct? its fine to recycle QPs, eg when re-connect that follows sending a REQ and getting a REJ whose reason is stale connection is carried out?! >> I see in cm.c :: cm_rej_handler() that when the state is >> IB_CM_REQ_SENT and the reject reason is IB_CM_REJ_STALE_CONN you just >> move the cm_id into timewait state, which will cause a retry on the >> REQ, correct? > The cm_id moves into timewait, but that shouldn't cause a retry. The CM > should notify the ULP of the reject. The QP cannot be re-used until the > cm_id exits the timewait state. OK, i understand it now better, however: Your reply made me rethink the issue of reject reasons handling in the CMA framework and i have figured out we might have a little hole here... This is as of include/rdma/ib_cm.h :: ib_cm_rej_reason enum values being delivered up by the CMA to the ULP in the status field of the cma event but the ULP is not aware to this enum... One solution i can think of is sorting out the values of ib_cm_rej_reason to few categories: 0) ones that are of no interest to the CMA nor to the ULP above it but rather only to the local CM (are there any?) 1) ones that *must* be handled internally by the CMA (are there any?) 2) ones that *can* be handled internally by the CMA (eg stale-conn) 3) ones that can be translated to errno value (eg invalid-sid (8) to econnrefused) and set the cma event status field to the errno value 4) ones that do not fall into none of 0-3 above, for those set the status field to ENOPROTO and once a CMA app that gets ENOPROTO and has an issue with it would show up, we will see what can we do. Please let me know what you think Or. From ogerlitz at voltaire.com Mon Aug 7 03:52:27 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 07 Aug 2006 13:52:27 +0300 Subject: [openib-general] [PATCH] RDMA / IB CM: support immediately sending replies to received messages In-Reply-To: <000101c6b821$6f783be0$8698070a@amr.corp.intel.com> References: <000101c6b821$6f783be0$8698070a@amr.corp.intel.com> Message-ID: <44D71B6B.3000007@voltaire.com> Sean Hefty wrote: > This is what the patch would look like that would support sending > replies immediately after polling a receive before the QP has > "finished" connecting. The changes require that the ib_cm support > returning QP attributes for the RTS transition after a REQ has > been received, and add an rdma_establish() call to the RDMA CM. Sean, OK, just to make sure i don't miss anything here: 1) the CM moves the QP state to RTS before sending the REP and hence per the QP its fine to do TX before getting the ESTABLISHED event. 2) if a ULP calls rdma_establish (eg when polling an RX from CQ --> QP --> un ESTABLISHED CMA ID or from the COMM_EST case of the qp async event handler) it is ensured they will get ESTABLISHED event so its up to them if to wait for the event before processing the RX or not. > I'd like to decide on which approach to use by the end of next > week, so I can commit any changes and update userspace accordingly. OK, fair enough. Personally i preferred the patch set that implemented everything within the ib stack and just had this little requirement on the ULP to hold on with doing TX before getting the ESTABLISHED event, but this one makes sense as well. I will try to see if i can think of races that are bad enough to prevent us from moving the QP state to RTS *before* sending the REP. Also, i understand that for APM support a patch in the spirit of what you were suggesting (ie track local QPNs and affiliated QP async events in the CM) would be merged anyway, correct? Or. From krkumar2 at in.ibm.com Mon Aug 7 05:04:32 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 7 Aug 2006 17:34:32 +0530 Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: Message-ID: Hi James, > Are the rdmav_ versions intended to be generic or are they intended > for use with the native communications managers (IB CM and iWARP CM)? It is true that the rdmav_ version is generic. But using the rdmav_ routines means knowing the type of transport interface (eg, I need to do ibv_modify_qp, etc). Using CMA makes that invisible as the application does an rdma_create_id, create_qp, bind/listen/connect, etc, and then proceeds to post wr 's to the qp using ibv_post_send(), etc. > Is there a way that the differences could be made clearer? Could one > be eliminated? If I understand right, the CMA interface was designed and added to let the application use the IB/iwarp devices without knowing the underlying interface. So it would not make sense to remove this interface (and definitely not the rdmav_() interfaces as that is used by rdma_() interface). Thanks, - KK From jlentini at netapp.com Mon Aug 7 07:19:30 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 7 Aug 2006 10:19:30 -0400 (EDT) Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: References: Message-ID: On Mon, 7 Aug 2006, Krishna Kumar2 wrote: > Hi James, > > > Are the rdmav_ versions intended to be generic or are they intended > > for use with the native communications managers (IB CM and iWARP CM)? > > It is true that the rdmav_ version is generic. But using the rdmav_ > routines means knowing the type of transport interface (eg, I need > to do ibv_modify_qp, etc). Is there a benefit to having rdmav_create_qp() take generic parameters if the application needs to understand the type of QP (IB, iWARP, etc.) created and the transport specific communication manager calls that are needed to manipulate it? Would it make more sense if the QP create command was also transport specific? > Using CMA makes that invisible as the application does an > rdma_create_id, create_qp, bind/listen/connect, etc, and then > proceeds to post wr 's to the qp using ibv_post_send(), etc. > > > Is there a way that the differences could be made clearer? Could one > > be eliminated? > > If I understand right, the CMA interface was designed and added to > let the application use the IB/iwarp devices without knowing the > underlying interface. Correct. > So it would not make sense to remove this interface (and definitely > not the rdmav_() interfaces as that is used by rdma_() interface). We should definitely keep both intefaces. From jlentini at netapp.com Mon Aug 7 09:32:47 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 7 Aug 2006 12:32:47 -0400 (EDT) Subject: [openib-general] OFED 1.0 kernel sources Message-ID: What subversion revision are the OFED 1.0 kernel sources supposed to correspond to? In the OFED package's source directory, there is a SOURCES/openib-1.0/src/linux-kernel/infiniband/core/ping.c file. However the makefile doesn't build this code and the code itself references symbols from a later version of the CMA (e.g. RDMA_TRANSPORT_IB). james From swise at opengridcomputing.com Mon Aug 7 09:43:39 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 07 Aug 2006 11:43:39 -0500 Subject: [openib-general] rdma cm process hang In-Reply-To: <20060804200834.GC25527@osc.edu> References: <20060801213416.GA18941@osc.edu> <1154531379.32560.13.camel@stevo-desktop> <20060802155721.GA20429@osc.edu> <1154611170.29187.7.camel@stevo-desktop> <20060804200834.GC25527@osc.edu> Message-ID: <1154969019.17232.10.camel@stevo-desktop> Pete, this looks good. I'm going to hold off on pulling this in until I get a word from Roland on if he's pulling in the driver patch to 2.6.19. I'd rather get what I have in to Roland's git tree, then add this as a 1-off patch. Roland, what do you think? Thanks, Steve. On Fri, 2006-08-04 at 16:08 -0400, Pete Wyckoff wrote: > swise at opengridcomputing.com wrote on Thu, 03 Aug 2006 08:19 -0500: > > I don't know when, or if I'll have time to address this limitation in > > the ammasso firmware. But there is a way (if anyone wants to implement > > it): > > > > 1) add a timer to the c2_qp struct and start it when c2_llp_connect() is > > called. > > > > 2) if the timer fires, generate a CONNECT_REPLY upcall to the IWCM with > > status TIMEDOUT. Mark in the qp that the connect timed out. > > > > 3) deal with the rare condition that the timer fires at or about the > > same time the connection really does get established: if the adapter > > passes up a CCAE_ACTIVE_CONNECT_RESULTS -after- the timer fires but > > before the qp is destroyed by the consumer, then you must squelch this > > event and probably destroy the HWQP at least from the adapter's > > perspective... > > Here's a first cut. It fixes one source of process hangs I had been > running into. A couple of issues: > > - What is the proper connect timeout? Old BSD used 24 sec. > Modern linux seems to be around 3 minutes based on > experimentation. > > - I used the new hrtimer code. Backports will need major > changes to use the old timer interface. > > - I also added code in c2_free_qp() to kill the connection (and > release the qp ref) to handle the ctrl-C case, or any other > event that would cause the QP to go away while a connection was > outstanding. > > - No attempt is made to cleanup hardware state. That code could > go in connect_timer_expire(), although there may be issues on > who is holding what locks when the timer expires. > > This against r8688. > > -- Pete > > Attempt to work around buggy Ammasso firmware that does not timeout > active connection requests. Also actively cancels an outstanding > connections when the QP is being freed. > > Signed-off-by: Pete Wyckoff > > Index: linux-kernel/infiniband/hw/amso1100/c2_qp.c > =================================================================== > --- linux-kernel/infiniband/hw/amso1100/c2_qp.c (revision 8688) > +++ linux-kernel/infiniband/hw/amso1100/c2_qp.c (working copy) > @@ -517,6 +517,8 @@ > c2dev->qp_table.map[qp->qpn] = qp; > spin_unlock_irq(&c2dev->qp_table.lock); > > + hrtimer_init(&qp->connect_timer, CLOCK_MONOTONIC, HRTIMER_REL); > + > return 0; > > bail6: > @@ -545,6 +547,13 @@ > recv_cq = to_c2cq(qp->ibqp.recv_cq); > > /* > + * If the timer was still active, a connection attempt is outstanding. > + * Call the expire function directly to release the ref on the qp. > + */ > + if (hrtimer_cancel(&qp->connect_timer)) > + qp->connect_timer.function(&qp->connect_timer); > + > + /* > * Lock CQs here, so that CQ polling code can do QP lookup > * without taking a lock. > */ > Index: linux-kernel/infiniband/hw/amso1100/c2_ae.c > =================================================================== > --- linux-kernel/infiniband/hw/amso1100/c2_ae.c (revision 8688) > +++ linux-kernel/infiniband/hw/amso1100/c2_ae.c (working copy) > @@ -226,6 +226,14 @@ > cm_event.private_data_len = 0; > cm_event.private_data = NULL; > } > + > + /* > + * Cancel the connect timeout; but if it already > + * ran, throw away this hardware connect result > + * that raced against it. > + */ > + if (hrtimer_cancel(&qp->connect_timer) == 0) > + goto ignore_it; > if (cm_event.private_data_len) { > /* copy private data */ > pdata = > Index: linux-kernel/infiniband/hw/amso1100/c2_provider.h > =================================================================== > --- linux-kernel/infiniband/hw/amso1100/c2_provider.h (revision 8688) > +++ linux-kernel/infiniband/hw/amso1100/c2_provider.h (working copy) > @@ -120,6 +120,7 @@ > > struct c2_mq sq_mq; > struct c2_mq rq_mq; > + struct hrtimer connect_timer; > }; > > struct c2_cr_query_attrs { > Index: linux-kernel/infiniband/hw/amso1100/c2_cm.c > =================================================================== > --- linux-kernel/infiniband/hw/amso1100/c2_cm.c (revision 8688) > +++ linux-kernel/infiniband/hw/amso1100/c2_cm.c (working copy) > @@ -36,6 +36,22 @@ > #include "c2_vq.h" > #include > > +static int connect_timer_expire(struct hrtimer *timer) > +{ > + struct c2_qp *qp; > + > + qp = container_of(timer, struct c2_qp, connect_timer); > + if (qp->cm_id && qp->cm_id->event_handler) { > + struct iw_cm_event cm_event = { > + .event = IW_CM_EVENT_CONNECT_REPLY, > + .status = IW_CM_EVENT_STATUS_TIMEOUT, > + }; > + dprintk("%s: sending connect timeout event\n", __func__); > + qp->cm_id->event_handler(qp->cm_id, &cm_event); > + } > + return HRTIMER_NORESTART; > +} > + > int c2_llp_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) > { > struct c2_dev *c2dev = to_c2dev(cm_id->device); > @@ -123,6 +139,15 @@ > cm_id->provider_data = NULL; > qp->cm_id = NULL; > cm_id->rem_ref(cm_id); > + } else { > + /* > + * Start connect timer. Since buggy firmware will not > + * time out active connections, ever, this timer is used > + * to force expiry after 30 sec. > + */ > + qp->connect_timer.function = connect_timer_expire; > + hrtimer_start(&qp->connect_timer, ktime_set(30, 0), > + HRTIMER_REL); > } > return err; > } From swise at opengridcomputing.com Mon Aug 7 09:46:30 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 07 Aug 2006 11:46:30 -0500 Subject: [openib-general] hack around IWCM double-close problem In-Reply-To: <20060804201423.GA25697@osc.edu> References: <20060804201423.GA25697@osc.edu> Message-ID: <1154969190.17232.13.camel@stevo-desktop> The BUG_ON is there to catch these problems. The iwarp driver is never supposed to post events twice. I'll look into this, but I think the fix should be in the amso driver... Tom, what do you think? Steve. On Fri, 2006-08-04 at 16:14 -0400, Pete Wyckoff wrote: > In some cases using Ammasso devices, a CONNECTION_LOST message may > come from c2_ae_event while something else is in the middle of > c2_destroy_qp. If that happens, a BUG_ON triggers in > cm_close_handler as the QP goes to IDLE after the first of the two > calls. Perhaps this is a problem with the Ammasso driver, but this > little hack-around hid it for me. > > Signed-off-by: Pete Wyckoff > > Index: linux-kernel/infiniband/core/iwcm.c > =================================================================== > --- linux-kernel/infiniband/core/iwcm.c (revision 8688) > +++ linux-kernel/infiniband/core/iwcm.c (working copy) > @@ -673,6 +673,12 @@ > case IW_CM_STATE_DESTROYING: > spin_unlock_irqrestore(&cm_id_priv->lock, flags); > break; > + case IW_CM_STATE_IDLE: > + /* protect against double-close from concurrent c2_destroy_qp > + * and c2_ae_event CONNECTION_LOST */ > + printk(KERN_INFO "%s: in IDLE state, ignoring\n", __func__); > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + break; > default: > BUG_ON(1); > } > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From bunk at stusta.de Mon Aug 7 10:04:51 2006 From: bunk at stusta.de (Adrian Bunk) Date: Mon, 7 Aug 2006 19:04:51 +0200 Subject: [openib-general] [patch 02/45] IB/mthca: restore missing PCI registers after reset In-Reply-To: <20060806050425.GA6418@mellanox.co.il> References: <20060805225333.GY25692@stusta.de> <20060806050425.GA6418@mellanox.co.il> Message-ID: <20060807170451.GH3691@stusta.de> On Sun, Aug 06, 2006 at 08:04:25AM +0300, Michael S. Tsirkin wrote: > Quoting r. Adrian Bunk : > > Subject: Re: [patch 02/45] IB/mthca: restore missing PCI registers after reset > > > > It seems this patch should also be included in the 2.6.16.x branch, or > > do I miss anything? > > Right, good idea. Thanks for this information, I've applied it. > MST cu Adrian -- Gentoo kernels are 42 times more popular than SUSE kernels among KLive users (a service by SUSE contractor Andrea Arcangeli that gathers data about kernels from many users worldwide). There are three kinds of lies: Lies, Damn Lies, and Statistics. Benjamin Disraeli From mshefty at ichips.intel.com Mon Aug 7 10:05:32 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 07 Aug 2006 10:05:32 -0700 Subject: [openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT) In-Reply-To: <44D70E2B.60205@voltaire.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> <44D049B5.6000505@voltaire.com> <44D0D808.1070003@ichips.intel.com> <44D1986B.6070302@voltaire.com> <44D22218.7000005@ichips.intel.com> <44D70E2B.60205@voltaire.com> Message-ID: <44D772DC.7010101@ichips.intel.com> Or Gerlitz wrote: > 0) ones that are of no interest to the CMA nor to the ULP above it but > rather only to the local CM (are there any?) > > 1) ones that *must* be handled internally by the CMA (are there any?) > > 2) ones that *can* be handled internally by the CMA (eg stale-conn) > > 3) ones that can be translated to errno value (eg invalid-sid (8) to > econnrefused) and set the cma event status field to the errno value > > 4) ones that do not fall into none of 0-3 above, for those set the > status field to ENOPROTO and once a CMA app that gets ENOPROTO and has > an issue with it would show up, we will see what can we do. > > Please let me know what you think I think we'd have to take each reject code and see if it makes sense to change what's done with it. I'd rather expose the underlying reject reason to the user, than convert it to some other status code that may mask the real reason for the reject. - Sean From bos at pathscale.com Mon Aug 7 10:22:37 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 07 Aug 2006 10:22:37 -0700 Subject: [openib-general] OFED 1.0 kernel sources In-Reply-To: References: Message-ID: <1154971357.26375.50.camel@sardonyx> On Mon, 2006-08-07 at 12:32 -0400, James Lentini wrote: > What subversion revision are the OFED 1.0 kernel sources > supposed to correspond to? None. The OFED 1.0 kernel sources were pulled from Roland's git tree as of a few months ago, then patched by the build script. References: <000101c6b821$6f783be0$8698070a@amr.corp.intel.com> <44D71B6B.3000007@voltaire.com> Message-ID: <44D776AD.3080606@ichips.intel.com> Or Gerlitz wrote: > 1) the CM moves the QP state to RTS before sending the REP and hence per > the QP its fine to do TX before getting the ESTABLISHED event. Correct. > 2) if a ULP calls rdma_establish (eg when polling an RX from CQ --> QP > --> un ESTABLISHED CMA ID or from the COMM_EST case of the qp async > event handler) it is ensured they will get ESTABLISHED event so its up > to them if to wait for the event before processing the RX or not. Note that since the QP is in the RTS state, the IB COMM_EST event will _not_ be generated. This is the key difference between these approaches. The user would need to call rdma_establish when polling a RX from the CQ. Failure to do so would result in a connection failure if the RTU were not received. Calling rdma_establish would result in an RDMA CM ESTABLISH event. > OK, fair enough. Personally i preferred the patch set that implemented > everything within the ib stack and just had this little requirement on > the ULP to hold on with doing TX before getting the ESTABLISHED event, > but this one makes sense as well. My preference is to provide whichever solution makes it easier on the majority of the ULPs (ignoring spec changes at the moment). I don't know which fix ULPs will find easier to work with. Maybe we can come up with a compromise where we expose rdma_establish, but transitioning to RTS before the REP is sent requires that the ULP do the transition...? > Also, i understand that for APM support a patch in the spirit of what > you were suggesting (ie track local QPNs and affiliated QP async events > in the CM) would be merged anyway, correct? Correct. - Sean From mshefty at ichips.intel.com Mon Aug 7 10:37:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 07 Aug 2006 10:37:28 -0700 Subject: [openib-general] [libibcm] does the libibcm support multithreaded applications? In-Reply-To: <200608061650.52846.dotanb@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302A3C659@mtlexch01.mtl.com> <44D2212D.4060508@ichips.intel.com> <200608061650.52846.dotanb@mellanox.co.il> Message-ID: <44D77A58.9020904@ichips.intel.com> Dotan Barak wrote: > I have a multithreaded test (qp_test) and i tried to add support to the > libibcm: every thread is calling the ib_cm_get_device function and get a > cm_device_handle. I checked the handles and it seems that both of the threads > get the same CM device handle, thing which causes to thread X sometimes get > the event which i wanted to send to thread Y. This is correct based on the current implementation. The library opens the CM files once. > How should a multithreaded application need to work with the libibcm? I understand what the problem is, and I think you're right. If ib_cm_get_device() returned a new ib_cm_device, you could more easily control event processing. I will fix this up when I remove the dependency on libsysfs from the libibcm. I am probably at least 2 weeks away from starting on this though. - Sean From linhd at owenmedia.com Mon Aug 7 10:59:29 2006 From: linhd at owenmedia.com (Linh Dinh) Date: Mon, 7 Aug 2006 10:59:29 -0700 Subject: [openib-general] OpenFabrics Alliance and InfiniBand Trade Association Developer Conference - Sept. 25, 2006 Message-ID: Hi Everyone, Mark your calendar: the InfiniBand Trade Association (IBTA) and the OpenFabrics Alliance (OFA) will host a joint developers conference on September 25, 2006 at the Moscone Center West in San Francisco. The event is being held in co-location with the Fall 2006 Intel Developer Forum. If you are an application developer, systems vendor, hardware/software solution provider or end user of the technology, please join us for presentations and collaborative sessions that will highlight the recent advancements of the InfiniBand specification and available software solutions. The one-day conference begins at 8:30 a.m. with keynotes from Jim Pappas, director of initiative marketing at Intel Corporation, and Krish Ramakrishnan, vice president and general manager of Server Switching at Cisco. In addition, we have an exciting day planned including: * End users sharing experiences on real-life deployment and usage of the technology * Highlights of the recent advancements of the InfiniBand specification by IBTA * Updates on available InfiniBand-supported software solutions from OFA and industry partners * Collaborative sessions and discussions about future joint developments between IBTA and OFA Attendees who register by September 1st can do so for the early-bird rate of $149. Afterwards, the standard registration fee is $199. To register for the event, please visit: www.acteva.com/go/IBTAOFADevCon06 Special discount offered to those registering for IDF: If you haven't yet registered for IDF, we invite you to take advantage of an exclusive discount being offered to those attending the IBTA and OFA conference. Attendees may purchase conference passes to IDF at a discounted rate of $750 - a savings of $745 off the standard rate. To register for IDF and receive this discount, please visit: www.intel.com/idf/us/fall2006/registration IBTA Member Bulk Code: FCAGRBTA The Intel Developer Forum in San Francisco offers attendees over 130 hours of technology training to choose from, led by top Intel and industry engineers who provide critical training that will help you solve your day-to-day, real-time problems. Linh Dinh For InfiniBand Trade Association & OpenFabrics Alliance 206-322-1167, ext. 115 linhd at owenmedia.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Aug 7 11:02:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 07 Aug 2006 11:02:58 -0700 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <200608061544.11205.dotanb@mellanox.co.il> References: <200608061544.11205.dotanb@mellanox.co.il> Message-ID: <44D78052.7010600@ichips.intel.com> Dotan Barak wrote: > enum ib_cm_sidr_status { 0, 1, 2, 3, 4, IB_SIDR_UNSUPPORTED_VERSION }; > > it seems that the enumerations values were replaced with integers. > > when i searched for the values that were enumerated in the headre files i > found the following defines in ib_types.h: > > #define IB_SIDR_SUCCESS 0 #define > IB_SIDR_UNSUPPORTED 1 #define > IB_SIDR_REJECT 2 #define > IB_SIDR_NO_QP 3 #define > IB_SIDR_REDIRECT 4 > > > I think that the problem was that ib_types.h was included in a file that > includes the cm.h and the preprocessor replaced the enumeration names with > the integer values. > > who can check this issue? I think the solution is to remove CM definitions out of ib_types.h. What is the reason for including ib_types.h and cm.h? ib_types looks like an internal opensm include file. - Sean From tziporet at mellanox.co.il Mon Aug 7 12:32:44 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 7 Aug 2006 22:32:44 +0300 Subject: [openib-general] OFED 1.0 kernel sources Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7624@mtlexch01.mtl.com> best to take the sources from the release tarball Tziporet -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Bryan O'Sullivan Sent: Monday, August 07, 2006 8:23 PM To: James Lentini Cc: openib-general Subject: Re: [openib-general] OFED 1.0 kernel sources On Mon, 2006-08-07 at 12:32 -0400, James Lentini wrote: > What subversion revision are the OFED 1.0 kernel sources > supposed to correspond to? None. The OFED 1.0 kernel sources were pulled from Roland's git tree as of a few months ago, then patched by the build script. References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7624@mtlexch01.mtl.com> Message-ID: On Mon, 7 Aug 2006, Tziporet Koren wrote: > best to take the sources from the release tarball That is what I was doing. The problem is that the source file for the ib_ping module is in the release, but not the matching kernel sources. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Bryan O'Sullivan > Sent: Monday, August 07, 2006 8:23 PM > To: James Lentini > Cc: openib-general > Subject: Re: [openib-general] OFED 1.0 kernel sources > > On Mon, 2006-08-07 at 12:32 -0400, James Lentini wrote: > > What subversion revision are the OFED 1.0 kernel sources > > supposed to correspond to? > > None. The OFED 1.0 kernel sources were pulled from Roland's git tree as > of a few months ago, then patched by the build script. From sean.hefty at intel.com Mon Aug 7 15:32:40 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 7 Aug 2006 15:32:40 -0700 Subject: [openib-general] [PATCH] sa_query: require SA query registration In-Reply-To: <20060804001229.GA11296@mellanox.co.il> Message-ID: <000101c6ba71$646ccac0$e598070a@amr.corp.intel.com> Require registration with SA module, to prevent module text from going away while sa query callback is still running, and update all users. Signed-off-by: Michael S. Tsirkin Signed-off-by: Sean Hefty --- Changes to the previous post include: * Move struct ib_sa_client definition internal to SA module to better encapsulate future extensions. We can debate whether this is good or bad. * Fix duplicate dereferences on client objects. * Add registration/unregistration to SRP. I did not add tracking to cancel queries automatically. Adding this wouldn't change the API - it only speeds up unregistration. I have a separate patch for the util directory (not posted). Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 8843) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -91,6 +91,8 @@ static struct ib_client ipoib_client = { .remove = ipoib_remove_one }; +static struct ib_sa_client *ipoib_sa_client; + int ipoib_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -460,7 +462,7 @@ static int path_rec_start(struct net_dev init_completion(&path->done); path->query_id = - ib_sa_path_rec_get(priv->ca, priv->port, + ib_sa_path_rec_get(ipoib_sa_client, priv->ca, priv->port, &path->pathrec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | @@ -1185,12 +1187,21 @@ static int __init ipoib_init_module(void goto err_fs; } + ipoib_sa_client = ib_sa_register_client(); + if (IS_ERR(ipoib_sa_client)) { + ret = PTR_ERR(ipoib_sa_client); + goto err_wq; + } + ret = ib_register_client(&ipoib_client); if (ret) - goto err_wq; + goto err_sa; return 0; +err_sa: + ib_sa_unregister_client(ipoib_sa_client); + err_wq: destroy_workqueue(ipoib_workqueue); @@ -1202,6 +1213,7 @@ err_fs: static void __exit ipoib_cleanup_module(void) { + ib_sa_unregister_client(ipoib_sa_client); ib_unregister_client(&ipoib_client); ipoib_unregister_debugfs(); destroy_workqueue(ipoib_workqueue); Index: ulp/srp/ib_srp.c =================================================================== --- ulp/srp/ib_srp.c (revision 8843) +++ ulp/srp/ib_srp.c (working copy) @@ -103,6 +103,8 @@ static struct ib_client srp_client = { .remove = srp_remove_one }; +static struct ib_sa_client *srp_sa_client; + static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) { return (struct srp_target_port *) host->hostdata; @@ -2002,10 +2004,17 @@ static int __init srp_init_module(void) return ret; } + srp_sa_client = ib_sa_register_client(); + if (IS_ERR(srp_sa_client)) { + class_unregister(&srp_class); + return PTR_ERR(srp_sa_client); + } + ret = ib_register_client(&srp_client); if (ret) { printk(KERN_ERR PFX "couldn't register IB client\n"); class_unregister(&srp_class); + ib_sa_unregister_client(srp_sa_client); return ret; } @@ -2014,6 +2023,7 @@ static int __init srp_init_module(void) static void __exit srp_cleanup_module(void) { + ib_sa_unregister_client(srp_sa_client); ib_unregister_client(&srp_client); class_unregister(&srp_class); } Index: include/rdma/ib_sa.h =================================================================== --- include/rdma/ib_sa.h (revision 8843) +++ include/rdma/ib_sa.h (working copy) @@ -250,11 +250,25 @@ struct ib_sa_service_rec { u64 data64[2]; }; +struct ib_sa_client; + +/** + * ib_sa_register_client - Register an SA client. + */ +struct ib_sa_client *ib_sa_register_client(void); + +/** + * ib_sa_unregister_client - Deregister an SA client. + * @client: Client object to deregister. + */ +void ib_sa_unregister_client(struct ib_sa_client *client); + struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -264,7 +278,8 @@ int ib_sa_path_rec_get(struct ib_device void *context, struct ib_sa_query **query); -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -275,7 +290,8 @@ int ib_sa_mcmember_rec_query(struct ib_d void *context, struct ib_sa_query **query); -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, @@ -288,6 +304,7 @@ int ib_sa_service_rec_query(struct ib_de /** * ib_sa_mcmember_rec_set - Start an MCMember set query + * @client:SA client * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -312,7 +329,8 @@ int ib_sa_service_rec_query(struct ib_de * cancel the query. */ static inline int -ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_set(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -322,7 +340,7 @@ ib_sa_mcmember_rec_set(struct ib_device void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, timeout_ms, retries, gfp_mask, callback, @@ -331,6 +349,7 @@ ib_sa_mcmember_rec_set(struct ib_device /** * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @client:SA client * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -355,7 +374,8 @@ ib_sa_mcmember_rec_set(struct ib_device * cancel the query. */ static inline int -ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_delete(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -365,7 +385,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, timeout_ms, retries, gfp_mask, callback, Index: core/multicast.c =================================================================== --- core/multicast.c (revision 8843) +++ core/multicast.c (working copy) @@ -63,6 +63,7 @@ static struct ib_client mcast_client = { .remove = mcast_remove_one }; +static struct ib_sa_client *sa_client; static struct ib_event_handler event_handler; static struct workqueue_struct *mcast_wq; static union ib_gid mgid0; @@ -305,8 +306,8 @@ static int send_join(struct mcast_group int ret; group->last_join = member; - ret = ib_sa_mcmember_rec_set(port->dev->device, port->port_num, - &member->multicast.rec, + ret = ib_sa_mcmember_rec_set(sa_client, port->dev->device, + port->port_num, &member->multicast.rec, member->multicast.comp_mask, retry_timer, retries, GFP_KERNEL, join_handler, group, &group->query); @@ -326,7 +327,8 @@ static int send_leave(struct mcast_group rec = group->rec; rec.join_state = leave_state; - ret = ib_sa_mcmember_rec_delete(port->dev->device, port->port_num, &rec, + ret = ib_sa_mcmember_rec_delete(sa_client, port->dev->device, + port->port_num, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_JOIN_STATE, @@ -770,18 +772,27 @@ static int __init mcast_init(void) if (!mcast_wq) return -ENOMEM; + sa_client = ib_sa_register_client(); + if (IS_ERR(sa_client)) { + ret = PTR_ERR(sa_client); + goto err1; + } + ret = ib_register_client(&mcast_client); if (ret) - goto err; + goto err2; return 0; -err: +err2: + ib_sa_unregister_client(sa_client); +err1: destroy_workqueue(mcast_wq); return ret; } static void __exit mcast_cleanup(void) { + ib_sa_unregister_client(sa_client); ib_unregister_client(&mcast_client); destroy_workqueue(mcast_wq); } Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 8843) +++ core/sa_query.c (working copy) @@ -33,6 +33,9 @@ * $Id$ */ +#include + +#include #include #include #include @@ -71,9 +74,15 @@ struct ib_sa_device { struct ib_sa_port port[0]; }; +struct ib_sa_client { + atomic_t users; + struct completion comp; +}; + struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); + struct ib_sa_client *client; struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; @@ -413,6 +422,46 @@ static void ib_sa_event(struct ib_event_ } } +struct ib_sa_client *ib_sa_register_client() +{ + struct ib_sa_client *client; + + client = kmalloc(sizeof *client, GFP_KERNEL); + if (!client) + return ERR_PTR(-ENOMEM); + + atomic_set(&client->users, 1); + init_completion(&client->comp); + return client; +} +EXPORT_SYMBOL(ib_sa_register_client); + +static void ib_sa_client_get(struct ib_sa_query *query, + struct ib_sa_client *client) +{ + atomic_inc(&client->users); + query->client = client; +} + +static inline void deref_client(struct ib_sa_client *client) +{ + if (atomic_dec_and_test(&client->users)) + complete(&client->comp); +} + +static void ib_sa_client_put(struct ib_sa_query *query) +{ + deref_client(query->client); +} + +void ib_sa_unregister_client(struct ib_sa_client *client) +{ + deref_client(client); + wait_for_completion(&client->comp); + kfree(client); +} +EXPORT_SYMBOL(ib_sa_unregister_client); + /** * ib_sa_cancel_query - try to cancel an SA query * @id:ID of query to cancel @@ -636,7 +685,8 @@ static void ib_sa_path_rec_release(struc * error code. Otherwise it is a query ID that can be used to cancel * the query. */ -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -677,6 +727,7 @@ int ib_sa_path_rec_get(struct ib_device mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + ib_sa_client_get(&query->sa_query, client); query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; query->sa_query.port = port; @@ -696,6 +747,7 @@ int ib_sa_path_rec_get(struct ib_device err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -753,7 +805,8 @@ static void ib_sa_service_rec_release(st * error code. Otherwise it is a request ID that can be used to cancel * the query. */ -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -799,6 +852,7 @@ int ib_sa_service_rec_query(struct ib_de mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + ib_sa_client_get(&query->sa_query, client); query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; query->sa_query.release = ib_sa_service_rec_release; query->sa_query.port = port; @@ -819,6 +873,7 @@ int ib_sa_service_rec_query(struct ib_de err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -849,7 +904,8 @@ static void ib_sa_mcmember_rec_release(s kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -891,6 +947,7 @@ int ib_sa_mcmember_rec_query(struct ib_d mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + ib_sa_client_get(&query->sa_query, client); query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; query->sa_query.port = port; @@ -911,6 +968,7 @@ int ib_sa_mcmember_rec_query(struct ib_d err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -947,6 +1005,7 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(mad_send_wc->send_buf); kref_put(&query->sm_ah->ref, free_sm_ah); + ib_sa_client_put(query); query->release(query); } Index: core/cma.c =================================================================== --- core/cma.c (revision 8843) +++ core/cma.c (working copy) @@ -61,6 +61,7 @@ static struct ib_client cma_client = { .remove = cma_remove_one }; +struct ib_sa_client *sa_client; static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); @@ -1272,7 +1273,7 @@ static int cma_query_ib_route(struct rdm path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; - id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + id_priv->query_id = ib_sa_path_rec_get(sa_client, id_priv->id.device, id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, @@ -2367,18 +2368,27 @@ static int cma_init(void) if (!cma_wq) return -ENOMEM; + sa_client = ib_sa_register_client(); + if (IS_ERR(sa_client)) { + ret = PTR_ERR(sa_client); + goto err1; + } + ret = ib_register_client(&cma_client); if (ret) - goto err; + goto err2; return 0; -err: +err2: + ib_sa_unregister_client(sa_client); +err1: destroy_workqueue(cma_wq); return ret; } static void cma_cleanup(void) { + ib_sa_unregister_client(sa_client); ib_unregister_client(&cma_client); destroy_workqueue(cma_wq); idr_destroy(&sdp_ps); From mst at mellanox.co.il Mon Aug 7 15:55:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Aug 2006 01:55:07 +0300 Subject: [openib-general] [PATCH] sa_query: require SA query registration In-Reply-To: <000101c6ba71$646ccac0$e598070a@amr.corp.intel.com> References: <000101c6ba71$646ccac0$e598070a@amr.corp.intel.com> Message-ID: <20060807225507.GA10166@mellanox.co.il> Quoting r. Sean Hefty : > @@ -1202,6 +1213,7 @@ err_fs: > > static void __exit ipoib_cleanup_module(void) > { > + ib_sa_unregister_client(ipoib_sa_client); > ib_unregister_client(&ipoib_client); > ipoib_unregister_debugfs(); > destroy_workqueue(ipoib_workqueue); I think you must call ib_unregister_client first, before ib_sa_unregister_client. This is because ib_unregister_client triggers hotplug event which cancels all queries. If you don't do this, you'll get a deadlock as ipoib might retry the queries forever. Same probably applies to other modules. -- MST From sean.hefty at intel.com Mon Aug 7 16:09:30 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 7 Aug 2006 16:09:30 -0700 Subject: [openib-general] [PATCH v2] sa_query: require SA query registration In-Reply-To: <20060807225507.GA10166@mellanox.co.il> Message-ID: <000201c6ba76$89c7fc90$e598070a@amr.corp.intel.com> Fixed to call ib_unregister_client() before ib_sa_unregister_client(). Thanks --- Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 8843) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -91,6 +91,8 @@ static struct ib_client ipoib_client = { .remove = ipoib_remove_one }; +static struct ib_sa_client *ipoib_sa_client; + int ipoib_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -460,7 +462,7 @@ static int path_rec_start(struct net_dev init_completion(&path->done); path->query_id = - ib_sa_path_rec_get(priv->ca, priv->port, + ib_sa_path_rec_get(ipoib_sa_client, priv->ca, priv->port, &path->pathrec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | @@ -1185,12 +1187,21 @@ static int __init ipoib_init_module(void goto err_fs; } + ipoib_sa_client = ib_sa_register_client(); + if (IS_ERR(ipoib_sa_client)) { + ret = PTR_ERR(ipoib_sa_client); + goto err_wq; + } + ret = ib_register_client(&ipoib_client); if (ret) - goto err_wq; + goto err_sa; return 0; +err_sa: + ib_sa_unregister_client(ipoib_sa_client); + err_wq: destroy_workqueue(ipoib_workqueue); @@ -1203,6 +1214,7 @@ err_fs: static void __exit ipoib_cleanup_module(void) { ib_unregister_client(&ipoib_client); + ib_sa_unregister_client(ipoib_sa_client); ipoib_unregister_debugfs(); destroy_workqueue(ipoib_workqueue); } Index: ulp/srp/ib_srp.c =================================================================== --- ulp/srp/ib_srp.c (revision 8843) +++ ulp/srp/ib_srp.c (working copy) @@ -103,6 +103,8 @@ static struct ib_client srp_client = { .remove = srp_remove_one }; +static struct ib_sa_client *srp_sa_client; + static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) { return (struct srp_target_port *) host->hostdata; @@ -2002,9 +2004,16 @@ static int __init srp_init_module(void) return ret; } + srp_sa_client = ib_sa_register_client(); + if (IS_ERR(srp_sa_client)) { + class_unregister(&srp_class); + return PTR_ERR(srp_sa_client); + } + ret = ib_register_client(&srp_client); if (ret) { printk(KERN_ERR PFX "couldn't register IB client\n"); + ib_sa_unregister_client(srp_sa_client); class_unregister(&srp_class); return ret; } @@ -2015,6 +2024,7 @@ static int __init srp_init_module(void) static void __exit srp_cleanup_module(void) { ib_unregister_client(&srp_client); + ib_sa_unregister_client(srp_sa_client); class_unregister(&srp_class); } Index: include/rdma/ib_sa.h =================================================================== --- include/rdma/ib_sa.h (revision 8843) +++ include/rdma/ib_sa.h (working copy) @@ -250,11 +250,25 @@ struct ib_sa_service_rec { u64 data64[2]; }; +struct ib_sa_client; + +/** + * ib_sa_register_client - Register an SA client. + */ +struct ib_sa_client *ib_sa_register_client(void); + +/** + * ib_sa_unregister_client - Deregister an SA client. + * @client: Client object to deregister. + */ +void ib_sa_unregister_client(struct ib_sa_client *client); + struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -264,7 +278,8 @@ int ib_sa_path_rec_get(struct ib_device void *context, struct ib_sa_query **query); -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -275,7 +290,8 @@ int ib_sa_mcmember_rec_query(struct ib_d void *context, struct ib_sa_query **query); -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, @@ -288,6 +304,7 @@ int ib_sa_service_rec_query(struct ib_de /** * ib_sa_mcmember_rec_set - Start an MCMember set query + * @client:SA client * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -312,7 +329,8 @@ int ib_sa_service_rec_query(struct ib_de * cancel the query. */ static inline int -ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_set(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -322,7 +340,7 @@ ib_sa_mcmember_rec_set(struct ib_device void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, timeout_ms, retries, gfp_mask, callback, @@ -331,6 +349,7 @@ ib_sa_mcmember_rec_set(struct ib_device /** * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @client:SA client * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -355,7 +374,8 @@ ib_sa_mcmember_rec_set(struct ib_device * cancel the query. */ static inline int -ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_delete(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -365,7 +385,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, timeout_ms, retries, gfp_mask, callback, Index: core/multicast.c =================================================================== --- core/multicast.c (revision 8843) +++ core/multicast.c (working copy) @@ -63,6 +63,7 @@ static struct ib_client mcast_client = { .remove = mcast_remove_one }; +static struct ib_sa_client *sa_client; static struct ib_event_handler event_handler; static struct workqueue_struct *mcast_wq; static union ib_gid mgid0; @@ -305,8 +306,8 @@ static int send_join(struct mcast_group int ret; group->last_join = member; - ret = ib_sa_mcmember_rec_set(port->dev->device, port->port_num, - &member->multicast.rec, + ret = ib_sa_mcmember_rec_set(sa_client, port->dev->device, + port->port_num, &member->multicast.rec, member->multicast.comp_mask, retry_timer, retries, GFP_KERNEL, join_handler, group, &group->query); @@ -326,7 +327,8 @@ static int send_leave(struct mcast_group rec = group->rec; rec.join_state = leave_state; - ret = ib_sa_mcmember_rec_delete(port->dev->device, port->port_num, &rec, + ret = ib_sa_mcmember_rec_delete(sa_client, port->dev->device, + port->port_num, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_JOIN_STATE, @@ -770,12 +772,20 @@ static int __init mcast_init(void) if (!mcast_wq) return -ENOMEM; + sa_client = ib_sa_register_client(); + if (IS_ERR(sa_client)) { + ret = PTR_ERR(sa_client); + goto err1; + } + ret = ib_register_client(&mcast_client); if (ret) - goto err; + goto err2; return 0; -err: +err2: + ib_sa_unregister_client(sa_client); +err1: destroy_workqueue(mcast_wq); return ret; } @@ -783,6 +793,7 @@ err: static void __exit mcast_cleanup(void) { ib_unregister_client(&mcast_client); + ib_sa_unregister_client(sa_client); destroy_workqueue(mcast_wq); } Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 8843) +++ core/sa_query.c (working copy) @@ -33,6 +33,9 @@ * $Id$ */ +#include + +#include #include #include #include @@ -71,9 +74,15 @@ struct ib_sa_device { struct ib_sa_port port[0]; }; +struct ib_sa_client { + atomic_t users; + struct completion comp; +}; + struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); + struct ib_sa_client *client; struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; @@ -413,6 +422,46 @@ static void ib_sa_event(struct ib_event_ } } +struct ib_sa_client *ib_sa_register_client() +{ + struct ib_sa_client *client; + + client = kmalloc(sizeof *client, GFP_KERNEL); + if (!client) + return ERR_PTR(-ENOMEM); + + atomic_set(&client->users, 1); + init_completion(&client->comp); + return client; +} +EXPORT_SYMBOL(ib_sa_register_client); + +static void ib_sa_client_get(struct ib_sa_query *query, + struct ib_sa_client *client) +{ + atomic_inc(&client->users); + query->client = client; +} + +static inline void deref_client(struct ib_sa_client *client) +{ + if (atomic_dec_and_test(&client->users)) + complete(&client->comp); +} + +static void ib_sa_client_put(struct ib_sa_query *query) +{ + deref_client(query->client); +} + +void ib_sa_unregister_client(struct ib_sa_client *client) +{ + deref_client(client); + wait_for_completion(&client->comp); + kfree(client); +} +EXPORT_SYMBOL(ib_sa_unregister_client); + /** * ib_sa_cancel_query - try to cancel an SA query * @id:ID of query to cancel @@ -636,7 +685,8 @@ static void ib_sa_path_rec_release(struc * error code. Otherwise it is a query ID that can be used to cancel * the query. */ -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -677,6 +727,7 @@ int ib_sa_path_rec_get(struct ib_device mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + ib_sa_client_get(&query->sa_query, client); query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; query->sa_query.port = port; @@ -696,6 +747,7 @@ int ib_sa_path_rec_get(struct ib_device err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -753,7 +805,8 @@ static void ib_sa_service_rec_release(st * error code. Otherwise it is a request ID that can be used to cancel * the query. */ -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -799,6 +852,7 @@ int ib_sa_service_rec_query(struct ib_de mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + ib_sa_client_get(&query->sa_query, client); query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; query->sa_query.release = ib_sa_service_rec_release; query->sa_query.port = port; @@ -819,6 +873,7 @@ int ib_sa_service_rec_query(struct ib_de err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -849,7 +904,8 @@ static void ib_sa_mcmember_rec_release(s kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -891,6 +947,7 @@ int ib_sa_mcmember_rec_query(struct ib_d mad = query->sa_query.mad_buf->mad; init_mad(mad, agent); + ib_sa_client_get(&query->sa_query, client); query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; query->sa_query.port = port; @@ -911,6 +968,7 @@ int ib_sa_mcmember_rec_query(struct ib_d err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -947,6 +1005,7 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(mad_send_wc->send_buf); kref_put(&query->sm_ah->ref, free_sm_ah); + ib_sa_client_put(query); query->release(query); } Index: core/cma.c =================================================================== --- core/cma.c (revision 8843) +++ core/cma.c (working copy) @@ -61,6 +61,7 @@ static struct ib_client cma_client = { .remove = cma_remove_one }; +struct ib_sa_client *sa_client; static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); @@ -1272,7 +1273,7 @@ static int cma_query_ib_route(struct rdm path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; - id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + id_priv->query_id = ib_sa_path_rec_get(sa_client, id_priv->id.device, id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, @@ -2367,12 +2368,20 @@ static int cma_init(void) if (!cma_wq) return -ENOMEM; + sa_client = ib_sa_register_client(); + if (IS_ERR(sa_client)) { + ret = PTR_ERR(sa_client); + goto err1; + } + ret = ib_register_client(&cma_client); if (ret) - goto err; + goto err2; return 0; -err: +err2: + ib_sa_unregister_client(sa_client); +err1: destroy_workqueue(cma_wq); return ret; } @@ -2380,6 +2389,7 @@ err: static void cma_cleanup(void) { ib_unregister_client(&cma_client); + ib_sa_unregister_client(sa_client); destroy_workqueue(cma_wq); idr_destroy(&sdp_ps); idr_destroy(&tcp_ps); From mst at mellanox.co.il Mon Aug 7 16:15:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Aug 2006 02:15:36 +0300 Subject: [openib-general] [PATCH] sa_query: require SA query registration In-Reply-To: <000101c6ba71$646ccac0$e598070a@amr.corp.intel.com> References: <000101c6ba71$646ccac0$e598070a@amr.corp.intel.com> Message-ID: <20060807231536.GB10166@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH] sa_query: require SA query registration > > Require registration with SA module, to prevent module text > from going away while sa query callback is still running, > and update all users. > > Signed-off-by: Michael S. Tsirkin > Signed-off-by: Sean Hefty > --- > Changes to the previous post include: > > * Move struct ib_sa_client definition internal to SA module to > better encapsulate future extensions. Documentation/stable_api_nonsense.txt says: "You think you want a stable kernel interface, but you really do not" > We can debate whether this is good or bad. I liked the init/destroy API better. It seems your API has a (small) extra cost, as ib_sa_register_client will now remain in-memory even if it is only used at module load time, plus all calls to SA routines have to load the pointer to structure from memory instead of the address being computed by linker. Error handling also seems to be trickier with your patch. I also liked the fact that the original patch put the name into the structure, I think ability to figure out who's using the query will come handy for debugging. Some more questions on the matter: We allocate stuff in ib_verbs mainly because there are multiple providers inheriting structs like ib_qp. But why would you want different types of clients? Could you explain how does allocating structure from heap this better encapsulate future extensions, please? Why is ipoib_sa_client = ib_sa_register_client ret = PTR_ERR(ipoib_sa_client) a better encapsulation than ret = ib_sa_register_client(&ipoib_sa_client) Further, why should this API be different from ib_register_client? Isn't this confusing? Isn't it better to let user control where does the memory come from, rather than forcing GFP_KERNEL? Take care, -- MST From mst at mellanox.co.il Mon Aug 7 16:20:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Aug 2006 02:20:31 +0300 Subject: [openib-general] [PATCH v2] sa_query: require SA query registration In-Reply-To: <000201c6ba76$89c7fc90$e598070a@amr.corp.intel.com> References: <000201c6ba76$89c7fc90$e598070a@amr.corp.intel.com> Message-ID: <20060807232031.GC10166@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH v2] sa_query: require SA query registration > > Fixed to call ib_unregister_client() before ib_sa_unregister_client(). So, it seems tracking queries inside a client won't be useful now, will it? -- MST From sean.hefty at intel.com Mon Aug 7 16:28:56 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 7 Aug 2006 16:28:56 -0700 Subject: [openib-general] [PATCH v2] sa_query: require SA query registration In-Reply-To: <20060807232031.GC10166@mellanox.co.il> Message-ID: <000301c6ba79$40f22ce0$e598070a@amr.corp.intel.com> >> Fixed to call ib_unregister_client() before ib_sa_unregister_client(). > >So, it seems tracking queries inside a client won't be useful now, will it? Actually, what I think may be useful are calls similar to: ib_sa_cancel_queries(struct ib_sa_client *, struct ib_device *); ib_sa_cancel_all(struct ib_sa_client *); However, until I can finish the userspace SA client support, I won't know what the best API would be. Tracking queries by every user seems redundant. - Sean From mst at mellanox.co.il Mon Aug 7 16:34:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Aug 2006 02:34:55 +0300 Subject: [openib-general] [patch 02/45] IB/mthca: restore missing PCI registers after reset In-Reply-To: <20060807170451.GH3691@stusta.de> References: <20060807170451.GH3691@stusta.de> Message-ID: <20060807233455.GA25326@mellanox.co.il> Quoting r. Adrian Bunk : > Thanks for this information, I've applied it. BTW, is there a git tree to see what you are cooking? -- MST From mst at mellanox.co.il Mon Aug 7 16:39:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Aug 2006 02:39:37 +0300 Subject: [openib-general] [patch 02/45] IB/mthca: restore missing PCI registers after reset In-Reply-To: <20060807233455.GA25326@mellanox.co.il> References: <20060807170451.GH3691@stusta.de> <20060807233455.GA25326@mellanox.co.il> Message-ID: <20060807233937.GB25326@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: [patch 02/45] IB/mthca: restore missing PCI registers after reset > > Quoting r. Adrian Bunk : > > Thanks for this information, I've applied it. > > BTW, is there a git tree to see what you are cooking? Never mind, I found linux/kernel/git/stable/linux-2.6.16.y.git. It says "owner Greg Kroah-Hartman" which is what confused me. -- MST From sean.hefty at intel.com Mon Aug 7 16:49:29 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 7 Aug 2006 16:49:29 -0700 Subject: [openib-general] [PATCH] sa_query: require SA query registration In-Reply-To: <20060807231536.GB10166@mellanox.co.il> Message-ID: <000401c6ba7c$1f9928c0$e598070a@amr.corp.intel.com> >> * Move struct ib_sa_client definition internal to SA module to >> better encapsulate future extensions. > >Documentation/stable_api_nonsense.txt says: >"You think you want a stable kernel interface, but you really do not" Encapsulation is typically a good idea. The user cannot modify anything inside of struct ib_sa_client without breaking it. The only benefit we gain by exposing it is saving a memory allocation. The verb interfaces expose the structures to allow for direct calls from a ULP to a provider for speed path operations. >Why is > ipoib_sa_client = ib_sa_register_client > ret = PTR_ERR(ipoib_sa_client) >a better encapsulation than > ret = ib_sa_register_client(&ipoib_sa_client) The former completely hides the internal definition of the structure from the user. The latter does not, providing no encapsulation. >Further, why should this API be different from ib_register_client? >Isn't this confusing? It's different from ib_register_client, but more similar to ib_register_mad_agent. If anything it would make more sense to be closer to ib_register_mad_agent. For example, a callback function makes more sense passed into a register call than every query call. >Isn't it better to let user control where does the memory come from, >rather than forcing GFP_KERNEL? The sa_query module _is_ the user of the memory, but can add a gfp_mask if we think that's useful. - Sean From mst at mellanox.co.il Mon Aug 7 16:59:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Aug 2006 02:59:07 +0300 Subject: [openib-general] [PATCH] sa_query: require SA query registration In-Reply-To: <000401c6ba7c$1f9928c0$e598070a@amr.corp.intel.com> References: <000401c6ba7c$1f9928c0$e598070a@amr.corp.intel.com> Message-ID: <20060807235906.GA13181@mellanox.co.il> Quoting r. Sean Hefty : > >Further, why should this API be different from ib_register_client? > >Isn't this confusing? > > It's different from ib_register_client, but more similar to > ib_register_mad_agent. If anything it would make more sense to be closer to > ib_register_mad_agent. Actually, different from both - we don't get events in the sa_client. We just need a cookie to track queries to synchronise for module unloading. Maybe rename to ib_sa_query_cookie or something? > For example, a callback function makes more sense passed > into a register call than every query call. So ULPs will need multiple clients for multiple types of queries? Ugh. -- MST From dotanb at mellanox.co.il Mon Aug 7 23:11:34 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 8 Aug 2006 09:11:34 +0300 Subject: [openib-general] [libibcm] does the libibcm support multithreaded applications? In-Reply-To: <44D77A58.9020904@ichips.intel.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302A3C659@mtlexch01.mtl.com> <200608061650.52846.dotanb@mellanox.co.il> <44D77A58.9020904@ichips.intel.com> Message-ID: <200608080911.34796.dotanb@mellanox.co.il> On Monday 07 August 2006 20:37, Sean Hefty wrote: > > How should a multithreaded application need to work with the libibcm? > > I understand what the problem is, and I think you're right. If > ib_cm_get_device() returned a new ib_cm_device, you could more easily control > event processing. I will fix this up when I remove the dependency on libsysfs > from the libibcm. I am probably at least 2 weeks away from starting on this though. Thank you. When you'll have the new code, it will have at least one customer .... Dotan From dotanb at mellanox.co.il Mon Aug 7 23:17:03 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 8 Aug 2006 09:17:03 +0300 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <44D78052.7010600@ichips.intel.com> References: <200608061544.11205.dotanb@mellanox.co.il> <44D78052.7010600@ichips.intel.com> Message-ID: <200608080917.03248.dotanb@mellanox.co.il> On Monday 07 August 2006 21:02, Sean Hefty wrote: > Dotan Barak wrote: > > enum ib_cm_sidr_status { 0, 1, 2, 3, 4, IB_SIDR_UNSUPPORTED_VERSION }; > > > > it seems that the enumerations values were replaced with integers. > > > > when i searched for the values that were enumerated in the headre files i > > found the following defines in ib_types.h: > > > > #define IB_SIDR_SUCCESS 0 #define > > IB_SIDR_UNSUPPORTED 1 #define > > IB_SIDR_REJECT 2 #define > > IB_SIDR_NO_QP 3 #define > > IB_SIDR_REDIRECT 4 > > > > > > I think that the problem was that ib_types.h was included in a file that > > includes the cm.h and the preprocessor replaced the enumeration names with > > the integer values. > > > > who can check this issue? > > I think the solution is to remove CM definitions out of ib_types.h. What is the > reason for including ib_types.h and cm.h? ib_types looks like an internal > opensm include file. As much as i know, ib_types.h is the only header that have all of the MADs definitions, so i need to include it in several tests that sends MADs. The solution may be that one of the files (cm.h or ib_types.h) will rename those names to a different names. thanks Dotan From dotanb at mellanox.co.il Tue Aug 8 00:06:21 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 8 Aug 2006 10:06:21 +0300 Subject: [openib-general] [librdmacm] a question about the cma code Message-ID: <200608081006.22223.dotanb@mellanox.co.il> Hi. I noticed the following code in the librdmacm (in gen2_linux-20060807-1730 (REV=8840)): static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst, struct rdma_conn_param *src, uint32_t qp_num, enum ibv_qp_type qp_type, uint8_t srq) { dst->qp_num = qp_num; dst->qp_type = qp_type; dst->srq = srq; dst->responder_resources = src->responder_resources; dst->initiator_depth = src->initiator_depth; dst->flow_control = src->flow_control; dst->retry_count = src->retry_count; dst->rnr_retry_count = src->rnr_retry_count; dst->valid = 1; if (src->private_data && src->private_data_len) { memcpy(dst->private_data, src->private_data, src->private_data_len); dst->private_data_len = src->private_data_len; } else src->private_data_len = 0; } What is the purpose of the following code line: src->private_data_len = 0; Thanks Dotan From tziporet at mellanox.co.il Tue Aug 8 01:32:21 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 08 Aug 2006 11:32:21 +0300 Subject: [openib-general] OFED 1.0 kernel sources In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7624@mtlexch01.mtl.com> Message-ID: <44D84C15.6010204@mellanox.co.il> James Lentini wrote: > On Mon, 7 Aug 2006, Tziporet Koren wrote: > > >> best to take the sources from the release tarball >> > > That is what I was doing. The problem is that the source file > for the ib_ping module is in the release, but not the matching > kernel sources. > > Due to the way we build OFED 1.0, some kernel sources were taken from the git tree, and some from svn. This can cause this problem since ib_ping was taken from svn (not par of git) and since we never used it it may use the wrong interface to the kernel sources coming from git. This is the reason we changed to have all kernel sources from git in coming OFED 1.1. Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Tue Aug 8 01:37:14 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 08 Aug 2006 11:37:14 +0300 Subject: [openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT) In-Reply-To: <44D772DC.7010101@ichips.intel.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> <44D049B5.6000505@voltaire.com> <44D0D808.1070003@ichips.intel.com> <44D1986B.6070302@voltaire.com> <44D22218.7000005@ichips.intel.com> <44D70E2B.60205@voltaire.com> <44D772DC.7010101@ichips.intel.com> Message-ID: <44D84D3A.9050502@voltaire.com> Sean Hefty wrote: > Or Gerlitz wrote: >> 0) ones that are of no interest to the CMA nor to the ULP above it but >> rather only to the local CM (are there any?) >> >> 1) ones that *must* be handled internally by the CMA (are there any?) >> >> 2) ones that *can* be handled internally by the CMA (eg stale-conn) >> >> 3) ones that can be translated to errno value (eg invalid-sid (8) to >> econnrefused) and set the cma event status field to the errno value >> >> 4) ones that do not fall into none of 0-3 above, for those set the >> status field to ENOPROTO and once a CMA app that gets ENOPROTO and has >> an issue with it would show up, we will see what can we do. >> >> Please let me know what you think > > I think we'd have to take each reject code and see if it makes sense to > change what's done with it. I'd rather expose the underlying reject > reason to the user, than convert it to some other status code that may > mask the real reason for the reject. Conceptually, do we agree that it would be better not to expose IB reject code to the CMA consumers? that is in the spirit of the CMA being a framework for doing connection management in RDMA transport independent fashion, etc. The CMA does return **errno** values on the status field for some events (eg with UNREACHABLE event as of REQ/REP timeout, as in the case that started this thread...), so we need to decide a clearer approach here. Or. From ogerlitz at voltaire.com Tue Aug 8 01:54:20 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 08 Aug 2006 11:54:20 +0300 Subject: [openib-general] [PATCH] RDMA / IB CM: support immediately sending replies to received messages In-Reply-To: <44D776AD.3080606@ichips.intel.com> References: <000101c6b821$6f783be0$8698070a@amr.corp.intel.com> <44D71B6B.3000007@voltaire.com> <44D776AD.3080606@ichips.intel.com> Message-ID: <44D8513C.8000801@voltaire.com> Sean Hefty wrote: > Or Gerlitz wrote: > Note that since the QP is in the RTS state, the IB COMM_EST event will > _not_ be generated. This is the key difference between these > approaches. The user would need to call rdma_establish when polling a > RX from the CQ. Failure to do so would result in a connection failure > if the RTU were not received. Calling rdma_establish would result in an > RDMA CM ESTABLISH event. OK, thanks for further clarifying it. >> OK, fair enough. Personally i preferred the patch set that implemented >> everything within the ib stack and just had this little requirement on >> the ULP to hold on with doing TX before getting the ESTABLISHED event, >> but this one makes sense as well. > My preference is to provide whichever solution makes it easier on the > majority of the ULPs (ignoring spec changes at the moment). I don't > know which fix ULPs will find easier to work with. Maybe we can come up > with a compromise where we expose rdma_establish, but transitioning to > RTS before the REP is sent requires that the ULP do the transition...? Indeed, lets see if we can get some input from the ULP people working on passive side / targets (eg NFS/Lustre/iSER/SDP). If indeed there's a spec violation here, i think it makes much sense to have the ULP and not the CMA violate the spec, so as you suggest, this can be done by CMA consumers doing the QP transitions themselves. Or. From hotel at hotelulipy.cz Mon Aug 7 23:29:52 2006 From: hotel at hotelulipy.cz (HOTEL U LPY) Date: Tue, 8 Aug 2006 08:29:52 +0200 Subject: [openib-general] HOTEL U LIPY*** 50%Slevy, Sele na rozni, Zdarma saly Message-ID: <20060808063246.28C4F2044CD@lipa.hotelulipy.cz> This is a text part of the message. It is shown for the users of old-style e-mail clients -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Tue Aug 8 03:54:30 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 8 Aug 2006 13:54:30 +0300 Subject: [openib-general] re the core network changes to support network event notification Message-ID: Steve, Tracking in the netdev archive the threads related to the netevents patch, i have noted that you were attempting to add support for two "netlink update events" (RTM_NEIGHUPD RTM_ROUTEUPD) but eventually the patch that was merged by Dave (round 5) did not include these events. Did you had any spcial reason to drop them? is it correct that the merged code does implement a framework for adding them later? Can you spare few words about the differences from a **kernel** consumer point of view between being a direct netevent vs netlink consumer? I guess IPoIB code would be able to take advantage of RTM_NEIGHUPD netlink event or a netevents based filter to remove the data path memcmp on the neigh HA address. thanks, Or. List: linux-netdev Subject: Re: [PATCH Round 4 2/3] Core network changes to support network event notification From: David Miller Date: 2006-07-26 20:56:42 From: Steve Wise Date: Wed, 26 Jul 2006 11:15:43 -0500 > Dave, what do you think about removing the user-space stuff for the > first round of integration? IE: Just add netevents and kernel hooks to > generate them. Sure. From jlentini at netapp.com Tue Aug 8 06:26:17 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 8 Aug 2006 09:26:17 -0400 (EDT) Subject: [openib-general] OFED 1.0 kernel sources In-Reply-To: <44D84C15.6010204@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7624@mtlexch01.mtl.com> <44D84C15.6010204@mellanox.co.il> Message-ID: On Tue, 8 Aug 2006, Tziporet Koren wrote: > This is the reason we changed to have all kernel sources from git in > coming OFED 1.1. Are the OFED 1.1 sources available somewhere? From tziporet at mellanox.co.il Tue Aug 8 06:40:15 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 08 Aug 2006 16:40:15 +0300 Subject: [openib-general] OFED 1.0 kernel sources In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7624@mtlexch01.mtl.com> <44D84C15.6010204@mellanox.co.il> Message-ID: <44D8943F.7000503@mellanox.co.il> James Lentini wrote: > > Are the OFED 1.1 sources available somewhere? > User space https://openib.org/svn/gen2/branches/1.1/src/userspace Kernel Git: git://www.mellanox.co.il/~git/infiniband ofed_1_1 Tziporet From swise at opengridcomputing.com Tue Aug 8 06:56:34 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 Aug 2006 08:56:34 -0500 Subject: [openib-general] re the core network changes to support network event notification In-Reply-To: References: Message-ID: <1155045394.14112.14.camel@stevo-desktop> On Tue, 2006-08-08 at 13:54 +0300, Or Gerlitz wrote: > Steve, > > Tracking in the netdev archive the threads related to the netevents > patch, i have noted that you were attempting to add support for two > "netlink update events" (RTM_NEIGHUPD RTM_ROUTEUPD) but eventually the > patch that was merged by Dave (round 5) did not include these events. > > Did you had any spcial reason to drop them? is it correct that the > merged code does implement a framework for adding them later? > I dropped them at the request of the reviewers because they weren't correct and they aren't needed for kernel netevent subscribers. I also dropped ROUTE add/delete events because they weren't correct either. And I don't think they are needed for RDMA drivers. The framework is there to add all this later if needed... > Can you spare few words about the differences from a **kernel** consumer > point of view between being a direct netevent vs netlink consumer? > netlink is for user space consumers. netlink is a socket family and allows creating a socket with one end hooked to various kernel subsystems and the other end to a user process. Then the user and kernel can do IPC and exchange messages. rtnetlink is a routing netlink socket that is used to pass routing events up to user processes (like routing daemons) as well as for these same daemons to send commands to the kernel to add/del routes. If you're a unix guy rtnetlink implements BSD AF_ROUTE sockets. netlink/rtnetlink was never intended to pass messages between kernel subsystems, although it could be done by using kernel sockets, but that is kind of overkill in that the information is marshalled into a message, and then unmarshalled by the other end. For user<->kernel thats fine. But it is a waste for intra-kernel event notifications. Netevents use the linux notifier block mechanism to allow consumers to register a direct function callback to receive netevents. For neighbour update events, the callback is passed the neigh ptr that was updated. For ICMP redirects, the callback is passed the old and new dst_entry ptrs. For PMTU changes, the dst_entry ptr is passed. > I guess IPoIB code would be able to take advantage of RTM_NEIGHUPD netlink event > or a netevents based filter to remove the data path memcmp on the neigh HA address. > I haven't looked at the IPoIB code. What is it doing exactly (I am somewhat familiar with the IPoIB specification and how it works)? Steve. From ogerlitz at voltaire.com Tue Aug 8 07:15:02 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 08 Aug 2006 17:15:02 +0300 Subject: [openib-general] re the core network changes to support network event notification In-Reply-To: <1155045394.14112.14.camel@stevo-desktop> References: <1155045394.14112.14.camel@stevo-desktop> Message-ID: <44D89C66.7060307@voltaire.com> Steve Wise wrote: > On Tue, 2006-08-08 at 13:54 +0300, Or Gerlitz wrote: Steve, OK, thanks for all the clarifications and information re the rtnetlink and why its an overkill to use it (vs netevents) within the kernel. >> I guess IPoIB code would be able to take advantage of RTM_NEIGHUPD netlink event >> or a netevents based filter to remove the data path memcmp on the neigh HA address. > I haven't looked at the IPoIB code. What is it doing exactly (I am > somewhat familiar with the IPoIB specification and how it works)? http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8a7f752125a930a83f4d8dfe37fa5a081ab19d31 Take a look on this "IB/ipoib: Fix packet loss after hardware address update" patch, it overcomes a situation where the neighbor structure is **updated** (for example the HA address is changed, as of gratuitous ARP) but as the neighbor destructor is not called, IPoIB is not aware to that and hence does not update the AH (IB Address-Handle) associated with the HA. So the current solution was to memcmp the neigh info with what IPoIB knows about this nieghbor on the data path. With your patch, i guess this memcmp can be eliminated. Or. Or. From swise at opengridcomputing.com Tue Aug 8 07:31:28 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 Aug 2006 09:31:28 -0500 Subject: [openib-general] re the core network changes to support network event notification In-Reply-To: <44D89C66.7060307@voltaire.com> References: <1155045394.14112.14.camel@stevo-desktop> <44D89C66.7060307@voltaire.com> Message-ID: <1155047488.14112.38.camel@stevo-desktop> On Tue, 2006-08-08 at 17:15 +0300, Or Gerlitz wrote: > Steve Wise wrote: > > On Tue, 2006-08-08 at 13:54 +0300, Or Gerlitz wrote: > > Steve, > > OK, thanks for all the clarifications and information re the rtnetlink > and why its an overkill to use it (vs netevents) within the kernel. > > >> I guess IPoIB code would be able to take advantage of RTM_NEIGHUPD netlink event > >> or a netevents based filter to remove the data path memcmp on the neigh HA address. > > > I haven't looked at the IPoIB code. What is it doing exactly (I am > > somewhat familiar with the IPoIB specification and how it works)? > > http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8a7f752125a930a83f4d8dfe37fa5a081ab19d31 > > Take a look on this "IB/ipoib: Fix packet loss after hardware address > update" patch, it overcomes a situation where the neighbor structure is > **updated** (for example the HA address is changed, as of gratuitous > ARP) but as the neighbor destructor is not called, IPoIB is not aware to > that and hence does not update the AH (IB Address-Handle) associated > with the HA. > > So the current solution was to memcmp the neigh info with what IPoIB > knows about this nieghbor on the data path. > > With your patch, i guess this memcmp can be eliminated. > I think so. With netevents, I think you'll update your AH entries when the HA changes as opposed to doing the memcmp() on every xmit. From tziporet at mellanox.co.il Tue Aug 8 07:48:13 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 8 Aug 2006 17:48:13 +0300 Subject: [openib-general] OFED 1.1-rc1 is available Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA763D@mtlexch01.mtl.com> Hi, In two week delay we publish OFED 1.1-RC1 on https://openib.org/svn/gen2/branches/1.1/ofed/releases/ File: OFED-1.1-rc1.tgz Build_id: OFED-1.1-rc1   openib-1.1 (REV=8849) # User space https://openib.org/svn/gen2/branches/1.1/src/userspace Git: git://www.mellanox.co.il/~git/infiniband ofed_1_1 ref: refs/heads/ofed_1_1 commit df6aabce49695368fd004e6505102a1519b266a4   # MPI mpi_osu-0.9.7-mlx2.2.0.tgz openmpi-1.1-1.src.rpm mpitests-2.0-0.src.rpm OS support: =========== Novell: - SLES 9.0 SP3* - SLES10 (official release)* Redhat: - Redhat EL4 up3 - Redhat EL4 up4 (was not tested yet) kernel.org: - Kernel 2.6.17* * Changed from 1.0 release Note: Redhat EL4 up2, Fedora C4 and SuSE Pro 10 were dropped from the list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. Systems: ======== * x86_64     * x86     * ia64     * ppc64 Main changes from OFED-1.0: =========================== o Bug fixes o Enabled building 32-bit libraries on x86_64 and ppc64.   Note: sysfsutils and sysfsutils-devel 32-bit required. o Kernel code based on 2.6.18 o Kernel hot-plug support in uverbs - removing module wait till all user applications release their HCA resources. o Package sources (user space and kernel modules) are places under: /src/ Original kernel sources are not replaced o RDS was removed from the OFED package o Set options in CMA & uCMA o OSM new features: - Partition Manager (Pkey) - Pre-computed routing load from file - Primitive QoS - As technology preview o SDP: - Improved latency (13 usec with netperf tcprr) - Implemented Naggle algorithm - Memory leaks fixes - Error handling added o MPI: - OSU - MVAPICH: Message coalescing to improve message rate - Open MPI 1.1-1: see changes: http://svn.open-mpi.org/svn/ompi/trunk/NEWS - MPI tests: Replace to the new test versions from LLNL, Intel, OSU o SRP: - Stability o iSER: - Stability - Testing more platforms (e.g. ppc64 and ia64) - Performance improvements o Management: - Add saquery tool - Enhancement to ibnetdiscover tool with grouping function - New ibutils package: o Port error counter check o Port performance counters dump o Link width and Link Speed check by flag o uDAPL: - Scalability features needed for Intel MPI - Code was updated from trunk Limitations and known issues: ============================= 1. ipath driver compilation fails on all systems 2. iSER support in install script for SLES 10 is missing 3. SDP: - 32 bit systems might run out of low memory when opening hundreds of sockets. - For Mellanox Sinai HCAs one must use latest FW version (1.1.000). Missing features that should be completed for RC2: ================================================== 1. SRP: - Complete testing with DM (Device Mapper) - for high availability - New daemon 2. IPoIB: High availability support using a daemon in user level 3. SDP: support sending/receiving out of band data 4. Add Madeye utility 5. Fatal error support in mthca Please report any issues in bugzilla http://openib.org/bugzilla/ Tziporet & Vlad From Brian.Cain at ge.com Tue Aug 8 07:48:18 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Tue, 8 Aug 2006 10:48:18 -0400 Subject: [openib-general] HCA not recognized by OFED Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB0339B1120@CINMLVEM11.e2k.ad.ge.com> I installed OFED 1.0.1 on an Intel Alcolu-based system and wasn't able to get any of the tools to recognize the HCA. It's a Mellanox (or maybe Intel?) "InfiniHost III Lx" (PSID INT_0010000001). lspci indicates that its PCI ID is 15b3:538d. Using Mellanox's firmware tool, I was able to detect the card, dump its firmware image and upgrade to a newer one. The card is definitely present, but when I do a ibv_devices, I get "Fatal: no infiniband class devices found." `ls /sys/class/infiniband` returns no results. Grepping for "ib_" in /var/log/* and `dmesg` returned nothing, too. Running a FC based distro, kernel 2.6.15-2.4 (SMP, x86_64). What other sources of info can I use to debug this problem? -- -Brian From sweitzen at cisco.com Tue Aug 8 09:05:11 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 8 Aug 2006 09:05:11 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available Message-ID: Bryan, can you please add a "1.1rc1" version to bugzilla? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openfabrics-ewg-bounces at openib.org > [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of > Tziporet Koren > Sent: Tuesday, August 08, 2006 7:48 AM > To: openfabrics-ewg at openib.org > Cc: openib > Subject: [openfabrics-ewg] OFED 1.1-rc1 is available > > Hi, > > In two week delay we publish OFED 1.1-RC1 on > https://openib.org/svn/gen2/branches/1.1/ofed/releases/ > File: OFED-1.1-rc1.tgz > > Build_id: > OFED-1.1-rc1 >   > openib-1.1 (REV=8849) > # User space > https://openib.org/svn/gen2/branches/1.1/src/userspace > Git: > git://www.mellanox.co.il/~git/infiniband ofed_1_1 > ref: refs/heads/ofed_1_1 > commit df6aabce49695368fd004e6505102a1519b266a4 >   > # MPI > mpi_osu-0.9.7-mlx2.2.0.tgz > openmpi-1.1-1.src.rpm > mpitests-2.0-0.src.rpm > > > OS support: > =========== > Novell: > - SLES 9.0 SP3* > - SLES10 (official release)* > Redhat: > - Redhat EL4 up3 > - Redhat EL4 up4 (was not tested yet) > kernel.org: > - Kernel 2.6.17* > * Changed from 1.0 release > > Note: Redhat EL4 up2, Fedora C4 and SuSE Pro 10 were dropped > from the list. > We keep the backport patches for these OSes and make sure > OFED compile and > loaded properly but will not do full QA cycle. > > Systems: > ======== > * x86_64 >     * x86 >     * ia64 >     * ppc64 > > Main changes from OFED-1.0: > =========================== > o Bug fixes > o Enabled building 32-bit libraries on x86_64 and ppc64. >   Note: sysfsutils and sysfsutils-devel 32-bit required. > o Kernel code based on 2.6.18 > o Kernel hot-plug support in uverbs - removing module wait > till all user > applications release their HCA resources. > o Package sources (user space and kernel modules) are places > under: /src/ > Original kernel sources are not replaced > o RDS was removed from the OFED package > o Set options in CMA & uCMA > o OSM new features: > - Partition Manager (Pkey) > - Pre-computed routing load from file > - Primitive QoS - As technology preview > o SDP: > - Improved latency (13 usec with netperf tcprr) > - Implemented Naggle algorithm > - Memory leaks fixes > - Error handling added > o MPI: > - OSU - MVAPICH: Message coalescing to improve message rate > - Open MPI 1.1-1: see changes: > http://svn.open-mpi.org/svn/ompi/trunk/NEWS > - MPI tests: Replace to the new test versions from > LLNL, Intel, OSU > o SRP: > - Stability > o iSER: > - Stability > - Testing more platforms (e.g. ppc64 and ia64) > - Performance improvements > o Management: > - Add saquery tool > - Enhancement to ibnetdiscover tool with grouping function > - New ibutils package: > o Port error counter check > o Port performance counters dump > o Link width and Link Speed check by flag > o uDAPL: > - Scalability features needed for Intel MPI > - Code was updated from trunk > > > Limitations and known issues: > ============================= > 1. ipath driver compilation fails on all systems > 2. iSER support in install script for SLES 10 is missing > 3. SDP: > - 32 bit systems might run out of low memory when > opening hundreds of sockets. > - For Mellanox Sinai HCAs one must use latest FW > version (1.1.000). > > Missing features that should be completed for RC2: > ================================================== > 1. SRP: > - Complete testing with DM (Device Mapper) - for high > availability > - New daemon > 2. IPoIB: High availability support using a daemon in user level > 3. SDP: support sending/receiving out of band data > 4. Add Madeye utility > 5. Fatal error support in mthca > > > Please report any issues in bugzilla http://openib.org/bugzilla/ > > Tziporet & Vlad > > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > From mshefty at ichips.intel.com Tue Aug 8 09:08:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Aug 2006 09:08:17 -0700 Subject: [openib-general] [librdmacm] a question about the cma code In-Reply-To: <200608081006.22223.dotanb@mellanox.co.il> References: <200608081006.22223.dotanb@mellanox.co.il> Message-ID: <44D8B6F1.9090501@ichips.intel.com> Dotan Barak wrote: > if (src->private_data && src->private_data_len) { > memcpy(dst->private_data, src->private_data, > src->private_data_len); > dst->private_data_len = src->private_data_len; > } else > src->private_data_len = 0; > } > > What is the purpose of the following code line: > src->private_data_len = 0; That's a bug. I just committed a fix to set dst->private_data_len = 0 instead. Thanks, Sean From robert.j.woodruff at intel.com Tue Aug 8 09:11:09 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 8 Aug 2006 09:11:09 -0700 Subject: [openib-general] HCA not recognized by OFED Message-ID: Brian wrote, >What other sources of info can I use to debug this problem? >-- >-Brian Are you sure that the drivers are loaded ? lsmod should show something like this, [root at rkl-13 linpack]# /sbin/lsmod | grep mthca ib_mthca 139184 0 ib_mad 43176 5 ib_local_sa,ib_mthca,ib_umad,ib_sa,ib_cm ib_core 59520 14 ib_rds,ib_srp,ib_sdp,rdma_cm,ib_local_sa,ib_ipath,ib_mthca,ib_ipoib,ib_u verbs,ib_umad,ib_ucm,ib_sa,ib_cm,ib_mad From Brian.Cain at ge.com Tue Aug 8 09:16:35 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Tue, 8 Aug 2006 12:16:35 -0400 Subject: [openib-general] HCA not recognized by OFED In-Reply-To: Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB0339B128E@CINMLVEM11.e2k.ad.ge.com> > -----Original Message----- > From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] > Sent: Tuesday, August 08, 2006 11:11 AM > To: Cain, Brian (GE Healthcare); openib-general at openib.org > Subject: RE: [openib-general] HCA not recognized by OFED > > Brian wrote, > > >What other sources of info can I use to debug this problem? > > >-- > >-Brian > > Are you sure that the drivers are loaded ? > lsmod should show something like this, > > [root at rkl-13 linpack]# /sbin/lsmod | grep mthca > ib_mthca 139184 0 > ib_mad 43176 5 > ib_local_sa,ib_mthca,ib_umad,ib_sa,ib_cm > ib_core 59520 14 > ib_rds,ib_srp,ib_sdp,rdma_cm,ib_local_sa,ib_ipath,ib_mthca,ib_ > ipoib,ib_u > verbs,ib_umad,ib_ucm,ib_sa,ib_cm,ib_mad Yes, they're loaded. lsmod indicates something very similar to the above. In my earlier message, I wrote the PCI ID wrong, it's not "15b3:538d", it's "15b3:5e8d". I saw references to 5e8c and 5e8d sprinkled throughout the mthca code, but there were far more 5e8c's than 5e8d's. `modinfo ib_mthca | grep -i 15b3` gives the following: alias: pci:v000015B3d00005A44sv*sd*bc*sc*i* alias: pci:v000015B3d00006278sv*sd*bc*sc*i* alias: pci:v000015B3d00006282sv*sd*bc*sc*i* alias: pci:v000015B3d00006274sv*sd*bc*sc*i* alias: pci:v000015B3d00005E8Csv*sd*bc*sc*i* ...does that mean that 5e8d is not supported? -Brian From mshefty at ichips.intel.com Tue Aug 8 09:34:31 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Aug 2006 09:34:31 -0700 Subject: [openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT) In-Reply-To: <44D84D3A.9050502@voltaire.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> <44D049B5.6000505@voltaire.com> <44D0D808.1070003@ichips.intel.com> <44D1986B.6070302@voltaire.com> <44D22218.7000005@ichips.intel.com> <44D70E2B.60205@voltaire.com> <44D772DC.7010101@ichips.intel.com> <44D84D3A.9050502@voltaire.com> Message-ID: <44D8BD17.3040804@ichips.intel.com> Or Gerlitz wrote: > Conceptually, do we agree that it would be better not to expose IB > reject code to the CMA consumers? that is in the spirit of the CMA being > a framework for doing connection management in RDMA transport > independent fashion, etc. My concern is that I do not want to mask the real reason for the reject in a way that prevents the user from understanding what's needed to establish the connection. A different way to view this is that the event provides the generic information, and the status detailed. > The CMA does return **errno** values on the status field for some events > (eg with UNREACHABLE event as of REQ/REP timeout, as in the case that > started this thread...), so we need to decide a clearer approach here. We can provide two status values with an event, one that maps to an errno, and another that maps to a transport specific reason. - Sean From pauln at psc.edu Tue Aug 8 09:30:31 2006 From: pauln at psc.edu (pauln) Date: Tue, 08 Aug 2006 12:30:31 -0400 Subject: [openib-general] SRP IO Size Message-ID: <44D8BC27.7010504@psc.edu> Hi, I was running some performance tests to an srp target and noticed that the largest io sent over srp was 128k. When using a direct-attached scsi device I see io's up to 4m. I'm running the ibgd 1.8.2 stack. Can someone tell me if this issue has been addressed in a more recent version? Thanks, Paul From mst at mellanox.co.il Tue Aug 8 09:41:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Aug 2006 19:41:03 +0300 Subject: [openib-general] HCA not recognized by OFED In-Reply-To: <2376B63A5AF8564F8A2A2D76BC6DB0339B128E@CINMLVEM11.e2k.ad.ge.com> References: <2376B63A5AF8564F8A2A2D76BC6DB0339B128E@CINMLVEM11.e2k.ad.ge.com> Message-ID: <20060808164103.GA30817@mellanox.co.il> Quoting r. Cain, Brian (GE Healthcare) : > In my earlier message, I wrote the PCI ID wrong, it's not "15b3:538d", > it's "15b3:5e8d". I saw references to 5e8c and 5e8d sprinkled > throughout the mthca code, but there were far more 5e8c's than 5e8d's. > `modinfo ib_mthca | grep -i 15b3` gives the following: > alias: pci:v000015B3d00005A44sv*sd*bc*sc*i* > alias: pci:v000015B3d00006278sv*sd*bc*sc*i* > alias: pci:v000015B3d00006282sv*sd*bc*sc*i* > alias: pci:v000015B3d00006274sv*sd*bc*sc*i* > alias: pci:v000015B3d00005E8Csv*sd*bc*sc*i* > > ...does that mean that 5e8d is not supported? > > -Brian A modern system should show: 5e8d MT25204 [InfiniHost III Lx HCA Flash Recovery] so either the flash is corrupted, or you set a jumper to disable flash. -- MST From Brian.Cain at ge.com Tue Aug 8 09:48:09 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Tue, 8 Aug 2006 12:48:09 -0400 Subject: [openib-general] HCA not recognized by OFED In-Reply-To: <20060808164103.GA30817@mellanox.co.il> Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB0339B12F6@CINMLVEM11.e2k.ad.ge.com> > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Tuesday, August 08, 2006 11:41 AM > To: Cain, Brian (GE Healthcare) > Cc: openib-general at openib.org > Subject: Re: HCA not recognized by OFED > > Quoting r. Cain, Brian (GE Healthcare) : > > In my earlier message, I wrote the PCI ID wrong, it's not > "15b3:538d", > > it's "15b3:5e8d". I saw references to 5e8c and 5e8d sprinkled > > throughout the mthca code, but there were far more 5e8c's > than 5e8d's. > > `modinfo ib_mthca | grep -i 15b3` gives the following: > > alias: pci:v000015B3d00005A44sv*sd*bc*sc*i* > > alias: pci:v000015B3d00006278sv*sd*bc*sc*i* > > alias: pci:v000015B3d00006282sv*sd*bc*sc*i* > > alias: pci:v000015B3d00006274sv*sd*bc*sc*i* > > alias: pci:v000015B3d00005E8Csv*sd*bc*sc*i* > > > > ...does that mean that 5e8d is not supported? > > > > -Brian > > A modern system should show: > 5e8d MT25204 [InfiniHost III Lx HCA Flash Recovery] > > so either the flash is corrupted, or you set a jumper > to disable flash. I suppose I snipped a little too much when I posted the output of lspci. It does look just as you indicate: "[InfiniHost III Lx HCA Flash Recovery]". -Brian From mshefty at ichips.intel.com Tue Aug 8 09:59:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Aug 2006 09:59:44 -0700 Subject: [openib-general] [PATCH] sa_query: require SA query registration In-Reply-To: <20060807235906.GA13181@mellanox.co.il> References: <000401c6ba7c$1f9928c0$e598070a@amr.corp.intel.com> <20060807235906.GA13181@mellanox.co.il> Message-ID: <44D8C300.4010105@ichips.intel.com> Michael S. Tsirkin wrote: >>For example, a callback function makes more sense passed >>into a register call than every query call. > > So ULPs will need multiple clients for multiple types of queries? Ugh. I was thinking more of having the user pass in one or more callbacks into register. If you look at the patch that adds ib_sa_send_mad(), there's very little difference between the implementation of any of the queries. The differences result from the query callback prototype; otherwise their code could be exactly the same. I think this can come later, but is a worthwhile change. - Sean From mst at mellanox.co.il Tue Aug 8 10:08:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Aug 2006 20:08:02 +0300 Subject: [openib-general] re the core network changes to support network event notification In-Reply-To: <1155047488.14112.38.camel@stevo-desktop> References: <1155047488.14112.38.camel@stevo-desktop> Message-ID: <20060808170802.GC30817@mellanox.co.il> Quoting r. Steve Wise : > > So the current solution was to memcmp the neigh info with what IPoIB > > knows about this nieghbor on the data path. > > > > With your patch, i guess this memcmp can be eliminated. > > > > I think so. With netevents, I think you'll update your AH entries when > the HA changes as opposed to doing the memcmp() on every xmit. > Need to be careful not to deadlock however. Is it OK for network device to use netevents? What locks are held when netevent is generated? -- MST From mst at mellanox.co.il Tue Aug 8 10:05:28 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Aug 2006 20:05:28 +0300 Subject: [openib-general] [librdmacm] a question about the cma code In-Reply-To: <44D8B6F1.9090501@ichips.intel.com> References: <44D8B6F1.9090501@ichips.intel.com> Message-ID: <20060808170528.GB30817@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [librdmacm] a question about the cma code > > Dotan Barak wrote: > > if (src->private_data && src->private_data_len) { > > memcpy(dst->private_data, src->private_data, > > src->private_data_len); > > dst->private_data_len = src->private_data_len; > > } else > > src->private_data_len = 0; > > } > > > > What is the purpose of the following code line: > > src->private_data_len = 0; > > That's a bug. I just committed a fix to set dst->private_data_len = 0 instead. Isn't the whole command memset to 0 before hand? Maybe just remove the line? -- MST From mshefty at ichips.intel.com Tue Aug 8 10:14:36 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Aug 2006 10:14:36 -0700 Subject: [openib-general] [librdmacm] a question about the cma code In-Reply-To: <20060808170528.GB30817@mellanox.co.il> References: <44D8B6F1.9090501@ichips.intel.com> <20060808170528.GB30817@mellanox.co.il> Message-ID: <44D8C67C.5070803@ichips.intel.com> Michael S. Tsirkin wrote: > Isn't the whole command memset to 0 before hand? > Maybe just remove the line? You're right. The memset is hidden in the CMA_CREATE_MSG_CMD* macro, so I overlooked that it was already initialized to 0. - Sean From sean.hefty at intel.com Tue Aug 8 10:22:24 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 8 Aug 2006 10:22:24 -0700 Subject: [openib-general] [PATCH] osm ib_types: remove CM defines In-Reply-To: <200608080917.03248.dotanb@mellanox.co.il> Message-ID: <000001c6bb0f$36b912c0$e598070a@amr.corp.intel.com> Here's an completely untested patch to remove the CM definitions from ib_types.h. Can you see if this works for you? (I'm attaching the patch as well, since ib_types has weird formatting.) Signed-off-by: Sean Hefty --- Index: ib_types.h =================================================================== --- ib_types.h (revision 8215) +++ ib_types.h (working copy) @@ -7422,128 +7422,6 @@ typedef struct _ib_ioc_info } PACK_SUFFIX ib_ioc_info_t; #include -/* - * Defines known Communication management class versions - */ -#define IB_MCLASS_CM_VER_2 2 -#define IB_MCLASS_CM_VER_1 1 - -/* - * Defines the size of user available data in communication management MADs - */ -#define IB_REQ_PDATA_SIZE_VER2 92 -#define IB_MRA_PDATA_SIZE_VER2 222 -#define IB_REJ_PDATA_SIZE_VER2 148 -#define IB_REP_PDATA_SIZE_VER2 196 -#define IB_RTU_PDATA_SIZE_VER2 224 -#define IB_LAP_PDATA_SIZE_VER2 168 -#define IB_APR_PDATA_SIZE_VER2 148 -#define IB_DREQ_PDATA_SIZE_VER2 220 -#define IB_DREP_PDATA_SIZE_VER2 224 -#define IB_SIDR_REQ_PDATA_SIZE_VER2 216 -#define IB_SIDR_REP_PDATA_SIZE_VER2 136 - -#define IB_REQ_PDATA_SIZE_VER1 92 -#define IB_MRA_PDATA_SIZE_VER1 222 -#define IB_REJ_PDATA_SIZE_VER1 148 -#define IB_REP_PDATA_SIZE_VER1 204 -#define IB_RTU_PDATA_SIZE_VER1 224 -#define IB_LAP_PDATA_SIZE_VER1 168 -#define IB_APR_PDATA_SIZE_VER1 151 -#define IB_DREQ_PDATA_SIZE_VER1 220 -#define IB_DREP_PDATA_SIZE_VER1 224 -#define IB_SIDR_REQ_PDATA_SIZE_VER1 216 -#define IB_SIDR_REP_PDATA_SIZE_VER1 140 - -#define IB_ARI_SIZE 72 // redefine -#define IB_APR_INFO_SIZE 72 - - -/****d* Access Layer/ib_rej_status_t -* NAME -* ib_rej_status_t -* -* DESCRIPTION -* Rejection reasons. -* -* SYNOPSIS -*/ -typedef ib_net16_t ib_rej_status_t; -/* -* SEE ALSO -* ib_cm_rej, ib_cm_rej_rec_t -* -* SOURCE - */ -#define IB_REJ_INSUF_QP CL_HTON16(1) -#define IB_REJ_INSUF_EEC CL_HTON16(2) -#define IB_REJ_INSUF_RESOURCES CL_HTON16(3) -#define IB_REJ_TIMEOUT CL_HTON16(4) -#define IB_REJ_UNSUPPORTED CL_HTON16(5) -#define IB_REJ_INVALID_COMM_ID CL_HTON16(6) -#define IB_REJ_INVALID_COMM_INSTANCE CL_HTON16(7) -#define IB_REJ_INVALID_SID CL_HTON16(8) -#define IB_REJ_INVALID_XPORT CL_HTON16(9) -#define IB_REJ_STALE_CONN CL_HTON16(10) -#define IB_REJ_RDC_NOT_EXIST CL_HTON16(11) -#define IB_REJ_INVALID_GID CL_HTON16(12) -#define IB_REJ_INVALID_LID CL_HTON16(13) -#define IB_REJ_INVALID_SL CL_HTON16(14) -#define IB_REJ_INVALID_TRAFFIC_CLASS CL_HTON16(15) -#define IB_REJ_INVALID_HOP_LIMIT CL_HTON16(16) -#define IB_REJ_INVALID_PKT_RATE CL_HTON16(17) -#define IB_REJ_INVALID_ALT_GID CL_HTON16(18) -#define IB_REJ_INVALID_ALT_LID CL_HTON16(19) -#define IB_REJ_INVALID_ALT_SL CL_HTON16(20) -#define IB_REJ_INVALID_ALT_TRAFFIC_CLASS CL_HTON16(21) -#define IB_REJ_INVALID_ALT_HOP_LIMIT CL_HTON16(22) -#define IB_REJ_INVALID_ALT_PKT_RATE CL_HTON16(23) -#define IB_REJ_PORT_REDIRECT CL_HTON16(24) -#define IB_REJ_INVALID_MTU CL_HTON16(26) -#define IB_REJ_INSUFFICIENT_RESP_RES CL_HTON16(27) -#define IB_REJ_USER_DEFINED CL_HTON16(28) -#define IB_REJ_INVALID_RNR_RETRY CL_HTON16(29) -#define IB_REJ_DUPLICATE_LOCAL_COMM_ID CL_HTON16(30) -#define IB_REJ_INVALID_CLASS_VER CL_HTON16(31) -#define IB_REJ_INVALID_FLOW_LBL CL_HTON16(32) -#define IB_REJ_INVALID_ALT_FLOW_LBL CL_HTON16(33) - -#define IB_REJ_SERVICE_HANDOFF CL_HTON16(65535) -/******/ - - -/****d* Access Layer/ib_apr_status_t -* NAME -* ib_apr_status_t -* -* DESCRIPTION -* Automatic path migration status information. -* -* SYNOPSIS -*/ -typedef uint8_t ib_apr_status_t; -/* -* SEE ALSO -* ib_cm_apr, ib_cm_apr_rec_t -* -* SOURCE - */ -#define IB_AP_SUCCESS 0 -#define IB_AP_INVALID_COMM_ID 1 -#define IB_AP_UNSUPPORTED 2 -#define IB_AP_REJECT 3 -#define IB_AP_REDIRECT 4 -#define IB_AP_IS_CURRENT 5 -#define IB_AP_INVALID_QPN_EECN 6 -#define IB_AP_INVALID_LID 7 -#define IB_AP_INVALID_GID 8 -#define IB_AP_INVALID_FLOW_LBL 9 -#define IB_AP_INVALID_TCLASS 10 -#define IB_AP_INVALID_HOP_LIMIT 11 -#define IB_AP_INVALID_PKT_RATE 12 -#define IB_AP_INVALID_SL 13 -/******/ - /****d* Access Layer/ib_cm_cap_mask_t * NAME * ib_cm_cap_mask_t @@ -7568,18 +7446,6 @@ typedef uint8_t ib_apr_status_t; /* - * Service ID resolution status - */ -typedef uint16_t ib_sidr_status_t; -#define IB_SIDR_SUCCESS 0 -#define IB_SIDR_UNSUPPORTED 1 -#define IB_SIDR_REJECT 2 -#define IB_SIDR_NO_QP 3 -#define IB_SIDR_REDIRECT 4 -#define IB_SIDR_UNSUPPORTED_VER 5 - - -/* * The following definitions are shared between the Access Layer and VPD */ -------------- next part -------------- A non-text attachment was scrubbed... Name: diffs.ib_types Type: application/octet-stream Size: 4548 bytes Desc: not available URL: From mdidomenico at silverstorm.com Tue Aug 8 10:52:30 2006 From: mdidomenico at silverstorm.com (Di Domenico, Michael) Date: Tue, 8 Aug 2006 13:52:30 -0400 Subject: [openib-general] ofed-1.0-rc6 - dmesg errors In-Reply-To: <44D8C300.4010105@ichips.intel.com> Message-ID: I'm getting these errors when I run jobs over OFED-1.0-rc6. Can anyone explain to me what they are, and if they are a bug? Thanks - Michael xhpl(4178): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4179): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4179): floating-point assist fault at ip 400000000001ec91, isr 0000020000000008 xhpl(4179): floating-point assist fault at ip 400000000001f001, isr 0000020000000008 xhpl(4181): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4184): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4182): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4183): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4256): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4255): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4257): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4258): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4255): floating-point assist fault at ip 400000000001ec91, isr 0000020000000008 xhpl(4282): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4282): floating-point assist fault at ip 400000000001ec91, isr 0000020000000008 xhpl(4282): floating-point assist fault at ip 400000000001f001, isr 0000020000000008 xhpl(4282): floating-point assist fault at ip 400000000001f0d1, isr 0000020000000008 xhpl(4311): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4312): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4312): floating-point assist fault at ip 400000000001ec91, isr 0000020000000008 xhpl(4311): floating-point assist fault at ip 400000000001ec91, isr 0000020000000008 xhpl(4340): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4341): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4342): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 xhpl(4341): floating-point assist fault at ip 400000000001ec91, isr 0000020000000008 xhpl(4343): floating-point assist fault at ip 20000000004fec12, isr 0000020000001001 cpi(14227): unaligned access to 0x600000000028c02c, ip=0x4000000000059281 cpi(14227): unaligned access to 0x600000000028c054, ip=0x4000000000059290 cpi(14227): unaligned access to 0x600000000028c024, ip=0x40000000000592a0 cpi(14227): unaligned access to 0x600000000028c034, ip=0x40000000000592a1 pi3(14271): unaligned access to 0x600000000028c02c, ip=0x4000000000059fa1 pi3(14271): unaligned access to 0x600000000028c054, ip=0x4000000000059fb0 pi3(14271): unaligned access to 0x600000000028c024, ip=0x4000000000059fc0 pi3(14271): unaligned access to 0x600000000028c034, ip=0x4000000000059fc1 xhpl(16370): unaligned access to 0x600000000041c02c, ip=0x400000000009e9f1 xhpl(16370): unaligned access to 0x600000000041c054, ip=0x400000000009ea00 xhpl(16370): unaligned access to 0x600000000041c024, ip=0x400000000009ea10 xhpl(16373): unaligned access to 0x600000000041c02c, ip=0x400000000009e9f1 xhpl(16369): floating-point assist fault at ip 2000000000186c12, isr 0000020000001001 xhpl(16363): floating-point assist fault at ip 2000000000186c12, isr 0000020000001001 xhpl(16370): floating-point assist fault at ip 2000000000186c12, isr 0000020000001001 xhpl(16373): floating-point assist fault at ip 2000000000186c12, isr 0000020000001001 From halr at voltaire.com Tue Aug 8 11:10:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2006 14:10:04 -0400 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <200608061544.11205.dotanb@mellanox.co.il> References: <200608061544.11205.dotanb@mellanox.co.il> Message-ID: <1155060596.17511.260112.camel@hal.voltaire.com> On Sun, 2006-08-06 at 08:44, Dotan Barak wrote: > Hi. > > I compiled code of a test that we wrote and i get compilaton error, here is the compilation error: > > cc -c -g -O2 -Wall -W -Werror -I/usr/mst/include -I/usr/local/include/infiniband async_event_test.c > In file included from /usr/include/vl_gen2u_str.h:40, > from /usr/include/vl.h:49, > from async_event_test.c:38: > /usr/local/include/infiniband/cm.h:209: error: syntax error before numeric constant > make: *** [async_event_test.o] Error 1 > > > when i looked at a preprocessed code of the test i noticed the following code: > > enum ib_cm_sidr_status { > 0, > 1, > 2, > 3, > 4, > IB_SIDR_UNSUPPORTED_VERSION > }; > > it seems that the enumerations values were replaced with integers. > > when i searched for the values that were enumerated in the headre files i found the following defines in ib_types.h: > > #define IB_SIDR_SUCCESS 0 > #define IB_SIDR_UNSUPPORTED 1 > #define IB_SIDR_REJECT 2 > #define IB_SIDR_NO_QP 3 > #define IB_SIDR_REDIRECT 4 > > > I think that the problem was that ib_types.h was included in a file that includes the cm.h and the preprocessor replaced the > enumeration names with the integer values. > > who can check this issue? There is a slight naming inconsistency here. The ib_types.h definition name should be IB_SIDR_UNSUPPORTED_VER. Do you have an old version of this file ? -- Hal > > thanks > Dotan > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From pw at osc.edu Tue Aug 8 11:19:28 2006 From: pw at osc.edu (Pete Wyckoff) Date: Tue, 8 Aug 2006 14:19:28 -0400 Subject: [openib-general] return error when rdma_listen fails Message-ID: <20060808181928.GA15075@osc.edu> Calling rdma_listen() on a cm_id bound to INADDR_ANY can fail, e.g. with EADDRINUSE, but report no error back to the user. This patch fixes that by propagating the error. Success occurs only if at least one of the possibly multiple devices in the system was able to listen. In the case of multiple devices reporting errors on listen, only the first error value is returned. iwarp branch. Signed-off-by: Pete Wyckoff Index: infiniband/core/cma.c =================================================================== --- infiniband/core/cma.c (revision 8688) +++ infiniband/core/cma.c (working copy) @@ -1189,7 +1189,7 @@ static int cma_listen_handler(struct rdm return id_priv->id.event_handler(id, event); } -static void cma_listen_on_dev(struct rdma_id_private *id_priv, +static int cma_listen_on_dev(struct rdma_id_private *id_priv, struct cma_device *cma_dev) { struct rdma_id_private *dev_id_priv; @@ -1198,7 +1198,7 @@ static void cma_listen_on_dev(struct rdm id = rdma_create_id(cma_listen_handler, id_priv, id_priv->id.ps); if (IS_ERR(id)) - return; + return PTR_ERR(id); dev_id_priv = container_of(id, struct rdma_id_private, id); @@ -1213,20 +1213,34 @@ static void cma_listen_on_dev(struct rdm if (ret) goto err; - return; + return 0; err: cma_destroy_listen(dev_id_priv); + return ret; } -static void cma_listen_on_all(struct rdma_id_private *id_priv) +/* + * Try to listen on all devices. Return 0 if success on any device, + * else return the first error reported. + */ +static int cma_listen_on_all(struct rdma_id_private *id_priv) { struct cma_device *cma_dev; + int any_success = 0; + int ret, ret_first = 0; mutex_lock(&lock); list_add_tail(&id_priv->list, &listen_any_list); - list_for_each_entry(cma_dev, &dev_list, list) - cma_listen_on_dev(id_priv, cma_dev); + list_for_each_entry(cma_dev, &dev_list, list) { + ret = cma_listen_on_dev(id_priv, cma_dev); + if (ret) { + if (!ret_first) + ret_first = ret; + } else + any_success = 1; + } mutex_unlock(&lock); + return any_success ? 0 : ret_first; } static int cma_bind_any(struct rdma_cm_id *id, sa_family_t af) @@ -1269,8 +1283,11 @@ int rdma_listen(struct rdma_cm_id *id, i ret = -ENOSYS; goto err; } - } else - cma_listen_on_all(id_priv); + } else { + ret = cma_listen_on_all(id_priv); + if (ret) + goto err; + } id_priv->backlog = backlog; return 0; From mshefty at ichips.intel.com Tue Aug 8 11:20:52 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Aug 2006 11:20:52 -0700 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <1155060596.17511.260112.camel@hal.voltaire.com> References: <200608061544.11205.dotanb@mellanox.co.il> <1155060596.17511.260112.camel@hal.voltaire.com> Message-ID: <44D8D604.60703@ichips.intel.com> Hal Rosenstock wrote: >>enum ib_cm_sidr_status { >> 0, >> 1, >> 2, >> 3, >> 4, >> IB_SIDR_UNSUPPORTED_VERSION >>}; ... >> >>#define IB_SIDR_SUCCESS 0 >>#define IB_SIDR_UNSUPPORTED 1 >>#define IB_SIDR_REJECT 2 >>#define IB_SIDR_NO_QP 3 >>#define IB_SIDR_REDIRECT 4 ... > There is a slight naming inconsistency here. The ib_types.h definition > name should be IB_SIDR_UNSUPPORTED_VER. Do you have an old version of > this file ? IB_SIDR_UNSUPPORT_VERSION was the name in cm.h, which is why it was left unchanged. The other 5 names were the same. - Sean From mshefty at ichips.intel.com Tue Aug 8 11:30:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Aug 2006 11:30:28 -0700 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <1155061337.17511.260337.camel@hal.voltaire.com> References: <200608061544.11205.dotanb@mellanox.co.il> <44D78052.7010600@ichips.intel.com> <1155061337.17511.260337.camel@hal.voltaire.com> Message-ID: <44D8D844.3090600@ichips.intel.com> Hal Rosenstock wrote: > There are many userspace applications which use ib_types.h so it is not > just a opensm internal file. Anything using the SA client API also uses > this. There has been discussion of this on the list before. I tend to skip over most of the opensm related discussions, so could have easily missed this. Can you direct me to the SA client API? - Sean From halr at voltaire.com Tue Aug 8 11:29:25 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2006 14:29:25 -0400 Subject: [openib-general] [PATCH] osm ib_types: remove CM defines In-Reply-To: <000001c6bb0f$36b912c0$e598070a@amr.corp.intel.com> References: <000001c6bb0f$36b912c0$e598070a@amr.corp.intel.com> Message-ID: <1155061749.17511.260486.camel@hal.voltaire.com> On Tue, 2006-08-08 at 13:22, Sean Hefty wrote: > Here's an completely untested patch to remove the CM definitions > from ib_types.h. Can you see if this works for you? (I'm attaching > the patch as well, since ib_types has weird formatting.) > > Signed-off-by: Sean Hefty > --- > Index: ib_types.h > =================================================================== > --- ib_types.h (revision 8215) > +++ ib_types.h (working copy) > @@ -7422,128 +7422,6 @@ typedef struct _ib_ioc_info > } PACK_SUFFIX ib_ioc_info_t; > #include > > -/* > - * Defines known Communication management class versions > - */ > -#define IB_MCLASS_CM_VER_2 2 > -#define IB_MCLASS_CM_VER_1 1 > - > -/* > - * Defines the size of user available data in communication management MADs > - */ > -#define IB_REQ_PDATA_SIZE_VER2 92 > -#define IB_MRA_PDATA_SIZE_VER2 222 > -#define IB_REJ_PDATA_SIZE_VER2 148 > -#define IB_REP_PDATA_SIZE_VER2 196 > -#define IB_RTU_PDATA_SIZE_VER2 224 > -#define IB_LAP_PDATA_SIZE_VER2 168 > -#define IB_APR_PDATA_SIZE_VER2 148 > -#define IB_DREQ_PDATA_SIZE_VER2 220 > -#define IB_DREP_PDATA_SIZE_VER2 224 > -#define IB_SIDR_REQ_PDATA_SIZE_VER2 216 > -#define IB_SIDR_REP_PDATA_SIZE_VER2 136 > - > -#define IB_REQ_PDATA_SIZE_VER1 92 > -#define IB_MRA_PDATA_SIZE_VER1 222 > -#define IB_REJ_PDATA_SIZE_VER1 148 > -#define IB_REP_PDATA_SIZE_VER1 204 > -#define IB_RTU_PDATA_SIZE_VER1 224 > -#define IB_LAP_PDATA_SIZE_VER1 168 > -#define IB_APR_PDATA_SIZE_VER1 151 > -#define IB_DREQ_PDATA_SIZE_VER1 220 > -#define IB_DREP_PDATA_SIZE_VER1 224 > -#define IB_SIDR_REQ_PDATA_SIZE_VER1 216 > -#define IB_SIDR_REP_PDATA_SIZE_VER1 140 > - > -#define IB_ARI_SIZE 72 // redefine > -#define IB_APR_INFO_SIZE 72 > - > - > -/****d* Access Layer/ib_rej_status_t > -* NAME > -* ib_rej_status_t > -* > -* DESCRIPTION > -* Rejection reasons. > -* > -* SYNOPSIS > -*/ > -typedef ib_net16_t ib_rej_status_t; > -/* > -* SEE ALSO > -* ib_cm_rej, ib_cm_rej_rec_t > -* > -* SOURCE > - */ > -#define IB_REJ_INSUF_QP CL_HTON16(1) > -#define IB_REJ_INSUF_EEC CL_HTON16(2) > -#define IB_REJ_INSUF_RESOURCES CL_HTON16(3) > -#define IB_REJ_TIMEOUT CL_HTON16(4) > -#define IB_REJ_UNSUPPORTED CL_HTON16(5) > -#define IB_REJ_INVALID_COMM_ID CL_HTON16(6) > -#define IB_REJ_INVALID_COMM_INSTANCE CL_HTON16(7) > -#define IB_REJ_INVALID_SID CL_HTON16(8) > -#define IB_REJ_INVALID_XPORT CL_HTON16(9) > -#define IB_REJ_STALE_CONN CL_HTON16(10) > -#define IB_REJ_RDC_NOT_EXIST CL_HTON16(11) > -#define IB_REJ_INVALID_GID CL_HTON16(12) > -#define IB_REJ_INVALID_LID CL_HTON16(13) > -#define IB_REJ_INVALID_SL CL_HTON16(14) > -#define IB_REJ_INVALID_TRAFFIC_CLASS CL_HTON16(15) > -#define IB_REJ_INVALID_HOP_LIMIT CL_HTON16(16) > -#define IB_REJ_INVALID_PKT_RATE CL_HTON16(17) > -#define IB_REJ_INVALID_ALT_GID CL_HTON16(18) > -#define IB_REJ_INVALID_ALT_LID CL_HTON16(19) > -#define IB_REJ_INVALID_ALT_SL CL_HTON16(20) > -#define IB_REJ_INVALID_ALT_TRAFFIC_CLASS CL_HTON16(21) > -#define IB_REJ_INVALID_ALT_HOP_LIMIT CL_HTON16(22) > -#define IB_REJ_INVALID_ALT_PKT_RATE CL_HTON16(23) > -#define IB_REJ_PORT_REDIRECT CL_HTON16(24) > -#define IB_REJ_INVALID_MTU CL_HTON16(26) > -#define IB_REJ_INSUFFICIENT_RESP_RES CL_HTON16(27) > -#define IB_REJ_USER_DEFINED CL_HTON16(28) > -#define IB_REJ_INVALID_RNR_RETRY CL_HTON16(29) > -#define IB_REJ_DUPLICATE_LOCAL_COMM_ID CL_HTON16(30) > -#define IB_REJ_INVALID_CLASS_VER CL_HTON16(31) > -#define IB_REJ_INVALID_FLOW_LBL CL_HTON16(32) > -#define IB_REJ_INVALID_ALT_FLOW_LBL CL_HTON16(33) > - > -#define IB_REJ_SERVICE_HANDOFF CL_HTON16(65535) > -/******/ > - > - > -/****d* Access Layer/ib_apr_status_t > -* NAME > -* ib_apr_status_t > -* > -* DESCRIPTION > -* Automatic path migration status information. > -* > -* SYNOPSIS > -*/ > -typedef uint8_t ib_apr_status_t; > -/* > -* SEE ALSO > -* ib_cm_apr, ib_cm_apr_rec_t > -* > -* SOURCE > - */ > -#define IB_AP_SUCCESS 0 > -#define IB_AP_INVALID_COMM_ID 1 > -#define IB_AP_UNSUPPORTED 2 > -#define IB_AP_REJECT 3 > -#define IB_AP_REDIRECT 4 > -#define IB_AP_IS_CURRENT 5 > -#define IB_AP_INVALID_QPN_EECN 6 > -#define IB_AP_INVALID_LID 7 > -#define IB_AP_INVALID_GID 8 > -#define IB_AP_INVALID_FLOW_LBL 9 > -#define IB_AP_INVALID_TCLASS 10 > -#define IB_AP_INVALID_HOP_LIMIT 11 > -#define IB_AP_INVALID_PKT_RATE 12 > -#define IB_AP_INVALID_SL 13 > -/******/ > - > /****d* Access Layer/ib_cm_cap_mask_t > * NAME > * ib_cm_cap_mask_t > @@ -7568,18 +7446,6 @@ typedef uint8_t > ib_apr_status_t; > > > /* > - * Service ID resolution status > - */ > -typedef uint16_t ib_sidr_status_t; > -#define IB_SIDR_SUCCESS 0 > -#define IB_SIDR_UNSUPPORTED 1 > -#define IB_SIDR_REJECT 2 > -#define IB_SIDR_NO_QP 3 > -#define IB_SIDR_REDIRECT 4 > -#define IB_SIDR_UNSUPPORTED_VER 5 > - > - > -/* > * The following definitions are shared between the Access Layer and VPD > */ > > > As there may be other (unknown to me) applications using this, I do not feel comfortable with this. What might work is the following (untested) patch. Another approach might be to guard those defines in ib_types.h with CM_H. -- Hal Index: include/iba/ib_types.h =================================================================== --- include/iba/ib_types.h (revision 8747) +++ include/iba/ib_types.h (working copy) @@ -7743,7 +7743,7 @@ typedef uint16_t ib_sidr_status_t; #define IB_SIDR_NO_QP 3 #define IB_SIDR_REDIRECT 4 #define IB_SIDR_UNSUPPORTED_VER 5 - +#define IB_SIDR_UNSUPPORTED_VERSION IB_SIDR_UNSUPPORTED_VER /* * The following definitions are shared between the Access Layer and VPD From halr at voltaire.com Tue Aug 8 11:32:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2006 14:32:08 -0400 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <200608080917.03248.dotanb@mellanox.co.il> References: <200608061544.11205.dotanb@mellanox.co.il> <44D78052.7010600@ichips.intel.com> <200608080917.03248.dotanb@mellanox.co.il> Message-ID: <1155061926.17511.260530.camel@hal.voltaire.com> On Tue, 2006-08-08 at 02:17, Dotan Barak wrote: > On Monday 07 August 2006 21:02, Sean Hefty wrote: > > Dotan Barak wrote: > > > enum ib_cm_sidr_status { 0, 1, 2, 3, 4, IB_SIDR_UNSUPPORTED_VERSION }; > > > > > > it seems that the enumerations values were replaced with integers. > > > > > > when i searched for the values that were enumerated in the headre files i > > > found the following defines in ib_types.h: > > > > > > #define IB_SIDR_SUCCESS 0 #define > > > IB_SIDR_UNSUPPORTED 1 #define > > > IB_SIDR_REJECT 2 #define > > > IB_SIDR_NO_QP 3 #define > > > IB_SIDR_REDIRECT 4 > > > > > > > > > I think that the problem was that ib_types.h was included in a file that > > > includes the cm.h and the preprocessor replaced the enumeration names with > > > the integer values. > > > > > > who can check this issue? > > > > I think the solution is to remove CM definitions out of ib_types.h. What is the > > reason for including ib_types.h and cm.h? ib_types looks like an internal > > opensm include file. > > As much as i know, ib_types.h is the only header that have all of the MADs definitions, > so i need to include it in several tests that sends MADs. > > The solution may be that one of the files (cm.h or ib_types.h) will rename those names to > a different names. That's another approach to the ones I mentioned in my previous response. -- Hal > thanks > Dotan From sean.hefty at intel.com Tue Aug 8 11:38:36 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 8 Aug 2006 11:38:36 -0700 Subject: [openib-general] [PATCH] osm ib_types: remove CM defines In-Reply-To: <1155061749.17511.260486.camel@hal.voltaire.com> Message-ID: <000401c6bb19$dbda9670$e598070a@amr.corp.intel.com> >As there may be other (unknown to me) applications using this, I do not >feel comfortable with this. What might work is the following (untested) >patch. Another approach might be to guard those defines in ib_types.h >with CM_H. The problem is that ib_types.h #define's these values before cm.h is included. The defined values result in the enum in cm.h appearing as: enum { 0, 1, 2, 3, ... }; which doesn't compile. If we need to rename the #define's in ib_types.h, then it makes more sense to me to remove them completely. Both require changes to users, but the latter is limited to a new include. A different approach would be to change ib_types.h to use an identical enum. - Sean From halr at voltaire.com Tue Aug 8 11:22:24 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2006 14:22:24 -0400 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <44D78052.7010600@ichips.intel.com> References: <200608061544.11205.dotanb@mellanox.co.il> <44D78052.7010600@ichips.intel.com> Message-ID: <1155061337.17511.260337.camel@hal.voltaire.com> On Mon, 2006-08-07 at 14:02, Sean Hefty wrote: > Dotan Barak wrote: > > enum ib_cm_sidr_status { 0, 1, 2, 3, 4, IB_SIDR_UNSUPPORTED_VERSION }; > > > > it seems that the enumerations values were replaced with integers. > > when i searched for the values that were enumerated in the headre files i > > found the following defines in ib_types.h: > > > > #define IB_SIDR_SUCCESS 0 #define > > IB_SIDR_UNSUPPORTED 1 #define > > IB_SIDR_REJECT 2 #define > > IB_SIDR_NO_QP 3 #define > > IB_SIDR_REDIRECT 4 > > > > > > I think that the problem was that ib_types.h was included in a file that > > includes the cm.h and the preprocessor replaced the enumeration names with > > the integer values. > > > > who can check this issue? > > I think the solution is to remove CM definitions out of ib_types.h. Perhaps. > What is the > reason for including ib_types.h and cm.h? ib_types looks like an internal > opensm include file. There are many userspace applications which use ib_types.h so it is not just a opensm internal file. Anything using the SA client API also uses this. There has been discussion of this on the list before. It does appear there is some overlap here though. -- Hal > - Sean From halr at voltaire.com Tue Aug 8 11:45:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2006 14:45:40 -0400 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <44D8D844.3090600@ichips.intel.com> References: <200608061544.11205.dotanb@mellanox.co.il> <44D78052.7010600@ichips.intel.com> <1155061337.17511.260337.camel@hal.voltaire.com> <44D8D844.3090600@ichips.intel.com> Message-ID: <1155062732.17511.260749.camel@hal.voltaire.com> On Tue, 2006-08-08 at 14:30, Sean Hefty wrote: > Hal Rosenstock wrote: > > There are many userspace applications which use ib_types.h so it is not > > just a opensm internal file. Anything using the SA client API also uses > > this. There has been discussion of this on the list before. > > I tend to skip over most of the opensm related discussions, so could have easily > missed this. I thought you commented on the changes proposed to ib_types.h at the time but my memory could be faulty... > Can you direct me to the SA client API? include/vendor/osm_vendor_sa_api.h -- Hal > - Sean From halr at voltaire.com Tue Aug 8 11:49:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2006 14:49:55 -0400 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <44D8D604.60703@ichips.intel.com> References: <200608061544.11205.dotanb@mellanox.co.il> <1155060596.17511.260112.camel@hal.voltaire.com> <44D8D604.60703@ichips.intel.com> Message-ID: <1155062990.17511.260842.camel@hal.voltaire.com> On Tue, 2006-08-08 at 14:20, Sean Hefty wrote: > Hal Rosenstock wrote: > >>enum ib_cm_sidr_status { > >> 0, > >> 1, > >> 2, > >> 3, > >> 4, > >> IB_SIDR_UNSUPPORTED_VERSION > >>}; > ... > >> > >>#define IB_SIDR_SUCCESS 0 > >>#define IB_SIDR_UNSUPPORTED 1 > >>#define IB_SIDR_REJECT 2 > >>#define IB_SIDR_NO_QP 3 > >>#define IB_SIDR_REDIRECT 4 > ... > > There is a slight naming inconsistency here. The ib_types.h definition > > name should be IB_SIDR_UNSUPPORTED_VER. Do you have an old version of > > this file ? > > IB_SIDR_UNSUPPORT_VERSION was the name in cm.h, I thought it was IB_SIDR_UNSUPPORTED_VERSION. ib_types.h uses IB_SIDR_UNSUPPORTED_VER. -- Hal > which is why it was left > unchanged. The other 5 names were the same. > > - Sean From mst at mellanox.co.il Tue Aug 8 12:30:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Aug 2006 22:30:08 +0300 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <1155061337.17511.260337.camel@hal.voltaire.com> References: <1155061337.17511.260337.camel@hal.voltaire.com> Message-ID: <20060808193008.GB24416@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [libibcm] compilation of code that uses libibcm may fail > > On Mon, 2006-08-07 at 14:02, Sean Hefty wrote: > > Dotan Barak wrote: > > > enum ib_cm_sidr_status { 0, 1, 2, 3, 4, IB_SIDR_UNSUPPORTED_VERSION }; > > > > > > it seems that the enumerations values were replaced with integers. > > > when i searched for the values that were enumerated in the headre files i > > > found the following defines in ib_types.h: > > > > > > #define IB_SIDR_SUCCESS 0 #define > > > IB_SIDR_UNSUPPORTED 1 #define > > > IB_SIDR_REJECT 2 #define > > > IB_SIDR_NO_QP 3 #define > > > IB_SIDR_REDIRECT 4 > > > > > > > > > I think that the problem was that ib_types.h was included in a file that > > > includes the cm.h and the preprocessor replaced the enumeration names with > > > the integer values. > > > > > > who can check this issue? > > > > I think the solution is to remove CM definitions out of ib_types.h. > > Perhaps. Ugh. Guys, how about we start adding proper library prefixes to names? IBA really should name things IBA_..., CM should be CM_.. or IB_CM_... If everyone insists on polluting the IB_ namespace conflicts are unavoidable. With respect to API - this should not be a problem: if there are some legacy applications it is easy to make ib_types.h a wrapper header for these. Mark this stuff deprecated and make sure no internal code uses them. -- MST From sean.hefty at intel.com Tue Aug 8 12:47:19 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 8 Aug 2006 12:47:19 -0700 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <20060808193008.GB24416@mellanox.co.il> Message-ID: <000501c6bb23$75bd0850$e598070a@amr.corp.intel.com> >Guys, how about we start adding proper library prefixes to names? >IBA really should name things IBA_..., CM should be CM_.. or IB_CM_... Libibcm uses IB_CM_ in most places. The exception is SIDR, which is IB_SIDR_. IB_SIDR probably should have been IB_CM_SIDR (and is likely just an oversight), but I don't see that IB_SIDR is that bad. >If everyone insists on polluting the IB_ namespace conflicts are unavoidable. I'm not trying to sound harsh here, but the IB CM definitions in ib_types.h seem useless. There's nothing that a user can do with them without an interface to the CM, and I would need to be convinced that there's an application that uses those definitions. - Sean From mshefty at ichips.intel.com Tue Aug 8 12:56:35 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Aug 2006 12:56:35 -0700 Subject: [openib-general] [libibcm] compilation of code that uses libibcm may fail In-Reply-To: <1155062990.17511.260842.camel@hal.voltaire.com> References: <200608061544.11205.dotanb@mellanox.co.il> <1155060596.17511.260112.camel@hal.voltaire.com> <44D8D604.60703@ichips.intel.com> <1155062990.17511.260842.camel@hal.voltaire.com> Message-ID: <44D8EC73.1080300@ichips.intel.com> Hal Rosenstock wrote: >>IB_SIDR_UNSUPPORT_VERSION was the name in cm.h, > > > I thought it was IB_SIDR_UNSUPPORTED_VERSION. ib_types.h uses > IB_SIDR_UNSUPPORTED_VER. Sorry, typo on my part - left off the 'ED'. - Sean From halr at voltaire.com Tue Aug 8 13:50:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2006 16:50:29 -0400 Subject: [openib-general] [PATCH] osm: Dynamic verbosity control per file In-Reply-To: <842b8cdf0608020816y3fdfa145nea876171f58650d9@mail.gmail.com> References: <842b8cdf0608020816y3fdfa145nea876171f58650d9@mail.gmail.com> Message-ID: <1155070228.17511.263098.camel@hal.voltaire.com> Hi Yevgeny, On Wed, 2006-08-02 at 11:16, Yevgeny Kliteynik wrote: > Hi Hal Just got back from vacation and am in the process of catching up. > This patch adds new verbosity functionality. > 1. Verbosity configuration file > ------------------------------- > > The user is able to set verbosity level per source code file > by supplying verbosity configuration file using the following > command line arguments: > > -b filename > --verbosity_file filename > > By default, the OSM will use the following file: /etc/opensmlog.conf > Verbosity configuration file should contain zero or more lines of > the following pattern: > > filename verbosity_level > > where 'filename' is the name of the source code file that the > 'verbosity_level' refers to, and the 'verbosity_level' itself > should be specified as an integer number (decimal or hexadecimal). > > One reserved filename is 'all' - it represents general verbosity > level, that is used for all the files that are not specified in > the verbosity configuration file. > If 'all' is not specified, the verbosity level set in the > command line will be used instead. > Note: The 'all' file verbosity level will override any other > general level that was specified by the command line arguments. > > Sending a SIGHUP signal to the OSM will cause it to reload > the verbosity configuration file. > > > 2. Logging source code filename and line number > ----------------------------------------------- > > If command line option -S or --log_source_info is specified, > OSM will add source code filename and line number to every > log message that is written to the log file. > By default, the OSM will not log this additional info. > > > Yevgeny Is it hard to find which file and line an opensm log message comes from ? Is this functionality really needed ? -- Hal From halr at voltaire.com Tue Aug 8 13:51:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2006 16:51:31 -0400 Subject: [openib-general] APM: QP migration state change when failover triggered by hw In-Reply-To: <44D0EA7A.1020503@3leafnetworks.com> References: <000201c6b59c$17ef1d30$29cc180a@amr.corp.intel.com> <200608021043.17017.jackm@mellanox.co.il> <44D0EA7A.1020503@3leafnetworks.com> Message-ID: <1155070286.17511.263118.camel@hal.voltaire.com> On Wed, 2006-08-02 at 14:10, Venkatesh Babu wrote: > >Babu, regarding the migration event that you are seeing, are you sure that it > >is from the migration transition that does not occur? Possibly, the > >problematic transition is the second one, which occurs after specifying a new > >alternate path and rearming APM? > > > > > I am sure that it is when the cable is disconnected for the first > time, and not by the second transition. I will reload the alternate path > with LAP messages only when the port's state changes to IB_PORT_ACTIVE. > If the remote passive node's port has disconnected then I am expecting > for the notice event saying that remote port transitioned to > IB_PORT_ACTIVE. In gen1 I was using tsIbSetInServiceNoticeHandler() for > this. In OFED we don't have these interfaces yet. > > >It seems more likely to me that the first transition does occur, since you > >receive a MIG event on both sides, and since the alt path data is loaded by > >you during the initial bringup of the RC QP pair(either at init->rtr, or at > >rtr->rts). If you are receiving the MIGRATED event, the qp is already in the > >migrated state. > > > >However, after the first migration occurs, you need to do the following: > >1. send a LAP packet to the remote node, containing the new alt path info. > >2. load NEW alt path information (ib_modify_qp, rts->rts), including remote > >LID received in LAP packet. > >3. Rearm path migration (ib_modify_qp, rts->rts) > > > >Are you certain that the above 3 steps have taken place? > > > > > Yes I am doing all these steps only when I get the event > IB_PORT_ACTIVE or InServiceNotice event is received for the remote port. How do you get the InServiceNotice event ? -- Hal > > >Note that 1. and 2. above are a separate phase from 3., since the IB Spec > >allows changing the alternate path while the QP is still armed, not just when > >it has migrated. > > > >- Jack > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From venkatesh.babu at 3leafnetworks.com Tue Aug 8 15:45:02 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Tue, 08 Aug 2006 15:45:02 -0700 Subject: [openib-general] APM: QP migration state change when failover triggered by hw In-Reply-To: <1155070286.17511.263118.camel@hal.voltaire.com> References: <000201c6b59c$17ef1d30$29cc180a@amr.corp.intel.com> <200608021043.17017.jackm@mellanox.co.il> <44D0EA7A.1020503@3leafnetworks.com> <1155070286.17511.263118.camel@hal.voltaire.com> Message-ID: <44D913EE.2050907@3leafnetworks.com> Hal Rosenstock wrote: > >> Yes I am doing all these steps only when I get the event >>IB_PORT_ACTIVE or InServiceNotice event is received for the remote port. >> >> > >How do you get the InServiceNotice event ? > >-- Hal > > In Gen1 implementation tsIbSetInServiceNoticeHandler() and tsIbSetOutofServiceNoticeHandler() interfaces are there for this purpose. I am trying to port these interfaces to OFED stack, because these pieces are missing here. I have the code for sending inform info record to OpenSM and registering with it for notification. But I have not yet implemented Notice handlers when the OpenSM sends the notice trap TS_IB_GENERIC_TRAP_NUM_IN_SVC(64) or TS_IB_GENERIC_TRAP_NUM_OUT_OF_SVC(65). I will appriciate any information on implementing this mechanism. VBabu From halr at voltaire.com Tue Aug 8 15:51:24 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2006 18:51:24 -0400 Subject: [openib-general] APM: QP migration state change when failover triggered by hw In-Reply-To: <44D913EE.2050907@3leafnetworks.com> References: <000201c6b59c$17ef1d30$29cc180a@amr.corp.intel.com> <200608021043.17017.jackm@mellanox.co.il> <44D0EA7A.1020503@3leafnetworks.com> <1155070286.17511.263118.camel@hal.voltaire.com> <44D913EE.2050907@3leafnetworks.com> Message-ID: <1155077482.17511.265267.camel@hal.voltaire.com> On Tue, 2006-08-08 at 18:45, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > > > >> Yes I am doing all these steps only when I get the event > >>IB_PORT_ACTIVE or InServiceNotice event is received for the remote port. > >> > >> > > > >How do you get the InServiceNotice event ? > > > >-- Hal > > > > > In Gen1 implementation tsIbSetInServiceNoticeHandler() and > tsIbSetOutofServiceNoticeHandler() interfaces are there for this > purpose. I am trying to port these interfaces to OFED stack, because > these pieces are missing here. I have the code for sending inform info > record to OpenSM and registering with it for notification. But I have > not yet implemented Notice handlers when the OpenSM sends the notice > trap TS_IB_GENERIC_TRAP_NUM_IN_SVC(64) or > TS_IB_GENERIC_TRAP_NUM_OUT_OF_SVC(65). I will appriciate any information > on implementing this mechanism. I thought that (InServiceNotice) was a reference to some gen1 code. That's what I wanted to clarify. There are no "official" gen2 interfaces for this as yet. What's your timeframe ? -- Hal > VBabu > > From vuhuong at mellanox.com Tue Aug 8 17:57:53 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 08 Aug 2006 17:57:53 -0700 Subject: [openib-general] SRP IO Size In-Reply-To: <44D8BC27.7010504@psc.edu> References: <44D8BC27.7010504@psc.edu> Message-ID: <44D93311.8030207@mellanox.com> Hi Paul, You can load ib_srp module with module param max_xfer_sectors_per_io=<512, 1024, 2048> to support 256KB, 512KB and 1M direct IOs Adding the following line "options ib_srp max_xfer_sectors_per_io=1024 to /etc/modprobe.conf Vu > Hi, > I was running some performance tests to an srp target and noticed > that the largest io sent over srp was 128k. When using a direct-attached > scsi device I see io's up to 4m. I'm running the ibgd 1.8.2 stack. Can > someone tell me if this issue has been addressed in a more recent version? > Thanks, > Paul > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Tue Aug 8 22:59:12 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 09 Aug 2006 08:59:12 +0300 Subject: [openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT) In-Reply-To: <44D8BD17.3040804@ichips.intel.com> References: <000101c6b1a2$11269f80$c5c8180a@amr.corp.intel.com> <44D03AF1.8080300@voltaire.com> <33890.81.10.199.193.1154498687.squirrel@81.10.199.193> <44D049B5.6000505@voltaire.com> <44D0D808.1070003@ichips.intel.com> <44D1986B.6070302@voltaire.com> <44D22218.7000005@ichips.intel.com> <44D70E2B.60205@voltaire.com> <44D772DC.7010101@ichips.intel.com> <44D84D3A.9050502@voltaire.com> <44D8BD17.3040804@ichips.intel.com> Message-ID: <44D979B0.5040102@voltaire.com> Sean Hefty wrote: > Or Gerlitz wrote: >> Conceptually, do we agree that it would be better not to expose IB >> reject code to the CMA consumers? that is in the spirit of the CMA >> being a framework for doing connection management in RDMA transport >> independent fashion, etc. > My concern is that I do not want to mask the real reason for the reject > in a way that prevents the user from understanding what's needed to > establish the connection. A different way to view this is that the > event provides the generic information, and the status detailed. So you are fine with the CMA consumer being aware to the RDMA transport up to the extent of having a per transport reject codes handler? >> The CMA does return **errno** values on the status field for some >> events (eg with UNREACHABLE event as of REQ/REP timeout, as in the >> case that started this thread...), so we need to decide a clearer >> approach here. > We can provide two status values with an event, one that maps to an > errno, and another that maps to a transport specific reason. This sounds much better. Or. From sweitzen at cisco.com Tue Aug 8 23:58:49 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 8 Aug 2006 23:58:49 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1 planning meeting - summary Message-ID: We received our DDN equipment today and have started setting it up. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Shawn > Hansen (shahanse) > Sent: Tuesday, July 25, 2006 1:39 PM > To: Tziporet Koren; Matt Leininger > Cc: openfabrics-ewg at openib.org; openib > Subject: Re: [openib-general] [openfabrics-ewg] OFED 1.1 > planning meeting - summary > > Yes, Cisco plans to test OFED on a DDN SRP target. > > --Shawn > > -----Original Message----- > From: openfabrics-ewg-bounces at openib.org > [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of > Tziporet Koren > Sent: Tuesday, July 25, 2006 8:40 AM > To: Matt Leininger > Cc: openfabrics-ewg at openib.org; openib > Subject: Re: [openfabrics-ewg] OFED 1.1 planning meeting - summary > > Matt Leininger wrote: > >> 5. SRP: > >> > >> - GA quality > >> > >> - DM (Device Mapper) - for high availability > >> > >> - Basic failover/failback testing with daemon+srp+XVM/MPP and > >> Engenio target > >> > >> > > Tziporet, > > > > Are there any plans to test with the DDN SRP target? Several DoE > > sites are testing/using the DDN IB based storage. > > > > > > > Mellanox does not have DDN SRP target. We will be happy to test it of > DDN will loan us a system. > > Another option is that DDN will take OFED 1.1 RCs and test it in their > labs. > Can you approach them and ask this. If yes then I can cc them > on the RCs > mails so they can do it. > > Is there any other vendor who has DDN SRP target, and going > to test OFED > with it? > > Tziporet > > > > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Wed Aug 9 00:04:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Aug 2006 10:04:35 +0300 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work Message-ID: <20060809070435.GN20848@mellanox.co.il> IB/mthca: make IB_SEND_FENCE work Fence bit must be set in the doorbell, not only in WQE Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/hw/mthca/mthca_doorbell.h b/drivers/infiniband/hw/mthca/mthca_doorbell.h index dd9a44d..e5f0ad6 100644 --- a/drivers/infiniband/hw/mthca/mthca_doorbell.h +++ b/drivers/infiniband/hw/mthca/mthca_doorbell.h @@ -42,6 +42,10 @@ #define MTHCA_RECEIVE_DOORBELL 0x18 #define MTHCA_CQ_DOORBELL 0x20 #define MTHCA_EQ_DOORBELL 0x28 +enum { + MTHCA_SEND_DOORBELL_FENCE = 1 << 5 +}; + #if BITS_PER_LONG == 64 /* * Assume that we can just write a 64-bit doorbell atomically. s390 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index cd8b672..6efba4a 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1502,7 +1502,7 @@ int mthca_tavor_post_send(struct ib_qp * int i; int size; int size0 = 0; - u32 f0 = 0; + u32 f0 = unlikely(wr->send_flags & IB_SEND_FENCE) ? MTHCA_SEND_DOORBELL_FENCE : 0; int ind; u8 op0 = 0; @@ -1843,7 +1843,7 @@ int mthca_arbel_post_send(struct ib_qp * int i; int size; int size0 = 0; - u32 f0 = 0; + u32 f0 = unlikely(wr->send_flags & IB_SEND_FENCE) ? MTHCA_SEND_DOORBELL_FENCE : 0; int ind; u8 op0 = 0; @@ -1864,6 +1864,7 @@ int mthca_arbel_post_send(struct ib_qp * qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; size0 = 0; + f0 = unlikely(wr->send_flags & IB_SEND_FENCE) ? MTHCA_SEND_DOORBELL_FENCE : 0; /* * Make sure that descriptors are written before -- MST From mst at mellanox.co.il Wed Aug 9 00:06:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Aug 2006 10:06:05 +0300 Subject: [openib-general] libmthca: make fence work Message-ID: <20060809070605.GO20848@mellanox.co.il> libmthca: fence bit must be set in doorbell, not only in WQE Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/libmthca/src/qp.c =================================================================== --- openib/src/userspace/libmthca/src/qp.c (revision 8841) +++ openib/src/userspace/libmthca/src/qp.c (working copy) @@ -106,7 +106,8 @@ int mthca_tavor_post_send(struct ibv_qp int ret = 0; int size, size0 = 0; int i; - uint32_t f0 = 0, op0 = 0; + uint32_t f0 = (wr->send_flags & IBV_SEND_FENCE) ? MTHCA_SEND_DOORBELL_FENCE : 0; + uint32_t op0 = 0; pthread_spin_lock(&qp->sq.lock); @@ -436,7 +437,8 @@ int mthca_arbel_post_send(struct ibv_qp int ret = 0; int size, size0 = 0; int i; - uint32_t f0 = 0, op0 = 0; + uint32_t f0 = (wr->send_flags & IBV_SEND_FENCE) ? MTHCA_SEND_DOORBELL_FENCE : 0; + uint32_t op0 = 0; pthread_spin_lock(&qp->sq.lock); @@ -469,6 +471,7 @@ int mthca_arbel_post_send(struct ibv_qp mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_SEND_DOORBELL); size0 = 0; + f0 = (wr->send_flags & IBV_SEND_FENCE) ? MTHCA_SEND_DOORBELL_FENCE : 0; } if (wq_overflow(&qp->sq, nreq, to_mcq(qp->ibv_qp.send_cq))) { Index: openib/src/userspace/libmthca/src/wqe.h =================================================================== --- openib/src/userspace/libmthca/src/wqe.h (revision 8841) +++ openib/src/userspace/libmthca/src/wqe.h (working copy) @@ -42,6 +42,10 @@ enum { }; enum { + MTHCA_SEND_DOORBELL_FENCE = 1 << 5 +}; + +enum { MTHCA_NEXT_DBD = 1 << 7, MTHCA_NEXT_FENCE = 1 << 6, MTHCA_NEXT_CQ_UPDATE = 1 << 3, -- MST From hanahana at hushmail.com Wed Aug 9 00:30:39 2006 From: hanahana at hushmail.com (hanahana at hushmail.com) Date: Wed, 9 Aug 2006 01:30:39 -0600 (MDT) Subject: [openib-general] =?iso-2022-jp?B?l1SVn4LIioSQ2Irzll2QbI3Ig3CBW4Nng2mBW4/Qie4=?= Message-ID: 20060809142907.65097mail@mail.hostserver-19web88ggwwgg.ns1.name.info �����b�ɂȂ��Ă��܂�m(_ _)m �S���̊F���ł��B �{����D��VIP�������������o�^�ɂȂ�܂����̂ł��m�点�v���܂����B �܂��A�e�����̕�����̋��͗���͐�ɂł���n���o���܂��B �S�Ă̏����͐撅���ɂȂ��Ă��܂��̂ŁA�����߂̂��A�����҂����Ă��܂��B http://dsdsyyydsds.h1x.com:112/hito7/ ���R�����Љ���[���̂���������I���S�����ŗV�ׂ�� �t���[���[���łn�j�ł��I ���\���\���\���\���\���\���\���\���\���\���\���\���\���\���\���\�� �@���ŐV��� �@�@���A���҂��̏����͂�����̕��ƂȂ��Ă܂��I �@�@�����O�F���b����i�R�T�΁j �@�@�������F���V�O���~ �@ ����]�F�����̖邩�y�����ȁ@�@ �@ �����̓x�A��LVIP�������܂���͊m�F���点�Ē����������U���݂����Ē����܂��B �@�m�F�ˁ@http://dsdsyyydsds.h1x.com:112/hito7/ ���\���\���\���\���\���\���\���\���\���\���\���\���\���\���\���\�� �܂�����VIP�������M���l�����Ƃ��҂��ɂȂ��Ă���܂��̂ŋX�������肢�v���܂��B �撅���ɂȂ��Ă��܂��̂ŁA�����͐����X�V�ƂȂ�܂��B �S���@�F�� ���w�lj��̕��� http://dsdsyyydsds.h1x.com:112/hito7/ �̍ʼn��i���炲�o�^�������� From muli at il.ibm.com Wed Aug 9 00:29:52 2006 From: muli at il.ibm.com (Muli Ben-Yehuda) Date: Wed, 9 Aug 2006 10:29:52 +0300 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <20060809070435.GN20848@mellanox.co.il> References: <20060809070435.GN20848@mellanox.co.il> Message-ID: <20060809072952.GB3725@rhun.haifa.ibm.com> On Wed, Aug 09, 2006 at 10:04:35AM +0300, Michael S. Tsirkin wrote: > diff --git a/drivers/infiniband/hw/mthca/mthca_doorbell.h b/drivers/infiniband/hw/mthca/mthca_doorbell.h > index dd9a44d..e5f0ad6 100644 > --- a/drivers/infiniband/hw/mthca/mthca_doorbell.h > +++ b/drivers/infiniband/hw/mthca/mthca_doorbell.h > @@ -42,6 +42,10 @@ #define MTHCA_RECEIVE_DOORBELL 0x18 > #define MTHCA_CQ_DOORBELL 0x20 > #define MTHCA_EQ_DOORBELL 0x28 > > +enum { > + MTHCA_SEND_DOORBELL_FENCE = 1 << 5 > +}; Does anonymous enum have any benefit over define for this? Cheers, Muli From mst at mellanox.co.il Wed Aug 9 00:37:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Aug 2006 10:37:20 +0300 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <20060809072952.GB3725@rhun.haifa.ibm.com> References: <20060809072952.GB3725@rhun.haifa.ibm.com> Message-ID: <20060809073720.GP20848@mellanox.co.il> Quoting r. Muli Ben-Yehuda : > > +enum { > > + MTHCA_SEND_DOORBELL_FENCE = 1 << 5 > > +}; > > Does anonymous enum have any benefit over define for this? Not really, no. -- MST From muli at il.ibm.com Wed Aug 9 00:38:05 2006 From: muli at il.ibm.com (Muli Ben-Yehuda) Date: Wed, 9 Aug 2006 10:38:05 +0300 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <20060809073720.GP20848@mellanox.co.il> References: <20060809072952.GB3725@rhun.haifa.ibm.com> <20060809073720.GP20848@mellanox.co.il> Message-ID: <20060809073805.GD3725@rhun.haifa.ibm.com> On Wed, Aug 09, 2006 at 10:37:20AM +0300, Michael S. Tsirkin wrote: > Quoting r. Muli Ben-Yehuda : > > > +enum { > > > + MTHCA_SEND_DOORBELL_FENCE = 1 << 5 > > > +}; > > > > Does anonymous enum have any benefit over define for this? > > Not really, no. Good, because I think #define is the idiomatic way to do this... Cheers, Muli From mst at mellanox.co.il Wed Aug 9 00:50:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Aug 2006 10:50:36 +0300 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <20060809073805.GD3725@rhun.haifa.ibm.com> References: <20060809073805.GD3725@rhun.haifa.ibm.com> Message-ID: <20060809075036.GQ20848@mellanox.co.il> Quoting r. Muli Ben-Yehuda : > Subject: Re: [PATCH] mthca: make IB_SEND_FENCE work > > On Wed, Aug 09, 2006 at 10:37:20AM +0300, Michael S. Tsirkin wrote: > > Quoting r. Muli Ben-Yehuda : > > > > +enum { > > > > + MTHCA_SEND_DOORBELL_FENCE = 1 << 5 > > > > +}; > > > > > > Does anonymous enum have any benefit over define for this? > > > > Not really, no. > > Good, because I think #define is the idiomatic way to do this... Fine with me too. -- IB/mthca: make IB_SEND_FENCE work Fence bit must be set in the doorbell, not only in WQE Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/hw/mthca/mthca_doorbell.h b/drivers/infiniband/hw/mthca/mthca_doorbell.h index dd9a44d..e5f0ad6 100644 --- a/drivers/infiniband/hw/mthca/mthca_doorbell.h +++ b/drivers/infiniband/hw/mthca/mthca_doorbell.h @@ -42,6 +42,8 @@ #define MTHCA_RECEIVE_DOORBELL 0x18 #define MTHCA_CQ_DOORBELL 0x20 #define MTHCA_EQ_DOORBELL 0x28 +#define MTHCA_SEND_DOORBELL_FENCE (1 << 5) + #if BITS_PER_LONG == 64 /* * Assume that we can just write a 64-bit doorbell atomically. s390 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index cd8b672..6efba4a 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1502,7 +1502,7 @@ int mthca_tavor_post_send(struct ib_qp * int i; int size; int size0 = 0; - u32 f0 = 0; + u32 f0 = unlikely(wr->send_flags & IB_SEND_FENCE) ? MTHCA_SEND_DOORBELL_FENCE : 0; int ind; u8 op0 = 0; @@ -1843,7 +1843,7 @@ int mthca_arbel_post_send(struct ib_qp * int i; int size; int size0 = 0; - u32 f0 = 0; + u32 f0 = unlikely(wr->send_flags & IB_SEND_FENCE) ? MTHCA_SEND_DOORBELL_FENCE : 0; int ind; u8 op0 = 0; @@ -1864,6 +1864,7 @@ int mthca_arbel_post_send(struct ib_qp * qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; size0 = 0; + f0 = unlikely(wr->send_flags & IB_SEND_FENCE) ? MTHCA_SEND_DOORBELL_FENCE : 0; /* * Make sure that descriptors are written before -- MST From krkumar2 at in.ibm.com Wed Aug 9 01:37:33 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Wed, 9 Aug 2006 14:07:33 +0530 Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: Message-ID: Hi James, Sorry for the late response, my system was down and I just got it fixed. > Is there a benefit to having rdmav_create_qp() take generic > parameters if the application needs to understand the type of QP (IB, > iWARP, etc.) created and the transport specific communication manager > calls that are needed to manipulate it? > > Would it make more sense if the QP create command was also transport > specific? My opinion is that the create_qp taking generic parameters is correct, only subsequent calls may need to use transport specific calls/arguments. Infact rdma_create_qp uses the ibv_create_qp (now changed to rdmav_create_qp) call internally. PS : What is the opinion on this patchset ? Thanks, - KK From monil at voltaire.com Wed Aug 9 01:55:07 2006 From: monil at voltaire.com (Moni Levy) Date: Wed, 9 Aug 2006 11:55:07 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA763D@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA763D@mtlexch01.mtl.com> Message-ID: <6a122cc00608090155n4a47f5d6w5d30f4c5bdc1db79@mail.gmail.com> Hi, Tziporet, On 8/8/06, Tziporet Koren wrote: > o iSER: > - Stability > - Testing more platforms (e.g. ppc64 and ia64) > - Performance improvements Only number two above is in the scope of OFED from our perspective, so we prefer to have it listed alone. > 2. iSER support in install script for SLES 10 is missing We have a fix for that and it will be part of RC2 -- Moni From zhushisongzhu at yahoo.com Wed Aug 9 02:28:27 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Wed, 9 Aug 2006 02:28:27 -0700 (PDT) Subject: [openib-general] OFED 1.0 - Official Release (Tziporet Koren) In-Reply-To: <44CC79D7.3030805@mellanox.co.il> Message-ID: <20060809092827.88383.qmail@web36902.mail.mud.yahoo.com> Dear Sir, Can I test the new release fixed for large SDP connections? tks zhu --- Tziporet Koren wrote: > > > zhu shi song wrote: > > Dear Sir, > > what's your progress on sdp connection? I'm > waiting > > for the new release to test. > > tks > > zhu > > > > Progress is very good, and we succeeded to run > Polygraph with 800 > connections for few days. > > RC1 is expected this week so you will be able to > test it yourself > > Tziporet > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From mst at mellanox.co.il Wed Aug 9 03:34:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Aug 2006 13:34:23 +0300 Subject: [openib-general] OFED 1.0 - Official Release (Tziporet Koren) In-Reply-To: <20060809092827.88383.qmail@web36902.mail.mud.yahoo.com> References: <20060809092827.88383.qmail@web36902.mail.mud.yahoo.com> Message-ID: <20060809103423.GV20848@mellanox.co.il> Things work well on 64 bit systems. Quoting r. zhu shi song : Subject: Re: OFED 1.0 - Official Release (Tziporet Koren) Dear Sir, Can I test the new release fixed for large SDP connections? tks zhu --- Tziporet Koren wrote: > > > zhu shi song wrote: > > Dear Sir, > > what's your progress on sdp connection? I'm > waiting > > for the new release to test. > > tks > > zhu > > > > Progress is very good, and we succeeded to run > Polygraph with 800 > connections for few days. > > RC1 is expected this week so you will be able to > test it yourself > > Tziporet > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From halr at voltaire.com Wed Aug 9 11:03:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2006 14:03:54 -0400 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA763D@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA763D@mtlexch01.mtl.com> Message-ID: <1155146629.17511.287731.camel@hal.voltaire.com> On Tue, 2006-08-08 at 10:48, Tziporet Koren wrote: > Hi, > > In two week delay we publish OFED 1.1-RC1 on https://openib.org/svn/gen2/branches/1.1/ofed/releases/ > File: OFED-1.1-rc1.tgz Is there an update to the OFED 1.1 schedule going forward ? -- Hal 1. Schedule: ============ Target release date: 31-Aug Intermediate milestones: 1. Create 1.1 branch of user level code and rc1: 27-Jul 2. Feature freeze : 3-Aug 3. Code freeze (rc-x): 25-Aug 4. Final release: 31-Aug From bunk at stusta.de Wed Aug 9 11:50:48 2006 From: bunk at stusta.de (Adrian Bunk) Date: Wed, 9 Aug 2006 20:50:48 +0200 Subject: [openib-general] [patch 02/45] IB/mthca: restore missing PCI registers after reset In-Reply-To: <20060807233937.GB25326@mellanox.co.il> References: <20060807170451.GH3691@stusta.de> <20060807233455.GA25326@mellanox.co.il> <20060807233937.GB25326@mellanox.co.il> Message-ID: <20060809185048.GY3691@stusta.de> On Tue, Aug 08, 2006 at 02:39:37AM +0300, Michael S. Tsirkin wrote: > Quoting r. Michael S. Tsirkin : > > Subject: Re: [patch 02/45] IB/mthca: restore missing PCI registers after reset > > > > Quoting r. Adrian Bunk : > > > Thanks for this information, I've applied it. > > > > BTW, is there a git tree to see what you are cooking? > > Never mind, I found linux/kernel/git/stable/linux-2.6.16.y.git. > It says "owner Greg Kroah-Hartman" which is what confused me. Thanks for the note, I've fixed this. > MST cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed From halr at voltaire.com Wed Aug 9 13:04:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2006 16:04:41 -0400 Subject: [openib-general] [PATCH] libibumad: nit on short mad read In-Reply-To: <86psfk4une.fsf@mtl066.yok.mtl.com> References: <86psfk4une.fsf@mtl066.yok.mtl.com> Message-ID: <1155153833.17511.289621.camel@hal.voltaire.com> Hi Eitan, On Tue, 2006-08-01 at 06:28, Eitan Zahavi wrote: > Hi Hal > > This was reported to me by Ishai R. > > Consider function umad_recv line 810: > if ((n = read(port->dev_fd, umad, sizeof *mad + *length)) <= > sizeof *mad + *length) { > DEBUG("mad received by agent %d length %d", mad->agent_id, n); > *length = n - sizeof *mad; > return mad->agent_id; > } > > if (n == -EWOULDBLOCK) { > if (!errno) > errno = EWOULDBLOCK; > return n; > } > > Seems that umad.c umad_recv would never go through the second "if" > as if the read return n < 0 it will be cought by the first "if". > > Then I have noticed that a wrap around of the returned length is also > possible. > > The patch fixes these issue. > > Eitan Good catches. Thanks. Applied to both trunk and 1.1 branch. -- Hal From sashak at voltaire.com Wed Aug 9 13:41:00 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 9 Aug 2006 23:41:00 +0300 Subject: [openib-general] howto get copy of https://openib.org/svn repository Message-ID: <20060809204100.GX24920@sashak.voltaire.com> Hi, Maybe somebody know. How is it possible to download exact copy (not SVK mirrors) of https://openib.org/svn repository? I like to test some svn to git converter, but would prefer to not overload the network. Thanks, Sasha From mshefty at ichips.intel.com Wed Aug 9 14:07:45 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Aug 2006 14:07:45 -0700 Subject: [openib-general] return error when rdma_listen fails In-Reply-To: <20060808181928.GA15075@osc.edu> References: <20060808181928.GA15075@osc.edu> Message-ID: <44DA4EA1.7000902@ichips.intel.com> Pete Wyckoff wrote: > Calling rdma_listen() on a cm_id bound to INADDR_ANY can fail, e.g. > with EADDRINUSE, but report no error back to the user. This patch > fixes that by propagating the error. Success occurs only if at > least one of the possibly multiple devices in the system was able to > listen. In the case of multiple devices reporting errors on listen, > only the first error value is returned. iwarp branch. There's a problem if the listen is done before any devices have been added to the system. In this case, the listen should succeed. - Sean From ardavis at ichips.intel.com Wed Aug 9 14:26:10 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 09 Aug 2006 14:26:10 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1 planning meeting - summary In-Reply-To: <44CC78A5.3030104@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA74FE@mtlexch01.mtl.com> <44CA91BE.5050301@ichips.intel.com> <44CC78A5.3030104@mellanox.co.il> Message-ID: <44DA52F2.8080703@ichips.intel.com> >> Can we include librdmacm and dapl in the basic installation option? >> Also, it would be nice to have rdma_ucm and rdma_cm load on boot by >> default. >> >> Thanks, >> >> -arlin > > > This is a small change in the OFED scripts. > I suggest that if we go for this change we will do it for the HPC > install and not for the basic install (which includes only the verbs > and IPoIB). > > If there is no objection from anyone we will go for this change. > > Tziporet > I don't see this change in OFED 1.1 RC1. Please add librdmacm and dapl into the HPC install and make sure rdma_ucma and rdma_cma gets loaded during boot by default in RC2. Thanks, -arlin From rdreier at cisco.com Wed Aug 9 15:05:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 09 Aug 2006 15:05:14 -0700 Subject: [openib-general] howto get copy of https://openib.org/svn repository In-Reply-To: <20060809204100.GX24920@sashak.voltaire.com> (Sasha Khapyorsky's message of "Wed, 9 Aug 2006 23:41:00 +0300") References: <20060809204100.GX24920@sashak.voltaire.com> Message-ID: search for svn-mirror From manpreet at gmail.com Wed Aug 9 20:05:04 2006 From: manpreet at gmail.com (Manpreet Singh) Date: Wed, 9 Aug 2006 20:05:04 -0700 Subject: [openib-general] Outstanding RDMA operations Message-ID: <67897d690608092005i1e45bb8wa23b3f0105103cca@mail.gmail.com> Hi, Some time ago, I thought there were some thoughts about making the outstanding RDMA count to be configurable via a patch to the mthca driver (the 'rdb_per_qp' parameter in mthca_main.c). Just curious if the patch was going to go in at some point. Thanks, Manpreet. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at mellanox.co.il Thu Aug 10 00:50:21 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 10 Aug 2006 10:50:21 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1 planning meeting - summary Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7670@mtlexch01.mtl.com> You are correct - we forgot about it. Will be fixed in rc2 Can you open a bug in bugzilla for the installer package so we will not miss it this time? Thanks, Tziporet -----Original Message----- From: Arlin Davis [mailto:ardavis at ichips.intel.com] Sent: Thursday, August 10, 2006 12:26 AM To: Tziporet Koren Cc: openfabrics-ewg at openib.org; openib Subject: Re: [openib-general] [openfabrics-ewg] OFED 1.1 planning meeting - summary >> Can we include librdmacm and dapl in the basic installation option? >> Also, it would be nice to have rdma_ucm and rdma_cm load on boot by >> default. >> >> Thanks, >> >> -arlin > > > This is a small change in the OFED scripts. > I suggest that if we go for this change we will do it for the HPC > install and not for the basic install (which includes only the verbs > and IPoIB). > > If there is no objection from anyone we will go for this change. > > Tziporet > I don't see this change in OFED 1.1 RC1. Please add librdmacm and dapl into the HPC install and make sure rdma_ucma and rdma_cma gets loaded during boot by default in RC2. Thanks, -arlin From Abhijit.Gadgil at pantasys.com Thu Aug 10 04:21:21 2006 From: Abhijit.Gadgil at pantasys.com (Abhijit Gadgil) Date: Thu, 10 Aug 2006 04:21:21 -0700 (PDT) Subject: [openib-general] umad_recv won't block after first read... Message-ID: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> Hi All, I am trying to write a simple program using libibumad to 'subscribe' for traps and then receive traps from the SA. Most of the things seem to work fine, however I am facing a small problem where, after first read for the trap, all subsequent reads are not blocking (and return some incorrect length). Attached is the simple code, can someone tell, what exactly is wrong out here? Thanks -abhijit -------------- next part -------------- #include #include #include #include #define MAX_HCAS (4) #define MAX_PORTS_PER_CA (2) #define SA_INFORMINFO_LEN (60) #define SA_INFORMINFO_OFFSET (56) #define SA_HEADER_LENGTH (20) #define SA_CLASS_VERSION (2) char ca_names[MAX_HCAS][UMAD_CA_NAME_LEN]; umad_ca_t cas[MAX_HCAS]; uint64_t query_tid = 0; static int set_bit(int nr, void *method_mask) { int mask, retval; long *addr = method_mask; addr += nr >> 5; mask = 1 << (nr & 0x1f); retval = (mask & *addr) != 0; *addr |= mask; return retval; } static void init_sa_headers(void *mad) { mad_set_field(mad, 0, IB_MAD_RESPONSE_F, 0); mad_set_field(mad, 0, IB_MAD_CLASSVER_F, SA_CLASS_VERSION); mad_set_field(mad, 0, IB_MAD_MGMTCLASS_F, IB_SA_CLASS); mad_set_field(mad, 0, IB_MAD_BASEVER_F, 1); mad_set_field(mad, 0, IB_MAD_STATUS_F, 0); mad_set_field64(mad, 0, IB_MAD_TRID_F, query_tid++); mad_set_field(mad, 0, IB_MAD_ATTRID_F, IB_SA_ATTR_INFORMINFO); mad_set_field(mad, 0, IB_MAD_ATTRMOD_F, 0); mad_set_field(mad, 0, IB_SA_RMPP_VERS_F, 1); mad_set_field(mad, 0, IB_SA_RMPP_TYPE_F, IB_RMPP_TYPE_DATA); mad_set_field(mad, 0, IB_SA_RMPP_RESP_F, 0x1f); mad_set_field(mad, 0, IB_SA_RMPP_FLAGS_F, IB_RMPP_FLAG_ACTIVE | IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST ); mad_set_field(mad, 0, IB_SA_RMPP_STATUS_F, 0); mad_set_field(mad, 0, IB_SA_RMPP_D1_F, 1); /* First packet */ mad_set_field(mad, 0, IB_SA_RMPP_D2_F, SA_INFORMINFO_LEN + SA_HEADER_LENGTH); /* Size of Informinfo */ } void init_informinfo_set(void *umad_in) { union ibv_gid gid; union ibv_gid mcast_gid; void *infinfo_set_mad = umad_get_mad(umad_in); uint8_t char_id = 0x01; uint8_t char_void = 0x00; uint32_t vendorid = 0x00000004; mcast_gid.global.subnet_prefix = 0; mcast_gid.global.interface_id = 0; /* hard coded for the time being */ gid.global.subnet_prefix = cas[0].ports[1]->gid_prefix; gid.global.interface_id = cas[0].ports[1]->port_guid; init_sa_headers(infinfo_set_mad); mad_set_field(infinfo_set_mad, 0, IB_MAD_METHOD_F, IB_MAD_METHOD_SET); /* mad_set_array(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_SUBGID_F, &gid); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_ENUM_F, 0x0000); mad_set_field64(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_RESV0_F, 0x00UL); */ mad_set_array(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_GID_F, &mcast_gid); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_LID_BEGIN_F, 0xFFFF); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_LID_END_F, 0xFFFF); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_RESV1_F, (uint16_t)0x0000); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_ISGENERIC_F, char_id); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_SUBSCRIBE_F, char_id); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_TYPE_F, 0xFFFF); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_TRAP_DEVID_F, 0xFFFF); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_QPN_F, 0x123456); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_RES2_F, 0x00); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_RESPTIME_F, 0x12); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_RESV3_F, char_void); mad_set_field(infinfo_set_mad, IB_SA_DATA_OFFS, IB_SA_INFINFO_VENDORID_F, 0x00000004); } int main() { int num, err = 0, i,j, result ; struct ibv_device *iter; int mad_port_hndl, mad_port_found, sa_agent_hndl, trap_umad_hndl, trap_umad_sa_agent; umad_port_t tmp; void *umadbuf; void *recv_umad; int umad_recv_size; uint8_t *response; uint32_t method_mask[4]; err = umad_init(); if(err) { printf("Error in umad_init %d\n", err); return -1; } umadbuf = umad_alloc(1, umad_size() + IB_MAD_SIZE); if(!umadbuf) { printf("Error allocating umad buffer\n"); return umad_done(); } recv_umad = umad_alloc(1, umad_size() + IB_MAD_SIZE); if(!recv_umad) { printf("Error allocating umad buffer\n"); return umad_done(); } num = umad_get_cas_names(ca_names, MAX_HCAS); if(num < 0) { printf("Error in umad_get_cas_names %d\n", err); return umad_done(); } /* ** Get the first 'ACTIVE' port of the first 'CA'. If found ** nothing, bailout. ** FIXME: This we'll do later. Apparantly, the libibumad allows to ** do umad_open_port even if port state is not ACTIVE. This is pretty useless. */ for(i = 0; i < num; i++) { err = umad_get_ca(ca_names[i], &cas[i]); if(err < 0) { printf("Error in umad_get_ca %d\n", err); return umad_done(); } /* LATER : :-P for(j = 0; j < cas[i].numports; j++) { printf("ok!\n"); memset(&tmp, 0, sizeof(tmp)); umad_get_port(cas[i].ca_name, (j+1), &tmp); printf("port %d of HCA %s is in the state %d\n", (j+1), cas[i].ca_name, tmp.state); mad_port_hndl = umad_open_port(cas[i].ca_name, (j+1), ); if(mad_port_hndl < 0) { continue; } else { printf("opened port %d of HCA %s\n", (j+1), cas[i].ca_name); mad_port_found = 1; break; } } if(mad_port_found) { break; } */ /* * following code works only when there is one HCA in the system */ mad_port_hndl = umad_open_port(cas[i].ca_name, 2); if(mad_port_hndl < 0) { continue; } else { printf("opened port %d of HCA %s: port_handle %d\n", 2, cas[i].ca_name, mad_port_hndl); mad_port_found = 1; break; } } err = umad_register(mad_port_hndl, IB_SA_CLASS, SA_CLASS_VERSION, 1, 0); if(err < 0) { printf("Error in umad_register %d\n", err); } else { sa_agent_hndl = err; } printf("sa_agent_hndl %d\n", sa_agent_hndl); memset(umadbuf, 0, umad_size() + IB_MAD_SIZE); init_informinfo_set(umadbuf); /* * (0x12) happens to be the SM_LID on the subnet */ umad_set_addr(umadbuf, (0x12), 0x1, 0x0, IB_DEFAULT_QP1_QKEY); result = umad_send(mad_port_hndl, sa_agent_hndl, umadbuf, (IB_MAD_SIZE), 100, 5); if(result < 0) { printf("Error %d in umad_send\n", result); return -1; } umad_recv_size = IB_MAD_SIZE; result = umad_recv(mad_port_hndl, umadbuf, &umad_recv_size, -1); if(result >= 0) { uint32_t val; printf("umad_status is %x\n", umad_status(umadbuf)); response = umad_get_mad(umadbuf); sleep(1); } else { printf("Error in subscription\n"); return -1; } err = umad_unregister(mad_port_hndl, sa_agent_hndl); if(err < 0) { printf("Error in unregistering...\n"); } err = umad_close_port(mad_port_hndl); if(err) { printf("Error in umad_close_port %d\n", err); } memset(method_mask, 0, sizeof(method_mask)); set_bit(IB_MAD_METHOD_REPORT, &method_mask); set_bit(IB_MAD_METHOD_TRAP, &method_mask); set_bit(IB_MAD_METHOD_TRAP_REPRESS, &method_mask); /* * now prepare to receive the traps */ trap_umad_hndl = umad_open_port(cas[0].ca_name, 2); if(trap_umad_hndl < 0) { printf("Error opening port to receive forwarded traps\n"); return -1; } err = umad_register(trap_umad_hndl, IB_SA_CLASS, SA_CLASS_VERSION, 0, method_mask); if(err < 0) { printf("Error in registering for receiving Traps without RMPP\n"); return -1; } else { trap_umad_sa_agent = err; } /* * Wait forever for TRAPS */ while(1) { memset(recv_umad, 0, umad_size() + IB_MAD_SIZE); umad_recv_size = IB_MAD_SIZE; result = umad_recv(trap_umad_hndl, recv_umad, &umad_recv_size, -1); if(result >= 0) { uint32_t val; uint64_t val64; printf("\n\n********** umad dump ************\n"); umad_dump(recv_umad); printf("********** umad dump end ************\n\n"); printf("result of umad_recv = %d\t", result); printf("umad_recv_size = %d\t", umad_recv_size); /* * dump some fields to indicate that we've received right things */ response = umad_get_mad(recv_umad); val = mad_get_field(response, 0, IB_MAD_METHOD_F); printf("method %x\t", val); val = mad_get_field(response, 0, IB_MAD_ATTRID_F); printf("attribute %x\t", val); val = mad_get_field(response, IB_SA_DATA_OFFS, IB_NOTICE_TYPE_F); printf("notice type %x\t", val); val = mad_get_field(response, IB_SA_DATA_OFFS, IB_NOTICE_TRAP_NUMBER_F); printf("trap number %x\t", val); val = mad_get_field(response, IB_SA_DATA_OFFS, IB_NOTICE_ISSUER_LID_F); printf("from LID %x\t", val); val64 = mad_get_field64(response, 0, IB_MAD_TRID_F); printf("Transaction id %lx\n", val64); mad_set_field(response, 0, IB_MAD_METHOD_F, IB_MAD_METHOD_REPORT_RESPONSE); mad_set_field(response, 0, IB_MAD_RESPONSE_F, 0x1); result = umad_send(trap_umad_hndl, trap_umad_sa_agent, recv_umad, IB_MAD_SIZE, 100, 0); printf("result of umad_send %d\n", result); } else { printf("Error %d in umad_recv \n", result); sleep(1); } } err = umad_close_port(trap_umad_hndl); if(err) { printf("Error in umad_close_port %d\n", err); } err = umad_release_ca(&cas[0]); if(err) { printf("Error in umad_release_ca %d\n", err); } err = umad_done(); if(err) { printf("Error in umad_done %d\n", err); } return err; } From yipeeyipeeyipeeyipee at yahoo.com Thu Aug 10 05:06:54 2006 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Thu, 10 Aug 2006 12:06:54 +0000 (UTC) Subject: [openib-general] HCA not recognized by OFED References: <20060808164103.GA30817@mellanox.co.il> <2376B63A5AF8564F8A2A2D76BC6DB0339B12F6@CINMLVEM11.e2k.ad.ge.com> Message-ID: Cain, Brian (GE Healthcare ge.com> writes: [snip] > I suppose I snipped a little too much when I posted the output of lspci. > It does look just as you indicate: "[InfiniHost III Lx HCA Flash > Recovery]". What's this "Flash Recovery" in your lspci output? sounds like you should reburn the hca with the latest firmware. Are you using Arbel firmware or the "Tavor compatability" firmware? What kind of motherboard are you using? >From my experience you should DISABLE the "MMIO above 4GB" in your BIOS settings (under 'PCI configuration') in order to make ARBEL-mode hca's work. Bye, y From halr at voltaire.com Thu Aug 10 06:31:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Aug 2006 09:31:42 -0400 Subject: [openib-general] umad_recv won't block after first read... In-Reply-To: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> Message-ID: <1155216690.17511.312291.camel@hal.voltaire.com> Hi Abhijit, On Thu, 2006-08-10 at 07:21, Abhijit Gadgil wrote: > Hi All, > > I am trying to write a simple program using libibumad to 'subscribe' for traps and then receive traps from the SA. Most of the things seem to work fine, however I am facing a small problem where, after first read for the trap, all subsequent reads are not blocking (and return some incorrect length). What do those calls return ? What version of management are you using ? > Attached is the simple code, can someone tell, what exactly is wrong out here? I didn't build and run this so my comments are based on just looking at the code. I don't think it would build as there are other changes needed to support this (e.g. IB_SA_INFINFO_XXX in libibmad at a minimum). Is the main loop based on some operational program ? If so, which one ? A couple of specific comments: init_sa_headers: InformInfo does not actually use RMPP so the initialization here needs to change. Not sure what doing this would cause without actually building and running this. Based on this, what is the result of the subscription ? Does it really succeed ? main: Rather than hard coding SM LID to 0x12, there are ways to get this dynamically. There are examples of how to do this. -- Hal > Thanks > > -abhijit > > ______________________________________________________________________ > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tziporet at mellanox.co.il Thu Aug 10 06:42:03 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 10 Aug 2006 16:42:03 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA767F@mtlexch01.mtl.com> The schedule is sleeps in 2 weeks meaning: Target release date: 12-Sep Intermediate milestones: 1. Create 1.1 branch of user level: 27-Jul - done 2. RC1: 8-Aug - done 3. Feature freeze (RC2): 17-Aug 4. Code freeze (rc-x): 6-Sep 5. Final release: 12-Sep Tziporet -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, August 09, 2006 9:04 PM To: Tziporet Koren Cc: OpenFabricsEWG; openib Subject: Re: [openfabrics-ewg] OFED 1.1-rc1 is available On Tue, 2006-08-08 at 10:48, Tziporet Koren wrote: > Hi, > > In two week delay we publish OFED 1.1-RC1 on https://openib.org/svn/gen2/branches/1.1/ofed/releases/ > File: OFED-1.1-rc1.tgz Is there an update to the OFED 1.1 schedule going forward ? -- Hal 1. Schedule: ============ Target release date: 31-Aug Intermediate milestones: 1. Create 1.1 branch of user level code and rc1: 27-Jul 2. Feature freeze : 3-Aug 3. Code freeze (rc-x): 25-Aug 4. Final release: 31-Aug From Brian.Cain at ge.com Thu Aug 10 06:56:09 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Thu, 10 Aug 2006 09:56:09 -0400 Subject: [openib-general] HCA not recognized by OFED In-Reply-To: Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB033A05CB0@CINMLVEM11.e2k.ad.ge.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of yipee > Sent: Thursday, August 10, 2006 7:07 AM > To: openib-general at openib.org > Subject: Re: [openib-general] HCA not recognized by OFED > > Cain, Brian (GE Healthcare ge.com> writes: > [snip] > > > I suppose I snipped a little too much when I posted the > output of lspci. > > It does look just as you indicate: "[InfiniHost III Lx HCA Flash > > Recovery]". > > What's this "Flash Recovery" in your lspci output? sounds > like you should reburn > the hca with the latest firmware. ... Yeah, apparently the flash recovery jumper was shorted against a nut on the motherboard. We clipped the leads, reflashed the card and everything works well now. Thanks for the help, everyone. -Brian From Abhijit.Gadgil at pantasys.com Thu Aug 10 06:46:46 2006 From: Abhijit.Gadgil at pantasys.com (Abhijit Gadgil) Date: Thu, 10 Aug 2006 06:46:46 -0700 (PDT) Subject: [openib-general] umad_recv won't block after first read... References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com><1155216690.17511.312291.camel@hal.voltaire.com> Message-ID: <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> Hi Hal, Please see below. On Aug 10, 2006 07:01 PM, Hal Rosenstock wrote: > Hi Abhijit, > > On Thu, 2006-08-10 at 07:21, Abhijit Gadgil wrote: > > Hi All, > > > > I am trying to write a simple program using libibumad to 'subscribe' for traps and then receive traps from the SA. Most of the things seem to work fine, however I am facing a small problem where, after first read for the trap, all subsequent reads are not blocking (and return some incorrect length). > > What do those calls return ? What version of management are you using ? > I am running the management code from the SVN (svn release 8781, it may be slightly outdated!) > > Attached is the simple code, can someone tell, what exactly is wrong out here? > > I didn't build and run this so my comments are based on just looking at > the code. I don't think it would build as there are other changes needed > to support this (e.g. IB_SA_INFINFO_XXX in libibmad at a minimum). > Oh I am sorry, I didn't mention this before, I modified the libibmad sources (specifically src/fields.c and include/infiniband/mad.h) files to accomplish this. Once I get it right, I will submit a patch. (It's too hacky right now) > Is the main loop based on some operational program ? If so, which one ? > > A couple of specific comments: > > init_sa_headers: InformInfo does not actually use RMPP so the > initialization here needs to change. Not sure what doing this would > cause without actually building and running this. > This was my first try of trying to use umad, hence for simplicity I copied from some reference code that was having RMPP enabled. I think I should get rid of this as well. > Based on this, what is the result of the subscription ? Does it really > succeed ? Well the subscriptions in-deed succeeded and I was able to receive IPoIB broadcast multicast group creation/deletion traps as well, but the problem mentioned below (ie. non-blocking reads) started appearing. > main: Rather than hard coding SM LID to 0x12, there are ways to get this > dynamically. There are examples of how to do this. Sorry about this again. I realized it later that it is stupid to hard code it (eg. I could have got it from the ca[].port->sm_lid), will fix that eventually. Thanks. -abhijit > -- Hal > > > Thanks > > > > -abhijit > > > > ______________________________________________________________________ > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tziporet at mellanox.co.il Thu Aug 10 07:22:07 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 10 Aug 2006 17:22:07 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7682@mtlexch01.mtl.com> of course I meant slips (we are all awake here :-) -----Original Message----- From: openfabrics-ewg-bounces at openib.org [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Tziporet Koren Sent: Thursday, August 10, 2006 4:42 PM To: Hal Rosenstock Cc: OpenFabricsEWG; openib Subject: Re: [openfabrics-ewg] OFED 1.1-rc1 is available The schedule is sleeps in 2 weeks meaning: Target release date: 12-Sep Intermediate milestones: 1. Create 1.1 branch of user level: 27-Jul - done 2. RC1: 8-Aug - done 3. Feature freeze (RC2): 17-Aug 4. Code freeze (rc-x): 6-Sep 5. Final release: 12-Sep Tziporet -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, August 09, 2006 9:04 PM To: Tziporet Koren Cc: OpenFabricsEWG; openib Subject: Re: [openfabrics-ewg] OFED 1.1-rc1 is available On Tue, 2006-08-08 at 10:48, Tziporet Koren wrote: > Hi, > > In two week delay we publish OFED 1.1-RC1 on https://openib.org/svn/gen2/branches/1.1/ofed/releases/ > File: OFED-1.1-rc1.tgz Is there an update to the OFED 1.1 schedule going forward ? -- Hal 1. Schedule: ============ Target release date: 31-Aug Intermediate milestones: 1. Create 1.1 branch of user level code and rc1: 27-Jul 2. Feature freeze : 3-Aug 3. Code freeze (rc-x): 25-Aug 4. Final release: 31-Aug _______________________________________________ openfabrics-ewg mailing list openfabrics-ewg at openib.org http://openib.org/mailman/listinfo/openfabrics-ewg From halr at voltaire.com Thu Aug 10 07:32:35 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Aug 2006 10:32:35 -0400 Subject: [openib-general] umad_recv won't block after first read... In-Reply-To: <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155216690.17511.312291.camel@hal.voltaire.com> <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> Message-ID: <1155220349.17511.313498.camel@hal.voltaire.com> Hi again Abhijit, On Thu, 2006-08-10 at 09:46, Abhijit Gadgil wrote: > Hi Hal, > > Please see below. > > On Aug 10, 2006 07:01 PM, Hal Rosenstock wrote: > > > Hi Abhijit, > > > > On Thu, 2006-08-10 at 07:21, Abhijit Gadgil wrote: > > > Hi All, > > > > > > I am trying to write a simple program using libibumad to 'subscribe' for traps and then receive traps from the SA. Most of the things seem to work fine, however I am facing a small problem where, after first read for the trap, all subsequent reads are not blocking (and return some incorrect length). > > > > What do those calls return ? What version of management are you using ? > > > > I am running the management code from the SVN (svn release 8781, it may be slightly outdated!) A fix just went in to libibumad:umad_recv which may impact your results. Can you update this and retry ? What do the reads return other than incorrect length ? -- Hal > > > Attached is the simple code, can someone tell, what exactly is wrong out here? > > > > I didn't build and run this so my comments are based on just looking at > > the code. I don't think it would build as there are other changes needed > > to support this (e.g. IB_SA_INFINFO_XXX in libibmad at a minimum). > > > > Oh I am sorry, I didn't mention this before, I modified the libibmad sources (specifically src/fields.c and include/infiniband/mad.h) files to accomplish this. Once I get it right, I will submit a patch. (It's too hacky right now) > > > Is the main loop based on some operational program ? If so, which one ? > > > > A couple of specific comments: > > > > init_sa_headers: InformInfo does not actually use RMPP so the > > initialization here needs to change. Not sure what doing this would > > cause without actually building and running this. > > > > This was my first try of trying to use umad, hence for simplicity I copied from some reference code that was having RMPP enabled. I think I should get rid of this as well. > > > > Based on this, what is the result of the subscription ? Does it really > > succeed ? > > Well the subscriptions in-deed succeeded and I was able to receive IPoIB broadcast multicast group creation/deletion traps as well, but the problem mentioned below (ie. non-blocking reads) started appearing. > > > main: Rather than hard coding SM LID to 0x12, there are ways to get this > > dynamically. There are examples of how to do this. > > Sorry about this again. I realized it later that it is stupid to hard code it (eg. I could have got it from the ca[].port->sm_lid), will fix that eventually. > > Thanks. > > -abhijit > > > -- Hal > > > > > Thanks > > > > > > -abhijit > > > > > > ______________________________________________________________________ > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Abhijit.Gadgil at pantasys.com Thu Aug 10 07:55:52 2006 From: Abhijit.Gadgil at pantasys.com (Abhijit Gadgil) Date: Thu, 10 Aug 2006 07:55:52 -0700 (PDT) Subject: [openib-general] umad_recv won't block after first read... References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155216690.17511.312291.camel@hal.voltaire.com> <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com><1155220349.17511.313498.camel@hal.voltaire.com> Message-ID: <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com> Hi Hal, I tried using the umad code as per the latest repository. (The latest fix is on libibumad/umad.c Line # 806 right?) I manually applied that patch. It doesn't seem to work yet. Infact, what I figured out was that the 'poll' on the umad->fd isn't blocking either. The read returns the correct 'mad_agent' ie. 0 in this case and some length which is usually 24 for the specific code. I am attaching the local copy of infiniband/include/mad.h and src/fields.c, so that you may be able to try this code. (There may be stray printf's in those files!). Also, since I was not quite clear about whether the subscriptions should include the RID information (as per section 15.2.5), so I tried including it first, which the SA doesn't seem to like, but the subscriptions work after I get rid of the RID header. This particular aspect is not quite clear to me yet. Please let me know what you find. Regards. -abhijit On Aug 10, 2006 08:02 PM, Hal Rosenstock wrote: > Hi again Abhijit, > > On Thu, 2006-08-10 at 09:46, Abhijit Gadgil wrote: > > Hi Hal, > > > > Please see below. > > > > On Aug 10, 2006 07:01 PM, Hal Rosenstock wrote: > > > > > Hi Abhijit, > > > > > > On Thu, 2006-08-10 at 07:21, Abhijit Gadgil wrote: > > > > Hi All, > > > > > > > > I am trying to write a simple program using libibumad to 'subscribe' for traps and then receive traps from the SA. Most of the things seem to work fine, however I am facing a small problem where, after first read for the trap, all subsequent reads are not blocking (and return some incorrect length). > > > > > > What do those calls return ? What version of management are you using ? > > > > > > > I am running the management code from the SVN (svn release 8781, it may be slightly outdated!) > > A fix just went in to libibumad:umad_recv which may impact your results. > Can you update this and retry ? > > What do the reads return other than incorrect length ? > > -- Hal > > > > > Attached is the simple code, can someone tell, what exactly is wrong out here? > > > > > > I didn't build and run this so my comments are based on just looking at > > > the code. I don't think it would build as there are other changes needed > > > to support this (e.g. IB_SA_INFINFO_XXX in libibmad at a minimum). > > > > > > > Oh I am sorry, I didn't mention this before, I modified the libibmad sources (specifically src/fields.c and include/infiniband/mad.h) files to accomplish this. Once I get it right, I will submit a patch. (It's too hacky right now) > > > > > Is the main loop based on some operational program ? If so, which one ? > > > > > > A couple of specific comments: > > > > > > init_sa_headers: InformInfo does not actually use RMPP so the > > > initialization here needs to change. Not sure what doing this would > > > cause without actually building and running this. > > > > > > > This was my first try of trying to use umad, hence for simplicity I copied from some reference code that was having RMPP enabled. I think I should get rid of this as well. > > > > > > > Based on this, what is the result of the subscription ? Does it really > > > succeed ? > > > > Well the subscriptions in-deed succeeded and I was able to receive IPoIB broadcast multicast group creation/deletion traps as well, but the problem mentioned below (ie. non-blocking reads) started appearing. > > > > > main: Rather than hard coding SM LID to 0x12, there are ways to get this > > > dynamically. There are examples of how to do this. > > > > Sorry about this again. I realized it later that it is stupid to hard code it (eg. I could have got it from the ca[].port->sm_lid), will fix that eventually. > > > > Thanks. > > > > -abhijit > > > > > -- Hal > > > > > > > Thanks > > > > > > > > -abhijit -------------- next part -------------- /* * Copyright (c) 2004,2005 Voltaire Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * $Id: fields.c 8484 2006-07-10 23:13:53Z halr $ */ #if HAVE_CONFIG_H # include #endif /* HAVE_CONFIG_H */ #include #include #include #include #include #include /* * BITSOFFS and BE_OFFS are required due the fact that the bit offsets are inconsistently * encoded in the IB spec - IB headers are encoded such that the bit offsets * are in big endian convention (BE_OFFS), while the SMI/GSI queries data fields bit * offsets are specified using real bit offset (?!) * The following macros normalize everything to big endian offsets. */ #define BITSOFFS(o, w) (((o) & ~31) | ((32 - ((o) & 31) - (w)))), (w) #define BE_OFFS(o, w) (o), (w) #define BE_TO_BITSOFFS(o, w) (((o) & ~31) | ((32 - ((o) & 31) - (w)))) ib_field_t ib_mad_f [] = { [0] {0, 0}, /* IB_NO_FIELD - reserved as invalid */ [IB_GID_PREFIX_F] {0, 64, "GidPrefix", mad_dump_rhex}, [IB_GID_GUID_F] {64, 64, "GidGuid", mad_dump_rhex}, /* * MAD: common MAD fields (IB spec 13.4.2) * SMP: Subnet Management packets - lid routed (IB spec 14.2.1.1) * DSMP: Subnet Management packets - direct route (IB spec 14.2.1.2) * SA: Subnet Administration packets (IB spec 15.2.1.1) */ /* first MAD word (0-3 bytes) */ [IB_MAD_METHOD_F] {BE_OFFS(0, 7), "MadMethod", mad_dump_hex}, /* TODO: add dumper */ [IB_MAD_RESPONSE_F] {BE_OFFS(7, 1), "MadIsResponse", mad_dump_uint}, /* TODO: add dumper */ [IB_MAD_CLASSVER_F] {BE_OFFS(8, 8), "MadClassVersion", mad_dump_uint}, [IB_MAD_MGMTCLASS_F] {BE_OFFS(16, 8), "MadMgmtClass", mad_dump_uint}, /* TODO: add dumper */ [IB_MAD_BASEVER_F] {BE_OFFS(24, 8), "MadBaseVersion", mad_dump_uint}, /* second MAD word (4-7 bytes) */ [IB_MAD_STATUS_F] {BE_OFFS(48, 16), "MadStatus", mad_dump_hex}, /* TODO: add dumper */ /* DR SMP only */ [IB_DRSMP_HOPCNT_F] {BE_OFFS(32, 8), "DrSmpHopCnt", mad_dump_uint}, [IB_DRSMP_HOPPTR_F] {BE_OFFS(40, 8), "DrSmpHopPtr", mad_dump_uint}, [IB_DRSMP_STATUS_F] {BE_OFFS(48, 15), "DrSmpStatus", mad_dump_hex}, /* TODO: add dumper */ [IB_DRSMP_DIRECTION_F] {BE_OFFS(63, 1), "DrSmpDirection", mad_dump_uint}, /* TODO: add dumper */ /* words 3,4,5,6 (8-23 bytes) */ [IB_MAD_TRID_F] {64, 64, "MadTRID", mad_dump_hex}, [IB_MAD_ATTRID_F] {BE_OFFS(144, 16), "MadAttr", mad_dump_hex}, /* TODO: add dumper */ [IB_MAD_ATTRMOD_F] {160, 32, "MadModifier", mad_dump_hex}, /* TODO: add dumper */ /* word 7,8 (24-31 bytes) */ [IB_MAD_MKEY_F] {196, 64, "MadMkey", mad_dump_hex}, /* word 9 (32-37 bytes) */ [IB_DRSMP_DRDLID_F] {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_hex}, [IB_DRSMP_DRSLID_F] {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex}, /* word 12 (44-47 bytes) */ [IB_SA_ATTROFFS_F] {BE_OFFS(46*8, 16), "SaAttrOffs", mad_dump_uint}, /* word 13,14 (48-55 bytes) */ [IB_SA_COMPMASK_F] {48*8, 64, "SaCompMask", mad_dump_hex}, /* word 13,14 (56-255 bytes) */ [IB_SA_DATA_F] {56*8, (256-56)*8, "SaData", mad_dump_hex}, [IB_DRSMP_PATH_F] {1024, 512, "DrSmpPath", mad_dump_hex}, [IB_DRSMP_RPATH_F] {1536, 512, "DrSmpRetPath", mad_dump_hex}, [IB_GS_DATA_F] {64*8, (256-64) * 8, "GsData", mad_dump_hex}, /* * PortInfo fields: */ [IB_PORT_MKEY_F] {0, 64, "Mkey", mad_dump_hex}, [IB_PORT_GID_PREFIX_F] {64, 64, "GidPrefix", mad_dump_hex}, [IB_PORT_LID_F] {BITSOFFS(128, 16), "Lid", mad_dump_hex}, [IB_PORT_SMLID_F] {BITSOFFS(144, 16), "SMLid", mad_dump_hex}, [IB_PORT_CAPMASK_F] {160, 32, "CapMask", mad_dump_portcapmask}, [IB_PORT_DIAG_F] {BITSOFFS(192, 16), "DiagCode", mad_dump_hex}, [IB_PORT_MKEY_LEASE_F] {BITSOFFS(208, 16), "MkeyLeasePeriod", mad_dump_uint}, [IB_PORT_LOCAL_PORT_F] {BITSOFFS(224, 8), "LocalPort", mad_dump_uint}, [IB_PORT_LINK_WIDTH_ENABLED_F] {BITSOFFS(232, 8), "LinkWidthEnabled", mad_dump_linkwidthen}, [IB_PORT_LINK_WIDTH_SUPPORTED_F] {BITSOFFS(240, 8), "LinkWidthSupported", mad_dump_linkwidthsup}, [IB_PORT_LINK_WIDTH_ACTIVE_F] {BITSOFFS(248, 8), "LinkWidthActive", mad_dump_linkwidth}, [IB_PORT_LINK_SPEED_SUPPORTED_F] {BITSOFFS(256, 4), "LinkSpeedSupported", mad_dump_linkspeedsup}, [IB_PORT_STATE_F] {BITSOFFS(260, 4), "LinkState", mad_dump_portstate}, [IB_PORT_PHYS_STATE_F] {BITSOFFS(264, 4), "PhysLinkState", mad_dump_physportstate}, [IB_PORT_LINK_DOWN_DEF_F] {BITSOFFS(268, 4), "LinkDownDefState", mad_dump_linkdowndefstate}, [IB_PORT_MKEY_PROT_BITS_F] {BITSOFFS(272, 2), "ProtectBits", mad_dump_uint}, [IB_PORT_LMC_F] {BITSOFFS(277, 3), "LMC", mad_dump_uint}, [IB_PORT_LINK_SPEED_ACTIVE_F] {BITSOFFS(280, 4), "LinkSpeedActive", mad_dump_linkspeed}, [IB_PORT_LINK_SPEED_ENABLED_F] {BITSOFFS(284, 4), "LinkSpeedEnabled", mad_dump_linkspeeden}, [IB_PORT_NEIGHBOR_MTU_F] {BITSOFFS(288, 4), "NeighborMTU", mad_dump_mtu}, [IB_PORT_SMSL_F] {BITSOFFS(292, 4), "SMSL", mad_dump_uint}, [IB_PORT_VL_CAP_F] {BITSOFFS(296, 4), "VLCap", mad_dump_vlcap}, [IB_PORT_INIT_TYPE_F] {BITSOFFS(300, 4), "InitType", mad_dump_hex}, [IB_PORT_VL_HIGH_LIMIT_F] {BITSOFFS(304, 8), "VLHighLimit", mad_dump_uint}, [IB_PORT_VL_ARBITRATION_HIGH_CAP_F] {BITSOFFS(312, 8), "VLArbHighCap", mad_dump_uint}, [IB_PORT_VL_ARBITRATION_LOW_CAP_F] {BITSOFFS(320, 8), "VLArbLowCap", mad_dump_uint}, [IB_PORT_INIT_TYPE_REPLY_F] {BITSOFFS(328, 4), "InitReply", mad_dump_hex}, [IB_PORT_MTRU_CAP_F] {BITSOFFS(332, 4), "MtuCap", mad_dump_mtu}, [IB_PORT_VL_STALL_COUNT_F] {BITSOFFS(336, 3), "VLStallCount", mad_dump_uint}, [IB_PORT_HOQ_LIFE_F] {BITSOFFS(339, 5), "HoqLife", mad_dump_uint}, [IB_PORT_OPER_VLS_F] {BITSOFFS(344, 4), "OperVLs", mad_dump_opervls}, [IB_PORT_PART_EN_INB_F] {BITSOFFS(348, 1), "PartEnforceInb", mad_dump_uint}, [IB_PORT_PART_EN_OUTB_F] {BITSOFFS(349, 1), "PartEnforceOutb", mad_dump_uint}, [IB_PORT_FILTER_RAW_INB_F] {BITSOFFS(350, 1), "FilterRawInb", mad_dump_uint}, [IB_PORT_FILTER_RAW_OUTB_F] {BITSOFFS(351, 1), "FilterRawOutb", mad_dump_uint}, [IB_PORT_MKEY_VIOL_F] {BITSOFFS(352, 16), "MkeyViolations", mad_dump_uint}, [IB_PORT_PKEY_VIOL_F] {BITSOFFS(368, 16), "PkeyViolations", mad_dump_uint}, [IB_PORT_QKEY_VIOL_F] {BITSOFFS(384, 16), "QkeyViolations", mad_dump_uint}, [IB_PORT_GUID_CAP_F] {BITSOFFS(400, 8), "GuidCap", mad_dump_uint}, [IB_PORT_CLIENT_REREG_F] {BITSOFFS(408, 1), "ClientReregister", mad_dump_uint}, [IB_PORT_SUBN_TIMEOUT_F] {BITSOFFS(411, 5), "SubnetTimeout", mad_dump_uint}, [IB_PORT_RESP_TIME_VAL_F] {BITSOFFS(419, 5), "RespTimeVal", mad_dump_uint}, [IB_PORT_LOCAL_PHYS_ERR_F] {BITSOFFS(424, 4), "LocalPhysErr", mad_dump_uint}, [IB_PORT_OVERRUN_ERR_F] {BITSOFFS(428, 4), "OverrunErr", mad_dump_uint}, [IB_PORT_MAX_CREDIT_HINT_F] {BITSOFFS(432, 16), "MaxCreditHint", mad_dump_uint}, [IB_PORT_LINK_ROUND_TRIP_F] {BITSOFFS(456, 24), "RoundTrip", mad_dump_uint}, /* * NodeInfo fields: */ [IB_NODE_BASE_VERS_F] {BITSOFFS(0,8), "BaseVers", mad_dump_uint}, [IB_NODE_CLASS_VERS_F] {BITSOFFS(8,8), "ClassVers", mad_dump_uint}, [IB_NODE_TYPE_F] {BITSOFFS(16,8), "NodeType", mad_dump_node_type}, [IB_NODE_NPORTS_F] {BITSOFFS(24,8), "NumPorts", mad_dump_uint}, [IB_NODE_SYSTEM_GUID_F] {32, 64, "SystemGuid", mad_dump_hex}, [IB_NODE_GUID_F] {96, 64, "Guid", mad_dump_hex}, [IB_NODE_PORT_GUID_F] {160, 64, "PortGuid", mad_dump_hex}, [IB_NODE_PARTITION_CAP_F] {BITSOFFS(224,16), "PartCap", mad_dump_uint}, [IB_NODE_DEVID_F] {BITSOFFS(240,16), "DevId", mad_dump_hex}, [IB_NODE_REVISION_F] {256, 32, "Revision", mad_dump_hex}, [IB_NODE_LOCAL_PORT_F] {BITSOFFS(288,8), "LocalPort", mad_dump_uint}, [IB_NODE_VENDORID_F] {BITSOFFS(296,24), "VendorId", mad_dump_hex}, /* * SwitchInfo fields: */ [IB_SW_LINEAR_FDB_CAP_F] {BITSOFFS(0, 16), "LinearFdbCap", mad_dump_uint}, [IB_SW_RANDOM_FDB_CAP_F] {BITSOFFS(16, 16), "RandomFdbCap", mad_dump_uint}, [IB_SW_MCAST_FDB_CAP_F] {BITSOFFS(32, 16), "McastFdbCap", mad_dump_uint}, [IB_SW_LINEAR_FDB_TOP_F] {BITSOFFS(48, 16), "LinearFdbTop", mad_dump_uint}, [IB_SW_DEF_PORT_F] {BITSOFFS(64, 8), "DefPort", mad_dump_uint}, [IB_SW_DEF_MCAST_PRIM_F] {BITSOFFS(72, 8), "DefMcastPrimPort", mad_dump_uint}, [IB_SW_DEF_MCAST_NOT_PRIM_F] {BITSOFFS(80, 8), "DefMcastNotPrimPort", mad_dump_uint}, [IB_SW_LIFE_TIME_F] {BITSOFFS(88, 5), "LifeTime", mad_dump_uint}, [IB_SW_STATE_CHANGE_F] {BITSOFFS(93, 1), "StateChange", mad_dump_uint}, [IB_SW_LIDS_PER_PORT_F] {BITSOFFS(96,16), "LidsPerPort", mad_dump_uint}, [IB_SW_PARTITION_ENFORCE_CAP_F] {BITSOFFS(112, 16), "PartEnforceCap", mad_dump_uint}, [IB_SW_PARTITION_ENF_INB_F] {BITSOFFS(128, 1), "InboundPartEnf", mad_dump_uint}, [IB_SW_PARTITION_ENF_OUTB_F] {BITSOFFS(129, 1), "OutboundPartEnf", mad_dump_uint}, [IB_SW_FILTER_RAW_INB_F] {BITSOFFS(130, 1), "FilterRawInbound", mad_dump_uint}, [IB_SW_FILTER_RAW_OUTB_F] {BITSOFFS(131, 1), "FilterRawInbound", mad_dump_uint}, [IB_SW_ENHANCED_PORT0_F] {BITSOFFS(132, 1), "EnhancedPort0", mad_dump_uint}, /* * SwitchLinearForwardingTable fields: */ [IB_LINEAR_FORW_TBL_F] {0, 512, "LinearForwTbl", mad_dump_array}, /* * SwitchMulticastForwardingTable fields: */ [IB_MULTICAST_FORW_TBL_F] {0, 512, "MulticastForwTbl", mad_dump_array}, /* * Notice/Trap fields */ [IB_NOTICE_IS_GENERIC_F] {BITSOFFS(0, 1), "NoticeIsGeneric", mad_dump_uint}, [IB_NOTICE_TYPE_F] {BITSOFFS(1, 7), "NoticeType", mad_dump_uint}, [IB_NOTICE_PRODUCER_F] {BITSOFFS(8, 24), "NoticeProducerType", mad_dump_node_type}, [IB_NOTICE_TRAP_NUMBER_F] {BITSOFFS(32, 16), "NoticeTrapNumber", mad_dump_uint}, [IB_NOTICE_ISSUER_LID_F] {BITSOFFS(48, 16), "NoticeIssuerLID", mad_dump_uint}, [IB_NOTICE_TOGGLE_F] {BITSOFFS(64, 1), "NoticeToggle", mad_dump_uint}, [IB_NOTICE_COUNT_F] {BITSOFFS(65, 15), "NoticeCount", mad_dump_uint}, [IB_NOTICE_DATA_LID_F] {BITSOFFS(80, 16), "NoticeDataLID", mad_dump_uint}, /* * NodeDescription fields: */ [IB_NODE_DESC_F] {0, 64*8, "NodeDesc", mad_dump_string}, /* * Port counters */ [IB_PC_PORT_SELECT_F] {BITSOFFS(8, 8), "PortSelect", mad_dump_uint}, [IB_PC_COUNTER_SELECT_F] {BITSOFFS(16, 16), "CounterSelect", mad_dump_hex}, [IB_PC_ERR_SYM_F] {BITSOFFS(32, 16), "SymbolErrors", mad_dump_uint}, [IB_PC_LINK_RECOVERS_F] {BITSOFFS(48, 8), "LinkRecovers", mad_dump_uint}, [IB_PC_LINK_DOWNED_F] {BITSOFFS(56, 8), "LinkDowned", mad_dump_uint}, [IB_PC_ERR_RCV_F] {BITSOFFS(64, 16), "RcvErrors", mad_dump_uint}, [IB_PC_ERR_PHYSRCV_F] {BITSOFFS(80, 16), "RcvRemotePhysErrors", mad_dump_uint}, [IB_PC_ERR_SWITCH_REL_F] {BITSOFFS(96, 16), "RcvSwRelayErrors", mad_dump_uint}, [IB_PC_XMT_DISCARDS_F] {BITSOFFS(112, 16), "XmtDiscards", mad_dump_uint}, [IB_PC_ERR_XMTCONSTR_F] {BITSOFFS(128, 8), "XmtConstraintErrors", mad_dump_uint}, [IB_PC_ERR_RCVCONSTR_F] {BITSOFFS(136, 8), "RcvConstraintErrors", mad_dump_uint}, [IB_PC_ERR_LOCALINTEG_F] {BITSOFFS(152, 4), "LinkIntegrityErrors", mad_dump_uint}, [IB_PC_ERR_EXCESS_OVR_F] {BITSOFFS(156, 4), "ExcBufOverrunErrors", mad_dump_uint}, [IB_PC_VL15_DROPPED_F] {BITSOFFS(176, 16), "VL15Dropped", mad_dump_uint}, [IB_PC_XMT_BYTES_F] {192, 32, "XmtBytes", mad_dump_uint}, [IB_PC_RCV_BYTES_F] {224, 32, "RcvBytes", mad_dump_uint}, [IB_PC_XMT_PKTS_F] {256, 32, "XmtPkts", mad_dump_uint}, [IB_PC_RCV_PKTS_F] {288, 32, "RcvPkts", mad_dump_uint}, /* * SMInfo */ [IB_SMINFO_GUID_F] {0, 64, "SmInfoGuid", mad_dump_hex}, [IB_SMINFO_KEY_F] {64, 64, "SmInfoKey", mad_dump_hex}, [IB_SMINFO_ACT_F] {128, 32, "SmActivity", mad_dump_uint}, [IB_SMINFO_PRIO_F] {BITSOFFS(160, 4), "SmPriority", mad_dump_uint}, [IB_SMINFO_STATE_F] {BITSOFFS(164, 4), "SmState", mad_dump_uint}, /* * SA RMPP */ [IB_SA_RMPP_VERS_F] {BE_OFFS(24*8+24, 8), "RmppVers", mad_dump_uint}, [IB_SA_RMPP_TYPE_F] {BE_OFFS(24*8+16, 8), "RmppType", mad_dump_uint}, [IB_SA_RMPP_RESP_F] {BE_OFFS(24*8+11, 5), "RmppResp", mad_dump_uint}, [IB_SA_RMPP_FLAGS_F] {BE_OFFS(24*8+8, 3), "RmppFlags", mad_dump_hex}, [IB_SA_RMPP_STATUS_F] {BE_OFFS(24*8+0, 8), "RmppStatus", mad_dump_hex}, /* data1 */ [IB_SA_RMPP_D1_F] {28*8, 32, "RmppData1", mad_dump_hex}, [IB_SA_RMPP_SEGNUM_F] {28*8, 32, "RmppSegNum", mad_dump_uint}, /* data2 */ [IB_SA_RMPP_D2_F] {32*8, 32, "RmppData2", mad_dump_hex}, [IB_SA_RMPP_LEN_F] {32*8, 32, "RmppPayload", mad_dump_uint}, [IB_SA_RMPP_NEWWIN_F] {32*8, 32, "RmppNewWin", mad_dump_uint}, /* * SA Path rec */ [IB_SA_PR_DGID_F] {64,128, "PathRecDGid", mad_dump_array}, [IB_SA_PR_SGID_F] {192,128, "PathRecSGid", mad_dump_array}, [IB_SA_PR_DLID_F] {BITSOFFS(320,16), "PathRecDLid", mad_dump_hex}, [IB_SA_PR_SLID_F] {BITSOFFS(336,16), "PathRecSLid", mad_dump_hex}, [IB_SA_PR_NPATH_F] {BITSOFFS(393,7), "PathRecNumPath", mad_dump_uint}, /* * SA Get Multi Path */ [IB_SA_MP_NPATH_F] {BITSOFFS(41,7), "MultiPathNumPath", mad_dump_uint}, [IB_SA_MP_NSRC_F] {BITSOFFS(120,8), "MultiPathNumSrc", mad_dump_uint}, [IB_SA_MP_NDEST_F] {BITSOFFS(128,8), "MultiPathNumDest", mad_dump_uint}, [IB_SA_MP_GID0_F] {192,128, "MultiPathGid", mad_dump_array}, /* * MC Member rec */ [IB_SA_MCM_MGID_F] {0, 128, "McastMemMGid", mad_dump_array}, [IB_SA_MCM_PORTGID_F] {128, 128, "McastMemPortGid", mad_dump_array}, [IB_SA_MCM_QKEY_F] {256, 32, "McastMemQkey", mad_dump_hex}, [IB_SA_MCM_MLID_F] {BITSOFFS(288, 16), "McastMemMLid", mad_dump_hex}, [IB_SA_MCM_MTU_F] {BITSOFFS(306, 6), "McastMemMTU", mad_dump_uint}, [IB_SA_MCM_TCLASS_F] {BITSOFFS(312, 8), "McastMemTClass", mad_dump_uint}, [IB_SA_MCM_PKEY_F] {BITSOFFS(320, 16), "McastMemPkey", mad_dump_uint}, [IB_SA_MCM_RATE_F] {BITSOFFS(338, 6), "McastMemRate", mad_dump_uint}, [IB_SA_MCM_SL_F] {BITSOFFS(352, 4), "McastMemSL", mad_dump_uint}, [IB_SA_MCM_FLOW_LABEL_F] {BITSOFFS(356, 20), "McastMemFlowLbl", mad_dump_uint}, [IB_SA_MCM_JOIN_STATE_F] {BITSOFFS(388, 4), "McastMemJoinState", mad_dump_uint}, [IB_SA_MCM_PROXY_JOIN_F] {BITSOFFS(392, 1), "McastMemProxyJoin", mad_dump_uint}, /* * Service record */ [IB_SA_SR_ID_F] {0, 64, "ServRecID", mad_dump_hex}, [IB_SA_SR_GID_F] {64, 128, "ServRecGid", mad_dump_array}, [IB_SA_SR_PKEY_F] {BITSOFFS(192, 16), "ServRecPkey", mad_dump_hex}, [IB_SA_SR_LEASE_F] {224, 32, "ServRecLease", mad_dump_hex}, [IB_SA_SR_KEY_F] {256, 128, "ServRecKey", mad_dump_hex}, [IB_SA_SR_NAME_F] {384, 512, "ServRecName", mad_dump_string}, [IB_SA_SR_DATA_F] {896, 512, "ServRecData", mad_dump_array}, /* ATS for example */ /* * ATS SM record - within SA_SR_DATA */ [IB_ATS_SM_NODE_ADDR_F] {12*8, 32, "ATSNodeAddr", mad_dump_hex}, [IB_ATS_SM_MAGIC_KEY_F] {BITSOFFS(16*8, 16), "ATSMagicKey", mad_dump_hex}, [IB_ATS_SM_NODE_TYPE_F] {BITSOFFS(18*8, 16), "ATSNodeType", mad_dump_hex}, [IB_ATS_SM_NODE_NAME_F] {32*8, 32*8, "ATSNodeName", mad_dump_string}, /* * SLTOVL MAPPING TABLE */ [IB_SLTOVL_MAPPING_TABLE_F] {0, 64, "SLToVLMap", mad_dump_hex}, /* * VL ARBITRATION TABLE */ [IB_VL_ARBITRATION_TABLE_F] {0, 512, "VLArbTbl", mad_dump_array}, /* * IB vendor classes range 2 */ [IB_VEND2_OUI_F] {BE_OFFS(36*8, 24), "OUI", mad_dump_array}, [IB_VEND2_DATA_F] {40*8, (256-40)*8, "Vendor2Data", mad_dump_array}, /* * IB InformInfo rec * Trying two types of InformInfo Rec, one with RID and one without RID * [IB_SA_INFINFO_SUBGID_F] {0, 128, "InformInfoSubscriberGID", mad_dump_array}, [IB_SA_INFINFO_ENUM_F] {128, 16, "InformInfoEnum", mad_dump_hex}, [IB_SA_INFINFO_RESV0_F] {144, 48, "InformInfoReservd", mad_dump_uint}, [IB_SA_INFINFO_GID_F] {192,128, "InformInfoGID", mad_dump_array}, [IB_SA_INFINFO_LID_BEGIN_F] {320, 16, "InformInfoLIDBegin", mad_dump_hex}, [IB_SA_INFINFO_LID_END_F] {336, 16, "InformInfoEnd", mad_dump_hex}, [IB_SA_INFINFO_RESV1_F] {352, 16, "InformInfoReserved", mad_dump_hex}, [IB_SA_INFINFO_ISGENERIC_F] {368, 8, "InformInfoIsGeneric", mad_dump_uint}, [IB_SA_INFINFO_SUBSCRIBE_F] {376, 8, "InformInfoSubsribe", mad_dump_uint}, [IB_SA_INFINFO_TYPE_F] {384, 16, "InformInfoType", mad_dump_uint}, [IB_SA_INFINFO_TRAP_DEVID_F] {400, 16, "InformInfoTrapNumber/DeviceID", mad_dump_uint}, [IB_SA_INFINFO_QPN_F] {BITSOFFS(416, 24), "InformInfoQPN", mad_dump_hex}, [IB_SA_INFINFO_RES2_F] {BITSOFFS(440, 3), "InformInfoResrve", mad_dump_hex}, [IB_SA_INFINFO_RESPTIME_F] {BITSOFFS(443, 5), "InformInfoRespTimeValue", mad_dump_hex}, [IB_SA_INFINFO_RESV3_F] {448, 8, "InformInfoReserved", mad_dump_hex}, [IB_SA_INFINFO_VENDORID_F] {456, 24, "InformInfoProducerType/DeviceID", mad_dump_hex}, */ [IB_SA_INFINFO_GID_F] {0,128, "InformInfoGID", mad_dump_array}, [IB_SA_INFINFO_LID_BEGIN_F] {BITSOFFS(128, 16), "InformInfoLIDBegin", mad_dump_hex}, [IB_SA_INFINFO_LID_END_F] {BITSOFFS(144, 16), "InformInfoEnd", mad_dump_hex}, [IB_SA_INFINFO_RESV1_F] {BITSOFFS(160, 16), "InformInfoReserved", mad_dump_hex}, [IB_SA_INFINFO_ISGENERIC_F] {BITSOFFS(176, 8), "InformInfoIsGeneric", mad_dump_uint}, [IB_SA_INFINFO_SUBSCRIBE_F] {BITSOFFS(184, 8), "InformInfoSubsribe", mad_dump_uint}, [IB_SA_INFINFO_TYPE_F] {BITSOFFS(192, 16), "InformInfoType", mad_dump_uint}, [IB_SA_INFINFO_TRAP_DEVID_F] {BITSOFFS(208, 16), "InformInfoTrapNumber/DeviceID", mad_dump_uint}, [IB_SA_INFINFO_QPN_F] {BITSOFFS(224, 24), "InformInfoQPN", mad_dump_hex}, [IB_SA_INFINFO_RES2_F] {BITSOFFS(248, 3), "InformInfoResrve", mad_dump_hex}, [IB_SA_INFINFO_RESPTIME_F] {BITSOFFS(251, 5), "InformInfoRespTimeValue", mad_dump_hex}, [IB_SA_INFINFO_RESV3_F] {BITSOFFS(256, 8), "InformInfoReserved", mad_dump_hex}, [IB_SA_INFINFO_VENDORID_F] {BITSOFFS(264, 24), "InformInfoProducerType/DeviceID", mad_dump_hex}, }; void _set_field64(void *buf, int base_offs, ib_field_t *f, uint64_t val) { *(uint64_t *)((char *)buf + base_offs + f->bitoffs / 8) = htonll(val); } uint64_t _get_field64(void *buf, int base_offs, ib_field_t *f) { uint64_t val = *(uint64_t *)((char *)buf + base_offs + f->bitoffs / 8); return ntohll(val); } void _set_field(void *buf, int base_offs, ib_field_t *f, uint32_t val) { int prebits = (8 - (f->bitoffs & 7)) & 7; int postbits = (f->bitoffs + f->bitlen) & 7; int bytelen = f->bitlen / 8; uint idx = base_offs + f->bitoffs / 8; char *p = (char *)buf; if (!bytelen && (f->bitoffs & 7) + f->bitlen < 8) { p[3^idx] &= ~((((1 << f->bitlen) - 1)) << (f->bitoffs & 7)); p[3^idx] |= (val & ((1 << f->bitlen) - 1)) << (f->bitoffs & 7); return; } if (prebits) { /* val lsb in byte msb */ p[3^idx] &= (1 << (8 - prebits)) - 1; p[3^idx++] |= (val & ((1 << prebits) - 1)) << (8 - prebits); val >>= prebits; } /* BIG endian byte order */ for (; bytelen--; val >>= 8) p[3^idx++] = val & 0xff; if (postbits) { /* val msb in byte lsb */ p[3^idx] &= ~((1 << postbits) - 1); p[3^idx] |= val; } } uint32_t _get_field(void *buf, int base_offs, ib_field_t *f) { int prebits = (8 - (f->bitoffs & 7)) & 7; int postbits = (f->bitoffs + f->bitlen) & 7; int bytelen = f->bitlen / 8; uint idx = base_offs + f->bitoffs / 8; uint8_t *p = (uint8_t *)buf; uint32_t val = 0, v = 0, i; if (!bytelen && (f->bitoffs & 7) + f->bitlen < 8) return (p[3^idx] >> (f->bitoffs & 7)) & ((1 << f->bitlen) - 1); if (prebits) /* val lsb from byte msb */ v = p[3^idx++] >> (8 - prebits); if (postbits) { /* val msb from byte lsb */ i = base_offs + (f->bitoffs + f->bitlen) / 8; val = (p[3^i] & ((1 << postbits) - 1)); } /* BIG endian byte order */ for (idx += bytelen - 1; bytelen--; idx--) val = (val << 8) | p[3^idx]; return (val << prebits) | v; } /* field must be byte aligned */ void _set_array(void *buf, int base_offs, ib_field_t *f, void *val) { int bitoffs = f->bitoffs;; if (f->bitlen < 32) bitoffs = BE_TO_BITSOFFS(bitoffs, f->bitlen); memcpy((uint8_t *)buf + base_offs + bitoffs / 8, val, f->bitlen / 8); } void _get_array(void *buf, int base_offs, ib_field_t *f, void *val) { int bitoffs = f->bitoffs;; if (f->bitlen < 32) bitoffs = BE_TO_BITSOFFS(bitoffs, f->bitlen); memcpy(val, (uint8_t *)buf + base_offs + bitoffs / 8, f->bitlen / 8); } -------------- next part -------------- /* * Copyright (c) 2004-2006 Voltaire Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * $Id: mad.h 8630 2006-07-22 10:41:25Z halr $ */ #ifndef _MAD_H_ #define _MAD_H_ #include #include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } #else /* !__cplusplus */ # define BEGIN_C_DECLS # define END_C_DECLS #endif /* __cplusplus */ BEGIN_C_DECLS #define IB_SUBNET_PATH_HOPS_MAX 64 #define IB_DEFAULT_SUBN_PREFIX 0xfe80000000000000llu #define IB_DEFAULT_QP1_QKEY 0x80010000 #define IB_MAD_SIZE 256 #define IB_SMP_DATA_OFFS 64 #define IB_SMP_DATA_SIZE 64 #define IB_VENDOR_RANGE1_DATA_OFFS 24 #define IB_VENDOR_RANGE1_DATA_SIZE (IB_MAD_SIZE - IB_VENDOR_RANGE1_DATA_OFFS) #define IB_VENDOR_RANGE2_DATA_OFFS 40 #define IB_VENDOR_RANGE2_DATA_SIZE (IB_MAD_SIZE - IB_VENDOR_RANGE2_DATA_OFFS) #define IB_SA_DATA_SIZE 200 #define IB_SA_DATA_OFFS 56 #define IB_PC_DATA_OFFS 64 #define IB_PC_DATA_SZ (IB_MAD_SIZE - IB_PC_DATA_OFFS) #define IB_SA_MCM_RECSZ 53 #define IB_SA_PR_RECSZ 64 enum MAD_CLASSES { IB_SMI_CLASS = 0x1, IB_SMI_DIRECT_CLASS = 0x81, IB_SA_CLASS = 0x3, IB_PERFORMANCE_CLASS = 0x4, IB_BOARD_MGMT_CLASS = 0x5, IB_DEVICE_MGMT_CLASS = 0x6, IB_CM_CLASS = 0x7, IB_SNMP_CLASS = 0x8, IB_VENDOR_RANGE1_START_CLASS = 0x9, IB_VENDOR_RANGE1_END_CLASS = 0x0f, IB_VENDOR_RANGE2_START_CLASS = 0x30, IB_VENDOR_RANGE2_END_CLASS = 0x4f, }; enum MAD_METHODS { IB_MAD_METHOD_GET = 0x1, IB_MAD_METHOD_SET = 0x2, IB_MAD_METHOD_GET_RESPONSE = 0x81, IB_MAD_METHOD_SEND = 0x3, IB_MAD_METHOD_TRAP = 0x5, IB_MAD_METHOD_TRAP_REPRESS = 0x7, IB_MAD_METHOD_REPORT = 0x6, IB_MAD_METHOD_REPORT_RESPONSE = 0x86, IB_MAD_METHOD_GET_TABLE = 0x12, IB_MAD_METHOD_GET_TABLE_RESPONSE = 0x92, IB_MAD_METHOD_GET_TRACE_TABLE = 0x13, IB_MAD_METHOD_GET_TRACE_TABLE_RESPONSE = 0x93, IB_MAD_METHOD_GETMULTI = 0x14, IB_MAD_METHOD_GETMULTI_RESPONSE = 0x94, IB_MAD_METHOD_DELETE = 0x15, IB_MAD_METHOD_DELETE_RESPONSE = 0x95, IB_MAD_RESPONSE = 0x80, }; enum MAD_ATTR_ID { CLASS_PORT_INFO = 0x1, NOTICE = 0x2, INFORM_INFO = 0x3, }; enum SMI_ATTR_ID { IB_ATTR_NODE_DESC = 0x10, IB_ATTR_NODE_INFO = 0x11, IB_ATTR_SWITCH_INFO = 0x12, IB_ATTR_PORT_INFO = 0x15, IB_ATTR_PKEY_TBL = 0x16, IB_ATTR_SLVL_TABLE = 0x17, IB_ATTR_VL_ARBITRATION = 0x18, IB_ATTR_LINEARFORWTBL = 0x19, IB_ATTR_MULTICASTFORWTBL = 0x1b, IB_ATTR_SMINFO = 0x20, IB_ATTR_LAST }; enum SA_ATTR_ID { IB_SA_ATTR_NOTICE = 0x02, IB_SA_ATTR_INFORMINFO = 0x03, IB_SA_ATTR_PORTINFORECORD = 0x12, IB_SA_ATTR_LINKRECORD = 0x20, IB_SA_ATTR_SERVICERECORD = 0x31, IB_SA_ATTR_PATHRECORD = 0x35, IB_SA_ATTR_MCRECORD = 0x38, IB_SA_ATTR_MULTIPATH = 0x3a, IB_SA_ATTR_LAST }; enum GSI_ATTR_ID { IB_GSI_PORT_SAMPLES_CONTROL = 0x10, IB_GSI_PORT_SAMPLES_RESULT = 0x11, IB_GSI_PORT_COUNTERS = 0x12, IB_GSI_PORT_COUNTERS_EXT = 0x1D, IB_GSI_ATTR_LAST }; #define IB_VENDOR_OPENIB_PING_CLASS (IB_VENDOR_RANGE2_START_CLASS + 2) #define IB_VENDOR_OPENIB_SYSSTAT_CLASS (IB_VENDOR_RANGE2_START_CLASS + 3) #define IB_OPENIB_OUI (0x001405) typedef uint8_t ib_gid_t[16]; typedef struct { int cnt; uint8_t p[IB_SUBNET_PATH_HOPS_MAX]; uint16_t drslid; uint16_t drdlid; } ib_dr_path_t; typedef struct { uint id; uint mod; } ib_attr_t; typedef struct { int mgtclass; int method; ib_attr_t attr; uint32_t rstatus; /* return status */ int dataoffs; int datasz; uint64_t mkey; uint64_t trid; /* used for out mad if nonzero, return real val */ uint64_t mask; /* for sa mads */ uint recsz; /* for sa mads (attribute offset) */ int timeout; uint32_t oui; /* for vendor mads range 2 */ } ib_rpc_t; typedef struct portid { int lid; /* lid or 0 if directed route */ ib_dr_path_t drpath; int grh; /* flag */ ib_gid_t gid; uint32_t qp; uint32_t qkey; uint8_t sl; uint pkey_idx; } ib_portid_t; typedef void (ib_mad_dump_fn)(char *buf, int bufsz, void *val, int valsz); #define IB_FIELD_NAME_LEN 32 typedef struct ib_field { int bitoffs; int bitlen; char name[IB_FIELD_NAME_LEN]; ib_mad_dump_fn *def_dump_fn; } ib_field_t; enum MAD_FIELDS { IB_NO_FIELD, IB_GID_PREFIX_F, IB_GID_GUID_F, /* first MAD word (0-3 bytes) */ IB_MAD_METHOD_F, IB_MAD_RESPONSE_F, IB_MAD_CLASSVER_F, IB_MAD_MGMTCLASS_F, IB_MAD_BASEVER_F, /* second MAD word (4-7 bytes) */ IB_MAD_STATUS_F, /* DRSMP only */ IB_DRSMP_HOPCNT_F, IB_DRSMP_HOPPTR_F, IB_DRSMP_STATUS_F, IB_DRSMP_DIRECTION_F, /* words 3,4,5,6 (8-23 bytes) */ IB_MAD_TRID_F, IB_MAD_ATTRID_F, IB_MAD_ATTRMOD_F, /* word 7,8 (24-31 bytes) */ IB_MAD_MKEY_F, /* word 9 (32-37 bytes) */ IB_DRSMP_DRSLID_F, IB_DRSMP_DRDLID_F, /* word 10,11 (36-43 bytes) */ IB_SA_MKEY_F, /* word 12 (44-47 bytes) */ IB_SA_ATTROFFS_F, /* word 13,14 (48-55 bytes) */ IB_SA_COMPMASK_F, /* word 13,14 (56-255 bytes) */ IB_SA_DATA_F, /* bytes 64 - 127 */ IB_SM_DATA_F, /* bytes 64 - 256 */ IB_GS_DATA_F, /* bytes 128 - 191 */ IB_DRSMP_PATH_F, /* bytes 192 - 255 */ IB_DRSMP_RPATH_F, /* * PortInfo fields: */ IB_PORT_FIRST_F, IB_PORT_MKEY_F = IB_PORT_FIRST_F, IB_PORT_GID_PREFIX_F, IB_PORT_LID_F, IB_PORT_SMLID_F, IB_PORT_CAPMASK_F, IB_PORT_DIAG_F, IB_PORT_MKEY_LEASE_F, IB_PORT_LOCAL_PORT_F, IB_PORT_LINK_WIDTH_ENABLED_F, IB_PORT_LINK_WIDTH_SUPPORTED_F, IB_PORT_LINK_WIDTH_ACTIVE_F, IB_PORT_LINK_SPEED_SUPPORTED_F, IB_PORT_STATE_F, IB_PORT_PHYS_STATE_F, IB_PORT_LINK_DOWN_DEF_F, IB_PORT_MKEY_PROT_BITS_F, IB_PORT_LMC_F, IB_PORT_LINK_SPEED_ACTIVE_F, IB_PORT_LINK_SPEED_ENABLED_F, IB_PORT_NEIGHBOR_MTU_F, IB_PORT_SMSL_F, IB_PORT_VL_CAP_F, IB_PORT_INIT_TYPE_F, IB_PORT_VL_HIGH_LIMIT_F, IB_PORT_VL_ARBITRATION_HIGH_CAP_F, IB_PORT_VL_ARBITRATION_LOW_CAP_F, IB_PORT_INIT_TYPE_REPLY_F, IB_PORT_MTRU_CAP_F, IB_PORT_VL_STALL_COUNT_F, IB_PORT_HOQ_LIFE_F, IB_PORT_OPER_VLS_F, IB_PORT_PART_EN_INB_F, IB_PORT_PART_EN_OUTB_F, IB_PORT_FILTER_RAW_INB_F, IB_PORT_FILTER_RAW_OUTB_F, IB_PORT_MKEY_VIOL_F, IB_PORT_PKEY_VIOL_F, IB_PORT_QKEY_VIOL_F, IB_PORT_GUID_CAP_F, IB_PORT_CLIENT_REREG_F, IB_PORT_SUBN_TIMEOUT_F, IB_PORT_RESP_TIME_VAL_F, IB_PORT_LOCAL_PHYS_ERR_F, IB_PORT_OVERRUN_ERR_F, IB_PORT_MAX_CREDIT_HINT_F, IB_PORT_LINK_ROUND_TRIP_F, IB_PORT_LAST_F, /* * NodeInfo fields: */ IB_NODE_FIRST_F, IB_NODE_BASE_VERS_F = IB_NODE_FIRST_F, IB_NODE_CLASS_VERS_F, IB_NODE_TYPE_F, IB_NODE_NPORTS_F, IB_NODE_SYSTEM_GUID_F, IB_NODE_GUID_F, IB_NODE_PORT_GUID_F, IB_NODE_PARTITION_CAP_F, IB_NODE_DEVID_F, IB_NODE_REVISION_F, IB_NODE_LOCAL_PORT_F, IB_NODE_VENDORID_F, IB_NODE_LAST_F, /* * SwitchInfo fields: */ IB_SW_FIRST_F, IB_SW_LINEAR_FDB_CAP_F = IB_SW_FIRST_F, IB_SW_RANDOM_FDB_CAP_F, IB_SW_MCAST_FDB_CAP_F, IB_SW_LINEAR_FDB_TOP_F, IB_SW_DEF_PORT_F, IB_SW_DEF_MCAST_PRIM_F, IB_SW_DEF_MCAST_NOT_PRIM_F, IB_SW_LIFE_TIME_F, IB_SW_STATE_CHANGE_F, IB_SW_LIDS_PER_PORT_F, IB_SW_PARTITION_ENFORCE_CAP_F, IB_SW_PARTITION_ENF_INB_F, IB_SW_PARTITION_ENF_OUTB_F, IB_SW_FILTER_RAW_INB_F, IB_SW_FILTER_RAW_OUTB_F, IB_SW_ENHANCED_PORT0_F, IB_SW_LAST_F, /* * SwitchLinearForwardingTable fields: */ IB_LINEAR_FORW_TBL_F, /* * SwitchMulticastForwardingTable fields: */ IB_MULTICAST_FORW_TBL_F, /* * NodeDescription fields: */ IB_NODE_DESC_F, /* * Notice/Trap fields */ IB_NOTICE_IS_GENERIC_F, IB_NOTICE_TYPE_F, IB_NOTICE_PRODUCER_F, IB_NOTICE_TRAP_NUMBER_F, IB_NOTICE_ISSUER_LID_F, IB_NOTICE_TOGGLE_F, IB_NOTICE_COUNT_F, IB_NOTICE_DATA_LID_F, /* * GS Performance */ IB_PC_FIRST_F, IB_PC_PORT_SELECT_F = IB_PC_FIRST_F, IB_PC_COUNTER_SELECT_F, IB_PC_ERR_SYM_F, IB_PC_LINK_RECOVERS_F, IB_PC_LINK_DOWNED_F, IB_PC_ERR_RCV_F, IB_PC_ERR_PHYSRCV_F, IB_PC_ERR_SWITCH_REL_F, IB_PC_XMT_DISCARDS_F, IB_PC_ERR_XMTCONSTR_F, IB_PC_ERR_RCVCONSTR_F, IB_PC_ERR_LOCALINTEG_F, IB_PC_ERR_EXCESS_OVR_F, IB_PC_VL15_DROPPED_F, IB_PC_XMT_BYTES_F, IB_PC_RCV_BYTES_F, IB_PC_XMT_PKTS_F, IB_PC_RCV_PKTS_F, IB_PC_LAST_F, /* * SMInfo */ IB_SMINFO_GUID_F, IB_SMINFO_KEY_F, IB_SMINFO_ACT_F, IB_SMINFO_PRIO_F, IB_SMINFO_STATE_F, /* * SA RMPP */ IB_SA_RMPP_VERS_F, IB_SA_RMPP_TYPE_F, IB_SA_RMPP_RESP_F, IB_SA_RMPP_FLAGS_F, IB_SA_RMPP_STATUS_F, /* data1 */ IB_SA_RMPP_D1_F, IB_SA_RMPP_SEGNUM_F, /* data2 */ IB_SA_RMPP_D2_F, IB_SA_RMPP_LEN_F, /* DATA: Payload len */ IB_SA_RMPP_NEWWIN_F, /* ACK: new window last */ /* * SA Get Multi Path */ IB_SA_MP_NPATH_F, IB_SA_MP_NSRC_F, IB_SA_MP_NDEST_F, IB_SA_MP_GID0_F, /* * SA Path rec */ IB_SA_PR_DGID_F, IB_SA_PR_SGID_F, IB_SA_PR_DLID_F, IB_SA_PR_SLID_F, IB_SA_PR_NPATH_F, /* * MC Member rec */ IB_SA_MCM_MGID_F, IB_SA_MCM_PORTGID_F, IB_SA_MCM_QKEY_F, IB_SA_MCM_MLID_F, IB_SA_MCM_SL_F, IB_SA_MCM_MTU_F, IB_SA_MCM_RATE_F, IB_SA_MCM_TCLASS_F, IB_SA_MCM_PKEY_F, IB_SA_MCM_FLOW_LABEL_F, IB_SA_MCM_JOIN_STATE_F, IB_SA_MCM_PROXY_JOIN_F, /* * Service record */ IB_SA_SR_ID_F, IB_SA_SR_GID_F, IB_SA_SR_PKEY_F, IB_SA_SR_LEASE_F, IB_SA_SR_KEY_F, IB_SA_SR_NAME_F, IB_SA_SR_DATA_F, /* * ATS SM record - within SA_SR_DATA */ IB_ATS_SM_NODE_ADDR_F, IB_ATS_SM_MAGIC_KEY_F, IB_ATS_SM_NODE_TYPE_F, IB_ATS_SM_NODE_NAME_F, /* * SLTOVL MAPPING TABLE */ IB_SLTOVL_MAPPING_TABLE_F, /* * VL ARBITRATION TABLE */ IB_VL_ARBITRATION_TABLE_F, /* * IB vendor class range 2 */ IB_VEND2_OUI_F, IB_VEND2_DATA_F, /* * InformInfo record IB_SA_INFINFO_SUBGID_F, IB_SA_INFINFO_ENUM_F, IB_SA_INFINFO_RESV0_F, */ IB_SA_INFINFO_GID_F, IB_SA_INFINFO_LID_BEGIN_F, IB_SA_INFINFO_LID_END_F, IB_SA_INFINFO_RESV1_F, IB_SA_INFINFO_ISGENERIC_F, IB_SA_INFINFO_SUBSCRIBE_F, IB_SA_INFINFO_TYPE_F, IB_SA_INFINFO_TRAP_DEVID_F, IB_SA_INFINFO_QPN_F, IB_SA_INFINFO_RES2_F, IB_SA_INFINFO_RESPTIME_F, IB_SA_INFINFO_RESV3_F, IB_SA_INFINFO_VENDORID_F, IB_FIELD_LAST_ /* must be last */ }; /* * SA RMPP section */ enum RMPP_TYPE_ENUM { IB_RMPP_TYPE_NONE, IB_RMPP_TYPE_DATA, IB_RMPP_TYPE_ACK, IB_RMPP_TYPE_STOP, IB_RMPP_TYPE_ABORT, }; enum RMPP_FLAGS_ENUM { IB_RMPP_FLAG_ACTIVE = 1 << 0, IB_RMPP_FLAG_FIRST = 1 << 1, IB_RMPP_FLAG_LAST = 1 << 2, }; typedef struct { int type; int flags; int status; union { uint32_t u; uint32_t segnum; } d1; union { uint32_t u; uint32_t len; uint32_t newwin; } d2; } ib_rmpp_hdr_t; enum SA_SIZES_ENUM { SA_HEADER_SZ = 20, }; typedef struct ib_sa_call { uint attrid; uint mod; uint64_t mask; uint method; uint64_t trid; /* used for out mad if nonzero, return real val */ uint recsz; /* return field */ ib_rmpp_hdr_t rmpp; } ib_sa_call_t; typedef struct ib_vendor_call { uint method; uint mgmt_class; uint attrid; uint mod; uint32_t oui; uint timeout; ib_rmpp_hdr_t rmpp; } ib_vendor_call_t; #define IB_MIN_UCAST_LID 1 #define IB_MAX_UCAST_LID (0xc000-1) #define IB_MIN_MCAST_LID 0xc000 #define IB_MAX_MCAST_LID (0xffff-1) #define IB_LID_VALID(lid) ((lid) >= IB_MIN_UCAST_LID && lid <= IB_MAX_UCAST_LID) #define IB_MLID_VALID(lid) ((lid) >= IB_MIN_MCAST_LID && lid <= IB_MAX_MCAST_LID) #define MAD_DEF_RETRIES 3 #define MAD_DEF_TIMEOUT_MS 1000 enum { IB_DEST_LID, IB_DEST_DRPATH, IB_DEST_GUID, }; enum { IB_NODE_CA = 1, IB_NODE_SWITCH, IB_NODE_ROUTER, NODE_RNIC, IB_NODE_MAX = NODE_RNIC }; /******************************************************************************/ /* portid.c */ char * portid2str(ib_portid_t *portid); int portid2portnum(ib_portid_t *portid); int str2drpath(ib_dr_path_t *path, char *routepath, int drslid, int drdlid); static inline int ib_portid_set(ib_portid_t *portid, int lid, int qp, int qkey) { portid->lid = lid; portid->qp = qp; portid->qkey = qkey; return 0; } /* fields.c */ extern ib_field_t ib_mad_f[]; void _set_field(void *buf, int base_offs, ib_field_t *f, uint32_t val); uint32_t _get_field(void *buf, int base_offs, ib_field_t *f); void _set_array(void *buf, int base_offs, ib_field_t *f, void *val); void _get_array(void *buf, int base_offs, ib_field_t *f, void *val); void _set_field64(void *buf, int base_offs, ib_field_t *f, uint64_t val); uint64_t _get_field64(void *buf, int base_offs, ib_field_t *f); /* mad.c */ static inline uint32_t mad_get_field(void *buf, int base_offs, int field) { return _get_field(buf, base_offs, ib_mad_f + field); } static inline void mad_set_field(void *buf, int base_offs, int field, uint32_t val) { _set_field(buf, base_offs, ib_mad_f + field, val); } /* field must be byte aligned */ static inline uint64_t mad_get_field64(void *buf, int base_offs, int field) { return _get_field64(buf, base_offs, ib_mad_f + field); } static inline void mad_set_field64(void *buf, int base_offs, int field, uint64_t val) { _set_field64(buf, base_offs, ib_mad_f + field, val); } static inline void mad_set_array(void *buf, int base_offs, int field, void *val) { _set_array(buf, base_offs, ib_mad_f + field, val); } static inline void mad_get_array(void *buf, int base_offs, int field, void *val) { _get_array(buf, base_offs, ib_mad_f + field, val); } void mad_decode_field(uint8_t *buf, int field, void *val); void mad_encode_field(uint8_t *buf, int field, void *val); void * mad_encode(void *buf, ib_rpc_t *rpc, ib_dr_path_t *drpath, void *data); uint64_t mad_trid(void); int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); /* register.c */ int mad_register_client(int mgmt, uint8_t rmpp_version); int mad_register_server(int mgmt, uint8_t rmpp_version, uint32_t method_mask[4], uint32_t class_oui); int mad_class_agent(int mgmt); int mad_agent_class(int agent); /* serv.c */ int mad_send(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); void * mad_receive(void *umad, int timeout); int mad_respond(void *umad, ib_portid_t *portid, uint32_t rstatus); void * mad_alloc(void); void mad_free(void *umad); /* vendor */ uint8_t *ib_vendor_call(void *data, ib_portid_t *portid, ib_vendor_call_t *call); static inline int mad_is_vendor_range1(int mgmt) { return mgmt >= 0x9 && mgmt <= 0xf; } static inline int mad_is_vendor_range2(int mgmt) { return mgmt >= 0x30 && mgmt <= 0x4f; } /* rpc.c */ int madrpc_portid(void); int madrpc_set_retries(int retries); int madrpc_set_timeout(int timeout); void * madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata); void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes); void madrpc_save_mad(void *madbuf, int len); void madrpc_lock(void); void madrpc_unlock(void); void madrpc_show_errors(int set); /* smp.c */ uint8_t * smp_query(void *buf, ib_portid_t *id, uint attrid, uint mod, uint timeout); uint8_t * smp_set(void *buf, ib_portid_t *id, uint attrid, uint mod, uint timeout); inline static uint8_t * safe_smp_query(void *rcvbuf, ib_portid_t *portid, uint attrid, uint mod, uint timeout) { uint8_t *p; madrpc_lock(); p = smp_query(rcvbuf, portid, attrid, mod, timeout); madrpc_unlock(); return p; } inline static uint8_t * safe_smp_set(void *rcvbuf, ib_portid_t *portid, uint attrid, uint mod, uint timeout) { uint8_t *p; madrpc_lock(); p = smp_set(rcvbuf, portid, attrid, mod, timeout); madrpc_unlock(); return p; } /* sa.c */ uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, uint timeout); int ib_path_query(ib_gid_t srcgid, ib_gid_t destgid, ib_portid_t *sm_id, void *buf); /* returns lid */ inline static uint8_t * safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, uint timeout) { uint8_t *p; madrpc_lock(); p = sa_call(rcvbuf, portid, sa, timeout); madrpc_unlock(); return p; } /* resolve.c */ int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, ib_portid_t *sm_id, int timeout); int ib_resolve_portid_str(ib_portid_t *portid, char *addr_str, int dest_type, ib_portid_t *sm_id); int ib_resolve_self(ib_portid_t *portid, int *portnum, ib_gid_t *gid); /* gs.c */ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t *dest, int port, uint timeout); uint8_t *port_performance_query(void *rcvbuf, ib_portid_t *dest, int port, uint timeout); uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t *dest, int port, uint mask, uint timeout); uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t *dest, int port, uint timeout); uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t *dest, int port, uint timeout); /* dump.c */ ib_mad_dump_fn mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, mad_dump_bitfield, mad_dump_array, mad_dump_string, mad_dump_linkwidth, mad_dump_linkwidthsup, mad_dump_linkwidthen, mad_dump_linkdowndefstate, mad_dump_linkspeed, mad_dump_linkspeedsup, mad_dump_linkspeeden, mad_dump_portstate, mad_dump_portstates, mad_dump_physportstate, mad_dump_portcapmask, mad_dump_mtu, mad_dump_vlcap, mad_dump_opervls, mad_dump_node_type, mad_dump_sltovl, mad_dump_vlarbitration, mad_dump_nodedesc, mad_dump_nodeinfo, mad_dump_portinfo, mad_dump_switchinfo, mad_dump_perfcounters; int _mad_dump(ib_mad_dump_fn *fn, char *name, void *val, int valsz); char * _mad_dump_field(ib_field_t *f, char *name, char *buf, int bufsz, void *val); int _mad_print_field(ib_field_t *f, char *name, void *val, int valsz); char * _mad_dump_val(ib_field_t *f, char *buf, int bufsz, void *val); static inline int mad_print_field(int field, char *name, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return -1; return _mad_print_field(ib_mad_f + field, name, val, 0); } static inline char * mad_dump_field(int field, char *buf, int bufsz, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return 0; return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val); } static inline char * mad_dump_val(int field, char *buf, int bufsz, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return 0; return _mad_dump_val(ib_mad_f + field, buf, bufsz, val); } extern int ibdebug; END_C_DECLS #endif /* _MAD_H_ */ From sweitzen at cisco.com Thu Aug 10 08:57:47 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 10 Aug 2006 08:57:47 -0700 Subject: [openib-general] does Oracle still support AIO SDP? Message-ID: I know at one point Oracle supported AIO SDP on Linux, is still still supported? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Aug 10 09:38:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 Aug 2006 09:38:03 -0700 Subject: [openib-general] Outstanding RDMA operations In-Reply-To: <67897d690608092005i1e45bb8wa23b3f0105103cca@mail.gmail.com> (Manpreet Singh's message of "Wed, 9 Aug 2006 20:05:04 -0700") References: <67897d690608092005i1e45bb8wa23b3f0105103cca@mail.gmail.com> Message-ID: Manpreet> Hi, Some time ago, I thought there were some thoughts Manpreet> about making the outstanding RDMA count to be Manpreet> configurable via a patch to the mthca driver (the Manpreet> 'rdb_per_qp' parameter in mthca_main.c). Manpreet> Just curious if the patch was going to go in at some Manpreet> point. I've lost the patch. Someone will need to resubmit it and work on cleaning it up. - R. From fbcn1_g at dch.com Thu Aug 10 10:51:41 2006 From: fbcn1_g at dch.com (=?windows-1255?Q?=F4=E9=E6=E9=F7=E4_=EE=F9=F4=E7=FA=E9=FA?=) Date: Thu, 10 Aug 2006 19:51:41 +0200 Subject: [openib-general] =?windows-1255?b?5eng4vjkIO7w6ODs6fo=?= Message-ID: An HTML attachment was scrubbed... URL: From ardavis at ichips.intel.com Thu Aug 10 10:06:42 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 10 Aug 2006 10:06:42 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1 planning meeting - summary In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7670@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7670@mtlexch01.mtl.com> Message-ID: <44DB67A2.7020705@ichips.intel.com> Tziporet Koren wrote: >You are correct - we forgot about it. >Will be fixed in rc2 >Can you open a bug in bugzilla for the installer package so we will not >miss it this time? > > > Done. Bug 195. From tom at opengridcomputing.com Thu Aug 10 10:14:59 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 10 Aug 2006 12:14:59 -0500 Subject: [openib-general] RDMA_READ SGE Message-ID: <1155230099.15374.38.camel@trinity.ogc.int> Roland: iWARP RNIC's have a different SGE limit for RDMA_READ response then they do for other SGE. To support iWARP, we need to add a max_read_sge attribute to the ib_device structure. We had originally mapped this to max_sge_rd, but someone pointed out that this is not for RDMA_READ in general, but for RDMA_READ_ATOMIC in particular. In any event, we need to add this value to the ib_verbs.h structure so that ULP know how to limit sge for the RDMA_READ sink. Do you want me to submit a patch on top of Steve's, or do you want me to wait until you've got the git tree built? Thanks, Tom From weiny2 at llnl.gov Thu Aug 10 10:26:25 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 10 Aug 2006 10:26:25 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7682@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7682@mtlexch01.mtl.com> Message-ID: <20060810102625.2f6eefc0.weiny2@llnl.gov> Is there a reason ib_ping.ko is not in this release? Ira On Thu, 10 Aug 2006 17:22:07 +0300 "Tziporet Koren" wrote: > of course I meant slips (we are all awake here :-) > > > -----Original Message----- > From: openfabrics-ewg-bounces at openib.org > [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Tziporet Koren > Sent: Thursday, August 10, 2006 4:42 PM > To: Hal Rosenstock > Cc: OpenFabricsEWG; openib > Subject: Re: [openfabrics-ewg] OFED 1.1-rc1 is available > > The schedule is sleeps in 2 weeks meaning: > > Target release date: 12-Sep > > Intermediate milestones: > 1. Create 1.1 branch of user level: 27-Jul - done > 2. RC1: 8-Aug - done > 3. Feature freeze (RC2): 17-Aug > 4. Code freeze (rc-x): 6-Sep > 5. Final release: 12-Sep > > Tziporet > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, August 09, 2006 9:04 PM > To: Tziporet Koren > Cc: OpenFabricsEWG; openib > Subject: Re: [openfabrics-ewg] OFED 1.1-rc1 is available > > > On Tue, 2006-08-08 at 10:48, Tziporet Koren wrote: > > Hi, > > > > In two week delay we publish OFED 1.1-RC1 on > https://openib.org/svn/gen2/branches/1.1/ofed/releases/ > > File: OFED-1.1-rc1.tgz > > Is there an update to the OFED 1.1 schedule going forward ? > > -- Hal > > > 1. Schedule: > > ============ > > Target release date: 31-Aug > > Intermediate milestones: > > 1. Create 1.1 branch of user level code and rc1: 27-Jul > > 2. Feature freeze : 3-Aug > > 3. Code freeze (rc-x): 25-Aug > > 4. Final release: 31-Aug > > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu Aug 10 10:44:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 Aug 2006 10:44:12 -0700 Subject: [openib-general] libmthca: make fence work In-Reply-To: <20060809070605.GO20848@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Aug 2006 10:06:05 +0300") References: <20060809070605.GO20848@mellanox.co.il> Message-ID: I fixed this in a slightly different way that seemed cleaner to me (see below). Please verify that my fix works for you. Thanks, Roland Index: libmthca/src/qp.c =================================================================== --- libmthca/src/qp.c (revision 8875) +++ libmthca/src/qp.c (working copy) @@ -46,6 +46,10 @@ #include "doorbell.h" #include "wqe.h" +enum { + MTHCA_SEND_DOORBELL_FENCE = 1 << 5 +}; + static const uint8_t mthca_opcode[] = { [IBV_WR_SEND] = MTHCA_OPCODE_SEND, [IBV_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, @@ -104,9 +108,18 @@ int mthca_tavor_post_send(struct ibv_qp int ind; int nreq; int ret = 0; - int size, size0 = 0; + int size; + int size0 = 0; int i; - uint32_t f0 = 0, op0 = 0; + /* + * f0 and op0 cannot be used unless nreq > 0, which means this + * function makes it through the loop at least once. So the + * code inside the if (!size0) will be executed, and f0 and + * op0 will be initialized. So any gcc warning about "may be + * used unitialized" is bogus. + */ + uint32_t f0; + uint32_t op0; pthread_spin_lock(&qp->sq.lock); @@ -290,6 +303,8 @@ int mthca_tavor_post_send(struct ibv_qp if (!size0) { size0 = size; op0 = mthca_opcode[wr->opcode]; + f0 = wr->send_flags & IBV_SEND_FENCE ? + MTHCA_SEND_DOORBELL_FENCE : 0; } ++ind; @@ -434,9 +449,18 @@ int mthca_arbel_post_send(struct ibv_qp int ind; int nreq; int ret = 0; - int size, size0 = 0; + int size; + int size0 = 0; int i; - uint32_t f0 = 0, op0 = 0; + /* + * f0 and op0 cannot be used unless nreq > 0, which means this + * function makes it through the loop at least once. So the + * code inside the if (!size0) will be executed, and f0 and + * op0 will be initialized. So any gcc warning about "may be + * used unitialized" is bogus. + */ + uint32_t f0; + uint32_t op0; pthread_spin_lock(&qp->sq.lock); @@ -644,6 +668,8 @@ int mthca_arbel_post_send(struct ibv_qp if (!size0) { size0 = size; op0 = mthca_opcode[wr->opcode]; + f0 = wr->send_flags & IBV_SEND_FENCE ? + MTHCA_SEND_DOORBELL_FENCE : 0; } ++ind; Index: libmthca/ChangeLog =================================================================== --- libmthca/ChangeLog (revision 8875) +++ libmthca/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2006-08-09 Michael S. Tsirkin + + * src/qp.c (mthca_tavor_post_send, mthca_arbel_post_send): Fence + bit must be set in both doorbell and WQE. + 2006-08-03 Jack Morgenstein * src/mthca.h: Include to get definition of offsetof(). From rdreier at cisco.com Thu Aug 10 10:53:01 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 Aug 2006 10:53:01 -0700 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <20060809070435.GN20848@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Aug 2006 10:04:35 +0300") References: <20060809070435.GN20848@mellanox.co.il> Message-ID: Similarly I just checked this in: (BTW Muli -- I think anonymous enum is better than define because it lets the compiler have human-readable names for values, rather than throwing the info away at the preprocessor stage -- so error messages can be better, etc) commit e54b82d739d4a2ef992976c8c0692cdf89286420 Author: Michael S. Tsirkin Date: Thu Aug 10 10:46:56 2006 -0700 IB/mthca: Make fence flag work for send work requests The fence bit needs to be set in the doorbell too, not just the WQE. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index cd8b672..157b4f8 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -99,6 +99,10 @@ enum { MTHCA_QP_BIT_RSC = 1 << 3 }; +enum { + MTHCA_SEND_DOORBELL_FENCE = 1 << 5 +}; + struct mthca_qp_path { __be32 port_pkey; u8 rnr_retry; @@ -1502,7 +1506,7 @@ int mthca_tavor_post_send(struct ib_qp * int i; int size; int size0 = 0; - u32 f0 = 0; + u32 f0; int ind; u8 op0 = 0; @@ -1686,6 +1690,8 @@ int mthca_tavor_post_send(struct ib_qp * if (!size0) { size0 = size; op0 = mthca_opcode[wr->opcode]; + f0 = wr->send_flags & IB_SEND_FENCE ? + MTHCA_SEND_DOORBELL_FENCE : 0; } ++ind; @@ -1843,7 +1849,7 @@ int mthca_arbel_post_send(struct ib_qp * int i; int size; int size0 = 0; - u32 f0 = 0; + u32 f0; int ind; u8 op0 = 0; @@ -2051,6 +2057,8 @@ int mthca_arbel_post_send(struct ib_qp * if (!size0) { size0 = size; op0 = mthca_opcode[wr->opcode]; + f0 = wr->send_flags & IB_SEND_FENCE ? + MTHCA_SEND_DOORBELL_FENCE : 0; } ++ind; From rdreier at cisco.com Thu Aug 10 10:56:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 Aug 2006 10:56:29 -0700 Subject: [openib-general] [PATCH v2] sa_query: require SA query registration In-Reply-To: <000201c6ba76$89c7fc90$e598070a@amr.corp.intel.com> (Sean Hefty's message of "Mon, 7 Aug 2006 16:09:30 -0700") References: <000201c6ba76$89c7fc90$e598070a@amr.corp.intel.com> Message-ID: I think I agree with Michael that having ib_sa_register_client() allocate the structure is going to far in the name of encapsulation. How about making the structure public, but providing a macro DEFINE_IB_SA_CLIENT() to wrap up the initialization? (Similar to DEFINE_MUTEX, DEFINE_IDR, etc) For debugging, if we want, we can do some macro magic to get the __FILE__ and variable name stuffed into a char[] member. If we want to allow for dynamic allocation of clients, then we can have a ib_sa_init_client(struct ib_sa_client *client) function, but I don't see a need for that right now. - R. From rdreier at cisco.com Thu Aug 10 10:56:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 Aug 2006 10:56:59 -0700 Subject: [openib-general] RDMA_READ SGE In-Reply-To: <1155230099.15374.38.camel@trinity.ogc.int> (Tom Tucker's message of "Thu, 10 Aug 2006 12:14:59 -0500") References: <1155230099.15374.38.camel@trinity.ogc.int> Message-ID: Tom> Do you want me to submit a patch on top of Steve's, or do you Tom> want me to wait until you've got the git tree built? Go ahead and send it, which the caveat that you should keep track of it and make sure I get it into my git tree... - R. From rdreier at cisco.com Thu Aug 10 11:00:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 Aug 2006 11:00:53 -0700 Subject: [openib-general] [PATCH report] IB/ipoib fix flush/start xmit race (from code review) In-Reply-To: <20060803191606.GB26623@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 Aug 2006 22:16:06 +0300") References: <20060803191606.GB26623@mellanox.co.il> Message-ID: Looks safe enough but I'd rather not tinker with locking so late in the 2.6.18 cycle. The race is purely theoretical at this point, and these sorts of changes have a tendency to introduce unintended consequences. So I queued this for 2.6.19 and checke it into svn. - R. From sean.hefty at intel.com Thu Aug 10 11:10:57 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 10 Aug 2006 11:10:57 -0700 Subject: [openib-general] [PATCH v2] sa_query: require SA query registration In-Reply-To: Message-ID: <000101c6bca8$53c20670$ff0da8c0@amr.corp.intel.com> >I think I agree with Michael that having ib_sa_register_client() >allocate the structure is going to far in the name of encapsulation. >How about making the structure public, but providing a macro >DEFINE_IB_SA_CLIENT() to wrap up the initialization? (Similar to >DEFINE_MUTEX, DEFINE_IDR, etc) Currently what I have is: struct ib_sa_client { atomic_t users; struct completion comp; }; I tried not to change it too much from what Michael had, but I haven't added canceling of requests yet either. Canceling requests isn't strictly needed; we just end up waiting longer for deregister to complete. But doing so should only add a list_head to the structure. So, yes, I can move the definition back to being public. >If we want to allow for dynamic allocation of clients, then we can >have a ib_sa_init_client(struct ib_sa_client *client) function, but I >don't see a need for that right now. I do use dynamic allocation for userspace support. - Sean From sashak at voltaire.com Thu Aug 10 11:26:26 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 10 Aug 2006 21:26:26 +0300 Subject: [openib-general] howto get copy of https://openib.org/svn repository In-Reply-To: References: <20060809204100.GX24920@sashak.voltaire.com> Message-ID: <20060810182626.GA24920@sashak.voltaire.com> On 15:05 Wed 09 Aug , Roland Dreier wrote: > search for svn-mirror This works like SVK mirroring (I guess it based on the same SVN::Mirror). It is able preserve history, but may change revision numbers - mirrored tree is similar but not equivalent. Will try to play anyway... Thanks, Sasha From rdreier at cisco.com Thu Aug 10 11:33:21 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 Aug 2006 11:33:21 -0700 Subject: [openib-general] howto get copy of https://openib.org/svn repository In-Reply-To: <20060810182626.GA24920@sashak.voltaire.com> (Sasha Khapyorsky's message of "Thu, 10 Aug 2006 21:26:26 +0300") References: <20060809204100.GX24920@sashak.voltaire.com> <20060810182626.GA24920@sashak.voltaire.com> Message-ID: Sasha> This works like SVK mirroring (I guess it based on the same Sasha> SVN::Mirror). It is able preserve history, but may change Sasha> revision numbers - mirrored tree is similar but not Sasha> equivalent. Will try to play anyway... If you really care about revision numbers I guess you could mirror the tree, dump it, trim the initial revisions you don't want, and undump it into a new repo. - R. From sashak at voltaire.com Thu Aug 10 11:51:53 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 10 Aug 2006 21:51:53 +0300 Subject: [openib-general] howto get copy of https://openib.org/svn repository In-Reply-To: References: <20060809204100.GX24920@sashak.voltaire.com> <20060810182626.GA24920@sashak.voltaire.com> Message-ID: <20060810185153.GB24920@sashak.voltaire.com> On 11:33 Thu 10 Aug , Roland Dreier wrote: > Sasha> This works like SVK mirroring (I guess it based on the same > Sasha> SVN::Mirror). It is able preserve history, but may change > Sasha> revision numbers - mirrored tree is similar but not > Sasha> equivalent. Will try to play anyway... > > If you really care about revision numbers I guess you could mirror the > tree, dump it, trim the initial revisions you don't want, and undump > it into a new repo. I suspect things like: ------------------------------------------------------------------------ r8674 | tziporet | 2006-07-26 15:51:07 +0300 (Wed, 26 Jul 2006) | 3 lines Changed paths: A /gen2/branches/1.1/src/userspace (from /gen2/trunk/src/userspace:8673) may be broken then. Ok, will see how to workaround it... Sasha From mst at mellanox.co.il Thu Aug 10 11:54:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Aug 2006 21:54:50 +0300 Subject: [openib-general] RDMA_READ SGE In-Reply-To: <1155230099.15374.38.camel@trinity.ogc.int> References: <1155230099.15374.38.camel@trinity.ogc.int> Message-ID: <20060810185450.GA20994@mellanox.co.il> Quoting r. Tom Tucker : > Subject: RDMA_READ SGE > > Roland: > > iWARP RNIC's have a different SGE limit for RDMA_READ response then they > do for other SGE. To support iWARP, we need to add a max_read_sge > attribute to the ib_device structure. We had originally mapped this to > max_sge_rd, but someone pointed out that this is not for RDMA_READ in > general, but for RDMA_READ_ATOMIC in particular. > > In any event, we need to add this value to the ib_verbs.h structure so > that ULP know how to limit sge for the RDMA_READ sink. > > Do you want me to submit a patch on top of Steve's, or do you want me to > wait until you've got the git tree built? > > Thanks, > Tom Actually, Mellanox devices also have different SGE limits depending on the operation kinds and QP types. So far the approach there has been that the device should report one value for the best case result. Practically I don't see reporting the exact values as a priority - I think applications really can figure this out easier by attempting operating with relevant parameters and fallback to smaller values on failure. But assuming that applications really need this information - it seems we really should generalize this - maybe make the device provide a function mapping QP attributes and operation kinds to the max set of values allowed? -- MST From caitlinb at broadcom.com Thu Aug 10 12:21:50 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 10 Aug 2006 12:21:50 -0700 Subject: [openib-general] RDMA_READ SGE In-Reply-To: <20060810185450.GA20994@mellanox.co.il> Message-ID: <54AD0F12E08D1541B826BE97C98F99F17DD234@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Quoting r. Tom Tucker : >> Subject: RDMA_READ SGE >> >> Roland: >> >> iWARP RNIC's have a different SGE limit for RDMA_READ response then >> they do for other SGE. To support iWARP, we need to add a >> max_read_sge attribute to the ib_device structure. We had originally >> mapped this to max_sge_rd, but someone pointed out that this is not >> for RDMA_READ in general, but for RDMA_READ_ATOMIC in particular. >> >> In any event, we need to add this value to the ib_verbs.h structure >> so that ULP know how to limit sge for the RDMA_READ sink. >> >> Do you want me to submit a patch on top of Steve's, or do you want me >> to wait until you've got the git tree built? >> >> Thanks, >> Tom > > Actually, Mellanox devices also have different SGE limits > depending on the operation kinds and QP types. > So far the approach there has been that the device should > report one value for the best case result. > > Practically I don't see reporting the exact values as a > priority - I think applications really can figure this out > easier by attempting operating with relevant parameters and > fallback to smaller values on failure. > > But assuming that applications really need this information - > it seems we really should generalize this - maybe make the > device provide a function mapping QP attributes and operation > kinds to the max set of values allowed? That was the consensus in RNIC-PI, report as many distinct values as required so that the chances that a device would end up under- reporting any capability would be minimal. For iWARP, the RDMA Read SGE length is fairly critical, though. It is almost always exactly one. Having to report that you only support a single SGE when in fact *only* RDMA Reads have that restriction would be misleading the application a lot. The other distinctions don't tend to be a severe. From sweitzen at cisco.com Thu Aug 10 12:35:49 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 10 Aug 2006 12:35:49 -0700 Subject: [openib-general] OF bugzilla: have added 1.1rc1 version, and "RHEL 4" and "SLES 10" op_sys Message-ID: I have been given OF bugzilla admin privs, and have added more values. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Thu Aug 10 13:36:51 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 10 Aug 2006 16:36:51 -0400 (EDT) Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <20060810102625.2f6eefc0.weiny2@llnl.gov> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7682@mtlexch01.mtl.com> <20060810102625.2f6eefc0.weiny2@llnl.gov> Message-ID: On Thu, 10 Aug 2006, Ira Weiny wrote: > Is there a reason ib_ping.ko is not in this release? The OFED release team decided to only pull kernel sources from the upstream kernel. I understand why they did this, but I too had hoped that ib_ping would be in the next OFED release. Roland, will ib_ping be moving upstream? From rdreier at cisco.com Thu Aug 10 13:38:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 Aug 2006 13:38:53 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: (James Lentini's message of "Thu, 10 Aug 2006 16:36:51 -0400 (EDT)") References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7682@mtlexch01.mtl.com> <20060810102625.2f6eefc0.weiny2@llnl.gov> Message-ID: James> Roland, will ib_ping be moving upstream? No one has submitted it. And I really question what the point of it is anyway. - R. From jlentini at netapp.com Thu Aug 10 13:50:42 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 10 Aug 2006 16:50:42 -0400 (EDT) Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7682@mtlexch01.mtl.com> <20060810102625.2f6eefc0.weiny2@llnl.gov> Message-ID: On Thu, 10 Aug 2006, Roland Dreier wrote: > James> Roland, will ib_ping be moving upstream? > > No one has submitted it. And I really question what the point of it > is anyway. Why don't you find it useful as an InfiniBand diagnosis tool? In the absence of IPoIB (and therefore ICMP pings), it can be used to test connectivity. From rdreier at cisco.com Thu Aug 10 13:59:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 Aug 2006 13:59:08 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: (James Lentini's message of "Thu, 10 Aug 2006 16:50:42 -0400 (EDT)") References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7682@mtlexch01.mtl.com> <20060810102625.2f6eefc0.weiny2@llnl.gov> Message-ID: James> Why don't you find it useful as an InfiniBand diagnosis James> tool? James> In the absence of IPoIB (and therefore ICMP pings), it can James> be used to test connectivity. Sure, but AFAIK the ibping userspace app (which is required to use it anyway) has a server mode, so you don't need the kernel module. And I'm hard pressed to imagine a situation where the SM could bring a port up but ibping would show a problem. Putting some non-standard network server into the kernel just for some oddball debugging scenario doesn't seem worth it to me. - R. From halr at voltaire.com Thu Aug 10 14:00:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Aug 2006 17:00:14 -0400 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7682@mtlexch01.mtl.com> <20060810102625.2f6eefc0.weiny2@llnl.gov> Message-ID: <1155243579.17511.321172.camel@hal.voltaire.com> On Thu, 2006-08-10 at 16:36, James Lentini wrote: > On Thu, 10 Aug 2006, Ira Weiny wrote: > > > Is there a reason ib_ping.ko is not in this release? > > The OFED release team decided to only pull kernel sources from the > upstream kernel. There are some things which have not yet gone upstream (e.g. SDP) and other things (like madeye) which will not go upstream. > I understand why they did this, but I too had hoped that ib_ping would > be in the next OFED release. If it's any help, the userspace command ibping can be used in server mode (-S) and provide the same functionality. -- Hal > Roland, will ib_ping be moving upstream? > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > From halr at voltaire.com Thu Aug 10 14:28:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Aug 2006 17:28:04 -0400 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7682@mtlexch01.mtl.com> <20060810102625.2f6eefc0.weiny2@llnl.gov> Message-ID: <1155245242.17511.321593.camel@hal.voltaire.com> On Thu, 2006-08-10 at 16:59, Roland Dreier wrote: > James> Why don't you find it useful as an InfiniBand diagnosis > James> tool? > > James> In the absence of IPoIB (and therefore ICMP pings), it can > James> be used to test connectivity. > > Sure, but AFAIK the ibping userspace app (which is required to use it > anyway) has a server mode, Yes that is accurate. > so you don't need the kernel module. > And > I'm hard pressed to imagine a situation where the SM could bring a > port up but ibping would show a problem. I think it depends on the architecture and is true right now but not in the (perhaps near term) future. In the current architectures (mthca, ipath), perfquery accomplishes almost the same thing as ibping. I'm not so sure that is the case for ehca. I can envision component failures where something like this is useful. > Putting some non-standard network server into the kernel just for some > oddball debugging scenario doesn't seem worth it to me. It may not be an oddball debugging scenario. One man's oddball is perhaps another's elegant solution. -- Hal > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From boris at mellanox.com Thu Aug 10 16:47:30 2006 From: boris at mellanox.com (Boris Shpolyansky) Date: Thu, 10 Aug 2006 16:47:30 -0700 Subject: [openib-general] dapl_rmr_create in uDAPL gen-2 Message-ID: <1E3DCD1C63492545881FACB6063A57C13244DF@mtiexch01.mti.com> Hi, I have found out that dapl_rmr_create() function will never succeed in current gen-2 uDAPL implementation since dapls_ib_mw_alloc() function is not implemented. What is the reason for this ? Does it mean that DAPL doesn't support RDMA operations ? Thanks, Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Aug 10 20:52:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Aug 2006 23:52:43 -0400 Subject: [openib-general] umad_recv won't block after first read... In-Reply-To: <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com> References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155216690.17511.312291.camel@hal.voltaire.com> <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155220349.17511.313498.camel@hal.voltaire.com> <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com> Message-ID: <1155268363.4507.577.camel@hal.voltaire.com> Hi Abhijit, On Thu, 2006-08-10 at 10:55, Abhijit Gadgil wrote: > Hi Hal, > I tried using the umad code as per the latest repository. > (The latest fix is on libibumad/umad.c Line # 806 right?) Yes. > I manually applied that patch. OK but not sure why you did this "manually". > It doesn't seem to work yet. What do you mean ? Do you mean that change makes no difference for this and you still have the same problem ? > Infact, what I figured out was that the 'poll' on the umad->fd isn't > blocking either. What do you mean by either ? A poll with an negative timeout should be infinite which means blocking so something is happening on the fd but perhaps is not reported correctly. This particular usage has not been tried to my knowledge although it is used in a similar manner for some other things (by OpenSM). What kernel version are you using ? Are you using OpenIB from svn or OFED or something else ? What version is this up to ? > The read returns the correct 'mad_agent' ie. 0 in this case and some length which is usually 24 for the specific code. That shows the breakage. Not sure why. > I am attaching the local copy of infiniband/include/mad.h and src/fields.c, so that you may be able to try this code. (There may be stray printf's in those files!). Also, since I was not quite clear about whether the subscriptions should include the RID information (as per section 15.2.5), so I tried including it first, which the SA doesn't seem to like, but the subscriptions work after I get rid of the RID header. This particular aspect is not quite clear to me yet. > > Please let me know what you find. I'll try to look at this more tomorrow. I have some other nits on the test code you sent. I'll comment on these later as well although I don't think they are the crux of the issue. -- Hal > Regards. > > -abhijit > > > On Aug 10, 2006 08:02 PM, Hal Rosenstock wrote: > > > Hi again Abhijit, > > > > On Thu, 2006-08-10 at 09:46, Abhijit Gadgil wrote: > > > Hi Hal, > > > > > > Please see below. > > > > > > On Aug 10, 2006 07:01 PM, Hal Rosenstock wrote: > > > > > > > Hi Abhijit, > > > > > > > > On Thu, 2006-08-10 at 07:21, Abhijit Gadgil wrote: > > > > > Hi All, > > > > > > > > > > I am trying to write a simple program using libibumad to 'subscribe' for traps and then receive traps from the SA. Most of the things seem to work fine, however I am facing a small problem where, after first read for the trap, all subsequent reads are not blocking (and return some incorrect length). > > > > > > > > What do those calls return ? What version of management are you using ? > > > > > > > > > > I am running the management code from the SVN (svn release 8781, it may be slightly outdated!) > > > > A fix just went in to libibumad:umad_recv which may impact your results. > > Can you update this and retry ? > > > > What do the reads return other than incorrect length ? > > > > -- Hal > > > > > > > Attached is the simple code, can someone tell, what exactly is wrong out here? > > > > > > > > I didn't build and run this so my comments are based on just looking at > > > > the code. I don't think it would build as there are other changes needed > > > > to support this (e.g. IB_SA_INFINFO_XXX in libibmad at a minimum). > > > > > > > > > > Oh I am sorry, I didn't mention this before, I modified the libibmad sources (specifically src/fields.c and include/infiniband/mad.h) files to accomplish this. Once I get it right, I will submit a patch. (It's too hacky right now) > > > > > > > Is the main loop based on some operational program ? If so, which one ? > > > > > > > > A couple of specific comments: > > > > > > > > init_sa_headers: InformInfo does not actually use RMPP so the > > > > initialization here needs to change. Not sure what doing this would > > > > cause without actually building and running this. > > > > > > > > > > This was my first try of trying to use umad, hence for simplicity I copied from some reference code that was having RMPP enabled. I think I should get rid of this as well. > > > > > > > > > > Based on this, what is the result of the subscription ? Does it really > > > > succeed ? > > > > > > Well the subscriptions in-deed succeeded and I was able to receive IPoIB broadcast multicast group creation/deletion traps as well, but the problem mentioned below (ie. non-blocking reads) started appearing. > > > > > > > main: Rather than hard coding SM LID to 0x12, there are ways to get this > > > > dynamically. There are examples of how to do this. > > > > > > Sorry about this again. I realized it later that it is stupid to hard code it (eg. I could have got it from the ca[].port->sm_lid), will fix that eventually. > > > > > > Thanks. > > > > > > -abhijit > > > > > > > -- Hal > > > > > > > > > Thanks > > > > > > > > > > -abhijit > > > From Abhijit.Gadgil at pantasys.com Thu Aug 10 21:44:25 2006 From: Abhijit.Gadgil at pantasys.com (Abhijit Gadgil) Date: Thu, 10 Aug 2006 21:44:25 -0700 (PDT) Subject: [openib-general] umad_recv won't block after first read... References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155216690.17511.312291.camel@hal.voltaire.com> <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155220349.17511.313498.camel@hal.voltaire.com> <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com><1155268363.4507.577.camel@hal.voltaire.com> Message-ID: <4053879.1155271465930.SLOX.WebMail.wwwrun@ox.pantasys.com> Hi Hal, Sorry for being ambiguous on the answers below. However, I figured out what the problem was (while not looking at the code and thinking over it offline.) The main mistake was the umad_send part in the while(1) loop. Where I have specified the timeout value greater than '0' which means the mads were solicited. The SubnAdmResponse should not be sent as solicited and that was the main problem. So if I set the timeout value to '0' and the retries count to '0', there is no data available for subsequent reads and the 'read' blocks as expected. Thanks for the help. Some of the clarifications for previous questions are below. Please see inline. On Aug 11, 2006 09:22 AM, Hal Rosenstock wrote: > Hi Abhijit, > > On Thu, 2006-08-10 at 10:55, Abhijit Gadgil wrote: > > Hi Hal, > > > I tried using the umad code as per the latest repository. > > (The latest fix is on libibumad/umad.c Line # 806 right?) > > Yes. > > > I manually applied that patch. > > OK but not sure why you did this "manually". > Sorry about this, the machine where I am testing this code does not grab code from the svn repository directly, hence I just edited the file with hand. > > It doesn't seem to work yet. > > What do you mean ? Do you mean that change makes no difference for this > and you still have the same problem ? > > > Infact, what I figured out was that the 'poll' on the umad->fd isn't > > blocking either. > > What do you mean by either ? > Well both 'read' and 'poll' were returning immediately because of the 'timeout' parameter specified in the umad_send. So even if I specify the timeout to be a negative value (in umad_poll), there was a data available always. :-( > A poll with an negative timeout should be infinite which means blocking > so something is happening on the fd but perhaps is not reported > correctly. This particular usage has not been tried to my knowledge > although it is used in a similar manner for some other things (by > OpenSM). > > What kernel version are you using ? Are you using OpenIB from svn or > OFED or something else ? What version is this up to ? > I am using the latest kernel version 2.6.17 and openIB from svn as well. (same revision ie. 8781). > > The read returns the correct 'mad_agent' ie. 0 in this case and some length which is usually 24 for the specific code. > > That shows the breakage. Not sure why. > > > I am attaching the local copy of infiniband/include/mad.h and src/fields.c, so that you may be able to try this code. (There may be stray printf's in those files!). Also, since I was not quite clear about whether the subscriptions should include the RID information (as per section 15.2.5), so I tried including it first, which the SA doesn't seem to like, but the subscriptions work after I get rid of the RID header. This particular aspect is not quite clear to me yet. > > > > Please let me know what you find. > > I'll try to look at this more tomorrow. I have some other nits on the > test code you sent. I'll comment on these later as well although I don't > think they are the crux of the issue. Please let me know additional comments that you have. Further, it is not quite clear from the specification that whether one should include the RIDs in the InformInfo records during subscription. What is the correct intended behavior? Regards -abhijit > > -- Hal > > > Regards. > > > > -abhijit > > > > > > On Aug 10, 2006 08:02 PM, Hal Rosenstock wrote: > > > > > Hi again Abhijit, > > > > > > On Thu, 2006-08-10 at 09:46, Abhijit Gadgil wrote: > > > > Hi Hal, > > > > > > > > Please see below. > > > > > > > > On Aug 10, 2006 07:01 PM, Hal Rosenstock wrote: > > > > > > > > > Hi Abhijit, > > > > > > > > > > On Thu, 2006-08-10 at 07:21, Abhijit Gadgil wrote: > > > > > > Hi All, > > > > > > > > > > > > I am trying to write a simple program using libibumad to 'subscribe' for traps and then receive traps from the SA. Most of the things seem to work fine, however I am facing a small problem where, after first read for the trap, all subsequent reads are not blocking (and return some incorrect length). > > > > > > > > > > What do those calls return ? What version of management are you using ? > > > > > > > > > > > > > I am running the management code from the SVN (svn release 8781, it may be slightly outdated!) > > > > > > A fix just went in to libibumad:umad_recv which may impact your results. > > > Can you update this and retry ? > > > > > > What do the reads return other than incorrect length ? > > > > > > -- Hal > > > > > > > > > Attached is the simple code, can someone tell, what exactly is wrong out here? > > > > > > > > > > I didn't build and run this so my comments are based on just looking at > > > > > the code. I don't think it would build as there are other changes needed > > > > > to support this (e.g. IB_SA_INFINFO_XXX in libibmad at a minimum). > > > > > > > > > > > > > Oh I am sorry, I didn't mention this before, I modified the libibmad sources (specifically src/fields.c and include/infiniband/mad.h) files to accomplish this. Once I get it right, I will submit a patch. (It's too hacky right now) > > > > > > > > > Is the main loop based on some operational program ? If so, which one ? > > > > > > > > > > A couple of specific comments: > > > > > > > > > > init_sa_headers: InformInfo does not actually use RMPP so the > > > > > initialization here needs to change. Not sure what doing this would > > > > > cause without actually building and running this. > > > > > > > > > > > > > This was my first try of trying to use umad, hence for simplicity I copied from some reference code that was having RMPP enabled. I think I should get rid of this as well. > > > > > > > > > > > > > Based on this, what is the result of the subscription ? Does it really > > > > > succeed ? > > > > > > > > Well the subscriptions in-deed succeeded and I was able to receive IPoIB broadcast multicast group creation/deletion traps as well, but the problem mentioned below (ie. non-blocking reads) started appearing. > > > > > > > > > main: Rather than hard coding SM LID to 0x12, there are ways to get this > > > > > dynamically. There are examples of how to do this. > > > > > > > > Sorry about this again. I realized it later that it is stupid to hard code it (eg. I could have got it from the ca[].port->sm_lid), will fix that eventually. > > > > > > > > Thanks. > > > > > > > > -abhijit > > > > > > > > > -- Hal > > > > > > > > > > > Thanks > > > > > > > > > > > > -abhijit > > > > > > > From halr at voltaire.com Fri Aug 11 03:53:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2006 06:53:30 -0400 Subject: [openib-general] umad_recv won't block after first read... In-Reply-To: <4053879.1155271465930.SLOX.WebMail.wwwrun@ox.pantasys.com> References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155216690.17511.312291.camel@hal.voltaire.com> <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155220349.17511.313498.camel@hal.voltaire.com> <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155268363.4507.577.camel@hal.voltaire.com> <4053879.1155271465930.SLOX.WebMail.wwwrun@ox.pantasys.com> Message-ID: <1155293609.4507.10852.camel@hal.voltaire.com> Hi Abhijit, On Fri, 2006-08-11 at 00:44, Abhijit Gadgil wrote: > Hi Hal, > > Sorry for being ambiguous on the answers below. However, I figured out what > the problem was (while not looking at the code and thinking over it offline.) > The main mistake was the umad_send part in the while(1) loop. Where I have > specified the timeout value greater than '0' which means the mads were > solicited. The SubnAdmResponse should not be sent as solicited and that > was the main problem. So if I set the timeout value to '0' and the > retries count to '0', there is no data available for subsequent reads > and the 'read' blocks as expected. I missed that. Glad you found it. > Thanks for the help. Some of the clarifications for previous questions > are below. Please see inline. > > On Aug 11, 2006 09:22 AM, Hal Rosenstock wrote: > > > Hi Abhijit, > > > > On Thu, 2006-08-10 at 10:55, Abhijit Gadgil wrote: > > > Hi Hal, > > > > > I tried using the umad code as per the latest repository. > > > (The latest fix is on libibumad/umad.c Line # 806 right?) > > > > Yes. > > > > > I manually applied that patch. > > > > OK but not sure why you did this "manually". > > > > Sorry about this, the machine where I am testing this code does not grab code from the svn repository directly, hence I just edited the file with hand. > > > > It doesn't seem to work yet. > > > > What do you mean ? Do you mean that change makes no difference for this > > and you still have the same problem ? > > > > > Infact, what I figured out was that the 'poll' on the umad->fd isn't > > > blocking either. > > > > What do you mean by either ? > > > > Well both 'read' and 'poll' were returning immediately because of the 'timeout' parameter specified in the umad_send. So even if I specify the timeout to be a negative value (in umad_poll), there was a data available always. :-( > > > A poll with an negative timeout should be infinite which means blocking > > so something is happening on the fd but perhaps is not reported > > correctly. This particular usage has not been tried to my knowledge > > although it is used in a similar manner for some other things (by > > OpenSM). > > > > What kernel version are you using ? Are you using OpenIB from svn or > > OFED or something else ? What version is this up to ? > > > > I am using the latest kernel version 2.6.17 and openIB from svn as well. (same revision ie. 8781). > > > > The read returns the correct 'mad_agent' ie. 0 in this case and some length which is usually 24 for the specific code. > > > > That shows the breakage. Not sure why. > > > > > I am attaching the local copy of infiniband/include/mad.h and src/fields.c, so that you may be able to try this code. (There may be stray printf's in those files!). Also, since I was not quite clear about whether the subscriptions should include the RID information (as per section 15.2.5), so I tried including it first, which the SA doesn't seem to like, but the subscriptions work after I get rid of the RID header. This particular aspect is not quite clear to me yet. > > > > > > Please let me know what you find. > > > > I'll try to look at this more tomorrow. I have some other nits on the > > test code you sent. I'll comment on these later as well although I don't > > think they are the crux of the issue. > > Please let me know additional comments that you have. Some more quick comments on test-umad.c: In init_informinfo_set, some settings could be changed (more to make sure the SM side is doing the right thing): LIDRangeEnd is ignored when LIDRangeBegin is 0xFFFF which it is so it shouldn't matter what this field is set to (should work when this is set to 0). TrapNumber/DeviceID is ignored when IsGeneric so this could be set to 0. QPN is ignored when subscribe is 0. In main.c, you should only need to set the REPORT method mask bit not the following ones: set_bit(IB_MAD_METHOD_TRAP, &method_mask); set_bit(IB_MAD_METHOD_TRAP_REPRESS, &method_mask); > Further, it is not quite clear from the specification that whether one > should include the RIDs in the InformInfo records during subscription. > What is the correct intended behavior? Sorry, I forgot to answer this but I think you already knew the answer empirically: There is no RID for InformInfo. Not all SA records have RIDs (e.g. PathRecords, MultiPathRecords, TraceRecords, ServiceAssociationRecords). -- Hal > Regards > > -abhijit > > > > > -- Hal > > > > > Regards. > > > > > > -abhijit > > > > > > > > > On Aug 10, 2006 08:02 PM, Hal Rosenstock wrote: > > > > > > > Hi again Abhijit, > > > > > > > > On Thu, 2006-08-10 at 09:46, Abhijit Gadgil wrote: > > > > > Hi Hal, > > > > > > > > > > Please see below. > > > > > > > > > > On Aug 10, 2006 07:01 PM, Hal Rosenstock wrote: > > > > > > > > > > > Hi Abhijit, > > > > > > > > > > > > On Thu, 2006-08-10 at 07:21, Abhijit Gadgil wrote: > > > > > > > Hi All, > > > > > > > > > > > > > > I am trying to write a simple program using libibumad to 'subscribe' for traps and then receive traps from the SA. Most of the things seem to work fine, however I am facing a small problem where, after first read for the trap, all subsequent reads are not blocking (and return some incorrect length). > > > > > > > > > > > > What do those calls return ? What version of management are you using ? > > > > > > > > > > > > > > > > I am running the management code from the SVN (svn release 8781, it may be slightly outdated!) > > > > > > > > A fix just went in to libibumad:umad_recv which may impact your results. > > > > Can you update this and retry ? > > > > > > > > What do the reads return other than incorrect length ? > > > > > > > > -- Hal > > > > > > > > > > > Attached is the simple code, can someone tell, what exactly is wrong out here? > > > > > > > > > > > > I didn't build and run this so my comments are based on just looking at > > > > > > the code. I don't think it would build as there are other changes needed > > > > > > to support this (e.g. IB_SA_INFINFO_XXX in libibmad at a minimum). > > > > > > > > > > > > > > > > Oh I am sorry, I didn't mention this before, I modified the libibmad sources (specifically src/fields.c and include/infiniband/mad.h) files to accomplish this. Once I get it right, I will submit a patch. (It's too hacky right now) > > > > > > > > > > > Is the main loop based on some operational program ? If so, which one ? > > > > > > > > > > > > A couple of specific comments: > > > > > > > > > > > > init_sa_headers: InformInfo does not actually use RMPP so the > > > > > > initialization here needs to change. Not sure what doing this would > > > > > > cause without actually building and running this. > > > > > > > > > > > > > > > > This was my first try of trying to use umad, hence for simplicity I copied from some reference code that was having RMPP enabled. I think I should get rid of this as well. > > > > > > > > > > > > > > > > Based on this, what is the result of the subscription ? Does it really > > > > > > succeed ? > > > > > > > > > > Well the subscriptions in-deed succeeded and I was able to receive IPoIB broadcast multicast group creation/deletion traps as well, but the problem mentioned below (ie. non-blocking reads) started appearing. > > > > > > > > > > > main: Rather than hard coding SM LID to 0x12, there are ways to get this > > > > > > dynamically. There are examples of how to do this. > > > > > > > > > > Sorry about this again. I realized it later that it is stupid to hard code it (eg. I could have got it from the ca[].port->sm_lid), will fix that eventually. > > > > > > > > > > Thanks. > > > > > > > > > > -abhijit > > > > > > > > > > > -- Hal > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > -abhijit > > > > > > > > > > > > > > From halr at voltaire.com Fri Aug 11 04:00:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2006 07:00:29 -0400 Subject: [openib-general] umad_recv won't block after first read... In-Reply-To: <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com> References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155216690.17511.312291.camel@hal.voltaire.com> <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155220349.17511.313498.camel@hal.voltaire.com> <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com> Message-ID: <1155294029.4507.11039.camel@hal.voltaire.com> Hi Abhijit, On Thu, 2006-08-10 at 10:55, Abhijit Gadgil wrote: > I am attaching the local copy of infiniband/include/mad.h and src/fields.c, > so that you may be able to try this code. (There may be stray > printf's in those files!). It would be easier to see what changed if diffs (patches) for these were supplied. That is the accepted practice. > Also, since I was not quite clear about > whether the subscriptions should include the RID information > (as per section 15.2.5), so I tried including it first, which the > SA doesn't seem to like, but the subscriptions work after I get > rid of the RID header. This particular aspect is not quite clear to me yet. There is no RID on SA InformInfo. -- Hal From Abhijit.Gadgil at pantasys.com Fri Aug 11 04:24:44 2006 From: Abhijit.Gadgil at pantasys.com (Abhijit Gadgil) Date: Fri, 11 Aug 2006 04:24:44 -0700 (PDT) Subject: [openib-general] umad_recv won't block after first read... References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155216690.17511.312291.camel@hal.voltaire.com> <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155220349.17511.313498.camel@hal.voltaire.com> <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com><1155294029.4507.11039.camel@hal.voltaire.com> Message-ID: <4955656.1155295484537.SLOX.WebMail.wwwrun@ox.pantasys.com> Hi Hal, Please see below. On Aug 11, 2006 04:30 PM, Hal Rosenstock wrote: > Hi Abhijit, > > On Thu, 2006-08-10 at 10:55, Abhijit Gadgil wrote: > > > I am attaching the local copy of infiniband/include/mad.h and src/fields.c, > > so that you may be able to try this code. (There may be stray > > printf's in those files!). > > It would be easier to see what changed if diffs (patches) for these were > supplied. That is the accepted practice. > Sure, I will do that. I have also added constants in mad.h and fields.c for decoding the received generic notice types as per section 14.2.5.1. I will include that as well. > > Also, since I was not quite clear about > > whether the subscriptions should include the RID information > > (as per section 15.2.5), so I tried including it first, which the > > SA doesn't seem to like, but the subscriptions work after I get > > rid of the RID header. This particular aspect is not quite clear to me yet. > > There is no RID on SA InformInfo. Section 15.2.5.12 on page 894 mentions this. Is this an old reference that should be deleted? Regards -abhijit > -- Hal > > > From halr at voltaire.com Fri Aug 11 05:35:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2006 08:35:54 -0400 Subject: [openib-general] umad_recv won't block after first read... In-Reply-To: <4955656.1155295484537.SLOX.WebMail.wwwrun@ox.pantasys.com> References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155216690.17511.312291.camel@hal.voltaire.com> <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155220349.17511.313498.camel@hal.voltaire.com> <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155294029.4507.11039.camel@hal.voltaire.com> <4955656.1155295484537.SLOX.WebMail.wwwrun@ox.pantasys.com> Message-ID: <1155299754.4507.13514.camel@hal.voltaire.com> On Fri, 2006-08-11 at 07:24, Abhijit Gadgil wrote: > > There is no RID on SA InformInfo. > > Section 15.2.5.12 on page 894 mentions this. > Is this an old reference that should be deleted? That's InformInfoRecord which is different than InformInfo. -- Hal > Regards > > -abhijit > > > > > > -- Hal > > > > > > > > > From jlentini at netapp.com Fri Aug 11 06:39:19 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 11 Aug 2006 09:39:19 -0400 (EDT) Subject: [openib-general] dapl_rmr_create in uDAPL gen-2 In-Reply-To: <1E3DCD1C63492545881FACB6063A57C13244DF@mtiexch01.mti.com> References: <1E3DCD1C63492545881FACB6063A57C13244DF@mtiexch01.mti.com> Message-ID: On Thu, 10 Aug 2006, Boris Shpolyansky wrote: > I have found out that dapl_rmr_create() function will never succeed > in current gen-2 uDAPL implementation since dapls_ib_mw_alloc() > function is not implemented. What is the reason for this ? Does it > mean that DAPL doesn't support RDMA operations ? DAPL supports RDMA operations. DAPL RMRs (aka IB MWs) are not necessary for RDMA. Memory windows are are not supported in the Mellanox OFED driver. Once they are implemented, support for them can be added to DAPL. Currently I don't know of anyone working on adding MW support to the Mellanox driver. From halr at voltaire.com Fri Aug 11 06:36:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2006 09:36:57 -0400 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA767F@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA767F@mtlexch01.mtl.com> Message-ID: <1155303417.4507.15009.camel@hal.voltaire.com> On Thu, 2006-08-10 at 09:42, Tziporet Koren wrote: > The schedule is sleeps in 2 weeks meaning: Thanks. > Target release date: 12-Sep > > Intermediate milestones: > 1. Create 1.1 branch of user level: 27-Jul - done > 2. RC1: 8-Aug - done > 3. Feature freeze (RC2): 17-Aug What is the start build date for RC2 ? When do developers need to have their code in by to make RC2 ? > 4. Code freeze (rc-x): 6-Sep Is this 1 or 2 RCs beyond RC2 in order to make this ? -- Hal > 5. Final release: 12-Sep > > Tziporet > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, August 09, 2006 9:04 PM > To: Tziporet Koren > Cc: OpenFabricsEWG; openib > Subject: Re: [openfabrics-ewg] OFED 1.1-rc1 is available > > > On Tue, 2006-08-08 at 10:48, Tziporet Koren wrote: > > Hi, > > > > In two week delay we publish OFED 1.1-RC1 on > https://openib.org/svn/gen2/branches/1.1/ofed/releases/ > > File: OFED-1.1-rc1.tgz > > Is there an update to the OFED 1.1 schedule going forward ? > > -- Hal > > > 1. Schedule: > > ============ > > Target release date: 31-Aug > > Intermediate milestones: > > 1. Create 1.1 branch of user level code and rc1: 27-Jul > > 2. Feature freeze : 3-Aug > > 3. Code freeze (rc-x): 25-Aug > > 4. Final release: 31-Aug > From Abhijit.Gadgil at pantasys.com Fri Aug 11 06:52:38 2006 From: Abhijit.Gadgil at pantasys.com (Abhijit Gadgil) Date: Fri, 11 Aug 2006 06:52:38 -0700 (PDT) Subject: [openib-general] umad_recv won't block after first read... References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155216690.17511.312291.camel@hal.voltaire.com> <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155220349.17511.313498.camel@hal.voltaire.com> <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155294029.4507.11039.camel@hal.voltaire.com> <4955656.1155295484537.SLOX.WebMail.wwwrun@ox.pantasys.com><1155299754.4507.13514.camel@hal.voltaire.com> Message-ID: <5496280.1155304358400.SLOX.WebMail.wwwrun@ox.pantasys.com> On Aug 11, 2006 06:05 PM, Hal Rosenstock wrote: > On Fri, 2006-08-11 at 07:24, Abhijit Gadgil wrote: > > > There is no RID on SA InformInfo. > > > > Section 15.2.5.12 on page 894 mentions this. > > Is this an old reference that should be deleted? > > That's InformInfoRecord which is different than InformInfo. Thanks, but it is not clear to me when the InformInfoRecord is used instead of InformInfo? Can you please elaborate. Regards. -abhijit > -- Hal > > > Regards > > > > -abhijit > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > > From jlentini at netapp.com Fri Aug 11 07:20:53 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 11 Aug 2006 10:20:53 -0400 (EDT) Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: References: Message-ID: On Wed, 9 Aug 2006, Krishna Kumar2 wrote: > Hi James, > > Sorry for the late response, my system was down and I just got it fixed. > > > Is there a benefit to having rdmav_create_qp() take generic > > parameters if the application needs to understand the type of QP (IB, > > iWARP, etc.) created and the transport specific communication manager > > calls that are needed to manipulate it? > > > > Would it make more sense if the QP create command was also transport > > specific? > > My opinion is that the create_qp taking generic parameters is > correct, only subsequent calls may need to use transport specific > calls/arguments. Infact rdma_create_qp uses the ibv_create_qp (now > changed to rdmav_create_qp) call internally. If you want to have a generic rdmav_create_qp() call, there needs to be programmatic way for the API consumer to determine what type of QP (iWARP vs. IB) was created. I don't see any way to do that in your patch: http://openib.org/pipermail/openib-general/2006-August/024605.html > PS : What is the opinion on this patchset ? I like the new approach you are taking (keeping 1 verbs library and adding rdmav_ symbol names). This change to transport neutral names is long overdue. When you finish with the userspace APIs, I hope you will update the kernel APIs as well. From halr at voltaire.com Fri Aug 11 07:19:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2006 10:19:48 -0400 Subject: [openib-general] umad_recv won't block after first read... In-Reply-To: <5496280.1155304358400.SLOX.WebMail.wwwrun@ox.pantasys.com> References: <5744945.1155208881347.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155216690.17511.312291.camel@hal.voltaire.com> <3826137.1155217606588.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155220349.17511.313498.camel@hal.voltaire.com> <6837019.1155221752641.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155294029.4507.11039.camel@hal.voltaire.com> <4955656.1155295484537.SLOX.WebMail.wwwrun@ox.pantasys.com> <1155299754.4507.13514.camel@hal.voltaire.com> <5496280.1155304358400.SLOX.WebMail.wwwrun@ox.pantasys.com> Message-ID: <1155305987.4507.16179.camel@hal.voltaire.com> On Fri, 2006-08-11 at 09:52, Abhijit Gadgil wrote: > On Aug 11, 2006 06:05 PM, Hal Rosenstock wrote: > > > On Fri, 2006-08-11 at 07:24, Abhijit Gadgil wrote: > > > > There is no RID on SA InformInfo. > > > > > > Section 15.2.5.12 on page 894 mentions this. > > > Is this an old reference that should be deleted? > > > > That's InformInfoRecord which is different than InformInfo. > > Thanks, but it is not clear to me when the InformInfoRecord is used > instead of InformInfo? Can you please elaborate. It's not "instead" of; it's in addition to. If you look at the SA method/attribute table 190 on p. 890, you will see InformInfo is only Set and InformInfoRecord is Get or GetTable. InformInfo is the actual act of subscribing/unsubscribing. InformInfoRecord is used to see the current subscriptions (most applications don't need this functionality). They needed to be separate unlike other things since InformInfo was defined in chapter 13 as a common attribute and reused by the SA. Make sense ? -- Hal > Regards. > > -abhijit > > > > > -- Hal > > > > > Regards > > > > > > -abhijit > > > > > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > > > > > > > > > > > > From tom at opengridcomputing.com Fri Aug 11 07:46:40 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 11 Aug 2006 09:46:40 -0500 Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: References: Message-ID: <1155307600.15374.72.camel@trinity.ogc.int> On Fri, 2006-08-11 at 10:20 -0400, James Lentini wrote: > > On Wed, 9 Aug 2006, Krishna Kumar2 wrote: > > > Hi James, > > > > Sorry for the late response, my system was down and I just got it fixed. > > > > > Is there a benefit to having rdmav_create_qp() take generic > > > parameters if the application needs to understand the type of QP (IB, > > > iWARP, etc.) created and the transport specific communication manager > > > calls that are needed to manipulate it? > > > > > > Would it make more sense if the QP create command was also transport > > > specific? > > > > My opinion is that the create_qp taking generic parameters is > > correct, only subsequent calls may need to use transport specific > > calls/arguments. Infact rdma_create_qp uses the ibv_create_qp (now > > changed to rdmav_create_qp) call internally. > > If you want to have a generic rdmav_create_qp() call, there needs to > be programmatic way for the API consumer to determine what type of QP > (iWARP vs. IB) was created. > > I don't see any way to do that in your patch: I think the QP is associated with the transport type indirectly through the context. It can be queried with ibv_get_transport_type verb. A renamed rdma_get_transport type would probably suffice. > > http://openib.org/pipermail/openib-general/2006-August/024605.html > > > PS : What is the opinion on this patchset ? > > I like the new approach you are taking (keeping 1 verbs library and > adding rdmav_ symbol names). This change to transport neutral names is > long overdue. > > When you finish with the userspace APIs, I hope you will update the > kernel APIs as well. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jlentini at netapp.com Fri Aug 11 08:25:33 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 11 Aug 2006 11:25:33 -0400 (EDT) Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: <1155307600.15374.72.camel@trinity.ogc.int> References: <1155307600.15374.72.camel@trinity.ogc.int> Message-ID: On Fri, 11 Aug 2006, Tom Tucker wrote: > On Fri, 2006-08-11 at 10:20 -0400, James Lentini wrote: > > > > On Wed, 9 Aug 2006, Krishna Kumar2 wrote: > > > > > Hi James, > > > > > > Sorry for the late response, my system was down and I just got it fixed. > > > > > > > Is there a benefit to having rdmav_create_qp() take generic > > > > parameters if the application needs to understand the type of QP (IB, > > > > iWARP, etc.) created and the transport specific communication manager > > > > calls that are needed to manipulate it? > > > > > > > > Would it make more sense if the QP create command was also transport > > > > specific? > > > > > > My opinion is that the create_qp taking generic parameters is > > > correct, only subsequent calls may need to use transport specific > > > calls/arguments. Infact rdma_create_qp uses the ibv_create_qp (now > > > changed to rdmav_create_qp) call internally. > > > > If you want to have a generic rdmav_create_qp() call, there needs to > > be programmatic way for the API consumer to determine what type of QP > > (iWARP vs. IB) was created. > > > > I don't see any way to do that in your patch: > > I think the QP is associated with the transport type indirectly through > the context. It can be queried with ibv_get_transport_type verb. A > renamed rdma_get_transport type would probably suffice. We don't have a userspace ibv_get_transport_type() verb. There is a kernel verb, but no userspace version. > > http://openib.org/pipermail/openib-general/2006-August/024605.html > > > > > PS : What is the opinion on this patchset ? > > > > I like the new approach you are taking (keeping 1 verbs library and > > adding rdmav_ symbol names). This change to transport neutral names is > > long overdue. > > > > When you finish with the userspace APIs, I hope you will update the > > kernel APIs as well. From tom at opengridcomputing.com Fri Aug 11 08:33:21 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 11 Aug 2006 10:33:21 -0500 Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: References: <1155307600.15374.72.camel@trinity.ogc.int> Message-ID: <1155310401.15374.87.camel@trinity.ogc.int> [...snip...] > > I think the QP is associated with the transport type indirectly through > > the context. It can be queried with ibv_get_transport_type verb. A > > renamed rdma_get_transport type would probably suffice. > > We don't have a userspace ibv_get_transport_type() verb. There is a > kernel verb, but no userspace version. It's in the iwarp branch in verbs.h. Sorry, I should have pointed that out. > > > > http://openib.org/pipermail/openib-general/2006-August/024605.html > > > > > > > PS : What is the opinion on this patchset ? > > > > > > I like the new approach you are taking (keeping 1 verbs library and > > > adding rdmav_ symbol names). This change to transport neutral names is > > > long overdue. > > > > > > When you finish with the userspace APIs, I hope you will update the > > > kernel APIs as well. From zach.brown at oracle.com Fri Aug 11 08:40:36 2006 From: zach.brown at oracle.com (Zach Brown) Date: Fri, 11 Aug 2006 08:40:36 -0700 Subject: [openib-general] does Oracle still support AIO SDP? In-Reply-To: References: Message-ID: <44DCA4F4.5060007@oracle.com> Scott Weitzenkamp (sweitzen) wrote: > I know at one point Oracle supported AIO SDP on Linux, is still still > supported? Having asked around I believe that it still is. I don't know the details, though. - z From rdreier at cisco.com Fri Aug 11 09:00:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 11 Aug 2006 09:00:22 -0700 Subject: [openib-general] [PATCH] Fix potential deadlock in mthca Message-ID: Here's a long-standing bug that lockdep found very nicely. Ingo/Arjan, can you confirm that the fix looks OK and I am using spin_lock_nested() properly? I couldn't find much documentation or many examples of it, so I'm not positive this is the right way to handle this fix. I'm inclined to put this fix in 2.6.18, since this is a kernel deadlock that is triggerable from userspace via uverbs. Comments? Thanks, Roland commit a19aa5c5fdda8b556ab238177ee27c5ef7873c94 Author: Roland Dreier Date: Fri Aug 11 08:56:57 2006 -0700 IB/mthca: Fix potential AB-BA deadlock with CQ locks When destroying a QP, mthca locks both the QP's send CQ and receive CQ. However, the following scenario is perfectly valid: QP_a: send_cq == CQ_x, recv_cq == CQ_y QP_b: send_cq == CQ_y, recv_cq == CQ_x The old mthca code simply locked send_cq and then recv_cq, which in this case could lead to an AB-BA deadlock if QP_a and QP_b were destroyed simultaneously. We can fix this by changing the locking code to lock the CQ with the lower CQ number first, which will create a consistent lock ordering. Also, the second CQ is locked with spin_lock_nested() to tell lockdep that we know what we're doing with the lock nesting. This bug was found by lockdep. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index 8de2887..9a5bece 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -136,8 +136,8 @@ struct mthca_ah { * We have one global lock that protects dev->cq/qp_table. Each * struct mthca_cq/qp also has its own lock. An individual qp lock * may be taken inside of an individual cq lock. Both cqs attached to - * a qp may be locked, with the send cq locked first. No other - * nesting should be done. + * a qp may be locked, with the cq with the lower cqn locked first. + * No other nesting should be done. * * Each struct mthca_cq/qp also has an ref count, protected by the * corresponding table lock. The pointer from the cq/qp_table to the diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 157b4f8..2e8f6f3 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1263,6 +1263,32 @@ int mthca_alloc_qp(struct mthca_dev *dev return 0; } +static void mthca_lock_cqs(struct mthca_cq *send_cq, struct mthca_cq *recv_cq) +{ + if (send_cq == recv_cq) + spin_lock_irq(&send_cq->lock); + else if (send_cq->cqn < recv_cq->cqn) { + spin_lock_irq(&send_cq->lock); + spin_lock_nested(&recv_cq->lock, SINGLE_DEPTH_NESTING); + } else { + spin_lock_irq(&recv_cq->lock); + spin_lock_nested(&send_cq->lock, SINGLE_DEPTH_NESTING); + } +} + +static void mthca_unlock_cqs(struct mthca_cq *send_cq, struct mthca_cq *recv_cq) +{ + if (send_cq == recv_cq) + spin_unlock_irq(&send_cq->lock); + else if (send_cq->cqn < recv_cq->cqn) { + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); + } else { + spin_unlock(&send_cq->lock); + spin_unlock_irq(&recv_cq->lock); + } +} + int mthca_alloc_sqp(struct mthca_dev *dev, struct mthca_pd *pd, struct mthca_cq *send_cq, @@ -1315,17 +1341,13 @@ int mthca_alloc_sqp(struct mthca_dev *de * Lock CQs here, so that CQ polling code can do QP lookup * without taking a lock. */ - spin_lock_irq(&send_cq->lock); - if (send_cq != recv_cq) - spin_lock(&recv_cq->lock); + mthca_lock_cqs(send_cq, recv_cq); spin_lock(&dev->qp_table.lock); mthca_array_clear(&dev->qp_table.qp, mqpn); spin_unlock(&dev->qp_table.lock); - if (send_cq != recv_cq) - spin_unlock(&recv_cq->lock); - spin_unlock_irq(&send_cq->lock); + mthca_unlock_cqs(send_cq, recv_cq); err_out: dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, @@ -1359,9 +1381,7 @@ void mthca_free_qp(struct mthca_dev *dev * Lock CQs here, so that CQ polling code can do QP lookup * without taking a lock. */ - spin_lock_irq(&send_cq->lock); - if (send_cq != recv_cq) - spin_lock(&recv_cq->lock); + mthca_lock_cqs(send_cq, recv_cq); spin_lock(&dev->qp_table.lock); mthca_array_clear(&dev->qp_table.qp, @@ -1369,9 +1389,7 @@ void mthca_free_qp(struct mthca_dev *dev --qp->refcount; spin_unlock(&dev->qp_table.lock); - if (send_cq != recv_cq) - spin_unlock(&recv_cq->lock); - spin_unlock_irq(&send_cq->lock); + mthca_unlock_cqs(send_cq, recv_cq); wait_event(qp->wait, !get_qp_refcount(dev, qp)); From ryszard.jurga at cern.ch Fri Aug 11 09:05:29 2006 From: ryszard.jurga at cern.ch (Ryszard Jurga) Date: Fri, 11 Aug 2006 18:05:29 +0200 Subject: [openib-general] DAPL and local_iov in RDMA RR/RW mode Message-ID: <025801c6bd5f$f77b8d10$3b388d80@cern.ch> Hi everybody, I have one question about a number of segments in local_iov when using RDMA Write and Read mode. Is it possible to have num_segments>1? I am asking, because when I try to set up num_segments to a value > 1, then I can still only read/write one segment, even though I have an appropriate remote buffer already reserved. The size of transfered buffer is 10bytes, num_segs=2. The information, which is printed below, was obrained from network devices with one remark - I have set up manualy max_rdma_read_iov=10 and max_rdma_write_iov=10. Thank you in advance for your help. Best regards, Ryszard. EP_ATTR: the same for both nodes: ---------------------------------- max_message_size=2147483648 max_rdma_size=2147483648 max_recv_dtos=16 max_request_dtos=16 max_recv_iov=4 max_request_iov=4 max_rdma_read_in=4 max_rdma_read_out=4 srq_soft_hw=0 max_rdma_read_iov=10 max_rdma_write_iov=10 ep_transport_specific_count=0 ep_provider_specific_count=0 ---------------------------------- IA_ATTR: different for nodes ---------------------------------- IA Info: max_eps=64512 max_dto_per_ep=65535 max_rdma_read_per_ep_in=4 max_rdma_read_per_ep_out=1610616831 max_evds=65408 max_evd_qlen=131071 max_iov_segments_per_dto=28 max_lmrs=131056 max_lmr_block_size=18446744073709551615 max_pzs=32768 max_message_size=2147483648 max_rdma_size=2147483648 max_rmrs=0 max_srqs=0 max_ep_per_srq=0 max_recv_per_srq=143263 max_iov_segments_per_rdma_read=1073741824 max_iov_segments_per_rdma_write=0 max_rdma_read_in=0 max_rdma_read_out=65535 max_rdma_read_per_ep_in_guaranteed=7286 max_rdma_read_per_ep_out_guaranteed=0 IA Info: max_eps=64512 max_dto_per_ep=65535 max_rdma_read_per_ep_in=4 max_rdma_read_per_ep_out=0 max_evds=65408 max_evd_qlen=131071 max_iov_segments_per_dto=28 max_lmrs=131056 max_lmr_block_size=18446744073709551615 max_pzs=32768 max_message_size=2147483648 max_rdma_size=2147483648 max_rmrs=0 max_srqs=0 max_ep_per_srq=0 max_recv_per_srq=142247 max_iov_segments_per_rdma_read=1073741824 max_iov_segments_per_rdma_write=0 max_rdma_read_in=0 max_rdma_read_out=65535 max_rdma_read_per_ep_in_guaranteed=7286 max_rdma_read_per_ep_out_guaranteed=28 -------------- next part -------------- An HTML attachment was scrubbed... URL: From arjan at linux.intel.com Fri Aug 11 09:15:48 2006 From: arjan at linux.intel.com (Arjan van de Ven) Date: Fri, 11 Aug 2006 09:15:48 -0700 Subject: [openib-general] [PATCH] Fix potential deadlock in mthca In-Reply-To: References: Message-ID: <44DCAD34.5040502@linux.intel.com> Roland Dreier wrote: > Here's a long-standing bug that lockdep found very nicely. > > Ingo/Arjan, can you confirm that the fix looks OK and I am using > spin_lock_nested() properly? I couldn't find much documentation or > many examples of it, so I'm not positive this is the right way to > handle this fix. > looks correct to me; Acked-by: Arjan van de Ven From pi3orama at gmail.com Fri Aug 11 09:45:49 2006 From: pi3orama at gmail.com (Nan Wang) Date: Sat, 12 Aug 2006 00:45:49 +0800 Subject: [openib-general] Where is the programming reference manual of IBG2? Message-ID: <5a9c53c90608110945y58a9deady85cabcb3590ee1c1@mail.gmail.com> Hi, I'm new to IBG2, I find there's only Release notes in the package. Is there any "programming reference manual" showing the basic programming concept about IBG2 or OFED? Must I read the sample code and the source code to learn it? From wangnan06 at ict.ac.cn Fri Aug 11 11:09:28 2006 From: wangnan06 at ict.ac.cn (wangnan06 at ict.ac.cn) Date: Sat, 12 Aug 2006 02:09:28 +0800 (CST) Subject: [openib-general] Where is the programming reference manual of IBG2(OFED)? Message-ID: <3544.159.226.195.142.1155319768.squirrel@webmail.ict.ac.cn> Hi, I'm new to IBG2 programming, It's seems that there's only "Release Notes" in the package. I need some basic documents about progrmming with IBG2 or OFED. The specification is too abstract and useless for programming. I search google but find nothing. Anyone can help? Must I read the sample code and source code to learn it? From ardavis at ichips.intel.com Fri Aug 11 13:14:01 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 11 Aug 2006 13:14:01 -0700 Subject: [openib-general] DAPL and local_iov in RDMA RR/RW mode In-Reply-To: <025801c6bd5f$f77b8d10$3b388d80@cern.ch> References: <025801c6bd5f$f77b8d10$3b388d80@cern.ch> Message-ID: <44DCE509.5090005@ichips.intel.com> Ryszard Jurga wrote: > Hi everybody, > > I have one question about a number of segments in local_iov when using > RDMA Write and Read mode. Is it possible to have num_segments>1? I > am asking, because when I try to set up num_segments to a value > 1, > then I can still only read/write one segment, even though I have an > appropriate remote buffer already reserved. The size of transfered > buffer is 10bytes, num_segs=2. The information, which is printed > below, was obrained from network devices with one remark - I have set > up manualy max_rdma_read_iov=10 and max_rdma_write_iov=10. Thank you > in advance for your help. > Yes, uDAPL will support num_segments up to the max counts returned on the ep_attr. Can you be more specific? Does the post return immediate errors or are you simply missing data on the remote node? Can you turn up the uDAPL debug switch (export DAPL_DBG_TYPE=0xffff) and send output of the post call? -arlin > Best regards, > Ryszard. > > > EP_ATTR: the same for both nodes: > ---------------------------------- > max_message_size=2147483648 > max_rdma_size=2147483648 > max_recv_dtos=16 > max_request_dtos=16 > max_recv_iov=4 > max_request_iov=4 > max_rdma_read_in=4 > max_rdma_read_out=4 > srq_soft_hw=0 > max_rdma_read_iov=10 > max_rdma_write_iov=10 > ep_transport_specific_count=0 > ep_provider_specific_count=0 > ---------------------------------- > > > IA_ATTR: different for nodes > ---------------------------------- > IA Info: > max_eps=64512 > max_dto_per_ep=65535 > max_rdma_read_per_ep_in=4 > max_rdma_read_per_ep_out=1610616831 > max_evds=65408 > max_evd_qlen=131071 > max_iov_segments_per_dto=28 > max_lmrs=131056 > max_lmr_block_size=18446744073709551615 > max_pzs=32768 > max_message_size=2147483648 > max_rdma_size=2147483648 > max_rmrs=0 > max_srqs=0 > max_ep_per_srq=0 > max_recv_per_srq=143263 > max_iov_segments_per_rdma_read=1073741824 > max_iov_segments_per_rdma_write=0 > max_rdma_read_in=0 > max_rdma_read_out=65535 > max_rdma_read_per_ep_in_guaranteed=7286 > max_rdma_read_per_ep_out_guaranteed=0 > > IA Info: > max_eps=64512 > max_dto_per_ep=65535 > max_rdma_read_per_ep_in=4 > max_rdma_read_per_ep_out=0 > max_evds=65408 > max_evd_qlen=131071 > max_iov_segments_per_dto=28 > max_lmrs=131056 > max_lmr_block_size=18446744073709551615 > max_pzs=32768 > max_message_size=2147483648 > max_rdma_size=2147483648 > max_rmrs=0 > max_srqs=0 > max_ep_per_srq=0 > max_recv_per_srq=142247 > max_iov_segments_per_rdma_read=1073741824 > max_iov_segments_per_rdma_write=0 > max_rdma_read_in=0 > max_rdma_read_out=65535 > max_rdma_read_per_ep_in_guaranteed=7286 > max_rdma_read_per_ep_out_guaranteed=28 > >------------------------------------------------------------------------ > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ralphc at pathscale.com Fri Aug 11 14:54:36 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 11 Aug 2006 14:54:36 -0700 Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues Message-ID: <1155333276.20325.422.camel@brick.pathscale.com> The following patches update libibverbs, libmthca, libipathverbs, and the kernel ib_core, ib_mthca, ib_ehca, and ib_ipath modules in order to improve performance on QLogic InfiniPath HCAs. The completion queue and receive queues are now mmap'ed into the user's address space. The kernel changes are compatible with either the new or old user library and the new user library is compatible with the old or new kernel driver. The patches should be applied to the SVN trunk except for the InfiniPath kernel driver patch (last one) which is applied to the kernel git tree. These patches have been posted earlier for review. This posting should be considered ready for inclusion. Allow the driver plug-in library to return additional data in the response from ibv_cmd_resize_cq(). Signed-off-by: Ralph Campbell Index: src/userspace/libibverbs/include/infiniband/driver.h =================================================================== --- src/userspace/libibverbs/include/infiniband/driver.h (revision 8843) +++ src/userspace/libibverbs/include/infiniband/driver.h (working copy) @@ -95,7 +95,8 @@ int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only); int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, - struct ibv_resize_cq *cmd, size_t cmd_size); + struct ibv_resize_cq *cmd, size_t cmd_size, + struct ibv_resize_cq_resp *resp, size_t resp_size); int ibv_cmd_destroy_cq(struct ibv_cq *cq); int ibv_cmd_create_srq(struct ibv_pd *pd, Index: src/userspace/libibverbs/include/infiniband/kern-abi.h =================================================================== --- src/userspace/libibverbs/include/infiniband/kern-abi.h (revision 8843) +++ src/userspace/libibverbs/include/infiniband/kern-abi.h (working copy) @@ -355,6 +355,8 @@ struct ibv_resize_cq_resp { __u32 cqe; + __u32 reserved; + __u64 driver_data[0]; }; struct ibv_destroy_cq { Index: src/userspace/libibverbs/src/cmd.c =================================================================== --- src/userspace/libibverbs/src/cmd.c (revision 8843) +++ src/userspace/libibverbs/src/cmd.c (working copy) @@ -368,18 +368,18 @@ } int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, - struct ibv_resize_cq *cmd, size_t cmd_size) + struct ibv_resize_cq *cmd, size_t cmd_size, + struct ibv_resize_cq_resp *resp, size_t resp_size) { - struct ibv_resize_cq_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, &resp, sizeof resp); + IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, resp, resp_size); cmd->cq_handle = cq->handle; cmd->cqe = cqe; if (write(cq->context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; - cq->cqe = resp.cqe; + cq->cqe = resp->cqe; return 0; } From ralphc at pathscale.com Fri Aug 11 14:55:38 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 11 Aug 2006 14:55:38 -0700 Subject: [openib-general] [PATCH 2/7] IB/ipath - performance improvements via mmap of queues Message-ID: <1155333339.20325.424.camel@brick.pathscale.com> MTHCA changes to correspond to the libibverbs changes. Signed-off-by: Ralph Campbell Index: src/userspace/libmthca/src/verbs.c =================================================================== --- src/userspace/libmthca/src/verbs.c (revision 8843) +++ src/userspace/libmthca/src/verbs.c (working copy) @@ -262,6 +262,7 @@ { struct mthca_cq *cq = to_mcq(ibcq); struct mthca_resize_cq cmd; + struct ibv_resize_cq_resp resp; struct ibv_mr *mr; struct mthca_buf buf; int old_cqe; @@ -297,7 +298,8 @@ old_cqe = ibcq->cqe; cmd.lkey = mr->lkey; - ret = ibv_cmd_resize_cq(ibcq, cqe - 1, &cmd.ibv_cmd, sizeof cmd); + ret = ibv_cmd_resize_cq(ibcq, cqe - 1, &cmd.ibv_cmd, sizeof cmd, + &resp, sizeof resp); if (ret) { mthca_dereg_mr(mr); mthca_free_buf(&buf); From ralphc at pathscale.com Fri Aug 11 14:57:02 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 11 Aug 2006 14:57:02 -0700 Subject: [openib-general] [PATCH 3/7] IB/ipath - performance improvements via mmap of queues Message-ID: <1155333423.20325.427.camel@brick.pathscale.com> Improve the performance of the userspace verbs ibv_poll_cq(), ibv_post_recv(), and ibv_post_srq_recv(). Signed-off-by: Ralph Campbell Index: src/userspace/libipathverbs/src/verbs.c =================================================================== --- src/userspace/libipathverbs/src/verbs.c (revision 8843) +++ src/userspace/libipathverbs/src/verbs.c (working copy) @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005. PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -43,8 +44,11 @@ #include #include #include +#include +#include #include "ipathverbs.h" +#include "ipath-abi.h" int ipath_query_device(struct ibv_context *context, struct ibv_device_attr *attr) @@ -54,7 +58,8 @@ unsigned major, minor, sub_minor; int ret; - ret = ibv_cmd_query_device(context, attr, &raw_fw_ver, &cmd, sizeof cmd); + ret = ibv_cmd_query_device(context, attr, &raw_fw_ver, + &cmd, sizeof cmd); if (ret) return ret; @@ -142,55 +147,147 @@ struct ibv_comp_channel *channel, int comp_vector) { - struct ibv_cq *cq; - struct ibv_create_cq cmd; - struct ibv_create_cq_resp resp; - int ret; + struct ipath_cq *cq; + struct ibv_create_cq cmd; + struct ipath_create_cq_resp resp; + int ret; + size_t size; cq = malloc(sizeof *cq); if (!cq) return NULL; - ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, cq, - &cmd, sizeof cmd, &resp, sizeof resp); + ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, + &cq->ibv_cq, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); if (ret) { free(cq); return NULL; } - return cq; + size = sizeof(struct ipath_cq_wc) + sizeof(struct ipath_wc) * cqe; + cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + context->cmd_fd, resp.offset); + if ((void *) cq->queue == MAP_FAILED) { + free(cq); + return NULL; + } + + pthread_spin_init(&cq->lock, PTHREAD_PROCESS_PRIVATE); + return &cq->ibv_cq; } -int ipath_destroy_cq(struct ibv_cq *cq) +int ipath_resize_cq(struct ibv_cq *ibcq, int cqe) { + struct ipath_cq *cq = to_icq(ibcq); + struct ibv_resize_cq cmd; + struct ipath_resize_cq_resp resp; + size_t size; + int ret; + + pthread_spin_lock(&cq->lock); + /* Save the old size so we can unmmap the queue. */ + size = sizeof(struct ipath_cq_wc) + + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); + ret = ibv_cmd_resize_cq(ibcq, cqe, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); + if (ret) { + pthread_spin_unlock(&cq->lock); + return ret; + } + (void) munmap(cq->queue, size); + size = sizeof(struct ipath_cq_wc) + + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); + cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + ibcq->context->cmd_fd, resp.offset); + ret = errno; + pthread_spin_unlock(&cq->lock); + if ((void *) cq->queue == MAP_FAILED) + return ret; + return 0; +} + +int ipath_destroy_cq(struct ibv_cq *ibcq) +{ + struct ipath_cq *cq = to_icq(ibcq); int ret; - ret = ibv_cmd_destroy_cq(cq); + ret = ibv_cmd_destroy_cq(ibcq); if (ret) return ret; + (void) munmap(cq->queue, sizeof(struct ipath_cq_wc) + + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe)); free(cq); return 0; } +int ipath_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc) +{ + struct ipath_cq *cq = to_icq(ibcq); + struct ipath_cq_wc *q; + int npolled; + uint32_t tail; + + pthread_spin_lock(&cq->lock); + q = cq->queue; + tail = q->tail; + for (npolled = 0; npolled < ne; ++npolled, ++wc) { + if (tail == q->head) + break; + memcpy(wc, &q->queue[tail], sizeof(*wc)); + if (tail == cq->ibv_cq.cqe) + tail = 0; + else + tail++; + } + q->tail = tail; + pthread_spin_unlock(&cq->lock); + + return npolled; +} + struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) { - struct ibv_create_qp cmd; - struct ibv_create_qp_resp resp; - struct ibv_qp *qp; - int ret; + struct ibv_create_qp cmd; + struct ipath_create_qp_resp resp; + struct ipath_qp *qp; + int ret; + size_t size; qp = malloc(sizeof *qp); if (!qp) return NULL; - ret = ibv_cmd_create_qp(pd, qp, attr, &cmd, sizeof cmd, &resp, sizeof resp); + ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); if (ret) { free(qp); return NULL; } - return qp; + if (attr->srq) { + qp->rq.size = 0; + qp->rq.max_sge = 0; + qp->rq.rwq = NULL; + } else { + qp->rq.size = attr->cap.max_recv_wr + 1; + qp->rq.max_sge = attr->cap.max_recv_sge; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * qp->rq.max_sge)) * + qp->rq.size; + qp->rq.rwq = mmap(NULL, size, + PROT_READ | PROT_WRITE, MAP_SHARED, + pd->context->cmd_fd, resp.offset); + if ((void *) qp->rq.rwq == MAP_FAILED) { + free(qp); + return NULL; + } + } + + pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE); + return &qp->ibv_qp; } int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, @@ -211,47 +308,152 @@ return ibv_cmd_modify_qp(qp, attr, attr_mask, &cmd, sizeof cmd); } -int ipath_destroy_qp(struct ibv_qp *qp) +int ipath_destroy_qp(struct ibv_qp *ibqp) { + struct ipath_qp *qp = to_iqp(ibqp); int ret; - ret = ibv_cmd_destroy_qp(qp); + ret = ibv_cmd_destroy_qp(ibqp); if (ret) return ret; + if (qp->rq.rwq) { + size_t size; + + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * qp->rq.max_sge)) * + qp->rq.size; + (void) munmap(qp->rq.rwq, size); + } free(qp); return 0; } +static int post_recv(struct ipath_rq *rq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ibv_recv_wr *i; + struct ipath_rwq *rwq; + struct ipath_rwqe *wqe; + uint32_t head; + int n, ret; + + pthread_spin_lock(&rq->lock); + rwq = rq->rwq; + head = rwq->head; + for (i = wr; i; i = i->next) { + if ((unsigned) i->num_sge > rq->max_sge) + goto bad; + wqe = get_rwqe_ptr(rq, head); + if (++head >= rq->size) + head = 0; + if (head == rwq->tail) + goto bad; + wqe->wr_id = i->wr_id; + wqe->num_sge = i->num_sge; + for (n = 0; n < wqe->num_sge; n++) + wqe->sg_list[n] = i->sg_list[n]; + rwq->head = head; + } + ret = 0; + goto done; + +bad: + ret = -ENOMEM; + if (bad_wr) + *bad_wr = i; +done: + pthread_spin_unlock(&rq->lock); + return ret; +} + +int ipath_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ipath_qp *qp = to_iqp(ibqp); + + return post_recv(&qp->rq, wr, bad_wr); +} + struct ibv_srq *ipath_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *attr) { - struct ibv_srq *srq; + struct ipath_srq *srq; struct ibv_create_srq cmd; - struct ibv_create_srq_resp resp; + struct ipath_create_srq_resp resp; int ret; + size_t size; srq = malloc(sizeof *srq); if (srq == NULL) return NULL; - ret = ibv_cmd_create_srq(pd, srq, attr, &cmd, sizeof cmd, - &resp, sizeof resp); + ret = ibv_cmd_create_srq(pd, &srq->ibv_srq, attr, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); if (ret) { free(srq); return NULL; } - return srq; + srq->rq.size = attr->attr.max_wr + 1; + srq->rq.max_sge = attr->attr.max_sge; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * srq->rq.size; + srq->rq.rwq = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + pd->context->cmd_fd, resp.offset); + if ((void *) srq->rq.rwq == MAP_FAILED) { + free(srq); + return NULL; + } + + pthread_spin_init(&srq->rq.lock, PTHREAD_PROCESS_PRIVATE); + return &srq->ibv_srq; } -int ipath_modify_srq(struct ibv_srq *srq, +int ipath_modify_srq(struct ibv_srq *ibsrq, struct ibv_srq_attr *attr, enum ibv_srq_attr_mask attr_mask) { - struct ibv_modify_srq cmd; + struct ipath_srq *srq = to_isrq(ibsrq); + struct ipath_modify_srq_cmd cmd; + __u64 offset; + size_t size; + int ret; - return ibv_cmd_modify_srq(srq, attr, attr_mask, &cmd, sizeof cmd); + if (attr_mask & IBV_SRQ_MAX_WR) { + pthread_spin_lock(&srq->rq.lock); + /* Save the old size so we can unmmap the queue. */ + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * + srq->rq.size; + } + cmd.offset_addr = (__u64) &offset; + ret = ibv_cmd_modify_srq(ibsrq, attr, attr_mask, + &cmd.ibv_cmd, sizeof cmd); + if (ret) { + if (attr_mask & IBV_SRQ_MAX_WR) + pthread_spin_unlock(&srq->rq.lock); + return ret; + } + if (attr_mask & IBV_SRQ_MAX_WR) { + (void) munmap(srq->rq.rwq, size); + srq->rq.size = attr->max_wr + 1; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * + srq->rq.size; + srq->rq.rwq = mmap(NULL, size, + PROT_READ | PROT_WRITE, MAP_SHARED, + ibsrq->context->cmd_fd, offset); + pthread_spin_unlock(&srq->rq.lock); + /* XXX Now we have no receive queue. */ + if ((void *) srq->rq.rwq == MAP_FAILED) + return errno; + } + return 0; } int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr) @@ -261,18 +463,32 @@ return ibv_cmd_query_srq(srq, attr, &cmd, sizeof cmd); } -int ipath_destroy_srq(struct ibv_srq *srq) +int ipath_destroy_srq(struct ibv_srq *ibsrq) { + struct ipath_srq *srq = to_isrq(ibsrq); + size_t size; int ret; - ret = ibv_cmd_destroy_srq(srq); + ret = ibv_cmd_destroy_srq(ibsrq); if (ret) return ret; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * srq->rq.size; + (void) munmap(srq->rq.rwq, size); free(srq); return 0; } +int ipath_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + + return post_recv(&srq->rq, wr, bad_wr); +} + struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr) { struct ibv_ah *ah; Index: src/userspace/libipathverbs/src/ipath-abi.h =================================================================== --- src/userspace/libipathverbs/src/ipath-abi.h (revision 0) +++ src/userspace/libipathverbs/src/ipath-abi.h (revision 0) @@ -0,0 +1,72 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#ifndef IPATH_ABI_H +#define IPATH_ABI_H + +#include + +struct ipath_get_context_resp { + struct ibv_get_context_resp ibv_resp; + __u32 version; +}; + +struct ipath_create_cq_resp { + struct ibv_create_cq_resp ibv_resp; + __u64 offset; +}; + +struct ipath_resize_cq_resp { + struct ibv_resize_cq_resp ibv_resp; + __u64 offset; +}; + +struct ipath_create_qp_resp { + struct ibv_create_qp_resp ibv_resp; + __u64 offset; +}; + +struct ipath_create_srq_resp { + struct ibv_create_srq_resp ibv_resp; + __u64 offset; +}; + +struct ipath_modify_srq_cmd { + struct ibv_modify_srq ibv_cmd; + __u64 offset_addr; +}; + +#endif /* IPATH_ABI_H */ Index: src/userspace/libipathverbs/src/ipathverbs.c =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.c (revision 8843) +++ src/userspace/libipathverbs/src/ipathverbs.c (working copy) @@ -1,4 +1,5 @@ /* + * Copyright (C) 2006 QLogic Corporation, All rights reserved. * Copyright (c) 2005. PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -43,6 +44,7 @@ #include #include "ipathverbs.h" +#include "ipath-abi.h" #ifndef PCI_VENDOR_ID_PATHSCALE #define PCI_VENDOR_ID_PATHSCALE 0x1fc1 @@ -86,22 +88,25 @@ .dereg_mr = ipath_dereg_mr, .create_cq = ipath_create_cq, - .poll_cq = ibv_cmd_poll_cq, + .poll_cq = ipath_poll_cq, .req_notify_cq = ibv_cmd_req_notify_cq, .cq_event = NULL, + .resize_cq = ipath_resize_cq, .destroy_cq = ipath_destroy_cq, .create_srq = ipath_create_srq, .modify_srq = ipath_modify_srq, + .query_srq = ipath_query_srq, .destroy_srq = ipath_destroy_srq, - .post_srq_recv = ibv_cmd_post_srq_recv, + .post_srq_recv = ipath_post_srq_recv, .create_qp = ipath_create_qp, + .query_qp = ipath_query_qp, .modify_qp = ipath_modify_qp, .destroy_qp = ipath_destroy_qp, .post_send = ibv_cmd_post_send, - .post_recv = ibv_cmd_post_recv, + .post_recv = ipath_post_recv, .create_ah = ipath_create_ah, .destroy_ah = ipath_destroy_ah, @@ -116,6 +121,7 @@ struct ipath_context *context; struct ibv_get_context cmd; struct ibv_get_context_resp resp; + struct ipath_device *dev; context = malloc(sizeof *context); if (!context) @@ -126,6 +132,12 @@ goto err_free; context->ibv_ctx.ops = ipath_ctx_ops; + dev = to_idev(ibdev); + if (dev->abi_version == 1) { + context->ibv_ctx.ops.poll_cq = ibv_cmd_poll_cq; + context->ibv_ctx.ops.post_srq_recv = ibv_cmd_post_srq_recv; + context->ibv_ctx.ops.post_recv = ibv_cmd_post_recv; + } return &context->ibv_ctx; err_free: @@ -180,6 +192,7 @@ dev->ibv_dev.ops = ipath_dev_ops; dev->hca_type = hca_table[i].type; + dev->abi_version = abi_version; return &dev->ibv_dev; } Index: src/userspace/libipathverbs/src/ipathverbs.h =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.h (revision 8843) +++ src/userspace/libipathverbs/src/ipathverbs.h (working copy) @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005. PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -39,6 +40,7 @@ #include #include +#include #include #include @@ -57,12 +59,88 @@ struct ipath_device { struct ibv_device ibv_dev; enum ipath_hca_type hca_type; + int abi_version; }; struct ipath_context { struct ibv_context ibv_ctx; }; +/* + * This structure needs to have the same size and offsets as + * the kernel's ib_wc structure since it is memory mapped. + */ +struct ipath_wc { + uint64_t wr_id; + enum ibv_wc_status status; + enum ibv_wc_opcode opcode; + uint32_t vendor_err; + uint32_t byte_len; + uint32_t imm_data; /* in network byte order */ + uint32_t qp_num; + uint32_t src_qp; + enum ibv_wc_flags wc_flags; + uint16_t pkey_index; + uint16_t slid; + uint8_t sl; + uint8_t dlid_path_bits; + uint8_t port_num; +}; + +struct ipath_cq_wc { + uint32_t head; + uint32_t tail; + struct ipath_wc queue[1]; +}; + +struct ipath_cq { + struct ibv_cq ibv_cq; + struct ipath_cq_wc *queue; + pthread_spinlock_t lock; +}; + +/* + * Receive work request queue entry. + * The size of the sg_list is determined when the QP is created and stored + * in qp->r_max_sge. + */ +struct ipath_rwqe { + uint64_t wr_id; + uint8_t num_sge; + struct ibv_sge sg_list[0]; +}; + +/* + * This struture is used to contain the head pointer, tail pointer, + * and receive work queue entries as a single memory allocation so + * it can be mmap'ed into user space. + * Note that the wq array elements are variable size so you can't + * just index into the array to get the N'th element; + * use get_rwqe_ptr() instead. + */ +struct ipath_rwq { + uint32_t head; /* new requests posted to the head */ + uint32_t tail; /* receives pull requests from here. */ + struct ipath_rwqe wq[0]; +}; + +struct ipath_rq { + struct ipath_rwq *rwq; + pthread_spinlock_t lock; + uint32_t size; + uint32_t max_sge; +}; + +struct ipath_qp { + struct ibv_qp ibv_qp; + struct ipath_rq rq; +}; + +struct ipath_srq { + struct ibv_srq ibv_srq; + struct ipath_rq rq; +}; + #define to_ixxx(xxx, type) \ ((struct ipath_##type *) \ ((void *) ib##xxx - offsetof(struct ipath_##type, ibv_##xxx))) @@ -72,6 +150,39 @@ return to_ixxx(ctx, context); } +static inline struct ipath_device *to_idev(struct ibv_device *ibdev) +{ + return to_ixxx(dev, device); +} + +static inline struct ipath_cq *to_icq(struct ibv_cq *ibcq) +{ + return to_ixxx(cq, cq); +} + +static inline struct ipath_qp *to_iqp(struct ibv_qp *ibqp) +{ + return to_ixxx(qp, qp); +} + +static inline struct ipath_srq *to_isrq(struct ibv_srq *ibsrq) +{ + return to_ixxx(srq, srq); +} + +/* + * Since struct ipath_rwqe is not a fixed size, we can't simply index into + * struct ipath_rq.wq. This function does the array index computation. + */ +static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq, + unsigned n) +{ + return (struct ipath_rwqe *) + ((char *) rq->rwq->wq + + (sizeof(struct ipath_rwqe) + + rq->max_sge * sizeof(struct ibv_sge)) * n); +} + extern int ipath_query_device(struct ibv_context *context, struct ibv_device_attr *attr); @@ -91,8 +202,12 @@ struct ibv_comp_channel *channel, int comp_vector); +int ipath_resize_cq(struct ibv_cq *cq, int cqe); + int ipath_destroy_cq(struct ibv_cq *cq); +int ipath_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); + struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); @@ -122,6 +237,9 @@ int ipath_destroy_srq(struct ibv_srq *srq); +int ipath_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr); + struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr); int ipath_destroy_ah(struct ibv_ah *ah); From ralphc at pathscale.com Fri Aug 11 14:58:09 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 11 Aug 2006 14:58:09 -0700 Subject: [openib-general] [PATCH 4/7] IB/ipath - performance improvements via mmap of queues Message-ID: <1155333489.20325.429.camel@brick.pathscale.com> Allow the driver plug-in library to return additional data in the response from ibv_cmd_resize_cq(). Also, allow the user library to pass additional information to ib_modify_qp() and ib_modify_srq(). Signed-off-by: Ralph Campbell Index: src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h =================================================================== --- src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h (revision 8843) +++ src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -275,6 +275,8 @@ struct ib_uverbs_resize_cq_resp { __u32 cqe; + __u32 reserved; + __u64 driver_data[0]; }; struct ib_uverbs_poll_cq { Index: src/linux-kernel/infiniband/include/rdma/ib_verbs.h =================================================================== --- src/linux-kernel/infiniband/include/rdma/ib_verbs.h (revision 8843) +++ src/linux-kernel/infiniband/include/rdma/ib_verbs.h (working copy) @@ -911,7 +911,8 @@ struct ib_udata *udata); int (*modify_srq)(struct ib_srq *srq, struct ib_srq_attr *srq_attr, - enum ib_srq_attr_mask srq_attr_mask); + enum ib_srq_attr_mask srq_attr_mask, + struct ib_udata *udata); int (*query_srq)(struct ib_srq *srq, struct ib_srq_attr *srq_attr); int (*destroy_srq)(struct ib_srq *srq); @@ -923,7 +924,8 @@ struct ib_udata *udata); int (*modify_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, - int qp_attr_mask); + int qp_attr_mask, + struct ib_udata *udata); int (*query_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, Index: src/linux-kernel/infiniband/core/verbs.c =================================================================== --- src/linux-kernel/infiniband/core/verbs.c (revision 8843) +++ src/linux-kernel/infiniband/core/verbs.c (working copy) @@ -231,7 +231,7 @@ struct ib_srq_attr *srq_attr, enum ib_srq_attr_mask srq_attr_mask) { - return srq->device->modify_srq(srq, srq_attr, srq_attr_mask); + return srq->device->modify_srq(srq, srq_attr, srq_attr_mask, NULL); } EXPORT_SYMBOL(ib_modify_srq); @@ -547,7 +547,7 @@ struct ib_qp_attr *qp_attr, int qp_attr_mask) { - return qp->device->modify_qp(qp, qp_attr, qp_attr_mask); + return qp->device->modify_qp(qp, qp_attr, qp_attr_mask, NULL); } EXPORT_SYMBOL(ib_modify_qp); Index: src/linux-kernel/infiniband/core/uverbs_cmd.c =================================================================== --- src/linux-kernel/infiniband/core/uverbs_cmd.c (revision 8843) +++ src/linux-kernel/infiniband/core/uverbs_cmd.c (working copy) @@ -829,7 +829,6 @@ err_copy: idr_remove_uobj(&ib_uverbs_cq_idr, &obj->uobject); - err_free: ib_destroy_cq(cq); @@ -1264,6 +1263,7 @@ int out_len) { struct ib_uverbs_modify_qp cmd; + struct ib_udata udata; struct ib_qp *qp; struct ib_qp_attr *attr; int ret; @@ -1271,6 +1271,9 @@ if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + INIT_UDATA(&udata, buf + sizeof cmd, NULL, in_len - sizeof cmd, + out_len); + attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) return -ENOMEM; @@ -1327,7 +1330,7 @@ attr->alt_ah_attr.ah_flags = cmd.alt_dest.is_global ? IB_AH_GRH : 0; attr->alt_ah_attr.port_num = cmd.alt_dest.port_num; - ret = ib_modify_qp(qp, attr, cmd.attr_mask); + ret = qp->device->modify_qp(qp, attr, cmd.attr_mask, &udata); put_qp_read(qp); @@ -2045,6 +2048,7 @@ int out_len) { struct ib_uverbs_modify_srq cmd; + struct ib_udata udata; struct ib_srq *srq; struct ib_srq_attr attr; int ret; @@ -2052,6 +2056,9 @@ if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + INIT_UDATA(&udata, buf + sizeof cmd, NULL, in_len - sizeof cmd, + out_len); + srq = idr_read_srq(cmd.srq_handle, file->ucontext); if (!srq) return -EINVAL; @@ -2059,7 +2066,7 @@ attr.max_wr = cmd.max_wr; attr.srq_limit = cmd.srq_limit; - ret = ib_modify_srq(srq, &attr, cmd.attr_mask); + ret = srq->device->modify_srq(srq, &attr, cmd.attr_mask, &udata); put_srq_read(srq); From ralphc at pathscale.com Fri Aug 11 14:59:09 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 11 Aug 2006 14:59:09 -0700 Subject: [openib-general] [PATCH 5/7] IB/ipath - performance improvements via mmap of queues Message-ID: <1155333549.20325.430.camel@brick.pathscale.com> EHCA changes to correspond to the ib_core module changes. Signed-off-by: Ralph Campbell Index: src/linux-kernel/infiniband/hw/ehca/ehca_qp.c =================================================================== --- src/linux-kernel/infiniband/hw/ehca/ehca_qp.c (revision 8843) +++ src/linux-kernel/infiniband/hw/ehca/ehca_qp.c (working copy) @@ -1288,7 +1288,8 @@ return ret; } -int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata) { int ret = 0; struct ehca_qp *my_qp = NULL; Index: src/linux-kernel/infiniband/hw/ehca/ehca_iverbs.h =================================================================== --- src/linux-kernel/infiniband/hw/ehca/ehca_iverbs.h (revision 8843) +++ src/linux-kernel/infiniband/hw/ehca/ehca_iverbs.h (working copy) @@ -143,7 +143,8 @@ int ehca_destroy_qp(struct ib_qp *qp); -int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata); int ehca_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); From weiny2 at llnl.gov Fri Aug 11 14:56:20 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 11 Aug 2006 14:56:20 -0700 Subject: [openib-general] [PATCH] make "prefix" the default where to find libraries. Message-ID: <20060811145620.7e9a762d.weiny2@llnl.gov> I have been using this patch to allow me to install somewhere other than /usr/local. Without this the dependancies do not work out right and here at LLNL /usr/local is a NFS mounted volume. Not good for building and testing on a single node. While I am not a configure expert I think this would make things easier for building the trunk. Ira -------------- next part -------------- A non-text attachment was scrubbed... Name: trunk-prefix-fix.patch Type: application/octet-stream Size: 5570 bytes Desc: not available URL: From ralphc at pathscale.com Fri Aug 11 14:59:57 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 11 Aug 2006 14:59:57 -0700 Subject: [openib-general] [PATCH 6/7] IB/ipath - performance improvements via mmap of queues Message-ID: <1155333597.20325.432.camel@brick.pathscale.com> MTHCA changes to correspond to the ib_core module changes. Signed-off-by: Ralph Campbell Index: src/linux-kernel/infiniband/hw/mthca/mthca_srq.c =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_srq.c (revision 8843) +++ src/linux-kernel/infiniband/hw/mthca/mthca_srq.c (working copy) @@ -358,7 +358,7 @@ } int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask) + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata) { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); Index: src/linux-kernel/infiniband/hw/mthca/mthca_dev.h =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 8843) +++ src/linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -506,7 +506,7 @@ struct ib_srq_attr *attr, struct mthca_srq *srq); void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq); int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask); + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata); int mthca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr); int mthca_max_srq_sge(struct mthca_dev *dev); void mthca_srq_event(struct mthca_dev *dev, u32 srqn, @@ -521,7 +521,8 @@ enum ib_event_type event_type); int mthca_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); -int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata); int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr); int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, Index: src/linux-kernel/infiniband/hw/mthca/mthca_qp.c =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_qp.c (revision 8843) +++ src/linux-kernel/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -522,7 +522,8 @@ return 0; } -int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata) { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); From ralphc at pathscale.com Fri Aug 11 15:01:16 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 11 Aug 2006 15:01:16 -0700 Subject: [openib-general] [PATCH 7/7] IB/ipath - performance improvements via mmap of queues Message-ID: <1155333676.20325.435.camel@brick.pathscale.com> Improve the performance of the userspace verbs ibv_poll_cq(), ibv_post_recv(), and ibv_post_srq_recv(). The driver now mmaps the completion queue and receive queues into the user's address space; the userspace libipathverbs.so library has been modified to take advantage of this. These changes are backward compatible with the old libipathverbs.so. Signed-off-by: Ralph Campbell diff -r dcc321d1340a drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/Makefile Fri Aug 11 13:14:18 2006 -0700 @@ -25,6 +25,7 @@ ib_ipath-y := \ ipath_cq.o \ ipath_keys.o \ ipath_mad.o \ + ipath_mmap.o \ ipath_mr.o \ ipath_qp.o \ ipath_rc.o \ diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_cq.c --- a/drivers/infiniband/hw/ipath/ipath_cq.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_cq.c Fri Aug 11 13:32:44 2006 -0700 @@ -42,20 +42,28 @@ * @entry: work completion entry to add * @sig: true if @entry is a solicitated entry * - * This may be called with one of the qp->s_lock or qp->r_rq.lock held. + * This may be called with qp->s_lock held. */ void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int solicited) { + struct ipath_cq_wc *wc = cq->queue; unsigned long flags; + u32 head; u32 next; spin_lock_irqsave(&cq->lock, flags); - if (cq->head == cq->ibcq.cqe) + /* + * Note that the head pointer might be writable by user processes. + * Take care to verify it is a sane value. + */ + head = wc->head; + if (head >= (unsigned) cq->ibcq.cqe) { + head = cq->ibcq.cqe; next = 0; - else - next = cq->head + 1; - if (unlikely(next == cq->tail)) { + } else + next = head + 1; + if (unlikely(next == wc->tail)) { spin_unlock_irqrestore(&cq->lock, flags); if (cq->ibcq.event_handler) { struct ib_event ev; @@ -67,8 +75,8 @@ void ipath_cq_enter(struct ipath_cq *cq, } return; } - cq->queue[cq->head] = *entry; - cq->head = next; + wc->queue[head] = *entry; + wc->head = next; if (cq->notify == IB_CQ_NEXT_COMP || (cq->notify == IB_CQ_SOLICITED && solicited)) { @@ -101,19 +109,20 @@ int ipath_poll_cq(struct ib_cq *ibcq, in int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) { struct ipath_cq *cq = to_icq(ibcq); + struct ipath_cq_wc *wc = cq->queue; unsigned long flags; int npolled; spin_lock_irqsave(&cq->lock, flags); for (npolled = 0; npolled < num_entries; ++npolled, ++entry) { - if (cq->tail == cq->head) + if (wc->tail == wc->head) break; - *entry = cq->queue[cq->tail]; - if (cq->tail == cq->ibcq.cqe) - cq->tail = 0; + *entry = wc->queue[wc->tail]; + if (wc->tail >= cq->ibcq.cqe) + wc->tail = 0; else - cq->tail++; + wc->tail++; } spin_unlock_irqrestore(&cq->lock, flags); @@ -160,38 +169,74 @@ struct ib_cq *ipath_create_cq(struct ib_ { struct ipath_ibdev *dev = to_idev(ibdev); struct ipath_cq *cq; - struct ib_wc *wc; + struct ipath_cq_wc *wc; struct ib_cq *ret; if (entries > ib_ipath_max_cqes) { ret = ERR_PTR(-EINVAL); - goto bail; + goto done; } if (dev->n_cqs_allocated == ib_ipath_max_cqs) { ret = ERR_PTR(-ENOMEM); - goto bail; - } - - /* - * Need to use vmalloc() if we want to support large #s of - * entries. - */ + goto done; + } + + /* Allocate the completion queue structure. */ cq = kmalloc(sizeof(*cq), GFP_KERNEL); if (!cq) { ret = ERR_PTR(-ENOMEM); - goto bail; - } - - /* - * Need to use vmalloc() if we want to support large #s of entries. - */ - wc = vmalloc(sizeof(*wc) * (entries + 1)); + goto done; + } + + /* + * Allocate the completion queue entries and head/tail pointers. + * This is allocated separately so that it can be resized and + * also mapped into user space. + * We need to use vmalloc() in order to support mmap and large + * numbers of entries. + */ + wc = vmalloc(sizeof(*wc) + sizeof(struct ib_wc) * entries); if (!wc) { - kfree(cq); ret = ERR_PTR(-ENOMEM); - goto bail; - } + goto bail_cq; + } + + /* + * Return the address of the WC as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata && udata->outlen >= sizeof(__u64)) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) wc; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto bail_wc; + } + + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto bail_wc; + } + cq->ip = ip; + ip->context = context; + ip->obj = wc; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(*wc) + + sizeof(struct ib_wc) * entries); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } else + cq->ip = NULL; + /* * ib_create_cq() will initialize cq->ibcq except for cq->ibcq.cqe. * The number of entries should be >= the number requested or return @@ -202,15 +247,22 @@ struct ib_cq *ipath_create_cq(struct ib_ cq->triggered = 0; spin_lock_init(&cq->lock); tasklet_init(&cq->comptask, send_complete, (unsigned long)cq); - cq->head = 0; - cq->tail = 0; + wc->head = 0; + wc->tail = 0; cq->queue = wc; ret = &cq->ibcq; dev->n_cqs_allocated++; - -bail: + goto done; + +bail_wc: + vfree(wc); + +bail_cq: + kfree(cq); + +done: return ret; } @@ -229,7 +281,10 @@ int ipath_destroy_cq(struct ib_cq *ibcq) tasklet_kill(&cq->comptask); dev->n_cqs_allocated--; - vfree(cq->queue); + if (cq->ip) + kref_put(&cq->ip->ref, ipath_release_mmap_info); + else + vfree(cq->queue); kfree(cq); return 0; @@ -253,7 +308,7 @@ int ipath_req_notify_cq(struct ib_cq *ib spin_lock_irqsave(&cq->lock, flags); /* * Don't change IB_CQ_NEXT_COMP to IB_CQ_SOLICITED but allow - * any other transitions. + * any other transitions (see C11-31 and C11-32 in ch. 11.4.2.2). */ if (cq->notify != IB_CQ_NEXT_COMP) cq->notify = notify; @@ -264,46 +319,81 @@ int ipath_resize_cq(struct ib_cq *ibcq, int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) { struct ipath_cq *cq = to_icq(ibcq); - struct ib_wc *wc, *old_wc; - u32 n; + struct ipath_cq_wc *old_wc = cq->queue; + struct ipath_cq_wc *wc; + u32 head, tail, n; int ret; /* * Need to use vmalloc() if we want to support large #s of entries. */ - wc = vmalloc(sizeof(*wc) * (cqe + 1)); + wc = vmalloc(sizeof(*wc) + sizeof(struct ib_wc) * cqe); if (!wc) { ret = -ENOMEM; goto bail; } + /* + * Return the address of the WC as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata && udata->outlen >= sizeof(__u64)) { + __u64 offset = (__u64) wc; + + ret = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (ret) + goto bail; + } + spin_lock_irq(&cq->lock); - if (cq->head < cq->tail) - n = cq->ibcq.cqe + 1 + cq->head - cq->tail; + /* + * Make sure head and tail are sane since they + * might be user writable. + */ + head = old_wc->head; + if (head > (u32) cq->ibcq.cqe) + head = (u32) cq->ibcq.cqe; + tail = old_wc->tail; + if (tail > (u32) cq->ibcq.cqe) + tail = (u32) cq->ibcq.cqe; + if (head < tail) + n = cq->ibcq.cqe + 1 + head - tail; else - n = cq->head - cq->tail; + n = head - tail; if (unlikely((u32)cqe < n)) { spin_unlock_irq(&cq->lock); vfree(wc); ret = -EOVERFLOW; goto bail; } - for (n = 0; cq->tail != cq->head; n++) { - wc[n] = cq->queue[cq->tail]; - if (cq->tail == cq->ibcq.cqe) - cq->tail = 0; + for (n = 0; tail != head; n++) { + wc->queue[n] = old_wc->queue[tail]; + if (tail == (u32) cq->ibcq.cqe) + tail = 0; else - cq->tail++; + tail++; } cq->ibcq.cqe = cqe; - cq->head = n; - cq->tail = 0; - old_wc = cq->queue; + wc->head = n; + wc->tail = 0; cq->queue = wc; spin_unlock_irq(&cq->lock); vfree(old_wc); + if (cq->ip) { + struct ipath_ibdev *dev = to_idev(ibcq->device); + struct ipath_mmap_info *ip = cq->ip; + + ip->obj = wc; + ip->size = PAGE_ALIGN(sizeof(*wc) + + sizeof(struct ib_wc) * cqe); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + ret = 0; bail: diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_mmap.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_mmap.c Fri Aug 11 13:14:18 2006 -0700 @@ -0,0 +1,149 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include + +#include "ipath_verbs.h" + +/** + * ipath_release_mmap_info - free mmap info structure + * @ref: a pointer to the kref within struct ipath_mmap_info + */ +void ipath_release_mmap_info(struct kref *ref) +{ + struct ipath_mmap_info *ip = + container_of(ref, struct ipath_mmap_info, ref); + + vfree(ip->obj); + kfree(ip); +} + +/* + * open and close keep track of how many times the CQ is mapped, + * to avoid releasing it. + */ +static void ipath_vma_open(struct vm_area_struct *vma) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + + kref_get(&ip->ref); + ip->mmap_cnt++; +} + +static void ipath_vma_close(struct vm_area_struct *vma) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + + ip->mmap_cnt--; + kref_put(&ip->ref, ipath_release_mmap_info); +} + +/* + * ipath_vma_nopage - handle a VMA page fault. + */ +static struct page *ipath_vma_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + unsigned long offset = address - vma->vm_start; + struct page *page = NOPAGE_SIGBUS; + void *pageptr; + + if (offset >= ip->size) + goto out; /* out of range */ + + /* + * Convert the vmalloc address into a struct page. + */ + pageptr = (void *)(offset + (vma->vm_pgoff << PAGE_SHIFT)); + page = vmalloc_to_page(pageptr); + if (!page) + goto out; + + /* Increment the reference count. */ + get_page(page); + if (type) + *type = VM_FAULT_MINOR; +out: + return page; +} + +static struct vm_operations_struct ipath_vm_ops = { + .open = ipath_vma_open, + .close = ipath_vma_close, + .nopage = ipath_vma_nopage, +}; + +/** + * ipath_mmap - create a new mmap region + * @context: the IB user context of the process making the mmap() call + * @vma: the VMA to be initialized + * Return zero if the mmap is OK. Otherwise, return an errno. + */ +int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + struct ipath_ibdev *dev = to_idev(context->device); + unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; + unsigned long size = vma->vm_end - vma->vm_start; + struct ipath_mmap_info *ip, **pp; + + /* + * Search the device's list of objects waiting for a mmap call. + * Normally, this list is very short since a call to create a + * CQ, QP, or SRQ is soon followed by a call to mmap(). + */ + spin_lock_irq(&dev->pending_lock); + for (pp = &dev->pending_mmaps; (ip = *pp); pp = &ip->next) { + /* Only the creator is allowed to mmap the object */ + if (context != ip->context || (void *) offset != ip->obj) + continue; + /* Don't allow a mmap larger than the object. */ + if (size > ip->size) + break; + + *pp = ip->next; + spin_unlock_irq(&dev->pending_lock); + + vma->vm_ops = &ipath_vm_ops; + vma->vm_flags |= VM_RESERVED | VM_DONTEXPAND; + vma->vm_private_data = ip; + ipath_vma_open(vma); + return 0; + } + spin_unlock_irq(&dev->pending_lock); + return -EINVAL; +} diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 11 13:14:18 2006 -0700 @@ -35,7 +35,7 @@ #include #include "ipath_verbs.h" -#include "ipath_common.h" +#include "ipath_kernel.h" #define BITS_PER_PAGE (PAGE_SIZE*BITS_PER_BYTE) #define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1) @@ -43,19 +43,6 @@ (off)) #define find_next_offset(map, off) find_next_zero_bit((map)->page, \ BITS_PER_PAGE, off) - -#define TRANS_INVALID 0 -#define TRANS_ANY2RST 1 -#define TRANS_RST2INIT 2 -#define TRANS_INIT2INIT 3 -#define TRANS_INIT2RTR 4 -#define TRANS_RTR2RTS 5 -#define TRANS_RTS2RTS 6 -#define TRANS_SQERR2RTS 7 -#define TRANS_ANY2ERR 8 -#define TRANS_RTS2SQD 9 /* XXX Wait for expected ACKs & signal event */ -#define TRANS_SQD2SQD 10 /* error if not drained & parameter change */ -#define TRANS_SQD2RTS 11 /* error if not drained */ /* * Convert the AETH credit code into the number of credits. @@ -355,8 +342,10 @@ static void ipath_reset_qp(struct ipath_ qp->s_last = 0; qp->s_ssn = 1; qp->s_lsn = 0; - qp->r_rq.head = 0; - qp->r_rq.tail = 0; + if (qp->r_rq.wq) { + qp->r_rq.wq->head = 0; + qp->r_rq.wq->tail = 0; + } qp->r_reuse_sge = 0; } @@ -410,15 +399,32 @@ void ipath_error_qp(struct ipath_qp *qp) qp->s_hdrwords = 0; qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; - wc.opcode = IB_WC_RECV; - spin_lock(&qp->r_rq.lock); - while (qp->r_rq.tail != qp->r_rq.head) { - wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; - if (++qp->r_rq.tail >= qp->r_rq.size) - qp->r_rq.tail = 0; - ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); - } - spin_unlock(&qp->r_rq.lock); + if (qp->r_rq.wq) { + struct ipath_rwq *wq; + u32 head; + u32 tail; + + spin_lock(&qp->r_rq.lock); + + /* sanity check pointers before trusting them */ + wq = qp->r_rq.wq; + head = wq->head; + if (head >= qp->r_rq.size) + head = 0; + tail = wq->tail; + if (tail >= qp->r_rq.size) + tail = 0; + wc.opcode = IB_WC_RECV; + while (tail != head) { + wc.wr_id = get_rwqe_ptr(&qp->r_rq, tail)->wr_id; + if (++tail >= qp->r_rq.size) + tail = 0; + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + } + wq->tail = tail; + + spin_unlock(&qp->r_rq.lock); + } } /** @@ -426,11 +432,12 @@ void ipath_error_qp(struct ipath_qp *qp) * @ibqp: the queue pair who's attributes we're modifying * @attr: the new attributes * @attr_mask: the mask of attributes to modify + * @udata: user data for ipathverbs.so * * Returns 0 on success, otherwise returns an errno. */ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask) + int attr_mask, struct ib_udata *udata) { struct ipath_ibdev *dev = to_idev(ibqp->device); struct ipath_qp *qp = to_iqp(ibqp); @@ -543,7 +550,7 @@ int ipath_query_qp(struct ib_qp *ibqp, s attr->dest_qp_num = qp->remote_qpn; attr->qp_access_flags = qp->qp_access_flags; attr->cap.max_send_wr = qp->s_size - 1; - attr->cap.max_recv_wr = qp->r_rq.size - 1; + attr->cap.max_recv_wr = qp->ibqp.srq ? 0 : qp->r_rq.size - 1; attr->cap.max_send_sge = qp->s_max_sge; attr->cap.max_recv_sge = qp->r_rq.max_sge; attr->cap.max_inline_data = 0; @@ -596,13 +603,23 @@ __be32 ipath_compute_aeth(struct ipath_q } else { u32 min, max, x; u32 credits; - + struct ipath_rwq *wq = qp->r_rq.wq; + u32 head; + u32 tail; + + /* sanity check pointers before trusting them */ + head = wq->head; + if (head >= qp->r_rq.size) + head = 0; + tail = wq->tail; + if (tail >= qp->r_rq.size) + tail = 0; /* * Compute the number of credits available (RWQEs). * XXX Not holding the r_rq.lock here so there is a small * chance that the pair of reads are not atomic. */ - credits = qp->r_rq.head - qp->r_rq.tail; + credits = head - tail; if ((int)credits < 0) credits += qp->r_rq.size; /* @@ -679,27 +696,37 @@ struct ib_qp *ipath_create_qp(struct ib_ case IB_QPT_UD: case IB_QPT_SMI: case IB_QPT_GSI: - qp = kmalloc(sizeof(*qp), GFP_KERNEL); + sz = sizeof(*qp); + if (init_attr->srq) { + struct ipath_srq *srq = to_isrq(init_attr->srq); + + sz += sizeof(*qp->r_sg_list) * + srq->rq.max_sge; + } else + sz += sizeof(*qp->r_sg_list) * + init_attr->cap.max_recv_sge; + qp = kmalloc(sz, GFP_KERNEL); if (!qp) { - vfree(swq); ret = ERR_PTR(-ENOMEM); - goto bail; + goto bail_swq; } if (init_attr->srq) { + sz = 0; qp->r_rq.size = 0; qp->r_rq.max_sge = 0; qp->r_rq.wq = NULL; + init_attr->cap.max_recv_wr = 0; + init_attr->cap.max_recv_sge = 0; } else { qp->r_rq.size = init_attr->cap.max_recv_wr + 1; qp->r_rq.max_sge = init_attr->cap.max_recv_sge; - sz = (sizeof(struct ipath_sge) * qp->r_rq.max_sge) + + sz = (sizeof(struct ib_sge) * qp->r_rq.max_sge) + sizeof(struct ipath_rwqe); - qp->r_rq.wq = vmalloc(qp->r_rq.size * sz); + qp->r_rq.wq = vmalloc(sizeof(struct ipath_rwq) + + qp->r_rq.size * sz); if (!qp->r_rq.wq) { - kfree(qp); - vfree(swq); ret = ERR_PTR(-ENOMEM); - goto bail; + goto bail_qp; } } @@ -725,12 +752,10 @@ struct ib_qp *ipath_create_qp(struct ib_ err = ipath_alloc_qpn(&dev->qp_table, qp, init_attr->qp_type); if (err) { - vfree(swq); - vfree(qp->r_rq.wq); - kfree(qp); ret = ERR_PTR(err); - goto bail; - } + goto bail_rwq; + } + qp->ip = NULL; ipath_reset_qp(qp); /* Tell the core driver that the kernel SMA is present. */ @@ -747,8 +772,51 @@ struct ib_qp *ipath_create_qp(struct ib_ init_attr->cap.max_inline_data = 0; + /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata && udata->outlen >= sizeof(__u64)) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) qp->r_rq.wq; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto bail_rwq; + } + + if (qp->r_rq.wq) { + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto bail_rwq; + } + qp->ip = ip; + ip->context = ibpd->uobject->context; + ip->obj = qp->r_rq.wq; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + qp->r_rq.size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + } + ret = &qp->ibqp; - + goto bail; + +bail_rwq: + vfree(qp->r_rq.wq); +bail_qp: + kfree(qp); +bail_swq: + vfree(swq); bail: return ret; } @@ -772,11 +840,9 @@ int ipath_destroy_qp(struct ib_qp *ibqp) if (qp->ibqp.qp_type == IB_QPT_SMI) ipath_layer_set_verbs_flags(dev->dd, 0); - spin_lock_irqsave(&qp->r_rq.lock, flags); - spin_lock(&qp->s_lock); + spin_lock_irqsave(&qp->s_lock, flags); qp->state = IB_QPS_ERR; - spin_unlock(&qp->s_lock); - spin_unlock_irqrestore(&qp->r_rq.lock, flags); + spin_unlock_irqrestore(&qp->s_lock, flags); /* Stop the sending tasklet. */ tasklet_kill(&qp->s_task); @@ -797,8 +863,11 @@ int ipath_destroy_qp(struct ib_qp *ibqp) if (atomic_read(&qp->refcount) != 0) ipath_free_qp(&dev->qp_table, qp); + if (qp->ip) + kref_put(&qp->ip->ref, ipath_release_mmap_info); + else + vfree(qp->r_rq.wq); vfree(qp->s_wq); - vfree(qp->r_rq.wq); kfree(qp); return 0; } diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_ruc.c --- a/drivers/infiniband/hw/ipath/ipath_ruc.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c Fri Aug 11 13:14:18 2006 -0700 @@ -32,7 +32,7 @@ */ #include "ipath_verbs.h" -#include "ipath_common.h" +#include "ipath_kernel.h" /* * Convert the AETH RNR timeout code into the number of milliseconds. @@ -106,6 +106,54 @@ void ipath_insert_rnr_queue(struct ipath spin_unlock_irqrestore(&dev->pending_lock, flags); } +static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + int user = to_ipd(qp->ibqp.pd)->user; + int i, j, ret; + struct ib_wc wc; + + qp->r_len = 0; + for (i = j = 0; i < wqe->num_sge; i++) { + if (wqe->sg_list[i].length == 0) + continue; + /* Check LKEY */ + if ((user && wqe->sg_list[i].lkey == 0) || + !ipath_lkey_ok(&dev->lk_table, + &qp->r_sg_list[j], &wqe->sg_list[i], + IB_ACCESS_LOCAL_WRITE)) + goto bad_lkey; + qp->r_len += wqe->sg_list[i].length; + j++; + } + qp->r_sge.sge = qp->r_sg_list[0]; + qp->r_sge.sg_list = qp->r_sg_list + 1; + qp->r_sge.num_sge = j; + ret = 1; + goto bail; + +bad_lkey: + wc.wr_id = wqe->wr_id; + wc.status = IB_WC_LOC_PROT_ERR; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal solicited completion event. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + ret = 0; +bail: + return ret; +} + /** * ipath_get_rwqe - copy the next RWQE into the QP's RWQE * @qp: the QP @@ -119,71 +167,71 @@ int ipath_get_rwqe(struct ipath_qp *qp, { unsigned long flags; struct ipath_rq *rq; + struct ipath_rwq *wq; struct ipath_srq *srq; struct ipath_rwqe *wqe; - int ret = 1; - - if (!qp->ibqp.srq) { + void (*handler)(struct ib_event *, void *); + u32 tail; + int ret; + + if (qp->ibqp.srq) { + srq = to_isrq(qp->ibqp.srq); + handler = srq->ibsrq.event_handler; + rq = &srq->rq; + } else { + srq = NULL; + handler = NULL; rq = &qp->r_rq; - spin_lock_irqsave(&rq->lock, flags); - - if (unlikely(rq->tail == rq->head)) { + } + + spin_lock_irqsave(&rq->lock, flags); + wq = rq->wq; + tail = wq->tail; + /* Validate tail before using it since it is user writable. */ + if (tail >= rq->size) + tail = 0; + do { + if (unlikely(tail == wq->head)) { + spin_unlock_irqrestore(&rq->lock, flags); ret = 0; - goto done; - } - wqe = get_rwqe_ptr(rq, rq->tail); - qp->r_wr_id = wqe->wr_id; - if (!wr_id_only) { - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - qp->r_len = wqe->length; - } - if (++rq->tail >= rq->size) - rq->tail = 0; - goto done; - } - - srq = to_isrq(qp->ibqp.srq); - rq = &srq->rq; - spin_lock_irqsave(&rq->lock, flags); - - if (unlikely(rq->tail == rq->head)) { - ret = 0; - goto done; - } - wqe = get_rwqe_ptr(rq, rq->tail); + goto bail; + } + wqe = get_rwqe_ptr(rq, tail); + if (++tail >= rq->size) + tail = 0; + } while (!wr_id_only && !init_sge(qp, wqe)); qp->r_wr_id = wqe->wr_id; - if (!wr_id_only) { - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - qp->r_len = wqe->length; - } - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq->ibsrq.event_handler) { - struct ib_event ev; + wq->tail = tail; + + ret = 1; + if (handler) { u32 n; - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + n = wq->head; + if (n >= rq->size) + n = 0; + if (n < tail) + n += rq->size - tail; else - n = rq->head - rq->tail; + n -= tail; if (n < srq->limit) { + struct ib_event ev; + srq->limit = 0; spin_unlock_irqrestore(&rq->lock, flags); ev.device = qp->ibqp.device; ev.element.srq = qp->ibqp.srq; ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); + handler(&ev, srq->ibsrq.srq_context); goto bail; } } - -done: spin_unlock_irqrestore(&rq->lock, flags); + bail: return ret; } diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_srq.c --- a/drivers/infiniband/hw/ipath/ipath_srq.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_srq.c Fri Aug 11 13:36:18 2006 -0700 @@ -48,66 +48,39 @@ int ipath_post_srq_receive(struct ib_srq struct ib_recv_wr **bad_wr) { struct ipath_srq *srq = to_isrq(ibsrq); - struct ipath_ibdev *dev = to_idev(ibsrq->device); + struct ipath_rwq *wq; unsigned long flags; int ret; for (; wr; wr = wr->next) { struct ipath_rwqe *wqe; u32 next; - int i, j; - - if (wr->num_sge > srq->rq.max_sge) { + int i; + + if ((unsigned) wr->num_sge > srq->rq.max_sge) { *bad_wr = wr; ret = -ENOMEM; goto bail; } spin_lock_irqsave(&srq->rq.lock, flags); - next = srq->rq.head + 1; + wq = srq->rq.wq; + next = wq->head + 1; if (next >= srq->rq.size) next = 0; - if (next == srq->rq.tail) { + if (next == wq->tail) { spin_unlock_irqrestore(&srq->rq.lock, flags); *bad_wr = wr; ret = -ENOMEM; goto bail; } - wqe = get_rwqe_ptr(&srq->rq, srq->rq.head); + wqe = get_rwqe_ptr(&srq->rq, wq->head); wqe->wr_id = wr->wr_id; - wqe->sg_list[0].mr = NULL; - wqe->sg_list[0].vaddr = NULL; - wqe->sg_list[0].length = 0; - wqe->sg_list[0].sge_length = 0; - wqe->length = 0; - for (i = 0, j = 0; i < wr->num_sge; i++) { - /* Check LKEY */ - if (to_ipd(srq->ibsrq.pd)->user && - wr->sg_list[i].lkey == 0) { - spin_unlock_irqrestore(&srq->rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - if (wr->sg_list[i].length == 0) - continue; - if (!ipath_lkey_ok(&dev->lk_table, - &wqe->sg_list[j], - &wr->sg_list[i], - IB_ACCESS_LOCAL_WRITE)) { - spin_unlock_irqrestore(&srq->rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - wqe->length += wr->sg_list[i].length; - j++; - } - wqe->num_sge = j; - srq->rq.head = next; + wqe->num_sge = wr->num_sge; + for (i = 0; i < wr->num_sge; i++) + wqe->sg_list[i] = wr->sg_list[i]; + wq->head = next; spin_unlock_irqrestore(&srq->rq.lock, flags); } ret = 0; @@ -133,53 +106,95 @@ struct ib_srq *ipath_create_srq(struct i if (dev->n_srqs_allocated == ib_ipath_max_srqs) { ret = ERR_PTR(-ENOMEM); - goto bail; + goto done; } if (srq_init_attr->attr.max_wr == 0) { ret = ERR_PTR(-EINVAL); - goto bail; + goto done; } if ((srq_init_attr->attr.max_sge > ib_ipath_max_srq_sges) || (srq_init_attr->attr.max_wr > ib_ipath_max_srq_wrs)) { ret = ERR_PTR(-EINVAL); - goto bail; + goto done; } srq = kmalloc(sizeof(*srq), GFP_KERNEL); if (!srq) { ret = ERR_PTR(-ENOMEM); - goto bail; + goto done; } /* * Need to use vmalloc() if we want to support large #s of entries. */ srq->rq.size = srq_init_attr->attr.max_wr + 1; - sz = sizeof(struct ipath_sge) * srq_init_attr->attr.max_sge + + srq->rq.max_sge = srq_init_attr->attr.max_sge; + sz = sizeof(struct ib_sge) * srq->rq.max_sge + sizeof(struct ipath_rwqe); - srq->rq.wq = vmalloc(srq->rq.size * sz); + srq->rq.wq = vmalloc(sizeof(struct ipath_rwq) + srq->rq.size * sz); if (!srq->rq.wq) { - kfree(srq); ret = ERR_PTR(-ENOMEM); - goto bail; - } + goto bail_srq; + } + + /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata && udata->outlen >= sizeof(__u64)) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) srq->rq.wq; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto bail_wq; + } + + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto bail_wq; + } + srq->ip = ip; + ip->context = ibpd->uobject->context; + ip->obj = srq->rq.wq; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + srq->rq.size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } else + srq->ip = NULL; /* * ib_create_srq() will initialize srq->ibsrq. */ spin_lock_init(&srq->rq.lock); - srq->rq.head = 0; - srq->rq.tail = 0; + srq->rq.wq->head = 0; + srq->rq.wq->tail = 0; srq->rq.max_sge = srq_init_attr->attr.max_sge; srq->limit = srq_init_attr->attr.srq_limit; + dev->n_srqs_allocated++; + ret = &srq->ibsrq; - - dev->n_srqs_allocated++; - -bail: + goto done; + +bail_wq: + vfree(srq->rq.wq); + +bail_srq: + kfree(srq); + +done: return ret; } @@ -188,83 +203,130 @@ bail: * @ibsrq: the SRQ to modify * @attr: the new attributes of the SRQ * @attr_mask: indicates which attributes to modify + * @udata: user data for ipathverbs.so */ int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask) -{ - struct ipath_srq *srq = to_isrq(ibsrq); - unsigned long flags; - int ret; - - if (attr_mask & IB_SRQ_MAX_WR) + enum ib_srq_attr_mask attr_mask, + struct ib_udata *udata) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + int ret = 0; + + if (attr_mask & IB_SRQ_MAX_WR) { + struct ipath_rwq *owq; + struct ipath_rwq *wq; + struct ipath_rwqe *p; + u32 sz, size, n, head, tail; + + /* Check that the requested sizes are below the limits. */ if ((attr->max_wr > ib_ipath_max_srq_wrs) || - (attr->max_sge > srq->rq.max_sge)) { + ((attr_mask & IB_SRQ_LIMIT) ? + attr->srq_limit : srq->limit) > attr->max_wr) { ret = -EINVAL; goto bail; } - if (attr_mask & IB_SRQ_LIMIT) - if (attr->srq_limit >= srq->rq.size) { - ret = -EINVAL; - goto bail; - } - - if (attr_mask & IB_SRQ_MAX_WR) { - struct ipath_rwqe *wq, *p; - u32 sz, size, n; - sz = sizeof(struct ipath_rwqe) + - attr->max_sge * sizeof(struct ipath_sge); + srq->rq.max_sge * sizeof(struct ib_sge); size = attr->max_wr + 1; - wq = vmalloc(size * sz); + wq = vmalloc(sizeof(struct ipath_rwq) + size * sz); if (!wq) { ret = -ENOMEM; goto bail; } - spin_lock_irqsave(&srq->rq.lock, flags); - if (srq->rq.head < srq->rq.tail) - n = srq->rq.size + srq->rq.head - srq->rq.tail; + /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata && udata->inlen >= sizeof(__u64)) { + __u64 offset_addr; + __u64 offset = (__u64) wq; + + ret = ib_copy_from_udata(&offset_addr, udata, + sizeof(offset_addr)); + if (ret) { + vfree(wq); + goto bail; + } + udata->outbuf = (void __user *) offset_addr; + ret = ib_copy_to_udata(udata, &offset, + sizeof(offset)); + if (ret) { + vfree(wq); + goto bail; + } + } + + spin_lock_irq(&srq->rq.lock); + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + owq = srq->rq.wq; + head = owq->head; + if (head >= srq->rq.size) + head = 0; + tail = owq->tail; + if (tail >= srq->rq.size) + tail = 0; + n = head; + if (n < tail) + n += srq->rq.size - tail; else - n = srq->rq.head - srq->rq.tail; - if (size <= n || size <= srq->limit) { - spin_unlock_irqrestore(&srq->rq.lock, flags); + n -= tail; + if (size <= n) { + spin_unlock_irq(&srq->rq.lock); vfree(wq); ret = -EINVAL; goto bail; } n = 0; - p = wq; - while (srq->rq.tail != srq->rq.head) { + p = wq->wq; + while (tail != head) { struct ipath_rwqe *wqe; int i; - wqe = get_rwqe_ptr(&srq->rq, srq->rq.tail); + wqe = get_rwqe_ptr(&srq->rq, tail); p->wr_id = wqe->wr_id; - p->length = wqe->length; p->num_sge = wqe->num_sge; for (i = 0; i < wqe->num_sge; i++) p->sg_list[i] = wqe->sg_list[i]; n++; p = (struct ipath_rwqe *)((char *) p + sz); - if (++srq->rq.tail >= srq->rq.size) - srq->rq.tail = 0; - } - vfree(srq->rq.wq); + if (++tail >= srq->rq.size) + tail = 0; + } srq->rq.wq = wq; srq->rq.size = size; - srq->rq.head = n; - srq->rq.tail = 0; - srq->rq.max_sge = attr->max_sge; - spin_unlock_irqrestore(&srq->rq.lock, flags); - } - - if (attr_mask & IB_SRQ_LIMIT) { - spin_lock_irqsave(&srq->rq.lock, flags); - srq->limit = attr->srq_limit; - spin_unlock_irqrestore(&srq->rq.lock, flags); - } - ret = 0; + wq->head = n; + wq->tail = 0; + if (attr_mask & IB_SRQ_LIMIT) + srq->limit = attr->srq_limit; + spin_unlock_irq(&srq->rq.lock); + + vfree(owq); + + if (srq->ip) { + struct ipath_mmap_info *ip = srq->ip; + struct ipath_ibdev *dev = to_idev(srq->ibsrq.device); + + ip->obj = wq; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + } else if (attr_mask & IB_SRQ_LIMIT) { + spin_lock_irq(&srq->rq.lock); + if (attr->srq_limit >= srq->rq.size) + ret = -EINVAL; + else + srq->limit = attr->srq_limit; + spin_unlock_irq(&srq->rq.lock); + } bail: return ret; diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_ud.c --- a/drivers/infiniband/hw/ipath/ipath_ud.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_ud.c Fri Aug 11 13:14:18 2006 -0700 @@ -34,7 +34,54 @@ #include #include "ipath_verbs.h" -#include "ipath_common.h" +#include "ipath_kernel.h" + +static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe, + u32 *lengthp, struct ipath_sge_state *ss) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + int user = to_ipd(qp->ibqp.pd)->user; + int i, j, ret; + struct ib_wc wc; + + *lengthp = 0; + for (i = j = 0; i < wqe->num_sge; i++) { + if (wqe->sg_list[i].length == 0) + continue; + /* Check LKEY */ + if ((user && wqe->sg_list[i].lkey == 0) || + !ipath_lkey_ok(&dev->lk_table, + j ? &ss->sg_list[j - 1] : &ss->sge, + &wqe->sg_list[i], IB_ACCESS_LOCAL_WRITE)) + goto bad_lkey; + *lengthp += wqe->sg_list[i].length; + j++; + } + ss->num_sge = j; + ret = 1; + goto bail; + +bad_lkey: + wc.wr_id = wqe->wr_id; + wc.status = IB_WC_LOC_PROT_ERR; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal solicited completion event. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + ret = 0; +bail: + return ret; +} /** * ipath_ud_loopback - handle send on loopback QPs @@ -46,6 +93,8 @@ * * This is called from ipath_post_ud_send() to forward a WQE addressed * to the same HCA. + * Note that the receive interrupt handler may be calling ipath_ud_rcv() + * while this is being called. */ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, @@ -60,7 +109,11 @@ static void ipath_ud_loopback(struct ipa struct ipath_srq *srq; struct ipath_sge_state rsge; struct ipath_sge *sge; + struct ipath_rwq *wq; struct ipath_rwqe *wqe; + void (*handler)(struct ib_event *, void *); + u32 tail; + u32 rlen; qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn); if (!qp) @@ -94,6 +147,13 @@ static void ipath_ud_loopback(struct ipa wc->imm_data = 0; } + if (wr->num_sge > 1) { + rsge.sg_list = kmalloc((wr->num_sge - 1) * + sizeof(struct ipath_sge), + GFP_ATOMIC); + } else + rsge.sg_list = NULL; + /* * Get the next work request entry to find where to put the data. * Note that it is safe to drop the lock after changing rq->tail @@ -101,37 +161,52 @@ static void ipath_ud_loopback(struct ipa */ if (qp->ibqp.srq) { srq = to_isrq(qp->ibqp.srq); + handler = srq->ibsrq.event_handler; rq = &srq->rq; } else { srq = NULL; + handler = NULL; rq = &qp->r_rq; } + spin_lock_irqsave(&rq->lock, flags); - if (rq->tail == rq->head) { + wq = rq->wq; + tail = wq->tail; + while (1) { + if (unlikely(tail == wq->head)) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + goto bail_sge; + } + wqe = get_rwqe_ptr(rq, tail); + if (++tail >= rq->size) + tail = 0; + if (init_sge(qp, wqe, &rlen, &rsge)) + break; + wq->tail = tail; + } + /* Silently drop packets which are too big. */ + if (wc->byte_len > rlen) { spin_unlock_irqrestore(&rq->lock, flags); dev->n_pkt_drops++; - goto done; - } - /* Silently drop packets which are too big. */ - wqe = get_rwqe_ptr(rq, rq->tail); - if (wc->byte_len > wqe->length) { - spin_unlock_irqrestore(&rq->lock, flags); - dev->n_pkt_drops++; - goto done; - } + goto bail_sge; + } + wq->tail = tail; wc->wr_id = wqe->wr_id; - rsge.sge = wqe->sg_list[0]; - rsge.sg_list = wqe->sg_list + 1; - rsge.num_sge = wqe->num_sge; - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq && srq->ibsrq.event_handler) { + if (handler) { u32 n; - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + n = wq->head; + if (n >= rq->size) + n = 0; + if (n < tail) + n += rq->size - tail; else - n = rq->head - rq->tail; + n -= tail; if (n < srq->limit) { struct ib_event ev; @@ -140,12 +215,12 @@ static void ipath_ud_loopback(struct ipa ev.device = qp->ibqp.device; ev.element.srq = qp->ibqp.srq; ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); + handler(&ev, srq->ibsrq.srq_context); } else spin_unlock_irqrestore(&rq->lock, flags); } else spin_unlock_irqrestore(&rq->lock, flags); + ah_attr = &to_iah(wr->wr.ud.ah)->attr; if (ah_attr->ah_flags & IB_AH_GRH) { ipath_copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh)); @@ -186,7 +261,7 @@ static void ipath_ud_loopback(struct ipa wc->src_qp = sqp->ibqp.qp_num; /* XXX do we know which pkey matched? Only needed for GSI. */ wc->pkey_index = 0; - wc->slid = ipath_layer_get_lid(dev->dd) | + wc->slid = dev->dd->ipath_lid | (ah_attr->src_path_bits & ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1)); wc->sl = ah_attr->sl; @@ -196,6 +271,8 @@ static void ipath_ud_loopback(struct ipa ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, wr->send_flags & IB_SEND_SOLICITED); +bail_sge: + kfree(rsge.sg_list); done: if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); @@ -433,13 +510,9 @@ void ipath_ud_rcv(struct ipath_ibdev *de int opcode; u32 hdrsize; u32 pad; - unsigned long flags; struct ib_wc wc; u32 qkey; u32 src_qp; - struct ipath_rq *rq; - struct ipath_srq *srq; - struct ipath_rwqe *wqe; u16 dlid; int header_in_data; @@ -547,19 +620,10 @@ void ipath_ud_rcv(struct ipath_ibdev *de /* * Get the next work request entry to find where to put the data. - * Note that it is safe to drop the lock after changing rq->tail - * since ipath_post_receive() won't fill the empty slot. - */ - if (qp->ibqp.srq) { - srq = to_isrq(qp->ibqp.srq); - rq = &srq->rq; - } else { - srq = NULL; - rq = &qp->r_rq; - } - spin_lock_irqsave(&rq->lock, flags); - if (rq->tail == rq->head) { - spin_unlock_irqrestore(&rq->lock, flags); + */ + if (qp->r_reuse_sge) + qp->r_reuse_sge = 0; + else if (!ipath_get_rwqe(qp, 0)) { /* * Count VL15 packets dropped due to no receive buffer. * Otherwise, count them as buffer overruns since usually, @@ -573,39 +637,11 @@ void ipath_ud_rcv(struct ipath_ibdev *de goto bail; } /* Silently drop packets which are too big. */ - wqe = get_rwqe_ptr(rq, rq->tail); - if (wc.byte_len > wqe->length) { - spin_unlock_irqrestore(&rq->lock, flags); + if (wc.byte_len > qp->r_len) { + qp->r_reuse_sge = 1; dev->n_pkt_drops++; goto bail; } - wc.wr_id = wqe->wr_id; - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq && srq->ibsrq.event_handler) { - u32 n; - - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; - else - n = rq->head - rq->tail; - if (n < srq->limit) { - struct ib_event ev; - - srq->limit = 0; - spin_unlock_irqrestore(&rq->lock, flags); - ev.device = qp->ibqp.device; - ev.element.srq = qp->ibqp.srq; - ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); - } else - spin_unlock_irqrestore(&rq->lock, flags); - } else - spin_unlock_irqrestore(&rq->lock, flags); if (has_grh) { ipath_copy_sge(&qp->r_sge, &hdr->u.l.grh, sizeof(struct ib_grh)); @@ -614,6 +650,7 @@ void ipath_ud_rcv(struct ipath_ibdev *de ipath_skip_sge(&qp->r_sge, sizeof(struct ib_grh)); ipath_copy_sge(&qp->r_sge, data, wc.byte_len - sizeof(struct ib_grh)); + wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; wc.vendor_err = 0; diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 11 13:14:18 2006 -0700 @@ -277,11 +277,12 @@ static int ipath_post_receive(struct ib_ struct ib_recv_wr **bad_wr) { struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_rwq *wq = qp->r_rq.wq; unsigned long flags; int ret; /* Check that state is OK to post receive. */ - if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK)) { + if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK) || !wq) { *bad_wr = wr; ret = -EINVAL; goto bail; @@ -290,59 +291,31 @@ static int ipath_post_receive(struct ib_ for (; wr; wr = wr->next) { struct ipath_rwqe *wqe; u32 next; - int i, j; - - if (wr->num_sge > qp->r_rq.max_sge) { + int i; + + if ((unsigned) wr->num_sge > qp->r_rq.max_sge) { *bad_wr = wr; ret = -ENOMEM; goto bail; } spin_lock_irqsave(&qp->r_rq.lock, flags); - next = qp->r_rq.head + 1; + next = wq->head + 1; if (next >= qp->r_rq.size) next = 0; - if (next == qp->r_rq.tail) { + if (next == wq->tail) { spin_unlock_irqrestore(&qp->r_rq.lock, flags); *bad_wr = wr; ret = -ENOMEM; goto bail; } - wqe = get_rwqe_ptr(&qp->r_rq, qp->r_rq.head); + wqe = get_rwqe_ptr(&qp->r_rq, wq->head); wqe->wr_id = wr->wr_id; - wqe->sg_list[0].mr = NULL; - wqe->sg_list[0].vaddr = NULL; - wqe->sg_list[0].length = 0; - wqe->sg_list[0].sge_length = 0; - wqe->length = 0; - for (i = 0, j = 0; i < wr->num_sge; i++) { - /* Check LKEY */ - if (to_ipd(qp->ibqp.pd)->user && - wr->sg_list[i].lkey == 0) { - spin_unlock_irqrestore(&qp->r_rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - if (wr->sg_list[i].length == 0) - continue; - if (!ipath_lkey_ok( - &to_idev(qp->ibqp.device)->lk_table, - &wqe->sg_list[j], &wr->sg_list[i], - IB_ACCESS_LOCAL_WRITE)) { - spin_unlock_irqrestore(&qp->r_rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - wqe->length += wr->sg_list[i].length; - j++; - } - wqe->num_sge = j; - qp->r_rq.head = next; + wqe->num_sge = wr->num_sge; + for (i = 0; i < wr->num_sge; i++) + wqe->sg_list[i] = wr->sg_list[i]; + wq->head = next; spin_unlock_irqrestore(&qp->r_rq.lock, flags); } ret = 0; @@ -1137,6 +1110,7 @@ static void *ipath_register_ib_device(in dev->attach_mcast = ipath_multicast_attach; dev->detach_mcast = ipath_multicast_detach; dev->process_mad = ipath_process_mad; + dev->mmap = ipath_mmap; snprintf(dev->node_desc, sizeof(dev->node_desc), IPATH_IDSTR " %s kernel_SMA", system_utsname.nodename); diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 11 13:14:18 2006 -0700 @@ -38,6 +38,7 @@ #include #include #include +#include #include #include "ipath_layer.h" @@ -50,7 +51,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define IPATH_UVERBS_ABI_VERSION 1 +#define IPATH_UVERBS_ABI_VERSION 2 /* * Define an ib_cq_notify value that is not valid so we know when CQ @@ -178,58 +179,41 @@ struct ipath_ah { }; /* - * Quick description of our CQ/QP locking scheme: - * - * We have one global lock that protects dev->cq/qp_table. Each - * struct ipath_cq/qp also has its own lock. An individual qp lock - * may be taken inside of an individual cq lock. Both cqs attached to - * a qp may be locked, with the send cq locked first. No other - * nesting should be done. - * - * Each struct ipath_cq/qp also has an atomic_t ref count. The - * pointer from the cq/qp_table to the struct counts as one reference. - * This reference also is good for access through the consumer API, so - * modifying the CQ/QP etc doesn't need to take another reference. - * Access because of a completion being polled does need a reference. - * - * Finally, each struct ipath_cq/qp has a wait_queue_head_t for the - * destroy function to sleep on. - * - * This means that access from the consumer API requires nothing but - * taking the struct's lock. - * - * Access because of a completion event should go as follows: - * - lock cq/qp_table and look up struct - * - increment ref count in struct - * - drop cq/qp_table lock - * - lock struct, do your thing, and unlock struct - * - decrement ref count; if zero, wake up waiters - * - * To destroy a CQ/QP, we can do the following: - * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock - * - decrement ref count - * - wait_event until ref count is zero - * - * It is the consumer's responsibilty to make sure that no QP - * operations (WQE posting or state modification) are pending when the - * QP is destroyed. Also, the consumer must make sure that calls to - * qp_modify are serialized. - * - * Possible optimizations (wait for profile data to see if/where we - * have locks bouncing between CPUs): - * - split cq/qp table lock into n separate (cache-aligned) locks, - * indexed (say) by the page in the table - */ - + * This structure is used by ipath_mmap() to validate an offset + * when an mmap() request is made. The vm_area_struct then uses + * this as its vm_private_data. + */ +struct ipath_mmap_info { + struct ipath_mmap_info *next; + struct ib_ucontext *context; + void *obj; + struct kref ref; + unsigned size; + unsigned mmap_cnt; +}; + +/* + * This structure is used to contain the head pointer, tail pointer, + * and completion queue entries as a single memory allocation so + * it can be mmap'ed into user space. + */ +struct ipath_cq_wc { + u32 head; /* index of next entry to fill */ + u32 tail; /* index of next ib_poll_cq() entry */ + struct ib_wc queue[1]; /* this is actually size ibcq.cqe + 1 */ +}; + +/* + * The completion queue structure. + */ struct ipath_cq { struct ib_cq ibcq; struct tasklet_struct comptask; spinlock_t lock; u8 notify; u8 triggered; - u32 head; /* new records added to the head */ - u32 tail; /* poll_cq() reads from here. */ - struct ib_wc *queue; /* this is actually ibcq.cqe + 1 */ + struct ipath_cq_wc *queue; + struct ipath_mmap_info *ip; }; /* @@ -248,28 +232,40 @@ struct ipath_swqe { /* * Receive work request queue entry. - * The size of the sg_list is determined when the QP is created and stored - * in qp->r_max_sge. + * The size of the sg_list is determined when the QP (or SRQ) is created + * and stored in qp->r_rq.max_sge (or srq->rq.max_sge). */ struct ipath_rwqe { u64 wr_id; - u32 length; /* total length of data in sg_list */ u8 num_sge; - struct ipath_sge sg_list[0]; -}; - -struct ipath_rq { - spinlock_t lock; + struct ib_sge sg_list[0]; +}; + +/* + * This structure is used to contain the head pointer, tail pointer, + * and receive work queue entries as a single memory allocation so + * it can be mmap'ed into user space. + * Note that the wq array elements are variable size so you can't + * just index into the array to get the N'th element; + * use get_rwqe_ptr() instead. + */ +struct ipath_rwq { u32 head; /* new work requests posted to the head */ u32 tail; /* receives pull requests from here. */ + struct ipath_rwqe wq[0]; +}; + +struct ipath_rq { + struct ipath_rwq *wq; + spinlock_t lock; u32 size; /* size of RWQE array */ u8 max_sge; - struct ipath_rwqe *wq; /* RWQE array */ }; struct ipath_srq { struct ib_srq ibsrq; struct ipath_rq rq; + struct ipath_mmap_info *ip; /* send signal when number of RWQEs < limit */ u32 limit; }; @@ -293,6 +289,7 @@ struct ipath_qp { atomic_t refcount; wait_queue_head_t wait; struct tasklet_struct s_task; + struct ipath_mmap_info *ip; struct ipath_sge_state *s_cur_sge; struct ipath_sge_state s_sge; /* current send request data */ /* current RDMA read send data */ @@ -345,7 +342,8 @@ struct ipath_qp { u32 s_ssn; /* SSN of tail entry */ u32 s_lsn; /* limit sequence number (credit) */ struct ipath_swqe *s_wq; /* send work queue */ - struct ipath_rq r_rq; /* receive work queue */ + struct ipath_rq r_rq; /* receive work queue */ + struct ipath_sge r_sg_list[0]; /* verified SGEs */ }; /* @@ -369,15 +367,15 @@ static inline struct ipath_swqe *get_swq /* * Since struct ipath_rwqe is not a fixed size, we can't simply index into - * struct ipath_rq.wq. This function does the array index computation. + * struct ipath_rwq.wq. This function does the array index computation. */ static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq, unsigned n) { return (struct ipath_rwqe *) - ((char *) rq->wq + + ((char *) rq->wq->wq + (sizeof(struct ipath_rwqe) + - rq->max_sge * sizeof(struct ipath_sge)) * n); + rq->max_sge * sizeof(struct ib_sge)) * n); } /* @@ -417,6 +415,7 @@ struct ipath_ibdev { struct ib_device ibdev; struct list_head dev_list; struct ipath_devdata *dd; + struct ipath_mmap_info *pending_mmaps; int ib_unit; /* This is the device number */ u16 sm_lid; /* in host order */ u8 sm_sl; @@ -579,7 +578,7 @@ int ipath_destroy_qp(struct ib_qp *ibqp) int ipath_destroy_qp(struct ib_qp *ibqp); int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask); + int attr_mask, struct ib_udata *udata); int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_qp_init_attr *init_attr); @@ -638,7 +637,8 @@ struct ib_srq *ipath_create_srq(struct i struct ib_udata *udata); int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask); + enum ib_srq_attr_mask attr_mask, + struct ib_udata *udata); int ipath_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr); @@ -680,6 +680,10 @@ int ipath_unmap_fmr(struct list_head *fm int ipath_dealloc_fmr(struct ib_fmr *ibfmr); +void ipath_release_mmap_info(struct kref *ref); + +int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); + void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev); void ipath_insert_rnr_queue(struct ipath_qp *qp); From ianjiang.ict at gmail.com Sat Aug 12 00:07:42 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Sat, 12 Aug 2006 15:07:42 +0800 Subject: [openib-general] where are the userspace sources after installing the ofed-1.0? Message-ID: <7b2fa1820608120007y433ed30ek282d765861291e2d@mail.gmail.com> I know the userspace sources could be got from OFED-1.0.tgz\SOURCES\openib-1.0\src\userspace. But I am wondering wheather these source will be installed when installing the OFED-1.0 RPMs. I cannot find them under /usr/local/ofed/ Thanks! -- Ian Jiang From ogerlitz at voltaire.com Sun Aug 13 00:42:56 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 13 Aug 2006 10:42:56 +0300 Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1155333276.20325.422.camel@brick.pathscale.com> References: <1155333276.20325.422.camel@brick.pathscale.com> Message-ID: <44DED800.7070703@voltaire.com> Ralph Campbell wrote: > The following patches update libibverbs, libmthca, libipathverbs, > and the kernel ib_core, ib_mthca, ib_ehca, and ib_ipath modules in > order to improve performance on QLogic InfiniPath HCAs. (With probably not being enough into the details of what you are changing) I find it somehow hard to review your patch set as of two reasons: (a) all the patches having the same subject line and (b) except for the seventh in the series, the patches were generated without the modified/new function/structure name (ie without the -p flag to diff). Or. From mst at mellanox.co.il Sun Aug 13 04:55:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 13 Aug 2006 14:55:04 +0300 Subject: [openib-general] [PATCH] IB/srp: add port/device attributes Message-ID: <20060813115504.GA21712@mellanox.co.il> Hi, Roland! There does not, at the moment, seem to exist a way to find out which HCA port the specific SRP host is connected through. This is needed to for things like availability or balancing where we want to connect to the same target through multiple distinct ports. While not really a bugfix, maybe the following is small enough for 2.6.18? We will use it in srptools that will ship with OFED. Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin Index: source/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- source.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-08-13 13:34:23.000000000 +0300 +++ source/drivers/infiniband/ulp/srp/ib_srp.c 2006-08-13 14:02:56.000000000 +0300 @@ -1461,12 +1461,29 @@ static ssize_t show_zero_req_lim(struct return sprintf(buf, "%d\n", target->zero_req_lim); } +static ssize_t show_srp_port(struct class_device *cdev, char *buf) +{ + struct srp_target_port *target = host_to_target(class_to_shost(cdev)); + + return sprintf(buf, "%d\n", target->srp_host->port); +} + +static ssize_t show_srp_device(struct class_device *cdev, char *buf) +{ + struct srp_target_port *target = host_to_target(class_to_shost(cdev)); + + return sprintf(buf, "%s\n", target->srp_host->dev->dev->name); +} + + static CLASS_DEVICE_ATTR(id_ext, S_IRUGO, show_id_ext, NULL); static CLASS_DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL); static CLASS_DEVICE_ATTR(service_id, S_IRUGO, show_service_id, NULL); static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); static CLASS_DEVICE_ATTR(dgid, S_IRUGO, show_dgid, NULL); static CLASS_DEVICE_ATTR(zero_req_lim, S_IRUGO, show_zero_req_lim, NULL); +static CLASS_DEVICE_ATTR(srp_port, S_IRUGO, show_srp_port, NULL); +static CLASS_DEVICE_ATTR(srp_device, S_IRUGO, show_srp_device, NULL); static struct class_device_attribute *srp_host_attrs[] = { &class_device_attr_id_ext, @@ -1475,6 +1492,8 @@ static struct class_device_attribute *sr &class_device_attr_pkey, &class_device_attr_dgid, &class_device_attr_zero_req_lim, + &class_device_attr_srp_port, + &class_device_attr_srp_device, NULL }; -- MST From tziporet at mellanox.co.il Sun Aug 13 06:14:10 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 13 Aug 2006 16:14:10 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <1155303417.4507.15009.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA767F@mtlexch01.mtl.com> <1155303417.4507.15009.camel@hal.voltaire.com> Message-ID: <44DF25A2.30405@mellanox.co.il> Hal Rosenstock wrote: >> Target release date: 12-Sep >> >> Intermediate milestones: >> 1. Create 1.1 branch of user level: 27-Jul - done >> 2. RC1: 8-Aug - done >> 3. Feature freeze (RC2): 17-Aug >> > > What is the start build date for RC2 ? When do developers need to have > their code in by to make RC2 ? > We will start on Tue 15-Aug. Is this OK with you? > > >> 4. Code freeze (rc-x): 6-Sep >> > > Is this 1 or 2 RCs beyond RC2 in order to make this ? > > I hope one but I guess it will be two more RCs. Tziporet From tziporet at mellanox.co.il Sun Aug 13 06:54:58 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 13 Aug 2006 16:54:58 +0300 Subject: [openib-general] Where is the programming reference manual of IBG2(OFED)? In-Reply-To: <3544.159.226.195.142.1155319768.squirrel@webmail.ict.ac.cn> References: <3544.159.226.195.142.1155319768.squirrel@webmail.ict.ac.cn> Message-ID: <44DF2F32.3050706@mellanox.co.il> wangnan06 at ict.ac.cn wrote: > Hi, > I'm new to IBG2 programming, It's seems that there's only "Release Notes" > in the package. I need some basic documents about progrmming with IBG2 or > OFED. The specification is too abstract and useless for programming. I > search google but find nothing. Anyone can help? Must I read the sample > code and source code to learn it? > There is no such a document. You need to look at examples. Tziporet From tziporet at mellanox.co.il Sun Aug 13 08:02:19 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 13 Aug 2006 18:02:19 +0300 Subject: [openib-general] where are the userspace sources after installing the ofed-1.0? In-Reply-To: <7b2fa1820608120007y433ed30ek282d765861291e2d@mail.gmail.com> References: <7b2fa1820608120007y433ed30ek282d765861291e2d@mail.gmail.com> Message-ID: <44DF3EFB.5060700@mellanox.co.il> Ian Jiang wrote: > I know the userspace sources could be got from > OFED-1.0.tgz\SOURCES\openib-1.0\src\userspace. But I am wondering > wheather these source will be installed when installing the OFED-1.0 > RPMs. I cannot find them under /usr/local/ofed/ > > Thanks! > > In OFED 1.0 sources are not installed at all after the installation. In OFED 1.1 after installation the sources are located on: /src/ Tziporet From kliteyn at gmail.com Sun Aug 13 08:17:49 2006 From: kliteyn at gmail.com (Yevgeny Kliteynik) Date: Sun, 13 Aug 2006 18:17:49 +0300 Subject: [openib-general] [PATCH] osm: OSM crash when working with Cisco's TopSpin stack Message-ID: <842b8cdf0608130817h3660faa6i1d23d75f3d7aca92@mail.gmail.com> Hi Hal. This patch fixes an OSM crash when working with Cisco's TS stack. Cisco's TopSpin doesn't follow the same rules when generating transaction id. When looking up the transaction in the table, SM applies a mask that is supposed to mask out the umad id and leave only the transaction id itself. When mask was applied to the transaction id that was created by TS stack, the result was 0, because it actually masked out the transaction id itself. Yevgeny Signed-off-by: Yevgeny Kliteynik Index: osm/libvendor/osm_vendor_ibumad.c =================================================================== --- osm/libvendor/osm_vendor_ibumad.c (revision 8614) +++ osm/libvendor/osm_vendor_ibumad.c (working copy) @@ -141,12 +141,25 @@ get_madw(osm_vendor_t *p_vend, ib_net64_ ib_net64_t mtid = (*tid & cl_ntoh64(0x00000000ffffffffllu)); osm_madw_t *res; + /* + * Some vendors (such as Cisco's TopSpin) may not follow + * the same rules when generating transaction id. + * If the resuls of applying a mask (which is supposed to + * mask out the umad id and leave only the transaction id + * itself) on a transaction id is 0, it means that the + * creator of the transaction is not SM, hence we don't + * have this transaction in the table anyway. + */ + if (mtid == 0) + return 0; + cl_spinlock_acquire( &p_vend->match_tbl_lock ); for (m = p_vend->mtbl.tbl, e = m + p_vend->mtbl.max; m < e; m++) { if (m->tid == mtid) { m->tid = 0; *tid = mtid; res = m->v; + m->v = NULL; cl_spinlock_release( &p_vend->match_tbl_lock ); return res; } @@ -1148,7 +1161,7 @@ Resp: osm_log(p_vend->p_log, OSM_LOG_DEBUG, "osm_vendor_send: " "Completed sending %s p_madw = %p\n", - resp_expected ? "response" : "request", p_madw); + resp_expected ? "request" : "response", p_madw); Exit: OSM_LOG_EXIT( p_vend->p_log ); return( ret ); -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralphc at pathscale.com Sun Aug 13 10:31:56 2006 From: ralphc at pathscale.com (ralphc at pathscale.com) Date: Sun, 13 Aug 2006 10:31:56 -0700 (PDT) Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <44DED800.7070703@voltaire.com> References: <1155333276.20325.422.camel@brick.pathscale.com> <44DED800.7070703@voltaire.com> Message-ID: <52217.71.131.37.202.1155490316.squirrel@rocky.pathscale.com> > Ralph Campbell wrote: >> The following patches update libibverbs, libmthca, libipathverbs, >> and the kernel ib_core, ib_mthca, ib_ehca, and ib_ipath modules in >> order to improve performance on QLogic InfiniPath HCAs. > > (With probably not being enough into the details of what you are > changing) I find it somehow hard to review your patch set as of two > reasons: (a) all the patches having the same subject line and (b) except > for the seventh in the series, the patches were generated without the > modified/new function/structure name (ie without the -p flag to diff). > > Or. a) I thought using the same subject line was the convention since it is essentially one patch. I split it due to size and the fact that each patch has a different owner. b) This is the format "svn diff" produces. I haven't had any complaints with it bwfore. I can resend them if you want. The bulk of the changes are to the InfiniPath kernel driver (ib_ipath) to support mmap'ing the CQ and receive queues (QP, SRQ) into the user level verbs library. The changes to the core IB were neccessary in order to allow additional information (i.e., the mmap offset) to be returned from the kernel driver to the user level verbs driver plugin. From mst at mellanox.co.il Sun Aug 13 11:29:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 13 Aug 2006 21:29:55 +0300 Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <52217.71.131.37.202.1155490316.squirrel@rocky.pathscale.com> References: <52217.71.131.37.202.1155490316.squirrel@rocky.pathscale.com> Message-ID: <20060813182955.GC23466@mellanox.co.il> Quoting r. ralphc at pathscale.com : > b) This is the format "svn diff" produces. I haven't had any > complaints with it bwfore. It's in the FAQ. Look here: https://openib.org/tiki/tiki-index.php?page=OpenIBFAQ anyway, git will generate the proper format for you. -- MST From kliteyn at mellanox.co.il Sun Aug 13 12:16:34 2006 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 13 Aug 2006 22:16:34 +0300 Subject: [openib-general] [PATCH] osm: Dynamic verbosity control per file Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2ABC@mtlexch01.mtl.com> Hi Hal. > Is it hard to find which file and line an opensm > log message comes from > ? Is this functionality really needed ? I'm guessing that your question refers to the second bullet - logging source code filename and line number. IMHO, this is a nice-to-have functionality - when debugging the SM that runs with a high verbosity, the log has a lot of information, and it's much easier to follow the SM flow when each message tells where exactly it came from. Regards, Yevgeny Kliteynik Mellanox Technologies LTD Tel: +972-4-909-7200 ext: 394 Fax: +972-4-959-3245 P.O. Box 586 Yokneam 20692 ISRAEL -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock Sent: Tuesday, August 08, 2006 11:50 PM To: Yevgeny Kliteynik Cc: OPENIB Subject: Re: [openib-general] [PATCH] osm: Dynamic verbosity control per file Hi Yevgeny, On Wed, 2006-08-02 at 11:16, Yevgeny Kliteynik wrote: > Hi Hal Just got back from vacation and am in the process of catching up. > This patch adds new verbosity functionality. > 1. Verbosity configuration file > ------------------------------- > > The user is able to set verbosity level per source code file > by supplying verbosity configuration file using the following > command line arguments: > > -b filename > --verbosity_file filename > > By default, the OSM will use the following file: /etc/opensmlog.conf > Verbosity configuration file should contain zero or more lines of > the following pattern: > > filename verbosity_level > > where 'filename' is the name of the source code file that the > 'verbosity_level' refers to, and the 'verbosity_level' itself > should be specified as an integer number (decimal or hexadecimal). > > One reserved filename is 'all' - it represents general verbosity > level, that is used for all the files that are not specified in > the verbosity configuration file. > If 'all' is not specified, the verbosity level set in the > command line will be used instead. > Note: The 'all' file verbosity level will override any other > general level that was specified by the command line arguments. > > Sending a SIGHUP signal to the OSM will cause it to reload > the verbosity configuration file. > > > 2. Logging source code filename and line number > ----------------------------------------------- > > If command line option -S or --log_source_info is specified, > OSM will add source code filename and line number to every > log message that is written to the log file. > By default, the OSM will not log this additional info. > > > Yevgeny Is it hard to find which file and line an opensm log message comes from ? Is this functionality really needed ? -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at opengridcomputing.com Sun Aug 13 13:22:39 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Sun, 13 Aug 2006 15:22:39 -0500 Subject: [openib-general] return error when rdma_listen fails In-Reply-To: <44DA4EA1.7000902@ichips.intel.com> References: <20060808181928.GA15075@osc.edu> <44DA4EA1.7000902@ichips.intel.com> Message-ID: <1155500559.29014.15.camel@trinity.ogc.int> On Wed, 2006-08-09 at 14:07 -0700, Sean Hefty wrote: > Pete Wyckoff wrote: > > Calling rdma_listen() on a cm_id bound to INADDR_ANY can fail, e.g. > > with EADDRINUSE, but report no error back to the user. This patch > > fixes that by propagating the error. Success occurs only if at > > least one of the possibly multiple devices in the system was able to > > listen. In the case of multiple devices reporting errors on listen, > > only the first error value is returned. iwarp branch. > > There's a problem if the listen is done before any devices have been added to > the system. In this case, the listen should succeed. I think this behavior is an artifact of the fact that the port spaces are not integrated. In order to fix this properly, IMO we need to use the Linux services that globally manage IP port spaces. This was discussed on netdev as part of our efforts to get the netdev notifier patch accepted. In absence of integration, we end up with this very strange behavior wherein a listen succeeds on one set of devices, but fails on another set. This is almost certainly not what the user expects or intends. How is the user supposed to interpret an error back from a listen request when the listen succeeded on device, but failed on another? Which device succeeded? Which failed? So all that blather aside, I think we should: - Implement an API into the existing Linux IP port space management database, - Use these services in the RDMA CM - Propose the API as a patch to the kernel on netdev. BTW, I actually think that Shawn could fix the current behavior so that it would be consistent within the RDMA_CM, however, we would still be inconsistent between sockets and iWARP, and IB/SDP and sockets. Thoughts? > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tom at opengridcomputing.com Sun Aug 13 13:36:22 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Sun, 13 Aug 2006 15:36:22 -0500 Subject: [openib-general] return error when rdma_listen fails In-Reply-To: <1155500559.29014.15.camel@trinity.ogc.int> References: <20060808181928.GA15075@osc.edu> <44DA4EA1.7000902@ichips.intel.com> <1155500559.29014.15.camel@trinity.ogc.int> Message-ID: <1155501382.29014.19.camel@trinity.ogc.int> Sorry, Sean, I can't spell today.... On Sun, 2006-08-13 at 15:22 -0500, Tom Tucker wrote: > On Wed, 2006-08-09 at 14:07 -0700, Sean Hefty wrote: > > Pete Wyckoff wrote: > > > Calling rdma_listen() on a cm_id bound to INADDR_ANY can fail, e.g. > > > with EADDRINUSE, but report no error back to the user. This patch > > > fixes that by propagating the error. Success occurs only if at > > > least one of the possibly multiple devices in the system was able to > > > listen. In the case of multiple devices reporting errors on listen, > > > only the first error value is returned. iwarp branch. > > > > There's a problem if the listen is done before any devices have been added to > > the system. In this case, the listen should succeed. > > I think this behavior is an artifact of the fact that the port spaces > are not integrated. In order to fix this properly, IMO we need to use > the Linux services that globally manage IP port spaces. This was > discussed on netdev as part of our efforts to get the netdev notifier > patch accepted. > > In absence of integration, we end up with this very strange behavior > wherein a listen succeeds on one set of devices, but fails on another > set. This is almost certainly not what the user expects or intends. > > How is the user supposed to interpret an error back from a listen > request when the listen succeeded on device, but failed on another? > Which device succeeded? Which failed? > > So all that blather aside, I think we should: > - Implement an API into the existing Linux IP port space > management database, > - Use these services in the RDMA CM > - Propose the API as a patch to the kernel on netdev. > > BTW, I actually think that Shawn could fix the current behavior so that > it would be consistent within the RDMA_CM, however, we would still be > inconsistent between sockets and iWARP, and IB/SDP and sockets. > > Thoughts? > > > > > - Sean > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Mon Aug 14 01:00:33 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 14 Aug 2006 11:00:33 +0300 Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <52217.71.131.37.202.1155490316.squirrel@rocky.pathscale.com> References: <1155333276.20325.422.camel@brick.pathscale.com> <44DED800.7070703@voltaire.com> <52217.71.131.37.202.1155490316.squirrel@rocky.pathscale.com> Message-ID: <44E02DA1.5070405@voltaire.com> >> I find it somehow hard to review your patch set as of two >> reasons: (a) all the patches having the same subject line > a) I thought using the same subject line was the convention since > it is essentially one patch. I split it due to size and the > fact that each patch has a different owner. Nope, the convention is to have the subject line telling what it this patch role within the patch series very similarly to what you have stated in some beginning of most of the patches. Another useful practice is to have all the patches sent over the same thread. > The bulk of the changes are to the InfiniPath kernel driver (ib_ipath) > to support mmap'ing the CQ and receive queues (QP, SRQ) into > the user level verbs library. > The changes to the core IB were neccessary in order to allow > additional information (i.e., the mmap offset) to be returned > from the kernel driver to the user level verbs driver plugin. thanks for the clarification. Or. From ianjiang.ict at gmail.com Mon Aug 14 01:05:17 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Mon, 14 Aug 2006 16:05:17 +0800 Subject: [openib-general] where are the userspace sources after installing the ofed-1.0? In-Reply-To: <44DF3EFB.5060700@mellanox.co.il> References: <7b2fa1820608120007y433ed30ek282d765861291e2d@mail.gmail.com> <44DF3EFB.5060700@mellanox.co.il> Message-ID: <7b2fa1820608140105x1b06f217l7e0b12c8d9fd0ca1@mail.gmail.com> Thank you very much! On 8/13/06, Tziporet Koren wrote: > Ian Jiang wrote: > > I know the userspace sources could be got from > > OFED-1.0.tgz\SOURCES\openib-1.0\src\userspace. But I am wondering > > wheather these source will be installed when installing the OFED-1.0 > > RPMs. I cannot find them under /usr/local/ofed/ > > > > Thanks! > > > > > In OFED 1.0 sources are not installed at all after the installation. > In OFED 1.1 after installation the sources are located on: /src/ > > Tziporet > -- Ian Jiang From kliteyn at gmail.com Mon Aug 14 01:13:19 2006 From: kliteyn at gmail.com (Yevgeny Kliteynik) Date: Mon, 14 Aug 2006 11:13:19 +0300 Subject: [openib-general] [PATCHv2] osm: OSM crash TRIVIAL bug fix Message-ID: <842b8cdf0608140113idc689n5029b4fde98fb43e@mail.gmail.com> Hi Hal. This patch fixes an OSM crash when working with Cisco's stack. Cisco's doesn't follow the same TID convention when generating transaction id which in some bad flow revealed this bug in the get_madw lookup. The bug is in get_madw which does not cleanup old pointers to retrieved madw and also does not detect lookup of its reserved "free" entry of key==0. (This better text replaces my previous patch: "OSM crash when working with Cisco's TopSpin stack") Yevgeny Signed-off-by: Yevgeny Kliteynik < kliteyn at mellanox.co.il> Index: osm/libvendor/osm_vendor_ibumad.c =================================================================== --- osm/libvendor/osm_vendor_ibumad.c (revision 8614) +++ osm/libvendor/osm_vendor_ibumad.c (working copy) @@ -141,12 +141,20 @@ get_madw(osm_vendor_t *p_vend, ib_net64_ ib_net64_t mtid = (*tid & cl_ntoh64(0x00000000ffffffffllu)); osm_madw_t *res; + /* + * Since mtid == 0 is the empty key we should not + * waste time looking for it + */ + if (mtid == 0) + return 0; + cl_spinlock_acquire( &p_vend->match_tbl_lock ); for (m = p_vend->mtbl.tbl, e = m + p_vend->mtbl.max; m < e; m++) { if (m->tid == mtid) { m->tid = 0; *tid = mtid; res = m->v; + m->v = NULL; /* just make sure we do not point to free'd madw */ cl_spinlock_release( &p_vend->match_tbl_lock ); return res; } -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at gmail.com Mon Aug 14 01:24:14 2006 From: kliteyn at gmail.com (Yevgeny Kliteynik) Date: Mon, 14 Aug 2006 11:24:14 +0300 Subject: [openib-general] [PATCH] osm: TRIVIAL wrong description in log message Message-ID: <842b8cdf0608140124x3be6ec4ao239a7765120196c2@mail.gmail.com> Hi Hal, Inspecting the log messages of the error flow in osm_vendor_send I have noticed that the terms "request" and "response" are reversed: If we are sending with response_expected it means we are sending our request... Yevgeny Signed-off-by: Yevgeny Kliteynik < kliteyn at mellanox.co.il> Index: osm/libvendor/osm_vendor_ibumad.c =================================================================== --- osm/libvendor/osm_vendor_ibumad.c (revision 8614) +++ osm/libvendor/osm_vendor_ibumad.c (working copy) @@ -1148,7 +1156,7 @@ Resp: osm_log(p_vend->p_log, OSM_LOG_DEBUG, "osm_vendor_send: " "Completed sending %s p_madw = %p\n", - resp_expected ? "response" : "request", p_madw); + resp_expected ? "request" : "response", p_madw); Exit: OSM_LOG_EXIT( p_vend->p_log ); return( ret ); From ryszard.jurga at cern.ch Mon Aug 14 02:20:53 2006 From: ryszard.jurga at cern.ch (Ryszard Jurga) Date: Mon, 14 Aug 2006 11:20:53 +0200 Subject: [openib-general] DAPL and local_iov in RDMA RR/RW mode References: <025801c6bd5f$f77b8d10$3b388d80@cern.ch> <44DCE509.5090005@ichips.intel.com> Message-ID: <010001c6bf82$f174b890$3b388d80@cern.ch> Hi Arlin, Thank you for your quick reply. Both dat_ep_post_rdma_read nad dat_ep_post_rdma_write return DAT_SUCCESS. When I read a field 'transfered_length' from DAT_DTO_COMPLETION_EVENT_DATA after calling a post function I receive the correct value which equals num_segs*seg_size. Unfortunately, when I read a content of a local buffer, only first segment is filled by appropriete data. I have tried to set up debug switch (by export DAPL_DBG_TYPE=0xffff before running my application) but unfortunately this does not produce any additional output for post functions. Do you have any other ideas? I did not mention before, but the case with num_segments>1 works fine with a send/recv type of transmision. Best regards, Ryszard. ----- Original Message ----- From: "Arlin Davis" To: "Ryszard Jurga" Cc: "openib" Sent: Friday, August 11, 2006 10:14 PM Subject: Re: [openib-general] DAPL and local_iov in RDMA RR/RW mode > Ryszard Jurga wrote: > >> Hi everybody, >> I have one question about a number of segments in local_iov when using >> RDMA Write and Read mode. Is it possible to have num_segments>1? I am >> asking, because when I try to set up num_segments to a value > 1, then I >> can still only read/write one segment, even though I have an appropriate >> remote buffer already reserved. The size of transfered buffer is 10bytes, >> num_segs=2. The information, which is printed below, was obrained from >> network devices with one remark - I have set up manualy >> max_rdma_read_iov=10 and max_rdma_write_iov=10. Thank you in advance for >> your help. > > Yes, uDAPL will support num_segments up to the max counts returned on the > ep_attr. Can you be more specific? Does the post return immediate errors > or are you simply missing data on the remote node? Can you turn up the > uDAPL debug switch (export DAPL_DBG_TYPE=0xffff) and send output of the > post call? > > -arlin > >> Best regards, >> Ryszard. >> EP_ATTR: the same for both nodes: >> ---------------------------------- >> max_message_size=2147483648 >> max_rdma_size=2147483648 >> max_recv_dtos=16 >> max_request_dtos=16 >> max_recv_iov=4 >> max_request_iov=4 >> max_rdma_read_in=4 >> max_rdma_read_out=4 >> srq_soft_hw=0 >> max_rdma_read_iov=10 >> max_rdma_write_iov=10 >> ep_transport_specific_count=0 >> ep_provider_specific_count=0 >> ---------------------------------- >> IA_ATTR: different for nodes >> ---------------------------------- >> IA Info: >> max_eps=64512 >> max_dto_per_ep=65535 >> max_rdma_read_per_ep_in=4 >> max_rdma_read_per_ep_out=1610616831 >> max_evds=65408 >> max_evd_qlen=131071 >> max_iov_segments_per_dto=28 >> max_lmrs=131056 >> max_lmr_block_size=18446744073709551615 >> max_pzs=32768 >> max_message_size=2147483648 >> max_rdma_size=2147483648 >> max_rmrs=0 >> max_srqs=0 >> max_ep_per_srq=0 >> max_recv_per_srq=143263 >> max_iov_segments_per_rdma_read=1073741824 >> max_iov_segments_per_rdma_write=0 >> max_rdma_read_in=0 >> max_rdma_read_out=65535 >> max_rdma_read_per_ep_in_guaranteed=7286 >> max_rdma_read_per_ep_out_guaranteed=0 >> IA Info: >> max_eps=64512 >> max_dto_per_ep=65535 >> max_rdma_read_per_ep_in=4 >> max_rdma_read_per_ep_out=0 >> max_evds=65408 >> max_evd_qlen=131071 >> max_iov_segments_per_dto=28 >> max_lmrs=131056 >> max_lmr_block_size=18446744073709551615 >> max_pzs=32768 >> max_message_size=2147483648 >> max_rdma_size=2147483648 >> max_rmrs=0 >> max_srqs=0 >> max_ep_per_srq=0 >> max_recv_per_srq=142247 >> max_iov_segments_per_rdma_read=1073741824 >> max_iov_segments_per_rdma_write=0 >> max_rdma_read_in=0 >> max_rdma_read_out=65535 >> max_rdma_read_per_ep_in_guaranteed=7286 >> max_rdma_read_per_ep_out_guaranteed=28 >> >>------------------------------------------------------------------------ >> >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit >>http://openib.org/mailman/listinfo/openib-general >> From mingo at redhat.com Mon Aug 14 04:07:51 2006 From: mingo at redhat.com (Ingo Molnar) Date: Mon, 14 Aug 2006 13:07:51 +0200 Subject: [openib-general] [PATCH] Fix potential deadlock in mthca In-Reply-To: <44DCAD34.5040502@linux.intel.com> References: <44DCAD34.5040502@linux.intel.com> Message-ID: <1155553671.22848.38.camel@earth> On Fri, 2006-08-11 at 09:15 -0700, Arjan van de Ven wrote: > Roland Dreier wrote: > > Here's a long-standing bug that lockdep found very nicely. > > > > Ingo/Arjan, can you confirm that the fix looks OK and I am using > > spin_lock_nested() properly? I couldn't find much documentation or > > many examples of it, so I'm not positive this is the right way to > > handle this fix. > > > > looks correct to me; > > Acked-by: Arjan van de Ven looks good to me too. Acked-by: Ingo Molnar btw., we could introduce a new spin-lock op: spin_lock_double(l1, l2); ... spin_unlock_double(l1, l2); because some other code, like kernel/sched.c, fs/dcache.c and kernel/futex.c uses quite similar locking. Ingo From dotanb at mellanox.co.il Mon Aug 14 04:36:48 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 14 Aug 2006 14:36:48 +0300 Subject: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table) Message-ID: <200608141436.48596.dotanb@mellanox.co.il> Hi. I noticed that the behavior of the openSM was changed in the latest driver: in the past, every HCA was configured (by the FW) with 0xffff in the first entry. today, the PKey table is being configured by the openSM: the first entry is being set to 0x7fff (except for the host that the SM is being executed from) This behavior is very problemtic because not all of the users would like to change the default PKey table (for example: MPI users). Users that will try to use OFED 1.1 (in the same way they used OFED 1.0) will get unexplained failures, because the connectivity because the nodes will be broken. (even the perfquery started to fail after executing the SM) I think that the default behavior of the openSM should be: not to change the PKey table, unless the user provided a PKey table policy file. Here are the props of the machines ************************************************************* Host Architecture : x86_64 Linux Distribution: Red Hat Enterprise Linux AS release 4 (Nahant Update 3) Kernel Version : 2.6.9-34.ELsmp GCC Version : gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2) Memory size : 4039892 kB Driver Version : gen2_linux-20060813-1905 (REV=8916) HCA ID(s) : mthca0 HCA model(s) : 23108 FW version(s) : 3.4.927 Board(s) : MT_0030000001 ************************************************************* thanks Dotan From sashak at voltaire.com Mon Aug 14 05:17:35 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 14 Aug 2006 15:17:35 +0300 Subject: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table) In-Reply-To: <200608141436.48596.dotanb@mellanox.co.il> References: <200608141436.48596.dotanb@mellanox.co.il> Message-ID: <20060814121735.GF24920@sashak.voltaire.com> Hi Dotan, On 14:36 Mon 14 Aug , Dotan Barak wrote: > Hi. > > I noticed that the behavior of the openSM was changed in the latest driver: > > in the past, every HCA was configured (by the FW) with 0xffff in the first entry. > today, the PKey table is being configured by the openSM: the first entry > is being set to 0x7fff (except for the host that the SM is being executed from) This is OpenSM default behavior in the case where partition policy file exists (/etc/osm-partitions.conf is default name), even if it is empty. When the partition policy file does not exist default 0xffff pkey value (full membership) should be inserted for all end-ports. I am not able to reproduce the reported behavior with my setup. If you are please describe your scenario. Thanks. Sasha > This behavior is very problemtic because not all of the users would like > to change the default PKey table (for example: MPI users). > Users that will try to use OFED 1.1 (in the same way they used OFED 1.0) will > get unexplained failures, because the connectivity because the nodes will be broken. > (even the perfquery started to fail after executing the SM) > > > I think that the default behavior of the openSM should be: not to change the > PKey table, unless the user provided a PKey table policy file. > > Here are the props of the machines > > ************************************************************* > Host Architecture : x86_64 > Linux Distribution: Red Hat Enterprise Linux AS release 4 (Nahant Update 3) > Kernel Version : 2.6.9-34.ELsmp > GCC Version : gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2) > Memory size : 4039892 kB > Driver Version : gen2_linux-20060813-1905 (REV=8916) > HCA ID(s) : mthca0 > HCA model(s) : 23108 > FW version(s) : 3.4.927 > Board(s) : MT_0030000001 > ************************************************************* > > > thanks > Dotan > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dotanb at mellanox.co.il Mon Aug 14 05:36:29 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 14 Aug 2006 15:36:29 +0300 Subject: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table) In-Reply-To: <20060814121735.GF24920@sashak.voltaire.com> References: <200608141436.48596.dotanb@mellanox.co.il> <20060814121735.GF24920@sashak.voltaire.com> Message-ID: <200608141536.29713.dotanb@mellanox.co.il> Thanks for the quick response. On Monday 14 August 2006 15:17, Sasha Khapyorsky wrote: > Hi Dotan, > > On 14:36 Mon 14 Aug , Dotan Barak wrote: > > Hi. > > > > I noticed that the behavior of the openSM was changed in the latest driver: > > > > in the past, every HCA was configured (by the FW) with 0xffff in the first entry. > > today, the PKey table is being configured by the openSM: the first entry > > is being set to 0x7fff (except for the host that the SM is being executed from) > > This is OpenSM default behavior in the case where partition policy file > exists (/etc/osm-partitions.conf is default name), even if it is empty. You are right, this file was exist in the host with the following content: Default=0x7fff : ALL, SELF=full ; YetAnotherOne = 0x300 : ALL, SELF=full ; partition1 = 0x1 : 0x0002c9020020b1c9=full; > > When the partition policy file does not exist default 0xffff pkey value > (full membership) should be inserted for all end-ports. > > I am not able to reproduce the reported behavior with my setup. If you > are please describe your scenario. Thanks. I have 2 machines connected b2b (without any switch in the middle) connected using one cable from port 1 to port 1. I executed the SM from one machine, and in the other machine "perfquery" i got the failure. Why doesn't the SM print that this file was found? this way, users can know that this file was found in their machine and the SM is using those rules (instead of the default rules, as you described) Thanks Dotan From sashak at voltaire.com Mon Aug 14 06:09:11 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 14 Aug 2006 16:09:11 +0300 Subject: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table) In-Reply-To: <200608141536.29713.dotanb@mellanox.co.il> References: <200608141436.48596.dotanb@mellanox.co.il> <20060814121735.GF24920@sashak.voltaire.com> <200608141536.29713.dotanb@mellanox.co.il> Message-ID: <20060814130911.GI24920@sashak.voltaire.com> On 15:36 Mon 14 Aug , Dotan Barak wrote: > Thanks for the quick response. > > On Monday 14 August 2006 15:17, Sasha Khapyorsky wrote: > > Hi Dotan, > > > > On 14:36 Mon 14 Aug , Dotan Barak wrote: > > > Hi. > > > > > > I noticed that the behavior of the openSM was changed in the latest driver: > > > > > > in the past, every HCA was configured (by the FW) with 0xffff in the first entry. > > > today, the PKey table is being configured by the openSM: the first entry > > > is being set to 0x7fff (except for the host that the SM is being executed from) > > > > This is OpenSM default behavior in the case where partition policy file > > exists (/etc/osm-partitions.conf is default name), even if it is empty. > > You are right, this file was exist in the host with the following content: > > Default=0x7fff : ALL, SELF=full ; > YetAnotherOne = 0x300 : ALL, SELF=full ; > partition1 = 0x1 : 0x0002c9020020b1c9=full; > > > > > > > When the partition policy file does not exist default 0xffff pkey value > > (full membership) should be inserted for all end-ports. > > > > I am not able to reproduce the reported behavior with my setup. If you > > are please describe your scenario. Thanks. > I have 2 machines connected b2b (without any switch in the middle) connected > using one cable from port 1 to port 1. > > I executed the SM from one machine, and in the other machine "perfquery" i got the failure. OpenSM configures pkey tables as requested in osm-partitions.conf file - this is the reason. Just remove (or rename) this file if you don't need it. > > > Why doesn't the SM print that this file was found? Yes, some prints may be helpful. Do you mean just log file or would prefer the message on stdout too? Sasha > this way, users can know that this file was found in their machine and the SM is using those rules > (instead of the default rules, as you described) > > Thanks > Dotan From halr at voltaire.com Mon Aug 14 06:05:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Aug 2006 09:05:28 -0400 Subject: [openib-general] [PATCHv2] osm: OSM crash TRIVIAL bug fix In-Reply-To: <842b8cdf0608140113idc689n5029b4fde98fb43e@mail.gmail.com> References: <842b8cdf0608140113idc689n5029b4fde98fb43e@mail.gmail.com> Message-ID: <1155560727.9532.39151.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2006-08-14 at 04:13, Yevgeny Kliteynik wrote: > Hi Hal. > > This patch fixes an OSM crash when working with Cisco's stack. > Cisco's doesn't follow the same TID convention when generating > transaction id which in some bad flow revealed this bug in the > get_madw lookup. > The bug is in get_madw which does not cleanup old pointers to > retrieved madw and also does not detect lookup of its reserved "free" > entry of key==0. > > (This better text replaces my previous patch: > "OSM crash when working with Cisco's TopSpin stack") > > Yevgeny Thanks. Good find. > Signed-off-by: Yevgeny Kliteynik < kliteyn at mellanox.co.il> > > > Index: osm/libvendor/osm_vendor_ibumad.c > =================================================================== > --- osm/libvendor/osm_vendor_ibumad.c (revision 8614) > +++ osm/libvendor/osm_vendor_ibumad.c (working copy) > @@ -141,12 +141,20 @@ get_madw(osm_vendor_t *p_vend, ib_net64_ > ib_net64_t mtid = (*tid & cl_ntoh64(0x00000000ffffffffllu)); > osm_madw_t *res; > > + /* > + * Since mtid == 0 is the empty key we should not > + * waste time looking for it > + */ > + if (mtid == 0) > + return 0; > + > cl_spinlock_acquire( &p_vend->match_tbl_lock ); > for (m = p_vend->mtbl.tbl, e = m + p_vend->mtbl.max; m < e; > m++) { > if (m->tid == mtid) { > m->tid = 0; > *tid = mtid; > res = m->v; > + m->v = NULL; /* just make sure we do not point > to free'd madw */ This line wrapped so there is something wrong with your mailer. Also, is this line really needed (and if so why) ? I know you did say "it cleans up old pointers to retrieved madw" but this shouldn't be accessed, right ? Also, if this is added here, there are other places where the same thing should be done ? > cl_spinlock_release( &p_vend->match_tbl_lock > ); > return res; > } > Applied to trunk and 1.1 with the exception noted above. -- Hal From dotanb at mellanox.co.il Mon Aug 14 06:14:36 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 14 Aug 2006 16:14:36 +0300 Subject: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table) In-Reply-To: <20060814130911.GI24920@sashak.voltaire.com> References: <200608141436.48596.dotanb@mellanox.co.il> <200608141536.29713.dotanb@mellanox.co.il> <20060814130911.GI24920@sashak.voltaire.com> Message-ID: <200608141614.36481.dotanb@mellanox.co.il> On Monday 14 August 2006 16:09, Sasha Khapyorsky wrote: > > > > Why doesn't the SM print that this file was found? > > Yes, some prints may be helpful. Do you mean just log file or would prefer > the message on stdout too? I believe that most of the users don't look at the log file, so a message in the stdout can be usefull. thanks Dotan From halr at voltaire.com Mon Aug 14 06:14:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Aug 2006 09:14:55 -0400 Subject: [openib-general] [PATCH] osm: TRIVIAL wrong description in log message In-Reply-To: <842b8cdf0608140124x3be6ec4ao239a7765120196c2@mail.gmail.com> References: <842b8cdf0608140124x3be6ec4ao239a7765120196c2@mail.gmail.com> Message-ID: <1155561294.9532.39433.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2006-08-14 at 04:24, Yevgeny Kliteynik wrote: > Hi Hal, > > Inspecting the log messages of the error flow in osm_vendor_send I have > noticed that the terms "request" and "response" are reversed: > If we are sending with response_expected it means we are sending our request... > > Yevgeny > > Signed-off-by: Yevgeny Kliteynik < kliteyn at mellanox.co.il> Thanks. Applied to trunk and 1.1 (with minor cosmetic change). -- Hal From halr at voltaire.com Mon Aug 14 06:15:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Aug 2006 09:15:49 -0400 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <44DF25A2.30405@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA767F@mtlexch01.mtl.com> <1155303417.4507.15009.camel@hal.voltaire.com> <44DF25A2.30405@mellanox.co.il> Message-ID: <1155561347.9532.39466.camel@hal.voltaire.com> On Sun, 2006-08-13 at 09:14, Tziporet Koren wrote: > Hal Rosenstock wrote: > >> Target release date: 12-Sep > >> > >> Intermediate milestones: > >> 1. Create 1.1 branch of user level: 27-Jul - done > >> 2. RC1: 8-Aug - done > >> 3. Feature freeze (RC2): 17-Aug > >> > > > > What is the start build date for RC2 ? When do developers need to have > > their code in by to make RC2 ? > > > We will start on Tue 15-Aug. Is this OK with you? Yes; I just needed to know when this needed to be done by. Thanks. -- Hal > > > > > >> 4. Code freeze (rc-x): 6-Sep > >> > > > > Is this 1 or 2 RCs beyond RC2 in order to make this ? > > > > > I hope one but I guess it will be two more RCs. > > Tziporet From tom at opengridcomputing.com Mon Aug 14 07:26:26 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 14 Aug 2006 09:26:26 -0500 Subject: [openib-general] RDMA_READ SGE In-Reply-To: <20060810185450.GA20994@mellanox.co.il> References: <1155230099.15374.38.camel@trinity.ogc.int> <20060810185450.GA20994@mellanox.co.il> Message-ID: <1155565586.29240.3.camel@trinity.ogc.int> [...snip...] > Practically I don't see reporting the exact values as a priority - > I think applications really can figure this out easier by attempting > operating with relevant parameters and fallback to smaller values > on failure. Perhaps it's not a priority, and it is certainly technically possible, however, I would say that this is quite a burden to place on every application to attempt to discover limits by submitting WR and checking when they fail. Who do they talk to while they're doing all this? It seems to me that it would be better to expand the set of attributes. Leave the current max_sge for backward compatibility, but let apps reliably query limits instead of attempting to discover them through trial and error. > But assuming that applications really need this information - > it seems we really should generalize this - maybe make the device provide > a function mapping QP attributes and operation kinds to the max set of values > allowed? > From halr at voltaire.com Mon Aug 14 08:04:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Aug 2006 11:04:29 -0400 Subject: [openib-general] [PATCH] osm: Dynamic verbosity control per file In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2ABC@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2ABC@mtlexch01.mtl.com> Message-ID: <1155567868.9532.42617.camel@hal.voltaire.com> Hi Yevgeny, On Sun, 2006-08-13 at 15:16, Yevgeny Kliteynik wrote: > Hi Hal. > > > Is it hard to find which file and line an opensm > > log message comes from > > ? Is this functionality really needed ? > > I'm guessing that your question refers to the second > bullet - logging source code filename and line number. > > IMHO, this is a nice-to-have functionality - when > debugging the SM that runs with a high verbosity, > the log has a lot of information, and it's much > easier to follow the SM flow when each message > tells where exactly it came from. OK. I'll work on adding it but may not get to it for another day or so (so it won't be in OFED 1.1 rc2) and may have other comments on the specifics of the patch. -- Hal > Regards, > > Yevgeny Kliteynik > > Mellanox Technologies LTD > Tel: +972-4-909-7200 ext: 394 > Fax: +972-4-959-3245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Tuesday, August 08, 2006 11:50 PM > To: Yevgeny Kliteynik > Cc: OPENIB > Subject: Re: [openib-general] [PATCH] osm: Dynamic verbosity control per > file > > Hi Yevgeny, > > On Wed, 2006-08-02 at 11:16, Yevgeny Kliteynik wrote: > > Hi Hal > > Just got back from vacation and am in the process of catching up. > > > This patch adds new verbosity functionality. > > > 1. Verbosity configuration file > > ------------------------------- > > > > The user is able to set verbosity level per source code file > > by supplying verbosity configuration file using the following > > command line arguments: > > > > -b filename > > --verbosity_file filename > > > > By default, the OSM will use the following file: /etc/opensmlog.conf > > Verbosity configuration file should contain zero or more lines of > > the following pattern: > > > > filename verbosity_level > > > > where 'filename' is the name of the source code file that the > > 'verbosity_level' refers to, and the 'verbosity_level' itself > > should be specified as an integer number (decimal or hexadecimal). > > > > One reserved filename is 'all' - it represents general verbosity > > level, that is used for all the files that are not specified in > > the verbosity configuration file. > > If 'all' is not specified, the verbosity level set in the > > command line will be used instead. > > Note: The 'all' file verbosity level will override any other > > general level that was specified by the command line arguments. > > > > Sending a SIGHUP signal to the OSM will cause it to reload > > the verbosity configuration file. > > > > > > 2. Logging source code filename and line number > > ----------------------------------------------- > > > > If command line option -S or --log_source_info is specified, > > OSM will add source code filename and line number to every > > log message that is written to the log file. > > By default, the OSM will not log this additional info. > > > > > > Yevgeny > > Is it hard to find which file and line an opensm log message comes from > ? Is this functionality really needed ? > > -- Hal > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From jlentini at netapp.com Mon Aug 14 08:32:21 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 14 Aug 2006 11:32:21 -0400 (EDT) Subject: [openib-general] [PATCH] include change Message-ID: Include the header file we really want. Signed-off-by: James Lentini Index: hw/mthca/mthca_cmd.c =================================================================== --- hw/mthca/mthca_cmd.c (revision 8850) +++ hw/mthca/mthca_cmd.c (working copy) @@ -34,7 +34,7 @@ * $Id$ */ -#include +#include #include #include #include From rdreier at cisco.com Mon Aug 14 09:16:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 09:16:48 -0700 Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1155333276.20325.422.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 11 Aug 2006 14:54:36 -0700") References: <1155333276.20325.422.camel@brick.pathscale.com> Message-ID: > int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, > - struct ibv_resize_cq *cmd, size_t cmd_size); > + struct ibv_resize_cq *cmd, size_t cmd_size, > + struct ibv_resize_cq_resp *resp, size_t resp_size); We can't make this change without a little more work -- as it stands this makes it impossible to have a low-level driver that works with both libibverbs 1.0 and 1.1, since there doesn't seem to be any autoconf way to check the number of parameters a function takes. I see two ways forward: either at a new ibv_cmd_resize_cq_resp() function (as you did originally), or add something like #define IBV_CMD_RESIZE_CQ_HAS_RESP_PARAMS and test that in libmthca and libehca. Personally I lean towards the second solution, although neither is very elegant. - R. From mlakshmanan at silverstorm.com Mon Aug 14 09:19:23 2006 From: mlakshmanan at silverstorm.com (Lakshmanan, Madhu) Date: Mon, 14 Aug 2006 12:19:23 -0400 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" Message-ID: In .../include/infiniband/mad.h,  it is: typedef uint8_t ib_gid_t[16]; In .../include/infiniband/iba/ib_types.h, it is: #include typedef union _ib_gid {      uint8_t raw[16];         struct _ib_gid_unicast         {          ib_gid_prefix_t prefix;             ib_net64_t      interface_id;         } PACK_SUFFIX unicast;         struct _ib_gid_multicast         {          uint8_t         header[2];             uint8_t         raw_group_id[14];         } PACK_SUFFIX multicast; } PACK_SUFFIX ib_gid_t; #include I need to include both files for a user space tool and I'm getting a compile error due to the conflict. Is it not the norm for a user space application to include both files? Appreciate any thoughts on this. Madhu Lakshmanan Silverstorm Technologies, Inc. mlakshmanan at silverstorm.com From rdreier at cisco.com Mon Aug 14 09:22:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 09:22:30 -0700 Subject: [openib-general] [PATCH 4/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1155333489.20325.429.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 11 Aug 2006 14:58:09 -0700") References: <1155333489.20325.429.camel@brick.pathscale.com> Message-ID: > struct ib_uverbs_resize_cq_resp { > __u32 cqe; > + __u32 reserved; > + __u64 driver_data[0]; > }; I don't see any changes related to this in uverbs_cmd.c, and you don't bump the ABI version. So as far as I can tell, ib_uverbs_resize_cq() will silently corrupt the stack on an old libibverbs that passes in a pointer to a 4-byte response structure. In general I've resisted putting backwards compatibility stuff into the kernel side of uverbs, so maybe an ABI bump is the answer in this case too. But then I have to do another libibverbs 1.0 release etc., which is kind of a pain. So in this case it's probably OK to add a check in ib_uverbs_resize_cq() for when out_len == 4, and not overflow the response buffer in that case. - R. From rdreier at cisco.com Mon Aug 14 09:25:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 09:25:36 -0700 Subject: [openib-general] [PATCH 7/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1155333676.20325.435.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 11 Aug 2006 15:01:16 -0700") References: <1155333676.20325.435.camel@brick.pathscale.com> Message-ID: > +/* > + * ipath_vma_nopage - handle a VMA page fault. > + */ > +static struct page *ipath_vma_nopage(struct vm_area_struct *vma, > + unsigned long address, int *type) > +{ > + struct ipath_mmap_info *ip = vma->vm_private_data; > + unsigned long offset = address - vma->vm_start; > + struct page *page = NOPAGE_SIGBUS; > + void *pageptr; > + > + if (offset >= ip->size) > + goto out; /* out of range */ > + > + /* > + * Convert the vmalloc address into a struct page. > + */ > + pageptr = (void *)(offset + (vma->vm_pgoff << PAGE_SHIFT)); > + page = vmalloc_to_page(pageptr); > + if (!page) > + goto out; > + > + /* Increment the reference count. */ > + get_page(page); > + if (type) > + *type = VM_FAULT_MINOR; > +out: > + return page; > +} > + > +static struct vm_operations_struct ipath_vm_ops = { > + .open = ipath_vma_open, > + .close = ipath_vma_close, > + .nopage = ipath_vma_nopage, > +}; It seems that all this would be much simpler with vmalloc_user() and remap_vmalloc_range(). And it would fix the information leak you introduce here: I don't see anywhere that you clear the memory you vmalloc and remap to userspace, so you're potentially exposing the cached contents of /etc/passwd or something like that. - R. From ralphc at pathscale.com Mon Aug 14 10:14:37 2006 From: ralphc at pathscale.com (ralphc at pathscale.com) Date: Mon, 14 Aug 2006 10:14:37 -0700 (PDT) Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: References: <1155333276.20325.422.camel@brick.pathscale.com> Message-ID: <55611.71.131.54.100.1155575677.squirrel@rocky.pathscale.com> > > int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, > > - struct ibv_resize_cq *cmd, size_t cmd_size); > > + struct ibv_resize_cq *cmd, size_t cmd_size, > > + struct ibv_resize_cq_resp *resp, size_t resp_size); > > We can't make this change without a little more work -- as it stands > this makes it impossible to have a low-level driver that works with > both libibverbs 1.0 and 1.1, since there doesn't seem to be any > autoconf way to check the number of parameters a function takes. What we really need is a version number for the device library plug-in to libibverbs.so interface verses the existing kernel device driver to plug-in or application to libibverbs.so version. > I see two ways forward: either at a new ibv_cmd_resize_cq_resp() > function (as you did originally), or add something like > > #define IBV_CMD_RESIZE_CQ_HAS_RESP_PARAMS > > and test that in libmthca and libehca. > > Personally I lean towards the second solution, although neither is > very elegant. > > - R. A #define won't help the plug-in know what parameters to pass, only a function name change will work if the semantics change. I can add another version argument to ibv_driver_init() if you agree. It seems to me that we have already made incompatible changes by moving to libibverbs.so.2. Couldn't we include this as part of the transition? Otherwise, I would vote for ibv_cmd_resize_cq_resp(). From ralphc at pathscale.com Mon Aug 14 10:30:21 2006 From: ralphc at pathscale.com (ralphc at pathscale.com) Date: Mon, 14 Aug 2006 10:30:21 -0700 (PDT) Subject: [openib-general] [PATCH 4/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: References: <1155333489.20325.429.camel@brick.pathscale.com> Message-ID: <46532.71.131.54.100.1155576621.squirrel@rocky.pathscale.com> > > struct ib_uverbs_resize_cq_resp { > > __u32 cqe; > > + __u32 reserved; > > + __u64 driver_data[0]; > > }; > > I don't see any changes related to this in uverbs_cmd.c, and you don't > bump the ABI version. So as far as I can tell, ib_uverbs_resize_cq() > will silently corrupt the stack on an old libibverbs that passes in a > pointer to a 4-byte response structure. > > In general I've resisted putting backwards compatibility stuff into > the kernel side of uverbs, so maybe an ABI bump is the answer in this > case too. > > But then I have to do another libibverbs 1.0 release etc., which is > kind of a pain. So in this case it's probably OK to add a check in > ib_uverbs_resize_cq() for when out_len == 4, and not overflow the > response buffer in that case. > > - R. This doesn't break compatibility. uverbs_cmd.c ib_uverbs_resize_cq() allocates a struct ib_uverbs_resize_cq_resp on the stack but only reads the first element in. The structure change isn't really needed at all since the INIT_UDATA() macro gets the start of driver_data from the struct ib_uverbs_resize_cq. The change to ib_uverbs_resize_cq_resp just matches the structure change used by libipathverbs to initialize ib_uverbs_resize_cq.response. If you want, I can remove this from the patch. From halr at voltaire.com Mon Aug 14 10:37:50 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 14 Aug 2006 20:37:50 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" References: Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AD10@taurus.voltaire.com> Hi Madhu, This is a similar but more mainstream example of the conflicts. A previous one was reported last week in terms of CM. Still not sure of the best resolution for this. Do you really need both includes ? -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Lakshmanan, Madhu Sent: Mon 8/14/2006 12:19 PM To: openib-general at openib.org Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In .../include/infiniband/mad.h, it is: typedef uint8_t ib_gid_t[16]; In .../include/infiniband/iba/ib_types.h, it is: #include typedef union _ib_gid { uint8_t raw[16]; struct _ib_gid_unicast { ib_gid_prefix_t prefix; ib_net64_t interface_id; } PACK_SUFFIX unicast; struct _ib_gid_multicast { uint8_t header[2]; uint8_t raw_group_id[14]; } PACK_SUFFIX multicast; } PACK_SUFFIX ib_gid_t; #include I need to include both files for a user space tool and I'm getting a compile error due to the conflict. Is it not the norm for a user space application to include both files? Appreciate any thoughts on this. Madhu Lakshmanan Silverstorm Technologies, Inc. mlakshmanan at silverstorm.com _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ralphc at pathscale.com Mon Aug 14 10:41:58 2006 From: ralphc at pathscale.com (ralphc at pathscale.com) Date: Mon, 14 Aug 2006 10:41:58 -0700 (PDT) Subject: [openib-general] [PATCH 7/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: References: <1155333676.20325.435.camel@brick.pathscale.com> Message-ID: <48305.71.131.54.100.1155577318.squirrel@rocky.pathscale.com> > > +/* > > + * ipath_vma_nopage - handle a VMA page fault. > > + */ > > +static struct page *ipath_vma_nopage(struct vm_area_struct *vma, > > + unsigned long address, int *type) > > +{ > > + struct ipath_mmap_info *ip = vma->vm_private_data; > > + unsigned long offset = address - vma->vm_start; > > + struct page *page = NOPAGE_SIGBUS; > > + void *pageptr; > > + > > + if (offset >= ip->size) > > + goto out; /* out of range */ > > + > > + /* > > + * Convert the vmalloc address into a struct page. > > + */ > > + pageptr = (void *)(offset + (vma->vm_pgoff << PAGE_SHIFT)); > > + page = vmalloc_to_page(pageptr); > > + if (!page) > > + goto out; > > + > > + /* Increment the reference count. */ > > + get_page(page); > > + if (type) > > + *type = VM_FAULT_MINOR; > > +out: > > + return page; > > +} > > + > > +static struct vm_operations_struct ipath_vm_ops = { > > + .open = ipath_vma_open, > > + .close = ipath_vma_close, > > + .nopage = ipath_vma_nopage, > > +}; > > It seems that all this would be much simpler with vmalloc_user() and > remap_vmalloc_range(). And it would fix the information leak you > introduce here: I don't see anywhere that you clear the memory you > vmalloc and remap to userspace, so you're potentially exposing the > cached contents of /etc/passwd or something like that. > > - R. I was unaware of these functions. Looks like they were just recently added (7/23/2006). I will update the patch to use these. From vlad at mellanox.co.il Mon Aug 14 10:44:10 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 14 Aug 2006 20:44:10 +0300 Subject: [openib-general] [RFC] IPoIB high availability daemon Message-ID: <44E0B66A.4090605@mellanox.co.il> Hi, The first version of the IPoIB high availability daemon can be found at: https://openib.org/svn/trunk/contrib/mellanox/ipoibtools The daemon is a perl script ipoib_ha.pl that should get the primary and the backup IPoIB interfaces as a parameters (default values are ib0 as a primary and ib1 as a backup). The basic steps performed by IPoIB High Availability (HA) daemon: - Get names of the IPoIB primary and backup interfaces. - Get configuration of the primary interface from its standard place (ifcfg-ib from /etc/sysconfig/{network,network-scripts}). - Run 'ip monitor link all' and parse its output to monitor IPoIB primary interface. - When "NO-CARRIER" occur, check if it is a primary IPoIB interface and if "yes" then migrate its IPoIB configuration to the backup IPoIB interface. - Run 'arpingib' utility if configured to update neighbors with a new MAC address - Get the list of multicast groups from /proc/net/dev_mcast that the primary IPoIB interface was registered to. Then register the backup IPoIB interface to these multicast groups (using ipmaddr utility). Currently there is an issue with join to IPoIB multicast group using both ip and ipmaddr utilities. This daemon is going to be added to the OFED-1.1 release. Please comment. Thanks, Regards, Vladimir From mshefty at ichips.intel.com Mon Aug 14 10:18:31 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Aug 2006 10:18:31 -0700 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: References: <20060809070435.GN20848@mellanox.co.il> Message-ID: <44E0B067.6040605@ichips.intel.com> Roland Dreier wrote: > @@ -1502,7 +1506,7 @@ int mthca_tavor_post_send(struct ib_qp * > int i; > int size; > int size0 = 0; > - u32 f0 = 0; > + u32 f0; This causes compile warnings for me that 'f0' might be used uninitialized. > @@ -1843,7 +1849,7 @@ int mthca_arbel_post_send(struct ib_qp * > int i; > int size; > int size0 = 0; > - u32 f0 = 0; > + u32 f0; Same here. - Sean From halr at voltaire.com Mon Aug 14 11:46:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Aug 2006 14:46:10 -0400 Subject: [openib-general] [PATCH] OpenSM/osm_resp.c: In osm_resp_make_resp_smp, set direction bit only if direct routed class Message-ID: <1155581168.9532.49010.camel@hal.voltaire.com> OpenSM/osm_resp.c: In osm_resp_make_resp_smp, set direction bit only if direct routed class This change fixes two minor issues with osm_resp_make_resp_smp: 1. Get/Set responses always had direction bit set. 2. Trap represses never had direction bit set. The direction bit needs setting in direct routed responses and it doesn't exist in LID routed responses. Signed-off-by: Hal Rosenstock Index: opensm/osm_resp.c =================================================================== --- opensm/osm_resp.c (revision 8931) +++ opensm/osm_resp.c (working copy) @@ -127,7 +127,7 @@ osm_resp_make_resp_smp( if (p_src_smp->method == IB_MAD_METHOD_GET || p_src_smp->method == IB_MAD_METHOD_SET ) { p_dest_smp->method = IB_MAD_METHOD_GET_RESP; - p_dest_smp->status = (ib_net16_t)(status | IB_SMP_DIRECTION); + p_dest_smp->status = status; } else if (p_src_smp->method == IB_MAD_METHOD_TRAP) { @@ -143,6 +143,9 @@ osm_resp_make_resp_smp( goto Exit; } + if (p_src_smp->mgmt_class == IB_MCLASS_SUBN_DIR) + p_dest_smp->status |= IB_SMP_DIRECTION; + p_dest_smp->dr_dlid = p_dest_smp->dr_slid; p_dest_smp->dr_slid = p_dest_smp->dr_dlid; memcpy( &p_dest_smp->data, p_payload, IB_SMP_DATA_SIZE ); From mlakshmanan at silverstorm.com Mon Aug 14 11:51:23 2006 From: mlakshmanan at silverstorm.com (mlakshmanan at silverstorm.com) Date: Mon, 14 Aug 2006 14:51:23 -0400 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AD10@taurus.voltaire.com> References: , <5CE025EE7D88BA4599A2C8FEFCF226F589AD10@taurus.voltaire.com> Message-ID: <44E08DEB.22871.1048D89@mlakshmanan.silverstorm.com> > > This is a similar but more mainstream example of the conflicts. A previous one was reported last week in terms of CM. Still not sure of the best resolution for this. > > Do you really need both includes ? > The userspace tool shows a textual representation of a HCA port's capability mask. So it requires the port capability bit definitions in ib_types.h. And I require mad.h for the MAD API. > -- Hal Thanks, Madhu > > ________________________________ > > From: openib-general-bounces at openib.org on behalf of Lakshmanan, Madhu > Sent: Mon 8/14/2006 12:19 PM > To: openib-general at openib.org > Subject: [openib-general] Conflicting typedefs for "ib_gid_t" > > > > > > In .../include/infiniband/mad.h, it is: > > typedef uint8_t ib_gid_t[16]; > > In .../include/infiniband/iba/ib_types.h, it is: > > #include > typedef union _ib_gid > { > uint8_t raw[16]; > struct _ib_gid_unicast > { > ib_gid_prefix_t prefix; > ib_net64_t interface_id; > > } PACK_SUFFIX unicast; > > struct _ib_gid_multicast > { > uint8_t header[2]; > uint8_t raw_group_id[14]; > > } PACK_SUFFIX multicast; > > } PACK_SUFFIX ib_gid_t; > #include > > I need to include both files for a user space tool and I'm getting a compile error due to the conflict. Is it not the norm for a user space application to include both files? > Appreciate any thoughts on this. > > Madhu Lakshmanan > Silverstorm Technologies, Inc. > mlakshmanan at silverstorm.com > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From halr at voltaire.com Mon Aug 14 11:57:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 14 Aug 2006 21:57:40 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" References: , <5CE025EE7D88BA4599A2C8FEFCF226F589AD10@taurus.voltaire.com> <44E08DEB.22871.1048D89@mlakshmanan.silverstorm.com> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AD14@taurus.voltaire.com> There already is another tool (OpenIB diag) which does that: smpquery portinfo -- Hal ________________________________ From: mlakshmanan at silverstorm.com [mailto:mlakshmanan at silverstorm.com] Sent: Mon 8/14/2006 2:51 PM To: Lakshmanan, Madhu; openib-general at openib.org; Hal Rosenstock Cc: eitan at mellanox.co.il; sean.hefty at intel.com Subject: RE: [openib-general] Conflicting typedefs for "ib_gid_t" > > This is a similar but more mainstream example of the conflicts. A previous one was reported last week in terms of CM. Still not sure of the best resolution for this. > > Do you really need both includes ? > The userspace tool shows a textual representation of a HCA port's capability mask. So it requires the port capability bit definitions in ib_types.h. And I require mad.h for the MAD API. > -- Hal Thanks, Madhu > > ________________________________ > > From: openib-general-bounces at openib.org on behalf of Lakshmanan, Madhu > Sent: Mon 8/14/2006 12:19 PM > To: openib-general at openib.org > Subject: [openib-general] Conflicting typedefs for "ib_gid_t" > > > > > > In .../include/infiniband/mad.h, it is: > > typedef uint8_t ib_gid_t[16]; > > In .../include/infiniband/iba/ib_types.h, it is: > > #include > typedef union _ib_gid > { > uint8_t raw[16]; > struct _ib_gid_unicast > { > ib_gid_prefix_t prefix; > ib_net64_t interface_id; > > } PACK_SUFFIX unicast; > > struct _ib_gid_multicast > { > uint8_t header[2]; > uint8_t raw_group_id[14]; > > } PACK_SUFFIX multicast; > > } PACK_SUFFIX ib_gid_t; > #include > > I need to include both files for a user space tool and I'm getting a compile error due to the conflict. Is it not the norm for a user space application to include both files? > Appreciate any thoughts on this. > > Madhu Lakshmanan > Silverstorm Technologies, Inc. > mlakshmanan at silverstorm.com > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From swise at opengridcomputing.com Mon Aug 14 12:21:25 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 14 Aug 2006 14:21:25 -0500 Subject: [openib-general] IB mcast question Message-ID: <1155583285.4676.37.camel@stevo-desktop> IB experts, I'm playing around with UD QPs over IB, and using the librdma services to join and send to multicast groups. I can run seans mckey example program and it works. I've written a simply program to echo stdin to a mcast address and another program to read mcast packets and echo to stout. I run this on two hosts connected p2p via IB. And it works as expected. However, if I run 2 instances of the app that reads mcasts and dumps them to stdout, I only get the mcast packets delivered to one of the applications. Namely the first one who joins the group seems to get the mcasts. I know for UDP/IP multicast, all applications bound to the same port and joined to the IP mcast addr will get a copy of incoming mcast packets. Is this not true for IB mcast? It appears not based on my tests... Thanks, Stevo. From sean.hefty at intel.com Mon Aug 14 12:31:37 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 14 Aug 2006 12:31:37 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155583285.4676.37.camel@stevo-desktop> Message-ID: <000001c6bfd8$42f017f0$8698070a@amr.corp.intel.com> >However, if I run 2 instances of the app that reads mcasts and dumps >them to stdout, I only get the mcast packets delivered to one of the >applications. Namely the first one who joins the group seems to get the >mcasts. I know for UDP/IP multicast, all applications bound to the same >port and joined to the IP mcast addr will get a copy of incoming mcast >packets. Is this not true for IB mcast? It appears not based on my >tests... My testing revealed the same issue, and I was unable to locate the root cause of the problem. I was not able to confirm that this configuration had ever been successfully tested. - Sean From rdreier at cisco.com Mon Aug 14 12:37:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 12:37:25 -0700 Subject: [openib-general] [PATCH 4/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <46532.71.131.54.100.1155576621.squirrel@rocky.pathscale.com> (ralphc@pathscale.com's message of "Mon, 14 Aug 2006 10:30:21 -0700 (PDT)") References: <1155333489.20325.429.camel@brick.pathscale.com> <46532.71.131.54.100.1155576621.squirrel@rocky.pathscale.com> Message-ID: ralphc> This doesn't break compatibility. uverbs_cmd.c ralphc> ib_uverbs_resize_cq() allocates a struct ralphc> ib_uverbs_resize_cq_resp on the stack but only reads the ralphc> first element in. The structure change isn't really ralphc> needed at all since the INIT_UDATA() macro gets the start ralphc> of driver_data from the struct ib_uverbs_resize_cq. The ralphc> change to ib_uverbs_resize_cq_resp just matches the ralphc> structure change used by libipathverbs to initialize ralphc> ib_uverbs_resize_cq.response. Am I missing something? Think about the case of old libibverbs, new kernel. libibverbs allocates a 4-byte response structure and passes the pointer to that to the kernel. The kernel allocates an 8-byte response structure and copies it back to userspace. And the 4 bytes after the userspace response structure get zeroed. - R. From rdreier at cisco.com Mon Aug 14 12:37:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 12:37:53 -0700 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <44E0B067.6040605@ichips.intel.com> (Sean Hefty's message of "Mon, 14 Aug 2006 10:18:31 -0700") References: <20060809070435.GN20848@mellanox.co.il> <44E0B067.6040605@ichips.intel.com> Message-ID: Sean> This causes compile warnings for me that 'f0' might be used Sean> uninitialized. Yes, but they're bogus. - R. From rdreier at cisco.com Mon Aug 14 12:39:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 12:39:27 -0700 Subject: [openib-general] [RFC] IPoIB high availability daemon In-Reply-To: <44E0B66A.4090605@mellanox.co.il> (Vladimir Sokolovsky's message of "Mon, 14 Aug 2006 20:44:10 +0300") References: <44E0B66A.4090605@mellanox.co.il> Message-ID: Vladimir> Currently there is an issue with join to IPoIB multicast Vladimir> group using both ip and ipmaddr utilities. What's the issue? - R. From rdreier at cisco.com Mon Aug 14 12:41:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 12:41:34 -0700 Subject: [openib-general] [RFC] IPoIB high availability daemon In-Reply-To: <44E0B66A.4090605@mellanox.co.il> (Vladimir Sokolovsky's message of "Mon, 14 Aug 2006 20:44:10 +0300") References: <44E0B66A.4090605@mellanox.co.il> Message-ID: Vladimir> The daemon is a perl script ipoib_ha.pl that should get Vladimir> the primary and the backup IPoIB interfaces as a Vladimir> parameters (default values are ib0 as a primary and ib1 Vladimir> as a backup). Seems like a perl script that relies on the ip command is a little heavyweight. Why not a standalone program that uses rtnetlink? Vladimir> - Get configuration of the primary interface from its Vladimir> standard place (ifcfg-ib from Vladimir> /etc/sysconfig/{network,network-scripts}). This is not standard on all distributions. It would be better to have a more flexible method that worked on Debian/Ubuntu, etc. - R. Date: Mon, 14 Aug 2006 12:41:25 -0700 In-Reply-To: <44E0B66A.4090605 at mellanox.co.il> (Vladimir Sokolovsky's message of "Mon, 14 Aug 2006 20:44:10 +0300") Message-ID: User-Agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.4.18 (linux) From rdreier at cisco.com Mon Aug 14 12:42:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 12:42:36 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155583285.4676.37.camel@stevo-desktop> (Steve Wise's message of "Mon, 14 Aug 2006 14:21:25 -0500") References: <1155583285.4676.37.camel@stevo-desktop> Message-ID: Steve> However, if I run 2 instances of the app that reads mcasts Steve> and dumps them to stdout, I only get the mcast packets Steve> delivered to one of the applications. Namely the first one Steve> who joins the group seems to get the mcasts. I know for Steve> UDP/IP multicast, all applications bound to the same port Steve> and joined to the IP mcast addr will get a copy of incoming Steve> mcast packets. Is this not true for IB mcast? It appears Steve> not based on my tests... This should work -- multicast packets should be replicated to all attached UD QPs. There is likely a bug in the librdma multicast support. - R. From rdreier at cisco.com Mon Aug 14 12:43:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 12:43:53 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <000001c6bfd8$42f017f0$8698070a@amr.corp.intel.com> (Sean Hefty's message of "Mon, 14 Aug 2006 12:31:37 -0700") References: <000001c6bfd8$42f017f0$8698070a@amr.corp.intel.com> Message-ID: Sean> My testing revealed the same issue, and I was unable to Sean> locate the root cause of the problem. I was not able to Sean> confirm that this configuration had ever been successfully Sean> tested. Are you positive ibv_attach_mcast() is called on all the QPs, and that the MGID is passed correctly in to all calls? - R. From mst at mellanox.co.il Mon Aug 14 12:44:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Aug 2006 22:44:51 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <44E08DEB.22871.1048D89@mlakshmanan.silverstorm.com> References: <44E08DEB.22871.1048D89@mlakshmanan.silverstorm.com> Message-ID: <20060814194451.GE16821@mellanox.co.il> Quoting r. mlakshmanan at silverstorm.com : > Subject: Re: Conflicting typedefs for "ib_gid_t" > > > > > This is a similar but more mainstream example of the conflicts. A previous > > one was reported last week in terms of CM. Still not sure of the best > > resolution for this. > > > > Do you really need both includes ? > > > > The userspace tool shows a textual representation of a HCA port's capability > mask. So it requires the port capability bit definitions in ib_types.h. And I > require mad.h for the MAD API. I don't think the way forward is using iba/ in all applications. I see it mostly as a legacy header for opensm and related apps that want their own layer for portability between stacks. Wrt issue at hand, using ib_ prefix anywhere is a mistake which will always lead to conflicts between libraries. Let us start prefixing types libibumad defines with umad_, just like ib verbs library prefixes types by ibv_. For example we have union ibv_gid, so can't mad.h have umad_gid_t? Hal, what do you say? -- MST From swise at opengridcomputing.com Mon Aug 14 12:49:46 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 14 Aug 2006 14:49:46 -0500 Subject: [openib-general] IB mcast question In-Reply-To: References: <000001c6bfd8$42f017f0$8698070a@amr.corp.intel.com> Message-ID: <1155584986.4676.39.camel@stevo-desktop> On Mon, 2006-08-14 at 12:43 -0700, Roland Dreier wrote: > Sean> My testing revealed the same issue, and I was unable to > Sean> locate the root cause of the problem. I was not able to > Sean> confirm that this configuration had ever been successfully > Sean> tested. > > Are you positive ibv_attach_mcast() is called on all the QPs, and that > the MGID is passed correctly in to all calls? > > - R. I'll let you know... From swise at opengridcomputing.com Mon Aug 14 12:50:08 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 14 Aug 2006 14:50:08 -0500 Subject: [openib-general] IB mcast question In-Reply-To: <000001c6bfd8$42f017f0$8698070a@amr.corp.intel.com> References: <000001c6bfd8$42f017f0$8698070a@amr.corp.intel.com> Message-ID: <1155585008.4676.41.camel@stevo-desktop> On Mon, 2006-08-14 at 12:31 -0700, Sean Hefty wrote: > >However, if I run 2 instances of the app that reads mcasts and dumps > >them to stdout, I only get the mcast packets delivered to one of the > >applications. Namely the first one who joins the group seems to get the > >mcasts. I know for UDP/IP multicast, all applications bound to the same > >port and joined to the IP mcast addr will get a copy of incoming mcast > >packets. Is this not true for IB mcast? It appears not based on my > >tests... > > My testing revealed the same issue, and I was unable to locate the root cause of > the problem. I was not able to confirm that this configuration had ever been > successfully tested. > > - Sean Hmm. Ok. I'll debug this. I need to get this working... Steve. From mst at mellanox.co.il Mon Aug 14 12:50:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Aug 2006 22:50:55 +0300 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: References: Message-ID: <20060814195055.GF16821@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: make IB_SEND_FENCE work > > Sean> This causes compile warnings for me that 'f0' might be used > Sean> uninitialized. > > Yes, but they're bogus. Yes, I see lots of such bogus warnings in kernel code too. And it seems obvious that since compiler isn't smart enough to figure out the initialization isn't needed, it will generate unecessary code if we *do* add initilization just to shut it up. Maybe compile with -Wno-uninitialized? -- MST From mshefty at ichips.intel.com Mon Aug 14 13:01:48 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Aug 2006 13:01:48 -0700 Subject: [openib-general] IB mcast question In-Reply-To: References: <000001c6bfd8$42f017f0$8698070a@amr.corp.intel.com> Message-ID: <44E0D6AC.5050805@ichips.intel.com> Roland Dreier wrote: > Are you positive ibv_attach_mcast() is called on all the QPs, and that > the MGID is passed correctly in to all calls? Yes - ibv_attach_mcast() is being called with the same MLID, MGID by both receiving processes. That doesn't necessarily mean that there's not a bug in ib_multicast or the RDMA CM; I just couldn't locate any. - Sean From halr at voltaire.com Mon Aug 14 13:06:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 14 Aug 2006 23:06:05 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" References: <44E08DEB.22871.1048D89@mlakshmanan.silverstorm.com> <20060814194451.GE16821@mellanox.co.il> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AD17@taurus.voltaire.com> I agree with your view on iba/ib_types.h I'm not sure I understand what you mean in terms of libibumad. He's including libibmad rather than libibumad. So I suspect you mean changing this (ib_gid_t) to mad_gid_t ? -- Hal ________________________________ From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] Sent: Mon 8/14/2006 3:44 PM To: mlakshmanan at silverstorm.com Cc: openib-general at openib.org; Hal Rosenstock Subject: Re: Conflicting typedefs for "ib_gid_t" Quoting r. mlakshmanan at silverstorm.com : > Subject: Re: Conflicting typedefs for "ib_gid_t" > > > > > This is a similar but more mainstream example of the conflicts. A previous > > one was reported last week in terms of CM. Still not sure of the best > > resolution for this. > > > > Do you really need both includes ? > > > > The userspace tool shows a textual representation of a HCA port's capability > mask. So it requires the port capability bit definitions in ib_types.h. And I > require mad.h for the MAD API. I don't think the way forward is using iba/ in all applications. I see it mostly as a legacy header for opensm and related apps that want their own layer for portability between stacks. Wrt issue at hand, using ib_ prefix anywhere is a mistake which will always lead to conflicts between libraries. Let us start prefixing types libibumad defines with umad_, just like ib verbs library prefixes types by ibv_. For example we have union ibv_gid, so can't mad.h have umad_gid_t? Hal, what do you say? -- MST From mst at mellanox.co.il Mon Aug 14 13:10:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Aug 2006 23:10:22 +0300 Subject: [openib-general] [RFC] IPoIB high availability daemon In-Reply-To: References: Message-ID: <20060814201022.GH16821@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [RFC] IPoIB high availability daemon > > Vladimir> Currently there is an issue with join to IPoIB multicast > Vladimir> group using both ip and ipmaddr utilities. > > What's the issue? I think I heard that ip maddr shows trunkated addresses, and attempt to join also fails as join is done to he wrong address. -- MST From swise at opengridcomputing.com Mon Aug 14 13:18:34 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 14 Aug 2006 15:18:34 -0500 Subject: [openib-general] IB mcast question In-Reply-To: References: <1155583285.4676.37.camel@stevo-desktop> Message-ID: <1155586714.4676.45.camel@stevo-desktop> On Mon, 2006-08-14 at 12:42 -0700, Roland Dreier wrote: > Steve> However, if I run 2 instances of the app that reads mcasts > Steve> and dumps them to stdout, I only get the mcast packets > Steve> delivered to one of the applications. Namely the first one > Steve> who joins the group seems to get the mcasts. I know for > Steve> UDP/IP multicast, all applications bound to the same port > Steve> and joined to the IP mcast addr will get a copy of incoming > Steve> mcast packets. Is this not true for IB mcast? It appears > Steve> not based on my tests... > > This should work -- multicast packets should be replicated to all > attached UD QPs. There is likely a bug in the librdma multicast support. > So is this replicating done in the mthca hca? Since one app is getting the mcast packet, can I assume the opensm code is doing the right thing switch/port wise? Should the SM get join requests for both applications that join the group on the same host? Or only the first one? Steve. From mst at mellanox.co.il Mon Aug 14 13:21:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Aug 2006 23:21:41 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AD17@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AD17@taurus.voltaire.com> Message-ID: <20060814202141.GI16821@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: RE: Conflicting typedefs for "ib_gid_t" > > I agree with your view on iba/ib_types.h > > I'm not sure I understand what you mean in terms of libibumad. He's including libibmad rather than libibumad. > So I suspect you mean changing this (ib_gid_t) to mad_gid_t ? Yes, or ib_mad_, or ibmad_. And same for other IB_ and ib_ names there. We really need to do something about names like ib_attr_t. I also would like the number of typedefs libibmad exposes reduced - its not like you'll be able to convert these structs to anything else transparently without breaking users - anyone sticks their data in there anyway. But this is more a matter of taste. -- MST From mshefty at ichips.intel.com Mon Aug 14 13:30:29 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Aug 2006 13:30:29 -0700 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <20060814195055.GF16821@mellanox.co.il> References: <20060814195055.GF16821@mellanox.co.il> Message-ID: <44E0DD65.5020303@ichips.intel.com> Michael S. Tsirkin wrote: > Yes, I see lots of such bogus warnings in kernel code too. > And it seems obvious that since compiler isn't smart enough to > figure out the initialization isn't needed, it will generate > unecessary code if we *do* add initilization just to shut it up. Can we relocate the setting for op0 and f0 outside of the for-loop (maybe before we enter the loop)? I'm really not familiar with this code, but on first glance, it looks like these are set based on the send_flag of the first posted wr. This could eliminate the warning, and remove an if statement from executing on each iteration. - Sean From halr at voltaire.com Mon Aug 14 13:32:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 14 Aug 2006 23:32:27 +0300 Subject: [openib-general] IB mcast question References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AD1A@taurus.voltaire.com> Steve, IB only replicates once per node (and also not on the incoming port if there are any members). The SM tracks join states (full, non member, send only member) for a port. It doesn't matter whether the SM gets duplicated join requests for a port. It would just indicate that was OK and the node would still only get one packet per send from another port in that group. It's up to the HCA to replicate the multicast packet to all its QPs which are part of that group. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Steve Wise Sent: Mon 8/14/2006 4:18 PM To: Roland Dreier Cc: openib-general Subject: Re: [openib-general] IB mcast question On Mon, 2006-08-14 at 12:42 -0700, Roland Dreier wrote: > Steve> However, if I run 2 instances of the app that reads mcasts > Steve> and dumps them to stdout, I only get the mcast packets > Steve> delivered to one of the applications. Namely the first one > Steve> who joins the group seems to get the mcasts. I know for > Steve> UDP/IP multicast, all applications bound to the same port > Steve> and joined to the IP mcast addr will get a copy of incoming > Steve> mcast packets. Is this not true for IB mcast? It appears > Steve> not based on my tests... > > This should work -- multicast packets should be replicated to all > attached UD QPs. There is likely a bug in the librdma multicast support. > So is this replicating done in the mthca hca? Since one app is getting the mcast packet, can I assume the opensm code is doing the right thing switch/port wise? Should the SM get join requests for both applications that join the group on the same host? Or only the first one? Steve. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Mon Aug 14 13:33:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Aug 2006 23:33:00 +0300 Subject: [openib-general] [RFC] IPoIB high availability daemon In-Reply-To: References: Message-ID: <20060814203300.GJ16821@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [RFC] IPoIB high availability daemon > > Vladimir> The daemon is a perl script ipoib_ha.pl that should get > Vladimir> the primary and the backup IPoIB interfaces as a > Vladimir> parameters (default values are ib0 as a primary and ib1 > Vladimir> as a backup). > > Seems like a perl script that relies on the ip command is a little > heavyweight. Why not a standalone program that uses rtnetlink? We could go there yet, but there'd be a lot of issues to cover: note how we need to also set IP addresses, read configuration ... Further, the only system that might care about this I know about (bproc) dislikes running daemons on endnodes anyway. Finally - work is underway for kernel-level solutions involving bonding etc. So - let's see a working implementation first, optimize later. Makes sense? > Vladimir> - Get configuration of the primary interface from its > Vladimir> standard place (ifcfg-ib from > Vladimir> /etc/sysconfig/{network,network-scripts}). > > This is not standard on all distributions. It would be better to have > a more flexible method that worked on Debian/Ubuntu, etc. Yea, but just adding our own configuration would mean configuration overhead for the administrator, lack of convenient tools to do it ... Using standard interfaces is good. An idea how to solve this? -- MST From mlakshmanan at silverstorm.com Mon Aug 14 13:35:47 2006 From: mlakshmanan at silverstorm.com (mlakshmanan at silverstorm.com) Date: Mon, 14 Aug 2006 16:35:47 -0400 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <20060814202141.GI16821@mellanox.co.il> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AD17@taurus.voltaire.com>, <20060814202141.GI16821@mellanox.co.il> Message-ID: <44E0A663.24086.19D836@mlakshmanan.silverstorm.com> >>> I don't think the way forward is using iba/ in all applications. >>> I see it mostly as a legacy header for opensm and related apps >>> that want their own layer for portability between stacks. > > > > I agree with your view on iba/ib_types.h Does that imply that the definitions in iba/ib_types.h are not expected to be required or used by user-space applications other than those categories mentioned above? If iba/ib_types.h is only a legacy header, are the definitions also present in another header file? Madhu From mshefty at ichips.intel.com Mon Aug 14 13:33:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Aug 2006 13:33:33 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155586714.4676.45.camel@stevo-desktop> References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> Message-ID: <44E0DE1D.7090301@ichips.intel.com> Steve Wise wrote: > So is this replicating done in the mthca hca? As just an FYI, I didn't see anything wrong in the mthca driver either when I was looking at this problem. > Since one app is getting the mcast packet, can I assume the opensm code > is doing the right thing switch/port wise? That seems like a fairly safe assumption. > Should the SM get join requests for both applications that join the > group on the same host? Or only the first one? Only the first join request should make it to the SA. The second join request is fulfilled by ib_multicast. This is what makes ib_multicast suspect. - Sean From ralphc at pathscale.com Mon Aug 14 13:38:42 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 14 Aug 2006 13:38:42 -0700 Subject: [openib-general] [PATCH 4/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: References: <1155333489.20325.429.camel@brick.pathscale.com> <46532.71.131.54.100.1155576621.squirrel@rocky.pathscale.com> Message-ID: <1155587922.20325.486.camel@brick.pathscale.com> On Mon, 2006-08-14 at 12:37 -0700, Roland Dreier wrote: > ralphc> This doesn't break compatibility. uverbs_cmd.c > ralphc> ib_uverbs_resize_cq() allocates a struct > ralphc> ib_uverbs_resize_cq_resp on the stack but only reads the > ralphc> first element in. The structure change isn't really > ralphc> needed at all since the INIT_UDATA() macro gets the start > ralphc> of driver_data from the struct ib_uverbs_resize_cq. The > ralphc> change to ib_uverbs_resize_cq_resp just matches the > ralphc> structure change used by libipathverbs to initialize > ralphc> ib_uverbs_resize_cq.response. > > Am I missing something? Think about the case of old libibverbs, new > kernel. libibverbs allocates a 4-byte response structure and passes > the pointer to that to the kernel. The kernel allocates an 8-byte > response structure and copies it back to userspace. And the 4 bytes > after the userspace response structure get zeroed. > > - R. No, I was missing something :-) You are correct. The structure change was needed to get the alignment correct for returning a u64 after the struct ib_uverbs_resize_cq_resp. I can avoid the incompatibility a number of ways: 1) change ib_uverbs_resize_cq() to only copyout resp.cqe. 2) change ib_ipath to ALIGN the udata->outbuf address. 3) define two resp structures, check the version, copyout the right one. Seems like #1 is the simplest to me. From sean.hefty at intel.com Mon Aug 14 13:38:42 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 14 Aug 2006 13:38:42 -0700 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <20060814202141.GI16821@mellanox.co.il> Message-ID: <000101c6bfe1$a1cdbe40$8698070a@amr.corp.intel.com> >Yes, or ib_mad_, or ibmad_. And same for other IB_ and ib_ names there. >We really need to do something about names like ib_attr_t. I like to move away from each library re-defining common IB data types. Something like ibv_gid should be picked up from libibverbs. IMO, the core of the problem is that opensm include files carry too many legacy typedefs. - Sean From mst at mellanox.co.il Mon Aug 14 13:41:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Aug 2006 23:41:54 +0300 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <44E0DD65.5020303@ichips.intel.com> References: <44E0DD65.5020303@ichips.intel.com> Message-ID: <20060814204154.GK16821@mellanox.co.il> Quoting r. Sean Hefty : > This could eliminate the warning, and remove an if statement from executing on > each iteration. We still need to test size0 to set size0 = size. So we just reuse the extra branch, and I agree with Roland this way code is clearer. -- MST From halr at voltaire.com Mon Aug 14 13:39:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 14 Aug 2006 23:39:00 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" References: <5CE025EE7D88BA4599A2C8FEFCF226F589AD17@taurus.voltaire.com>, <20060814202141.GI16821@mellanox.co.il> <44E0A663.24086.19D836@mlakshmanan.silverstorm.com> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AD1B@taurus.voltaire.com> To answer your questions: I'm not totally sure about your application but it seems to me to fall in the category being discussed. Not all of the definitions in ib_types.h are elsewhere. I am working on a patch to get you past this issue. -- Hal ________________________________ From: mlakshmanan at silverstorm.com [mailto:mlakshmanan at silverstorm.com] Sent: Mon 8/14/2006 4:35 PM To: Hal Rosenstock; Michael S. Tsirkin Cc: mlakshmanan at silverstorm.com; openib-general at openib.org Subject: Re: Conflicting typedefs for "ib_gid_t" >>> I don't think the way forward is using iba/ in all applications. >>> I see it mostly as a legacy header for opensm and related apps >>> that want their own layer for portability between stacks. > > > > I agree with your view on iba/ib_types.h Does that imply that the definitions in iba/ib_types.h are not expected to be required or used by user-space applications other than those categories mentioned above? If iba/ib_types.h is only a legacy header, are the definitions also present in another header file? Madhu From mst at mellanox.co.il Mon Aug 14 13:45:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Aug 2006 23:45:21 +0300 Subject: [openib-general] sanity check on datapath Message-ID: <20060814204521.GL16821@mellanox.co.il> Roland, do we really need code like if (wr->opcode >= sizeof mthca_opcode / sizeof mthca_opcode[0]) { ret = -1; *bad_wr = wr; goto out; } in mthca on data path? Should this be put within ifdef DEBUG or something? -- MST From mshefty at ichips.intel.com Mon Aug 14 13:47:01 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Aug 2006 13:47:01 -0700 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <20060814204154.GK16821@mellanox.co.il> References: <44E0DD65.5020303@ichips.intel.com> <20060814204154.GK16821@mellanox.co.il> Message-ID: <44E0E145.60906@ichips.intel.com> Michael S. Tsirkin wrote: > We still need to test size0 to set size0 = size. > So we just reuse the extra branch, and I agree with Roland > this way code is clearer. You're right; I missed where size0 was used below the loop. Then I think we can also do without initializing op0 = 0, and we can eliminate the size0 initialization by changing if (!size0) to if (!nreq). - Sean From mlakshmanan at silverstorm.com Mon Aug 14 13:51:05 2006 From: mlakshmanan at silverstorm.com (Lakshmanan, Madhu) Date: Mon, 14 Aug 2006 16:51:05 -0400 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AD1B@taurus.voltaire.com> Message-ID: I concur with your categorization of the application I mentioned. I was curious about the way going forward, as to whether anyone anticipated this to be a more commonly recurring issue. Thanks in advance for the patch. -- Madhu -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, August 14, 2006 4:39 PM To: Lakshmanan, Madhu; Michael S. Tsirkin Cc: openib-general at openib.org Subject: RE: Conflicting typedefs for "ib_gid_t" To answer your questions: I'm not totally sure about your application but it seems to me to fall in the category being discussed. Not all of the definitions in ib_types.h are elsewhere. I am working on a patch to get you past this issue. -- Hal ________________________________ From: mlakshmanan at silverstorm.com [mailto:mlakshmanan at silverstorm.com] Sent: Mon 8/14/2006 4:35 PM To: Hal Rosenstock; Michael S. Tsirkin Cc: mlakshmanan at silverstorm.com; openib-general at openib.org Subject: Re: Conflicting typedefs for "ib_gid_t" >>> I don't think the way forward is using iba/ in all applications. >>> I see it mostly as a legacy header for opensm and related apps >>> that want their own layer for portability between stacks. > > > > I agree with your view on iba/ib_types.h Does that imply that the definitions in iba/ib_types.h are not expected to be required or used by user-space applications other than those categories mentioned above? If iba/ib_types.h is only a legacy header, are the definitions also present in another header file? Madhu From swise at opengridcomputing.com Mon Aug 14 13:51:33 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 14 Aug 2006 15:51:33 -0500 Subject: [openib-general] IB mcast question In-Reply-To: <44E0DE1D.7090301@ichips.intel.com> References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> Message-ID: <1155588693.4676.48.camel@stevo-desktop> On Mon, 2006-08-14 at 13:33 -0700, Sean Hefty wrote: > Steve Wise wrote: > > So is this replicating done in the mthca hca? > > As just an FYI, I didn't see anything wrong in the mthca driver either when I > was looking at this problem. > Ok. I added printks in the mcast attach/detach and they're firing as expected: vic18:/home/swise/zip # dmesg mthca_multicast_attach qp_num 406 gid ff124001ffff0000:00000000000a0aff lid c003 mthca_multicast_attach qp_num 407 gid ff124001ffff0000:00000000000a0aff lid c003 mthca_multicast_detach qp_num 406 gid ff124001ffff0000:00000000000a0aff lid c003 mthca_multicast_detach qp_num 407 gid ff124001ffff0000:00000000000a0aff lid c003 > > Should the SM get join requests for both applications that join the > > group on the same host? Or only the first one? > > Only the first join request should make it to the SA. The second join request > is fulfilled by ib_multicast. This is what makes ib_multicast suspect. I'll look into this module... Thanks, Stevo. From mst at mellanox.co.il Mon Aug 14 13:59:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Aug 2006 23:59:29 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <000101c6bfe1$a1cdbe40$8698070a@amr.corp.intel.com> References: <000101c6bfe1$a1cdbe40$8698070a@amr.corp.intel.com> Message-ID: <20060814205929.GM16821@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: [openib-general] Conflicting typedefs for "ib_gid_t" > > >Yes, or ib_mad_, or ibmad_. And same for other IB_ and ib_ names there. > >We really need to do something about names like ib_attr_t. > > I like to move away from each library re-defining common IB data types. There are not that many "common IB types". verbs and management for example are more or less isolated. CM users mostly don't care about SMPs. etc. What's common? GID? It does not seem worth it for a "free format 16 byte network endianness" type. I don't necessary see a problem with what we have. Such defines mirror IB spec so are static and there is no overhead maintaining them. And this simplifies dependencies no end - think about testing tens of dependent libraries for breakage just 'cause you removed an used line in this "included by everyone" header. > Something like ibv_gid should be picked up from libibverbs. Hmm. User might not have uverbs even loaded, so adding verbs as a dependency to e.g. SM seems like a bad idea. > IMO, the core of the problem is that opensm include files carry too many legacy > typedefs. opensm really tries to be stack-agnostic, so it does need its own layer for things. -- MST From eitan at mellanox.co.il Mon Aug 14 14:08:31 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 15 Aug 2006 00:08:31 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2DF2@mtlexch01.mtl.com> Hi Sean, If we had all the IBTA related defines and structs in a well defined and tracked location there was no reason for OpenSM to have this header. To my best recollection I could not get any agreement to such common IBTA specifications file. Can I assume you are willing to have such common set of header files??? I would propose having a set of files (one common and one for each class) under include/infiniband/iba > > I like to move away from each library re-defining common IB data types. > Something like ibv_gid should be picked up from libibverbs. > > IMO, the core of the problem is that opensm include files carry too many > legacy typedefs. > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Mon Aug 14 14:08:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 00:08:05 +0300 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <44E0E145.60906@ichips.intel.com> References: <44E0E145.60906@ichips.intel.com> Message-ID: <20060814210805.GN16821@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] mthca: make IB_SEND_FENCE work > > Michael S. Tsirkin wrote: > > We still need to test size0 to set size0 = size. > > So we just reuse the extra branch, and I agree with Roland > > this way code is clearer. > > You're right; I missed where size0 was used below the loop. > > Then I think we can also do without initializing op0 = 0, and we can eliminate > the size0 initialization by changing if (!size0) to if (!nreq). > > - Sean > There's the ee_nds line above that tests size0 and should be changed then as well. -- MST From swise at opengridcomputing.com Mon Aug 14 14:17:27 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 14 Aug 2006 16:17:27 -0500 Subject: [openib-general] IB mcast question In-Reply-To: <1155588693.4676.48.camel@stevo-desktop> References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> <1155588693.4676.48.camel@stevo-desktop> Message-ID: <1155590247.4676.62.camel@stevo-desktop> > > > > Only the first join request should make it to the SA. The second join request > > is fulfilled by ib_multicast. This is what makes ib_multicast suspect. > > > I'll look into this module... > ib_multicast takes care of sending the join/leave info to the SA, right? It keeps track of _when_ to leave, for instance. So since opensm -is- getting the join and setting up the group, and the mcast packet is being passed to the first member who joined, then I don't think ib_multicast can mess up the subsequent members, can it? I confirmed that mthca was called to attach both qps to the mgid/mlid, so this makes me think ib_multicast worked ok. I'm new to IB mcast, so I'm learning, but it seems like the mthca firmware maybe isn't doing the right thing here. Any suggestions on how to further debug this? BTW my HCAs are at the latest firmware. I just had them upgraded. Steve. From mst at mellanox.co.il Mon Aug 14 14:16:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 00:16:36 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2DF2@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2DF2@mtlexch01.mtl.com> Message-ID: <20060814211636.GO16821@mellanox.co.il> Quoting r. Eitan Zahavi : > If we had all the IBTA related defines and structs in a well defined and tracked location there was no reason for OpenSM to have this header. Define well defined :) Seriously, won't opensm need a portability layer to be stack agnostic anyway? -- MST From mshefty at ichips.intel.com Mon Aug 14 14:15:31 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Aug 2006 14:15:31 -0700 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <20060814205929.GM16821@mellanox.co.il> References: <000101c6bfe1$a1cdbe40$8698070a@amr.corp.intel.com> <20060814205929.GM16821@mellanox.co.il> Message-ID: <44E0E7F3.1060905@ichips.intel.com> Michael S. Tsirkin wrote: > There are not that many "common IB types". verbs and management for example > are more or less isolated. CM users mostly don't care about SMPs. etc. > What's common? GID? It does not seem worth it for a "free format 16 byte network > endianness" type. Verbs and management are not isolated. Establishing a connection, joining a multicast group, or acquiring a path record are essential for actually using verbs correctly. > I don't necessary see a problem with what we have. Such defines mirror IB spec > so are static and there is no overhead maintaining them. And this simplifies > dependencies no end - think about testing tens of dependent libraries > for breakage just 'cause you removed an used line in this > "included by everyone" header. Having umad_gid, ibv_gid, ibv_sa_gid, mad_gid, cm_gid, some_other_random_library_gid is goofy. (Sorry, I'm completely ranting now.) > Hmm. User might not have uverbs even loaded, so adding verbs as a dependency > to e.g. SM seems like a bad idea. We only need the include file, not the library. > opensm really tries to be stack-agnostic, so it does need its own > layer for things. Then those things should be completely internal to opensm. - Sean From rdreier at cisco.com Mon Aug 14 14:19:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 14:19:44 -0700 Subject: [openib-general] [PATCH] mthca: make IB_SEND_FENCE work In-Reply-To: <20060814195055.GF16821@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 Aug 2006 22:50:55 +0300") References: <20060814195055.GF16821@mellanox.co.il> Message-ID: Michael> Maybe compile with -Wno-uninitialized? This is discussed periodically on lkml. The problem with -Wno-uninitialized is that it shuts up the good "is used uninitialized" warnings also (in addition to the "may be used" warnings, which are often bogus). - R. From eitan at mellanox.co.il Mon Aug 14 14:26:55 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 15 Aug 2006 00:26:55 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2DF7@mtlexch01.mtl.com> OpenSM ib_types.h had nothing to do with stack implementation. It carries the constants and structures (MAD formats) defined in IBTA specification 1.2 (+ errata). Any application that needs to send MADs and decode their status etc will need to use similar headers. ib_gid_t is just another example for a struct defined by IBTA. It has nothing to do with specific stack implementation. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Michael S. Tsirkin > Sent: Tuesday, August 15, 2006 12:17 AM > To: Eitan Zahavi > Cc: Sean Hefty; Hal Rosenstock; openib-general at openib.org > Subject: Re: [openib-general] Conflicting typedefs for "ib_gid_t" > > Quoting r. Eitan Zahavi : > > If we had all the IBTA related defines and structs in a well defined and > tracked location there was no reason for OpenSM to have this header. > > Define well defined :) > > Seriously, won't opensm need a portability layer to be stack agnostic anyway? > > -- > MST From rdreier at cisco.com Mon Aug 14 14:24:43 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 14:24:43 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155586714.4676.45.camel@stevo-desktop> (Steve Wise's message of "Mon, 14 Aug 2006 15:18:34 -0500") References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> Message-ID: Steve> So is this replicating done in the mthca hca? Yes, it should be. There may be a bug in the mthca kernel multicast code for handling multiple QPs attached to the same group. Steve> Since one app is getting the mcast packet, can I assume the Steve> opensm code is doing the right thing switch/port wise? Yep. Steve> Should the SM get join requests for both applications that Steve> join the group on the same host? Or only the first one? No there should only be one join request for a given port. - R. From rdreier at cisco.com Mon Aug 14 14:26:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 14:26:31 -0700 Subject: [openib-general] sanity check on datapath In-Reply-To: <20060814204521.GL16821@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 Aug 2006 23:45:21 +0300") References: <20060814204521.GL16821@mellanox.co.il> Message-ID: Michael> if (wr->opcode >= sizeof mthca_opcode / sizeof mthca_opcode[0]) Michael> { Michael> ret = -1; Michael> *bad_wr = wr; Michael> goto out; Michael> } Michael> in mthca on data path? Should this be put within ifdef Michael> DEBUG or something? Probably not needed -- I guess we can trust what the consumer gives us. - R. From mshefty at ichips.intel.com Mon Aug 14 14:30:14 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Aug 2006 14:30:14 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155590247.4676.62.camel@stevo-desktop> References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> <1155588693.4676.48.camel@stevo-desktop> <1155590247.4676.62.camel@stevo-desktop> Message-ID: <44E0EB66.6090508@ichips.intel.com> Steve Wise wrote: > ib_multicast takes care of sending the join/leave info to the SA, right? > It keeps track of _when_ to leave, for instance. So since opensm -is- > getting the join and setting up the group, and the mcast packet is being > passed to the first member who joined, then I don't think ib_multicast > can mess up the subsequent members, can it? It theory, it shouldn't mess up subsequent members. While the first join is active, subsequent join / leave requests to that same group should be queued. After the first join completes, subsequent joins should get a copy of the MCMemberRecord that was returned by the SA. (This is a slight simplification, with the actual operation determined by the type of join operation that occurs. But for the RDMA CM, this is what should happen.) > I'm new to IB mcast, so I'm learning, but it seems like the mthca > firmware maybe isn't doing the right thing here. This was my suspicion, but I couldn't be certain. It would help if anyone can say that they've successfully tested this sort of multicast configuration. I.e. two QPs from the same HCA in the same group. - Sean From ishai at mellanox.co.il Mon Aug 14 14:38:32 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Tue, 15 Aug 2006 00:38:32 +0300 Subject: [openib-general] (no subject) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2DF8@mtlexch01.mtl.com> Hi Roland, In order to support High-Availability in OFED 1.1, we need more functionality to the srp daemon. Based on your code I implemented a new srp daemon (I listed its new features below) . I put the code in https://openib.org/svn/trunk/contrib/mellanox/gen2/src/userspace/srptool s/srp_daemon and I'm in an initial testing stage (there are some known bugs). Since I think that people may still want to use your original ibsrpdm, I think we should keep your version and start a new tool from my code. What do you think? Ishai ======== The new tool main features: 1) Register to Traps 64 and 144. 2) Can ask ib_srp to connect to the targets it finds. 3) Can check if the target is already connected by ib_srp from the same port. 4) Can perform rescan of the fabric every X seconds. 5) Identify SA changes and other events and act accordingly. 6) Can get an hca name and a port number as input (not only a umad device). 7) Uses the umad package. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Aug 14 14:33:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 15 Aug 2006 00:33:48 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" References: <000101c6bfe1$a1cdbe40$8698070a@amr.corp.intel.com> <20060814205929.GM16821@mellanox.co.il> <44E0E7F3.1060905@ichips.intel.com> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AD1D@taurus.voltaire.com> Sean, I think it was agreed a long time ago on OpenIB to have duplicated definitions for some of the ib_xxx things. The specific issue here is that the one in the gen2 user libraries/verbs is different from the one which OpenSM uses. If they both were the same, this would work, right ? -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Sean Hefty Sent: Mon 8/14/2006 5:15 PM To: Michael S. Tsirkin Cc: openib-general at openib.org Subject: Re: [openib-general] Conflicting typedefs for "ib_gid_t" Michael S. Tsirkin wrote: > There are not that many "common IB types". verbs and management for example > are more or less isolated. CM users mostly don't care about SMPs. etc. > What's common? GID? It does not seem worth it for a "free format 16 byte network > endianness" type. Verbs and management are not isolated. Establishing a connection, joining a multicast group, or acquiring a path record are essential for actually using verbs correctly. > I don't necessary see a problem with what we have. Such defines mirror IB spec > so are static and there is no overhead maintaining them. And this simplifies > dependencies no end - think about testing tens of dependent libraries > for breakage just 'cause you removed an used line in this > "included by everyone" header. Having umad_gid, ibv_gid, ibv_sa_gid, mad_gid, cm_gid, some_other_random_library_gid is goofy. (Sorry, I'm completely ranting now.) > Hmm. User might not have uverbs even loaded, so adding verbs as a dependency > to e.g. SM seems like a bad idea. We only need the include file, not the library. > opensm really tries to be stack-agnostic, so it does need its own > layer for things. Then those things should be completely internal to opensm. - Sean _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ishai at mellanox.co.il Mon Aug 14 14:41:07 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Tue, 15 Aug 2006 00:41:07 +0300 Subject: [openib-general] A new version for srp daemon Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2DFA@mtlexch01.mtl.com> Adding a subject ________________________________ From: Ishai Rabinovitz Sent: Tuesday, August 15, 2006 12:32 AM To: 'Roland Dreier' Cc: 'openib-general at openib.org'; Tziporet Koren Subject: Hi Roland, In order to support High-Availability in OFED 1.1, we need more functionality to the srp daemon. Based on your code I implemented a new srp daemon (I listed its new features below) . I put the code in https://openib.org/svn/trunk/contrib/mellanox/gen2/src/userspace/srptool s/srp_daemon and I'm in an initial testing stage (there are some known bugs). Since I think that people may still want to use your original ibsrpdm, I think we should keep your version and start a new tool from my code. What do you think? Ishai ======== The new tool main features: 1) Register to Traps 64 and 144. 2) Can ask ib_srp to connect to the targets it finds. 3) Can check if the target is already connected by ib_srp from the same port. 4) Can perform rescan of the fabric every X seconds. 5) Identify SA changes and other events and act accordingly. 6) Can get an hca name and a port number as input (not only a umad device). 7) Uses the umad package. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Aug 14 14:38:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 15 Aug 2006 00:38:32 +0300 Subject: [openib-general] IB mcast question References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AD1E@taurus.voltaire.com> This is not the main issue (the lack of replication is) but I don't think a subsequent join from the same port does any harm but ib_multicast shouldn't be doing this. It would matter in terms of the leave though. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Roland Dreier Sent: Mon 8/14/2006 5:24 PM To: Steve Wise Cc: openib-general Subject: Re: [openib-general] IB mcast question Steve> So is this replicating done in the mthca hca? Yes, it should be. There may be a bug in the mthca kernel multicast code for handling multiple QPs attached to the same group. Steve> Since one app is getting the mcast packet, can I assume the Steve> opensm code is doing the right thing switch/port wise? Yep. Steve> Should the SM get join requests for both applications that Steve> join the group on the same host? Or only the first one? No there should only be one join request for a given port. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Mon Aug 14 14:51:56 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 14 Aug 2006 16:51:56 -0500 Subject: [openib-general] IB mcast question In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AD1E@taurus.voltaire.com> References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <5CE025EE7D88BA4599A2C8FEFCF226F589AD1E@taurus.voltaire.com> Message-ID: <1155592316.4676.66.camel@stevo-desktop> On Tue, 2006-08-15 at 00:38 +0300, Hal Rosenstock wrote: > This is not the main issue (the lack of replication is) but I don't > think a subsequent join from the same port does any harm but > ib_multicast shouldn't be doing this. It would matter in terms of the > leave though. > The osm logs seem to show only one join_mgrp request, when the first app joins, and one leave_mgrp when the 2nd app exits. So I think the interaction with OSM is okeydokey. Steve. From mst at mellanox.co.il Mon Aug 14 15:10:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 01:10:01 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2DF7@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2DF7@mtlexch01.mtl.com> Message-ID: <20060814221001.GQ16821@mellanox.co.il> Quoting r. Eitan Zahavi : > ib_gid_t is just another example for a struct defined by IBTA. IBTA only defines a wire protocol. It's often wrong for applications work in terms of raw packets- we should have libraries to encapsulate typical usage. While using on-the-wire values for library API sometimes makes sense, this is not necessarily always the right way and e.g. efficiency concerns might dictate otherwise. > It has nothing to do with specific stack implementation. The name is stack specific. Whether it's best as an array or a union or a struct is specific for an app. -- MST From weiny2 at llnl.gov Mon Aug 14 15:08:24 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 14 Aug 2006 15:08:24 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <44DF25A2.30405@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA767F@mtlexch01.mtl.com> <1155303417.4507.15009.camel@hal.voltaire.com> <44DF25A2.30405@mellanox.co.il> Message-ID: <20060814150824.11b28c05.weiny2@llnl.gov> Why is the OFED 1.1-rc1 source tar ball missing files when compared with the 1.1 branch? Of specific question is the absence of autogen.sh in libibverbs. Ira On Sun, 13 Aug 2006 16:14:10 +0300 "Tziporet Koren" wrote: > Hal Rosenstock wrote: > >> Target release date: 12-Sep > >> > >> Intermediate milestones: > >> 1. Create 1.1 branch of user level: 27-Jul - done > >> 2. RC1: 8-Aug - done > >> 3. Feature freeze (RC2): 17-Aug > >> > > > > What is the start build date for RC2 ? When do developers need to have > > their code in by to make RC2 ? > > > We will start on Tue 15-Aug. Is this OK with you? > > > > > >> 4. Code freeze (rc-x): 6-Sep > >> > > > > Is this 1 or 2 RCs beyond RC2 in order to make this ? > > > > > I hope one but I guess it will be two more RCs. > > Tziporet > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Mon Aug 14 15:44:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 15:44:53 -0700 Subject: [openib-general] [PATCH] IB/srp: add port/device attributes In-Reply-To: <20060813115504.GA21712@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 13 Aug 2006 14:55:04 +0300") References: <20060813115504.GA21712@mellanox.co.il> Message-ID: Michael> Hi, Roland! There does not, at the moment, seem to exist Michael> a way to find out which HCA port the specific SRP host is Michael> connected through. Seems OK -- although I wonder about the names srp_port and srp_device. I think "ib_port" and "ib_device" would make more sense (or perhaps "local_ib_port" and "local_ib_device" although I don't think that's really required). Michael> While not really a bugfix, maybe the following is small Michael> enough for 2.6.18? We will use it in srptools that will Michael> ship with OFED. No, I think this is purely a new feature and I'll queue it for 2.6.19. From swise at opengridcomputing.com Mon Aug 14 16:02:51 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 14 Aug 2006 18:02:51 -0500 Subject: [openib-general] IB mcast question In-Reply-To: <44E0EB66.6090508@ichips.intel.com> References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> <1155588693.4676.48.camel@stevo-desktop> <1155590247.4676.62.camel@stevo-desktop> <44E0EB66.6090508@ichips.intel.com> Message-ID: <1155596571.4676.72.camel@stevo-desktop> I added some debug printks in mthca_multicast_attach(). Roland, does this look ok to you? It seems correct to me: # dmesg mthca_multicast_attach qp_num 406 gid ff124001ffff0000:00000000000a0a0a lid c003 mthca_multicast_attach line 167 - found mgm, hash a20, prev ffffffff, index a20 mthca_multicast_attach line 197 - updated mgm gid: mgm gid ff124001ffff0000:00000000000a0a0a mthca_multicast_attach line 219 - writing mgm: mgm->qp[0] 80000406 (BE) mthca_multicast_attach qp_num 407 gid ff124001ffff0000:00000000000a0a0a lid c003 mthca_multicast_attach line 167 - found mgm, hash a20, prev ffffffff, index a20 mthca_multicast_attach line 197 - updated mgm gid: mgm gid ff124001ffff0000:00000000000a0a0a mthca_multicast_attach line 219 - writing mgm: mgm->qp[1] 80000407 (BE) On Mon, 2006-08-14 at 14:30 -0700, Sean Hefty wrote: > Steve Wise wrote: > > ib_multicast takes care of sending the join/leave info to the SA, right? > > It keeps track of _when_ to leave, for instance. So since opensm -is- > > getting the join and setting up the group, and the mcast packet is being > > passed to the first member who joined, then I don't think ib_multicast > > can mess up the subsequent members, can it? > > It theory, it shouldn't mess up subsequent members. While the first join is > active, subsequent join / leave requests to that same group should be queued. > After the first join completes, subsequent joins should get a copy of the > MCMemberRecord that was returned by the SA. > > (This is a slight simplification, with the actual operation determined by the > type of join operation that occurs. But for the RDMA CM, this is what should > happen.) > > > I'm new to IB mcast, so I'm learning, but it seems like the mthca > > firmware maybe isn't doing the right thing here. > > This was my suspicion, but I couldn't be certain. It would help if anyone can > say that they've successfully tested this sort of multicast configuration. I.e. > two QPs from the same HCA in the same group. > > - SeanR From halr at voltaire.com Mon Aug 14 16:22:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 15 Aug 2006 02:22:59 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" References: <000101c6bfe1$a1cdbe40$8698070a@amr.corp.intel.com> <20060814205929.GM16821@mellanox.co.il> <44E0E7F3.1060905@ichips.intel.com> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AD22@taurus.voltaire.com> Do you have a proposal for how to get to where you think this needs to be ? Have you looked at this ? I think you are proposing OpenSM include verbs.h. I think that's only part of what would need to be done (and has some other side effects). -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Sean Hefty Sent: Mon 8/14/2006 5:15 PM To: Michael S. Tsirkin Cc: openib-general at openib.org Subject: Re: [openib-general] Conflicting typedefs for "ib_gid_t" Michael S. Tsirkin wrote: > There are not that many "common IB types". verbs and management for example > are more or less isolated. CM users mostly don't care about SMPs. etc. > What's common? GID? It does not seem worth it for a "free format 16 byte network > endianness" type. Verbs and management are not isolated. Establishing a connection, joining a multicast group, or acquiring a path record are essential for actually using verbs correctly. > I don't necessary see a problem with what we have. Such defines mirror IB spec > so are static and there is no overhead maintaining them. And this simplifies > dependencies no end - think about testing tens of dependent libraries > for breakage just 'cause you removed an used line in this > "included by everyone" header. Having umad_gid, ibv_gid, ibv_sa_gid, mad_gid, cm_gid, some_other_random_library_gid is goofy. (Sorry, I'm completely ranting now.) > Hmm. User might not have uverbs even loaded, so adding verbs as a dependency > to e.g. SM seems like a bad idea. We only need the include file, not the library. > opensm really tries to be stack-agnostic, so it does need its own > layer for things. Then those things should be completely internal to opensm. - Sean _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Mon Aug 14 16:44:50 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Aug 2006 16:44:50 -0700 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AD22@taurus.voltaire.com> References: <000101c6bfe1$a1cdbe40$8698070a@amr.corp.intel.com> <20060814205929.GM16821@mellanox.co.il> <44E0E7F3.1060905@ichips.intel.com> <5CE025EE7D88BA4599A2C8FEFCF226F589AD22@taurus.voltaire.com> Message-ID: <44E10AF2.40502@ichips.intel.com> Hal Rosenstock wrote: > Do you have a proposal for how to get to where you think this needs to be ? > Have you looked at this ? I think defining include files similar to what we have for the kernel make sense. The problem is more complex than what definitions an application gets from which include file. The data in these structures are also exchanged between userspace and the kernel. For example, there's an sa.h file as part of libibverbs, since it marshals parameters from kernel to userspace. Wire structure definitions ended up working well for me for user to kernel transitions. > I think you are proposing OpenSM include verbs.h. I think that's only part of > what would need to be done (and has some other side effects). I would also have definitions in a libibsa and libibcm. The relevant CM definitions are there. (I don't see a need to expose most of the CM wire definition outside of the kernel.) I can't think of a reason why OpenSM would need any CM definitions, so I would remove those from from any OpenSM include files. SA attribute structures are also needed by libibsa, but since it seems overkill to include OpenSM on all nodes, I would define the SA attribute structures as part of libibsa, and let OpenSM use its include files. - Sean From halr at voltaire.com Mon Aug 14 16:53:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 15 Aug 2006 02:53:13 +0300 Subject: [openib-general] Conflicting typedefs for "ib_gid_t" References: <000101c6bfe1$a1cdbe40$8698070a@amr.corp.intel.com> <20060814205929.GM16821@mellanox.co.il> <44E0E7F3.1060905@ichips.intel.com> <5CE025EE7D88BA4599A2C8FEFCF226F589AD22@taurus.voltaire.com> <44E10AF2.40502@ichips.intel.com> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AD25@taurus.voltaire.com> I agree that SM (OpenSM) doesn't need CM definitions. ib_types.h was for convenience of other apps and having all the definitions in one place. It includes other definitions for other classes also unused by OpenSM. Once libibsa is present and supports (at least) all the SA attributes that OpenSM does, we could then talk about moving this over. However, at that point, things will be quite different between Linux and Windows versions of OpenSM unless Windows adopted more of what going on in Linux. -- Hal ________________________________ From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Mon 8/14/2006 7:44 PM To: Hal Rosenstock Cc: Michael S. Tsirkin; openib-general at openib.org Subject: Re: [openib-general] Conflicting typedefs for "ib_gid_t" Hal Rosenstock wrote: > Do you have a proposal for how to get to where you think this needs to be ? > Have you looked at this ? I think defining include files similar to what we have for the kernel make sense. The problem is more complex than what definitions an application gets from which include file. The data in these structures are also exchanged between userspace and the kernel. For example, there's an sa.h file as part of libibverbs, since it marshals parameters from kernel to userspace. Wire structure definitions ended up working well for me for user to kernel transitions. > I think you are proposing OpenSM include verbs.h. I think that's only part of > what would need to be done (and has some other side effects). I would also have definitions in a libibsa and libibcm. The relevant CM definitions are there. (I don't see a need to expose most of the CM wire definition outside of the kernel.) I can't think of a reason why OpenSM would need any CM definitions, so I would remove those from from any OpenSM include files. SA attribute structures are also needed by libibsa, but since it seems overkill to include OpenSM on all nodes, I would define the SA attribute structures as part of libibsa, and let OpenSM use its include files. - Sean From rdreier at cisco.com Mon Aug 14 17:15:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 Aug 2006 17:15:32 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155596571.4676.72.camel@stevo-desktop> (Steve Wise's message of "Mon, 14 Aug 2006 18:02:51 -0500") References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> <1155588693.4676.48.camel@stevo-desktop> <1155590247.4676.62.camel@stevo-desktop> <44E0EB66.6090508@ichips.intel.com> <1155596571.4676.72.camel@stevo-desktop> Message-ID: > I added some debug printks in mthca_multicast_attach(). > > Roland, does this look ok to you? It seems correct to me: > > # dmesg > mthca_multicast_attach qp_num 406 gid ff124001ffff0000:00000000000a0a0a lid c003 > mthca_multicast_attach line 167 - found mgm, hash a20, prev ffffffff, index a20 > mthca_multicast_attach line 197 - updated mgm gid: mgm gid ff124001ffff0000:00000000000a0a0a > mthca_multicast_attach line 219 - writing mgm: mgm->qp[0] 80000406 (BE) > mthca_multicast_attach qp_num 407 gid ff124001ffff0000:00000000000a0a0a lid c003 > mthca_multicast_attach line 167 - found mgm, hash a20, prev ffffffff, index a20 > mthca_multicast_attach line 197 - updated mgm gid: mgm gid ff124001ffff0000:00000000000a0a0a > mthca_multicast_attach line 219 - writing mgm: mgm->qp[1] 80000407 (BE) You're two steps ahead. Yeah, that looks fine to me. - R. From mst at mellanox.co.il Mon Aug 14 21:07:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 07:07:54 +0300 Subject: [openib-general] [PATCH] IB/srp: add port/device attributes In-Reply-To: References: Message-ID: <20060815040754.GA14998@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] IB/srp: add port/device attributes > > Michael> Hi, Roland! There does not, at the moment, seem to exist > Michael> a way to find out which HCA port the specific SRP host is > Michael> connected through. > > Seems OK -- although I wonder about the names srp_port and > srp_device. I think "ib_port" and "ib_device" would make more sense > (or perhaps "local_ib_port" and "local_ib_device" although I don't > think that's really required). OK. Want a patch like that or will you fix it up? > Michael> While not really a bugfix, maybe the following is small > Michael> enough for 2.6.18? We will use it in srptools that will > Michael> ship with OFED. > > No, I think this is purely a new feature and I'll queue it for 2.6.19. > Fair enough. -- MST From dotanb at mellanox.co.il Mon Aug 14 23:04:06 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 15 Aug 2006 09:04:06 +0300 Subject: [openib-general] IB mcast question In-Reply-To: <44E0DE1D.7090301@ichips.intel.com> References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> Message-ID: <200608150904.06673.dotanb@mellanox.co.il> Hi guys. On Monday 14 August 2006 23:33, Sean Hefty wrote: > Steve Wise wrote: > > So is this replicating done in the mthca hca? > > As just an FYI, I didn't see anything wrong in the mthca driver either when I > was looking at this problem. > > > Since one app is getting the mcast packet, can I assume the opensm code > > is doing the right thing switch/port wise? > > That seems like a fairly safe assumption. > > > Should the SM get join requests for both applications that join the > > group on the same host? Or only the first one? > > Only the first join request should make it to the SA. The second join request > is fulfilled by ib_multicast. This is what makes ib_multicast suspect. What is exactly the scenario that you are doing? We have a test (over the verbs) that have 1 server and n clients. All of the clients create a QPs and attaches them to the (same) multicast group (without any join). The server sends m messages and all of the clients get those messages in every QP. This test passes when it being executed in one HCA, in two HCAs (without any switch in the middle). Dotan From zhushisongzhu at yahoo.com Tue Aug 15 02:13:24 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Tue, 15 Aug 2006 02:13:24 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: Message-ID: <20060815091324.68637.qmail@web36912.mail.mud.yahoo.com> (1) ibv_devinfo HCA: MHES18-XTC FW: 1.1.0 OFED: OFED-1.1-rc1 (2) Test Bed On Client: ib0: 193.12.10.24 test command: LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so SIMPLE_LIBSDP=1 ab -c m -n m -X 193.12.10.14:3129 http://www.sse.com.cn/sseportal/ps/zhs/home.shtml The web page is about 68K. On Server: ib0: 193.12.10.14 squid.sdp -d 10 -f squid2.conf (I have changed squid-cache to support listening on SDP port 3129) The test result is : Concurrent Conns(=m) Free Memory Requests completed 0 926980 0 100 712508 100 200 497372 200 300 282636 256 400 52868 256 500 kernel crashed because of "out of memory" >From above, every about 100 concurrent SDP connections will cost 210M memory. It's too vast for large scale applications. TCP costs very lower memory than SDP. The max concurrent connections completed successfully is 256. it is some bad limit. Who knows how and when will solve the problem? I'll test the performance of sdp connection and compare it with TCP further. tks zhu --- openib-general-request at openib.org wrote: > Send openib-general mailing list submissions to > openib-general at openib.org > > To subscribe or unsubscribe via the World Wide Web, > visit > http://openib.org/mailman/listinfo/openib-general > or, via email, send a message with subject or body > 'help' to > openib-general-request at openib.org > > You can reach the person managing the list at > openib-general-owner at openib.org > > When replying, please edit your Subject line so it > is more specific > than "Re: Contents of openib-general digest..." > > > Today's Topics: > > 1. Re: Conflicting typedefs for "ib_gid_t" (Sean > Hefty) > 2. Re: [PATCH] mthca: make IB_SEND_FENCE work > (Michael S. Tsirkin) > 3. Re: Conflicting typedefs for "ib_gid_t" (Hal > Rosenstock) > 4. sanity check on datapath (Michael S. Tsirkin) > 5. Re: [PATCH] mthca: make IB_SEND_FENCE work > (Sean Hefty) > 6. Re: Conflicting typedefs for "ib_gid_t" > (Lakshmanan, Madhu) > 7. Re: IB mcast question (Steve Wise) > 8. Re: Conflicting typedefs for "ib_gid_t" > (Michael S. Tsirkin) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 14 Aug 2006 13:38:42 -0700 > From: "Sean Hefty" > Subject: Re: [openib-general] Conflicting typedefs > for "ib_gid_t" > To: "'Michael S. Tsirkin'" , > "Hal Rosenstock" > > Cc: openib-general at openib.org > Message-ID: > <000101c6bfe1$a1cdbe40$8698070a at amr.corp.intel.com> > Content-Type: text/plain; charset=us-ascii > > >Yes, or ib_mad_, or ibmad_. And same for other IB_ > and ib_ names there. > >We really need to do something about names like > ib_attr_t. > > I like to move away from each library re-defining > common IB data types. > Something like ibv_gid should be picked up from > libibverbs. > > IMO, the core of the problem is that opensm include > files carry too many legacy > typedefs. > > - Sean > > > > ------------------------------ > > Message: 2 > Date: Mon, 14 Aug 2006 23:41:54 +0300 > From: "Michael S. Tsirkin" > Subject: Re: [openib-general] [PATCH] mthca: make > IB_SEND_FENCE work > To: "Sean Hefty" > Cc: Roland Dreier , > openib-general at openib.org > Message-ID: <20060814204154.GK16821 at mellanox.co.il> > Content-Type: text/plain; charset=us-ascii > > Quoting r. Sean Hefty : > > This could eliminate the warning, and remove an if > statement from executing on > > each iteration. > > We still need to test size0 to set size0 = size. > So we just reuse the extra branch, and I agree with > Roland > this way code is clearer. > > -- > MST > > > > ------------------------------ > > Message: 3 > Date: Mon, 14 Aug 2006 23:39:00 +0300 > From: "Hal Rosenstock" > Subject: Re: [openib-general] Conflicting typedefs > for "ib_gid_t" > To: mlakshmanan at silverstorm.com, "Michael S. > Tsirkin" > > Cc: openib-general at openib.org > Message-ID: > > <5CE025EE7D88BA4599A2C8FEFCF226F589AD1B at taurus.voltaire.com> > Content-Type: text/plain; charset=iso-8859-1 > > To answer your questions: > > I'm not totally sure about your application but it > seems to me to fall in the category being discussed. > > Not all of the definitions in ib_types.h are > elsewhere. > > I am working on a patch to get you past this issue. > > -- Hal > > ________________________________ > > From: mlakshmanan at silverstorm.com > [mailto:mlakshmanan at silverstorm.com] > Sent: Mon 8/14/2006 4:35 PM > To: Hal Rosenstock; Michael S. Tsirkin > Cc: mlakshmanan at silverstorm.com; > openib-general at openib.org > Subject: Re: Conflicting typedefs for "ib_gid_t" > > > > >>> I don't think the way forward is using iba/ in > all applications. > >>> I see it mostly as a legacy header for opensm > and related apps > >>> that want their own layer for portability > between stacks. > > > > > > I agree with your view on iba/ib_types.h > > Does that imply that the definitions in > iba/ib_types.h are not expected to be required or > used > by user-space applications other than those > categories mentioned above? If iba/ib_types.h is > only a legacy header, are the definitions also > present in another header file? > > Madhu > > > > > > > ------------------------------ > > Message: 4 > Date: Mon, 14 Aug 2006 23:45:21 +0300 > From: "Michael S. Tsirkin" > Subject: [openib-general] sanity check on datapath > To: "Roland Dreier" , > openib-general at openib.org > Message-ID: <20060814204521.GL16821 at mellanox.co.il> > Content-Type: text/plain; charset=us-ascii > > Roland, do we really need code like > > if (wr->opcode >= sizeof mthca_opcode / sizeof > mthca_opcode[0]) > { > ret = -1; > *bad_wr = wr; > goto out; > } > > in mthca on data path? Should this be put within > ifdef DEBUG or something? > > -- > MST > > > > ------------------------------ > > Message: 5 > Date: Mon, 14 Aug 2006 13:47:01 -0700 > From: "Sean Hefty" > Subject: Re: [openib-general] [PATCH] mthca: make > IB_SEND_FENCE work > To: "Michael S. Tsirkin" > Cc: Roland Dreier , > openib-general at openib.org > Message-ID: <44E0E145.60906 at ichips.intel.com> > === message truncated === __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From mst at mellanox.co.il Tue Aug 15 02:41:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 12:41:04 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060815091324.68637.qmail@web36912.mail.mud.yahoo.com> References: <20060815091324.68637.qmail@web36912.mail.mud.yahoo.com> Message-ID: <20060815094104.GA15917@mellanox.co.il> Quoting r. zhu shi song : > Subject: why sdp connections cost so much memory > > (1) ibv_devinfo > HCA: MHES18-XTC > FW: 1.1.0 > OFED: OFED-1.1-rc1 > (2) Test Bed > On Client: > ib0: 193.12.10.24 > test command: > LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so > SIMPLE_LIBSDP=1 ab -c m -n m -X 193.12.10.14:3129 > http://www.sse.com.cn/sseportal/ps/zhs/home.shtml > The web page is about 68K. > On Server: > ib0: 193.12.10.14 > squid.sdp -d 10 -f squid2.conf (I have changed > squid-cache to support listening on SDP port 3129) > > The test result is : > Concurrent Conns(=m) Free Memory Requests > completed > 0 926980 0 > 100 712508 100 > 200 497372 200 > 300 282636 256 > 400 52868 256 > 500 kernel crashed because of > "out of memory" > > >From above, every about 100 concurrent SDP connections > will cost 210M memory. It's too vast for large scale > applications. TCP costs very lower memory than SDP. > The max concurrent connections completed successfully > is 256. it is some bad limit. Who knows how and when > will solve the problem? > I'll test the performance of sdp connection and > compare it with TCP further. > tks > zhu Most memory in SDP goes into pre-posted receive buffers. Currently SDP pre-posts a fixed 64 32K buffers per connection, that is 2M per connection. To verify that's the issue, try opening drivers/infiniband/ulp/sdp/sdp.h and changing SDP_RX_SIZE from 0x40 to a smaller value. If this helps, as a quick work-around I can make this value globally configurable. TCP on the other hand scales down more gracefully, and so should SDP longer-term. > --- openib-general-request at openib.org wrote: > > > Send openib-general mailing list submissions to > > openib-general at openib.org > > > > To subscribe or unsubscribe via the World Wide Web, > > visit > > http://openib.org/mailman/listinfo/openib-general > > or, via email, send a message with subject or body > > 'help' to > > openib-general-request at openib.org > > > > You can reach the person managing the list at > > openib-general-owner at openib.org > > > > When replying, please edit your Subject line so it > > is more specific > > than "Re: Contents of openib-general digest..." Is this relevant somehow? -- MST From ishai at mellanox.co.il Tue Aug 15 04:14:51 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Tue, 15 Aug 2006 14:14:51 +0300 Subject: [openib-general] A new version for srp daemon Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2F2B@mtlexch01.mtl.com> See Below -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Tuesday, August 15, 2006 1:48 AM To: Ishai Rabinovitz Cc: openib-general at openib.org; Tziporet Koren Subject: Re: A new version for srp daemon > I put the code in > https://openib.org/svn/trunk/contrib/mellanox/gen2/src/userspace/srptool s/srp_daemon Seems like a bizarre place for it -- a package inside of the srptools package?? [ishai] This is another tool for srp. I thought that we want to put several tools in srptools directory. I will put it in https://openib.org/svn/gen2/trunk/src/userspace/srp_daemon > 7) Uses the umad package. This seems like it adds a fairly complicated dependency (since umad depends on something else, etc) for minimal gain. Was it really worth it in terms of your code? I found the umad API more trouble than it was worth for the original srptools. [ishai] You may be right, but this way I can gain from future improvements in the umad package (I'm optimistic) - R. From halr at voltaire.com Tue Aug 15 04:16:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 07:16:31 -0400 Subject: [openib-general] [opensm] the default behavior of the openSM causes problems (configure the PKey table) In-Reply-To: <200608141436.48596.dotanb@mellanox.co.il> References: <200608141436.48596.dotanb@mellanox.co.il> Message-ID: <1155640591.29378.312.camel@hal.voltaire.com> On Mon, 2006-08-14 at 07:36, Dotan Barak wrote: > Hi. > > I noticed that the behavior of the openSM was changed in the latest driver: > > in the past, every HCA was configured (by the FW) with 0xffff in the first entry. > today, Just as an FYI: I think that Anafas have this in the second entry on port 0. -- Hal > the PKey table is being configured by the openSM: the first entry > is being set to 0x7fff (except for the host that the SM is being executed from) > > This behavior is very problemtic because not all of the users would like > to change the default PKey table (for example: MPI users). > Users that will try to use OFED 1.1 (in the same way they used OFED 1.0) will > get unexplained failures, because the connectivity because the nodes will be broken. > (even the perfquery started to fail after executing the SM) > > > I think that the default behavior of the openSM should be: not to change the > PKey table, unless the user provided a PKey table policy file. > > Here are the props of the machines > > ************************************************************* > Host Architecture : x86_64 > Linux Distribution: Red Hat Enterprise Linux AS release 4 (Nahant Update 3) > Kernel Version : 2.6.9-34.ELsmp > GCC Version : gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2) > Memory size : 4039892 kB > Driver Version : gen2_linux-20060813-1905 (REV=8916) > HCA ID(s) : mthca0 > HCA model(s) : 23108 > FW version(s) : 3.4.927 > Board(s) : MT_0030000001 > ************************************************************* > > > thanks > Dotan From swise at opengridcomputing.com Tue Aug 15 06:00:12 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 08:00:12 -0500 Subject: [openib-general] IB mcast question In-Reply-To: <200608150904.06673.dotanb@mellanox.co.il> References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> <200608150904.06673.dotanb@mellanox.co.il> Message-ID: <1155646812.26332.4.camel@stevo-desktop> can you send me this code? I suspect the main difference is that I'm using librdmacm to join and leave mcast groups. From vlad at mellanox.co.il Tue Aug 15 06:28:06 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 15 Aug 2006 16:28:06 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <20060814150824.11b28c05.weiny2@llnl.gov> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA767F@mtlexch01.mtl.com> <1155303417.4507.15009.camel@hal.voltaire.com> <44DF25A2.30405@mellanox.co.il> <20060814150824.11b28c05.weiny2@llnl.gov> Message-ID: <44E1CBE6.2090406@mellanox.co.il> The OFED-1.1-rc1 source tar ball (openib-1.1.tgz ) created by build_ofed.sh script (from https://openib.org/svn/gen2/branches/1.1/ofed/build) build_ofed.sh script takes userspace libraries/binaries after executing: autogen.sh configure make dist Therefor, autogen.sh is not a part of it and also it is the reason that you see Makefiles there. Regards, Vladimir Ira Weiny wrote: > Why is the OFED 1.1-rc1 source tar ball missing files when compared with the 1.1 branch? > > Of specific question is the absence of autogen.sh in libibverbs. > > Ira > > On Sun, 13 Aug 2006 16:14:10 +0300 > "Tziporet Koren" wrote: > > >> Hal Rosenstock wrote: >> >>>> Target release date: 12-Sep >>>> >>>> Intermediate milestones: >>>> 1. Create 1.1 branch of user level: 27-Jul - done >>>> 2. RC1: 8-Aug - done >>>> 3. Feature freeze (RC2): 17-Aug >>>> >>>> >>> What is the start build date for RC2 ? When do developers need to have >>> their code in by to make RC2 ? >>> >>> >> We will start on Tue 15-Aug. Is this OK with you? >> >>> >>> >>> >>>> 4. Code freeze (rc-x): 6-Sep >>>> >>>> >>> Is this 1 or 2 RCs beyond RC2 in order to make this ? >>> >>> >>> >> I hope one but I guess it will be two more RCs. >> >> Tziporet >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > From thomas.bub at thomson.net Tue Aug 15 06:36:41 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Tue, 15 Aug 2006 15:36:41 +0200 Subject: [openib-general] Where are running examples using libibcm in OFED-1.0.1? Message-ID: Hi, I'm just trying to port my IB gen1 application to gen2 in order to get it running on a SLES 9 SP3 x86_64 Linux workstation. I got stuck in porting the usage of the libcm in gen1 to libibcm in gen2. My code fails on using the ib_cm_create_id call. Looked around for some running examples using the libibcm but could not find them. Got a pointer to: https://openib.org/svn/gen2/branches/1.0/src/userspace/libibcm/examples/ but those wont compile. Can someone help me out here? Here is some background about my application: Out application is running between a PowerPC Linux gen1 IB source and an x86 Linux PC data destination. All is running fine with gen1 drivers from Mellanox IBGD-1.8.2. In order to get our data destination application running on 64-Bit Workstations with SLES 9 SP3 and above as well I have to port it gen2. Thanks in advance. Thomas Bub ............................................................ Thomas Bub Grass Valley Germany GmbH Brunnenweg 9 64331 Weiterstadt, Germany Tel: +49 6150 104 147 Fax: +49 6150 104 656 Email: Thomas.Bub at thomson.net www.GrassValley.com ............................................................ From rdreier at cisco.com Tue Aug 15 06:43:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 Aug 2006 06:43:52 -0700 Subject: [openib-general] A new version for srp daemon In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2F2B@mtlexch01.mtl.com> (Ishai Rabinovitz's message of "Tue, 15 Aug 2006 14:14:51 +0300") References: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2F2B@mtlexch01.mtl.com> Message-ID: Ishai> [ishai] This is another tool for srp. I thought that Ishai> we want to put several tools in srptools directory. I will Ishai> put it in Ishai> https://openib.org/svn/gen2/trunk/src/userspace/srp_daemon It would be fine to have the srp daemon be built as part of srptools. But putting another package with its own Makefile etc as a subdirectory of srptools is just weird. - R. From ryszard.jurga at cern.ch Tue Aug 15 06:46:03 2006 From: ryszard.jurga at cern.ch (Ryszard Jurga) Date: Tue, 15 Aug 2006 15:46:03 +0200 Subject: [openib-general] DAPL and local_iov in RDMA RR/RW mode Message-ID: <014f01c6c071$2697d540$3b388d80@cern.ch> Hi Arlin, Thank you for your quick reply. Both dat_ep_post_rdma_read nad dat_ep_post_rdma_write return DAT_SUCCESS. When I read a field 'transfered_length' from DAT_DTO_COMPLETION_EVENT_DATA after calling a post function I receive the correct value which equals num_segs*seg_size. Unfortunately, when I read a content of a local buffer, only first segment is filled by appropriete data. I have tried to set up debug switch (by export DAPL_DBG_TYPE=0xffff before running my application) but unfortunately this does not produce any additional output for post functions. Do you have any other ideas? I did not mention before, but the case with num_segments>1 works fine with a send/recv type of transmision. Best regards, Ryszard. ----- Original Message ----- From: "Arlin Davis" To: "Ryszard Jurga" Cc: "openib" Sent: Friday, August 11, 2006 10:14 PM Subject: Re: [openib-general] DAPL and local_iov in RDMA RR/RW mode > Ryszard Jurga wrote: > >> Hi everybody, >> I have one question about a number of segments in local_iov when using >> RDMA Write and Read mode. Is it possible to have num_segments>1? I am >> asking, because when I try to set up num_segments to a value > 1, then I >> can still only read/write one segment, even though I have an appropriate >> remote buffer already reserved. The size of transfered buffer is 10bytes, >> num_segs=2. The information, which is printed below, was obrained from >> network devices with one remark - I have set up manualy >> max_rdma_read_iov=10 and max_rdma_write_iov=10. Thank you in advance for >> your help. > > Yes, uDAPL will support num_segments up to the max counts returned on the > ep_attr. Can you be more specific? Does the post return immediate errors > or are you simply missing data on the remote node? Can you turn up the > uDAPL debug switch (export DAPL_DBG_TYPE=0xffff) and send output of the > post call? > > -arlin > >> Best regards, >> Ryszard. >> EP_ATTR: the same for both nodes: >> ---------------------------------- >> max_message_size=2147483648 >> max_rdma_size=2147483648 >> max_recv_dtos=16 >> max_request_dtos=16 >> max_recv_iov=4 >> max_request_iov=4 >> max_rdma_read_in=4 >> max_rdma_read_out=4 >> srq_soft_hw=0 >> max_rdma_read_iov=10 >> max_rdma_write_iov=10 >> ep_transport_specific_count=0 >> ep_provider_specific_count=0 >> ---------------------------------- >> IA_ATTR: different for nodes >> ---------------------------------- >> IA Info: >> max_eps=64512 >> max_dto_per_ep=65535 >> max_rdma_read_per_ep_in=4 >> max_rdma_read_per_ep_out=1610616831 >> max_evds=65408 >> max_evd_qlen=131071 >> max_iov_segments_per_dto=28 >> max_lmrs=131056 >> max_lmr_block_size=18446744073709551615 >> max_pzs=32768 >> max_message_size=2147483648 >> max_rdma_size=2147483648 >> max_rmrs=0 >> max_srqs=0 >> max_ep_per_srq=0 >> max_recv_per_srq=143263 >> max_iov_segments_per_rdma_read=1073741824 >> max_iov_segments_per_rdma_write=0 >> max_rdma_read_in=0 >> max_rdma_read_out=65535 >> max_rdma_read_per_ep_in_guaranteed=7286 >> max_rdma_read_per_ep_out_guaranteed=0 >> IA Info: >> max_eps=64512 >> max_dto_per_ep=65535 >> max_rdma_read_per_ep_in=4 >> max_rdma_read_per_ep_out=0 >> max_evds=65408 >> max_evd_qlen=131071 >> max_iov_segments_per_dto=28 >> max_lmrs=131056 >> max_lmr_block_size=18446744073709551615 >> max_pzs=32768 >> max_message_size=2147483648 >> max_rdma_size=2147483648 >> max_rmrs=0 >> max_srqs=0 >> max_ep_per_srq=0 >> max_recv_per_srq=142247 >> max_iov_segments_per_rdma_read=1073741824 >> max_iov_segments_per_rdma_write=0 >> max_rdma_read_in=0 >> max_rdma_read_out=65535 >> max_rdma_read_per_ep_in_guaranteed=7286 >> max_rdma_read_per_ep_out_guaranteed=28 >> >>------------------------------------------------------------------------ >> >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit >>http://openib.org/mailman/listinfo/openib-general >> From swise at opengridcomputing.com Tue Aug 15 07:12:10 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 09:12:10 -0500 Subject: [openib-general] IB mcast question In-Reply-To: References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> <1155588693.4676.48.camel@stevo-desktop> <1155590247.4676.62.camel@stevo-desktop> <44E0EB66.6090508@ichips.intel.com> <1155596571.4676.72.camel@stevo-desktop> Message-ID: <1155651130.26332.32.camel@stevo-desktop> Just throwing out ideas here: Maybe something in the ib_sa_mcmember_rec is prohibiting replication on the HCA? And maybe ib_multicast is incorrectly building this record... struct ib_sa_mcmember_rec { union ib_gid mgid; union ib_gid port_gid; __be32 qkey; __be16 mlid; u8 mtu_selector; u8 mtu; u8 traffic_class; __be16 pkey; u8 rate_selector; u8 rate; u8 packet_life_time_selector; u8 packet_life_time; u8 sl; __be32 flow_label; u8 hop_limit; u8 scope; u8 join_state; int proxy_join; }; From rdreier at cisco.com Tue Aug 15 07:15:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 Aug 2006 07:15:45 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155651130.26332.32.camel@stevo-desktop> (Steve Wise's message of "Tue, 15 Aug 2006 09:12:10 -0500") References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> <1155588693.4676.48.camel@stevo-desktop> <1155590247.4676.62.camel@stevo-desktop> <44E0EB66.6090508@ichips.intel.com> <1155596571.4676.72.camel@stevo-desktop> <1155651130.26332.32.camel@stevo-desktop> Message-ID: Steve> Just throwing out ideas here: Maybe something in the Steve> ib_sa_mcmember_rec is prohibiting replication on the HCA? Steve> And maybe ib_multicast is incorrectly building this Steve> record... Shouldn't make a difference -- if one copy of the packet arrives at the HCA then none of the SA stuff matters as far as replicating it to multiple QPs. - R. From mst at mellanox.co.il Tue Aug 15 07:20:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 17:20:50 +0300 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID change with client reregister set Message-ID: <20060815142050.GE15917@mellanox.co.il> Hi, Roland! Please consider the following patch for 2.6.18 - this fixes a regression from 2.6.17 for us. After commit 12bbb2b7be7f5564952ebe0196623e97464b8ac5, when SM LID change or LID change MAD also has a client reregistration bit set, only CLIENT_REREGISTER event is generated. As a result, the sa_query module and the cache module don't update the port information, and ULPs (e.g. IPoIB) stop working. This is the regression we observe as compared to 2.6.17. Rather than generate multiple events (which would have negative performance impact), let us simply let cache and sa query respond to reregister event in the same way as to LID and SM change events. --- IB/core: fix SM LID/LID change with client reregister set If PortInfo set (e.g. LID change) MAD has the reregister bit set, IB_EVENT_LID_CHANGE event is no longer generated. So sa_query and cache must respond to IB_EVENT_CLIENT_REREGISTER event in the same way as to IB_EVENT_LID_CHANGE. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: ofed_1_1/drivers/infiniband/core/cache.c =================================================================== --- ofed_1_1.orig/drivers/infiniband/core/cache.c 2006-08-03 14:30:20.000000000 +0300 +++ ofed_1_1/drivers/infiniband/core/cache.c 2006-08-15 16:31:36.880294000 +0300 @@ -301,7 +301,8 @@ static void ib_cache_event(struct ib_eve event->event == IB_EVENT_PORT_ACTIVE || event->event == IB_EVENT_LID_CHANGE || event->event == IB_EVENT_PKEY_CHANGE || - event->event == IB_EVENT_SM_CHANGE) { + event->event == IB_EVENT_SM_CHANGE || + event->event == IB_EVENT_CLIENT_REREGISTER) { work = kmalloc(sizeof *work, GFP_ATOMIC); if (work) { INIT_WORK(&work->work, ib_cache_task, work); Index: ofed_1_1/drivers/infiniband/core/sa_query.c =================================================================== --- ofed_1_1.orig/drivers/infiniband/core/sa_query.c 2006-08-03 14:30:20.000000000 +0300 +++ ofed_1_1/drivers/infiniband/core/sa_query.c 2006-08-15 16:32:35.100728000 +0300 @@ -405,7 +405,8 @@ static void ib_sa_event(struct ib_event_ event->event == IB_EVENT_PORT_ACTIVE || event->event == IB_EVENT_LID_CHANGE || event->event == IB_EVENT_PKEY_CHANGE || - event->event == IB_EVENT_SM_CHANGE) { + event->event == IB_EVENT_SM_CHANGE || + event->event == IB_EVENT_CLIENT_REREGISTER) { struct ib_sa_device *sa_dev; sa_dev = container_of(handler, typeof(*sa_dev), event_handler); -- MST From swise at opengridcomputing.com Tue Aug 15 07:22:34 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 09:22:34 -0500 Subject: [openib-general] IB mcast question In-Reply-To: References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> <1155588693.4676.48.camel@stevo-desktop> <1155590247.4676.62.camel@stevo-desktop> <44E0EB66.6090508@ichips.intel.com> <1155596571.4676.72.camel@stevo-desktop> <1155651130.26332.32.camel@stevo-desktop> Message-ID: <1155651754.26332.34.camel@stevo-desktop> How about qp attributes? pkeys? qkeys? On Tue, 2006-08-15 at 07:15 -0700, Roland Dreier wrote: > Steve> Just throwing out ideas here: Maybe something in the > Steve> ib_sa_mcmember_rec is prohibiting replication on the HCA? > Steve> And maybe ib_multicast is incorrectly building this > Steve> record... > > Shouldn't make a difference -- if one copy of the packet arrives at > the HCA then none of the SA stuff matters as far as replicating it to > multiple QPs. > > - R. From rdreier at cisco.com Tue Aug 15 07:27:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 Aug 2006 07:27:46 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155651754.26332.34.camel@stevo-desktop> (Steve Wise's message of "Tue, 15 Aug 2006 09:22:34 -0500") References: <1155583285.4676.37.camel@stevo-desktop> <1155586714.4676.45.camel@stevo-desktop> <44E0DE1D.7090301@ichips.intel.com> <1155588693.4676.48.camel@stevo-desktop> <1155590247.4676.62.camel@stevo-desktop> <44E0EB66.6090508@ichips.intel.com> <1155596571.4676.72.camel@stevo-desktop> <1155651130.26332.32.camel@stevo-desktop> <1155651754.26332.34.camel@stevo-desktop> Message-ID: Steve> How about qp attributes? pkeys? qkeys? Good question -- yes, the QPs will need be to set up with the right keys for packets to appear. It's definitely something to check. If different mcmembers are used for the first join of the group and subsequent joins by another QP, that could explain the problem. - R. From vlad at mellanox.co.il Tue Aug 15 07:30:35 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 15 Aug 2006 17:30:35 +0300 Subject: [openib-general] [RFC] IPoIB high availability daemon In-Reply-To: References: <44E0B66A.4090605@mellanox.co.il> Message-ID: <44E1DA8B.8010200@mellanox.co.il> When I run: # ip maddres add 224.0.0.9 dev ib0 I got (dmesg): ib0: joining MGID 0000:0000:ffff:0000:0000:7eb5:25c0:40f1 ib0: multicast join failed for: 0000:0000:ffff:0000:0000:7eb5:25c0:40f1, status -22 And from the SM side: ERR 1B01: Wrong MGID Prefix 0x00 must be 0xFF Regards, Vladimir Roland Dreier wrote: > Vladimir> Currently there is an issue with join to IPoIB multicast > Vladimir> group using both ip and ipmaddr utilities. > > What's the issue? > > - R. > > From mst at mellanox.co.il Tue Aug 15 07:34:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 17:34:52 +0300 Subject: [openib-general] [PATCHv2] IB/srp: add port/device attributes In-Reply-To: References: Message-ID: <20060815143452.GF15917@mellanox.co.il> Quoting r. Roland Dreier : > Michael> Hi, Roland! There does not, at the moment, seem to exist > Michael> a way to find out which HCA port the specific SRP host is > Michael> connected through. Here's the updated version of the patch. Pls queue for 2.6.19. Add local_ib_device/local_ib_port attributes to srp scsi_host. Needed for when we want to connect to the same target through multiple distinct ports. Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-08-15 16:55:55.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-08-15 16:58:22.000000000 +0300 @@ -1467,12 +1467,29 @@ static ssize_t show_zero_req_lim(struct return sprintf(buf, "%d\n", target->zero_req_lim); } -static CLASS_DEVICE_ATTR(id_ext, S_IRUGO, show_id_ext, NULL); -static CLASS_DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL); -static CLASS_DEVICE_ATTR(service_id, S_IRUGO, show_service_id, NULL); -static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); -static CLASS_DEVICE_ATTR(dgid, S_IRUGO, show_dgid, NULL); -static CLASS_DEVICE_ATTR(zero_req_lim, S_IRUGO, show_zero_req_lim, NULL); +static ssize_t show_local_ib_port(struct class_device *cdev, char *buf) +{ + struct srp_target_port *target = host_to_target(class_to_shost(cdev)); + + return sprintf(buf, "%d\n", target->srp_host->port); +} + +static ssize_t show_local_ib_device(struct class_device *cdev, char *buf) +{ + struct srp_target_port *target = host_to_target(class_to_shost(cdev)); + + return sprintf(buf, "%s\n", target->srp_host->dev->dev->name); +} + + +static CLASS_DEVICE_ATTR(id_ext, S_IRUGO, show_id_ext, NULL); +static CLASS_DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL); +static CLASS_DEVICE_ATTR(service_id, S_IRUGO, show_service_id, NULL); +static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); +static CLASS_DEVICE_ATTR(dgid, S_IRUGO, show_dgid, NULL); +static CLASS_DEVICE_ATTR(zero_req_lim, S_IRUGO, show_zero_req_lim, NULL); +static CLASS_DEVICE_ATTR(local_ib_port, S_IRUGO, show_local_ib_port, NULL); +static CLASS_DEVICE_ATTR(local_ib_device, S_IRUGO, show_local_ib_device, NULL); static struct class_device_attribute *srp_host_attrs[] = { &class_device_attr_id_ext, @@ -1481,6 +1498,8 @@ static struct class_device_attribute *sr &class_device_attr_pkey, &class_device_attr_dgid, &class_device_attr_zero_req_lim, + &class_device_attr_local_ib_port, + &class_device_attr_local_ib_device, NULL }; -- MST From halr at voltaire.com Tue Aug 15 07:42:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 10:42:34 -0400 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID change with client reregister set In-Reply-To: <20060815142050.GE15917@mellanox.co.il> References: <20060815142050.GE15917@mellanox.co.il> Message-ID: <1155652951.29378.6460.camel@hal.voltaire.com> On Tue, 2006-08-15 at 10:20, Michael S. Tsirkin wrote: > Hi, Roland! > Please consider the following patch for 2.6.18 - this fixes a regression > from 2.6.17 for us. > > After commit 12bbb2b7be7f5564952ebe0196623e97464b8ac5, when > SM LID change or LID change MAD also has a client reregistration bit > set, only CLIENT_REREGISTER event is generated. > > As a result, the sa_query module and the cache module don't update > the port information, and ULPs (e.g. IPoIB) stop working. > This is the regression we observe as compared to 2.6.17. > > Rather than generate multiple events (which would have negative performance > impact), let us simply let cache and sa query respond to reregister event > in the same way as to LID and SM change events. Are these two events equivalent ? e.g. does LID change require reregistration ? (That's a potential overhead as well). What about deregistration of the old registrations when this occurs ? Is that handled ? -- Hal > --- > > IB/core: fix SM LID/LID change with client reregister set > > If PortInfo set (e.g. LID change) MAD has the reregister bit set, > IB_EVENT_LID_CHANGE event is no longer generated. > So sa_query and cache must respond to IB_EVENT_CLIENT_REREGISTER event > in the same way as to IB_EVENT_LID_CHANGE. > > Signed-off-by: Jack Morgenstein > Signed-off-by: Michael S. Tsirkin > > Index: ofed_1_1/drivers/infiniband/core/cache.c > =================================================================== > --- ofed_1_1.orig/drivers/infiniband/core/cache.c 2006-08-03 14:30:20.000000000 +0300 > +++ ofed_1_1/drivers/infiniband/core/cache.c 2006-08-15 16:31:36.880294000 +0300 > @@ -301,7 +301,8 @@ static void ib_cache_event(struct ib_eve > event->event == IB_EVENT_PORT_ACTIVE || > event->event == IB_EVENT_LID_CHANGE || > event->event == IB_EVENT_PKEY_CHANGE || > - event->event == IB_EVENT_SM_CHANGE) { > + event->event == IB_EVENT_SM_CHANGE || > + event->event == IB_EVENT_CLIENT_REREGISTER) { > work = kmalloc(sizeof *work, GFP_ATOMIC); > if (work) { > INIT_WORK(&work->work, ib_cache_task, work); > Index: ofed_1_1/drivers/infiniband/core/sa_query.c > =================================================================== > --- ofed_1_1.orig/drivers/infiniband/core/sa_query.c 2006-08-03 14:30:20.000000000 +0300 > +++ ofed_1_1/drivers/infiniband/core/sa_query.c 2006-08-15 16:32:35.100728000 +0300 > @@ -405,7 +405,8 @@ static void ib_sa_event(struct ib_event_ > event->event == IB_EVENT_PORT_ACTIVE || > event->event == IB_EVENT_LID_CHANGE || > event->event == IB_EVENT_PKEY_CHANGE || > - event->event == IB_EVENT_SM_CHANGE) { > + event->event == IB_EVENT_SM_CHANGE || > + event->event == IB_EVENT_CLIENT_REREGISTER) { > struct ib_sa_device *sa_dev; > sa_dev = container_of(handler, typeof(*sa_dev), event_handler); > From halr at voltaire.com Tue Aug 15 07:44:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 10:44:00 -0400 Subject: [openib-general] [PATCH v2 1/2] sa_query: add generic query interfaces capable of supporting RMPP In-Reply-To: <000001c6b723$73cd2380$e598070a@amr.corp.intel.com> References: <000001c6b723$73cd2380$e598070a@amr.corp.intel.com> Message-ID: <1155653040.29378.6534.camel@hal.voltaire.com> On Thu, 2006-08-03 at 13:37, Sean Hefty wrote: > The following patch adds a generic interface to send MADs to the SA. > The primary motivation of adding these calls is to expand the SA query > interface to include RMPP responses for users wanting more than a > single attribute returned from a query (e.g. multipath record queries). Do you mean multiple path records rather than multipath record queries here ? > The implementation of existing SA query routines was layered on top of > the generic query interface. > > Signed-off-by: Sean Hefty [snip...] > +/* Return size of SA attributes on the wire. */ > +static int sa_mad_attr_size(int attr_id) > +{ > + int size; > + > + switch (attr_id) { > + case IB_SA_ATTR_SERVICE_REC: > + size = 176; > + break; > + case IB_SA_ATTR_PATH_REC: > + size = 64; > + break; > + case IB_SA_ATTR_MC_MEMBER_REC: > + size = 52; You probably already found this but this should be 56 as SA attributes are required to be modulo 8 in size. > + break; > + default: > + size = 0; > + break; > + } > + return size; > +} > + -- Hal From halr at voltaire.com Tue Aug 15 07:48:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 10:48:03 -0400 Subject: [openib-general] [RFC] IPoIB high availability daemon In-Reply-To: <44E1DA8B.8010200@mellanox.co.il> References: <44E0B66A.4090605@mellanox.co.il> <44E1DA8B.8010200@mellanox.co.il> Message-ID: <1155653282.29378.6641.camel@hal.voltaire.com> On Tue, 2006-08-15 at 10:30, Vladimir Sokolovsky wrote: > When I run: > # ip maddres add 224.0.0.9 dev ib0 > I got (dmesg): > ib0: joining MGID 0000:0000:ffff:0000:0000:7eb5:25c0:40f1 > ib0: multicast join failed for: 0000:0000:ffff:0000:0000:7eb5:25c0:40f1, > status -22 > > And from the SM side: > ERR 1B01: Wrong MGID Prefix 0x00 must be 0xFF That means that something is forming the MGID improperly (from the IPmc address 224.0.0.9). Specifically, the MGID must start with 0xFF in the first byte. Is the MGID displayed slightly earlier in the log ? If no, can you run with -V and it should be there to see what the SM is seeing for the invalid MGID. -- Hal > Regards, > Vladimir > > Roland Dreier wrote: > > Vladimir> Currently there is an issue with join to IPoIB multicast > > Vladimir> group using both ip and ipmaddr utilities. > > > > What's the issue? > > > > - R. > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Tue Aug 15 07:59:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 17:59:44 +0300 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID change with client reregisterset In-Reply-To: <1155652951.29378.6460.camel@hal.voltaire.com> References: <1155652951.29378.6460.camel@hal.voltaire.com> Message-ID: <20060815145944.GG15917@mellanox.co.il> Quoting r. Hal Rosenstock : > Are these two events equivalent ? e.g. does LID change require > reregistration ? (That's a potential overhead as well). Client reregistration support is an optional SM feature > What about deregistration of the old registrations when this occurs ? > Is that handled ? ??? sa_query and cache do not have any old registrations -- MST From vlad at mellanox.co.il Tue Aug 15 07:58:19 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 15 Aug 2006 17:58:19 +0300 Subject: [openib-general] [RFC] IPoIB high availability daemon In-Reply-To: <1155653282.29378.6641.camel@hal.voltaire.com> References: <44E0B66A.4090605@mellanox.co.il> <44E1DA8B.8010200@mellanox.co.il> <1155653282.29378.6641.camel@hal.voltaire.com> Message-ID: <44E1E10B.4040709@mellanox.co.il> MGID from the log: Aug 15 17:57:20 737657 [40796BB0] -> MCMember Record dump: MGID....................0x00000000ffff0000 : 0x00007eb525c040f1 PortGid.................0xfe80000000000000 : 0x0002c9020020d8ed qkey....................0xB1B mlid....................0x0 mtu.....................0x0 TClass..................0x0 pkey....................0xFFFF rate....................0x0 pkt_life................0x0 SLFlowLabelHopLimit.....0x0 ScopeState..............0x1 ProxyJoin...............0x0 Aug 15 17:57:20 737735 [40796BB0] -> osm_mcmr_rcv_create_new_mgrp: [ Aug 15 17:57:20 737755 [40796BB0] -> __get_new_mlid: [ Aug 15 17:57:20 737776 [40796BB0] -> __get_new_mlid: Found mgrp with lid:0xC000 MGID: 0xff12401bffff0000 : 0x00000000ffffffff Aug 15 17:57:20 737797 [40796BB0] -> __get_new_mlid: Found mgrp with lid:0xC001 MGID: 0xff12401bffff0000 : 0x0000000000000001 Aug 15 17:57:20 737817 [40796BB0] -> __get_new_mlid: Found mgrp with lid:0xC002 MGID: 0xff12601bffff0000 : 0x00000001ff20d8ee Aug 15 17:57:20 737836 [40796BB0] -> __get_new_mlid: Found mgrp with lid:0xC003 MGID: 0xff12601bffff0000 : 0x0000000000000001 Aug 15 17:57:20 737856 [40796BB0] -> __get_new_mlid: Found mgrp with lid:0xC004 MGID: 0xff12601bffff0000 : 0x00000001ff20d8ed Aug 15 17:57:20 737876 [40796BB0] -> __get_new_mlid: Found mgrp with lid:0xC005 MGID: 0xff12401bffff0000 : 0x00000000000000fb Aug 15 17:57:20 737895 [40796BB0] -> __get_new_mlid: Found mgrp with lid:0xC006 MGID: 0xff12601bffff0000 : 0x00000001ff20d901 Aug 15 17:57:20 737916 [40796BB0] -> __get_new_mlid: Found available mlid:0xC007 at idx:7 Aug 15 17:57:20 737936 [40796BB0] -> __get_new_mlid: ] Regards, Vladimir Hal Rosenstock wrote: > On Tue, 2006-08-15 at 10:30, Vladimir Sokolovsky wrote: > >> When I run: >> # ip maddres add 224.0.0.9 dev ib0 >> I got (dmesg): >> ib0: joining MGID 0000:0000:ffff:0000:0000:7eb5:25c0:40f1 >> ib0: multicast join failed for: 0000:0000:ffff:0000:0000:7eb5:25c0:40f1, >> status -22 >> >> And from the SM side: >> ERR 1B01: Wrong MGID Prefix 0x00 must be 0xFF >> > > That means that something is forming the MGID improperly (from the IPmc > address 224.0.0.9). Specifically, the MGID must start with 0xFF in the > first byte. Is the MGID displayed slightly earlier in the log ? If no, > can you run with -V and it should be there to see what the SM is seeing > for the invalid MGID. > > -- Hal > > >> Regards, >> Vladimir >> >> Roland Dreier wrote: >> >>> Vladimir> Currently there is an issue with join to IPoIB multicast >>> Vladimir> group using both ip and ipmaddr utilities. >>> >>> What's the issue? >>> >>> - R. >>> >>> >>> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > From rdreier at cisco.com Tue Aug 15 08:08:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 Aug 2006 08:08:29 -0700 Subject: [openib-general] [RFC] IPoIB high availability daemon In-Reply-To: <44E1DA8B.8010200@mellanox.co.il> (Vladimir Sokolovsky's message of "Tue, 15 Aug 2006 17:30:35 +0300") References: <44E0B66A.4090605@mellanox.co.il> <44E1DA8B.8010200@mellanox.co.il> Message-ID: Vladimir> When I run: # ip maddres add 224.0.0.9 dev ib0 It's poorly documented, but "ip maddr add" needs a link level address, not an IP address. - R. From jackm at mellanox.co.il Tue Aug 15 08:06:22 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 15 Aug 2006 18:06:22 +0300 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID change with client reregister set In-Reply-To: <1155652951.29378.6460.camel@hal.voltaire.com> References: <20060815142050.GE15917@mellanox.co.il> <1155652951.29378.6460.camel@hal.voltaire.com> Message-ID: <200608151806.22228.jackm@mellanox.co.il> On Tuesday 15 August 2006 17:42, Hal Rosenstock wrote: > > Are these two events equivalent ? e.g. does LID change require > reregistration ? (That's a potential overhead as well). > Before the change in mthca_mad.c (sm_snoop), the LID CHANGE event was generated in all cases where a SET port-info MAD was received. After the change, either LID_CHANGE -- or -- CLIENT_REREGISTER is generated. (i.e., CLIENT REREGISTER replaced LID_CHANGE in those cases where the reregistration bit was set in the MAD). This patch simply restores the previous behavior of the sa_query and cache modules in responding to events. No reregistration is required in these modules. > What about deregistration of the old registrations when this occurs ? > Is that handled ? > There is no deregistration involvement. - Jack From halr at voltaire.com Tue Aug 15 08:05:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 11:05:46 -0400 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID change with client reregisterset In-Reply-To: <20060815145944.GG15917@mellanox.co.il> References: <1155652951.29378.6460.camel@hal.voltaire.com> <20060815145944.GG15917@mellanox.co.il> Message-ID: <1155654346.29378.7226.camel@hal.voltaire.com> On Tue, 2006-08-15 at 10:59, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > Are these two events equivalent ? e.g. does LID change require > > reregistration ? (That's a potential overhead as well). > > Client reregistration support is an optional SM feature I don't think that answers my question. Yes, client reregistration is optional but when it is supported, are the two events equivalent ? > > What about deregistration of the old registrations when this occurs ? > > Is that handled ? > > ??? What's the ??? for ? > sa_query and cache do not have any old registrations Currently; it could be an issue in the future. -- Hal From vlad at mellanox.co.il Tue Aug 15 08:15:10 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 15 Aug 2006 18:15:10 +0300 Subject: [openib-general] [RFC] IPoIB high availability daemon In-Reply-To: References: <44E0B66A.4090605@mellanox.co.il> <44E1DA8B.8010200@mellanox.co.il> Message-ID: <44E1E4FE.1000909@mellanox.co.il> It is the same result for link level address. - Vladimir Roland Dreier wrote: > Vladimir> When I run: # ip maddres add 224.0.0.9 dev ib0 > > It's poorly documented, but "ip maddr add" needs a link level address, > not an IP address. > > - R. > > From rdreier at cisco.com Tue Aug 15 08:34:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 Aug 2006 08:34:30 -0700 Subject: [openib-general] A new version for srp daemon In-Reply-To: <44E1E929.4090306@mellanox.co.il> (Tziporet Koren's message of "Tue, 15 Aug 2006 18:32:57 +0300") References: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2F2B@mtlexch01.mtl.com> <44E1E929.4090306@mellanox.co.il> Message-ID: Tziporet> So do you suggest we change the current srptools Tziporet> Makefile to include this daemon too? Sure, that would be fine. - R. From sean.hefty at intel.com Tue Aug 15 08:35:40 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 08:35:40 -0700 Subject: [openib-general] Where are running examples using libibcm in OFED-1.0.1? In-Reply-To: Message-ID: <000201c6c080$76ed06f0$3ccd180a@amr.corp.intel.com> >My code fails on using the ib_cm_create_id call. Not sure why this happened. >Looked around for some running examples using the libibcm but could not >find them. >Got a pointer to: > >https://openib.org/svn/gen2/branches/1.0/src/userspace/libibcm/examples/ >but those wont compile. >Can someone help me out here? The cmpost test program was never updated to create rdma_event_channels. It's not too difficult to add that in and get it to run. I have a fix for this included with some other changes in my local repository. I'll see if I can extract out those changes for you. - Sean From tziporet at mellanox.co.il Tue Aug 15 08:32:57 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 15 Aug 2006 18:32:57 +0300 Subject: [openib-general] A new version for srp daemon In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E302AE2F2B@mtlexch01.mtl.com> Message-ID: <44E1E929.4090306@mellanox.co.il> Roland Dreier wrote: > Ishai> [ishai] This is another tool for srp. I thought that > Ishai> we want to put several tools in srptools directory. I will > Ishai> put it in > Ishai> https://openib.org/svn/gen2/trunk/src/userspace/srp_daemon > > It would be fine to have the srp daemon be built as part of srptools. > But putting another package with its own Makefile etc as a > subdirectory of srptools is just weird. > > - R. So do you suggest we change the current srptools Makefile to include this daemon too? Tziporet From halr at voltaire.com Tue Aug 15 08:34:22 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 11:34:22 -0400 Subject: [openib-general] [PATCH] osm: Dynamic verbosity control per file In-Reply-To: <842b8cdf0608020816y3fdfa145nea876171f58650d9@mail.gmail.com> References: <842b8cdf0608020816y3fdfa145nea876171f58650d9@mail.gmail.com> Message-ID: <1155656058.29378.8180.camel@hal.voltaire.com> Hi Yevgeny, On Wed, 2006-08-02 at 11:16, Yevgeny Kliteynik wrote: > Hi Hal > > This patch adds new verbosity functionality. I finally got some time to look this over and have a few comments on it (see embedded text). Generally, this looks good. One other thing that should be added to this patch is an update to the opensm man page for this functionality. So can you address the comment above as well as the embedded ones and resubmit ? Also, is this functionality needed for OFED 1.1 or is this trunk only ? Thanks. -- Hal > 1. Verbosity configuration file > ------------------------------- > > The user is able to set verbosity level per source code file > by supplying verbosity configuration file using the following > command line arguments: > > -b filename > --verbosity_file filename > > By default, the OSM will use the following file: /etc/opensmlog.conf Nit: For consistency in naming, this would be better as osmlog.conf (or osm-log.conf) rather than opensmlog.conf > Verbosity configuration file should contain zero or more lines of > the following pattern: > > filename verbosity_level > > where 'filename' is the name of the source code file that the > 'verbosity_level' refers to, and the 'verbosity_level' itself > should be specified as an integer number (decimal or hexadecimal). > > One reserved filename is 'all' - it represents general verbosity > level, that is used for all the files that are not specified in > the verbosity configuration file. > If 'all' is not specified, the verbosity level set in the > command line will be used instead. > Note: The 'all' file verbosity level will override any other > general level that was specified by the command line arguments. > > Sending a SIGHUP signal to the OSM will cause it to reload > the verbosity configuration file. > > > 2. Logging source code filename and line number > ----------------------------------------------- > > If command line option -S or --log_source_info is specified, > OSM will add source code filename and line number to every > log message that is written to the log file. > By default, the OSM will not log this additional info. > > > Yevgeny > > Signed-off-by: Yevgeny Kliteynik > > Index: include/opensm/osm_subnet.h > =================================================================== > --- include/opensm/osm_subnet.h (revision 8614) > +++ include/opensm/osm_subnet.h (working copy) > @@ -285,6 +285,8 @@ typedef struct _osm_subn_opt > osm_qos_options_t qos_sw0_options; > osm_qos_options_t qos_swe_options; > osm_qos_options_t qos_rtr_options; > + boolean_t src_info; > + char * verbosity_file; > } osm_subn_opt_t; > /* > * FIELDS > @@ -463,6 +465,27 @@ typedef struct _osm_subn_opt > * qos_rtr_options > * QoS options for router ports > * > +* src_info > +* If TRUE - the source code filename and line number will be > +* added to each log message. > +* Default value - FALSE. > +* > +* verbosity_file > +* OSM log configuration file - the file that describes > +* verbosity level per source code file. > +* The file may containg zero or more lines of the following > +* pattern: > +* filename verbosity_level > +* where 'filename' is the name of the source code file that > +* the 'verbosity_level' refers to. > +* Filename "all" represents general verbosity level, that is > +* used for all the files that are not specified in the > +* verbosity file. > +* If "all" is not specified, the general verbosity level will > +* be used instead. > +* Note: the "all" file verbosity level will override any other > +* general level that was specified by the command line > arguments. > +* > * SEE ALSO > * Subnet object > *********/ > Index: include/opensm/osm_base.h > =================================================================== > --- include/opensm/osm_base.h (revision 8614) > +++ include/opensm/osm_base.h (working copy) > @@ -222,6 +222,22 @@ BEGIN_C_DECLS > #endif > /***********/ > > +/****d* OpenSM: Base/OSM_DEFAULT_VERBOSITY_FILE > +* NAME > +* OSM_DEFAULT_VERBOSITY_FILE > +* > +* DESCRIPTION > +* Specifies the default verbosity config file name > +* > +* SYNOPSIS > +*/ > +#ifdef __WIN__ > +#define OSM_DEFAULT_VERBOSITY_FILE strcat(GetOsmPath(), " > opensmlog.conf") > +#else > +#define OSM_DEFAULT_VERBOSITY_FILE "/etc/opensmlog.conf" > +#endif > +/***********/ > + > /****d* OpenSM: Base/OSM_DEFAULT_PARTITION_CONFIG_FILE > * NAME > * OSM_DEFAULT_PARTITION_CONFIG_FILE > Index: include/opensm/osm_log.h > =================================================================== > --- include/opensm/osm_log.h (revision 8652) > +++ include/opensm/osm_log.h (working copy) > @@ -57,6 +57,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -123,9 +124,45 @@ typedef struct _osm_log > cl_spinlock_t lock; > boolean_t flush; > FILE* out_port; > + boolean_t src_info; > + st_table * table; > } osm_log_t; > /*********/ > > +/****f* OpenSM: Log/osm_log_read_verbosity_file > +* NAME > +* osm_log_read_verbosity_file > +* > +* DESCRIPTION > +* This function reads the verbosity configuration file > +* and constructs a verbosity data structure. > +* > +* SYNOPSIS > +*/ > +void > +osm_log_read_verbosity_file( > + IN osm_log_t* p_log, > + IN const char * const verbosity_file); > +/* > +* PARAMETERS > +* p_log > +* [in] Pointer to a Log object to construct. > +* > +* verbosity_file > +* [in] verbosity configuration file > +* > +* RETURN VALUE > +* None > +* > +* NOTES > +* If the verbosity configuration file is not found, default > +* verbosity value is used for all files. > +* If there is an error in some line of the verbosity > +* configuration file, the line is ignored. > +* > +*********/ > + > + > /****f* OpenSM: Log/osm_log_construct > * NAME > * osm_log_construct > @@ -201,9 +238,13 @@ osm_log_destroy( > * osm_log_init > *********/ > > -/****f* OpenSM: Log/osm_log_init > +#define osm_log_init(p_log, flush, log_flags, log_file, > accum_log_file) \ > + osm_log_init_ext(p_log, flush, (log_flags), log_file, \ > + accum_log_file, FALSE, OSM_DEFAULT_VERBOSITY_FILE) > + > +/****f* OpenSM: Log/osm_log_init_ext > * NAME > -* osm_log_init > +* osm_log_init_ext > * > * DESCRIPTION > * The osm_log_init function initializes a > @@ -211,50 +252,15 @@ osm_log_destroy( > * > * SYNOPSIS > */ > -static inline ib_api_status_t > -osm_log_init( > +ib_api_status_t > +osm_log_init_ext( > IN osm_log_t* const p_log, > IN const boolean_t flush, > IN const uint8_t log_flags, > IN const char *log_file, > - IN const boolean_t accum_log_file ) > -{ > - p_log->level = log_flags; > - p_log->flush = flush; > - > - if (log_file == NULL || !strcmp(log_file, "-") || > - !strcmp(log_file, "stdout")) > - { > - p_log->out_port = stdout; > - } > - else if (!strcmp(log_file, "stderr")) > - { > - p_log->out_port = stderr; > - } > - else > - { > - if (accum_log_file) > - p_log->out_port = fopen(log_file, "a+"); > - else > - p_log->out_port = fopen(log_file, "w+"); > - > - if (!p_log->out_port) > - { > - if (accum_log_file) > - printf("Cannot open %s for appending. Permission denied\n", > log_file); > - else > - printf("Cannot open %s for writing. Permission denied\n", > log_file); These lines above are line wrapped so they don't apply. This is an email issue on your side. > - > - return(IB_UNKNOWN_ERROR); > - } > - } > - openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); > - > - if (cl_spinlock_init( &p_log->lock ) == CL_SUCCESS) > - return IB_SUCCESS; > - else > - return IB_ERROR; > -} > + IN const boolean_t accum_log_file, > + IN const boolean_t src_info, > + IN const char *verbosity_file); > /* > * PARAMETERS > * p_log > @@ -271,6 +277,16 @@ osm_log_init( > * log_file > * [in] if not NULL defines the name of the log file. Otherwise it > is stdout. > * > +* accum_log_file > +* [in] Whether the log file should be accumulated. > +* > +* src_info > +* [in] Set to TRUE directs the log to add filename and line > number > +* to each log message. > +* > +* verbosity_file > +* [in] Log configuration file location. > +* > * RETURN VALUES > * CL_SUCCESS if the Log object was initialized > * successfully. > @@ -283,26 +299,32 @@ osm_log_init( > * osm_log_destroy > *********/ > > -/****f* OpenSM: Log/osm_log_get_level > +#define osm_log_get_level(p_log) \ > + osm_log_get_level_ext(p_log, __FILE__) > + > +/****f* OpenSM: Log/osm_log_get_level_ext > * NAME > -* osm_log_get_level > +* osm_log_get_level_ext > * > * DESCRIPTION > -* Returns the current log level. > +* Returns the current log level for the file. > +* If the file is not specified in the log config file, > +* the general verbosity level will be returned. > * > * SYNOPSIS > */ > -static inline osm_log_level_t > -osm_log_get_level( > - IN const osm_log_t* const p_log ) > -{ > - return( p_log->level ); > -} > +osm_log_level_t > +osm_log_get_level_ext( > + IN const osm_log_t* const p_log, > + IN const char* const p_filename ); > /* > * PARAMETERS > * p_log > * [in] Pointer to the log object. > * > +* p_filename > +* [in] Source code file name. > +* > * RETURN VALUES > * Returns the current log level. > * > @@ -310,7 +332,7 @@ osm_log_get_level( > * > * SEE ALSO > * Log object, osm_log_construct, > -* osm_log_destroy > +* osm_log_destroy, osm_log_get_level > *********/ > > /****f* OpenSM: Log/osm_log_set_level > @@ -318,7 +340,7 @@ osm_log_get_level( > * osm_log_set_level > * > * DESCRIPTION > -* Sets the current log level. > +* Sets the current general log level. > * > * SYNOPSIS > */ > @@ -338,7 +360,7 @@ osm_log_set_level( > * [in] New level to set. > * > * RETURN VALUES > -* Returns the current log level. > +* None. > * > * NOTES > * > @@ -347,9 +369,12 @@ osm_log_set_level( > * osm_log_destroy > *********/ > > -/****f* OpenSM: Log/osm_log_is_active > +#define osm_log_is_active(p_log, level) \ > + osm_log_is_active_ext(p_log, __FILE__, level) > + > +/****f* OpenSM: Log/osm_log_is_active_ext > * NAME > -* osm_log_is_active > +* osm_log_is_active_ext > * > * DESCRIPTION > * Returns TRUE if the specified log level would be logged. > @@ -357,18 +382,19 @@ osm_log_set_level( > * > * SYNOPSIS > */ > -static inline boolean_t > -osm_log_is_active( > +boolean_t > +osm_log_is_active_ext( > IN const osm_log_t* const p_log, > - IN const osm_log_level_t level ) > -{ > - return( (p_log->level & level) != 0 ); > -} > + IN const char* const p_filename, > + IN const osm_log_level_t level ); > /* > * PARAMETERS > * p_log > * [in] Pointer to the log object. > * > +* p_filename > +* [in] Source code file name. > +* > * level > * [in] Level to check. > * > @@ -383,17 +409,125 @@ osm_log_is_active( > * osm_log_destroy > *********/ > > + > +#define osm_log(p_log, verbosity, p_str, args...) \ > + osm_log_ext(p_log, verbosity, __FILE__, __LINE__, p_str , ## > args) > + > +/****f* OpenSM: Log/osm_log_ext > +* NAME > +* osm_log_ext > +* > +* DESCRIPTION > +* Logs the formatted specified message. > +* > +* SYNOPSIS > +*/ > void > -osm_log( > +osm_log_ext( > IN osm_log_t* const p_log, > IN const osm_log_level_t verbosity, > + IN const char *p_filename, > + IN int line, > IN const char *p_str, ... ); > +/* > +* PARAMETERS > +* p_log > +* [in] Pointer to the log object. > +* > +* verbosity > +* [in] Current message verbosity level > + > + p_filename > + [in] Name of the file that is logging this message > + > + line > + [in] Line number in the file that is logging this message > + > + p_str > + [in] Format string of the message > +* > +* RETURN VALUES > +* None. > +* > +* NOTES > +* > +* SEE ALSO > +* Log object, osm_log_construct, > +* osm_log_destroy > +*********/ > > +#define osm_log_raw(p_log, verbosity, p_buff) \ > + osm_log_raw_ext(p_log, verbosity, __FILE__, p_buff) > + > +/****f* OpenSM: Log/osm_log_raw_ext > +* NAME > +* osm_log_ext > +* > +* DESCRIPTION > +* Logs the specified message. > +* > +* SYNOPSIS > +*/ > void > -osm_log_raw( > +osm_log_raw_ext( > IN osm_log_t* const p_log, > IN const osm_log_level_t verbosity, > + IN const char * p_filename, > IN const char *p_buf ); > +/* > +* PARAMETERS > +* p_log > +* [in] Pointer to the log object. > +* > +* verbosity > +* [in] Current message verbosity level > + > + p_filename > + [in] Name of the file that is logging this message > + > + p_buf > + [in] Message string > +* > +* RETURN VALUES > +* None. > +* > +* NOTES > +* > +* SEE ALSO > +* Log object, osm_log_construct, > +* osm_log_destroy > +*********/ > + > + > +/****f* OpenSM: Log/osm_log_flush > +* NAME > +* osm_log_flush > +* > +* DESCRIPTION > +* Flushes the log. > +* > +* SYNOPSIS > +*/ > +static inline void > +osm_log_flush( > + IN osm_log_t* const p_log) > +{ > + fflush(p_log->out_port); > +} > +/* > +* PARAMETERS > +* p_log > +* [in] Pointer to the log object. > +* > +* RETURN VALUES > +* None. > +* > +* NOTES > +* > +* SEE ALSO > +* > +*********/ > + > > #define DBG_CL_LOCK 0 > > Index: opensm/osm_subnet.c > =================================================================== > --- opensm/osm_subnet.c (revision 8614) > +++ opensm/osm_subnet.c (working copy) > @@ -493,6 +493,8 @@ osm_subn_set_default_opt( > p_opt->ucast_dump_file = NULL; > p_opt->updn_guid_file = NULL; > p_opt->exit_on_fatal = TRUE; > + p_opt->src_info = FALSE; > + p_opt->verbosity_file = OSM_DEFAULT_VERBOSITY_FILE; > subn_set_default_qos_options(&p_opt->qos_options); > subn_set_default_qos_options(&p_opt->qos_hca_options); > subn_set_default_qos_options(&p_opt->qos_sw0_options); > @@ -959,6 +961,13 @@ osm_subn_parse_conf_file( > "honor_guid2lid_file", > p_key, p_val, &p_opts->honor_guid2lid_file); > > + __osm_subn_opts_unpack_boolean( > + "log_source_info", > + p_key, p_val, &p_opts->src_info); > + > + __osm_subn_opts_unpack_charp( > + "verbosity_file", p_key, p_val, &p_opts->verbosity_file); > + > subn_parse_qos_options("qos", > p_key, p_val, &p_opts->qos_options); > > @@ -1182,7 +1191,11 @@ osm_subn_write_conf_file( > "# No multicast routing is performed if TRUE\n" > "disable_multicast %s\n\n" > "# If TRUE opensm will exit on fatal initialization issues\n" > - "exit_on_fatal %s\n\n", > + "exit_on_fatal %s\n\n" > + "# If TRUE OpenSM will log filename and line numbers\n" > + "log_source_info %s\n\n" > + "# Verbosity configuration file to be used\n" > + "verbosity_file %s\n\n", > p_opts->log_flags, > p_opts->force_log_flush ? "TRUE" : "FALSE", > p_opts->log_file, > @@ -1190,7 +1203,9 @@ osm_subn_write_conf_file( > p_opts->dump_files_dir, > p_opts->no_multicast_option ? "TRUE" : "FALSE", > p_opts->disable_multicast ? "TRUE" : "FALSE", > - p_opts->exit_on_fatal ? "TRUE" : "FALSE" > + p_opts->exit_on_fatal ? "TRUE" : "FALSE", > + p_opts->src_info ? "TRUE" : "FALSE", > + p_opts->verbosity_file > ); > > fprintf( > Index: opensm/osm_opensm.c > =================================================================== > --- opensm/osm_opensm.c (revision 8614) > +++ opensm/osm_opensm.c (working copy) > @@ -180,8 +180,10 @@ osm_opensm_init( > /* Can't use log macros here, since we're initializing the log. */ > osm_opensm_construct( p_osm ); > > - status = osm_log_init( &p_osm->log, p_opt->force_log_flush, > - p_opt->log_flags, p_opt->log_file, > p_opt->accum_log_file ); > + status = osm_log_init_ext( &p_osm->log, p_opt->force_log_flush, > + p_opt->log_flags, p_opt->log_file, > + p_opt->accum_log_file, p_opt->src_info, > + p_opt->verbosity_file); > if( status != IB_SUCCESS ) > return ( status ); > > Index: opensm/libopensm.map > =================================================================== > --- opensm/libopensm.map (revision 8614) > +++ opensm/libopensm.map (working copy) > @@ -1,6 +1,11 @@ > -OPENSM_1.0 { > +OPENSM_2.0 { > global: > - osm_log; > + osm_log_init_ext; > + osm_log_ext; > + osm_log_raw_ext; > + osm_log_get_level_ext; > + osm_log_is_active_ext; > + osm_log_read_verbosity_file; > osm_is_debug; > osm_mad_pool_construct; > osm_mad_pool_destroy; > @@ -39,7 +44,6 @@ OPENSM_1.0 { > osm_dump_dr_path; > osm_dump_smp_dr_path; > osm_dump_pkey_block; > - osm_log_raw; > osm_get_sm_state_str; > osm_get_sm_signal_str; > osm_get_disp_msg_str; Rather than remove osm_log and osm_log_raw, these should be deprecated. There are other applications outside of OpenSM (like osmtest and others) that need this. > @@ -51,5 +55,11 @@ OPENSM_1.0 { > osm_get_lsa_str; > osm_get_sm_mgr_signal_str; > osm_get_sm_mgr_state_str; > + st_init_strtable; > + st_delete; > + st_insert; > + st_lookup; > + st_foreach; > + st_free_table; > local: *; > }; > Index: opensm/osm_log.c > =================================================================== > --- opensm/osm_log.c (revision 8614) > +++ opensm/osm_log.c (working copy) > @@ -80,17 +80,365 @@ static char *month_str[] = { > }; > #endif /* ndef WIN32 */ > > + > +/*************************************************************************** > + > ***************************************************************************/ > + > +#define OSM_VERBOSITY_ALL "all" > + > +static void > +__osm_log_free_verbosity_table( > + IN osm_log_t* p_log); > +static void > +__osm_log_print_verbosity_table( > + IN osm_log_t* const p_log); > + > +/*************************************************************************** > + > ***************************************************************************/ > + > +osm_log_level_t > +osm_log_get_level_ext( > + IN const osm_log_t* const p_log, > + IN const char* const p_filename ) > +{ > + osm_log_level_t * p_curr_file_level = NULL; > + > + if (!p_filename || !p_log->table) > + return p_log->level; > + > + if ( st_lookup( p_log->table, > + (st_data_t) p_filename, > + (st_data_t*) &p_curr_file_level) ) > + return *p_curr_file_level; > + else > + return p_log->level; > +} > + > +/*************************************************************************** > + > ***************************************************************************/ > + > +ib_api_status_t > +osm_log_init_ext( > + IN osm_log_t* const p_log, > + IN const boolean_t flush, > + IN const uint8_t log_flags, > + IN const char *log_file, > + IN const boolean_t accum_log_file, > + IN const boolean_t src_info, > + IN const char *verbosity_file) > +{ > + p_log->level = log_flags; > + p_log->flush = flush; > + p_log->src_info = src_info; > + p_log->table = NULL; > + > + if (log_file == NULL || !strcmp(log_file, "-") || > + !strcmp(log_file, "stdout")) > + { > + p_log->out_port = stdout; > + } > + else if (!strcmp(log_file, "stderr")) > + { > + p_log->out_port = stderr; > + } > + else > + { > + if (accum_log_file) > + p_log->out_port = fopen(log_file, "a+"); > + else > + p_log->out_port = fopen(log_file, "w+"); > + > + if (!p_log->out_port) > + { > + if (accum_log_file) > + printf("Cannot open %s for appending. Permission denied\n", > log_file); > + else > + printf("Cannot open %s for writing. Permission denied\n", > log_file); > + > + return(IB_UNKNOWN_ERROR); > + } > + } > + openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); > + > + if (cl_spinlock_init( &p_log->lock ) != CL_SUCCESS) > + return IB_ERROR; > + > + osm_log_read_verbosity_file(p_log,verbosity_file); > + return IB_SUCCESS; > +} > + > +/*************************************************************************** > + > ***************************************************************************/ > + > +void > +osm_log_read_verbosity_file( > + IN osm_log_t* p_log, > + IN const char * const verbosity_file) > +{ > + FILE *infile; > + char line[500]; > + struct stat buf; > + boolean_t table_empty = TRUE; > + char * tmp_str = NULL; > + > + if (p_log->table) > + { > + /* > + * Free the existing table. > + * Note: if the verbosity config file will not be found, this > will > + * effectivly reset the existing verbosity configuration and > set > + * all the files to the same verbosity level > + */ > + __osm_log_free_verbosity_table(p_log); > + } > + > + if (!verbosity_file) > + return; > + > + if ( stat(verbosity_file, &buf) != 0 ) > + { > + /* > + * Verbosity configuration file doesn't exist. > + */ > + if (strcmp(verbosity_file,OSM_DEFAULT_VERBOSITY_FILE) == 0) > + { > + /* > + * Verbosity configuration file wasn't explicitly specified. > + * No need to issue any error message. > + */ > + return; > + } > + else > + { > + /* > + * Verbosity configuration file was explicitly specified. > + */ > + osm_log(p_log, OSM_LOG_SYS, > + "ERROR: Verbosity configuration file (%s) doesn't > exist.\n", > + verbosity_file); > + osm_log(p_log, OSM_LOG_SYS, > + " Using general verbosity value.\n"); > + return; > + } > + } > + > + infile = fopen(verbosity_file, "r"); > + if ( infile == NULL ) > + { > + osm_log(p_log, OSM_LOG_SYS, > + "ERROR: Failed opening verbosity configuration file > (%s).\n", > + verbosity_file); > + osm_log(p_log, OSM_LOG_SYS, > + " Using general verbosity value.\n"); > + return; > + } > + > + p_log->table = st_init_strtable(); > + if (p_log->table == NULL) > + { > + osm_log(p_log, OSM_LOG_SYS, "ERROR: Verbosity table > initialization failed.\n"); > + return; > + } > + > + /* > + * Read the file line by line, parse the lines, and > + * add each line to p_log->table. > + */ > + while ( fgets(line, sizeof(line), infile) != NULL ) > + { > + char * str = line; > + char * name = NULL; > + char * value = NULL; > + osm_log_level_t * p_log_level_value = NULL; > + int res; > + > + name = strtok_r(str," \t\n",&tmp_str); > + if (name == NULL || strlen(name) == 0) { > + /* > + * empty line - ignore it > + */ > + continue; > + } > + value = strtok_r(NULL," \t\n",&tmp_str); > + if (value == NULL || strlen(value) == 0) > + { > + /* > + * No verbosity value - wrong syntax. > + * This line will be ignored. > + */ > + continue; > + } > + > + /* > + * If the conversion will fail, the log_level_value will get 0, > + * so the only way to check that the syntax is correct is to > + * scan value for any non-digit (which we're not doing here). > + */ > + p_log_level_value = malloc (sizeof(osm_log_level_t)); > + if (!p_log_level_value) > + { > + osm_log(p_log, OSM_LOG_SYS, "ERROR: malloc failed.\n"); > + p_log->table = NULL; > + fclose(infile); > + return; > + } > + *p_log_level_value = strtoul(value, NULL, 0); > + > + if (strcasecmp(name,OSM_VERBOSITY_ALL) == 0) > + { > + osm_log_set_level(p_log, *p_log_level_value); > + free(p_log_level_value); > + } > + else > + { > + res = st_insert( p_log->table, > + (st_data_t) strdup(name), > + (st_data_t) p_log_level_value); > + if (res != 0) > + { > + /* > + * Something is wrong with the verbosity table. > + * We won't try to free the table, because there's > + * clearly something corrupted there. > + */ > + osm_log(p_log, OSM_LOG_SYS, "ERROR: Failed adding > verbosity table element.\n"); > + p_log->table = NULL; > + fclose(infile); > + return; > + } > + table_empty = FALSE; > + } > + > + } > + > + if (table_empty) > + __osm_log_free_verbosity_table(p_log); > + > + fclose(infile); > + > + __osm_log_print_verbosity_table(p_log); > +} > + > +/*************************************************************************** > + > ***************************************************************************/ > + > +static int > +__osm_log_print_verbosity_table_element( > + IN st_data_t key, > + IN st_data_t val, > + IN st_data_t arg) > +{ > + osm_log( (osm_log_t* const) arg, > + OSM_LOG_INFO, > + "[verbosity] File: %s, Level: 0x%x\n", > + (char *) key, *((osm_log_level_t *) val)); > + > + return ST_CONTINUE; > +} > + > +static void > +__osm_log_print_verbosity_table( > + IN osm_log_t* const p_log) > +{ > + osm_log( p_log, OSM_LOG_INFO, > + "[verbosity] Verbosity table loaded\n" ); > + osm_log( p_log, OSM_LOG_INFO, > + "[verbosity] General level: > 0x%x\n",osm_log_get_level_ext(p_log,NULL)); > + > + if (p_log->table) > + { > + st_foreach( p_log->table, > + __osm_log_print_verbosity_table_element, > + (st_data_t) p_log ); > + } > + osm_log_flush(p_log); > +} > + > +/*************************************************************************** > + > ***************************************************************************/ > + > +static int > +__osm_log_free_verbosity_table_element( > + IN st_data_t key, > + IN st_data_t val, > + IN st_data_t arg) > +{ > + free( (char *) key ); > + free( (osm_log_level_t *) val ); > + return ST_DELETE; > +} > + > +static void > +__osm_log_free_verbosity_table( > + IN osm_log_t* p_log) > +{ > + if (!p_log->table) > + return; > + > + st_foreach( p_log->table, > + __osm_log_free_verbosity_table_element, > + (st_data_t) NULL); > + > + st_free_table(p_log->table); > + p_log->table = NULL; > +} > + > +/*************************************************************************** > + > ***************************************************************************/ > + > +static inline const char * > +__osm_log_get_base_name( > + IN const char * const p_filename) > +{ > +#ifdef WIN32 > + char dir_separator = '\\'; > +#else > + char dir_separator = '/'; > +#endif > + char * tmp_ptr; > + > + if (!p_filename) > + return NULL; > + > + tmp_ptr = strrchr(p_filename,dir_separator); > + > + if (!tmp_ptr) > + return p_filename; > + return tmp_ptr+1; > +} > + > +/*************************************************************************** > + > ***************************************************************************/ > + > +boolean_t > +osm_log_is_active_ext( > + IN const osm_log_t* const p_log, > + IN const char* const p_filename, > + IN const osm_log_level_t level ) > +{ > + osm_log_level_t tmp_lvl; > + tmp_lvl = level & > + > osm_log_get_level_ext(p_log,__osm_log_get_base_name(p_filename)); > + return ( tmp_lvl != 0 ); > +} > + > +/*************************************************************************** > + > ***************************************************************************/ > + > static int log_exit_count = 0; > > void > -osm_log( > +osm_log_ext( > IN osm_log_t* const p_log, > IN const osm_log_level_t verbosity, > + IN const char *p_filename, > + IN int line, > IN const char *p_str, ... ) > { > char buffer[LOG_ENTRY_SIZE_MAX]; > va_list args; > int ret; > + osm_log_level_t file_verbosity; > > #ifdef WIN32 > SYSTEMTIME st; > @@ -108,69 +456,89 @@ osm_log( > localtime_r(&tim, &result); > #endif /* WIN32 */ > > - /* If this is a call to syslog - always print it */ > - if ( verbosity & OSM_LOG_SYS ) > + /* > + * Extract only the file name out of the full path > + */ > + p_filename = __osm_log_get_base_name(p_filename); > + /* > + * Get the verbosity level for this file. > + * If the file is not specified in the log config file, > + * the general verbosity level will be returned. > + */ > + file_verbosity = osm_log_get_level_ext(p_log, p_filename); > + > + if ( ! (verbosity & OSM_LOG_SYS) && > + ! (file_verbosity & verbosity) ) > { > - /* this is a call to the syslog */ > + /* > + * This is not a syslog message (which is always printed) > + * and doesn't have the required verbosity level. > + */ > + return; > + } > + > va_start( args, p_str ); > vsprintf( buffer, p_str, args ); > va_end(args); > - cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); > > + > + if ( verbosity & OSM_LOG_SYS ) > + { > + /* this is a call to the syslog */ > + cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); > /* SYSLOG should go to stdout too */ > if (p_log->out_port != stdout) > { > - printf("%s\n", buffer); > + printf("%s", buffer); > fflush( stdout ); > } > + } > + /* SYSLOG also goes to to the log file */ > + > + cl_spinlock_acquire( &p_log->lock ); > > - /* send it also to the log file */ > #ifdef WIN32 > GetLocalTime(&st); > - fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", > + if (p_log->src_info) > + { > + ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] > [%s:%d] -> %s", > st.wHour, st.wMinute, st.wSecond, > st.wMilliseconds, > - pid, buffer); > -#else > - fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> > %s\n", > - (result.tm_mon < 12 ? month_str[result.tm_mon] : "???"), > - result.tm_mday, result.tm_hour, > - result.tm_min, result.tm_sec, > - usecs, pid, buffer); > - fflush( p_log->out_port ); > -#endif > + pid, p_filename, line, buffer); > } > - > - /* SYS messages go to the log anyways */ > - if (p_log->level & verbosity) > + else > { > - > - va_start( args, p_str ); > - vsprintf( buffer, p_str, args ); > - va_end(args); > - > - /* regular log to default out_port */ > - cl_spinlock_acquire( &p_log->lock ); > -#ifdef WIN32 > - GetLocalTime(&st); > ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> > %s", > st.wHour, st.wMinute, st.wSecond, > st.wMilliseconds, > pid, buffer); > - > + } > #else > pid = pthread_self(); > tim = time(NULL); > + if (p_log->src_info) > + { > + ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d > [%04X] [%s:%d] -> %s", > + ((result.tm_mon < 12) && (result.tm_mon >= 0) ? > + month_str[ result.tm_mon] : "???"), > + result.tm_mday, result.tm_hour, > + result.tm_min, result.tm_sec, > + usecs, pid, p_filename, line, buffer); > + } > + else > + { > ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d > [%04X] -> %s", > ((result.tm_mon < 12) && (result.tm_mon >= 0) ? > month_str[ result.tm_mon] : "???"), > result.tm_mday, result.tm_hour, > result.tm_min, result.tm_sec, > usecs, pid, buffer); > -#endif /* WIN32 */ > - > + } > +#endif > /* > - Flush log on errors too. > + * Flush log on errors and SYSLOGs too. > */ > - if( p_log->flush || (verbosity & OSM_LOG_ERROR) ) > + if ( p_log->flush || > + (verbosity & OSM_LOG_ERROR) || > + (verbosity & OSM_LOG_SYS) ) > fflush( p_log->out_port ); > > cl_spinlock_release( &p_log->lock ); > @@ -183,15 +551,30 @@ osm_log( > } > } > } > -} > + > +/*************************************************************************** > + > ***************************************************************************/ > > void > -osm_log_raw( > +osm_log_raw_ext( > IN osm_log_t* const p_log, > IN const osm_log_level_t verbosity, > + IN const char * p_filename, > IN const char *p_buf ) > { > - if( p_log->level & verbosity ) > + osm_log_level_t file_verbosity; > + /* > + * Extract only the file name out of the full path > + */ > + p_filename = __osm_log_get_base_name(p_filename); > + /* > + * Get the verbosity level for this file. > + * If the file is not specified in the log config file, > + * the general verbosity level will be returned. > + */ > + file_verbosity = osm_log_get_level_ext(p_log, p_filename); > + > + if ( file_verbosity & verbosity ) > { > cl_spinlock_acquire( &p_log->lock ); > printf( "%s", p_buf ); > @@ -205,6 +588,9 @@ osm_log_raw( > } > } > > +/*************************************************************************** > + > ***************************************************************************/ > + > boolean_t > osm_is_debug(void) > { > @@ -214,3 +600,7 @@ osm_is_debug(void) > return FALSE; > #endif /* defined( _DEBUG_ ) */ > } > + > +/*************************************************************************** > + > ***************************************************************************/ > + > Index: opensm/main.c > =================================================================== > --- opensm/main.c (revision 8652) > +++ opensm/main.c (working copy) > @@ -296,6 +296,33 @@ show_usage(void) > " -d3 - Disable multicast support\n" > " -d10 - Put OpenSM in testability mode\n" > " Without -d, no debug options are enabled\n\n" ); > + printf( "-S\n" > + "--log_source_info\n" > + " This option tells SM to add source code > filename\n" > + " and line number to every log message.\n" > + " By default, the SM will not log this additional > info.\n\n"); > + printf( "-b\n" > + "--verbosity_file \n" > + " This option specifies name of the verbosity\n" > + " configuration file, which describes verbosity > level\n" > + " per source code file. The file may contain zero > or\n" > + " more lines of the following pattern:\n" > + " filename verbosity_level\n" > + " where 'filename' is the name of the source code > file\n" > + " that the 'verbosity_level' refers to, and the > \n" > + " 'verbosity_level' itself should be specified as > a\n" > + " number (decimal or hexadecimal).\n" > + " Filename 'all' represents general verbosity > level,\n" > + " that is used for all the files that are not > specified\n" > + " in the verbosity file.\n" > + " Note: The 'all' file verbosity level will > override any\n" > + " other general level that was specified by the > command\n" > + " line arguments.\n" > + " By default, the SM will use the following > file:\n" > + " %s\n" > + " Sending a SIGHUP signal to the SM will cause it > to\n" > + " re-read the verbosity configuration file.\n" > + "\n\n", OSM_DEFAULT_VERBOSITY_FILE); > printf( "-h\n" > "--help\n" > " Display this usage info then exit.\n\n" ); > @@ -527,7 +554,7 @@ main( > boolean_t cache_options = FALSE; > char *ignore_guids_file_name = NULL; > uint32_t val; > - const char * const short_option = > "i:f:ed:g:l:s:t:a:R:U:P:NQvVhorcyx"; > + const char * const short_option = > "i:f:ed:g:l:s:t:a:R:U:P:b:SNQvVhorcyx"; > > /* > In the array below, the 2nd parameter specified the number > @@ -565,6 +592,8 @@ main( > { "cache-options", 0, NULL, 'c'}, > { "stay_on_fatal", 0, NULL, 'y'}, > { "honor_guid2lid", 0, NULL, 'x'}, > + { "log_source_info",0,NULL, 'S'}, > + { "verbosity_file",1, NULL, 'b'}, > { NULL, 0, NULL, 0 } /* Required at the end of > the array */ > }; > > @@ -808,6 +837,16 @@ main( > printf (" Honor guid2lid file, if possible\n"); > break; > > + case 'S': > + opt.src_info = TRUE; > + printf(" Logging source code filename and line number\n"); > + break; > + > + case 'b': > + opt.verbosity_file = optarg; > + printf(" Verbosity Configuration File: %s\n", optarg); > + break; > + > case 'h': > case '?': > case ':': > @@ -920,9 +959,13 @@ main( > > if (osm_hup_flag) { > osm_hup_flag = 0; > - /* a HUP signal should only start a new heavy sweep */ > + /* > + * A HUP signal should cause OSM to re-read the log > + * configuration file and start a new heavy sweep > + */ > osm.subn.force_immediate_heavy_sweep = TRUE; > osm_opensm_sweep( &osm ); > + osm_log_read_verbosity_file(&osm.log,opt.verbosity_file); > } > } > } > Index: opensm/Makefile.am > =================================================================== > --- opensm/Makefile.am (revision 8614) > +++ opensm/Makefile.am (working copy) > @@ -43,7 +43,7 @@ else > libopensm_version_script = > endif > > -libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c > +libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c st.c > libopensm_la_LDFLAGS = -version-info $(opensm_api_version) \ > -export-dynamic $(libopensm_version_script) > libopensm_la_DEPENDENCIES = $(srcdir)/libopensm.map > @@ -90,7 +90,7 @@ opensm_SOURCES = main.c osm_console.c os > osm_trap_rcv.c osm_trap_rcv_ctrl.c \ > osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c \ > osm_vl15intf.c osm_vl_arb_rcv.c \ > - osm_vl_arb_rcv_ctrl.c st.c > + osm_vl_arb_rcv_ctrl.c > if OSMV_OPENIB > opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing > -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) > -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > Index: doc/verbosity-config.txt > =================================================================== > --- doc/verbosity-config.txt (revision 0) > +++ doc/verbosity-config.txt (revision 0) > @@ -0,0 +1,43 @@ > + > +This patch adds new verbosity functionality. > + > +1. Verbosity configuration file > +------------------------------- > + > +The user is able to set verbosity level per source code file > +by supplying verbosity configuration file using the following > +command line arguments: > + > + -b filename > + --verbosity_file filename > + > +By default, the OSM will use the following file: /etc/opensmlog.conf > +Verbosity configuration file should contain zero or more lines of > +the following pattern: > + > + filename verbosity_level > + > +where 'filename' is the name of the source code file that the > +'verbosity_level' refers to, and the 'verbosity_level' itself > +should be specified as an integer number (decimal or hexadecimal). > + > +One reserved filename is 'all' - it represents general verbosity > +level, that is used for all the files that are not specified in > +the verbosity configuration file. > +If 'all' is not specified, the verbosity level set in the > +command line will be used instead. > +Note: The 'all' file verbosity level will override any other > +general level that was specified by the command line arguments. > + > +Sending a SIGHUP signal to the OSM will cause it to reload > +the verbosity configuration file. > + > + > +2. Logging source code filename and line number > +----------------------------------------------- > + > +If command line option -S or --log_source_info is specified, > +OSM will add source code filename and line number to every > +log message that is written to the log file. > +By default, the OSM will not log this additional info. > + > > From tziporet at mellanox.co.il Tue Aug 15 08:36:22 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 15 Aug 2006 18:36:22 +0300 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID change with client reregister set In-Reply-To: <200608151806.22228.jackm@mellanox.co.il> References: <20060815142050.GE15917@mellanox.co.il> <1155652951.29378.6460.camel@hal.voltaire.com> <200608151806.22228.jackm@mellanox.co.il> Message-ID: <44E1E9F6.1070007@mellanox.co.il> Jack Morgenstein wrote: > Before the change in mthca_mad.c (sm_snoop), the LID CHANGE event was > generated in all cases where a SET port-info MAD was received. > > After the change, either LID_CHANGE -- or -- CLIENT_REREGISTER is generated. > (i.e., CLIENT REREGISTER replaced LID_CHANGE in those cases where the > reregistration bit was set in the MAD). > > This patch simply restores the previous behavior of the sa_query and cache > modules in responding to events. > > > I wish to emphases that without this fix opensm handover is NOT working now in OFED (and in kernel 2.6.18) and this is a must feature. Tziporet From halr at voltaire.com Tue Aug 15 08:40:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 11:40:14 -0400 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID change with client reregister set In-Reply-To: <200608151806.22228.jackm@mellanox.co.il> References: <20060815142050.GE15917@mellanox.co.il> <1155652951.29378.6460.camel@hal.voltaire.com> <200608151806.22228.jackm@mellanox.co.il> Message-ID: <1155656413.29378.8380.camel@hal.voltaire.com> On Tue, 2006-08-15 at 11:06, Jack Morgenstein wrote: > On Tuesday 15 August 2006 17:42, Hal Rosenstock wrote: > > > > Are these two events equivalent ? e.g. does LID change require > > reregistration ? (That's a potential overhead as well). > > > > Before the change in mthca_mad.c (sm_snoop), the LID CHANGE event was > generated in all cases where a SET port-info MAD was received. > > After the change, either LID_CHANGE -- or -- CLIENT_REREGISTER is generated. > (i.e., CLIENT REREGISTER replaced LID_CHANGE in those cases where the > reregistration bit was set in the MAD). > > This patch simply restores the previous behavior of the sa_query and cache > modules in responding to events. Understood but doesn't it also has the effect of doing more than this for certain cases of Set PortInfo ? > No reregistration is required in these modules. Currently but I think it is a possibility in the future. -- Hal > > What about deregistration of the old registrations when this occurs ? > > Is that handled ? > > > There is no deregistration involvement. > > - Jack From sean.hefty at intel.com Tue Aug 15 08:52:27 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 08:52:27 -0700 Subject: [openib-general] [PATCH v2 1/2] sa_query: add generic query interfaces capableof supporting RMPP In-Reply-To: <1155653040.29378.6534.camel@hal.voltaire.com> Message-ID: <000301c6c082$cf389070$3ccd180a@amr.corp.intel.com> >Do you mean multiple path records rather than multipath record queries >here ? I meant multiple path records (GET_TABLE), but the interface is designed to handle both. The current implementation doesn't handle multipath record queries with more than 1 SGID and 1 DGID. (I.e. it uses a size of 56 bytes.) It shouldn't be hard to calculate the correct size based on the S/DGID counts; I just didn't do it yet. >You probably already found this but this should be 56 as SA attributes >are required to be modulo 8 in size. Isn't this an attribute offset issue, rather than a size issue? (I did have to account for the offset when walking attributes.) - Sean From halr at voltaire.com Tue Aug 15 08:57:11 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 11:57:11 -0400 Subject: [openib-general] [PATCH v2 1/2] sa_query: add generic query interfaces capableof supporting RMPP In-Reply-To: <000301c6c082$cf389070$3ccd180a@amr.corp.intel.com> References: <000301c6c082$cf389070$3ccd180a@amr.corp.intel.com> Message-ID: <1155657431.29378.8770.camel@hal.voltaire.com> On Tue, 2006-08-15 at 11:52, Sean Hefty wrote: > >Do you mean multiple path records rather than multipath record queries > >here ? > > I meant multiple path records (GET_TABLE), but the interface is designed to > handle both. The current implementation doesn't handle multipath record queries > with more than 1 SGID and 1 DGID. (I.e. it uses a size of 56 bytes.) It > shouldn't be hard to calculate the correct size based on the S/DGID counts; I > just didn't do it yet. I only saw SA SRs, PRs. and MCMs not MPRs. Did I miss the MPR support ? > >You probably already found this but this should be 56 as SA attributes > >are required to be modulo 8 in size. > > Isn't this an attribute offset issue, rather than a size issue? (I did have to > account for the offset when walking attributes.) Yes. -- Hal > > - Sean From sean.hefty at intel.com Tue Aug 15 09:06:47 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 09:06:47 -0700 Subject: [openib-general] [PATCH v2 1/2] sa_query: add generic query interfacescapableof supporting RMPP In-Reply-To: <1155657431.29378.8770.camel@hal.voltaire.com> Message-ID: <000401c6c084$cf8b8080$3ccd180a@amr.corp.intel.com> >I only saw SA SRs, PRs. and MCMs not MPRs. Did I miss the MPR support ? Ah - this is in my local copy. I defined all attribute lengths (in an enum). I wanted to get the libibsa support completed to make sure that the SA interfaces provided what I needed. I will need to extract this patch, and SA registration back out from my other changes and re-submit the patches. - Sean From sean.hefty at intel.com Tue Aug 15 09:10:31 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 09:10:31 -0700 Subject: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM In-Reply-To: Message-ID: <000501c6c085$55590ac0$3ccd180a@amr.corp.intel.com> Can you see if this patch lets you build compost? Signed-off-by: Sean Hefty --- Index: examples/cmpost.c =================================================================== --- examples/cmpost.c (revision 8215) +++ examples/cmpost.c (working copy) @@ -614,6 +614,7 @@ out: static int query_for_path(char *dst) { + struct rdma_event_channel *channel; struct rdma_cm_id *id; struct sockaddr_in addr_in; struct rdma_cm_event *event; @@ -623,15 +624,19 @@ static int query_for_path(char *dst) if (ret) return ret; - ret = rdma_create_id(&id, NULL); + channel = rdma_create_event_channel(); + if (!channel) + return -1; + + ret = rdma_create_id(channel, &id, NULL, RDMA_PS_TCP); if (ret) - return ret; + goto destroy_channel; ret = rdma_resolve_addr(id, NULL, (struct sockaddr *) &addr_in, 2000); if (ret) goto out; - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (!ret && event->event != RDMA_CM_EVENT_ADDR_RESOLVED) ret = event->status; rdma_ack_cm_event(event); @@ -642,7 +647,7 @@ static int query_for_path(char *dst) if (ret) goto out; - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (!ret && event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) ret = event->status; rdma_ack_cm_event(event); @@ -652,6 +657,8 @@ static int query_for_path(char *dst) test.path_rec = id->route.path_rec[0]; out: rdma_destroy_id(id); +destroy_channel: + rdma_destroy_event_channel(channel); return ret; } From sean.hefty at intel.com Tue Aug 15 09:24:44 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 09:24:44 -0700 Subject: [openib-general] IB mcast question In-Reply-To: Message-ID: <000601c6c087$51e83c10$3ccd180a@amr.corp.intel.com> > Steve> How about qp attributes? pkeys? qkeys? > >Good question -- yes, the QPs will need be to set up with the right >keys for packets to appear. It's definitely something to check. The qkeys used by the RDMA CM sound like they may be the problem. I'll verify this and see how to fix it if so. - Sean From mst at mellanox.co.il Tue Aug 15 09:45:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 19:45:42 +0300 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID change withclient reregister set In-Reply-To: <1155656413.29378.8380.camel@hal.voltaire.com> References: <1155656413.29378.8380.camel@hal.voltaire.com> Message-ID: <20060815164542.GA21551@mellanox.co.il> Quoting r. Hal Rosenstock : > > This patch simply restores the previous behavior of the sa_query and cache > > modules in responding to events. > > Understood but doesn't it also has the effect of doing more than this > for certain cases of Set PortInfo ? Not that I can see. > > No reregistration is required in these modules. > > Currently but I think it is a possibility in the future. So then we'll have to do it. -- MST From sean.hefty at intel.com Tue Aug 15 09:58:39 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 09:58:39 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <000601c6c087$51e83c10$3ccd180a@amr.corp.intel.com> Message-ID: <000701c6c08c$0f129bb0$3ccd180a@amr.corp.intel.com> >The qkeys used by the RDMA CM sound like they may be the problem. I'll verify >this and see how to fix it if so. If I set the qkeys for the QPs and MCMemberRecord to 0, I can get this to work now. The RDMA CM uses a qkey = port number for UD QPs, and a qkey = IPv4 address for MCMemberRecords. A potential fix I see for this is to use the same qkey for all UD QPs and multicast groups created by the RDMA CM. Otherwise we restrict UD QPs to using a single destination (remote UD QP or multicast group.) - Sean From swise at opengridcomputing.com Tue Aug 15 10:20:26 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 12:20:26 -0500 Subject: [openib-general] IB mcast question In-Reply-To: <000701c6c08c$0f129bb0$3ccd180a@amr.corp.intel.com> References: <000701c6c08c$0f129bb0$3ccd180a@amr.corp.intel.com> Message-ID: <1155662426.26332.78.camel@stevo-desktop> On Tue, 2006-08-15 at 09:58 -0700, Sean Hefty wrote: > >The qkeys used by the RDMA CM sound like they may be the problem. I'll verify > >this and see how to fix it if so. > > If I set the qkeys for the QPs and MCMemberRecord to 0, I can get this to work > now. The RDMA CM uses a qkey = port number for UD QPs, and a qkey = IPv4 > address for MCMemberRecords. > > A potential fix I see for this is to use the same qkey for all UD QPs and > multicast groups created by the RDMA CM. Otherwise we restrict UD QPs to using > a single destination (remote UD QP or multicast group.) > I was marching to the same tune! But I have a few points needing clarification. In my IP-centric mind, the sender specifies the ip mcast address and a remote port. All hosts with subscribers to the ip mcast address get the packet, and all sockets on those hosts who are bound to the dst_port receive a copy. Other sockets on those hosts that joined the ipmcast group but are bound to different ports will _not_ get a copy of the packet. In addition, the sender's local port number doesn't matter at all in the equation. Now how does that translate to qkeys, udqops, and ib mcast? It sounds to me like the remote_qkey is used to identify the mcast group when sending a mcast -and- to identify the set of qps on each host that should receive the incoming mcast packets. Is this true? From halr at voltaire.com Tue Aug 15 10:22:01 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 13:22:01 -0400 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID change withclient reregister set In-Reply-To: <20060815164542.GA21551@mellanox.co.il> References: <1155656413.29378.8380.camel@hal.voltaire.com> <20060815164542.GA21551@mellanox.co.il> Message-ID: <1155662516.29378.11285.camel@hal.voltaire.com> On Tue, 2006-08-15 at 12:45, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > > This patch simply restores the previous behavior of the sa_query and cache > > > modules in responding to events. > > > > Understood but doesn't it also has the effect of doing more than this > > for certain cases of Set PortInfo ? > > Not that I can see. It's more based on what the driver(s) is/are doing which is not your change but is the one (back on May 31) which caused this change to be needed. Client reregister takes precedence over LID change so client reregister needs to do everything LID change does and possibly more. Your change looks fine to me. What about local_sa ? Does it need this change too ? -- Hal > > > No reregistration is required in these modules. > > > > Currently but I think it is a possibility in the future. > > So then we'll have to do it. From halr at voltaire.com Tue Aug 15 10:37:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 13:37:32 -0400 Subject: [openib-general] IB mcast question In-Reply-To: <000701c6c08c$0f129bb0$3ccd180a@amr.corp.intel.com> References: <000701c6c08c$0f129bb0$3ccd180a@amr.corp.intel.com> Message-ID: <1155663451.29378.11776.camel@hal.voltaire.com> On Tue, 2006-08-15 at 12:58, Sean Hefty wrote: > >The qkeys used by the RDMA CM sound like they may be the problem. I'll verify > >this and see how to fix it if so. > > If I set the qkeys for the QPs and MCMemberRecord to 0, I can get this to work > now. The RDMA CM uses a qkey = port number for UD QPs, and a qkey = IPv4 > address for MCMemberRecords. > > A potential fix I see for this is to use the same qkey for all UD QPs and > multicast groups created by the RDMA CM. Otherwise we restrict UD QPs to using > a single destination (remote UD QP or multicast group.) Doesn't the QKey need to be the same as the one used for the IPoIB broadcast group (for the partition in question) per IPoIB RFC ? It should also be the one returned in the SA MCMemberRecord response. -- Hal > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Tue Aug 15 10:53:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 20:53:05 +0300 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID changewithclient reregister set In-Reply-To: <1155662516.29378.11285.camel@hal.voltaire.com> References: <1155662516.29378.11285.camel@hal.voltaire.com> Message-ID: <20060815175305.GA21780@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [PATCH] IB/core: fix SM LID/LID changewithclient reregister set > > On Tue, 2006-08-15 at 12:45, Michael S. Tsirkin wrote: > > Quoting r. Hal Rosenstock : > > > > This patch simply restores the previous behavior of the sa_query and cache > > > > modules in responding to events. > > > > > > Understood but doesn't it also has the effect of doing more than this > > > for certain cases of Set PortInfo ? > > > > Not that I can see. > > It's more based on what the driver(s) is/are doing which is not your > change but is the one (back on May 31) which caused this change to be > needed. Client reregister takes precedence over LID change so client > reregister needs to do everything LID change does and possibly more. Yes, I dislike this too - this seems to set policy in low level driver, of all places. In hindsight, it seems that we should have had a single PORT_INFO_SET event with additional bits marking which fields were set. Something along the lines of IB_EVENT_PORTINFO_SET = 0x100, IB_EVENT_LID_CHANGE = 0x101, IB_EVENT_PKEY_CHANGE = 0x102, IB_EVENT_SM_CHANGE = 0x104, IB_EVENT_CLIENT_REREGISTER = 0x108 and now each event can pass full information to users, and you can test specific bits to figure out if you are interested, or just IB_EVENT_PORTINFO_SET to handle all such events identically. Clearly not 2.6.18, but - how does this sound? > Your change looks fine to me. > > What about local_sa ? Does it need this change too ? No idea. I'm focusing on 2.6.18 mainly. local sa revocation policy is not really clear to me, maybe it's already covered? -- MST From halr at voltaire.com Tue Aug 15 10:55:39 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 13:55:39 -0400 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID changewithclient reregister set In-Reply-To: <20060815175305.GA21780@mellanox.co.il> References: <1155662516.29378.11285.camel@hal.voltaire.com> <20060815175305.GA21780@mellanox.co.il> Message-ID: <1155664538.29378.12334.camel@hal.voltaire.com> On Tue, 2006-08-15 at 13:53, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > Subject: Re: [PATCH] IB/core: fix SM LID/LID changewithclient reregister set > > > > On Tue, 2006-08-15 at 12:45, Michael S. Tsirkin wrote: > > > Quoting r. Hal Rosenstock : > > > > > This patch simply restores the previous behavior of the sa_query and cache > > > > > modules in responding to events. > > > > > > > > Understood but doesn't it also has the effect of doing more than this > > > > for certain cases of Set PortInfo ? > > > > > > Not that I can see. > > > > It's more based on what the driver(s) is/are doing which is not your > > change but is the one (back on May 31) which caused this change to be > > needed. Client reregister takes precedence over LID change so client > > reregister needs to do everything LID change does and possibly more. > > Yes, I dislike this too - this seems to set policy in low level > driver, of all places. > > In hindsight, it seems that we should have had a single PORT_INFO_SET > event with additional bits marking which fields were set. > > Something along the lines of > > > IB_EVENT_PORTINFO_SET = 0x100, > IB_EVENT_LID_CHANGE = 0x101, > IB_EVENT_PKEY_CHANGE = 0x102, > IB_EVENT_SM_CHANGE = 0x104, > IB_EVENT_CLIENT_REREGISTER = 0x108 > > and now each event can pass full information to users, > and you can test specific bits to figure out if you are > interested, or just IB_EVENT_PORTINFO_SET to handle > all such events identically. > > Clearly not 2.6.18, but - how does this sound? A bit mask sounds better. Is there any other info along with the events ? > > Your change looks fine to me. > > > > What about local_sa ? Does it need this change too ? > > No idea. I'm focusing on 2.6.18 mainly. > local sa revocation policy is not really clear to me, > maybe it's already covered? Sean ? -- Hal From ardavis at ichips.intel.com Tue Aug 15 11:02:53 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 15 Aug 2006 11:02:53 -0700 Subject: [openib-general] DAPL and local_iov in RDMA RR/RW mode In-Reply-To: <010001c6bf82$f174b890$3b388d80@cern.ch> References: <025801c6bd5f$f77b8d10$3b388d80@cern.ch> <44DCE509.5090005@ichips.intel.com> <010001c6bf82$f174b890$3b388d80@cern.ch> Message-ID: <44E20C4D.9030309@ichips.intel.com> Ryszard Jurga wrote: > Hi Arlin, > > Thank you for your quick reply. Both dat_ep_post_rdma_read nad > dat_ep_post_rdma_write return DAT_SUCCESS. When I read a field > 'transfered_length' from DAT_DTO_COMPLETION_EVENT_DATA after calling a > post function I receive the correct value which equals > num_segs*seg_size. Unfortunately, when I read a content of a local > buffer, only first segment is filled by appropriete data. I have tried > to set up debug switch (by export DAPL_DBG_TYPE=0xffff before running > my application) but unfortunately this does not produce any additional > output for post functions. Do you have any other ideas? I did not > mention before, but the case with num_segments>1 works fine with a > send/recv type of transmision. > You have to "configure --enable-debug" to get the debug information. You may want to pick up the latest dapl/test/dtest/dtest.c and take a look at the rdma write section for a simple multi-segment uDAPL example. I recently made a few modifications to include multiple segments in the test. You can also use dapltest to verify that RDMA with multple segments are working properly. Look at cl.sh for example script and "dapltest -TT --help" for usage. -arlin From mst at mellanox.co.il Tue Aug 15 11:11:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 21:11:18 +0300 Subject: [openib-general] [PATCH] IB/mthca: recover from device errors Message-ID: <20060815181118.GB21780@mellanox.co.il> Hello, Roland! The following makes it possible to recover from catastrophic errors through device reset. Could you please queue it for 2.6.19? Implementation detail: Catastrophic event device reset. Implemented via a fatal list, in which device objects are queued for resetting. A spinlock guarantees list insertion/deletion protection, while a mutex guarantees that we don't perform device resets while a device add/remove operation is in progress (and vice versa). Added a workqueue to the mthca driver to perform the reset in a thread context. -- Trigger devie remove and then add once a catastrophic error was detected in hardware. This, in turn, will cause a device reset typically recovering from the catastrophic condition. Since this might interefere with debugging the error root cause, add a module option to suppress this behaviour. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: ofed_1_1/drivers/infiniband/hw/mthca/mthca_catas.c =================================================================== --- ofed_1_1.orig/drivers/infiniband/hw/mthca/mthca_catas.c 2006-08-03 14:30:21.645701000 +0300 +++ ofed_1_1/drivers/infiniband/hw/mthca/mthca_catas.c 2006-08-10 16:46:57.418864000 +0300 @@ -34,6 +34,7 @@ #include #include +#include #include "mthca_dev.h" @@ -48,9 +49,42 @@ enum { static DEFINE_SPINLOCK(catas_lock); +static struct workqueue_struct *catas_wq; +static struct list_head catas_list; +static struct work_struct catas_work; + +static int catas_reset_disable = 0; +module_param_named(catas_reset_disable, catas_reset_disable, int, 0644); +MODULE_PARM_DESC(catas_reset_disable, "disable reset on catastrophic event if > 0"); + +static void catas_reset(void *work_ptr) +{ + struct mthca_dev *dev, *tmpdev; + LIST_HEAD(local_catas); + unsigned long flags; + int rc; + + mutex_lock(&mthca_device_mutex); + + spin_lock_irqsave(&catas_lock, flags); + list_for_each_entry_safe(dev, tmpdev, &catas_list, catas_err.list) + list_move_tail(&dev->catas_err.list, &local_catas); + spin_unlock_irqrestore(&catas_lock, flags); + + list_for_each_entry_safe(dev, tmpdev, &local_catas, catas_err.list) { + rc = mthca_restart_one(dev->pdev); + if (rc) + mthca_err(dev, "Reset failed (%d)\n", rc); + else + mthca_dbg(dev, "Reset succeeded\n"); + } + mutex_unlock(&mthca_device_mutex); +} + static void handle_catas(struct mthca_dev *dev) { struct ib_event event; + unsigned long flags; const char *type; int i; @@ -82,6 +116,14 @@ static void handle_catas(struct mthca_de for (i = 0; i < dev->catas_err.size; ++i) mthca_err(dev, " buf[%02x]: %08x\n", i, swab32(readl(dev->catas_err.map + i))); + + if (catas_reset_disable) + return; + + spin_lock_irqsave(&catas_lock, flags); + list_add(&dev->catas_err.list, &catas_list); + queue_work(catas_wq, &catas_work); + spin_unlock_irqrestore(&catas_lock, flags); } static void poll_catas(unsigned long dev_ptr) @@ -135,11 +177,14 @@ void mthca_start_catas_poll(struct mthca dev->catas_err.timer.data = (unsigned long) dev; dev->catas_err.timer.function = poll_catas; dev->catas_err.timer.expires = jiffies + MTHCA_CATAS_POLL_INTERVAL; + INIT_LIST_HEAD(&dev->catas_err.list); add_timer(&dev->catas_err.timer); } void mthca_stop_catas_poll(struct mthca_dev *dev) { + unsigned long flags; + spin_lock_irq(&catas_lock); dev->catas_err.stop = 1; spin_unlock_irq(&catas_lock); @@ -153,4 +198,23 @@ void mthca_stop_catas_poll(struct mthca_ dev->catas_err.addr), dev->catas_err.size * 4); } + + spin_lock_irqsave(&catas_lock, flags); + list_del(&dev->catas_err.list); + spin_unlock_irqrestore(&catas_lock, flags); +} + +int __init mthca_catas_init(void) +{ + INIT_LIST_HEAD(&catas_list); + INIT_WORK(&catas_work, catas_reset, NULL); + catas_wq = create_singlethread_workqueue("mthcacatas"); + if (!catas_wq) + return -ENOMEM; + return 0; +} + +void mthca_catas_cleanup(void) +{ + destroy_workqueue(catas_wq); } Index: ofed_1_1/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- ofed_1_1.orig/drivers/infiniband/hw/mthca/mthca_main.c 2006-08-03 14:30:21.747701000 +0300 +++ ofed_1_1/drivers/infiniband/hw/mthca/mthca_main.c 2006-08-10 16:46:16.770946000 +0300 @@ -80,6 +80,8 @@ static int tune_pci = 0; module_param(tune_pci, int, 0444); MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if nonzero"); +struct mutex mthca_device_mutex; + static const char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; @@ -978,28 +980,15 @@ static struct { MTHCA_FLAG_SINAI_OPT } }; -static int __devinit mthca_init_one(struct pci_dev *pdev, - const struct pci_device_id *id) +static int __mthca_init_one(struct pci_dev *pdev, int hca_type) { - static int mthca_version_printed = 0; int ddr_hidden = 0; int err; struct mthca_dev *mdev; - if (!mthca_version_printed) { - printk(KERN_INFO "%s", mthca_version); - ++mthca_version_printed; - } - printk(KERN_INFO PFX "Initializing %s\n", pci_name(pdev)); - if (id->driver_data >= ARRAY_SIZE(mthca_hca_table)) { - printk(KERN_ERR PFX "%s has invalid driver data %lx\n", - pci_name(pdev), id->driver_data); - return -ENODEV; - } - err = pci_enable_device(pdev); if (err) { dev_err(&pdev->dev, "Cannot enable PCI device, " @@ -1065,7 +1054,7 @@ static int __devinit mthca_init_one(stru mdev->pdev = pdev; - mdev->mthca_flags = mthca_hca_table[id->driver_data].flags; + mdev->mthca_flags = mthca_hca_table[hca_type].flags; if (ddr_hidden) mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; @@ -1099,13 +1088,13 @@ static int __devinit mthca_init_one(stru if (err) goto err_cmd; - if (mdev->fw_ver < mthca_hca_table[id->driver_data].latest_fw) { + if (mdev->fw_ver < mthca_hca_table[hca_type].latest_fw) { mthca_warn(mdev, "HCA FW version %d.%d.%d is old (%d.%d.%d is current).\n", (int) (mdev->fw_ver >> 32), (int) (mdev->fw_ver >> 16) & 0xffff, (int) (mdev->fw_ver & 0xffff), - (int) (mthca_hca_table[id->driver_data].latest_fw >> 32), - (int) (mthca_hca_table[id->driver_data].latest_fw >> 16) & 0xffff, - (int) (mthca_hca_table[id->driver_data].latest_fw & 0xffff)); + (int) (mthca_hca_table[hca_type].latest_fw >> 32), + (int) (mthca_hca_table[hca_type].latest_fw >> 16) & 0xffff, + (int) (mthca_hca_table[hca_type].latest_fw & 0xffff)); mthca_warn(mdev, "If you have problems, try updating your HCA FW.\n"); } @@ -1122,6 +1111,7 @@ static int __devinit mthca_init_one(stru goto err_unregister; pci_set_drvdata(pdev, mdev); + mdev->hca_type = hca_type; return 0; @@ -1166,7 +1156,7 @@ err_disable_pdev: return err; } -static void __devexit mthca_remove_one(struct pci_dev *pdev) +static void __mthca_remove_one(struct pci_dev *pdev) { struct mthca_dev *mdev = pci_get_drvdata(pdev); u8 status; @@ -1211,6 +1201,49 @@ static void __devexit mthca_remove_one(s } } +static int __devinit mthca_init_one(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + static int mthca_version_printed = 0; + int rc; + + mutex_lock(&mthca_device_mutex); + if (!mthca_version_printed) { + printk(KERN_INFO "%s", mthca_version); + ++mthca_version_printed; + } + + if (id->driver_data >= ARRAY_SIZE(mthca_hca_table)) { + printk(KERN_ERR PFX "%s has invalid driver data %lx\n", + pci_name(pdev), id->driver_data); + mutex_unlock(&mthca_device_mutex); + return -ENODEV; + } + + rc = __mthca_init_one(pdev, id->driver_data); + mutex_unlock(&mthca_device_mutex); + return rc; +} + +static void __devexit mthca_remove_one(struct pci_dev *pdev) +{ + mutex_lock(&mthca_device_mutex); + __mthca_remove_one(pdev); + mutex_unlock(&mthca_device_mutex); + return; +} + +int mthca_restart_one(struct pci_dev *pdev) +{ + struct mthca_dev *mdev; + + mdev = pci_get_drvdata(pdev); + if (!mdev) + return -ENODEV; + __mthca_remove_one(pdev); + return __mthca_init_one(pdev, mdev->hca_type); +} + static struct pci_device_id mthca_pci_table[] = { { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR), .driver_data = TAVOR }, @@ -1248,13 +1281,22 @@ static int __init mthca_init(void) { int ret; + mutex_init(&mthca_device_mutex); + if (mthca_catas_init()) + return -ENOMEM; + ret = pci_register_driver(&mthca_driver); - return ret < 0 ? ret : 0; + if (ret < 0) { + mthca_catas_cleanup(); + return ret; + } + return 0; } static void __exit mthca_cleanup(void) { pci_unregister_driver(&mthca_driver); + mthca_catas_cleanup(); } module_init(mthca_init); Index: ofed_1_1/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- ofed_1_1.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2006-08-03 14:30:21.706704000 +0300 +++ ofed_1_1/drivers/infiniband/hw/mthca/mthca_dev.h 2006-08-10 16:47:05.666648000 +0300 @@ -45,6 +45,7 @@ #include #include #include +#include #include @@ -283,8 +284,11 @@ struct mthca_catas_err { unsigned long stop; u32 size; struct timer_list timer; + struct list_head list; }; +extern struct mutex mthca_device_mutex; + struct mthca_dev { struct ib_device ib_dev; struct pci_dev *pdev; @@ -450,6 +454,9 @@ void mthca_unregister_device(struct mthc void mthca_start_catas_poll(struct mthca_dev *dev); void mthca_stop_catas_poll(struct mthca_dev *dev); +int mthca_restart_one(struct pci_dev *pdev); +int mthca_catas_init(void); +void mthca_catas_cleanup(void); int mthca_uar_alloc(struct mthca_dev *dev, struct mthca_uar *uar); void mthca_uar_free(struct mthca_dev *dev, struct mthca_uar *uar); -- MST From sean.hefty at intel.com Tue Aug 15 11:16:31 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 11:16:31 -0700 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID changewithclient reregister set In-Reply-To: <1155664538.29378.12334.camel@hal.voltaire.com> Message-ID: <000001c6c096$ef91e060$5f248686@amr.corp.intel.com> >> > What about local_sa ? Does it need this change too ? >> >> No idea. I'm focusing on 2.6.18 mainly. >> local sa revocation policy is not really clear to me, >> maybe it's already covered? ib_multicast is okay, but ib_local_sa could probably use the change. - Sean From sean.hefty at intel.com Tue Aug 15 11:18:44 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 11:18:44 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155663451.29378.11776.camel@hal.voltaire.com> Message-ID: <000101c6c097$3e844b90$5f248686@amr.corp.intel.com> >> A potential fix I see for this is to use the same qkey for all UD QPs and >> multicast groups created by the RDMA CM. Otherwise we restrict UD QPs to >using >> a single destination (remote UD QP or multicast group.) > >Doesn't the QKey need to be the same as the one used for the IPoIB >broadcast group (for the partition in question) per IPoIB RFC ? It >should also be the one returned in the SA MCMemberRecord response. It shouldn't. The RDMA CM multicast groups are separate from those used by ipoib. - Sean From halr at voltaire.com Tue Aug 15 11:23:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 14:23:13 -0400 Subject: [openib-general] IB mcast question In-Reply-To: <000101c6c097$3e844b90$5f248686@amr.corp.intel.com> References: <000101c6c097$3e844b90$5f248686@amr.corp.intel.com> Message-ID: <1155666192.29378.13180.camel@hal.voltaire.com> On Tue, 2006-08-15 at 14:18, Sean Hefty wrote: > >> A potential fix I see for this is to use the same qkey for all UD QPs and > >> multicast groups created by the RDMA CM. Otherwise we restrict UD QPs to > >using > >> a single destination (remote UD QP or multicast group.) > > > >Doesn't the QKey need to be the same as the one used for the IPoIB > >broadcast group (for the partition in question) per IPoIB RFC ? It > >should also be the one returned in the SA MCMemberRecord response. > > It shouldn't. The RDMA CM multicast groups are separate from those used by > ipoib. Is the IP address only used locally to construct the MGID ? What does the MGID look like ? What signature does it use if any ? -- Hal > - Sean From sean.hefty at intel.com Tue Aug 15 11:29:22 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 11:29:22 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155662426.26332.78.camel@stevo-desktop> Message-ID: <000201c6c098$bae91fc0$5f248686@amr.corp.intel.com> >In my IP-centric mind, the sender specifies the ip mcast address and a >remote port. All hosts with subscribers to the ip mcast address get the >packet, and all sockets on those hosts who are bound to the dst_port >receive a copy. Other sockets on those hosts that joined the ipmcast >group but are bound to different ports will _not_ get a copy of the >packet. In addition, the sender's local port number doesn't matter at >all in the equation. Now how does that translate to qkeys, udqops, and >ib mcast? Currently, the IP address is mapped to an MGID. Senders and receivers are required to subscribe to the multicast group in order to receive packets from the multicast group. (The UD QPs must be attached to the group to get the packet.) The port number is not used. Is it possible for an IP socket to receive packets from multiple multicast groups? >It sounds to me like the remote_qkey is used to identify the mcast group >when sending a mcast -and- to identify the set of qps on each host that >should receive the incoming mcast packets. Is this true? I think the QKey usage in the RDMA CM needs to be redone. If we look at just UD QP transfers, in order to support one-to-many data transfers, all of the QPs need to have the same QKey. - Sean From sean.hefty at intel.com Tue Aug 15 11:33:32 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 11:33:32 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155666192.29378.13180.camel@hal.voltaire.com> Message-ID: <000301c6c099$4fdf2070$5f248686@amr.corp.intel.com> >Is the IP address only used locally to construct the MGID ? What does >the MGID look like ? What signature does it use if any ? The IP address may also used be used to lookup routing information in order to bind to a local device. The address is then used locally construct the MGID. The MGID looks a lot like the ipoib MGIDs, with a byte 8 being 0x01. - Sean From halr at voltaire.com Tue Aug 15 12:05:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 15:05:59 -0400 Subject: [openib-general] IB mcast question In-Reply-To: <000301c6c099$4fdf2070$5f248686@amr.corp.intel.com> References: <000301c6c099$4fdf2070$5f248686@amr.corp.intel.com> Message-ID: <1155668758.29378.14483.camel@hal.voltaire.com> On Tue, 2006-08-15 at 14:33, Sean Hefty wrote: > >Is the IP address only used locally to construct the MGID ? What does > >the MGID look like ? What signature does it use if any ? > > The IP address may also used be used to lookup routing information in order to > bind to a local device. The address is then used locally construct the MGID. > The MGID looks a lot like the ipoib MGIDs, with a byte 8 being 0x01. One of the reserved bytes in the MGID is 1 rather than 0 and it's using an IPv4 signature (0x401b) ? Where does the qkey come from on the creation of the group ? -- Hal > - Sean From sean.hefty at intel.com Tue Aug 15 12:13:31 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 12:13:31 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155668758.29378.14483.camel@hal.voltaire.com> Message-ID: <000501c6c09e$e5ddc540$5f248686@amr.corp.intel.com> >One of the reserved bytes in the MGID is 1 rather than 0 and it's using >an IPv4 signature (0x401b) ? It uses a signature of 0x4001 to avoid conflicts with ipoib groups. >Where does the qkey come from on the creation of the group ? The qkey is the same as the IPv4 address. I need to spend some time looking at the QKeys of the QPs and the multicast group to understand how one of the receivers worked. - Sean From mst at mellanox.co.il Tue Aug 15 12:23:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 22:23:05 +0300 Subject: [openib-general] question: ib_umem page_size Message-ID: <20060815192305.GC22363@mellanox.co.il> Roland, could you please clarify what does the page_size field in struct ib_mem do? -- MST From swise at opengridcomputing.com Tue Aug 15 12:23:52 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 14:23:52 -0500 Subject: [openib-general] [Fwd: RE: IB mcast question] Message-ID: <1155669832.26332.129.camel@stevo-desktop> accidentally dropped off the list... -------- Forwarded Message -------- From: Sean Hefty To: 'Steve Wise' Subject: RE: [openib-general] IB mcast question Date: Tue, 15 Aug 2006 11:59:12 -0700 FYI - your reply dropped off the list. >For type SOCK_DGRAM (UDP), the socket will receive packets from multiple >subscribed ip mcast groups iff the dst_port of the incoming packet >matches the port to which the socket is bound... This is what I was referring to. I'm really not familiar with IP multicast beyond what I read in a book while coding the RDMA CM. It sounds like we might be able to use the QKey as the port number for the QP to mimic the behavior. The RDMA CM sets the QKey for UD QPs to the port number, but sets the QKey of a multicast group to the IPv4 address. >NOTE: I'm just trying to understand how this works in IB. I'm not >necessarily advocating it should behave exactly like ip mcast/udp. Clients need to create an UD QP. When they join a multicast group, they get an MGID, MLID, and QKey. The UD QP needs to attach to the MGID / MLID, and have a matching QKey. Today, the RDMA CM assigns a QKey to a UD QP when it's created; it doesn't know if it will join a multicast group or not. >And if you want to support a single UD QP receiving from multiple >subscribed groups, you'll have to have the same Qkey for all the groups >and QPs. Right? I believe that this will be the case. A send can specify the remote QKey, so it may be that UD QP transfers are okay, and only multicast has an issue. - Sean From swise at opengridcomputing.com Tue Aug 15 12:25:15 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 14:25:15 -0500 Subject: [openib-general] IB mcast question In-Reply-To: <000401c6c09c$e5b73c60$5f248686@amr.corp.intel.com> References: <000401c6c09c$e5b73c60$5f248686@amr.corp.intel.com> Message-ID: <1155669915.26332.132.camel@stevo-desktop> [adding back to list] On Tue, 2006-08-15 at 11:59 -0700, Sean Hefty wrote: > > >For type SOCK_DGRAM (UDP), the socket will receive packets from multiple > >subscribed ip mcast groups iff the dst_port of the incoming packet > >matches the port to which the socket is bound... > > This is what I was referring to. I'm really not familiar with IP multicast > beyond what I read in a book while coding the RDMA CM. It sounds like we might > be able to use the QKey as the port number for the QP to mimic the behavior. > > The RDMA CM sets the QKey for UD QPs to the port number, but sets the QKey of a > multicast group to the IPv4 address. > > >NOTE: I'm just trying to understand how this works in IB. I'm not > >necessarily advocating it should behave exactly like ip mcast/udp. > > Clients need to create an UD QP. When they join a multicast group, they get an > MGID, MLID, and QKey. The UD QP needs to attach to the MGID / MLID, and have a > matching QKey. Today, the RDMA CM assigns a QKey to a UD QP when it's created; > it doesn't know if it will join a multicast group or not. > Looking at the mckey code, I see that the code calls rdma_get_dst_attr() to get the remote qpn/qkey + the ah_attrs for the mcast group (which is the dst addr in this case). Then it creates an ibv_ah. Later when sending, the SEND WR contains both the ah and the remote qpn/qkey. Why are these separated? Isn't an address handle needed for each destination QP? If so, then why is the remote qpn/qkey also needed to transmit a datagram? Trying to understand how ah's relate to qpn/qkeys... From sean.hefty at intel.com Tue Aug 15 12:39:43 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 12:39:43 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155669915.26332.132.camel@stevo-desktop> Message-ID: <000601c6c0a2$8e9ee210$5f248686@amr.corp.intel.com> >Why are these separated? Isn't an address handle needed for each >destination QP? If so, then why is the remote qpn/qkey also needed to >transmit a datagram? The address handle doesn't include QPN/QKey information. Maybe think of them more as specifying the path to some port. - Sean From rdreier at cisco.com Tue Aug 15 12:59:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 Aug 2006 12:59:16 -0700 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155669832.26332.129.camel@stevo-desktop> (Steve Wise's message of "Tue, 15 Aug 2006 14:23:52 -0500") References: <1155669832.26332.129.camel@stevo-desktop> Message-ID: > This is what I was referring to. I'm really not familiar with IP multicast > beyond what I read in a book while coding the RDMA CM. It sounds like we might > be able to use the QKey as the port number for the QP to mimic the behavior. I don't see how this could work-- to mimic IP, an app has to be able to multicast a datagram to multiple destination ports using the same multicast group. Since IB multicast groups have a Q_Key associated, the only way this could work is if all QPs and MCGs share one Q_Key. From rdreier at cisco.com Tue Aug 15 13:04:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 Aug 2006 13:04:03 -0700 Subject: [openib-general] question: ib_umem page_size In-Reply-To: <20060815192305.GC22363@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 Aug 2006 22:23:05 +0300") References: <20060815192305.GC22363@mellanox.co.il> Message-ID: Michael> Roland, could you please clarify what does the page_size Michael> field in struct ib_mem do? It gives the page size for the user memory described by the struct. The idea was that if/when someone tries to optimize for huge pages, then the low-level driver can know that a region is using huge pages without having to walk through the page list and search for the minimum physically contiguous size. - R. From swise at opengridcomputing.com Tue Aug 15 13:07:43 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 15:07:43 -0500 Subject: [openib-general] IB mcast question In-Reply-To: <000601c6c0a2$8e9ee210$5f248686@amr.corp.intel.com> References: <000601c6c0a2$8e9ee210$5f248686@amr.corp.intel.com> Message-ID: <1155672463.26332.165.camel@stevo-desktop> On Tue, 2006-08-15 at 12:39 -0700, Sean Hefty wrote: > >Why are these separated? Isn't an address handle needed for each > >destination QP? If so, then why is the remote qpn/qkey also needed to > >transmit a datagram? > > The address handle doesn't include QPN/QKey information. Maybe think of them > more as specifying the path to some port. > Ok. >From what I can tell via experimentation, the qkey of the mcast group doesn't need to have any relation to the qkeys of the qps. I was able to create a mcast group with the mc qkey==0xe00a0a0a, and 3 apps joined this group, but their qp qkeys were 0 (I changed ucma_init_ud_qp() to set the qp qkey to 0). One app sent to the mcgroup ah/qkey/qpn and the other two received the packet. Does that make sense? So maybe all we need is the concept of REUSE_PORT to allow multiple librdma users to create cm_ids with the same local port. currently this isn't allowed. If we do this, then all processes that want to exchange mcast packets would create cm_ids and do rdma_resolve_addr() with the same src port number on all systems. Senders send to the ah/remote_qpn/remote_qkey of the mcast group. This routes packets to all IB ports that have subscribers. Then since the sender's qp has the same qkey as all the group participants each qp will receive a copy of the packet. The mcast setup code in librdma doesn't need to change. IE the qkey can remain the ip mcast address. I think this will work. It is similar to UDP/IP/MCAST... Or am I all wet? whatchathink? From mst at mellanox.co.il Tue Aug 15 13:08:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Aug 2006 23:08:54 +0300 Subject: [openib-general] question: ib_umem page_size In-Reply-To: References: Message-ID: <20060815200854.GD22363@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: question: ib_umem page_size > > Michael> Roland, could you please clarify what does the page_size > Michael> field in struct ib_mem do? > > It gives the page size for the user memory described by the struct. > The idea was that if/when someone tries to optimize for huge pages, > then the low-level driver can know that a region is using huge pages > without having to walk through the page list and search for the > minimum physically contiguous size. Thoguth though. Cool, that's exactly what I'm trying to do. -- MST From halr at voltaire.com Tue Aug 15 13:12:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 16:12:14 -0400 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: References: <1155669832.26332.129.camel@stevo-desktop> Message-ID: <1155672734.29378.16471.camel@hal.voltaire.com> On Tue, 2006-08-15 at 15:59, Roland Dreier wrote: > > This is what I was referring to. I'm really not familiar with IP multicast > > beyond what I read in a book while coding the RDMA CM. It sounds like we might > > be able to use the QKey as the port number for the QP to mimic the behavior. > > I don't see how this could work-- to mimic IP, an app has to be able > to multicast a datagram to multiple destination ports using the same > multicast group. Since IB multicast groups have a Q_Key associated, > the only way this could work is if all QPs and MCGs share one Q_Key. Isn't it all QPs on the same MCG need to share one Q_Key ? -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Tue Aug 15 13:18:16 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 15:18:16 -0500 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155672734.29378.16471.camel@hal.voltaire.com> References: <1155669832.26332.129.camel@stevo-desktop> <1155672734.29378.16471.camel@hal.voltaire.com> Message-ID: <1155673096.26332.175.camel@stevo-desktop> On Tue, 2006-08-15 at 16:12 -0400, Hal Rosenstock wrote: > On Tue, 2006-08-15 at 15:59, Roland Dreier wrote: > > > This is what I was referring to. I'm really not familiar with IP multicast > > > beyond what I read in a book while coding the RDMA CM. It sounds like we might > > > be able to use the QKey as the port number for the QP to mimic the behavior. > > > > I don't see how this could work-- to mimic IP, an app has to be able > > to multicast a datagram to multiple destination ports using the same > > multicast group. Since IB multicast groups have a Q_Key associated, > > the only way this could work is if all QPs and MCGs share one Q_Key. > > Isn't it all QPs on the same MCG need to share one Q_Key ? > >From my experimentation, only the qkeys of the QPs need match. The MCG qkey is only needed for sending to the group. Am I confused? (often the case :) From rdreier at cisco.com Tue Aug 15 13:17:35 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 Aug 2006 13:17:35 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155672463.26332.165.camel@stevo-desktop> (Steve Wise's message of "Tue, 15 Aug 2006 15:07:43 -0500") References: <000601c6c0a2$8e9ee210$5f248686@amr.corp.intel.com> <1155672463.26332.165.camel@stevo-desktop> Message-ID: Steve> I was able to create a mcast group with the mc Steve> qkey==0xe00a0a0a, and 3 apps joined this group, but their Steve> qp qkeys were 0 (I changed ucma_init_ud_qp() to set the qp Steve> qkey to 0). One app sent to the mcgroup ah/qkey/qpn and Steve> the other two received the packet. Does that make sense? In theory the Q_Key of a multicast group record is the Q_Key you're supposed to use when sending to the group. Of course nothing enforces this but I don't really like abusing things this way. - R. From swise at opengridcomputing.com Tue Aug 15 13:20:20 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 15:20:20 -0500 Subject: [openib-general] IB mcast question In-Reply-To: References: <000601c6c0a2$8e9ee210$5f248686@amr.corp.intel.com> <1155672463.26332.165.camel@stevo-desktop> Message-ID: <1155673220.26332.179.camel@stevo-desktop> On Tue, 2006-08-15 at 13:17 -0700, Roland Dreier wrote: > Steve> I was able to create a mcast group with the mc > Steve> qkey==0xe00a0a0a, and 3 apps joined this group, but their > Steve> qp qkeys were 0 (I changed ucma_init_ud_qp() to set the qp > Steve> qkey to 0). One app sent to the mcgroup ah/qkey/qpn and > Steve> the other two received the packet. Does that make sense? > > In theory the Q_Key of a multicast group record is the Q_Key you're > supposed to use when sending to the group. Of course nothing enforces > this but I don't really like abusing things this way. > I _did_ send the message to the qkey of the mcg. From halr at voltaire.com Tue Aug 15 13:28:16 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 16:28:16 -0400 Subject: [openib-general] IB mcast question In-Reply-To: <1155672463.26332.165.camel@stevo-desktop> References: <000601c6c0a2$8e9ee210$5f248686@amr.corp.intel.com> <1155672463.26332.165.camel@stevo-desktop> Message-ID: <1155673695.29378.16946.camel@hal.voltaire.com> On Tue, 2006-08-15 at 16:07, Steve Wise wrote: > On Tue, 2006-08-15 at 12:39 -0700, Sean Hefty wrote: > > >Why are these separated? Isn't an address handle needed for each > > >destination QP? If so, then why is the remote qpn/qkey also needed to > > >transmit a datagram? > > > > The address handle doesn't include QPN/QKey information. Maybe think of them > > more as specifying the path to some port. > > > > Ok. > > >From what I can tell via experimentation, the qkey of the mcast group > doesn't need to have any relation to the qkeys of the qps. That may be what is happening but I don't think that is correct (per the IBA spec). > I was able to create a mcast group with the mc qkey==0xe00a0a0a, Don't we need to be careful about controlled Q_Keys as well ? -- Hal > and 3 apps joined this group, but their qp qkeys were 0 (I changed > ucma_init_ud_qp() to set the qp qkey to 0). One app sent to the mcgroup > ah/qkey/qpn and the other two received the packet. Does that make > sense? > > So maybe all we need is the concept of REUSE_PORT to allow multiple > librdma users to create cm_ids with the same local port. currently this > isn't allowed. If we do this, then all processes that want to exchange > mcast packets would create cm_ids and do rdma_resolve_addr() with the > same src port number on all systems. > > Senders send to the ah/remote_qpn/remote_qkey of the mcast group. This > routes packets to all IB ports that have subscribers. Then since the > sender's qp has the same qkey as all the group participants each qp will > receive a copy of the packet. > > The mcast setup code in librdma doesn't need to change. IE the qkey can > remain the ip mcast address. > > I think this will work. It is similar to UDP/IP/MCAST... > > Or am I all wet? > > whatchathink? > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Tue Aug 15 13:26:46 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 13:26:46 -0700 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: Message-ID: <000701c6c0a9$21c1c1b0$5f248686@amr.corp.intel.com> >I don't see how this could work-- to mimic IP, an app has to be able >to multicast a datagram to multiple destination ports using the same >multicast group. Since IB multicast groups have a Q_Key associated, >the only way this could work is if all QPs and MCGs share one Q_Key. Hmm... I'd like to find a way to avoid using the same QKey for everything, but maybe it's not any worse than what we have with IP anyway. I guess if we know if a UD QP will not join a multicast group, we can use any QKey, but use a well known QKey if it will join a multicast group. What would happen in the following situation: App 1: Creates QP with QKey=22 Joins multicast group 1 with QKey=33 App 2: Creates QP with QKey=44 Joins multicast group 1 Sends to multicast group but with QKey=22 - Sean From sean.hefty at intel.com Tue Aug 15 13:33:46 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 13:33:46 -0700 Subject: [openib-general] IB mcast question In-Reply-To: <1155673220.26332.179.camel@stevo-desktop> Message-ID: <000801c6c0aa$1bd64b80$5f248686@amr.corp.intel.com> >> Steve> I was able to create a mcast group with the mc >> Steve> qkey==0xe00a0a0a, and 3 apps joined this group, but their >> Steve> qp qkeys were 0 (I changed ucma_init_ud_qp() to set the qp >> Steve> qkey to 0). One app sent to the mcgroup ah/qkey/qpn and >> Steve> the other two received the packet. Does that make sense? >> >> In theory the Q_Key of a multicast group record is the Q_Key you're >> supposed to use when sending to the group. Of course nothing enforces >> this but I don't really like abusing things this way. >> > >I _did_ send the message to the qkey of the mcg. I didn't think that this was supposed to work. Is the QKey going out on the wire the QKey from the send WR, or that associated with the QP? I think the QKey going out on the wire is the latter, which just happens to make it work. - Sean From rdreier at cisco.com Tue Aug 15 13:55:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 Aug 2006 13:55:28 -0700 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <000701c6c0a9$21c1c1b0$5f248686@amr.corp.intel.com> (Sean Hefty's message of "Tue, 15 Aug 2006 13:26:46 -0700") References: <000701c6c0a9$21c1c1b0$5f248686@amr.corp.intel.com> Message-ID: > App 1: > Creates QP with QKey=22 > Joins multicast group 1 with QKey=33 > > App 2: > Creates QP with QKey=44 > Joins multicast group 1 > Sends to multicast group but with QKey=22 I think that last send is technically an IBA spec violation. - R. From swise at opengridcomputing.com Tue Aug 15 14:13:45 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 16:13:45 -0500 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: References: <000701c6c0a9$21c1c1b0$5f248686@amr.corp.intel.com> Message-ID: <1155676425.6241.10.camel@stevo-desktop> On Tue, 2006-08-15 at 13:55 -0700, Roland Dreier wrote: > > App 1: > > Creates QP with QKey=22 > > Joins multicast group 1 with QKey=33 > > > > App 2: > > Creates QP with QKey=44 > > Joins multicast group 1 > > Sends to multicast group but with QKey=22 > > I think that last send is technically an IBA spec violation. > > - R. You guys are confusing me. "sends to multicast group but with QKey=22"... Does that mean you posted a SEND with the remote_qkey=22? According to the spec, C10-15 sez the qkey in the outgoing packet will be the qkey from the QP context IF the high order bit of the qkey is set. If the high order bit is _not_ set, then the outgoing packet will contain the qkey from the WR. (why? why?) Now, in my experiment, my mcast qkey was 0xe00a0a0a and the qp qkeys were zero. So when the sender posted a SEND with remote_qkey=0xe00a0a0a, the interface placed the qkey from sender's qp, which was zero, in the packet. That's why it worked I guess... Now my head hurts... From mst at mellanox.co.il Tue Aug 15 14:13:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Aug 2006 00:13:19 +0300 Subject: [openib-general] RFC: [PATCH untested] IB/uverbs: optimize registration for huge pages In-Reply-To: References: Message-ID: <20060815211319.GE22363@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: question: ib_umem page_size > > Michael> Roland, could you please clarify what does the page_size > Michael> field in struct ib_mem do? > > It gives the page size for the user memory described by the struct. > The idea was that if/when someone tries to optimize for huge pages, > then the low-level driver can know that a region is using huge pages > without having to walk through the page list and search for the > minimum physically contiguous size. OK, so here's a patch [warning: untested] that attempts to do this - we have customers that run out of resources when they register lots of huge pages, and this will help. How does this look? Is this the intended usage? uverbs_mem.c | 14 +++++++++++++- 1 files changed, 13 insertions(+), 1 deletion(-) -- Optimize memory registration for huge pages, by walking through the page list and searching for the minimum physically contiguous size. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/core/uverbs_mem.c b/drivers/infiniband/core/uverbs_mem.c index efe147d..f750652 100644 --- a/drivers/infiniband/core/uverbs_mem.c +++ b/drivers/infiniband/core/uverbs_mem.c @@ -73,6 +73,8 @@ int ib_umem_get(struct ib_device *dev, s unsigned long lock_limit; unsigned long cur_base; unsigned long npages; + dma_addr_t a, seg_end; + u32 mask = 0; int ret = 0; int off; int i; @@ -87,7 +89,6 @@ int ib_umem_get(struct ib_device *dev, s mem->user_base = (unsigned long) addr; mem->length = size; mem->offset = (unsigned long) addr & ~PAGE_MASK; - mem->page_size = PAGE_SIZE; mem->writable = write; INIT_LIST_HEAD(&mem->chunk_list); @@ -149,6 +150,15 @@ int ib_umem_get(struct ib_device *dev, s goto out; } + for (i = 0; i < chunk->nents; ++i) { + a = sg_dma_adress(chunk->page_list[i]); + if ((i || off) && a != seg_end) { + mask |= seg_end; + mask |= a; + } + seg_end = a + sg_dma_len(chunk->page_list[i]); + } + ret -= chunk->nents; off += chunk->nents; list_add_tail(&chunk->list, &mem->chunk_list); @@ -157,6 +167,8 @@ int ib_umem_get(struct ib_device *dev, s ret = 0; } + mem->page_size = ffs(mask) ? 1 << (ffs(mask) - 1) : (1 << 31); + out: if (ret < 0) __ib_umem_release(dev, mem, 0); -- MST From sean.hefty at intel.com Tue Aug 15 14:18:51 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 14:18:51 -0700 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: Message-ID: <000901c6c0b0$68084f70$5f248686@amr.corp.intel.com> > > App 1: > > Creates QP with QKey=22 > > Joins multicast group 1 with QKey=33 > > > > App 2: > > Creates QP with QKey=44 > > Joins multicast group 1 > > Sends to multicast group but with QKey=22 > >I think that last send is technically an IBA spec violation. I'll admit that this definitely seems like a hack, but I haven't found where it is violation. (I'll keep looking.) From 10.5.2.1, a QP doesn't need to be attached to a multicast group to initiate a send, so it seems like any potential violation is only that the QKey and MLID wouldn't match. - Sean From swise at opengridcomputing.com Tue Aug 15 14:26:03 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 16:26:03 -0500 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <000901c6c0b0$68084f70$5f248686@amr.corp.intel.com> References: <000901c6c0b0$68084f70$5f248686@amr.corp.intel.com> Message-ID: <1155677163.6241.23.camel@stevo-desktop> On Tue, 2006-08-15 at 14:18 -0700, Sean Hefty wrote: > > > App 1: > > > Creates QP with QKey=22 > > > Joins multicast group 1 with QKey=33 > > > > > > App 2: > > > Creates QP with QKey=44 > > > Joins multicast group 1 > > > Sends to multicast group but with QKey=22 > > > >I think that last send is technically an IBA spec violation. > > I'll admit that this definitely seems like a hack, but I haven't found where it > is violation. (I'll keep looking.) From 10.5.2.1, a QP doesn't need to be > attached to a multicast group to initiate a send, so it seems like any potential > violation is only that the QKey and MLID wouldn't match. > > - Sean And I can't find in the IBTA spec where they talk about the qkey field of a mcast group at all. qkeys are used to validate an incoming datagram. If the qkey in the packet doesn't match the qkey of the qp, then its dropped. This is an HCA ingress issue, not a mcast issue really. >From what I can gather, the sender: 1) Doesn't need to be in the group (as sean pointed out) 2) posts a SEND WR with an address handle for the the mcast group, the remote_qpn == 0xfffff, and the remote_qkey == to remote qp's qkey. Now why C10-15 is there, I dunno...but i'm sure somebody around here does... :) From halr at voltaire.com Tue Aug 15 14:29:37 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 17:29:37 -0400 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <000901c6c0b0$68084f70$5f248686@amr.corp.intel.com> References: <000901c6c0b0$68084f70$5f248686@amr.corp.intel.com> Message-ID: <1155677376.29378.18690.camel@hal.voltaire.com> On Tue, 2006-08-15 at 17:18, Sean Hefty wrote: > > > App 1: > > > Creates QP with QKey=22 > > > Joins multicast group 1 with QKey=33 > > > > > > App 2: > > > Creates QP with QKey=44 > > > Joins multicast group 1 > > > Sends to multicast group but with QKey=22 > > > >I think that last send is technically an IBA spec violation. > > I'll admit that this definitely seems like a hack, but I haven't found where it > is violation. (I'll keep looking.) The first occurrence that I find is the following: p.145 13) where it states that "Further, in addition to having the same MGID, all members of the multicast group must share the same P_Key and Q_Key. That is compliance per C4-3 on p. 143. There may be others. -- Hal > From 10.5.2.1, a QP doesn't need to be > attached to a multicast group to initiate a send, so it seems like any potential > violation is only that the QKey and MLID wouldn't match. > > - Sean From halr at voltaire.com Tue Aug 15 14:40:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 17:40:10 -0400 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155676425.6241.10.camel@stevo-desktop> References: <000701c6c0a9$21c1c1b0$5f248686@amr.corp.intel.com> <1155676425.6241.10.camel@stevo-desktop> Message-ID: <1155678009.29378.19011.camel@hal.voltaire.com> On Tue, 2006-08-15 at 17:13, Steve Wise wrote: > On Tue, 2006-08-15 at 13:55 -0700, Roland Dreier wrote: > > > > App 1: > > > Creates QP with QKey=22 > > > Joins multicast group 1 with QKey=33 > > > > > > App 2: > > > Creates QP with QKey=44 > > > Joins multicast group 1 > > > Sends to multicast group but with QKey=22 > > > > I think that last send is technically an IBA spec violation. > > > > - R. > > You guys are confusing me. "sends to multicast group but with > QKey=22"... Does that mean you posted a SEND with the remote_qkey=22? > > According to the spec, C10-15 sez the qkey in the outgoing packet will > be the qkey from the QP context IF the high order bit of the qkey is > set. If the high order bit is _not_ set, then the outgoing packet will > contain the qkey from the WR. (why? why?) The high order bit is for controlled QKeys which are only supposed to be sent by "privileged" consumers (e.g. MAD layer for QP1 handing). Also, the C10-15 requirement does not speak to additional requirements on the Q_Keys themselves like the one I cited to Sean in a previous email in terms of multicast group addressing requirements. > Now, in my experiment, my mcast qkey was 0xe00a0a0a I don't think that should be allowed from user space. -- Hal > and the qp qkeys > were zero. So when the sender posted a SEND with remote_qkey=0xe00a0a0a, > the interface placed the qkey from sender's qp, which was zero, in the > packet. That's why it worked I guess... > > Now my head hurts... > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Tue Aug 15 14:40:26 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 14:40:26 -0700 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155677376.29378.18690.camel@hal.voltaire.com> Message-ID: <000001c6c0b3$6c167d50$5f248686@amr.corp.intel.com> >p.145 13) where it states that "Further, in addition to having the same >MGID, all members of the multicast group must share the same P_Key and >Q_Key. > >That is compliance per C4-3 on p. 143. Thanks - my take is that the compliance breaks when the QPs are first created with the wrong QKey. Then, as Roland said, in order to support a QP using multiple multicast groups, we need all QKeys to be the same (for the RDMA CM). If we want to do anything with port numbers, we probably need to fold those into the MGID. - Sean From swise at opengridcomputing.com Tue Aug 15 14:41:23 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 16:41:23 -0500 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155677376.29378.18690.camel@hal.voltaire.com> References: <000901c6c0b0$68084f70$5f248686@amr.corp.intel.com> <1155677376.29378.18690.camel@hal.voltaire.com> Message-ID: <1155678083.6241.32.camel@stevo-desktop> On Tue, 2006-08-15 at 17:29 -0400, Hal Rosenstock wrote: > On Tue, 2006-08-15 at 17:18, Sean Hefty wrote: > > > > App 1: > > > > Creates QP with QKey=22 > > > > Joins multicast group 1 with QKey=33 > > > > > > > > App 2: > > > > Creates QP with QKey=44 > > > > Joins multicast group 1 > > > > Sends to multicast group but with QKey=22 > > > > > >I think that last send is technically an IBA spec violation. > > > > I'll admit that this definitely seems like a hack, but I haven't found where it > > is violation. (I'll keep looking.) > > The first occurrence that I find is the following: > > p.145 13) where it states that "Further, in addition to having the same > MGID, all members of the multicast group must share the same P_Key and > Q_Key. > Where does it say that a mcast group is associated with any qkey at all? 13) just sez that all qp's that want to exchange datagrams must have the same pkey/qkey. That's the normal UD paradigm. From halr at voltaire.com Tue Aug 15 14:44:38 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 17:44:38 -0400 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155677163.6241.23.camel@stevo-desktop> References: <000901c6c0b0$68084f70$5f248686@amr.corp.intel.com> <1155677163.6241.23.camel@stevo-desktop> Message-ID: <1155678276.29378.19140.camel@hal.voltaire.com> On Tue, 2006-08-15 at 17:26, Steve Wise wrote: > On Tue, 2006-08-15 at 14:18 -0700, Sean Hefty wrote: > > > > App 1: > > > > Creates QP with QKey=22 > > > > Joins multicast group 1 with QKey=33 > > > > > > > > App 2: > > > > Creates QP with QKey=44 > > > > Joins multicast group 1 > > > > Sends to multicast group but with QKey=22 > > > > > >I think that last send is technically an IBA spec violation. > > > > I'll admit that this definitely seems like a hack, but I haven't found where it > > is violation. (I'll keep looking.) From 10.5.2.1, a QP doesn't need to be > > attached to a multicast group to initiate a send, so it seems like any potential > > violation is only that the QKey and MLID wouldn't match. > > > > - Sean > > And I can't find in the IBTA spec where they talk about the qkey field > of a mcast group at all. Not so fast... I know the requirements are spread out through the spec and it's hard to piece them all together. I cited where this one is located. > qkeys are used to validate an incoming > datagram. If the qkey in the packet doesn't match the qkey of the qp, > then its dropped. This is an HCA ingress issue, not a mcast issue > really. > > >From what I can gather, the sender: > > 1) Doesn't need to be in the group (as sean pointed out) Yes it does. There are send only ("non") members in the JoinState component of the MCMemberRecord. This can be used to optimize the MC routing (as that port never needs to receive but needs to be able to send on the MLID). -- Hal > 2) posts a SEND WR with an address handle for the the mcast group, the > remote_qpn == 0xfffff, and the remote_qkey == to remote qp's qkey. > > > Now why C10-15 is there, I dunno...but i'm sure somebody around here > does... > > :) > > > > > > > From mdidomenico at silverstorm.com Tue Aug 15 14:46:48 2006 From: mdidomenico at silverstorm.com (Di Domenico, Michael) Date: Tue, 15 Aug 2006 17:46:48 -0400 Subject: [openib-general] debug version In-Reply-To: <1155678009.29378.19011.camel@hal.voltaire.com> Message-ID: Is there a way to use the build.sh script that comes with OFED to build a debug version of all the things listed in the ofed.conf file? From halr at voltaire.com Tue Aug 15 14:47:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 17:47:27 -0400 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <000001c6c0b3$6c167d50$5f248686@amr.corp.intel.com> References: <000001c6c0b3$6c167d50$5f248686@amr.corp.intel.com> Message-ID: <1155678446.29378.19237.camel@hal.voltaire.com> On Tue, 2006-08-15 at 17:40, Sean Hefty wrote: > >p.145 13) where it states that "Further, in addition to having the same > >MGID, all members of the multicast group must share the same P_Key and > >Q_Key. > > > >That is compliance per C4-3 on p. 143. > > Thanks - my take is that the compliance breaks when the QPs are first created > with the wrong QKey. Yes, I think so too. > Then, as Roland said, in order to support a QP using multiple multicast groups, > we need all QKeys to be the same (for the RDMA CM). Does 1 QP need to support multiple multicast groups ? If so, I agree. That was the point I missed before. > If we want to do anything > with port numbers, we probably need to fold those into the MGID. I'm not quite following you yet on this. -- Hal > - Sean From halr at voltaire.com Tue Aug 15 14:49:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 17:49:04 -0400 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155678083.6241.32.camel@stevo-desktop> References: <000901c6c0b0$68084f70$5f248686@amr.corp.intel.com> <1155677376.29378.18690.camel@hal.voltaire.com> <1155678083.6241.32.camel@stevo-desktop> Message-ID: <1155678543.29378.19271.camel@hal.voltaire.com> On Tue, 2006-08-15 at 17:41, Steve Wise wrote: > On Tue, 2006-08-15 at 17:29 -0400, Hal Rosenstock wrote: > > On Tue, 2006-08-15 at 17:18, Sean Hefty wrote: > > > > > App 1: > > > > > Creates QP with QKey=22 > > > > > Joins multicast group 1 with QKey=33 > > > > > > > > > > App 2: > > > > > Creates QP with QKey=44 > > > > > Joins multicast group 1 > > > > > Sends to multicast group but with QKey=22 > > > > > > > >I think that last send is technically an IBA spec violation. > > > > > > I'll admit that this definitely seems like a hack, but I haven't found where it > > > is violation. (I'll keep looking.) > > > > The first occurrence that I find is the following: > > > > p.145 13) where it states that "Further, in addition to having the same > > MGID, all members of the multicast group must share the same P_Key and > > Q_Key. > > > > Where does it say that a mcast group is associated with any qkey at all? In order to create a multicast group, the Q_Key needs to be supplied. See p. 912-3 o15-0.2.3 -- Hal > 13) just sez that all qp's that want to exchange datagrams must have the > same pkey/qkey. That's the normal UD paradigm. > > > > From sean.hefty at intel.com Tue Aug 15 14:48:39 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 15 Aug 2006 14:48:39 -0700 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155678276.29378.19140.camel@hal.voltaire.com> Message-ID: <000101c6c0b4$9210d220$5f248686@amr.corp.intel.com> >> 1) Doesn't need to be in the group (as sean pointed out) > >Yes it does. There are send only ("non") members in the JoinState >component of the MCMemberRecord. This can be used to optimize the MC >routing (as that port never needs to receive but needs to be able to >send on the MLID). I believe that the local port needs to be at least a SendOnly member (for routing purposes), but the QP does not need to be attached to the group. - Sean From swise at opengridcomputing.com Tue Aug 15 14:49:08 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 16:49:08 -0500 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155678009.29378.19011.camel@hal.voltaire.com> References: <000701c6c0a9$21c1c1b0$5f248686@amr.corp.intel.com> <1155676425.6241.10.camel@stevo-desktop> <1155678009.29378.19011.camel@hal.voltaire.com> Message-ID: <1155678548.6241.37.camel@stevo-desktop> > > According to the spec, C10-15 sez the qkey in the outgoing packet will > > be the qkey from the QP context IF the high order bit of the qkey is > > set. If the high order bit is _not_ set, then the outgoing packet will > > contain the qkey from the WR. (why? why?) > > The high order bit is for controlled QKeys which are only supposed to be > sent by "privileged" consumers (e.g. MAD layer for QP1 handing). I see. > Also, the C10-15 requirement does not speak to additional requirements > on the Q_Keys themselves like the one I cited to Sean in a previous > email in terms of multicast group addressing requirements. > > > Now, in my experiment, my mcast qkey was 0xe00a0a0a > > I don't think that should be allowed from user space. > Ah. So rdma cm needs to not use these special qkeys... From swise at opengridcomputing.com Tue Aug 15 14:51:34 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 15 Aug 2006 16:51:34 -0500 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155678543.29378.19271.camel@hal.voltaire.com> References: <000901c6c0b0$68084f70$5f248686@amr.corp.intel.com> <1155677376.29378.18690.camel@hal.voltaire.com> <1155678083.6241.32.camel@stevo-desktop> <1155678543.29378.19271.camel@hal.voltaire.com> Message-ID: <1155678694.6241.40.camel@stevo-desktop> > > > > Where does it say that a mcast group is associated with any qkey at all? > > In order to create a multicast group, the Q_Key needs to be supplied. > See p. 912-3 o15-0.2.3 > Thanks Hal. I'm beginning to understand... :-) From louis.laborde at hp.com Tue Aug 15 14:53:13 2006 From: louis.laborde at hp.com (Louis Laborde) Date: Tue, 15 Aug 2006 14:53:13 -0700 Subject: [openib-general] ib_get_dma_mr and remote access Message-ID: <44E24249.20402@hp.com> Hi there, I would like to know if any application today uses ib_get_dma_mr verb with remote access flag(s). It seems to me that such a dependency could first, create a security hole and second, make this verb hard to implement for some RNICs. If only local access is required for this "special" memory region, can it be implemented with the "Reserved LKey" or "STag0", whichever way it's called? Thanks, Louis From halr at voltaire.com Tue Aug 15 14:54:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2006 17:54:13 -0400 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155678548.6241.37.camel@stevo-desktop> References: <000701c6c0a9$21c1c1b0$5f248686@amr.corp.intel.com> <1155676425.6241.10.camel@stevo-desktop> <1155678009.29378.19011.camel@hal.voltaire.com> <1155678548.6241.37.camel@stevo-desktop> Message-ID: <1155678853.29378.19434.camel@hal.voltaire.com> On Tue, 2006-08-15 at 17:49, Steve Wise wrote: > > > According to the spec, C10-15 sez the qkey in the outgoing packet will > > > be the qkey from the QP context IF the high order bit of the qkey is > > > set. If the high order bit is _not_ set, then the outgoing packet will > > > contain the qkey from the WR. (why? why?) > > > > The high order bit is for controlled QKeys which are only supposed to be > > sent by "privileged" consumers (e.g. MAD layer for QP1 handing). > > I see. > > > Also, the C10-15 requirement does not speak to additional requirements > > on the Q_Keys themselves like the one I cited to Sean in a previous > > email in terms of multicast group addressing requirements. > > > > > Now, in my experiment, my mcast qkey was 0xe00a0a0a > > > > I don't think that should be allowed from user space. > > > > Ah. So rdma cm needs to not use these special qkeys... Yes, that's what I think. -- Hal > > > From ralphc at pathscale.com Tue Aug 15 15:16:12 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 15 Aug 2006 15:16:12 -0700 Subject: [openib-general] [PATCHv2] IB/ipath - performance improvements via mmap of queues Message-ID: <1155680172.20325.519.camel@brick.pathscale.com> This updated version of the ib_ipath mmap patch uses vmalloc_user() and remap_vmalloc_range() as Roland suggested. Signed-off-by: Ralph Campbell diff -r dcc321d1340a drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/Makefile Mon Aug 14 14:56:07 2006 -0700 @@ -25,6 +25,7 @@ ib_ipath-y := \ ipath_cq.o \ ipath_keys.o \ ipath_mad.o \ + ipath_mmap.o \ ipath_mr.o \ ipath_qp.o \ ipath_rc.o \ diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_cq.c --- a/drivers/infiniband/hw/ipath/ipath_cq.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_cq.c Mon Aug 14 14:56:07 2006 -0700 @@ -42,20 +42,28 @@ * @entry: work completion entry to add * @sig: true if @entry is a solicitated entry * - * This may be called with one of the qp->s_lock or qp->r_rq.lock held. + * This may be called with qp->s_lock held. */ void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int solicited) { + struct ipath_cq_wc *wc = cq->queue; unsigned long flags; + u32 head; u32 next; spin_lock_irqsave(&cq->lock, flags); - if (cq->head == cq->ibcq.cqe) + /* + * Note that the head pointer might be writable by user processes. + * Take care to verify it is a sane value. + */ + head = wc->head; + if (head >= (unsigned) cq->ibcq.cqe) { + head = cq->ibcq.cqe; next = 0; - else - next = cq->head + 1; - if (unlikely(next == cq->tail)) { + } else + next = head + 1; + if (unlikely(next == wc->tail)) { spin_unlock_irqrestore(&cq->lock, flags); if (cq->ibcq.event_handler) { struct ib_event ev; @@ -67,8 +75,8 @@ void ipath_cq_enter(struct ipath_cq *cq, } return; } - cq->queue[cq->head] = *entry; - cq->head = next; + wc->queue[head] = *entry; + wc->head = next; if (cq->notify == IB_CQ_NEXT_COMP || (cq->notify == IB_CQ_SOLICITED && solicited)) { @@ -101,19 +109,20 @@ int ipath_poll_cq(struct ib_cq *ibcq, in int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) { struct ipath_cq *cq = to_icq(ibcq); + struct ipath_cq_wc *wc = cq->queue; unsigned long flags; int npolled; spin_lock_irqsave(&cq->lock, flags); for (npolled = 0; npolled < num_entries; ++npolled, ++entry) { - if (cq->tail == cq->head) + if (wc->tail == wc->head) break; - *entry = cq->queue[cq->tail]; - if (cq->tail == cq->ibcq.cqe) - cq->tail = 0; + *entry = wc->queue[wc->tail]; + if (wc->tail >= cq->ibcq.cqe) + wc->tail = 0; else - cq->tail++; + wc->tail++; } spin_unlock_irqrestore(&cq->lock, flags); @@ -160,38 +169,74 @@ struct ib_cq *ipath_create_cq(struct ib_ { struct ipath_ibdev *dev = to_idev(ibdev); struct ipath_cq *cq; - struct ib_wc *wc; + struct ipath_cq_wc *wc; struct ib_cq *ret; if (entries > ib_ipath_max_cqes) { ret = ERR_PTR(-EINVAL); - goto bail; + goto done; } if (dev->n_cqs_allocated == ib_ipath_max_cqs) { ret = ERR_PTR(-ENOMEM); - goto bail; - } - - /* - * Need to use vmalloc() if we want to support large #s of - * entries. - */ + goto done; + } + + /* Allocate the completion queue structure. */ cq = kmalloc(sizeof(*cq), GFP_KERNEL); if (!cq) { ret = ERR_PTR(-ENOMEM); - goto bail; - } - - /* - * Need to use vmalloc() if we want to support large #s of entries. - */ - wc = vmalloc(sizeof(*wc) * (entries + 1)); + goto done; + } + + /* + * Allocate the completion queue entries and head/tail pointers. + * This is allocated separately so that it can be resized and + * also mapped into user space. + * We need to use vmalloc() in order to support mmap and large + * numbers of entries. + */ + wc = vmalloc_user(sizeof(*wc) + sizeof(struct ib_wc) * entries); if (!wc) { - kfree(cq); ret = ERR_PTR(-ENOMEM); - goto bail; - } + goto bail_cq; + } + + /* + * Return the address of the WC as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata && udata->outlen >= sizeof(__u64)) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) wc; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto bail_wc; + } + + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto bail_wc; + } + cq->ip = ip; + ip->context = context; + ip->obj = wc; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(*wc) + + sizeof(struct ib_wc) * entries); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } else + cq->ip = NULL; + /* * ib_create_cq() will initialize cq->ibcq except for cq->ibcq.cqe. * The number of entries should be >= the number requested or return @@ -202,15 +247,22 @@ struct ib_cq *ipath_create_cq(struct ib_ cq->triggered = 0; spin_lock_init(&cq->lock); tasklet_init(&cq->comptask, send_complete, (unsigned long)cq); - cq->head = 0; - cq->tail = 0; + wc->head = 0; + wc->tail = 0; cq->queue = wc; ret = &cq->ibcq; dev->n_cqs_allocated++; - -bail: + goto done; + +bail_wc: + vfree(wc); + +bail_cq: + kfree(cq); + +done: return ret; } @@ -229,7 +281,10 @@ int ipath_destroy_cq(struct ib_cq *ibcq) tasklet_kill(&cq->comptask); dev->n_cqs_allocated--; - vfree(cq->queue); + if (cq->ip) + kref_put(&cq->ip->ref, ipath_release_mmap_info); + else + vfree(cq->queue); kfree(cq); return 0; @@ -253,7 +308,7 @@ int ipath_req_notify_cq(struct ib_cq *ib spin_lock_irqsave(&cq->lock, flags); /* * Don't change IB_CQ_NEXT_COMP to IB_CQ_SOLICITED but allow - * any other transitions. + * any other transitions (see C11-31 and C11-32 in ch. 11.4.2.2). */ if (cq->notify != IB_CQ_NEXT_COMP) cq->notify = notify; @@ -264,46 +319,81 @@ int ipath_resize_cq(struct ib_cq *ibcq, int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) { struct ipath_cq *cq = to_icq(ibcq); - struct ib_wc *wc, *old_wc; - u32 n; + struct ipath_cq_wc *old_wc = cq->queue; + struct ipath_cq_wc *wc; + u32 head, tail, n; int ret; /* * Need to use vmalloc() if we want to support large #s of entries. */ - wc = vmalloc(sizeof(*wc) * (cqe + 1)); + wc = vmalloc_user(sizeof(*wc) + sizeof(struct ib_wc) * cqe); if (!wc) { ret = -ENOMEM; goto bail; } + /* + * Return the address of the WC as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata && udata->outlen >= sizeof(__u64)) { + __u64 offset = (__u64) wc; + + ret = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (ret) + goto bail; + } + spin_lock_irq(&cq->lock); - if (cq->head < cq->tail) - n = cq->ibcq.cqe + 1 + cq->head - cq->tail; + /* + * Make sure head and tail are sane since they + * might be user writable. + */ + head = old_wc->head; + if (head > (u32) cq->ibcq.cqe) + head = (u32) cq->ibcq.cqe; + tail = old_wc->tail; + if (tail > (u32) cq->ibcq.cqe) + tail = (u32) cq->ibcq.cqe; + if (head < tail) + n = cq->ibcq.cqe + 1 + head - tail; else - n = cq->head - cq->tail; + n = head - tail; if (unlikely((u32)cqe < n)) { spin_unlock_irq(&cq->lock); vfree(wc); ret = -EOVERFLOW; goto bail; } - for (n = 0; cq->tail != cq->head; n++) { - wc[n] = cq->queue[cq->tail]; - if (cq->tail == cq->ibcq.cqe) - cq->tail = 0; + for (n = 0; tail != head; n++) { + wc->queue[n] = old_wc->queue[tail]; + if (tail == (u32) cq->ibcq.cqe) + tail = 0; else - cq->tail++; + tail++; } cq->ibcq.cqe = cqe; - cq->head = n; - cq->tail = 0; - old_wc = cq->queue; + wc->head = n; + wc->tail = 0; cq->queue = wc; spin_unlock_irq(&cq->lock); vfree(old_wc); + if (cq->ip) { + struct ipath_ibdev *dev = to_idev(ibcq->device); + struct ipath_mmap_info *ip = cq->ip; + + ip->obj = wc; + ip->size = PAGE_ALIGN(sizeof(*wc) + + sizeof(struct ib_wc) * cqe); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + ret = 0; bail: diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_mmap.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_mmap.c Mon Aug 14 14:56:07 2006 -0700 @@ -0,0 +1,122 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include + +#include "ipath_verbs.h" + +/** + * ipath_release_mmap_info - free mmap info structure + * @ref: a pointer to the kref within struct ipath_mmap_info + */ +void ipath_release_mmap_info(struct kref *ref) +{ + struct ipath_mmap_info *ip = + container_of(ref, struct ipath_mmap_info, ref); + + vfree(ip->obj); + kfree(ip); +} + +/* + * open and close keep track of how many times the CQ is mapped, + * to avoid releasing it. + */ +static void ipath_vma_open(struct vm_area_struct *vma) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + + kref_get(&ip->ref); + ip->mmap_cnt++; +} + +static void ipath_vma_close(struct vm_area_struct *vma) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + + ip->mmap_cnt--; + kref_put(&ip->ref, ipath_release_mmap_info); +} + +static struct vm_operations_struct ipath_vm_ops = { + .open = ipath_vma_open, + .close = ipath_vma_close, +}; + +/** + * ipath_mmap - create a new mmap region + * @context: the IB user context of the process making the mmap() call + * @vma: the VMA to be initialized + * Return zero if the mmap is OK. Otherwise, return an errno. + */ +int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + struct ipath_ibdev *dev = to_idev(context->device); + unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; + unsigned long size = vma->vm_end - vma->vm_start; + struct ipath_mmap_info *ip, **pp; + int ret = -EINVAL; + + /* + * Search the device's list of objects waiting for a mmap call. + * Normally, this list is very short since a call to create a + * CQ, QP, or SRQ is soon followed by a call to mmap(). + */ + spin_lock_irq(&dev->pending_lock); + for (pp = &dev->pending_mmaps; (ip = *pp); pp = &ip->next) { + /* Only the creator is allowed to mmap the object */ + if (context != ip->context || (void *) offset != ip->obj) + continue; + /* Don't allow a mmap larger than the object. */ + if (size > ip->size) + break; + + *pp = ip->next; + spin_unlock_irq(&dev->pending_lock); + + ret = remap_vmalloc_range(vma, ip->obj, 0); + if (ret) + goto done; + vma->vm_ops = &ipath_vm_ops; + vma->vm_private_data = ip; + ipath_vma_open(vma); + goto done; + } + spin_unlock_irq(&dev->pending_lock); +done: + return ret; +} diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Mon Aug 14 14:56:07 2006 -0700 @@ -35,7 +35,7 @@ #include #include "ipath_verbs.h" -#include "ipath_common.h" +#include "ipath_kernel.h" #define BITS_PER_PAGE (PAGE_SIZE*BITS_PER_BYTE) #define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1) @@ -43,19 +43,6 @@ (off)) #define find_next_offset(map, off) find_next_zero_bit((map)->page, \ BITS_PER_PAGE, off) - -#define TRANS_INVALID 0 -#define TRANS_ANY2RST 1 -#define TRANS_RST2INIT 2 -#define TRANS_INIT2INIT 3 -#define TRANS_INIT2RTR 4 -#define TRANS_RTR2RTS 5 -#define TRANS_RTS2RTS 6 -#define TRANS_SQERR2RTS 7 -#define TRANS_ANY2ERR 8 -#define TRANS_RTS2SQD 9 /* XXX Wait for expected ACKs & signal event */ -#define TRANS_SQD2SQD 10 /* error if not drained & parameter change */ -#define TRANS_SQD2RTS 11 /* error if not drained */ /* * Convert the AETH credit code into the number of credits. @@ -355,8 +342,10 @@ static void ipath_reset_qp(struct ipath_ qp->s_last = 0; qp->s_ssn = 1; qp->s_lsn = 0; - qp->r_rq.head = 0; - qp->r_rq.tail = 0; + if (qp->r_rq.wq) { + qp->r_rq.wq->head = 0; + qp->r_rq.wq->tail = 0; + } qp->r_reuse_sge = 0; } @@ -410,15 +399,32 @@ void ipath_error_qp(struct ipath_qp *qp) qp->s_hdrwords = 0; qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; - wc.opcode = IB_WC_RECV; - spin_lock(&qp->r_rq.lock); - while (qp->r_rq.tail != qp->r_rq.head) { - wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; - if (++qp->r_rq.tail >= qp->r_rq.size) - qp->r_rq.tail = 0; - ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); - } - spin_unlock(&qp->r_rq.lock); + if (qp->r_rq.wq) { + struct ipath_rwq *wq; + u32 head; + u32 tail; + + spin_lock(&qp->r_rq.lock); + + /* sanity check pointers before trusting them */ + wq = qp->r_rq.wq; + head = wq->head; + if (head >= qp->r_rq.size) + head = 0; + tail = wq->tail; + if (tail >= qp->r_rq.size) + tail = 0; + wc.opcode = IB_WC_RECV; + while (tail != head) { + wc.wr_id = get_rwqe_ptr(&qp->r_rq, tail)->wr_id; + if (++tail >= qp->r_rq.size) + tail = 0; + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + } + wq->tail = tail; + + spin_unlock(&qp->r_rq.lock); + } } /** @@ -426,11 +432,12 @@ void ipath_error_qp(struct ipath_qp *qp) * @ibqp: the queue pair who's attributes we're modifying * @attr: the new attributes * @attr_mask: the mask of attributes to modify + * @udata: user data for ipathverbs.so * * Returns 0 on success, otherwise returns an errno. */ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask) + int attr_mask, struct ib_udata *udata) { struct ipath_ibdev *dev = to_idev(ibqp->device); struct ipath_qp *qp = to_iqp(ibqp); @@ -543,7 +550,7 @@ int ipath_query_qp(struct ib_qp *ibqp, s attr->dest_qp_num = qp->remote_qpn; attr->qp_access_flags = qp->qp_access_flags; attr->cap.max_send_wr = qp->s_size - 1; - attr->cap.max_recv_wr = qp->r_rq.size - 1; + attr->cap.max_recv_wr = qp->ibqp.srq ? 0 : qp->r_rq.size - 1; attr->cap.max_send_sge = qp->s_max_sge; attr->cap.max_recv_sge = qp->r_rq.max_sge; attr->cap.max_inline_data = 0; @@ -596,13 +603,23 @@ __be32 ipath_compute_aeth(struct ipath_q } else { u32 min, max, x; u32 credits; - + struct ipath_rwq *wq = qp->r_rq.wq; + u32 head; + u32 tail; + + /* sanity check pointers before trusting them */ + head = wq->head; + if (head >= qp->r_rq.size) + head = 0; + tail = wq->tail; + if (tail >= qp->r_rq.size) + tail = 0; /* * Compute the number of credits available (RWQEs). * XXX Not holding the r_rq.lock here so there is a small * chance that the pair of reads are not atomic. */ - credits = qp->r_rq.head - qp->r_rq.tail; + credits = head - tail; if ((int)credits < 0) credits += qp->r_rq.size; /* @@ -679,27 +696,37 @@ struct ib_qp *ipath_create_qp(struct ib_ case IB_QPT_UD: case IB_QPT_SMI: case IB_QPT_GSI: - qp = kmalloc(sizeof(*qp), GFP_KERNEL); + sz = sizeof(*qp); + if (init_attr->srq) { + struct ipath_srq *srq = to_isrq(init_attr->srq); + + sz += sizeof(*qp->r_sg_list) * + srq->rq.max_sge; + } else + sz += sizeof(*qp->r_sg_list) * + init_attr->cap.max_recv_sge; + qp = kmalloc(sz, GFP_KERNEL); if (!qp) { - vfree(swq); ret = ERR_PTR(-ENOMEM); - goto bail; + goto bail_swq; } if (init_attr->srq) { + sz = 0; qp->r_rq.size = 0; qp->r_rq.max_sge = 0; qp->r_rq.wq = NULL; + init_attr->cap.max_recv_wr = 0; + init_attr->cap.max_recv_sge = 0; } else { qp->r_rq.size = init_attr->cap.max_recv_wr + 1; qp->r_rq.max_sge = init_attr->cap.max_recv_sge; - sz = (sizeof(struct ipath_sge) * qp->r_rq.max_sge) + + sz = (sizeof(struct ib_sge) * qp->r_rq.max_sge) + sizeof(struct ipath_rwqe); - qp->r_rq.wq = vmalloc(qp->r_rq.size * sz); + qp->r_rq.wq = vmalloc_user(sizeof(struct ipath_rwq) + + qp->r_rq.size * sz); if (!qp->r_rq.wq) { - kfree(qp); - vfree(swq); ret = ERR_PTR(-ENOMEM); - goto bail; + goto bail_qp; } } @@ -725,12 +752,10 @@ struct ib_qp *ipath_create_qp(struct ib_ err = ipath_alloc_qpn(&dev->qp_table, qp, init_attr->qp_type); if (err) { - vfree(swq); - vfree(qp->r_rq.wq); - kfree(qp); ret = ERR_PTR(err); - goto bail; - } + goto bail_rwq; + } + qp->ip = NULL; ipath_reset_qp(qp); /* Tell the core driver that the kernel SMA is present. */ @@ -747,8 +772,51 @@ struct ib_qp *ipath_create_qp(struct ib_ init_attr->cap.max_inline_data = 0; + /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata && udata->outlen >= sizeof(__u64)) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) qp->r_rq.wq; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto bail_rwq; + } + + if (qp->r_rq.wq) { + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto bail_rwq; + } + qp->ip = ip; + ip->context = ibpd->uobject->context; + ip->obj = qp->r_rq.wq; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + qp->r_rq.size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + } + ret = &qp->ibqp; - + goto bail; + +bail_rwq: + vfree(qp->r_rq.wq); +bail_qp: + kfree(qp); +bail_swq: + vfree(swq); bail: return ret; } @@ -772,11 +840,9 @@ int ipath_destroy_qp(struct ib_qp *ibqp) if (qp->ibqp.qp_type == IB_QPT_SMI) ipath_layer_set_verbs_flags(dev->dd, 0); - spin_lock_irqsave(&qp->r_rq.lock, flags); - spin_lock(&qp->s_lock); + spin_lock_irqsave(&qp->s_lock, flags); qp->state = IB_QPS_ERR; - spin_unlock(&qp->s_lock); - spin_unlock_irqrestore(&qp->r_rq.lock, flags); + spin_unlock_irqrestore(&qp->s_lock, flags); /* Stop the sending tasklet. */ tasklet_kill(&qp->s_task); @@ -797,8 +863,11 @@ int ipath_destroy_qp(struct ib_qp *ibqp) if (atomic_read(&qp->refcount) != 0) ipath_free_qp(&dev->qp_table, qp); + if (qp->ip) + kref_put(&qp->ip->ref, ipath_release_mmap_info); + else + vfree(qp->r_rq.wq); vfree(qp->s_wq); - vfree(qp->r_rq.wq); kfree(qp); return 0; } diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_ruc.c --- a/drivers/infiniband/hw/ipath/ipath_ruc.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c Mon Aug 14 14:56:07 2006 -0700 @@ -32,7 +32,7 @@ */ #include "ipath_verbs.h" -#include "ipath_common.h" +#include "ipath_kernel.h" /* * Convert the AETH RNR timeout code into the number of milliseconds. @@ -106,6 +106,54 @@ void ipath_insert_rnr_queue(struct ipath spin_unlock_irqrestore(&dev->pending_lock, flags); } +static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + int user = to_ipd(qp->ibqp.pd)->user; + int i, j, ret; + struct ib_wc wc; + + qp->r_len = 0; + for (i = j = 0; i < wqe->num_sge; i++) { + if (wqe->sg_list[i].length == 0) + continue; + /* Check LKEY */ + if ((user && wqe->sg_list[i].lkey == 0) || + !ipath_lkey_ok(&dev->lk_table, + &qp->r_sg_list[j], &wqe->sg_list[i], + IB_ACCESS_LOCAL_WRITE)) + goto bad_lkey; + qp->r_len += wqe->sg_list[i].length; + j++; + } + qp->r_sge.sge = qp->r_sg_list[0]; + qp->r_sge.sg_list = qp->r_sg_list + 1; + qp->r_sge.num_sge = j; + ret = 1; + goto bail; + +bad_lkey: + wc.wr_id = wqe->wr_id; + wc.status = IB_WC_LOC_PROT_ERR; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal solicited completion event. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + ret = 0; +bail: + return ret; +} + /** * ipath_get_rwqe - copy the next RWQE into the QP's RWQE * @qp: the QP @@ -119,71 +167,71 @@ int ipath_get_rwqe(struct ipath_qp *qp, { unsigned long flags; struct ipath_rq *rq; + struct ipath_rwq *wq; struct ipath_srq *srq; struct ipath_rwqe *wqe; - int ret = 1; - - if (!qp->ibqp.srq) { + void (*handler)(struct ib_event *, void *); + u32 tail; + int ret; + + if (qp->ibqp.srq) { + srq = to_isrq(qp->ibqp.srq); + handler = srq->ibsrq.event_handler; + rq = &srq->rq; + } else { + srq = NULL; + handler = NULL; rq = &qp->r_rq; - spin_lock_irqsave(&rq->lock, flags); - - if (unlikely(rq->tail == rq->head)) { + } + + spin_lock_irqsave(&rq->lock, flags); + wq = rq->wq; + tail = wq->tail; + /* Validate tail before using it since it is user writable. */ + if (tail >= rq->size) + tail = 0; + do { + if (unlikely(tail == wq->head)) { + spin_unlock_irqrestore(&rq->lock, flags); ret = 0; - goto done; - } - wqe = get_rwqe_ptr(rq, rq->tail); - qp->r_wr_id = wqe->wr_id; - if (!wr_id_only) { - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - qp->r_len = wqe->length; - } - if (++rq->tail >= rq->size) - rq->tail = 0; - goto done; - } - - srq = to_isrq(qp->ibqp.srq); - rq = &srq->rq; - spin_lock_irqsave(&rq->lock, flags); - - if (unlikely(rq->tail == rq->head)) { - ret = 0; - goto done; - } - wqe = get_rwqe_ptr(rq, rq->tail); + goto bail; + } + wqe = get_rwqe_ptr(rq, tail); + if (++tail >= rq->size) + tail = 0; + } while (!wr_id_only && !init_sge(qp, wqe)); qp->r_wr_id = wqe->wr_id; - if (!wr_id_only) { - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - qp->r_len = wqe->length; - } - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq->ibsrq.event_handler) { - struct ib_event ev; + wq->tail = tail; + + ret = 1; + if (handler) { u32 n; - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + n = wq->head; + if (n >= rq->size) + n = 0; + if (n < tail) + n += rq->size - tail; else - n = rq->head - rq->tail; + n -= tail; if (n < srq->limit) { + struct ib_event ev; + srq->limit = 0; spin_unlock_irqrestore(&rq->lock, flags); ev.device = qp->ibqp.device; ev.element.srq = qp->ibqp.srq; ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); + handler(&ev, srq->ibsrq.srq_context); goto bail; } } - -done: spin_unlock_irqrestore(&rq->lock, flags); + bail: return ret; } diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_srq.c --- a/drivers/infiniband/hw/ipath/ipath_srq.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_srq.c Mon Aug 14 14:56:07 2006 -0700 @@ -48,66 +48,39 @@ int ipath_post_srq_receive(struct ib_srq struct ib_recv_wr **bad_wr) { struct ipath_srq *srq = to_isrq(ibsrq); - struct ipath_ibdev *dev = to_idev(ibsrq->device); + struct ipath_rwq *wq; unsigned long flags; int ret; for (; wr; wr = wr->next) { struct ipath_rwqe *wqe; u32 next; - int i, j; - - if (wr->num_sge > srq->rq.max_sge) { + int i; + + if ((unsigned) wr->num_sge > srq->rq.max_sge) { *bad_wr = wr; ret = -ENOMEM; goto bail; } spin_lock_irqsave(&srq->rq.lock, flags); - next = srq->rq.head + 1; + wq = srq->rq.wq; + next = wq->head + 1; if (next >= srq->rq.size) next = 0; - if (next == srq->rq.tail) { + if (next == wq->tail) { spin_unlock_irqrestore(&srq->rq.lock, flags); *bad_wr = wr; ret = -ENOMEM; goto bail; } - wqe = get_rwqe_ptr(&srq->rq, srq->rq.head); + wqe = get_rwqe_ptr(&srq->rq, wq->head); wqe->wr_id = wr->wr_id; - wqe->sg_list[0].mr = NULL; - wqe->sg_list[0].vaddr = NULL; - wqe->sg_list[0].length = 0; - wqe->sg_list[0].sge_length = 0; - wqe->length = 0; - for (i = 0, j = 0; i < wr->num_sge; i++) { - /* Check LKEY */ - if (to_ipd(srq->ibsrq.pd)->user && - wr->sg_list[i].lkey == 0) { - spin_unlock_irqrestore(&srq->rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - if (wr->sg_list[i].length == 0) - continue; - if (!ipath_lkey_ok(&dev->lk_table, - &wqe->sg_list[j], - &wr->sg_list[i], - IB_ACCESS_LOCAL_WRITE)) { - spin_unlock_irqrestore(&srq->rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - wqe->length += wr->sg_list[i].length; - j++; - } - wqe->num_sge = j; - srq->rq.head = next; + wqe->num_sge = wr->num_sge; + for (i = 0; i < wr->num_sge; i++) + wqe->sg_list[i] = wr->sg_list[i]; + wq->head = next; spin_unlock_irqrestore(&srq->rq.lock, flags); } ret = 0; @@ -133,53 +106,95 @@ struct ib_srq *ipath_create_srq(struct i if (dev->n_srqs_allocated == ib_ipath_max_srqs) { ret = ERR_PTR(-ENOMEM); - goto bail; + goto done; } if (srq_init_attr->attr.max_wr == 0) { ret = ERR_PTR(-EINVAL); - goto bail; + goto done; } if ((srq_init_attr->attr.max_sge > ib_ipath_max_srq_sges) || (srq_init_attr->attr.max_wr > ib_ipath_max_srq_wrs)) { ret = ERR_PTR(-EINVAL); - goto bail; + goto done; } srq = kmalloc(sizeof(*srq), GFP_KERNEL); if (!srq) { ret = ERR_PTR(-ENOMEM); - goto bail; + goto done; } /* * Need to use vmalloc() if we want to support large #s of entries. */ srq->rq.size = srq_init_attr->attr.max_wr + 1; - sz = sizeof(struct ipath_sge) * srq_init_attr->attr.max_sge + + srq->rq.max_sge = srq_init_attr->attr.max_sge; + sz = sizeof(struct ib_sge) * srq->rq.max_sge + sizeof(struct ipath_rwqe); - srq->rq.wq = vmalloc(srq->rq.size * sz); + srq->rq.wq = vmalloc_user(sizeof(struct ipath_rwq) + srq->rq.size * sz); if (!srq->rq.wq) { - kfree(srq); ret = ERR_PTR(-ENOMEM); - goto bail; - } + goto bail_srq; + } + + /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata && udata->outlen >= sizeof(__u64)) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) srq->rq.wq; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto bail_wq; + } + + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto bail_wq; + } + srq->ip = ip; + ip->context = ibpd->uobject->context; + ip->obj = srq->rq.wq; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + srq->rq.size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } else + srq->ip = NULL; /* * ib_create_srq() will initialize srq->ibsrq. */ spin_lock_init(&srq->rq.lock); - srq->rq.head = 0; - srq->rq.tail = 0; + srq->rq.wq->head = 0; + srq->rq.wq->tail = 0; srq->rq.max_sge = srq_init_attr->attr.max_sge; srq->limit = srq_init_attr->attr.srq_limit; + dev->n_srqs_allocated++; + ret = &srq->ibsrq; - - dev->n_srqs_allocated++; - -bail: + goto done; + +bail_wq: + vfree(srq->rq.wq); + +bail_srq: + kfree(srq); + +done: return ret; } @@ -188,83 +203,130 @@ bail: * @ibsrq: the SRQ to modify * @attr: the new attributes of the SRQ * @attr_mask: indicates which attributes to modify + * @udata: user data for ipathverbs.so */ int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask) -{ - struct ipath_srq *srq = to_isrq(ibsrq); - unsigned long flags; - int ret; - - if (attr_mask & IB_SRQ_MAX_WR) + enum ib_srq_attr_mask attr_mask, + struct ib_udata *udata) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + int ret = 0; + + if (attr_mask & IB_SRQ_MAX_WR) { + struct ipath_rwq *owq; + struct ipath_rwq *wq; + struct ipath_rwqe *p; + u32 sz, size, n, head, tail; + + /* Check that the requested sizes are below the limits. */ if ((attr->max_wr > ib_ipath_max_srq_wrs) || - (attr->max_sge > srq->rq.max_sge)) { + ((attr_mask & IB_SRQ_LIMIT) ? + attr->srq_limit : srq->limit) > attr->max_wr) { ret = -EINVAL; goto bail; } - if (attr_mask & IB_SRQ_LIMIT) - if (attr->srq_limit >= srq->rq.size) { - ret = -EINVAL; - goto bail; - } - - if (attr_mask & IB_SRQ_MAX_WR) { - struct ipath_rwqe *wq, *p; - u32 sz, size, n; - sz = sizeof(struct ipath_rwqe) + - attr->max_sge * sizeof(struct ipath_sge); + srq->rq.max_sge * sizeof(struct ib_sge); size = attr->max_wr + 1; - wq = vmalloc(size * sz); + wq = vmalloc_user(sizeof(struct ipath_rwq) + size * sz); if (!wq) { ret = -ENOMEM; goto bail; } - spin_lock_irqsave(&srq->rq.lock, flags); - if (srq->rq.head < srq->rq.tail) - n = srq->rq.size + srq->rq.head - srq->rq.tail; + /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata && udata->inlen >= sizeof(__u64)) { + __u64 offset_addr; + __u64 offset = (__u64) wq; + + ret = ib_copy_from_udata(&offset_addr, udata, + sizeof(offset_addr)); + if (ret) { + vfree(wq); + goto bail; + } + udata->outbuf = (void __user *) offset_addr; + ret = ib_copy_to_udata(udata, &offset, + sizeof(offset)); + if (ret) { + vfree(wq); + goto bail; + } + } + + spin_lock_irq(&srq->rq.lock); + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + owq = srq->rq.wq; + head = owq->head; + if (head >= srq->rq.size) + head = 0; + tail = owq->tail; + if (tail >= srq->rq.size) + tail = 0; + n = head; + if (n < tail) + n += srq->rq.size - tail; else - n = srq->rq.head - srq->rq.tail; - if (size <= n || size <= srq->limit) { - spin_unlock_irqrestore(&srq->rq.lock, flags); + n -= tail; + if (size <= n) { + spin_unlock_irq(&srq->rq.lock); vfree(wq); ret = -EINVAL; goto bail; } n = 0; - p = wq; - while (srq->rq.tail != srq->rq.head) { + p = wq->wq; + while (tail != head) { struct ipath_rwqe *wqe; int i; - wqe = get_rwqe_ptr(&srq->rq, srq->rq.tail); + wqe = get_rwqe_ptr(&srq->rq, tail); p->wr_id = wqe->wr_id; - p->length = wqe->length; p->num_sge = wqe->num_sge; for (i = 0; i < wqe->num_sge; i++) p->sg_list[i] = wqe->sg_list[i]; n++; p = (struct ipath_rwqe *)((char *) p + sz); - if (++srq->rq.tail >= srq->rq.size) - srq->rq.tail = 0; - } - vfree(srq->rq.wq); + if (++tail >= srq->rq.size) + tail = 0; + } srq->rq.wq = wq; srq->rq.size = size; - srq->rq.head = n; - srq->rq.tail = 0; - srq->rq.max_sge = attr->max_sge; - spin_unlock_irqrestore(&srq->rq.lock, flags); - } - - if (attr_mask & IB_SRQ_LIMIT) { - spin_lock_irqsave(&srq->rq.lock, flags); - srq->limit = attr->srq_limit; - spin_unlock_irqrestore(&srq->rq.lock, flags); - } - ret = 0; + wq->head = n; + wq->tail = 0; + if (attr_mask & IB_SRQ_LIMIT) + srq->limit = attr->srq_limit; + spin_unlock_irq(&srq->rq.lock); + + vfree(owq); + + if (srq->ip) { + struct ipath_mmap_info *ip = srq->ip; + struct ipath_ibdev *dev = to_idev(srq->ibsrq.device); + + ip->obj = wq; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + } else if (attr_mask & IB_SRQ_LIMIT) { + spin_lock_irq(&srq->rq.lock); + if (attr->srq_limit >= srq->rq.size) + ret = -EINVAL; + else + srq->limit = attr->srq_limit; + spin_unlock_irq(&srq->rq.lock); + } bail: return ret; diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_ud.c --- a/drivers/infiniband/hw/ipath/ipath_ud.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_ud.c Mon Aug 14 14:56:07 2006 -0700 @@ -34,7 +34,54 @@ #include #include "ipath_verbs.h" -#include "ipath_common.h" +#include "ipath_kernel.h" + +static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe, + u32 *lengthp, struct ipath_sge_state *ss) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + int user = to_ipd(qp->ibqp.pd)->user; + int i, j, ret; + struct ib_wc wc; + + *lengthp = 0; + for (i = j = 0; i < wqe->num_sge; i++) { + if (wqe->sg_list[i].length == 0) + continue; + /* Check LKEY */ + if ((user && wqe->sg_list[i].lkey == 0) || + !ipath_lkey_ok(&dev->lk_table, + j ? &ss->sg_list[j - 1] : &ss->sge, + &wqe->sg_list[i], IB_ACCESS_LOCAL_WRITE)) + goto bad_lkey; + *lengthp += wqe->sg_list[i].length; + j++; + } + ss->num_sge = j; + ret = 1; + goto bail; + +bad_lkey: + wc.wr_id = wqe->wr_id; + wc.status = IB_WC_LOC_PROT_ERR; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal solicited completion event. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + ret = 0; +bail: + return ret; +} /** * ipath_ud_loopback - handle send on loopback QPs @@ -46,6 +93,8 @@ * * This is called from ipath_post_ud_send() to forward a WQE addressed * to the same HCA. + * Note that the receive interrupt handler may be calling ipath_ud_rcv() + * while this is being called. */ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, @@ -60,7 +109,11 @@ static void ipath_ud_loopback(struct ipa struct ipath_srq *srq; struct ipath_sge_state rsge; struct ipath_sge *sge; + struct ipath_rwq *wq; struct ipath_rwqe *wqe; + void (*handler)(struct ib_event *, void *); + u32 tail; + u32 rlen; qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn); if (!qp) @@ -94,6 +147,13 @@ static void ipath_ud_loopback(struct ipa wc->imm_data = 0; } + if (wr->num_sge > 1) { + rsge.sg_list = kmalloc((wr->num_sge - 1) * + sizeof(struct ipath_sge), + GFP_ATOMIC); + } else + rsge.sg_list = NULL; + /* * Get the next work request entry to find where to put the data. * Note that it is safe to drop the lock after changing rq->tail @@ -101,37 +161,52 @@ static void ipath_ud_loopback(struct ipa */ if (qp->ibqp.srq) { srq = to_isrq(qp->ibqp.srq); + handler = srq->ibsrq.event_handler; rq = &srq->rq; } else { srq = NULL; + handler = NULL; rq = &qp->r_rq; } + spin_lock_irqsave(&rq->lock, flags); - if (rq->tail == rq->head) { + wq = rq->wq; + tail = wq->tail; + while (1) { + if (unlikely(tail == wq->head)) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + goto bail_sge; + } + wqe = get_rwqe_ptr(rq, tail); + if (++tail >= rq->size) + tail = 0; + if (init_sge(qp, wqe, &rlen, &rsge)) + break; + wq->tail = tail; + } + /* Silently drop packets which are too big. */ + if (wc->byte_len > rlen) { spin_unlock_irqrestore(&rq->lock, flags); dev->n_pkt_drops++; - goto done; - } - /* Silently drop packets which are too big. */ - wqe = get_rwqe_ptr(rq, rq->tail); - if (wc->byte_len > wqe->length) { - spin_unlock_irqrestore(&rq->lock, flags); - dev->n_pkt_drops++; - goto done; - } + goto bail_sge; + } + wq->tail = tail; wc->wr_id = wqe->wr_id; - rsge.sge = wqe->sg_list[0]; - rsge.sg_list = wqe->sg_list + 1; - rsge.num_sge = wqe->num_sge; - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq && srq->ibsrq.event_handler) { + if (handler) { u32 n; - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + n = wq->head; + if (n >= rq->size) + n = 0; + if (n < tail) + n += rq->size - tail; else - n = rq->head - rq->tail; + n -= tail; if (n < srq->limit) { struct ib_event ev; @@ -140,12 +215,12 @@ static void ipath_ud_loopback(struct ipa ev.device = qp->ibqp.device; ev.element.srq = qp->ibqp.srq; ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); + handler(&ev, srq->ibsrq.srq_context); } else spin_unlock_irqrestore(&rq->lock, flags); } else spin_unlock_irqrestore(&rq->lock, flags); + ah_attr = &to_iah(wr->wr.ud.ah)->attr; if (ah_attr->ah_flags & IB_AH_GRH) { ipath_copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh)); @@ -186,7 +261,7 @@ static void ipath_ud_loopback(struct ipa wc->src_qp = sqp->ibqp.qp_num; /* XXX do we know which pkey matched? Only needed for GSI. */ wc->pkey_index = 0; - wc->slid = ipath_layer_get_lid(dev->dd) | + wc->slid = dev->dd->ipath_lid | (ah_attr->src_path_bits & ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1)); wc->sl = ah_attr->sl; @@ -196,6 +271,8 @@ static void ipath_ud_loopback(struct ipa ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, wr->send_flags & IB_SEND_SOLICITED); +bail_sge: + kfree(rsge.sg_list); done: if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); @@ -433,13 +510,9 @@ void ipath_ud_rcv(struct ipath_ibdev *de int opcode; u32 hdrsize; u32 pad; - unsigned long flags; struct ib_wc wc; u32 qkey; u32 src_qp; - struct ipath_rq *rq; - struct ipath_srq *srq; - struct ipath_rwqe *wqe; u16 dlid; int header_in_data; @@ -547,19 +620,10 @@ void ipath_ud_rcv(struct ipath_ibdev *de /* * Get the next work request entry to find where to put the data. - * Note that it is safe to drop the lock after changing rq->tail - * since ipath_post_receive() won't fill the empty slot. - */ - if (qp->ibqp.srq) { - srq = to_isrq(qp->ibqp.srq); - rq = &srq->rq; - } else { - srq = NULL; - rq = &qp->r_rq; - } - spin_lock_irqsave(&rq->lock, flags); - if (rq->tail == rq->head) { - spin_unlock_irqrestore(&rq->lock, flags); + */ + if (qp->r_reuse_sge) + qp->r_reuse_sge = 0; + else if (!ipath_get_rwqe(qp, 0)) { /* * Count VL15 packets dropped due to no receive buffer. * Otherwise, count them as buffer overruns since usually, @@ -573,39 +637,11 @@ void ipath_ud_rcv(struct ipath_ibdev *de goto bail; } /* Silently drop packets which are too big. */ - wqe = get_rwqe_ptr(rq, rq->tail); - if (wc.byte_len > wqe->length) { - spin_unlock_irqrestore(&rq->lock, flags); + if (wc.byte_len > qp->r_len) { + qp->r_reuse_sge = 1; dev->n_pkt_drops++; goto bail; } - wc.wr_id = wqe->wr_id; - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq && srq->ibsrq.event_handler) { - u32 n; - - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; - else - n = rq->head - rq->tail; - if (n < srq->limit) { - struct ib_event ev; - - srq->limit = 0; - spin_unlock_irqrestore(&rq->lock, flags); - ev.device = qp->ibqp.device; - ev.element.srq = qp->ibqp.srq; - ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); - } else - spin_unlock_irqrestore(&rq->lock, flags); - } else - spin_unlock_irqrestore(&rq->lock, flags); if (has_grh) { ipath_copy_sge(&qp->r_sge, &hdr->u.l.grh, sizeof(struct ib_grh)); @@ -614,6 +650,7 @@ void ipath_ud_rcv(struct ipath_ibdev *de ipath_skip_sge(&qp->r_sge, sizeof(struct ib_grh)); ipath_copy_sge(&qp->r_sge, data, wc.byte_len - sizeof(struct ib_grh)); + wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; wc.vendor_err = 0; diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Mon Aug 14 14:56:07 2006 -0700 @@ -277,11 +277,12 @@ static int ipath_post_receive(struct ib_ struct ib_recv_wr **bad_wr) { struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_rwq *wq = qp->r_rq.wq; unsigned long flags; int ret; /* Check that state is OK to post receive. */ - if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK)) { + if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK) || !wq) { *bad_wr = wr; ret = -EINVAL; goto bail; @@ -290,59 +291,31 @@ static int ipath_post_receive(struct ib_ for (; wr; wr = wr->next) { struct ipath_rwqe *wqe; u32 next; - int i, j; - - if (wr->num_sge > qp->r_rq.max_sge) { + int i; + + if ((unsigned) wr->num_sge > qp->r_rq.max_sge) { *bad_wr = wr; ret = -ENOMEM; goto bail; } spin_lock_irqsave(&qp->r_rq.lock, flags); - next = qp->r_rq.head + 1; + next = wq->head + 1; if (next >= qp->r_rq.size) next = 0; - if (next == qp->r_rq.tail) { + if (next == wq->tail) { spin_unlock_irqrestore(&qp->r_rq.lock, flags); *bad_wr = wr; ret = -ENOMEM; goto bail; } - wqe = get_rwqe_ptr(&qp->r_rq, qp->r_rq.head); + wqe = get_rwqe_ptr(&qp->r_rq, wq->head); wqe->wr_id = wr->wr_id; - wqe->sg_list[0].mr = NULL; - wqe->sg_list[0].vaddr = NULL; - wqe->sg_list[0].length = 0; - wqe->sg_list[0].sge_length = 0; - wqe->length = 0; - for (i = 0, j = 0; i < wr->num_sge; i++) { - /* Check LKEY */ - if (to_ipd(qp->ibqp.pd)->user && - wr->sg_list[i].lkey == 0) { - spin_unlock_irqrestore(&qp->r_rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - if (wr->sg_list[i].length == 0) - continue; - if (!ipath_lkey_ok( - &to_idev(qp->ibqp.device)->lk_table, - &wqe->sg_list[j], &wr->sg_list[i], - IB_ACCESS_LOCAL_WRITE)) { - spin_unlock_irqrestore(&qp->r_rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - wqe->length += wr->sg_list[i].length; - j++; - } - wqe->num_sge = j; - qp->r_rq.head = next; + wqe->num_sge = wr->num_sge; + for (i = 0; i < wr->num_sge; i++) + wqe->sg_list[i] = wr->sg_list[i]; + wq->head = next; spin_unlock_irqrestore(&qp->r_rq.lock, flags); } ret = 0; @@ -1137,6 +1110,7 @@ static void *ipath_register_ib_device(in dev->attach_mcast = ipath_multicast_attach; dev->detach_mcast = ipath_multicast_detach; dev->process_mad = ipath_process_mad; + dev->mmap = ipath_mmap; snprintf(dev->node_desc, sizeof(dev->node_desc), IPATH_IDSTR " %s kernel_SMA", system_utsname.nodename); diff -r dcc321d1340a drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Sun Aug 06 19:00:05 2006 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Mon Aug 14 14:56:07 2006 -0700 @@ -38,6 +38,7 @@ #include #include #include +#include #include #include "ipath_layer.h" @@ -50,7 +51,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define IPATH_UVERBS_ABI_VERSION 1 +#define IPATH_UVERBS_ABI_VERSION 2 /* * Define an ib_cq_notify value that is not valid so we know when CQ @@ -178,58 +179,41 @@ struct ipath_ah { }; /* - * Quick description of our CQ/QP locking scheme: - * - * We have one global lock that protects dev->cq/qp_table. Each - * struct ipath_cq/qp also has its own lock. An individual qp lock - * may be taken inside of an individual cq lock. Both cqs attached to - * a qp may be locked, with the send cq locked first. No other - * nesting should be done. - * - * Each struct ipath_cq/qp also has an atomic_t ref count. The - * pointer from the cq/qp_table to the struct counts as one reference. - * This reference also is good for access through the consumer API, so - * modifying the CQ/QP etc doesn't need to take another reference. - * Access because of a completion being polled does need a reference. - * - * Finally, each struct ipath_cq/qp has a wait_queue_head_t for the - * destroy function to sleep on. - * - * This means that access from the consumer API requires nothing but - * taking the struct's lock. - * - * Access because of a completion event should go as follows: - * - lock cq/qp_table and look up struct - * - increment ref count in struct - * - drop cq/qp_table lock - * - lock struct, do your thing, and unlock struct - * - decrement ref count; if zero, wake up waiters - * - * To destroy a CQ/QP, we can do the following: - * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock - * - decrement ref count - * - wait_event until ref count is zero - * - * It is the consumer's responsibilty to make sure that no QP - * operations (WQE posting or state modification) are pending when the - * QP is destroyed. Also, the consumer must make sure that calls to - * qp_modify are serialized. - * - * Possible optimizations (wait for profile data to see if/where we - * have locks bouncing between CPUs): - * - split cq/qp table lock into n separate (cache-aligned) locks, - * indexed (say) by the page in the table - */ - + * This structure is used by ipath_mmap() to validate an offset + * when an mmap() request is made. The vm_area_struct then uses + * this as its vm_private_data. + */ +struct ipath_mmap_info { + struct ipath_mmap_info *next; + struct ib_ucontext *context; + void *obj; + struct kref ref; + unsigned size; + unsigned mmap_cnt; +}; + +/* + * This structure is used to contain the head pointer, tail pointer, + * and completion queue entries as a single memory allocation so + * it can be mmap'ed into user space. + */ +struct ipath_cq_wc { + u32 head; /* index of next entry to fill */ + u32 tail; /* index of next ib_poll_cq() entry */ + struct ib_wc queue[1]; /* this is actually size ibcq.cqe + 1 */ +}; + +/* + * The completion queue structure. + */ struct ipath_cq { struct ib_cq ibcq; struct tasklet_struct comptask; spinlock_t lock; u8 notify; u8 triggered; - u32 head; /* new records added to the head */ - u32 tail; /* poll_cq() reads from here. */ - struct ib_wc *queue; /* this is actually ibcq.cqe + 1 */ + struct ipath_cq_wc *queue; + struct ipath_mmap_info *ip; }; /* @@ -248,28 +232,40 @@ struct ipath_swqe { /* * Receive work request queue entry. - * The size of the sg_list is determined when the QP is created and stored - * in qp->r_max_sge. + * The size of the sg_list is determined when the QP (or SRQ) is created + * and stored in qp->r_rq.max_sge (or srq->rq.max_sge). */ struct ipath_rwqe { u64 wr_id; - u32 length; /* total length of data in sg_list */ u8 num_sge; - struct ipath_sge sg_list[0]; -}; - -struct ipath_rq { - spinlock_t lock; + struct ib_sge sg_list[0]; +}; + +/* + * This structure is used to contain the head pointer, tail pointer, + * and receive work queue entries as a single memory allocation so + * it can be mmap'ed into user space. + * Note that the wq array elements are variable size so you can't + * just index into the array to get the N'th element; + * use get_rwqe_ptr() instead. + */ +struct ipath_rwq { u32 head; /* new work requests posted to the head */ u32 tail; /* receives pull requests from here. */ + struct ipath_rwqe wq[0]; +}; + +struct ipath_rq { + struct ipath_rwq *wq; + spinlock_t lock; u32 size; /* size of RWQE array */ u8 max_sge; - struct ipath_rwqe *wq; /* RWQE array */ }; struct ipath_srq { struct ib_srq ibsrq; struct ipath_rq rq; + struct ipath_mmap_info *ip; /* send signal when number of RWQEs < limit */ u32 limit; }; @@ -293,6 +289,7 @@ struct ipath_qp { atomic_t refcount; wait_queue_head_t wait; struct tasklet_struct s_task; + struct ipath_mmap_info *ip; struct ipath_sge_state *s_cur_sge; struct ipath_sge_state s_sge; /* current send request data */ /* current RDMA read send data */ @@ -345,7 +342,8 @@ struct ipath_qp { u32 s_ssn; /* SSN of tail entry */ u32 s_lsn; /* limit sequence number (credit) */ struct ipath_swqe *s_wq; /* send work queue */ - struct ipath_rq r_rq; /* receive work queue */ + struct ipath_rq r_rq; /* receive work queue */ + struct ipath_sge r_sg_list[0]; /* verified SGEs */ }; /* @@ -369,15 +367,15 @@ static inline struct ipath_swqe *get_swq /* * Since struct ipath_rwqe is not a fixed size, we can't simply index into - * struct ipath_rq.wq. This function does the array index computation. + * struct ipath_rwq.wq. This function does the array index computation. */ static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq, unsigned n) { return (struct ipath_rwqe *) - ((char *) rq->wq + + ((char *) rq->wq->wq + (sizeof(struct ipath_rwqe) + - rq->max_sge * sizeof(struct ipath_sge)) * n); + rq->max_sge * sizeof(struct ib_sge)) * n); } /* @@ -417,6 +415,7 @@ struct ipath_ibdev { struct ib_device ibdev; struct list_head dev_list; struct ipath_devdata *dd; + struct ipath_mmap_info *pending_mmaps; int ib_unit; /* This is the device number */ u16 sm_lid; /* in host order */ u8 sm_sl; @@ -579,7 +578,7 @@ int ipath_destroy_qp(struct ib_qp *ibqp) int ipath_destroy_qp(struct ib_qp *ibqp); int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask); + int attr_mask, struct ib_udata *udata); int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_qp_init_attr *init_attr); @@ -638,7 +637,8 @@ struct ib_srq *ipath_create_srq(struct i struct ib_udata *udata); int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask); + enum ib_srq_attr_mask attr_mask, + struct ib_udata *udata); int ipath_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr); @@ -680,6 +680,10 @@ int ipath_unmap_fmr(struct list_head *fm int ipath_dealloc_fmr(struct ib_fmr *ibfmr); +void ipath_release_mmap_info(struct kref *ref); + +int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); + void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev); void ipath_insert_rnr_queue(struct ipath_qp *qp); From rdreier at cisco.com Tue Aug 15 15:18:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 Aug 2006 15:18:52 -0700 Subject: [openib-general] ib_get_dma_mr and remote access In-Reply-To: <44E24249.20402@hp.com> (Louis Laborde's message of "Tue, 15 Aug 2006 14:53:13 -0700") References: <44E24249.20402@hp.com> Message-ID: Louis> Hi there, I would like to know if any application today Louis> uses ib_get_dma_mr verb with remote access flag(s). Yes, the SRP initiator does. - R. From tom at opengridcomputing.com Tue Aug 15 15:31:51 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 15 Aug 2006 17:31:51 -0500 Subject: [openib-general] ib_get_dma_mr and remote access In-Reply-To: References: <44E24249.20402@hp.com> Message-ID: <1155681111.16709.8.camel@trinity.ogc.int> On Tue, 2006-08-15 at 15:18 -0700, Roland Dreier wrote: > Louis> Hi there, I would like to know if any application today > Louis> uses ib_get_dma_mr verb with remote access flag(s). > > Yes, the SRP initiator does. So does NFSoRDMA... > > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From johnt1johnt2 at gmail.com Tue Aug 15 23:01:57 2006 From: johnt1johnt2 at gmail.com (john t) Date: Wed, 16 Aug 2006 11:31:57 +0530 Subject: [openib-general] openIB question Message-ID: Hi, I am new to openIB and have some questions. - A QP has two work queues, send work queue and receive work queue. When host A sends a data buffer to host B using a QP (RC), I want to know if data is copied into the send work queue of host A and then into the receive work queue of host B or is it only the address of data buffer that is pushed into the send work q and receive work q. - I want to develop an application (using uverbs or any such appropriate tool) that uses raw datagrams. Basically I want to send an IB packet from host A to host B and receive the entire IB packet (along with headers + data) at host B. Is such a thing supported in openIB? Is there any example which I can look at? regards, John T -------------- next part -------------- An HTML attachment was scrubbed... URL: From krkumar2 at in.ibm.com Tue Aug 15 23:12:35 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Wed, 16 Aug 2006 11:42:35 +0530 Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: <1155307600.15374.72.camel@trinity.ogc.int> Message-ID: Hi James, Sorry for the delay, we had a long weekend. > > > My opinion is that the create_qp taking generic parameters is > > > correct, only subsequent calls may need to use transport specific > > > calls/arguments. Infact rdma_create_qp uses the ibv_create_qp (now > > > changed to rdmav_create_qp) call internally. > > > > If you want to have a generic rdmav_create_qp() call, there needs to > > be programmatic way for the API consumer to determine what type of QP > > (iWARP vs. IB) was created. > > > > I don't see any way to do that in your patch: > > I think the QP is associated with the transport type indirectly through > the context. It can be queried with ibv_get_transport_type verb. A > renamed rdma_get_transport type would probably suffice. Correct. Opening the device using rdmav_open_device with argument provided by the ULP will provide the context, which is used by subsequent calls to transparently make use of other calls. Either Steve or I can provide the rdmav_get_transport_type() call to return the actual device (transport) type. > > I like the new approach you are taking (keeping 1 verbs library and > > adding rdmav_ symbol names). This change to transport neutral names is > > long overdue. > > > > When you finish with the userspace APIs, I hope you will update the > > kernel APIs as well. Sure. Thanks, - KK From dotanb at mellanox.co.il Tue Aug 15 23:17:50 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 16 Aug 2006 09:17:50 +0300 Subject: [openib-general] openIB question In-Reply-To: References: Message-ID: <200608160917.50585.dotanb@mellanox.co.il> Hi and welcome John :) On Wednesday 16 August 2006 09:01, john t wrote: > Hi, > > I am new to openIB and have some questions. > > - A QP has two work queues, send work queue and receive work queue. When > host A sends a data buffer to host B using a QP (RC), I want to know if data > is copied into the send work queue of host A and then into the receive work > queue of host B or is it only the address of data buffer that is pushed into > the send work q and receive work q. If you use RC QP and send the data using the SEND opcode, if you'll get a completion (with good status) in the sender, that means that the data was written in the receiver buffer. > > - I want to develop an application (using uverbs or any such appropriate > tool) that uses raw datagrams. Basically I want to send an IB packet from > host A to host B and receive the entire IB packet (along with headers + > data) at host B. Is such a thing supported in openIB? Is there any example > which I can look at? In IB, you won't see the various headers (expect in UD QPs, in which you will get the GRH). You can look at the example in the following url: https://openib.org/svn/gen2/trunk/src/userspace/libibverbs/examples/rc_pingpong.c > regards, > John T > Dotan From sweitzen at cisco.com Tue Aug 15 23:25:26 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 15 Aug 2006 23:25:26 -0700 Subject: [openib-general] Cisco SQA results for OFED 1.1rc1 Message-ID: Cisco SQA has started testing DDR, I've enclosed spreadsheets for both OFED test results and MPI performance. All our testing so far has been on RHEL4 U3. We plan to start testing SLES10 this week. Some improvements we see for OFED 1.1rc1 vs 1.0fcs: * SDP latency is the best I've ever seen it, although throughput is still not back to where it was. * SDP stability seems greatly improved compared to OFED 1.0 rc5-fcs. * Intel MPI 2.0.1 seems to be more scalable compared to OFED 1.0. There are some major regressions though: * Open MPI does not work at all on 64-bit RHEL4 (bug 197). * tvflash no longer works on RHEL4 ppc64 (bug 198). I'd like to request we try to also fix the following bugs for OFED 1.1fcs: * 175 (add /proc, /sys, and ideally netstat support for SDP), as it's hard to debug SDP with just strace. * 183 (node_desc does not alway have hostname in it), as we have a customer who depends on this info. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed_sqa_results.xls Type: application/vnd.ms-excel Size: 56832 bytes Desc: ofed_sqa_results.xls URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mpi_perf.xls Type: application/vnd.ms-excel Size: 36352 bytes Desc: mpi_perf.xls URL: From tziporet at mellanox.co.il Tue Aug 15 23:45:32 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 16 Aug 2006 09:45:32 +0300 Subject: [openib-general] debug version In-Reply-To: References: Message-ID: <44E2BF0C.9050800@mellanox.co.il> Di Domenico, Michael wrote: > Is there a way to use the build.sh script that comes with OFED to build > a debug version of all the things listed in the ofed.conf file? > > > > Please see OFED RN (https://openib.org/svn/gen2/branches/1.0/ofed/docs/OFED_release_notes.txt) section 3.2: 3.2 Install Driver in Debug Mode -------------------------------- You can install the driver with debug messages enabled. To do this, enter prior to running install.sh: export OPENIB_PARAMS="--debug" Alternatively, you can place this export command in the file ofed.conf. Tziporet From johnt1johnt2 at gmail.com Wed Aug 16 01:47:21 2006 From: johnt1johnt2 at gmail.com (john t) Date: Wed, 16 Aug 2006 14:17:21 +0530 Subject: [openib-general] openIB question In-Reply-To: <200608160917.50585.dotanb@mellanox.co.il> References: <200608160917.50585.dotanb@mellanox.co.il> Message-ID: Hi Dotan, Thanks for your prompt reply. The example that u specified takes address of data buffer during send and receive and directly copies data into the data buffer of target machine. In my case, the target machine has 2 QPs on two different ports. It receives data from one QP on one port and sends data (same data wihout modification) through other QP on other port. I want to directly pass the data from one QP to other QP without storing it locally. The example code does not seem to support this. It first copies data to a local buffer (which is not required in my case) and only then it could send it over other QP. Is there a more efficient way to do this? regards, John T On 8/16/06, Dotan Barak wrote: > > Hi and welcome John > :) > > On Wednesday 16 August 2006 09:01, john t wrote: > > Hi, > > > > I am new to openIB and have some questions. > > > > - A QP has two work queues, send work queue and receive work queue. When > > host A sends a data buffer to host B using a QP (RC), I want to know if > data > > is copied into the send work queue of host A and then into the receive > work > > queue of host B or is it only the address of data buffer that is pushed > into > > the send work q and receive work q. > If you use RC QP and send the data using the SEND opcode, if you'll get a > completion > (with good status) in the sender, that means that the data was written in > the receiver buffer. > > > > - I want to develop an application (using uverbs or any such appropriate > > tool) that uses raw datagrams. Basically I want to send an IB packet > from > > host A to host B and receive the entire IB packet (along with headers + > > data) at host B. Is such a thing supported in openIB? Is there any > example > > which I can look at? > In IB, you won't see the various headers (expect in UD QPs, in which you > will get the GRH). > > You can look at the example in the following url: > > https://openib.org/svn/gen2/trunk/src/userspace/libibverbs/examples/rc_pingpong.c > > > regards, > > John T > > > > > Dotan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yhkim93 at keti.re.kr Wed Aug 16 02:18:15 2006 From: yhkim93 at keti.re.kr (=?ks_c_5601-1987?B?sei/tciv?=) Date: Wed, 16 Aug 2006 18:18:15 +0900 Subject: [openib-general] infiniband driver problem on AMCC-440SPe yucca board Message-ID: <20060816092035.BBD9A3B0001@sentry-two.sandia.gov> I am developing infiniband controller on AMCC Powerpc 440SPe yucca board. I have cross-compiled kernel-2.6.18-rc2 with infiniband driver. But kernel is panic as show the following error text. What is problem? ============================================================================ =============== ## Booting image at 00200000 ... Image Name: Linux-2.6.18-rc2 Image Type: PowerPC Linux Kernel Image (gzip compressed) Data Size: 1226673 Bytes = 1.2 MB Load Address: 00000000 Entry Point: 00000000 Verifying Checksum ... OK Uncompressing Kernel Image ... OK Linux version 2.6.18-rc2 (root at yhkim-devpc) (gcc version 4.0.0) #1 Wed Aug 16 16:04:28 KST 2006 PCIX0 host bridge: resources allocated PCIE0: card present PCIE: SDR0_PLLLCT1 already reset. PCIE initialization OK PCIE0 host bridge: resources allocated Yucca port (Roland Dreier ) Built 1 zonelists. Total pages: 131072 Kernel command line: root=/dev/nfs rw nfsroot=192.168.1.1:/tftpboot/yucca/ppc_4xx ip=192.168.1.10:192.168.1.1::255.255.255.0:yucca:eth0:off0PID hash table entries: 4096 (order: 12, 16384 bytes) Dentry cache hash table entries: 65536 (order: 6, 262144 bytes) Inode-cache hash table entries: 32768 (order: 5, 131072 bytes) Memory: 517120k available (1856k kernel code, 600k data, 144k init, 0k highmem) Mount-cache hash table entries: 512 NET: Registered protocol family 16 PCI: Probing PCI hardware NET: Registered protocol family 2 IP route cache hash table entries: 16384 (order: 4, 65536 bytes) TCP established hash table entries: 65536 (order: 6, 262144 bytes) TCP bind hash table entries: 32768 (order: 5, 131072 bytes) TCP: Hash tables configured (established 65536 bind 32768) TCP reno registered io scheduler noop registered io scheduler anticipatory registered (default) io scheduler deadline registered io scheduler cfq registered Generic RTC Driver v1.07 Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled serial8250: ttyS0 at MMIO 0x0 (irq = 0) is a 16550A serial8250: ttyS1 at MMIO 0x0 (irq = 1) is a 16550A serial8250: ttyS2 at MMIO 0x0 (irq = 37) is a 16550A RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize PPC 4xx OCP EMAC driver, version 3.54 mal0: initialized, 1 TX channels, 1 RX channels eth0: emac0, MAC 00:01:73:01:d0:f2 eth0: found CIS8201 Gigabit Ethernet PHY (0x01) IBM IIC driver v2.1 ibm-iic0: using standard (100 kHz) mode ibm-iic1: using standard (100 kHz) mode ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca: Initializing 0001:01:01.0 ib_mthca 0001:01:01.0: HCA FW version 5.0.1 is old (5.1.0 is current). ib_mthca 0001:01:01.0: If you have problems, try updating your HCA FW. kernel BUG in __dma_alloc_coherent at arch/ppc/kernel/dma-mapping.c:233! Oops: Exception in kernel mode, sig: 5 [#1] NIP: C0004A94 LR: C0004A60 CTR: 00000000 REGS: c0721cb0 TRAP: 0700 Not tainted (2.6.18-rc2) MSR: 00029000 CR: 88B04F82 XER: 20000000 TASK = c070ab70[1] 'swapper' THREAD: c0720000 GPR00: 00000001 C0721D60 C070AB70 C067BD40 00000001 0000001F DF5E2FFC 00029000 GPR08: FFFFFFFF 00000000 C0213E04 00000000 28B04F88 00000000 1FFF4900 007FFF93 GPR16: 00000000 00000001 007FFF00 1FFEF2F0 FFFFFFFF C0270000 C0210000 00000000 GPR24: DF5AB260 C021240C FF2FF000 C0721DBC C067BD60 C0720000 C067BD40 00001000 NIP [C0004A94] __dma_alloc_coherent+0x218/0x2e4 LR [C0004A60] __dma_alloc_coherent+0x1e4/0x2e4 Call Trace: [C0721D60] [C00049B8] __dma_alloc_coherent+0x13c/0x2e4 (unreliable) [C0721DA0] [C02672D8] mthca_create_eq+0x338/0x438 [C0721E00] [C026753C] mthca_init_eq_table+0x164/0x6c0 [C0721E40] [C0266B54] mthca_init_one+0x990/0xc60 [C0721E90] [C00F16B4] pci_device_probe+0x7c/0xa0 [C0721EB0] [C01098B0] driver_probe_device+0x60/0x118 [C0721ED0] [C0109AE8] __driver_attach+0xcc/0xf8 [C0721EF0] [C0108D54] bus_for_each_dev+0x54/0x90 [C0721F20] [C0109718] driver_attach+0x24/0x34 [C0721F30] [C010918C] bus_add_driver+0x84/0x13c [C0721F50] [C0109FD0] driver_register+0x70/0xb8 [C0721F60] [C00F1348] __pci_register_driver+0x44/0x54 [C0721F70] [C0265408] mthca_init+0x1c/0x40 [C0721F80] [C0001118] init+0x8c/0x298 [C0721FF0] [C0003F5C] kernel_thread+0x44/0x60 Instruction dump: 3d20c027 816900dc 7c00f050 54003826 7c005a14 901b0000 815d0004 39200000 7d205379 38000000 41820008 38000001 <0f000000> 38000400 7d60f028 7d6b0378 Kernel panic - not syncing: Attempted to kill init! <0>Rebooting in 1 seconds.. Korea Electronics Technology Institutes(KETI) Intelligent IT System Research Center Researcher kim, Young Hwan 031-789-7535 -------------- next part -------------- An HTML attachment was scrubbed... URL: From yhkim93 at keti.re.kr Wed Aug 16 02:50:46 2006 From: yhkim93 at keti.re.kr (=?ks_c_5601-1987?B?sei/tciv?=) Date: Wed, 16 Aug 2006 18:50:46 +0900 Subject: [openib-general] DMA allocation problem on AMCC-440SPe yucca board Message-ID: <20060816095131.BB09D3B0008@sentry-two.sandia.gov> I am developing infiniband controller on AMCC Powerpc 440SPe yucca board. I have cross-compiled kernel-2.6.18-rc2 with infiniband driver. But kernel is panic as show the following error text. What is problem? ============================================================================ =============== ## Booting image at 00200000 ... Image Name: Linux-2.6.18-rc2 Image Type: PowerPC Linux Kernel Image (gzip compressed) Data Size: 1226673 Bytes = 1.2 MB Load Address: 00000000 Entry Point: 00000000 Verifying Checksum ... OK Uncompressing Kernel Image ... OK Linux version 2.6.18-rc2 (root at yhkim-devpc) (gcc version 4.0.0) #1 Wed Aug 16 16:04:28 KST 2006 PCIX0 host bridge: resources allocated PCIE0: card present PCIE: SDR0_PLLLCT1 already reset. PCIE initialization OK PCIE0 host bridge: resources allocated Yucca port (Roland Dreier ) Built 1 zonelists. Total pages: 131072 Kernel command line: root=/dev/nfs rw nfsroot=192.168.1.1:/tftpboot/yucca/ppc_4xx ip=192.168.1.10:192.168.1.1::255.255.255.0:yucca:eth0:off0PID hash table entries: 4096 (order: 12, 16384 bytes) Dentry cache hash table entries: 65536 (order: 6, 262144 bytes) Inode-cache hash table entries: 32768 (order: 5, 131072 bytes) Memory: 517120k available (1856k kernel code, 600k data, 144k init, 0k highmem) Mount-cache hash table entries: 512 NET: Registered protocol family 16 PCI: Probing PCI hardware NET: Registered protocol family 2 IP route cache hash table entries: 16384 (order: 4, 65536 bytes) TCP established hash table entries: 65536 (order: 6, 262144 bytes) TCP bind hash table entries: 32768 (order: 5, 131072 bytes) TCP: Hash tables configured (established 65536 bind 32768) TCP reno registered io scheduler noop registered io scheduler anticipatory registered (default) io scheduler deadline registered io scheduler cfq registered Generic RTC Driver v1.07 Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled serial8250: ttyS0 at MMIO 0x0 (irq = 0) is a 16550A serial8250: ttyS1 at MMIO 0x0 (irq = 1) is a 16550A serial8250: ttyS2 at MMIO 0x0 (irq = 37) is a 16550A RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize PPC 4xx OCP EMAC driver, version 3.54 mal0: initialized, 1 TX channels, 1 RX channels eth0: emac0, MAC 00:01:73:01:d0:f2 eth0: found CIS8201 Gigabit Ethernet PHY (0x01) IBM IIC driver v2.1 ibm-iic0: using standard (100 kHz) mode ibm-iic1: using standard (100 kHz) mode ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca: Initializing 0001:01:01.0 ib_mthca 0001:01:01.0: HCA FW version 5.0.1 is old (5.1.0 is current). ib_mthca 0001:01:01.0: If you have problems, try updating your HCA FW. kernel BUG in __dma_alloc_coherent at arch/ppc/kernel/dma-mapping.c:233! Oops: Exception in kernel mode, sig: 5 [#1] NIP: C0004A94 LR: C0004A60 CTR: 00000000 REGS: c0721cb0 TRAP: 0700 Not tainted (2.6.18-rc2) MSR: 00029000 CR: 88B04F82 XER: 20000000 TASK = c070ab70[1] 'swapper' THREAD: c0720000 GPR00: 00000001 C0721D60 C070AB70 C067BD40 00000001 0000001F DF5E2FFC 00029000 GPR08: FFFFFFFF 00000000 C0213E04 00000000 28B04F88 00000000 1FFF4900 007FFF93 GPR16: 00000000 00000001 007FFF00 1FFEF2F0 FFFFFFFF C0270000 C0210000 00000000 GPR24: DF5AB260 C021240C FF2FF000 C0721DBC C067BD60 C0720000 C067BD40 00001000 NIP [C0004A94] __dma_alloc_coherent+0x218/0x2e4 LR [C0004A60] __dma_alloc_coherent+0x1e4/0x2e4 Call Trace: [C0721D60] [C00049B8] __dma_alloc_coherent+0x13c/0x2e4 (unreliable) [C0721DA0] [C02672D8] mthca_create_eq+0x338/0x438 [C0721E00] [C026753C] mthca_init_eq_table+0x164/0x6c0 [C0721E40] [C0266B54] mthca_init_one+0x990/0xc60 [C0721E90] [C00F16B4] pci_device_probe+0x7c/0xa0 [C0721EB0] [C01098B0] driver_probe_device+0x60/0x118 [C0721ED0] [C0109AE8] __driver_attach+0xcc/0xf8 [C0721EF0] [C0108D54] bus_for_each_dev+0x54/0x90 [C0721F20] [C0109718] driver_attach+0x24/0x34 [C0721F30] [C010918C] bus_add_driver+0x84/0x13c [C0721F50] [C0109FD0] driver_register+0x70/0xb8 [C0721F60] [C00F1348] __pci_register_driver+0x44/0x54 [C0721F70] [C0265408] mthca_init+0x1c/0x40 [C0721F80] [C0001118] init+0x8c/0x298 [C0721FF0] [C0003F5C] kernel_thread+0x44/0x60 Instruction dump: 3d20c027 816900dc 7c00f050 54003826 7c005a14 901b0000 815d0004 39200000 7d205379 38000000 41820008 38000001 <0f000000> 38000400 7d60f028 7d6b0378 Kernel panic - not syncing: Attempted to kill init! <0>Rebooting in 1 seconds.. Korea Electronics Technology Institutes(KETI) Intelligent IT System Research Center Researcher kim, Young Hwan 031-789-7535 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Aug 16 04:06:28 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Aug 2006 14:06:28 +0300 Subject: [openib-general] madeye backport patch for OFED 1.1 In-Reply-To: <20060815194648.GX24920@sashak.voltaire.com> References: <20060815194648.GX24920@sashak.voltaire.com> Message-ID: <20060816110628.GA2863@mellanox.co.il> Quoting r. Sasha Khapyorsky : > > > git-checkout -b madeye-for-ofed v2.6.18-rc4 > > > git-applymbox /path/to/hals_message > > > > Yea, will do. > > Thanks. By the way, I think that longer-term we really want to be able to let userspace expose a madeye-like functionality, snooping on all packets. It seems this could be a trivial extension to the umad module - any objections to this approach? -- MST From eitan at mellanox.co.il Wed Aug 16 04:34:25 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 16 Aug 2006 14:34:25 +0300 Subject: [openib-general] madeye backport patch for OFED 1.1 Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302B99BD2@mtlexch01.mtl.com> Hi Michael, > Quoting r. Sasha Khapyorsky : > > > > git-checkout -b madeye-for-ofed v2.6.18-rc4 git-applymbox > > > > /path/to/hals_message > > > > > > Yea, will do. > > > > Thanks. > > By the way, I think that longer-term we really want to be able to let userspace > expose a madeye-like functionality, snooping on all packets. > > It seems this could be a trivial extension to the umad module - any objections > to this approach? [EZ] As long as one can control user access to that interface using different access control then regular umad. > > -- > MST > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Wed Aug 16 05:13:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Aug 2006 15:13:32 +0300 Subject: [openib-general] madeye backport patch for OFED 1.1 In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302B99BD2@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302B99BD2@mtlexch01.mtl.com> Message-ID: <20060816121332.GB2863@mellanox.co.il> Quoting r. Eitan Zahavi : > > By the way, I think that longer-term we really want to be able to let > > userspace expose a madeye-like functionality, snooping on all packets. > > > > It seems this could be a trivial extension to the umad module - any > > objections to this approach? > > [EZ] As long as one can control user access to that interface using different > access control then regular umad. Sure, we don't want to break the ABI. -- MST From thomas.bub at thomson.net Wed Aug 16 06:49:18 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Wed, 16 Aug 2006 15:49:18 +0200 Subject: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM Message-ID: Sean, Thanks for the quick reply. Some errors got resolved some new added. Some of the errors but not all are caused by the picky gcc 3.4.4 that I could avoid by some casting. I left them all in to give you the whole picture. Here is the actual output: cmpost.c:65: error: field `path_rec' has incomplete type cmpost.c: In function `int modify_to_rtr(cmtest_node*)': cmpost.c:130: error: invalid conversion from `int' to `ibv_qp_attr_mask' cmpost.c:130: error: initializing argument 3 of `int ibv_modify_qp(ibv_qp*, ibv_qp_attr*, ibv_qp_attr_mask)' cmpost.c:142: error: invalid conversion from `int' to `ibv_qp_attr_mask' cmpost.c:142: error: initializing argument 3 of `int ibv_modify_qp(ibv_qp*, ibv_qp_attr*, ibv_qp_attr_mask)' cmpost.c: In function `int modify_to_rts(cmtest_node*)': cmpost.c:161: error: invalid conversion from `int' to `ibv_qp_attr_mask' cmpost.c:161: error: initializing argument 3 of `int ibv_modify_qp(ibv_qp*, ibv_qp_attr*, ibv_qp_attr_mask)' cmpost.c: In function `void cm_handler(ib_cm_id*, ib_cm_event*)': cmpost.c:267: error: invalid conversion from `void*' to `cmtest_node*' cmpost.c: In function `int create_nodes()': cmpost.c:353: error: invalid conversion from `void*' to `cmtest_node*' cmpost.c: In function `int query_for_path(char*)': cmpost.c:631: error: `RDMA_PS_TCP' undeclared (first use this function) cmpost.c:631: error: (Each undeclared identifier is reported only once for each function it appears in.) cmpost.c:657: error: 'struct cmtest' has no member named 'path_rec' cmpost.c:657: error: invalid use of undefined type `struct ibv_sa_path_rec' /vob/pkgs/sys/linux-include/rdma/rdma_cma.h:84: error: forward declaration of `struct ibv_sa_path_rec' cmpost.c: In function `void run_client(char*)': cmpost.c:678: error: 'struct cmtest' has no member named 'path_rec' *** Error code 1 clearmake: Error: Build script failed for "cmpost.o" Thomas -----Original Message----- From: Sean Hefty [mailto:sean.hefty at intel.com] Sent: Tuesday, August 15, 2006 6:11 PM To: Bub Thomas; openib-general at openib.org Cc: Erez Cohen Subject: [PATCH] cmpost: allow cmpost to build with latest RDMA CM Can you see if this patch lets you build compost? Signed-off-by: Sean Hefty --- Index: examples/cmpost.c =================================================================== --- examples/cmpost.c (revision 8215) +++ examples/cmpost.c (working copy) @@ -614,6 +614,7 @@ out: static int query_for_path(char *dst) { + struct rdma_event_channel *channel; struct rdma_cm_id *id; struct sockaddr_in addr_in; struct rdma_cm_event *event; @@ -623,15 +624,19 @@ static int query_for_path(char *dst) if (ret) return ret; - ret = rdma_create_id(&id, NULL); + channel = rdma_create_event_channel(); + if (!channel) + return -1; + + ret = rdma_create_id(channel, &id, NULL, RDMA_PS_TCP); if (ret) - return ret; + goto destroy_channel; ret = rdma_resolve_addr(id, NULL, (struct sockaddr *) &addr_in, 2000); if (ret) goto out; - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (!ret && event->event != RDMA_CM_EVENT_ADDR_RESOLVED) ret = event->status; rdma_ack_cm_event(event); @@ -642,7 +647,7 @@ static int query_for_path(char *dst) if (ret) goto out; - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (!ret && event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) ret = event->status; rdma_ack_cm_event(event); @@ -652,6 +657,8 @@ static int query_for_path(char *dst) test.path_rec = id->route.path_rec[0]; out: rdma_destroy_id(id); +destroy_channel: + rdma_destroy_event_channel(channel); return ret; } From swise at opengridcomputing.com Wed Aug 16 06:59:56 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 16 Aug 2006 08:59:56 -0500 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <000001c6c0b3$6c167d50$5f248686@amr.corp.intel.com> References: <000001c6c0b3$6c167d50$5f248686@amr.corp.intel.com> Message-ID: <1155736796.30529.7.camel@stevo-desktop> On Tue, 2006-08-15 at 14:40 -0700, Sean Hefty wrote: > Then, as Roland said, in order to support a QP using multiple multicast groups, > we need all QKeys to be the same (for the RDMA CM). If we want to do anything > with port numbers, we probably need to fold those into the MGID. > > - Sean Ok, (now that I've crawled back out of my qkey rathole ;-), what's the plan for fixing this? I suggest we get it working with a single qkey and not worry about trying to support the UDP/IP/mcast port concept: Proposal: - UD QPs created via librdma will have the same qkey - mcast groups created via librdma will use this same qkey as well - qkey value will not have the high order bit set Sound good? From pw at osc.edu Wed Aug 16 07:15:37 2006 From: pw at osc.edu (Pete Wyckoff) Date: Wed, 16 Aug 2006 10:15:37 -0400 Subject: [openib-general] [PATCH] make "prefix" the default where to find libraries. In-Reply-To: <20060811145620.7e9a762d.weiny2@llnl.gov> References: <20060811145620.7e9a762d.weiny2@llnl.gov> Message-ID: <20060816141537.GA10902@osc.edu> weiny2 at llnl.gov wrote on Fri, 11 Aug 2006 14:56 -0700: > I have been using this patch to allow me to install somewhere other than > /usr/local. Without this the dependancies do not work out right and here at > LLNL /usr/local is a NFS mounted volume. Not good for building and testing on > a single node. > > While I am not a configure expert I think this would make things easier for > building the trunk. You might find it easier to set these values in your environment rather than editing all the configure files. For example, prefix=/usr/local/openib-r8682 export CFLAGS="-g -I$prefix/include" export CXXFLAGS=$CFLAGS export LDFLAGS=-L$prefix/lib Then run autogen.sh, configure --prefix=$prefix, make, make install in each dir as usual. -- Pete From Thomas.Talpey at netapp.com Wed Aug 16 07:16:27 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 16 Aug 2006 10:16:27 -0400 Subject: [openib-general] ib_get_dma_mr and remote access In-Reply-To: <44E24249.20402@hp.com> References: <44E24249.20402@hp.com> Message-ID: <7.0.1.0.2.20060816100841.08057ba8@netapp.com> At 05:53 PM 8/15/2006, Louis Laborde wrote: >Hi there, > >I would like to know if any application today uses ib_get_dma_mr verb with >remote access flag(s). The NFS/RDMA client does this, if configured to do so. Otherwise, it registers specific byte regions when remote access is required. The client supports numerous memory registration strategies, to suit user requirements and HCA/RNIC limitations. >It seems to me that such a dependency could first, create a security hole >and second, make this verb hard to implement for some RNICs. Yes, and yes. >If only local access is required for this "special" memory region, can >it be implemented with the "Reserved LKey" or "STag0", whichever way it's >called? Sure, and I expect many consumers would be fine with this. Note however that iWARP RDMA Read requires remote write access to be granted on the destination sge's, unlike IB RDMA Read, which requires only local. Tom. From zhushisongzhu at yahoo.com Wed Aug 16 07:33:57 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Wed, 16 Aug 2006 07:33:57 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060815094104.GA15917@mellanox.co.il> Message-ID: <20060816143357.48475.qmail@web36904.mail.mud.yahoo.com> I have changed SDP_RX_SIZE from 0x40 to 1 and rebuilt ib_sdp.ko. But kernel always crashed. zhu --- "Michael S. Tsirkin" wrote: > Quoting r. zhu shi song : > > Subject: why sdp connections cost so much memory > > > > (1) ibv_devinfo > > HCA: MHES18-XTC > > FW: 1.1.0 > > OFED: OFED-1.1-rc1 > > (2) Test Bed > > On Client: > > ib0: 193.12.10.24 > > test command: > > LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so > > SIMPLE_LIBSDP=1 ab -c m -n m -X 193.12.10.14:3129 > > http://www.sse.com.cn/sseportal/ps/zhs/home.shtml > > The web page is about 68K. > > On Server: > > ib0: 193.12.10.14 > > squid.sdp -d 10 -f squid2.conf (I have changed > > squid-cache to support listening on SDP port 3129) > > > > The test result is : > > Concurrent Conns(=m) Free Memory Requests > > completed > > 0 926980 0 > > 100 712508 100 > > 200 497372 200 > > 300 282636 256 > > 400 52868 256 > > 500 kernel crashed because > of > > "out of memory" > > > > >From above, every about 100 concurrent SDP > connections > > will cost 210M memory. It's too vast for large > scale > > applications. TCP costs very lower memory than > SDP. > > The max concurrent connections completed > successfully > > is 256. it is some bad limit. Who knows how and > when > > will solve the problem? > > I'll test the performance of sdp connection and > > compare it with TCP further. > > tks > > zhu > > Most memory in SDP goes into pre-posted receive > buffers. > Currently SDP pre-posts a fixed 64 32K buffers per > connection, that is > 2M per connection. > > To verify that's the issue, try opening > drivers/infiniband/ulp/sdp/sdp.h > and changing SDP_RX_SIZE from 0x40 to a smaller > value. > If this helps, as a quick work-around I can make > this value > globally configurable. > > TCP on the other hand scales down more gracefully, > and so should > SDP longer-term. > > > --- openib-general-request at openib.org wrote: > > > > > Send openib-general mailing list submissions to > > > openib-general at openib.org > > > > > > To subscribe or unsubscribe via the World Wide > Web, > > > visit > > > > http://openib.org/mailman/listinfo/openib-general > > > or, via email, send a message with subject or > body > > > 'help' to > > > openib-general-request at openib.org > > > > > > You can reach the person managing the list at > > > openib-general-owner at openib.org > > > > > > When replying, please edit your Subject line so > it > > > is more specific > > > than "Re: Contents of openib-general digest..." > > Is this relevant somehow? > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From halr at voltaire.com Wed Aug 16 08:10:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Aug 2006 11:10:21 -0400 Subject: [openib-general] [PATCH] [MINOR] OpenSM/osm_sm_mad_ctrl.c: Properly handle status based on whether direct routed or LID routed SMP Message-ID: <1155741019.9855.689.camel@hal.voltaire.com> OpenSM/osm_sm_mad_ctrl.c: Properly handle status based on whether direct routed or LID routed SMP Signed-off-by: Hal Rosenstock Index: opensm/osm_sm_mad_ctrl.c =================================================================== --- opensm/osm_sm_mad_ctrl.c (revision 8934) +++ opensm/osm_sm_mad_ctrl.c (working copy) @@ -34,7 +34,6 @@ * $Id$ */ - /* * Abstract: * Implementation of osm_sm_mad_ctrl_t. @@ -253,12 +252,15 @@ __osm_sm_mad_ctrl_process_get_resp( p_smp = osm_madw_get_smp_ptr( p_madw ); - if( !ib_smp_is_d( p_smp ) ) + if( p_smp->mgmt_class == IB_MCLASS_SUBN_DIR ) { - osm_log( p_ctrl->p_log, OSM_LOG_ERROR, - "__osm_sm_mad_ctrl_process_get_resp: ERR 3102: " - "'D' bit not set in returned SMP\n" ); - osm_dump_dr_smp( p_ctrl->p_log, p_smp, OSM_LOG_ERROR ); + if( !ib_smp_is_d( p_smp ) ) + { + osm_log( p_ctrl->p_log, OSM_LOG_ERROR, + "__osm_sm_mad_ctrl_process_get_resp: ERR 3102: " + "'D' bit not set in returned SMP\n" ); + osm_dump_dr_smp( p_ctrl->p_log, p_smp, OSM_LOG_ERROR ); + } } p_old_madw = (osm_madw_t*)transaction_context; @@ -667,6 +669,7 @@ __osm_sm_mad_ctrl_rcv_callback( { osm_sm_mad_ctrl_t* p_ctrl = (osm_sm_mad_ctrl_t*)bind_context; ib_smp_t* p_smp; + ib_net16_t status; OSM_LOG_ENTER( p_ctrl->p_log, __osm_sm_mad_ctrl_rcv_callback ); @@ -717,11 +720,20 @@ __osm_sm_mad_ctrl_rcv_callback( if( osm_log_is_active( p_ctrl->p_log, OSM_LOG_FRAMES ) ) osm_dump_dr_smp( p_ctrl->p_log, p_smp, OSM_LOG_FRAMES ); - if( ib_smp_get_status( p_smp ) != 0 ) + if( p_smp->mgmt_class == IB_MCLASS_SUBN_DIR ) + { + status = ib_smp_get_status( p_smp ); + } + else + { + status = p_smp->status; + } + + if( status != 0 ) { osm_log( p_ctrl->p_log, OSM_LOG_ERROR, "__osm_sm_mad_ctrl_rcv_callback: ERR 3111: " - "Error status = 0x%X\n", ib_smp_get_status( p_smp ) ); + "Error status = 0x%X\n", status ); osm_dump_dr_smp( p_ctrl->p_log, p_smp, OSM_LOG_ERROR ); } From mshefty at ichips.intel.com Wed Aug 16 09:20:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Aug 2006 09:20:33 -0700 Subject: [openib-general] [Fwd: RE: IB mcast question] In-Reply-To: <1155736796.30529.7.camel@stevo-desktop> References: <000001c6c0b3$6c167d50$5f248686@amr.corp.intel.com> <1155736796.30529.7.camel@stevo-desktop> Message-ID: <44E345D1.1010306@ichips.intel.com> Steve Wise wrote: > - UD QPs created via librdma will have the same qkey > - mcast groups created via librdma will use this same qkey as well > - qkey value will not have the high order bit set This is what I would do as well. - Sean From mshefty at ichips.intel.com Wed Aug 16 09:30:59 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Aug 2006 09:30:59 -0700 Subject: [openib-general] openIB question In-Reply-To: References: <200608160917.50585.dotanb@mellanox.co.il> Message-ID: <44E34843.3030105@ichips.intel.com> john t wrote: > The example code does not seem to support this. It first copies data to > a local buffer (which is not required in my case) and only then it could > send it over other QP. Is there a more efficient way to do this? Maybe I'm not following you here, but the data should go directly from the wire into the receive buffer on the first QP. If you post a send that references that same buffer, the data should go directly into the receive buffer on the second QP. - Sean From mshefty at ichips.intel.com Wed Aug 16 09:33:05 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Aug 2006 09:33:05 -0700 Subject: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM In-Reply-To: References: Message-ID: <44E348C1.7060004@ichips.intel.com> Bub Thomas wrote: > cmpost.c:65: error: field `path_rec' has incomplete type > cmpost.c: In function `int modify_to_rtr(cmtest_node*)': > cmpost.c:130: error: invalid conversion from `int' to `ibv_qp_attr_mask' > cmpost.c:130: error: initializing argument 3 of `int > ibv_modify_qp(ibv_qp*, ibv_qp_attr*, ibv_qp_attr_mask)' > cmpost.c:142: error: invalid conversion from `int' to `ibv_qp_attr_mask' > cmpost.c:142: error: initializing argument 3 of `int > ibv_modify_qp(ibv_qp*, ibv_qp_attr*, ibv_qp_attr_mask)' > cmpost.c: In function `int modify_to_rts(cmtest_node*)': > cmpost.c:161: error: invalid conversion from `int' to `ibv_qp_attr_mask' > cmpost.c:161: error: initializing argument 3 of `int > ibv_modify_qp(ibv_qp*, ibv_qp_attr*, ibv_qp_attr_mask)' > cmpost.c: In function `void cm_handler(ib_cm_id*, ib_cm_event*)': > cmpost.c:267: error: invalid conversion from `void*' to `cmtest_node*' > cmpost.c: In function `int create_nodes()': > cmpost.c:353: error: invalid conversion from `void*' to `cmtest_node*' > cmpost.c: In function `int query_for_path(char*)': > cmpost.c:631: error: `RDMA_PS_TCP' undeclared (first use this function) > cmpost.c:631: error: (Each undeclared identifier is reported only once > for each function it appears in.) > cmpost.c:657: error: 'struct cmtest' has no member named 'path_rec' > cmpost.c:657: error: invalid use of undefined type `struct > ibv_sa_path_rec' > /vob/pkgs/sys/linux-include/rdma/rdma_cma.h:84: error: forward > declaration of `struct ibv_sa_path_rec' > cmpost.c: In function `void run_client(char*)': > cmpost.c:678: error: 'struct cmtest' has no member named 'path_rec' > *** Error code 1 > clearmake: Error: Build script failed for "cmpost.o" Some of these look like missing include files. Are libibverbs and librdmacm installed? - Sean From rdreier at cisco.com Wed Aug 16 09:43:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 Aug 2006 09:43:46 -0700 Subject: [openib-general] [PATCH] IB/uverbs: Fix lockdep warning when QP is created with 2 CQs Message-ID: Arjan, here's a case that disproves your rule of thumb that rwsems are equivalent to mutexes as far as correctness goes: you can't have an AB-BA deadlock with nested rwsems when using down_read(). In other words, the following: down_read(&lock_1); down_read(&lock_2); down_read(&lock_2); down_read(&lock_1); is perfectly safe, but it cannot be converted to mutex_lock(&lock_1); mutex_lock(&lock_2); mutex_lock(&lock_2); mutex_lock(&lock_1); But this is a pretty small corner case I guess... --- Lockdep warns when userspace creates a QP that uses different CQs for send completions and receive completions, because both CQs are locked and their mutexes belong to the same lock class. However, we know that the mutexes are distinct and the nesting is safe (there is no possibility of AB-BA deadlock because the mutexes are locked with down_read()), so annotate the situation with SINGLE_DEPTH_NESTING to get rid of the lockdep warning. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index b81307b..8b6df7c 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -155,7 +155,7 @@ static struct ib_uobject *__idr_get_uobj } static struct ib_uobject *idr_read_uobj(struct idr *idr, int id, - struct ib_ucontext *context) + struct ib_ucontext *context, int nested) { struct ib_uobject *uobj; @@ -163,7 +163,10 @@ static struct ib_uobject *idr_read_uobj( if (!uobj) return NULL; - down_read(&uobj->mutex); + if (nested) + down_read_nested(&uobj->mutex, SINGLE_DEPTH_NESTING); + else + down_read(&uobj->mutex); if (!uobj->live) { put_uobj_read(uobj); return NULL; @@ -190,17 +193,18 @@ static struct ib_uobject *idr_write_uobj return uobj; } -static void *idr_read_obj(struct idr *idr, int id, struct ib_ucontext *context) +static void *idr_read_obj(struct idr *idr, int id, struct ib_ucontext *context, + int nested) { struct ib_uobject *uobj; - uobj = idr_read_uobj(idr, id, context); + uobj = idr_read_uobj(idr, id, context, nested); return uobj ? uobj->object : NULL; } static struct ib_pd *idr_read_pd(int pd_handle, struct ib_ucontext *context) { - return idr_read_obj(&ib_uverbs_pd_idr, pd_handle, context); + return idr_read_obj(&ib_uverbs_pd_idr, pd_handle, context, 0); } static void put_pd_read(struct ib_pd *pd) @@ -208,9 +212,9 @@ static void put_pd_read(struct ib_pd *pd put_uobj_read(pd->uobject); } -static struct ib_cq *idr_read_cq(int cq_handle, struct ib_ucontext *context) +static struct ib_cq *idr_read_cq(int cq_handle, struct ib_ucontext *context, int nested) { - return idr_read_obj(&ib_uverbs_cq_idr, cq_handle, context); + return idr_read_obj(&ib_uverbs_cq_idr, cq_handle, context, nested); } static void put_cq_read(struct ib_cq *cq) @@ -220,7 +224,7 @@ static void put_cq_read(struct ib_cq *cq static struct ib_ah *idr_read_ah(int ah_handle, struct ib_ucontext *context) { - return idr_read_obj(&ib_uverbs_ah_idr, ah_handle, context); + return idr_read_obj(&ib_uverbs_ah_idr, ah_handle, context, 0); } static void put_ah_read(struct ib_ah *ah) @@ -230,7 +234,7 @@ static void put_ah_read(struct ib_ah *ah static struct ib_qp *idr_read_qp(int qp_handle, struct ib_ucontext *context) { - return idr_read_obj(&ib_uverbs_qp_idr, qp_handle, context); + return idr_read_obj(&ib_uverbs_qp_idr, qp_handle, context, 0); } static void put_qp_read(struct ib_qp *qp) @@ -240,7 +244,7 @@ static void put_qp_read(struct ib_qp *qp static struct ib_srq *idr_read_srq(int srq_handle, struct ib_ucontext *context) { - return idr_read_obj(&ib_uverbs_srq_idr, srq_handle, context); + return idr_read_obj(&ib_uverbs_srq_idr, srq_handle, context, 0); } static void put_srq_read(struct ib_srq *srq) @@ -867,7 +871,7 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); - cq = idr_read_cq(cmd.cq_handle, file->ucontext); + cq = idr_read_cq(cmd.cq_handle, file->ucontext, 0); if (!cq) return -EINVAL; @@ -914,7 +918,7 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver goto out_wc; } - cq = idr_read_cq(cmd.cq_handle, file->ucontext); + cq = idr_read_cq(cmd.cq_handle, file->ucontext, 0); if (!cq) { ret = -EINVAL; goto out; @@ -962,7 +966,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - cq = idr_read_cq(cmd.cq_handle, file->ucontext); + cq = idr_read_cq(cmd.cq_handle, file->ucontext, 0); if (!cq) return -EINVAL; @@ -1060,9 +1064,9 @@ ssize_t ib_uverbs_create_qp(struct ib_uv srq = cmd.is_srq ? idr_read_srq(cmd.srq_handle, file->ucontext) : NULL; pd = idr_read_pd(cmd.pd_handle, file->ucontext); - scq = idr_read_cq(cmd.send_cq_handle, file->ucontext); + scq = idr_read_cq(cmd.send_cq_handle, file->ucontext, 0); rcq = cmd.recv_cq_handle == cmd.send_cq_handle ? - scq : idr_read_cq(cmd.recv_cq_handle, file->ucontext); + scq : idr_read_cq(cmd.recv_cq_handle, file->ucontext, 1); if (!pd || !scq || !rcq || (cmd.is_srq && !srq)) { ret = -EINVAL; From rdreier at cisco.com Wed Aug 16 10:07:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 Aug 2006 10:07:59 -0700 Subject: [openib-general] [PATCH] include change In-Reply-To: (James Lentini's message of "Mon, 14 Aug 2006 11:32:21 -0400 (EDT)") References: Message-ID: Thanks, applied and queued for 2.6.19 From rdreier at cisco.com Wed Aug 16 09:55:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 Aug 2006 09:55:24 -0700 Subject: [openib-general] [PATCH] IB/core: fix SM LID/LID change with client reregister set In-Reply-To: <20060815142050.GE15917@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 Aug 2006 17:20:50 +0300") References: <20060815142050.GE15917@mellanox.co.il> Message-ID: Thanks, applied and queued for 2.6.18 From rdreier at cisco.com Wed Aug 16 09:49:50 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 Aug 2006 09:49:50 -0700 Subject: [openib-general] RFC: [PATCH untested] IB/uverbs: optimize registration for huge pages In-Reply-To: <20060815211319.GE22363@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 16 Aug 2006 00:13:19 +0300") References: <20060815211319.GE22363@mellanox.co.il> Message-ID: Michael> How does this look? Is this the intended usage? Looks OK, although I would like to see test results too. However I wonder what get_user_pages() does with a huge page region -- does it give back huge pages or does it waste a lot of effort creating 4KB pages to cover the region? Also, this > + u32 mask = 0; and this > + mem->page_size = ffs(mask) ? 1 << (ffs(mask) - 1) : (1 << 31); makes me think that maybe we should change mem->page_size to an unsigned long, since it's quite possible that a memory region spans more than 2GB physically contiguous (especially if support for GB pages on amd64 is added). - R. From arjan at infradead.org Wed Aug 16 10:11:23 2006 From: arjan at infradead.org (Arjan van de Ven) Date: Wed, 16 Aug 2006 19:11:23 +0200 Subject: [openib-general] [PATCH] IB/uverbs: Fix lockdep warning when QP is created with 2 CQs In-Reply-To: References: Message-ID: <1155748283.3023.76.camel@laptopd505.fenrus.org> On Wed, 2006-08-16 at 09:43 -0700, Roland Dreier wrote: > Arjan, here's a case that disproves your rule of thumb that rwsems > are equivalent to mutexes as far as correctness goes: you can't have > an AB-BA deadlock with nested rwsems when using down_read(). In other > words, the following: > > down_read(&lock_1); > down_read(&lock_2); > down_read(&lock_2); > down_read(&lock_1); > > is perfectly safe it's safe as long as you never ever do a down_write nested inside or outside a down_read of any of these locks.... -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com From mst at mellanox.co.il Wed Aug 16 10:15:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Aug 2006 20:15:21 +0300 Subject: [openib-general] RFC: [PATCH untested] IB/uverbs: optimize registration for huge pages In-Reply-To: References: Message-ID: <20060816171521.GA5566@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: [PATCH untested] IB/uverbs: optimize registration for huge pages > > Michael> How does this look? Is this the intended usage? > > Looks OK, although I would like to see test results too. However I > wonder what get_user_pages() does with a huge page region -- does it > give back huge pages or does it waste a lot of effort creating 4KB > pages to cover the region? It creates 4KB pages. > Also, this > > > + u32 mask = 0; > > and this > > > + mem->page_size = ffs(mask) ? 1 << (ffs(mask) - 1) : (1 << 31); > > makes me think that maybe we should change mem->page_size to an > unsigned long, since it's quite possible that a memory region spans > more than 2GB physically contiguous (especially if support for GB > pages on amd64 is added). We can, although I don't know how practical it is, and ffs only works on int :( -- MST From rdreier at cisco.com Wed Aug 16 10:12:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 Aug 2006 10:12:56 -0700 Subject: [openib-general] [PATCHv2] IB/srp: add port/device attributes In-Reply-To: <20060815143452.GF15917@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 Aug 2006 17:34:52 +0300") References: <20060815143452.GF15917@mellanox.co.il> Message-ID: Thanks, applied and queued for 2.6.19 From halr at voltaire.com Wed Aug 16 10:31:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Aug 2006 13:31:47 -0400 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <44E1CBE6.2090406@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA767F@mtlexch01.mtl.com> <1155303417.4507.15009.camel@hal.voltaire.com> <44DF25A2.30405@mellanox.co.il> <20060814150824.11b28c05.weiny2@llnl.gov> <44E1CBE6.2090406@mellanox.co.il> Message-ID: <1155749506.9855.4518.camel@hal.voltaire.com> On Tue, 2006-08-15 at 09:28, Vladimir Sokolovsky wrote: > The OFED-1.1-rc1 source tar ball (openib-1.1.tgz ) created by build_ofed.sh script (from https://openib.org/svn/gen2/branches/1.1/ofed/build) > > build_ofed.sh script takes userspace libraries/binaries after executing: > > autogen.sh > configure > make dist > > Therefor, autogen.sh is not a part of it and also it is the reason > that you see Makefiles there. Why shouldn't they be included ? Weren't they included in OFED 1.0 ? This makes life difficult for those who want to rebuild based on the OFED released sources. -- Hal > Regards, > Vladimir > > > > Ira Weiny wrote: > > Why is the OFED 1.1-rc1 source tar ball missing files when compared with the 1.1 branch? > > > > Of specific question is the absence of autogen.sh in libibverbs. > > > > Ira > > > > On Sun, 13 Aug 2006 16:14:10 +0300 > > "Tziporet Koren" wrote: > > > > > >> Hal Rosenstock wrote: > >> > >>>> Target release date: 12-Sep > >>>> > >>>> Intermediate milestones: > >>>> 1. Create 1.1 branch of user level: 27-Jul - done > >>>> 2. RC1: 8-Aug - done > >>>> 3. Feature freeze (RC2): 17-Aug > >>>> > >>>> > >>> What is the start build date for RC2 ? When do developers need to have > >>> their code in by to make RC2 ? > >>> > >>> > >> We will start on Tue 15-Aug. Is this OK with you? > >> > >>> > >>> > >>> > >>>> 4. Code freeze (rc-x): 6-Sep > >>>> > >>>> > >>> Is this 1 or 2 RCs beyond RC2 in order to make this ? > >>> > >>> > >>> > >> I hope one but I guess it will be two more RCs. > >> > >> Tziporet > >> > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > >> > > > > _______________________________________________ > > openfabrics-ewg mailing list > > openfabrics-ewg at openib.org > > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > From ralphc at pathscale.com Wed Aug 16 10:38:03 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 16 Aug 2006 10:38:03 -0700 Subject: [openib-general] FYI - on vacation Message-ID: <1155749883.20325.545.camel@brick.pathscale.com> I plan to be on vacation starting this evening and back on Sunday night. If anyone has further comments or questions about the ipath patches, I will answer when I get back to work on Monday. I will be around until about 3:30 today if there are any comments. From mst at mellanox.co.il Wed Aug 16 10:47:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Aug 2006 20:47:29 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <1155749506.9855.4518.camel@hal.voltaire.com> References: <1155749506.9855.4518.camel@hal.voltaire.com> Message-ID: <20060816174729.GA5796@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [openfabrics-ewg] OFED 1.1-rc1 is available > > On Tue, 2006-08-15 at 09:28, Vladimir Sokolovsky wrote: > > The OFED-1.1-rc1 source tar ball (openib-1.1.tgz ) created by build_ofed.sh script (from https://openib.org/svn/gen2/branches/1.1/ofed/build) > > > > build_ofed.sh script takes userspace libraries/binaries after executing: > > > > autogen.sh > > configure > > make dist > > > > Therefor, autogen.sh is not a part of it and also it is the reason > > that you see Makefiles there. > > Why shouldn't they be included ? Weren't they included in OFED 1.0 ? I think that's the way autotools work. You are *supposed* to do make dist before redistributing the package. > This makes life difficult for those who want to rebuild based on the > OFED released sources. Seems to be the reverse is true - you just run configure/make/make install, no need to play with autotools which are also broken in interesting ways on many platforms. And of course there's always the git/svn branch if you want the pristine sources. -- MST From rdreier at cisco.com Wed Aug 16 10:52:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 Aug 2006 10:52:55 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <1155749506.9855.4518.camel@hal.voltaire.com> (Hal Rosenstock's message of "16 Aug 2006 13:31:47 -0400") References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA767F@mtlexch01.mtl.com> <1155303417.4507.15009.camel@hal.voltaire.com> <44DF25A2.30405@mellanox.co.il> <20060814150824.11b28c05.weiny2@llnl.gov> <44E1CBE6.2090406@mellanox.co.il> <1155749506.9855.4518.camel@hal.voltaire.com> Message-ID: Hal> Why shouldn't they be included ? Weren't they included in Hal> OFED 1.0 ? Hal> This makes life difficult for those who want to rebuild based Hal> on the OFED released sources. Why? Distributing tarballs made with "make dist" seems like the sanest thing to me. - R. From vlad at mellanox.co.il Wed Aug 16 10:57:44 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 16 Aug 2006 20:57:44 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available In-Reply-To: <1155749506.9855.4518.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA767F@mtlexch01.mtl.com> <1155303417.4507.15009.camel@hal.voltaire.com> <44DF25A2.30405@mellanox.co.il> <20060814150824.11b28c05.weiny2@llnl.gov> <44E1CBE6.2090406@mellanox.co.il> <1155749506.9855.4518.camel@hal.voltaire.com> Message-ID: <44E35C98.2090006@mellanox.co.il> autogen.sh and some other autotools files are not included because they are not listed in the corresponding Makefile.am (EXTRA_DIST variable). They weren't included in OFED-1.0 also. If openib sources will be taken as is (without running 'make dist') then it will require from users to install updated autotools : (autoconf-2.59, automake-1.9.6, libtool-1.5.20, m4-1.4.4) Regards, Vladimir Hal Rosenstock wrote: > On Tue, 2006-08-15 at 09:28, Vladimir Sokolovsky wrote: > >> The OFED-1.1-rc1 source tar ball (openib-1.1.tgz ) created by build_ofed.sh script (from https://openib.org/svn/gen2/branches/1.1/ofed/build) >> >> build_ofed.sh script takes userspace libraries/binaries after executing: >> >> autogen.sh >> configure >> make dist >> >> Therefor, autogen.sh is not a part of it and also it is the reason >> that you see Makefiles there. >> > > Why shouldn't they be included ? Weren't they included in OFED 1.0 ? > > This makes life difficult for those who want to rebuild based on the > OFED released sources. > > -- Hal > > >> Regards, >> Vladimir >> >> >> >> Ira Weiny wrote: >> >>> Why is the OFED 1.1-rc1 source tar ball missing files when compared with the 1.1 branch? >>> >>> Of specific question is the absence of autogen.sh in libibverbs. >>> >>> Ira >>> >>> On Sun, 13 Aug 2006 16:14:10 +0300 >>> "Tziporet Koren" wrote: >>> >>> >>> >>>> Hal Rosenstock wrote: >>>> >>>> >>>>>> Target release date: 12-Sep >>>>>> >>>>>> Intermediate milestones: >>>>>> 1. Create 1.1 branch of user level: 27-Jul - done >>>>>> 2. RC1: 8-Aug - done >>>>>> 3. Feature freeze (RC2): 17-Aug >>>>>> >>>>>> >>>>>> >>>>> What is the start build date for RC2 ? When do developers need to have >>>>> their code in by to make RC2 ? >>>>> >>>>> >>>>> >>>> We will start on Tue 15-Aug. Is this OK with you? >>>> >>>> >>>>> >>>>> >>>>> >>>>> >>>>>> 4. Code freeze (rc-x): 6-Sep >>>>>> >>>>>> >>>>>> >>>>> Is this 1 or 2 RCs beyond RC2 in order to make this ? >>>>> >>>>> >>>>> >>>>> >>>> I hope one but I guess it will be two more RCs. >>>> >>>> Tziporet >>>> >>>> _______________________________________________ >>>> openib-general mailing list >>>> openib-general at openib.org >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> >>> _______________________________________________ >>> openfabrics-ewg mailing list >>> openfabrics-ewg at openib.org >>> http://openib.org/mailman/listinfo/openfabrics-ewg >>> >>> >>> >> _______________________________________________ >> openfabrics-ewg mailing list >> openfabrics-ewg at openib.org >> http://openib.org/mailman/listinfo/openfabrics-ewg >> >> > > From robert.j.woodruff at intel.com Wed Aug 16 11:02:03 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 16 Aug 2006 11:02:03 -0700 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED Message-ID: Betsy wrote, >OK, that's good news - we'll definitely go for getting the changes to >you before then. -Betsy >On Wed, 2006-08-16 at 17:02 +0300, Tziporet Koren wrote: >> > >> RC2 will be available only on Monday Aug-21. If we can get the patches >> on Sunday we can include them in RC2. >> >> Tziporet Could you guys also update the svn trunk with the latest driver and provide backport patches for redhat 2.6.9-EL kernels. I have people that would like to try to use your cards but cannot get the driver that is in SVN to work. Thanks in advance woody From mst at mellanox.co.il Wed Aug 16 11:33:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Aug 2006 21:33:29 +0300 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: References: Message-ID: <20060816183329.GB5796@mellanox.co.il> Quoting r. Woodruff, Robert J : > Subject: Re: [openfabrics-ewg] Rollup patch for ipath and OFED > > Betsy wrote, > >OK, that's good news - we'll definitely go for getting the changes to > >you before then. -Betsy > > >On Wed, 2006-08-16 at 17:02 +0300, Tziporet Koren wrote: > >> > > >> RC2 will be available only on Monday Aug-21. If we can get the > >> patches on Sunday we can include them in RC2. > >> > >> Tziporet > > Could you guys also update the svn trunk with the latest driver > and provide backport patches for redhat 2.6.9-EL kernels. We plan to do that eventually, although I would be happier if all kernel developers finally moved over to use git. > I have people that would like to try to use your cards but > cannot get the driver that is in SVN to work. > > Thanks in advance > > woody Why can't they take the code and patches from OFED git tree, or OFED RC tarballs? svn is really bleeding edge, I don't think it's right for first-time users. -- MST From mst at mellanox.co.il Wed Aug 16 11:49:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Aug 2006 21:49:10 +0300 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: <20060816183329.GB5796@mellanox.co.il> References: <20060816183329.GB5796@mellanox.co.il> Message-ID: <20060816184910.GD5796@mellanox.co.il> Quoting r. Michael S. Tsirkin : > > Could you guys also update the svn trunk with the latest driver > > and provide backport patches for redhat 2.6.9-EL kernels. > > We plan to do that eventually Woops, only now noticed that this was wrt the ipath driver, not mthca as I thought. Of course I didn't mean it - I don't edit ipath code in SVN, pathscale guys do that. Sorry. -- MST From mst at mellanox.co.il Wed Aug 16 11:55:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Aug 2006 21:55:21 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060816143357.48475.qmail@web36904.mail.mud.yahoo.com> References: <20060816143357.48475.qmail@web36904.mail.mud.yahoo.com> Message-ID: <20060816185521.GB6336@mellanox.co.il> Quoting r. zhu shi song : > Subject: Re: why sdp connections cost so much memory > > I have changed SDP_RX_SIZE from 0x40 to 1 and rebuilt > ib_sdp.ko. But kernel always crashed. Weird. How about this patch: diff --git a/drivers/infiniband/ulp/sdp/sdp_bcopy.c b/drivers/infiniband/ulp/sdp/sdp_bcopy.c index c35a4da..2fd79a0 100644 --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c @@ -234,7 +234,7 @@ void sdp_post_recvs(struct sdp_sock *ssk while ((likely(ssk->rx_head - ssk->rx_tail < SDP_RX_SIZE) && (ssk->rx_head - ssk->rx_tail - SDP_MIN_BUFS) * SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE + rmem < - ssk->isk.sk.sk_rcvbuf * 0x10) || + ssk->isk.sk.sk_rcvbuf) || unlikely(ssk->rx_head - ssk->rx_tail < SDP_MIN_BUFS)) sdp_post_recv(ssk); } -- MST From mst at mellanox.co.il Wed Aug 16 13:06:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Aug 2006 23:06:53 +0300 Subject: [openib-general] mem->page_size (was Re: RFC: [PATCH untested] IB/uverbs: optimize registration for huge pages) In-Reply-To: References: Message-ID: <20060816200653.GA6534@mellanox.co.il> Quoting r. Roland Dreier : > maybe we should change mem->page_size to an unsigned long 2^64 still won't fit though. Maybe better to change mem->page_size to mem->page_shift? Want a patch like that? -- MST From rdreier at cisco.com Wed Aug 16 13:14:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 Aug 2006 13:14:09 -0700 Subject: [openib-general] mem->page_size In-Reply-To: <20060816200653.GA6534@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 16 Aug 2006 23:06:53 +0300") References: <20060816200653.GA6534@mellanox.co.il> Message-ID: Michael> 2^64 still won't fit though. Maybe better to change Michael> mem->page_size to mem->page_shift? Want a patch like Michael> that? I'm not sure it's worth going that far -- it's not very likely that anyone will be running on a machine with > 2^64 bytes of memory any time soon, let alone registering 2^64 bytes of physically contiguous memory in a memory region. But on the other hand, if we're going to touch that code, there's no reason to impose an arbitrary limit. So I guess converting to page_shift does make sense. - R. From pw at osc.edu Wed Aug 16 13:51:04 2006 From: pw at osc.edu (Pete Wyckoff) Date: Wed, 16 Aug 2006 16:51:04 -0400 Subject: [openib-general] return error when rdma_listen fails In-Reply-To: <44DA4EA1.7000902@ichips.intel.com> References: <20060808181928.GA15075@osc.edu> <44DA4EA1.7000902@ichips.intel.com> Message-ID: <20060816205104.GA13031@osc.edu> mshefty at ichips.intel.com wrote on Wed, 09 Aug 2006 14:07 -0700: > Pete Wyckoff wrote: > >Calling rdma_listen() on a cm_id bound to INADDR_ANY can fail, e.g. > >with EADDRINUSE, but report no error back to the user. This patch > >fixes that by propagating the error. Success occurs only if at > >least one of the possibly multiple devices in the system was able to > >listen. In the case of multiple devices reporting errors on listen, > >only the first error value is returned. iwarp branch. > > There's a problem if the listen is done before any devices have been added > to the system. In this case, the listen should succeed. That would be easy enough to fix for that one special case. But tell me: 1) When a device gets added to the system, is there code that applies existing INADDR_ANY listens to the new device? Where? 2) If so, then I would expect you to extend your logic to argue against my patch in its entirety. There's always the chance that even if no device could satisfy the listen immediately, one might be added in the future. This all strikes me as rather unexpected from an application-writer point of view. Tom raised some good points. What I want in my application is that rdma_listen should cause at least one device to listen. I do not want the program to wait for a device to appear. Should I grab the device list and bind to a particular source IP rather than trying to use INADDR_ANY? By the way, shouldn't the rdma_bind_addr call that preceeded rdma_listen have failed when I tried to bind to INADDR_ANY with a specified port, but that port was already in use by a device? This could be just another failure to keep the amso NIC state consistent with the host state. -- Pete From rdreier at cisco.com Wed Aug 16 13:56:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 Aug 2006 13:56:38 -0700 Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <55611.71.131.54.100.1155575677.squirrel@rocky.pathscale.com> (ralphc@pathscale.com's message of "Mon, 14 Aug 2006 10:14:37 -0700 (PDT)") References: <1155333276.20325.422.camel@brick.pathscale.com> <55611.71.131.54.100.1155575677.squirrel@rocky.pathscale.com> Message-ID: ralphc> A #define won't help the plug-in know what parameters to ralphc> pass, only a function name change will work if the ralphc> semantics change. I don't follow: #ifdef OLD_FUNCTION_HAS_NEW_PARAMS old_function(new_params); #else old_function(old_params); #endif - R. From ralphc at pathscale.com Wed Aug 16 14:10:42 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 16 Aug 2006 14:10:42 -0700 Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: References: <1155333276.20325.422.camel@brick.pathscale.com> <55611.71.131.54.100.1155575677.squirrel@rocky.pathscale.com> Message-ID: <1155762642.20325.555.camel@brick.pathscale.com> On Wed, 2006-08-16 at 13:56 -0700, Roland Dreier wrote: > ralphc> A #define won't help the plug-in know what parameters to > ralphc> pass, only a function name change will work if the > ralphc> semantics change. > > I don't follow: > > #ifdef OLD_FUNCTION_HAS_NEW_PARAMS > old_function(new_params); > #else > old_function(old_params); > #endif > > - R. I thought your goal was to be able to have one binary libipathverbs or libmthca which could be dlopen()'ed by libibverbs.so.1.0 or libibverbs.so.2.0. The dlopen()'ed library needs to dynamically check which libibverbs opened it and call old_function() with the appropriate arguments. If it is compiled in, it will be wrong for the other case. Was I misunderstanding your goal? The above works if you are only trying to have one source code which can be compiled to work with either version of libibverbs. From rdreier at cisco.com Wed Aug 16 14:18:01 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 Aug 2006 14:18:01 -0700 Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1155762642.20325.555.camel@brick.pathscale.com> (Ralph Campbell's message of "Wed, 16 Aug 2006 14:10:42 -0700") References: <1155333276.20325.422.camel@brick.pathscale.com> <55611.71.131.54.100.1155575677.squirrel@rocky.pathscale.com> <1155762642.20325.555.camel@brick.pathscale.com> Message-ID: Ralph> Was I misunderstanding your goal? The above works if you Ralph> are only trying to have one source code which can be Ralph> compiled to work with either version of libibverbs. Yes, that's right. I just want to be able to build against either version of libibverbs. There are too many differences in structure layouts etc for things to work at runtime. - R. From ralphc at pathscale.com Wed Aug 16 14:28:02 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 16 Aug 2006 14:28:02 -0700 Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: References: <1155333276.20325.422.camel@brick.pathscale.com> <55611.71.131.54.100.1155575677.squirrel@rocky.pathscale.com> <1155762642.20325.555.camel@brick.pathscale.com> Message-ID: <1155763683.20325.557.camel@brick.pathscale.com> On Wed, 2006-08-16 at 14:18 -0700, Roland Dreier wrote: > Ralph> Was I misunderstanding your goal? The above works if you > Ralph> are only trying to have one source code which can be > Ralph> compiled to work with either version of libibverbs. > > Yes, that's right. I just want to be able to build against either > version of libibverbs. There are too many differences in structure > layouts etc for things to work at runtime. > > - R. OK. I guess we are in agreement. A #define is fine with me. From mshefty at ichips.intel.com Wed Aug 16 14:29:00 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Aug 2006 14:29:00 -0700 Subject: [openib-general] return error when rdma_listen fails In-Reply-To: <20060816205104.GA13031@osc.edu> References: <20060808181928.GA15075@osc.edu> <44DA4EA1.7000902@ichips.intel.com> <20060816205104.GA13031@osc.edu> Message-ID: <44E38E1C.1060805@ichips.intel.com> Pete Wyckoff wrote: > 1) When a device gets added to the system, is there code that applies > existing INADDR_ANY listens to the new device? Where? Yes - see cma_add_one() where cma_listen_on_dev() is called. > By the way, shouldn't the rdma_bind_addr call that preceeded > rdma_listen have failed when I tried to bind to INADDR_ANY with a > specified port, but that port was already in use by a device? This > could be just another failure to keep the amso NIC state consistent > with the host state. My expectation is that an iWarp device would fail the bind. In general, I do not believe that the correct approach is for the RDMA CM to use the same port space as TCP. Underlying transports should map addresses in whatever way works best for them. For IB, the IP addresses are mapped to GIDs and ports to service IDs. iWarp should map into the correct port space or fail the binding. - Sean From ardavis at ichips.intel.com Wed Aug 16 14:42:36 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 16 Aug 2006 14:42:36 -0700 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects Message-ID: <44E3914C.4010903@ichips.intel.com> We are running into connection reject issues (IB_CM_REJ_STALE_CONN) with our application under heavy load and lots of connections. We occassionally get a reject based on the QP being in timewait state leftover from a prior connection. It appears that the CM keeps track of the QP's in timewait state on both sides of the connection, independently of the verbs layer, even after the QP has been destroyed at the verbs level. I can actually create a new QP via verbs and it could still be on the CM timewait queue waiting for the timer to pop and be removed. If this is the case, my attempts to connect using this QP will fail with a reject. How can a consumer know for sure that the new QP will not be in a timewait state according to the CM? Does it make sense to push the timewait functionality down into verbs? If not, is there a way for the CM to hold a reference to the QP until the timewait expires? -arlin From sashak at voltaire.com Wed Aug 16 15:24:54 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 17 Aug 2006 01:24:54 +0300 Subject: [openib-general] madeye backport patch for OFED 1.1 In-Reply-To: <20060816110628.GA2863@mellanox.co.il> References: <20060815194648.GX24920@sashak.voltaire.com> <20060816110628.GA2863@mellanox.co.il> Message-ID: <20060816222454.GD18411@sashak.voltaire.com> Hi Michael, On 14:06 Wed 16 Aug , Michael S. Tsirkin wrote: > Quoting r. Sasha Khapyorsky : > > > > git-checkout -b madeye-for-ofed v2.6.18-rc4 > > > > git-applymbox /path/to/hals_message > > > > > > Yea, will do. > > > > Thanks. > > By the way, I think that longer-term we really want to be > able to let userspace expose a madeye-like functionality, > snooping on all packets. Maybe, but then instead of simple logger we may find us developing full-featured sniffer app... :) Sasha > It seems this could be a trivial extension to the umad module - > any objections to this approach? > > -- > MST From halr at voltaire.com Wed Aug 16 15:43:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Aug 2006 18:43:34 -0400 Subject: [openib-general] madeye backport patch for OFED 1.1 In-Reply-To: <20060816222454.GD18411@sashak.voltaire.com> References: <20060815194648.GX24920@sashak.voltaire.com> <20060816110628.GA2863@mellanox.co.il> <20060816222454.GD18411@sashak.voltaire.com> Message-ID: <1155768213.9855.13329.camel@hal.voltaire.com> On Wed, 2006-08-16 at 18:24, Sasha Khapyorsky wrote: > Hi Michael, > > On 14:06 Wed 16 Aug , Michael S. Tsirkin wrote: > > Quoting r. Sasha Khapyorsky : > > > > > git-checkout -b madeye-for-ofed v2.6.18-rc4 > > > > > git-applymbox /path/to/hals_message > > > > > > > > Yea, will do. > > > > > > Thanks. > > > > By the way, I think that longer-term we really want to be > > able to let userspace expose a madeye-like functionality, > > snooping on all packets. > > Maybe, but then instead of simple logger we may find us developing > full-featured sniffer app... :) Plug IB decode into ethereal ? -- Hal > Sasha > > > It seems this could be a trivial extension to the umad module - > > any objections to this approach? > > > > -- > > MST From tom at opengridcomputing.com Wed Aug 16 16:18:02 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 16 Aug 2006 18:18:02 -0500 Subject: [openib-general] return error when rdma_listen fails In-Reply-To: <44E38E1C.1060805@ichips.intel.com> References: <20060808181928.GA15075@osc.edu> <44DA4EA1.7000902@ichips.intel.com> <20060816205104.GA13031@osc.edu> <44E38E1C.1060805@ichips.intel.com> Message-ID: <1155770282.10135.52.camel@trinity.ogc.int> On Wed, 2006-08-16 at 14:29 -0700, Sean Hefty wrote: > Pete Wyckoff wrote: > > 1) When a device gets added to the system, is there code that applies > > existing INADDR_ANY listens to the new device? Where? > > Yes - see cma_add_one() where cma_listen_on_dev() is called. > > > By the way, shouldn't the rdma_bind_addr call that preceeded > > rdma_listen have failed when I tried to bind to INADDR_ANY with a > > specified port, but that port was already in use by a device? This > > could be just another failure to keep the amso NIC state consistent > > with the host state. > > My expectation is that an iWarp device would fail the bind. > > In general, I do not believe that the correct approach is for the RDMA CM to use > the same port space as TCP. Underlying transports should map addresses in > whatever way works best for them. For IB, the IP addresses are mapped to GIDs > and ports to service IDs. iWarp should map into the correct port space or fail > the binding. I think this makes sense for IB, however, for TCP based transports, we should share the port space with TCP. The reason is that most RNIC's are "converged nic" implementations. That is, the RDMA side and non-rdma side both share the same IP address (Ammasso had mac addresses which made it look like two physical ports to the native stack). In the current model, if an application does a listen over the native stack and then does an rdma_listen, BOTH will succeed, however, after the rdma_listen, the native side will never see the connection requests! This is clearly busted. Back to the IB side, the port space management services that I have in mind would allow you to create your own IB port space, SDP port space, etc... that is separate from TCP, just like UDP and TCP are different. Plus it would handle reuseaddr consistently, integrate with netstat, etc... Even if we never do it for IB, I think we need it for iWARP... > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tom at opengridcomputing.com Wed Aug 16 16:18:02 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 16 Aug 2006 18:18:02 -0500 Subject: [openib-general] return error when rdma_listen fails In-Reply-To: <44E38E1C.1060805@ichips.intel.com> References: <20060808181928.GA15075@osc.edu> <44DA4EA1.7000902@ichips.intel.com> <20060816205104.GA13031@osc.edu> <44E38E1C.1060805@ichips.intel.com> Message-ID: <1155770282.10135.52.camel@trinity.ogc.int> On Wed, 2006-08-16 at 14:29 -0700, Sean Hefty wrote: > Pete Wyckoff wrote: > > 1) When a device gets added to the system, is there code that applies > > existing INADDR_ANY listens to the new device? Where? > > Yes - see cma_add_one() where cma_listen_on_dev() is called. > > > By the way, shouldn't the rdma_bind_addr call that preceeded > > rdma_listen have failed when I tried to bind to INADDR_ANY with a > > specified port, but that port was already in use by a device? This > > could be just another failure to keep the amso NIC state consistent > > with the host state. > > My expectation is that an iWarp device would fail the bind. > > In general, I do not believe that the correct approach is for the RDMA CM to use > the same port space as TCP. Underlying transports should map addresses in > whatever way works best for them. For IB, the IP addresses are mapped to GIDs > and ports to service IDs. iWarp should map into the correct port space or fail > the binding. I think this makes sense for IB, however, for TCP based transports, we should share the port space with TCP. The reason is that most RNIC's are "converged nic" implementations. That is, the RDMA side and non-rdma side both share the same IP address (Ammasso had mac addresses which made it look like two physical ports to the native stack). In the current model, if an application does a listen over the native stack and then does an rdma_listen, BOTH will succeed, however, after the rdma_listen, the native side will never see the connection requests! This is clearly busted. Back to the IB side, the port space management services that I have in mind would allow you to create your own IB port space, SDP port space, etc... that is separate from TCP, just like UDP and TCP are different. Plus it would handle reuseaddr consistently, integrate with netstat, etc... Even if we never do it for IB, I think we need it for iWARP... > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Wed Aug 16 16:33:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Aug 2006 16:33:44 -0700 Subject: [openib-general] return error when rdma_listen fails In-Reply-To: <1155770282.10135.52.camel@trinity.ogc.int> References: <20060808181928.GA15075@osc.edu> <44DA4EA1.7000902@ichips.intel.com> <20060816205104.GA13031@osc.edu> <44E38E1C.1060805@ichips.intel.com> <1155770282.10135.52.camel@trinity.ogc.int> Message-ID: <44E3AB58.8070109@ichips.intel.com> Tom Tucker wrote: > I think this makes sense for IB, however, for TCP based transports, we > should share the port space with TCP. My view is that the iWarp transport needs to provide the mapping from an RDMA_PS_TCP to the actual TCP port space, RDMA_PS_UDP to UDP, etc. This is a function that should be part of the transport specific code, and not the general RDMA CM code. How the RDMA CM handles a failure when one transport fails an operation, while another one succeeds is unclear. - Sean From mshefty at ichips.intel.com Wed Aug 16 16:41:24 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Aug 2006 16:41:24 -0700 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E3914C.4010903@ichips.intel.com> References: <44E3914C.4010903@ichips.intel.com> Message-ID: <44E3AD24.4070200@ichips.intel.com> Arlin Davis wrote: > How can a consumer know for sure that the new QP will not be in a > timewait state according to the CM? Given that the QP may have been in use by another process, I don't think that there's any way for the new owner to know. > Does it make sense to push the timewait functionality down into verbs? This may be a clean way of handling the issue, but... see below. > If not, is there a way for the > CM to hold a reference to the QP until the timewait expires? For userspace QPs, the CM doesn't have access to the QP, so some sort of special call into verbs would be needed. Even if we pushed timewait handling under verbs, a user could always get a QP that the remote side thinks is connected. The original connection could fail to disconnect because of lost DREQs. So, locally, the QP could have exited timewait, while the remote side still thinks that it's connected. - Sean From sashak at voltaire.com Wed Aug 16 17:41:29 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 17 Aug 2006 03:41:29 +0300 Subject: [openib-general] madeye backport patch for OFED 1.1 In-Reply-To: <1155768213.9855.13329.camel@hal.voltaire.com> References: <20060815194648.GX24920@sashak.voltaire.com> <20060816110628.GA2863@mellanox.co.il> <20060816222454.GD18411@sashak.voltaire.com> <1155768213.9855.13329.camel@hal.voltaire.com> Message-ID: <1155775289.3379.2.camel@localhost> On Wed, 2006-08-16 at 18:43 -0400, Hal Rosenstock wrote: > On Wed, 2006-08-16 at 18:24, Sasha Khapyorsky wrote: > > Hi Michael, > > > > On 14:06 Wed 16 Aug , Michael S. Tsirkin wrote: > > > Quoting r. Sasha Khapyorsky : > > > > > > git-checkout -b madeye-for-ofed v2.6.18-rc4 > > > > > > git-applymbox /path/to/hals_message > > > > > > > > > > Yea, will do. > > > > > > > > Thanks. > > > > > > By the way, I think that longer-term we really want to be > > > able to let userspace expose a madeye-like functionality, > > > snooping on all packets. > > > > Maybe, but then instead of simple logger we may find us developing > > full-featured sniffer app... :) > > Plug IB decode into ethereal ? Yes, like this. (You see - you are starting already :)) Sasha > > -- Hal > > > Sasha > > > > > It seems this could be a trivial extension to the umad module - > > > any objections to this approach? > > > > > > -- > > > MST > From surs at cse.ohio-state.edu Wed Aug 16 19:07:58 2006 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 16 Aug 2006 21:07:58 -0500 Subject: [openib-general] vendor_id field in device attributes Message-ID: <44E3CF7E.20405@cse.ohio-state.edu> Hi, I have a quick question. If I use ibv_query_device() to find out the IB device properties, does the `vendor_id' field correspond to a unique HCA vendor? For example, I get the value 713 for Mellanox HCAs. Can I expect this to remain the same across various Gen2 installations? Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From bchang at atipa.com Wed Aug 16 21:43:20 2006 From: bchang at atipa.com (Brady Chang) Date: Wed, 16 Aug 2006 23:43:20 -0500 Subject: [openib-general] ib traffic Message-ID: Hi all, I'm using open mpi with ofed 1.0. I want to see the traffic over IB but not sure if there is a simple utility to display bytes received and bytes transmitted. thanks -Brady -------------- next part -------------- An HTML attachment was scrubbed... URL: From bpradip at in.ibm.com Wed Aug 16 22:30:22 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Thu, 17 Aug 2006 11:00:22 +0530 Subject: [openib-general] [PATCH] perftest: enhancement to rdma_lat to allow use of RDMA CM Message-ID: <20060817053013.GA16205@harry-potter.in.ibm.com> Hi Michael, This patch contains changes to the rdma_lat.c to allow use of RDMA CM. This has been successfully tested with Ammasso iWARP cards, IBM eHCA and mthca IB cards. Summary of changes # Added an option (-c|--cma) to enable use of RDMA CM # Added a new structure (struct pp_data) containing the user parameters as well as other data required by most of the routines. This makes it convenient to pass the parameters between various routines. # Outputs to stdout/stderr are prefixed with the process-id. This helps to sort the output when multiple servers/clients are run from the same machine Signed-off-by: Pradipta Kumar Banerjee --- Index: perftest/rdma_lat.c ============================================================================= --- rdma_lat.c.org 2006-08-07 16:12:18.000000000 +0530 +++ rdma_lat.c 2006-08-16 16:09:45.000000000 +0530 @@ -53,12 +53,14 @@ #include #include +#include #include "get_clock.h" #define PINGPONG_RDMA_WRID 3 static int page_size; +static pid_t pid; struct report_options { int unsorted; @@ -71,15 +73,16 @@ struct pingpong_context { struct ibv_context *context; struct ibv_pd *pd; struct ibv_mr *mr; - struct ibv_cq *cq; + struct ibv_cq *rcq; + struct ibv_cq *scq; struct ibv_qp *qp; void *buf; volatile char *post_buf; volatile char *poll_buf; int size; int tx_depth; - struct ibv_sge list; - struct ibv_send_wr wr; + struct ibv_sge list; + struct ibv_send_wr wr; }; struct pingpong_dest { @@ -90,6 +93,30 @@ struct pingpong_dest { unsigned long long vaddr; }; +struct pp_data { + int port; + int ib_port; + unsigned size; + int tx_depth; + int use_cma; + int sockfd; + char *servername; + struct pingpong_dest my_dest; + struct pingpong_dest *rem_dest; + struct ibv_device *ib_dev; + struct rdma_event_channel *cm_channel; + struct rdma_cm_id *cm_id; + +}; + +static void pp_post_recv(struct pingpong_context *); +static void pp_wait_for_done(struct pingpong_context *); +static void pp_send_done(struct pingpong_context *); +static void pp_wait_for_start(struct pingpong_context *); +static void pp_send_start(struct pingpong_context *); +static void pp_close_cma(struct pp_data ); +static struct pingpong_context *pp_init_ctx(void *, struct pp_data *); + static uint16_t pp_get_local_lid(struct pingpong_context *ctx, int port) { @@ -166,7 +193,7 @@ static int pp_read_keys(int sockfd, cons return 0; } -static int pp_client_connect(const char *servername, int port) +static struct pingpong_context *pp_client_connect(struct pp_data *data) { struct addrinfo *res, *t; struct addrinfo hints = { @@ -176,44 +203,156 @@ static int pp_client_connect(const char char *service; int n; int sockfd = -1; + struct rdma_cm_event *event; + struct sockaddr_in sin; + struct pingpong_context *ctx = NULL; + struct rdma_conn_param conn_param; - asprintf(&service, "%d", port); - n = getaddrinfo(servername, service, &hints, &res); + asprintf(&service, "%d", data->port); + n = getaddrinfo(data->servername, service, &hints, &res); if (n < 0) { - fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); - return n; + fprintf(stderr, "%d:%s: %s for %s:%d\n", + pid, __func__, gai_strerror(n), + data->servername, data->port); + goto err4; } - for (t = res; t; t = t->ai_next) { - sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); - if (sockfd >= 0) { - if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) - break; - close(sockfd); - sockfd = -1; + if (data->use_cma) { + sin.sin_addr.s_addr = ((struct sockaddr_in*)res->ai_addr)->sin_addr.s_addr; + sin.sin_family = AF_INET; + sin.sin_port = htons(data->port); + if (rdma_resolve_addr(data->cm_id, NULL, + (struct sockaddr *)&sin, 2000)) { + fprintf(stderr, "%d:%s: rdma_resolve_addr failed\n", + pid, __func__ ); + goto err2; + } + + if (rdma_get_cm_event(data->cm_channel, &event)) + goto err2; + + if (event->event != RDMA_CM_EVENT_ADDR_RESOLVED) { + fprintf(stderr, "%d:%s: unexpected CM event %d\n", + pid, __func__, event->event); + goto err1; + } + rdma_ack_cm_event(event); + + if (rdma_resolve_route(data->cm_id, 2000)) { + fprintf(stderr, "%d:%s: rdma_resolve_route failed\n", + pid, __func__); + goto err2; + } + + if (rdma_get_cm_event(data->cm_channel, &event)) + goto err2; + + if (event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) { + fprintf(stderr, "%d:%s: unexpected CM event %d\n", + pid, __func__, event->event); + rdma_ack_cm_event(event); + goto err1; + } + rdma_ack_cm_event(event); + ctx = pp_init_ctx(data->cm_id, data); + if (!ctx) { + fprintf(stderr, "%d:%s: pp_init_ctx failed\n", pid, __func__); + goto err2; + } + data->my_dest.psn = lrand48() & 0xffffff; + data->my_dest.qpn = 0; + data->my_dest.rkey = ctx->mr->rkey; + data->my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.responder_resources = 1; + conn_param.initiator_depth = 1; + conn_param.retry_count = 5; + conn_param.private_data = &data->my_dest; + conn_param.private_data_len = sizeof(data->my_dest); + + if (rdma_connect(data->cm_id, &conn_param)) { + fprintf(stderr, "%d:%s: rdma_connect failure\n", pid, __func__); + goto err2; + } + + if (rdma_get_cm_event(data->cm_channel, &event)) + goto err2; + + if (event->event != RDMA_CM_EVENT_ESTABLISHED) { + fprintf(stderr, "%d:%s: unexpected CM event %d\n", + pid, __func__, event->event); + goto err1; + } + if (!event->private_data || + (event->private_data_len < sizeof(*data->rem_dest))) { + fprintf(stderr, "%d:%s: bad private data ptr %p len %d\n", + pid, __func__, event->private_data, + event->private_data_len); + goto err1; + } + data->rem_dest = malloc(sizeof *data->rem_dest); + if (!data->rem_dest) + goto err1; + + memcpy(data->rem_dest, event->private_data, + sizeof(*data->rem_dest)); + rdma_ack_cm_event(event); + } else { + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, + t->ai_protocol); + if (sockfd >= 0) { + if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + if (sockfd < 0) { + fprintf(stderr, "%d:%s: Couldn't connect to %s:%d\n", + pid, __func__, data->servername, data->port); + goto err3; } + ctx = pp_init_ctx(data->ib_dev, data); + if (!ctx) + goto err3; + data->sockfd = sockfd; } freeaddrinfo(res); + return ctx; + +err1: + rdma_ack_cm_event(event); +err2: + rdma_destroy_id(data->cm_id); + rdma_destroy_event_channel(data->cm_channel); +err3: + freeaddrinfo(res); +err4: + return NULL; - if (sockfd < 0) { - fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); - return sockfd; - } - return sockfd; } -static int pp_client_exch_dest(int sockfd, const struct pingpong_dest *my_dest, - struct pingpong_dest *rem_dest) + +static int pp_client_exch_dest(struct pp_data *data) { - if (pp_write_keys(sockfd, my_dest)) + if (data->rem_dest != NULL) + free(data->rem_dest); + + data->rem_dest = malloc(sizeof *data->rem_dest); + if (!data->rem_dest) return -1; - return pp_read_keys(sockfd, my_dest, rem_dest); + if (pp_write_keys(data->sockfd, &data->my_dest)) + return -1; + + return pp_read_keys(data->sockfd, &data->my_dest, data->rem_dest); } -static int pp_server_connect(int port) +static struct pingpong_context *pp_server_connect(struct pp_data *data) { struct addrinfo *res, *t; struct addrinfo hints = { @@ -224,177 +363,297 @@ static int pp_server_connect(int port) char *service; int sockfd = -1, connfd; int n; + struct rdma_cm_event *event; + struct sockaddr_in sin; + struct pingpong_context *ctx = NULL; + struct rdma_cm_id *child_cm_id; + struct rdma_conn_param conn_param; + + asprintf(&service, "%d", data->port); + if ( (n = getaddrinfo(NULL, service, &hints, &res)) < 0 ) { + fprintf(stderr, "%d:%s: %s for port %d\n", pid, __func__, + gai_strerror(n), data->port); + goto err5; + } + + if (data->use_cma) { + sin.sin_addr.s_addr = 0; + sin.sin_family = AF_INET; + sin.sin_port = htons(data->port); + if (rdma_bind_addr(data->cm_id, (struct sockaddr *)&sin)) { + fprintf(stderr, "%d:%s: rdma_bind_addr failed\n", pid, __func__); + goto err3; + } + + if (rdma_listen(data->cm_id, 0)) { + fprintf(stderr, "%d:%s: rdma_listen failed\n", pid, __func__); + goto err3; + } + + if (rdma_get_cm_event(data->cm_channel, &event)) + goto err3; - asprintf(&service, "%d", port); - n = getaddrinfo(NULL, service, &hints, &res); - - if (n < 0) { - fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); - return n; - } - - for (t = res; t; t = t->ai_next) { - sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); - if (sockfd >= 0) { - n = 1; - - setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &n, sizeof n); - - if (!bind(sockfd, t->ai_addr, t->ai_addrlen)) - break; + if (event->event != RDMA_CM_EVENT_CONNECT_REQUEST) { + fprintf(stderr, "%d:%s: bad event waiting for connect request %d\n", + pid, __func__, event->event); + goto err2; + } + + if (!event->private_data || + (event->private_data_len < sizeof(*data->rem_dest))) { + fprintf(stderr, "%d:%s: bad private data len %d\n", pid, + __func__, event->private_data_len); + goto err2; + } + + data->rem_dest = malloc(sizeof *data->rem_dest); + if (!data->rem_dest) + goto err2; + + memcpy(data->rem_dest, event->private_data, sizeof(*data->rem_dest)); + + child_cm_id = (struct rdma_cm_id *)event->id; + ctx = pp_init_ctx(child_cm_id, data); + if (!ctx) { + free(data->rem_dest); + goto err1; + } + data->my_dest.psn = lrand48() & 0xffffff; + data->my_dest.qpn = 0; + data->my_dest.rkey = ctx->mr->rkey; + data->my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.responder_resources = 1; + conn_param.initiator_depth = 1; + conn_param.private_data = &data->my_dest; + conn_param.private_data_len = sizeof(data->my_dest); + if (rdma_accept(child_cm_id, &conn_param)) { + fprintf(stderr, "%d:%s: rdma_accept failed\n", pid, __func__); + goto err1; + } + rdma_ack_cm_event(event); + if (rdma_get_cm_event(data->cm_channel, &event)) { + fprintf(stderr, "%d:%s: rdma_get_cm_event error\n", pid, __func__); + rdma_destroy_id(child_cm_id); + goto err3; + } + if (event->event != RDMA_CM_EVENT_ESTABLISHED) { + fprintf(stderr, "%d:%s: bad event waiting for established %d\n", + pid, __func__, event->event); + goto err1; + } + rdma_ack_cm_event(event); + } else { + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + n = 1; + + setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &n, sizeof n); + + if (!bind(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + if (sockfd < 0) { + fprintf(stderr, "%d:%s: Couldn't listen to port %d\n", pid, + __func__, data->port); + goto err4; + } + + listen(sockfd, 1); + connfd = accept(sockfd, NULL, 0); + if (connfd < 0) { + perror("server accept"); + fprintf(stderr, "%d:%s: accept() failed\n", pid, __func__); close(sockfd); - sockfd = -1; + goto err4; } - } - - freeaddrinfo(res); + + close(sockfd); - if (sockfd < 0) { - fprintf(stderr, "Couldn't listen to port %d\n", port); - return sockfd; + ctx = pp_init_ctx(data->ib_dev, data); + if (!ctx) + goto err4; + data->sockfd = connfd; } + freeaddrinfo(res); + return ctx; - listen(sockfd, 1); - connfd = accept(sockfd, NULL, 0); - if (connfd < 0) { - perror("server accept"); - fprintf(stderr, "accept() failed\n"); - close(sockfd); - return connfd; - } +err1: + rdma_destroy_id(child_cm_id); +err2: + rdma_ack_cm_event(event); +err3: + rdma_destroy_id(data->cm_id); + rdma_destroy_event_channel(data->cm_channel); +err4: + freeaddrinfo(res); +err5: + return NULL; - close(sockfd); - return connfd; } -static int pp_server_exch_dest(int sockfd, const struct pingpong_dest *my_dest, - struct pingpong_dest* rem_dest) +static int pp_server_exch_dest(struct pp_data *data) { + if (data->rem_dest != NULL) + free(data->rem_dest); + data->rem_dest = malloc(sizeof *data->rem_dest); - if (pp_read_keys(sockfd, my_dest, rem_dest)) + if (!data->rem_dest) return -1; - return pp_write_keys(sockfd, my_dest); + if (pp_read_keys(data->sockfd, &data->my_dest, data->rem_dest)) + return -1; + + return pp_write_keys(data->sockfd, &data->my_dest); } -static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, - int tx_depth, int port) +static struct pingpong_context *pp_init_ctx(void *ptr, struct pp_data *data) { struct pingpong_context *ctx; + struct ibv_device *ib_dev; + struct rdma_cm_id *cm_id; ctx = malloc(sizeof *ctx); if (!ctx) return NULL; - ctx->size = size; - ctx->tx_depth = tx_depth; + ctx->size = data->size; + ctx->tx_depth = data->tx_depth; - ctx->buf = memalign(page_size, size * 2); + ctx->buf = memalign(page_size, ctx->size * 2); if (!ctx->buf) { - fprintf(stderr, "Couldn't allocate work buf.\n"); + fprintf(stderr, "%d:%s: Couldn't allocate work buf.\n", + pid, __func__); return NULL; } - memset(ctx->buf, 0, size * 2); + memset(ctx->buf, 0, ctx->size * 2); - ctx->post_buf = (char*)ctx->buf + (size - 1); - ctx->poll_buf = (char*)ctx->buf + (2 * size - 1); + ctx->post_buf = (char *)ctx->buf + (ctx->size -1); + ctx->poll_buf = (char *)ctx->buf + (2 * ctx->size -1); + - ctx->context = ibv_open_device(ib_dev); - if (!ctx->context) { - fprintf(stderr, "Couldn't get context for %s\n", - ibv_get_device_name(ib_dev)); - return NULL; + if (data->use_cma) { + cm_id = (struct rdma_cm_id *)ptr; + ctx->context = cm_id->verbs; + if (!ctx->context) { + fprintf(stderr, "%d:%s: Unbound cm_id!!\n", pid, + __func__); + return NULL; + } + + } else { + ib_dev = (struct ibv_device *)ptr; + ctx->context = ibv_open_device(ib_dev); + if (!ctx->context) { + fprintf(stderr, "%d:%s: Couldn't get context for %s\n", + pid, __func__, ibv_get_device_name(ib_dev)); + return NULL; + } } ctx->pd = ibv_alloc_pd(ctx->context); if (!ctx->pd) { - fprintf(stderr, "Couldn't allocate PD\n"); + fprintf(stderr, "%d:%s: Couldn't allocate PD\n", pid, __func__); return NULL; } - /* We dont really want IBV_ACCESS_LOCAL_WRITE, but IB spec says: - * The Consumer is not allowed to assign Remote Write or Remote Atomic to - * a Memory Region that has not been assigned Local Write. */ - ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size * 2, + /* We dont really want IBV_ACCESS_LOCAL_WRITE, but IB spec says: + * The Consumer is not allowed to assign Remote Write or Remote Atomic to + * a Memory Region that has not been assigned Local Write. */ + ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, ctx->size * 2, IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_LOCAL_WRITE); if (!ctx->mr) { - fprintf(stderr, "Couldn't allocate MR\n"); + fprintf(stderr, "%d:%s: Couldn't allocate MR\n", pid, __func__); return NULL; } - ctx->cq = ibv_create_cq(ctx->context, tx_depth, NULL, NULL, 0); - if (!ctx->cq) { - fprintf(stderr, "Couldn't create CQ\n"); + ctx->rcq = ibv_create_cq(ctx->context, 1, NULL, NULL, 0); + if (!ctx->rcq) { + fprintf(stderr, "%d:%s: Couldn't create recv CQ\n", pid, + __func__); return NULL; } - { - struct ibv_qp_init_attr attr = { - .send_cq = ctx->cq, - .recv_cq = ctx->cq, - .cap = { - .max_send_wr = tx_depth, - /* Work around: driver doesnt support - * recv_wr = 0 */ - .max_recv_wr = 1, - .max_send_sge = 1, - .max_recv_sge = 1, - .max_inline_data = size - }, - .qp_type = IBV_QPT_RC - }; + ctx->scq = ibv_create_cq(ctx->context, ctx->tx_depth, ctx, NULL, 0); + if (!ctx->scq) { + fprintf(stderr, "%d:%s: Couldn't create send CQ\n", pid, + __func__); + return NULL; + } + + struct ibv_qp_init_attr attr = { + .send_cq = ctx->scq, + .recv_cq = ctx->rcq, + .cap = { + .max_send_wr = ctx->tx_depth, + /* Work around: driver doesnt support + * recv_wr = 0 */ + .max_recv_wr = 1, + .max_send_sge = 1, + .max_recv_sge = 1, + .max_inline_data = 0 + }, + .qp_type = IBV_QPT_RC + }; + + if (data->use_cma) { + if (rdma_create_qp(cm_id, ctx->pd, &attr)) { + fprintf(stderr, "%d:%s: Couldn't create QP\n", pid, __func__); + return NULL; + } + ctx->qp = cm_id->qp; + pp_post_recv(ctx); + } else { ctx->qp = ibv_create_qp(ctx->pd, &attr); if (!ctx->qp) { - fprintf(stderr, "Couldn't create QP\n"); + fprintf(stderr, "%d:%s: Couldn't create QP\n", pid, __func__); return NULL; } - } - - { - struct ibv_qp_attr attr = { - .qp_state = IBV_QPS_INIT, - .pkey_index = 0, - .port_num = port, - .qp_access_flags = IBV_ACCESS_REMOTE_WRITE - }; - - if (ibv_modify_qp(ctx->qp, &attr, - IBV_QP_STATE | - IBV_QP_PKEY_INDEX | - IBV_QP_PORT | - IBV_QP_ACCESS_FLAGS)) { - fprintf(stderr, "Failed to modify QP to INIT\n"); - return NULL; + { + struct ibv_qp_attr attr; + + attr.qp_state = IBV_QPS_INIT; + attr.pkey_index = 0; + attr.port_num = data->ib_port; + attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; + + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS)) { + fprintf(stderr, "%d:%s: Failed to modify QP to INIT\n", + pid, __func__); + return NULL; + } } } - ctx->wr.wr_id = PINGPONG_RDMA_WRID; - ctx->wr.sg_list = &ctx->list; - ctx->wr.num_sge = 1; - ctx->wr.opcode = IBV_WR_RDMA_WRITE; - ctx->wr.send_flags = IBV_SEND_SIGNALED | IBV_SEND_INLINE; - ctx->wr.next = NULL; - - return ctx; + return ctx; } -static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, - struct pingpong_dest *dest) +static int pp_connect_ctx(struct pingpong_context *ctx, struct pp_data *data) { struct ibv_qp_attr attr = { .qp_state = IBV_QPS_RTR, .path_mtu = IBV_MTU_256, - .dest_qp_num = dest->qpn, - .rq_psn = dest->psn, - .max_dest_rd_atomic = 1, - .min_rnr_timer = 12, - .ah_attr.is_global = 0, - .ah_attr.dlid = dest->lid, - .ah_attr.sl = 0, - .ah_attr.src_path_bits = 0, - .ah_attr.port_num = port, + .dest_qp_num = data->rem_dest->qpn, + .rq_psn = data->rem_dest->psn, + .max_dest_rd_atomic = 1, + .min_rnr_timer = 12, + .ah_attr.is_global = 0, + .ah_attr.dlid = data->rem_dest->lid, + .ah_attr.sl = 0, + .ah_attr.src_path_bits = 0, + .ah_attr.port_num = data->ib_port }; if (ibv_modify_qp(ctx->qp, &attr, @@ -405,7 +664,7 @@ static int pp_connect_ctx(struct pingpon IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER)) { - fprintf(stderr, "Failed to modify QP to RTR\n"); + fprintf(stderr, "%s: Failed to modify QP to RTR\n", __func__); return 1; } @@ -413,7 +672,7 @@ static int pp_connect_ctx(struct pingpon attr.timeout = 14; attr.retry_cnt = 7; attr.rnr_retry = 7; - attr.sq_psn = my_psn; + attr.sq_psn = data->my_dest.psn; attr.max_rd_atomic = 1; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | @@ -422,71 +681,221 @@ static int pp_connect_ctx(struct pingpon IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC)) { - fprintf(stderr, "Failed to modify QP to RTS\n"); + fprintf(stderr, "%s: Failed to modify QP to RTS\n", __func__); return 1; } return 0; } -static int pp_open_port(struct pingpong_context *ctx, const char * servername, - int ib_port, int port, struct pingpong_dest *rem_dest) +static int pp_open_port(struct pingpong_context *ctx, struct pp_data *data ) { char addr_fmt[] = "%8s address: LID %#04x QPN %#06x PSN %#06x RKey %#08x VAddr %#016Lx\n"; - struct pingpong_dest my_dest; - int sockfd; - int rc; - /* Create connection between client and server. * We do it by exchanging data over a TCP socket connection. */ - my_dest.lid = pp_get_local_lid(ctx, ib_port); - my_dest.qpn = ctx->qp->qp_num; - my_dest.psn = lrand48() & 0xffffff; - if (!my_dest.lid) { + data->my_dest.lid = pp_get_local_lid(ctx, data->ib_port); + data->my_dest.qpn = ctx->qp->qp_num; + data->my_dest.psn = lrand48() & 0xffffff; + if (!data->my_dest.lid) { fprintf(stderr, "Local lid 0x0 detected. Is an SM running?\n"); return -1; } - my_dest.rkey = ctx->mr->rkey; - my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; + data->my_dest.rkey = ctx->mr->rkey; + data->my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; - printf(addr_fmt, "local", my_dest.lid, my_dest.qpn, my_dest.psn, - my_dest.rkey, my_dest.vaddr); - - sockfd = servername ? pp_client_connect(servername, port) : - pp_server_connect(port); + printf(addr_fmt, "local", data->my_dest.lid, data->my_dest.qpn, data->my_dest.psn, + data->my_dest.rkey, data->my_dest.vaddr); - if (sockfd < 0) { - printf("pp_connect_sock(%s,%d) failed (%d)!\n", - servername, port, sockfd); - return sockfd; + if (data->servername) { + if (pp_client_exch_dest(data)) + return 1; + } else { + if (pp_server_exch_dest(data)) + return 1; } - rc = servername ? pp_client_exch_dest(sockfd, &my_dest, rem_dest) : - pp_server_exch_dest(sockfd, &my_dest, rem_dest); - if (rc) - return rc; + printf(addr_fmt, "remote", data->rem_dest->lid, data->rem_dest->qpn, + data->rem_dest->psn, data->rem_dest->rkey, + data->rem_dest->vaddr); - printf(addr_fmt, "remote", rem_dest->lid, rem_dest->qpn, rem_dest->psn, - rem_dest->rkey, rem_dest->vaddr); - - if ((rc = pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest))) - return rc; + if (pp_connect_ctx(ctx, data)) + return 1; /* An additional handshake is required *after* moving qp to RTR. - * Arbitrarily reuse exch_dest for this purpose. - */ + Arbitrarily reuse exch_dest for this purpose. */ + if (data->servername) { + if (pp_client_exch_dest(data)) + return -1; + } else { + if (pp_server_exch_dest(data)) + return -1; + } - rc = servername ? pp_client_exch_dest(sockfd, &my_dest, rem_dest) : - pp_server_exch_dest(sockfd, &my_dest, rem_dest); + write(data->sockfd, "done", sizeof "done"); + close(data->sockfd); + + return 0; +} - if (rc) - return rc; +static void pp_post_recv(struct pingpong_context *ctx) +{ + struct ibv_sge list; + struct ibv_recv_wr wr, *bad_wr; + int rc; + + list.addr = (uintptr_t) ctx->buf; + list.length = 1; + list.lkey = ctx->mr->lkey; + wr.next = NULL; + wr.wr_id = 0xdeadbeef; + wr.sg_list = &list; + wr.num_sge = 1; + + rc = ibv_post_recv(ctx->qp, &wr, &bad_wr); + if (rc) { + perror("ibv_post_recv"); + fprintf(stderr, "%d:%s: ibv_post_recv failed %d\n", pid, + __func__, rc); + } +} - write(sockfd, "done", sizeof "done"); - close(sockfd); - return 0; +static void pp_wait_for_done(struct pingpong_context *ctx) +{ + struct ibv_wc wc; + int ne; + + do { + usleep(500); + ne = ibv_poll_cq(ctx->rcq, 1, &wc); + } while (ne == 0); + + if (wc.status) + fprintf(stderr, "%d:%s: bad wc status %d\n", pid, __func__, + wc.status); + if (!(wc.opcode & IBV_WC_RECV)) + fprintf(stderr, "%d:%s: bad wc opcode %d\n", pid, __func__, + wc.opcode); + if (wc.wr_id != 0xdeadbeef) + fprintf(stderr, "%d:%s: bad wc wr_id 0x%x\n", pid, __func__, + (int)wc.wr_id); +} + +static void pp_send_done(struct pingpong_context *ctx) +{ + struct ibv_send_wr *bad_wr; + struct ibv_wc wc; + int ne; + + ctx->list.addr = (uintptr_t) ctx->buf; + ctx->list.length = 1; + ctx->list.lkey = ctx->mr->lkey; + ctx->wr.wr_id = 0xcafebabe; + ctx->wr.sg_list = &ctx->list; + ctx->wr.num_sge = 1; + ctx->wr.opcode = IBV_WR_SEND; + ctx->wr.send_flags = IBV_SEND_SIGNALED; + ctx->wr.next = NULL; + if (ibv_post_send(ctx->qp, &ctx->wr, &bad_wr)) { + fprintf(stderr, "%d:%s: ibv_post_send failed\n", pid, __func__); + return; + } + do { + usleep(500); + ne = ibv_poll_cq(ctx->scq, 1, &wc); + } while (ne == 0); + + if (wc.status) + fprintf(stderr, "%d:%s: bad wc status %d\n", pid, __func__, + wc.status); + if (wc.opcode != IBV_WC_SEND) + fprintf(stderr, "%d:%s: bad wc opcode %d\n", pid, __func__, + wc.opcode); + if (wc.wr_id != 0xcafebabe) + fprintf(stderr, "%d:%s: bad wc wr_id 0x%x\n", pid, __func__, + (int)wc.wr_id); +} + +static void pp_wait_for_start(struct pingpong_context *ctx) +{ + struct ibv_wc wc; + int ne; + + do { + usleep(500); + ne = ibv_poll_cq(ctx->rcq, 1, &wc); + } while (ne == 0); + + if (wc.status) + fprintf(stderr, "%d:%s: bad wc status %d\n", pid, __func__, + wc.status); + if (!(wc.opcode & IBV_WC_RECV)) + fprintf(stderr, "%d:%s: bad wc opcode %d\n", pid, __func__, + wc.opcode); + if (wc.wr_id != 0xdeadbeef) + fprintf(stderr, "%d:%s: bad wc wr_id 0x%x\n", pid, __func__, + (int)wc.wr_id); + pp_post_recv(ctx); +} + +static void pp_send_start(struct pingpong_context *ctx) +{ + struct ibv_send_wr *bad_wr; + struct ibv_wc wc; + int ne; + + ctx->list.addr = (uintptr_t) ctx->buf; + ctx->list.length = 1; + ctx->list.lkey = ctx->mr->lkey; + ctx->wr.wr_id = 0xabbaabba; + ctx->wr.sg_list = &ctx->list; + ctx->wr.num_sge = 1; + ctx->wr.opcode = IBV_WR_SEND; + ctx->wr.send_flags = IBV_SEND_SIGNALED; + ctx->wr.next = NULL; + if (ibv_post_send(ctx->qp, &ctx->wr, &bad_wr)) { + fprintf(stderr, "%d:%s: ibv_post_send failed\n", pid, __func__); + return; + } + do { + usleep(500); + ne = ibv_poll_cq(ctx->scq, 1, &wc); + } while (ne == 0); + + if (wc.status) + fprintf(stderr, "%d:%s: bad wc status %d\n", pid, __func__, + wc.status); + if (wc.opcode != IBV_WC_SEND) + fprintf(stderr, "%d:%s: bad wc opcode %d\n", pid, __func__, + wc.opcode); + if (wc.wr_id != 0xabbaabba) + fprintf(stderr, "%d:%s: bad wc wr_id 0x%x\n", pid, __func__, + (int)wc.wr_id); +} + +static void pp_close_cma(struct pp_data data) +{ + struct rdma_cm_event *event; + int rc; + + if (data.servername) { + rc = rdma_disconnect(data.cm_id); + if (rc) { + perror("rdma_disconnect"); + fprintf(stderr, "%d:%s: rdma disconnect error\n", pid, + __func__); + return; + } + } + + rdma_get_cm_event(data.cm_channel, &event); + if (event->event != RDMA_CM_EVENT_DISCONNECTED) + fprintf(stderr, "%d:%s: unexpected event during disconnect %d\n", + pid, __func__, event->event); + rdma_ack_cm_event(event); + rdma_destroy_id(data.cm_id); + rdma_destroy_event_channel(data.cm_channel); } static void usage(const char *argv0) @@ -505,6 +914,7 @@ static void usage(const char *argv0) printf(" -C, --report-cycles report times in cpu cycle units (default microseconds)\n"); printf(" -H, --report-histogram print out all results (default print summary only)\n"); printf(" -U, --report-unsorted (implies -H) print out unsorted results (default sorted)\n"); + printf(" -c, --cma Use the RDMA CMA to setup the RDMA connection\n"); } /* @@ -584,16 +994,10 @@ int main(int argc, char *argv[]) { const char *ib_devname = NULL; const char *servername = NULL; - int port = 18515; - int ib_port = 1; - int size = 1; int iters = 1000; - int tx_depth = 50; struct report_options report = {}; struct pingpong_context *ctx; - struct pingpong_dest rem_dest; - struct ibv_device *ib_dev; struct ibv_qp *qp; struct ibv_send_wr *wr; @@ -604,6 +1008,19 @@ int main(int argc, char *argv[]) cycles_t *tstamp; + struct pp_data data = { + .port = 18515, + .ib_port = 1, + .size = 1, + .tx_depth = 50, + .use_cma = 0, + .servername = NULL, + .rem_dest = NULL, + .ib_dev = NULL, + .cm_channel = NULL, + .cm_id = NULL + }; + /* Parameter parsing. */ while (1) { int c; @@ -618,73 +1035,78 @@ int main(int argc, char *argv[]) { .name = "report-cycles", .has_arg = 0, .val = 'C' }, { .name = "report-histogram",.has_arg = 0, .val = 'H' }, { .name = "report-unsorted",.has_arg = 0, .val = 'U' }, + { .name = "cma", .has_arg = 0, .val = 'c' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:n:t:CHU", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:s:n:t:CHUc", long_options, NULL); if (c == -1) break; switch (c) { - case 'p': - port = strtol(optarg, NULL, 0); - if (port < 0 || port > 65535) { - usage(argv[0]); - return 1; - } - break; + case 'p': + data.port = strtol(optarg, NULL, 0); + if (data.port < 0 || data.port > 65535) { + usage(argv[0]); + return 1; + } + break; - case 'd': - ib_devname = strdupa(optarg); - break; + case 'd': + ib_devname = strdupa(optarg); + break; - case 'i': - ib_port = strtol(optarg, NULL, 0); - if (ib_port < 0) { - usage(argv[0]); - return 2; - } - break; + case 'i': + data.ib_port = strtol(optarg, NULL, 0); + if (data.ib_port < 0) { + usage(argv[0]); + return 2; + } + break; - case 's': - size = strtol(optarg, NULL, 0); - if (size < 1) { usage(argv[0]); return 3; } - break; + case 's': + data.size = strtol(optarg, NULL, 0); + if (data.size < 1) { usage(argv[0]); return 3; } + break; - case 't': - tx_depth = strtol(optarg, NULL, 0); - if (tx_depth < 1) { usage(argv[0]); return 4; } - break; + case 't': + data.tx_depth = strtol(optarg, NULL, 0); + if (data.tx_depth < 1) { usage(argv[0]); return 4; } + break; - case 'n': - iters = strtol(optarg, NULL, 0); - if (iters < 2) { - usage(argv[0]); - return 5; - } + case 'n': + iters = strtol(optarg, NULL, 0); + if (iters < 2) { + usage(argv[0]); + return 5; + } - break; + break; - case 'C': - report.cycles = 1; - break; + case 'C': + report.cycles = 1; + break; - case 'H': - report.histogram = 1; - break; + case 'H': + report.histogram = 1; + break; - case 'U': - report.unsorted = 1; - break; + case 'U': + report.unsorted = 1; + break; + + case 'c': + data.use_cma = 1; + break; - default: - usage(argv[0]); - return 5; + default: + usage(argv[0]); + return 5; } } if (optind == argc - 1) - servername = strdupa(argv[optind]); + data.servername = strdupa(argv[optind]); else if (optind < argc) { usage(argv[0]); return 6; @@ -693,27 +1115,81 @@ int main(int argc, char *argv[]) /* * Done with parameter parsing. Perform setup. */ + pid = getpid(); - srand48(getpid() * time(NULL)); + srand48(pid * time(NULL)); page_size = sysconf(_SC_PAGESIZE); - ib_dev = pp_find_dev(ib_devname); - if (!ib_dev) - return 7; - ctx = pp_init_ctx(ib_dev, size, tx_depth, ib_port); - if (!ctx) - return 8; + if (data.use_cma) { + data.cm_channel = rdma_create_event_channel(); + if (!data.cm_channel) { + fprintf(stderr, "%d:%s: rdma_create_event_channel failed\n", + pid, __func__); + return 1; + } + if (rdma_create_id(data.cm_channel, &data.cm_id, NULL, RDMA_PS_TCP)) { + fprintf(stderr, "%d:%s: rdma_create_id failed\n", + pid, __func__); + return 1; + } + + if (data.servername) { + ctx = pp_client_connect(&data); + if (!ctx) + return 1; + } else { + ctx = pp_server_connect(&data); + if (!ctx) + return 1; + } - if (pp_open_port(ctx, servername, ib_port, port, &rem_dest)) - return 9; + printf("%d: Local address: LID %#04x, QPN %#06x, PSN %#06x " + "RKey %#08x VAddr %#016Lx\n", pid, + data.my_dest.lid, data.my_dest.qpn, data.my_dest.psn, + data.my_dest.rkey, data.my_dest.vaddr); + + printf("%d: Remote address: LID %#04x, QPN %#06x, PSN %#06x, " + "RKey %#08x VAddr %#016Lx\n\n", pid, + data.rem_dest->lid, data.rem_dest->qpn, data.rem_dest->psn, + data.rem_dest->rkey, data.rem_dest->vaddr); + + if (data.servername) { + pp_send_start(ctx); + } else { + pp_wait_for_start(ctx); + } + } else { + data.ib_dev = pp_find_dev(ib_devname); + if (!data.ib_dev) + return 7; + + if (data.servername) { + ctx = pp_client_connect(&data); + if (!ctx) + return 8; + } else { + ctx = pp_server_connect(&data); + if (!ctx) + return 8; + } + + if (pp_open_port(ctx, &data)) + return 9; + } wr = &ctx->wr; ctx->list.addr = (uintptr_t) ctx->buf; ctx->list.length = ctx->size; ctx->list.lkey = ctx->mr->lkey; - wr->wr.rdma.remote_addr = rem_dest.vaddr; - wr->wr.rdma.rkey = rem_dest.rkey; + wr->wr.rdma.remote_addr = data.rem_dest->vaddr; + wr->wr.rdma.rkey = data.rem_dest->rkey; + ctx->wr.wr_id = PINGPONG_RDMA_WRID; + ctx->wr.sg_list = &ctx->list; + ctx->wr.num_sge = 1; + ctx->wr.opcode = IBV_WR_RDMA_WRITE; + ctx->wr.send_flags = IBV_SEND_SIGNALED | IBV_SEND_INLINE; + ctx->wr.next = NULL; scnt = 0; rcnt = 0; @@ -733,7 +1209,7 @@ int main(int argc, char *argv[]) while (scnt < iters || ccnt < iters || rcnt < iters) { /* Wait till buffer changes. */ - if (rcnt < iters && !(scnt < 1 && servername)) { + if (rcnt < iters && !(scnt < 1 && data.servername)) { ++rcnt; while (*poll_buf != (char)rcnt) ; @@ -759,7 +1235,7 @@ int main(int argc, char *argv[]) int ne; ++ccnt; do { - ne = ibv_poll_cq(ctx->cq, 1, &wc); + ne = ibv_poll_cq(ctx->scq, 1, &wc); } while (ne == 0); if (ne < 0) { @@ -777,6 +1253,11 @@ int main(int argc, char *argv[]) } } } + if (data.use_cma) { + pp_send_done(ctx); + pp_wait_for_done(ctx); + pp_close_cma(data); + } print_report(&report, iters, tstamp); return 0; From mst at mellanox.co.il Wed Aug 16 22:49:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Aug 2006 08:49:55 +0300 Subject: [openib-general] madeye backport patch for OFED 1.1 In-Reply-To: <1155768213.9855.13329.camel@hal.voltaire.com> References: <1155768213.9855.13329.camel@hal.voltaire.com> Message-ID: <20060817054955.GA2206@mellanox.co.il> Quoting r. Hal Rosenstock : > > > By the way, I think that longer-term we really want to be > > > able to let userspace expose a madeye-like functionality, > > > snooping on all packets. > > > > Maybe, but then instead of simple logger we may find us developing > > full-featured sniffer app... :) > > Plug IB decode into ethereal ? Exactly, only it's wireshark now :) But let's start with the kernel part enabling this, and porting madeye to userspace. -- MST From zhushisongzhu at yahoo.com Wed Aug 16 22:58:32 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Wed, 16 Aug 2006 22:58:32 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060816185521.GB6336@mellanox.co.il> Message-ID: <20060817055832.73826.qmail@web36909.mail.mud.yahoo.com> (1) my change: SDP_TX_SIZE=0x4 SDP_RX_SIZE=0x4 use your patch (2) memory consumed by SDP connection is ok But ab always only complete 256 requests successfully. Results are showed as following: # SIMPLE_LIBSDP=1 LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so ab -c 300 -n 300 -X 193.12.10.14:3129 http://http://www.sse.com.cn/sseportal/ps/zhs/home.shtml # This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ Benchmarking www.sse.com.cn [through 193.12.10.14:3129] (be patient) Completed 100 requests Completed 200 requests apr_recv: Connection refused (111) Total of 256 requests completed (3) one time linux kernel on the client crashed. I copy the output from the screen. Process sdp (pid:4059, threadinfo 0000010036384000 task 000001003ea10030) Call Trace:{:ib_sdp:sdp_destroy_workto} {:ib_sdp:sdp_destroy_qp+77} {:ib_sdp:sdp_destruct+279}{sk_free+28} {worker_thread+419}{default_wake_function+0} {default_wake_function+0}{keventd_create_kthread+0} {worker_thread+0}{keventd_create_kthread+0} {kthread+200}{child_rip+8} {keventd_create_kthread+0}{kthread+0}{child_rip+0} Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b 45 31 ff 45 31 ed 4c 89 RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> CR2:0000000000000004 <0>kernel panic-not syncing:Oops zhu --- "Michael S. Tsirkin" wrote: > Quoting r. zhu shi song : > > Subject: Re: why sdp connections cost so much > memory > > > > I have changed SDP_RX_SIZE from 0x40 to 1 and > rebuilt > > ib_sdp.ko. But kernel always crashed. > > Weird. How about this patch: > > diff --git a/drivers/infiniband/ulp/sdp/sdp_bcopy.c > b/drivers/infiniband/ulp/sdp/sdp_bcopy.c > index c35a4da..2fd79a0 100644 > --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c > +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c > @@ -234,7 +234,7 @@ void sdp_post_recvs(struct > sdp_sock *ssk > while ((likely(ssk->rx_head - ssk->rx_tail < > SDP_RX_SIZE) && > (ssk->rx_head - ssk->rx_tail - SDP_MIN_BUFS) * > SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE + rmem < > - ssk->isk.sk.sk_rcvbuf * 0x10) || > + ssk->isk.sk.sk_rcvbuf) || > unlikely(ssk->rx_head - ssk->rx_tail < > SDP_MIN_BUFS)) > sdp_post_recv(ssk); > } > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From mst at mellanox.co.il Wed Aug 16 23:02:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Aug 2006 09:02:02 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817055832.73826.qmail@web36909.mail.mud.yahoo.com> References: <20060817055832.73826.qmail@web36909.mail.mud.yahoo.com> Message-ID: <20060817060202.GA2380@mellanox.co.il> Quoting r. zhu shi song : > Subject: Re: why sdp connections cost so much memory > > (1) my change: > SDP_TX_SIZE=0x4 > SDP_RX_SIZE=0x4 > use your patch OK, thanks, I'll try to reproduce and debug. Which kernel are you using this on? -- MST From zhushisongzhu at yahoo.com Wed Aug 16 23:07:18 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Wed, 16 Aug 2006 23:07:18 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817060202.GA2380@mellanox.co.il> Message-ID: <20060817060718.91326.qmail@web36903.mail.mud.yahoo.com> OS: RHEL-4.3 Kernel: 2.6.9 about detail info see below: [root at ib_test1 squid.test]# uname -a Linux ib_test1 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux [root at ib_test1 squid.test]# zhu --- "Michael S. Tsirkin" wrote: > Quoting r. zhu shi song : > > Subject: Re: why sdp connections cost so much > memory > > > > (1) my change: > > SDP_TX_SIZE=0x4 > > SDP_RX_SIZE=0x4 > > use your patch > > OK, thanks, I'll try to reproduce and debug. Which > kernel > are you using this on? > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From mst at mellanox.co.il Wed Aug 16 23:05:09 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Aug 2006 09:05:09 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817055832.73826.qmail@web36909.mail.mud.yahoo.com> References: <20060817055832.73826.qmail@web36909.mail.mud.yahoo.com> Message-ID: <20060817060509.GB2380@mellanox.co.il> Quoting r. zhu shi song : > (2) memory consumed by SDP connection is ok > But ab always only complete 256 requests > successfully. Results are showed as following: > # SIMPLE_LIBSDP=1 > LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so ab -c 300 > -n 300 -X 193.12.10.14:3129 > http://http://www.sse.com.cn/sseportal/ps/zhs/home.shtml > # This is ApacheBench, Version 2.0.41-dev > <$Revision: 1.141 $> apache-2.0 > Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, > http://www.zeustech.net/ > Copyright (c) 1998-2002 The Apache Software > Foundation, http://www.apache.org/ > > Benchmarking www.sse.com.cn [through > 193.12.10.14:3129] (be patient) > Completed 100 requests > Completed 200 requests > apr_recv: Connection refused (111) > Total of 256 requests completed Hmm. This one looks quite weird. Could you please enable debug (run with debug_level=1) on the server side, and post the messsages you get in dmesg? -- MST From eitan at mellanox.co.il Thu Aug 17 00:24:30 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 17 Aug 2006 10:24:30 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302B99DD9@mtlexch01.mtl.com> Hi Hal, It is common practice to base "distributions" of Autotools based code on "make dist". It is also common to avoid including the configure.in Makefile.am and other stuff that is considered "input" source code. Sources are made available under the SVN tree. On many GNU projects the sources are also distributed in source tar gzip archives. If this is a strong requirement (to be able to do "maintainer mode" work) from the distribution, we need to change our EXTRA_DIST assignments in the entire tree. Can be done if is really required. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Wednesday, August 16, 2006 8:32 PM > To: Vladimir Sokolovsky > Cc: OpenFabricsEWG; openib-general at openib.org > Subject: Re: [openib-general] [openfabrics-ewg] OFED 1.1-rc1 is available > > On Tue, 2006-08-15 at 09:28, Vladimir Sokolovsky wrote: > > The OFED-1.1-rc1 source tar ball (openib-1.1.tgz ) created by > > build_ofed.sh script (from > > https://openib.org/svn/gen2/branches/1.1/ofed/build) > > > > build_ofed.sh script takes userspace libraries/binaries after executing: > > > > autogen.sh > > configure > > make dist > > > > Therefor, autogen.sh is not a part of it and also it is the reason > > that you see Makefiles there. > > Why shouldn't they be included ? Weren't they included in OFED 1.0 ? > > This makes life difficult for those who want to rebuild based on the OFED > released sources. > > -- Hal > > > Regards, > > Vladimir > > > > > > > > Ira Weiny wrote: > > > Why is the OFED 1.1-rc1 source tar ball missing files when compared with > the 1.1 branch? > > > > > > Of specific question is the absence of autogen.sh in libibverbs. > > > > > > Ira > > > > > > On Sun, 13 Aug 2006 16:14:10 +0300 > > > "Tziporet Koren" wrote: > > > > > > > > >> Hal Rosenstock wrote: > > >> > > >>>> Target release date: 12-Sep > > >>>> > > >>>> Intermediate milestones: > > >>>> 1. Create 1.1 branch of user level: 27-Jul - done 2. RC1: 8-Aug - > > >>>> done 3. Feature freeze (RC2): 17-Aug > > >>>> > > >>>> > > >>> What is the start build date for RC2 ? When do developers need to > > >>> have their code in by to make RC2 ? > > >>> > > >>> > > >> We will start on Tue 15-Aug. Is this OK with you? > > >> > > >>> > > >>> > > >>> > > >>>> 4. Code freeze (rc-x): 6-Sep > > >>>> > > >>>> > > >>> Is this 1 or 2 RCs beyond RC2 in order to make this ? > > >>> > > >>> > > >>> > > >> I hope one but I guess it will be two more RCs. > > >> > > >> Tziporet > > >> > > >> _______________________________________________ > > >> openib-general mailing list > > >> openib-general at openib.org > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> To unsubscribe, please visit > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> > > > > > > _______________________________________________ > > > openfabrics-ewg mailing list > > > openfabrics-ewg at openib.org > > > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > _______________________________________________ > > openfabrics-ewg mailing list > > openfabrics-ewg at openib.org > > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Thu Aug 17 00:19:29 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 17 Aug 2006 10:19:29 +0300 Subject: [openib-general] [PATCH] [MINOR] OpenSM/osm_sm_mad_ctrl.c: Properly handle status based on whether direct routed or LID routed SMP In-Reply-To: <1155741019.9855.689.camel@hal.voltaire.com> References: <1155741019.9855.689.camel@hal.voltaire.com> Message-ID: <44E41881.3060808@mellanox.co.il> Looks good. EZ Hal Rosenstock wrote: > OpenSM/osm_sm_mad_ctrl.c: Properly handle status based on whether direct > routed or LID routed SMP > > Signed-off-by: Hal Rosenstock > > Index: opensm/osm_sm_mad_ctrl.c > =================================================================== > --- opensm/osm_sm_mad_ctrl.c (revision 8934) > +++ opensm/osm_sm_mad_ctrl.c (working copy) > @@ -34,7 +34,6 @@ > * $Id$ > */ > > - > /* > * Abstract: > * Implementation of osm_sm_mad_ctrl_t. > @@ -253,12 +252,15 @@ __osm_sm_mad_ctrl_process_get_resp( > > p_smp = osm_madw_get_smp_ptr( p_madw ); > > - if( !ib_smp_is_d( p_smp ) ) > + if( p_smp->mgmt_class == IB_MCLASS_SUBN_DIR ) > { > - osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > - "__osm_sm_mad_ctrl_process_get_resp: ERR 3102: " > - "'D' bit not set in returned SMP\n" ); > - osm_dump_dr_smp( p_ctrl->p_log, p_smp, OSM_LOG_ERROR ); > + if( !ib_smp_is_d( p_smp ) ) > + { > + osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > + "__osm_sm_mad_ctrl_process_get_resp: ERR 3102: " > + "'D' bit not set in returned SMP\n" ); > + osm_dump_dr_smp( p_ctrl->p_log, p_smp, OSM_LOG_ERROR ); > + } > } > > p_old_madw = (osm_madw_t*)transaction_context; > @@ -667,6 +669,7 @@ __osm_sm_mad_ctrl_rcv_callback( > { > osm_sm_mad_ctrl_t* p_ctrl = (osm_sm_mad_ctrl_t*)bind_context; > ib_smp_t* p_smp; > + ib_net16_t status; > > OSM_LOG_ENTER( p_ctrl->p_log, __osm_sm_mad_ctrl_rcv_callback ); > > @@ -717,11 +720,20 @@ __osm_sm_mad_ctrl_rcv_callback( > if( osm_log_is_active( p_ctrl->p_log, OSM_LOG_FRAMES ) ) > osm_dump_dr_smp( p_ctrl->p_log, p_smp, OSM_LOG_FRAMES ); > > - if( ib_smp_get_status( p_smp ) != 0 ) > + if( p_smp->mgmt_class == IB_MCLASS_SUBN_DIR ) > + { > + status = ib_smp_get_status( p_smp ); > + } > + else > + { > + status = p_smp->status; > + } > + > + if( status != 0 ) > { > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > "__osm_sm_mad_ctrl_rcv_callback: ERR 3111: " > - "Error status = 0x%X\n", ib_smp_get_status( p_smp ) ); > + "Error status = 0x%X\n", status ); > osm_dump_dr_smp( p_ctrl->p_log, p_smp, OSM_LOG_ERROR ); > } > > > From rkuchimanchi at silverstorm.com Thu Aug 17 00:26:52 2006 From: rkuchimanchi at silverstorm.com (Ramachandra K) Date: Thu, 17 Aug 2006 12:56:52 +0530 Subject: [openib-general] Reference counting for IPoIB kernel module Message-ID: <44E41A3C.6070509@silverstorm.com> Hi, I have noticed that even though the IPoIB interface is up and some application (ping, ib_read_bw etc) is using this interface, I can remove the ib_ipoib module. While ping was running, I did a "rmmod ib_ipoib" which successfully removes the module and ping gives an error. (see below) Shouldn't there be some kind of reference counting that would not allow the ib_ipoib module to be removed while the IPoIB interfaces are being used ? Regards, Ram [root at sst21 b]# ping -Iib0 172.21.50.220 PING 172.21.50.220 (172.21.50.220) from 172.21.50.221 ib0: 56(84) bytes of data. 64 bytes from 172.21.50.220: icmp_seq=0 ttl=64 time=0.125 ms 64 bytes from 172.21.50.220: icmp_seq=1 ttl=64 time=0.063 ms 64 bytes from 172.21.50.220: icmp_seq=2 ttl=64 time=0.045 ms 64 bytes from 172.21.50.220: icmp_seq=3 ttl=64 time=0.048 ms 64 bytes from 172.21.50.220: icmp_seq=4 ttl=64 time=0.047 ms 64 bytes from 172.21.50.220: icmp_seq=5 ttl=64 time=0.058 ms 64 bytes from 172.21.50.220: icmp_seq=6 ttl=64 time=0.049 ms ping: sendmsg: No such device ping: sendmsg: No such device ping: sendmsg: No such device From zhushisongzhu at yahoo.com Thu Aug 17 00:40:56 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Thu, 17 Aug 2006 00:40:56 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817060509.GB2380@mellanox.co.il> Message-ID: <20060817074056.25982.qmail@web36903.mail.mud.yahoo.com> is "debug_level=1" for squid or for sdp? zhu --- "Michael S. Tsirkin" wrote: > Quoting r. zhu shi song : > > (2) memory consumed by SDP connection is ok > > But ab always only complete 256 requests > > successfully. Results are showed as following: > > # SIMPLE_LIBSDP=1 > > LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so ab -c > 300 > > -n 300 -X 193.12.10.14:3129 > > > http://http://www.sse.com.cn/sseportal/ps/zhs/home.shtml > > # This is ApacheBench, Version 2.0.41-dev > > <$Revision: 1.141 $> apache-2.0 > > Copyright (c) 1996 Adam Twiss, Zeus Technology > Ltd, > > http://www.zeustech.net/ > > Copyright (c) 1998-2002 The Apache Software > > Foundation, http://www.apache.org/ > > > > Benchmarking www.sse.com.cn [through > > 193.12.10.14:3129] (be patient) > > Completed 100 requests > > Completed 200 requests > > apr_recv: Connection refused (111) > > Total of 256 requests completed > > Hmm. This one looks quite weird. Could you please > enable debug > (run with debug_level=1) on the server side, and > post the messsages > you get in dmesg? > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dmesg.txt URL: From mst at mellanox.co.il Thu Aug 17 00:44:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Aug 2006 10:44:42 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817074056.25982.qmail@web36903.mail.mud.yahoo.com> References: <20060817074056.25982.qmail@web36903.mail.mud.yahoo.com> Message-ID: <20060817074442.GA2630@mellanox.co.il> > > Quoting r. zhu shi song : > > > (2) memory consumed by SDP connection is ok > > > But ab always only complete 256 requests > > > successfully. Results are showed as following: > > > # SIMPLE_LIBSDP=1 > > > LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so ab -c 300 > > > -n 300 -X 193.12.10.14:3129 > > > > > > http://http://www.sse.com.cn/sseportal/ps/zhs/home.shtml > > > # This is ApacheBench, Version 2.0.41-dev > > > <$Revision: 1.141 $> apache-2.0 > > > Copyright (c) 1996 Adam Twiss, Zeus Technology > > Ltd, > > > http://www.zeustech.net/ > > > Copyright (c) 1998-2002 The Apache Software > > > Foundation, http://www.apache.org/ > > > > > > Benchmarking www.sse.com.cn [through > > > 193.12.10.14:3129] (be patient) > > > Completed 100 requests > > > Completed 200 requests > > > apr_recv: Connection refused (111) > > > Total of 256 requests completed > > > > Hmm. This one looks quite weird. Could you please > > enable debug > > (run with debug_level=1) on the server side, and > > post the messsages > > you get in dmesg? > > Quoting zhu shi song : > > is "debug_level=1" for squid or for sdp? > > zhu That's for SDP. -- MST From ogerlitz at voltaire.com Thu Aug 17 00:59:10 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 17 Aug 2006 10:59:10 +0300 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E3914C.4010903@ichips.intel.com> References: <44E3914C.4010903@ichips.intel.com> Message-ID: <44E421CE.1050404@voltaire.com> Arlin Davis wrote: > We are running into connection reject issues (IB_CM_REJ_STALE_CONN) with > our application under heavy load and lots of connections. > > We occassionally get a reject based on the QP being in timewait state > leftover from a prior connection. It appears that the CM keeps track of > the QP's in timewait state on both sides of the connection, How did you verify that? the CM generated REJ with IB_CM_REJ_STALE_CONN in two flows for the passive side (ie rejecting a REQ) and one flow for the active side (ie rejecting a REP). > How can a consumer know for sure that the new QP will not be in a > timewait state according to the CM? Does it make sense to push the > timewait functionality down into verbs? If not, is there a way for the > CM to hold a reference to the QP until the timewait expires? Just to emphasize what Sean has pointed out, you are asking how can a CM consumer know that a **local** QPN is not in the timewait state according to the **remote** CM. Since the issue is with the remote CM, it seems to me that pushing down timewait into verbs is not the correct direction to look at. Or. From ogerlitz at voltaire.com Thu Aug 17 01:18:56 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 17 Aug 2006 11:18:56 +0300 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E3AD24.4070200@ichips.intel.com> References: <44E3914C.4010903@ichips.intel.com> <44E3AD24.4070200@ichips.intel.com> Message-ID: <44E42670.5040308@voltaire.com> Sean Hefty wrote: > Even if we pushed timewait handling under verbs, a user could always get a QP > that the remote side thinks is connected. The original connection could fail to > disconnect because of lost DREQs. So, locally, the QP could have exited > timewait, while the remote side still thinks that it's connected. Sean, If you don't mind (also related to the patch you have sent Eric of randomizing the initial local cm id) to get into this deeper, can we do here a quick code review of the REQ matching logic? I wrote what i understand below. > static struct cm_id_private * cm_match_req(struct cm_work *work, > + struct cm_id_private *cm_id_priv) > +{ > + struct cm_id_private *listen_cm_id_priv, *cur_cm_id_priv; > + struct cm_timewait_info *timewait_info; > + struct cm_req_msg *req_msg; > + unsigned long flags; > + > + req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; > + > + /* Check for duplicate REQ and stale connections. */ > + spin_lock_irqsave(&cm.lock, flags); > + timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); > + if (!timewait_info) > + timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); This if() holds when entry is present in remote_id_table OR entry is present in remote_qpn_table > + if (timewait_info) { > + cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, > + timewait_info->work.remote_id); > + spin_unlock_irqrestore(&cm.lock, flags); > + if (cur_cm_id_priv) { > + cm_dup_req_handler(work, cur_cm_id_priv); > + cm_deref_id(cur_cm_id_priv); entry exists in local_id_table, looking on dup_req_handler() i see it sends REP when the id is in "MRA sent" and sends a STALE_CONN REJ when the id is in timewait state, else it does nothing. > + } else > + cm_issue_rej(work->port, work->mad_recv_wc, > + IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, > + NULL, 0); what is this case? there is no entry but there is remote or entries??? > + goto error; > + } Or. From zhushisongzhu at yahoo.com Thu Aug 17 02:00:31 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Thu, 17 Aug 2006 02:00:31 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817074442.GA2630@mellanox.co.il> Message-ID: <20060817090031.39435.qmail@web36909.mail.mud.yahoo.com> I see many lines scrolling the screen quickly. there are many lines as below: sdp_sock(3129:0):sdp_accept state 10 expected 10 *err -22 sdp_accept:ib_req_notify_cq sdp_accept:status -22 sk 0000010034298100 newsk 00000xxxxx(changeable) sdp_accept:state 10 exptected 10 *err -22 sdp_accept:error -11 sdp_accept:status -11 sk 000001003429800 new sk 00000000000000 zhu --- "Michael S. Tsirkin" wrote: > > > Quoting r. zhu shi song > : > > > > (2) memory consumed by SDP connection is ok > > > > But ab always only complete 256 requests > > > > successfully. Results are showed as > following: > > > > # SIMPLE_LIBSDP=1 > > > > LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so > ab -c 300 > > > > -n 300 -X 193.12.10.14:3129 > > > > > > > > > > http://http://www.sse.com.cn/sseportal/ps/zhs/home.shtml > > > > # This is ApacheBench, Version 2.0.41-dev > > > > <$Revision: 1.141 $> apache-2.0 > > > > Copyright (c) 1996 Adam Twiss, Zeus Technology > > > Ltd, > > > > http://www.zeustech.net/ > > > > Copyright (c) 1998-2002 The Apache Software > > > > Foundation, http://www.apache.org/ > > > > > > > > Benchmarking www.sse.com.cn [through > > > > 193.12.10.14:3129] (be patient) > > > > Completed 100 requests > > > > Completed 200 requests > > > > apr_recv: Connection refused (111) > > > > Total of 256 requests completed > > > > > > Hmm. This one looks quite weird. Could you > please > > > enable debug > > > (run with debug_level=1) on the server side, and > > > post the messsages > > > you get in dmesg? > > > > Quoting zhu shi song : > > > > is "debug_level=1" for squid or for sdp? > > > > zhu > > That's for SDP. > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From mst at mellanox.co.il Thu Aug 17 02:10:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Aug 2006 12:10:12 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817090031.39435.qmail@web36909.mail.mud.yahoo.com> References: <20060817090031.39435.qmail@web36909.mail.mud.yahoo.com> Message-ID: <20060817091012.GE2630@mellanox.co.il> Quoting r. zhu shi song : > --- "Michael S. Tsirkin" wrote: > > > > > Quoting r. zhu shi song > > : > > > > > (2) memory consumed by SDP connection is ok > > > > > But ab always only complete 256 requests > > > > > successfully. Results are showed as > > following: > > > > > # SIMPLE_LIBSDP=1 > > > > > LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so > > ab -c 300 > > > > > -n 300 -X 193.12.10.14:3129 > > > > > > > > > > > > > > > http://http://www.sse.com.cn/sseportal/ps/zhs/home.shtml > > > > > # This is ApacheBench, Version 2.0.41-dev > > > > > <$Revision: 1.141 $> apache-2.0 > > > > > Copyright (c) 1996 Adam Twiss, Zeus Technology > > > > Ltd, > > > > > http://www.zeustech.net/ > > > > > Copyright (c) 1998-2002 The Apache Software > > > > > Foundation, http://www.apache.org/ > > > > > > > > > > Benchmarking www.sse.com.cn [through > > > > > 193.12.10.14:3129] (be patient) > > > > > Completed 100 requests > > > > > Completed 200 requests > > > > > apr_recv: Connection refused (111) > > > > > Total of 256 requests completed > > > > > > > > Hmm. This one looks quite weird. Could you > > please > > > > enable debug > > > > (run with debug_level=1) on the server side, and > > > > post the messsages > > > > you get in dmesg? > > > > > > Quoting zhu shi song : > > > > > > is "debug_level=1" for squid or for sdp? > > > > > > zhu > > > > That's for SDP. > > > I see many lines scrolling the screen quickly. there > are many lines as below: > > sdp_sock(3129:0):sdp_accept state 10 expected 10 *err > -22 > sdp_accept:ib_req_notify_cq > sdp_accept:status -22 sk 0000010034298100 newsk > 00000xxxxx(changeable) > sdp_accept:state 10 exptected 10 *err -22 > sdp_accept:error -11 > sdp_accept:status -11 sk 000001003429800 new sk > 00000000000000 No, these are benign. This is server timing out waiting for connection request. I'm interested in sdp_cma_ messages - conn/disconnect requests being handled. -- MST From zhushisongzhu at yahoo.com Thu Aug 17 02:15:54 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Thu, 17 Aug 2006 02:15:54 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817091012.GE2630@mellanox.co.il> Message-ID: <20060817091554.88725.qmail@web36911.mail.mud.yahoo.com> After testing several times, the client kernel crashed with the same message on the screen as what I sent to you. zhu --- "Michael S. Tsirkin" wrote: > Quoting r. zhu shi song : > > --- "Michael S. Tsirkin" > wrote: > > > > > > > Quoting r. zhu shi song > > > : > > > > > > (2) memory consumed by SDP connection is > ok > > > > > > But ab always only complete 256 > requests > > > > > > successfully. Results are showed as > > > following: > > > > > > # SIMPLE_LIBSDP=1 > > > > > > > LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so > > > ab -c 300 > > > > > > -n 300 -X 193.12.10.14:3129 > > > > > > > > > > > > > > > > > > > > > http://http://www.sse.com.cn/sseportal/ps/zhs/home.shtml > > > > > > # This is ApacheBench, Version > 2.0.41-dev > > > > > > <$Revision: 1.141 $> apache-2.0 > > > > > > Copyright (c) 1996 Adam Twiss, Zeus > Technology > > > > > Ltd, > > > > > > http://www.zeustech.net/ > > > > > > Copyright (c) 1998-2002 The Apache > Software > > > > > > Foundation, http://www.apache.org/ > > > > > > > > > > > > Benchmarking www.sse.com.cn [through > > > > > > 193.12.10.14:3129] (be patient) > > > > > > Completed 100 requests > > > > > > Completed 200 requests > > > > > > apr_recv: Connection refused (111) > > > > > > Total of 256 requests completed > > > > > > > > > > Hmm. This one looks quite weird. Could you > > > please > > > > > enable debug > > > > > (run with debug_level=1) on the server side, > and > > > > > post the messsages > > > > > you get in dmesg? > > > > > > > > Quoting zhu shi song > : > > > > > > > > is "debug_level=1" for squid or for sdp? > > > > > > > > zhu > > > > > > That's for SDP. > > > > > > I see many lines scrolling the screen quickly. > there > > are many lines as below: > > > > sdp_sock(3129:0):sdp_accept state 10 expected 10 > *err > > -22 > > sdp_accept:ib_req_notify_cq > > sdp_accept:status -22 sk 0000010034298100 newsk > > 00000xxxxx(changeable) > > sdp_accept:state 10 exptected 10 *err -22 > > sdp_accept:error -11 > > sdp_accept:status -11 sk 000001003429800 new sk > > 00000000000000 > > No, these are benign. This is server timing out > waiting > for connection request. > > I'm interested in sdp_cma_ messages - > conn/disconnect requests > being handled. > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From zhushisongzhu at yahoo.com Thu Aug 17 02:39:03 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Thu, 17 Aug 2006 02:39:03 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817091012.GE2630@mellanox.co.il> Message-ID: <20060817093903.70236.qmail@web36910.mail.mud.yahoo.com> (1) the client kernel often crashed. So it cost me much time on rebooting server. (2) the debug message scrolling too fast. I can't see some sdp_cma_* messages. I see many sdp_close messages. Can I just output sdp_cmd_* messages? zhu --- "Michael S. Tsirkin" wrote: > Quoting r. zhu shi song : > > --- "Michael S. Tsirkin" > wrote: > > > > > > > Quoting r. zhu shi song > > > : > > > > > > (2) memory consumed by SDP connection is > ok > > > > > > But ab always only complete 256 > requests > > > > > > successfully. Results are showed as > > > following: > > > > > > # SIMPLE_LIBSDP=1 > > > > > > > LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so > > > ab -c 300 > > > > > > -n 300 -X 193.12.10.14:3129 > > > > > > > > > > > > > > > > > > > > > http://http://www.sse.com.cn/sseportal/ps/zhs/home.shtml > > > > > > # This is ApacheBench, Version > 2.0.41-dev > > > > > > <$Revision: 1.141 $> apache-2.0 > > > > > > Copyright (c) 1996 Adam Twiss, Zeus > Technology > > > > > Ltd, > > > > > > http://www.zeustech.net/ > > > > > > Copyright (c) 1998-2002 The Apache > Software > > > > > > Foundation, http://www.apache.org/ > > > > > > > > > > > > Benchmarking www.sse.com.cn [through > > > > > > 193.12.10.14:3129] (be patient) > > > > > > Completed 100 requests > > > > > > Completed 200 requests > > > > > > apr_recv: Connection refused (111) > > > > > > Total of 256 requests completed > > > > > > > > > > Hmm. This one looks quite weird. Could you > > > please > > > > > enable debug > > > > > (run with debug_level=1) on the server side, > and > > > > > post the messsages > > > > > you get in dmesg? > > > > > > > > Quoting zhu shi song > : > > > > > > > > is "debug_level=1" for squid or for sdp? > > > > > > > > zhu > > > > > > That's for SDP. > > > > > > I see many lines scrolling the screen quickly. > there > > are many lines as below: > > > > sdp_sock(3129:0):sdp_accept state 10 expected 10 > *err > > -22 > > sdp_accept:ib_req_notify_cq > > sdp_accept:status -22 sk 0000010034298100 newsk > > 00000xxxxx(changeable) > > sdp_accept:state 10 exptected 10 *err -22 > > sdp_accept:error -11 > > sdp_accept:status -11 sk 000001003429800 new sk > > 00000000000000 > > No, these are benign. This is server timing out > waiting > for connection request. > > I'm interested in sdp_cma_ messages - > conn/disconnect requests > being handled. > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From ogerlitz at voltaire.com Thu Aug 17 02:55:17 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 17 Aug 2006 12:55:17 +0300 (IDT) Subject: [openib-general] void-ness of struct netdevice::set_multicast_list() problematic with IPoIB Message-ID: Roland, If i understand correct someone can attempt (*) setting IFF_ALLMULTI or IFF_PROMISC for an IPoIB device, and there is no very to return -EINVAL (or whatever) on that. This is since they (eg net/ipv4/ipmr.c and friends) just set the flags and later call the device set_multicast_list() function, which is void... I don't think its a big deal, maybe we can just print an unsupported warning from ipoib_set_mcast_list() if either of the flags is set. All of this is based on my understanding that IB does not support those flags, please correct me if i am wrong here... Or. (*) eg by calling packet(7) with PACKET_ADD_MEMBERSHIP and setting PACKET_MR_PROMISC or PACKET_MR_ALLMULTI From mst at mellanox.co.il Thu Aug 17 04:39:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Aug 2006 14:39:21 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817055832.73826.qmail@web36909.mail.mud.yahoo.com> References: <20060817055832.73826.qmail@web36909.mail.mud.yahoo.com> Message-ID: <20060817113921.GH2630@mellanox.co.il> Quoting r. zhu shi song : > (3) one time linux kernel on the client crashed. I > copy the output from the screen. > Process sdp (pid:4059, threadinfo 0000010036384000 > task 000001003ea10030) > Call > Trace:{:ib_sdp:sdp_destroy_workto} > {:ib_sdp:sdp_destroy_qp+77} > {:ib_sdp:sdp_destruct+279}{sk_free+28} > {worker_thread+419}{default_wake_function+0} > {default_wake_function+0}{keventd_create_kthread+0} > {worker_thread+0}{keventd_create_kthread+0} > {kthread+200}{child_rip+8} > {keventd_create_kthread+0}{kthread+0}{child_rip+0} > Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b 45 31 ff 45 > 31 ed 4c 89 > RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> > CR2:0000000000000004 > <0>kernel panic-not syncing:Oops > > zhu Hmm, the stack dump does not match my sources. Is this OFED rc1? Could you send me the sdp_main.o and sdp_main.c files from your system please? -- MST From thomas.bub at thomson.net Thu Aug 17 04:46:00 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Thu, 17 Aug 2006 13:46:00 +0200 Subject: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM Message-ID: Sean, I'm getting a little puzzled. For me it seems as if we are moving in the wrong direction. I don't have a RDMA CM on the Gen1 counterpart that my gen2 application is talking too. Or does the RDMA CM run also connect with a Gen1 libcm counterpart? If no I have to get it out of the cmpost example. If yes you have to explain me what the two different versions: rdma_cm.h and rdma_cma.h are good for. The cmpost.c was using rdma_cma.h up to now but the missing defines are located in rdma_cm.h Using rdma_cm.h does open another can of worms. Thomas -----Original Message----- From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Wednesday, August 16, 2006 6:33 PM To: Bub Thomas Cc: Sean Hefty; openib-general at openib.org; Erez Cohen Subject: Re: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM Bub Thomas wrote: > cmpost.c:65: error: field `path_rec' has incomplete type > cmpost.c: In function `int modify_to_rtr(cmtest_node*)': > cmpost.c:130: error: invalid conversion from `int' to `ibv_qp_attr_mask' > cmpost.c:130: error: initializing argument 3 of `int > ibv_modify_qp(ibv_qp*, ibv_qp_attr*, ibv_qp_attr_mask)' > cmpost.c:142: error: invalid conversion from `int' to `ibv_qp_attr_mask' > cmpost.c:142: error: initializing argument 3 of `int > ibv_modify_qp(ibv_qp*, ibv_qp_attr*, ibv_qp_attr_mask)' > cmpost.c: In function `int modify_to_rts(cmtest_node*)': > cmpost.c:161: error: invalid conversion from `int' to `ibv_qp_attr_mask' > cmpost.c:161: error: initializing argument 3 of `int > ibv_modify_qp(ibv_qp*, ibv_qp_attr*, ibv_qp_attr_mask)' > cmpost.c: In function `void cm_handler(ib_cm_id*, ib_cm_event*)': > cmpost.c:267: error: invalid conversion from `void*' to `cmtest_node*' > cmpost.c: In function `int create_nodes()': > cmpost.c:353: error: invalid conversion from `void*' to `cmtest_node*' > cmpost.c: In function `int query_for_path(char*)': > cmpost.c:631: error: `RDMA_PS_TCP' undeclared (first use this function) > cmpost.c:631: error: (Each undeclared identifier is reported only once > for each function it appears in.) > cmpost.c:657: error: 'struct cmtest' has no member named 'path_rec' > cmpost.c:657: error: invalid use of undefined type `struct > ibv_sa_path_rec' > /vob/pkgs/sys/linux-include/rdma/rdma_cma.h:84: error: forward > declaration of `struct ibv_sa_path_rec' > cmpost.c: In function `void run_client(char*)': > cmpost.c:678: error: 'struct cmtest' has no member named 'path_rec' > *** Error code 1 > clearmake: Error: Build script failed for "cmpost.o" Some of these look like missing include files. Are libibverbs and librdmacm installed? - Sean From kliteyn at mellanox.co.il Thu Aug 17 06:28:49 2006 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 17 Aug 2006 16:28:49 +0300 Subject: [openib-general] [PATCHv2] osm: OSM crash TRIVIAL bug fix In-Reply-To: <1155560727.9532.39151.camel@hal.voltaire.com> References: <1155560727.9532.39151.camel@hal.voltaire.com> Message-ID: <1155821329.13896.37.camel@kliteynik.yok.mtl.com> Hi Hal. > This line wrapped so there is something wrong with your mailer. I'm using a different mailer now, so I hope that it's ok now. > > + m->v = NULL; /* just make sure we do not point tofree'd madw */ > > Also, is this line really needed (and if so why) ? I know you did say > "it cleans up old pointers to retrieved madw" but this shouldn't be > accessed, right ? You're right, it shouldn't be accessed. But generally, it's a good practice to assign a null to any pointer that points to a freed memory, and should not be in use any more. > Also, if this is added here, there are other places where the same > thing should be done ? I just examined this area of code, so this is what I saw. -- Regards, Yevgeny Kliteynik Mellanox Technologies LTD Tel: +972-4-909-7200 ext: 394 Fax: +972-4-959-3245 P.O. Box 586 Yokneam 20692 ISRAEL On Mon, 2006-08-14 at 16:05 +0300, Hal Rosenstock wrote: > Hi Yevgeny, > > On Mon, 2006-08-14 at 04:13, Yevgeny Kliteynik wrote: > > Hi Hal. > > > > This patch fixes an OSM crash when working with Cisco's stack. > > Cisco's doesn't follow the same TID convention when generating > > transaction id which in some bad flow revealed this bug in the > > get_madw lookup. > > The bug is in get_madw which does not cleanup old pointers to > > retrieved madw and also does not detect lookup of its reserved > "free" > > entry of key==0. > > > > (This better text replaces my previous patch: > > "OSM crash when working with Cisco's TopSpin stack") > > > > Yevgeny > > Thanks. Good find. > > > Signed-off-by: Yevgeny Kliteynik < kliteyn at mellanox.co.il> > > > > > > Index: osm/libvendor/osm_vendor_ibumad.c > > =================================================================== > > --- osm/libvendor/osm_vendor_ibumad.c (revision 8614) > > +++ osm/libvendor/osm_vendor_ibumad.c (working copy) > > @@ -141,12 +141,20 @@ get_madw(osm_vendor_t *p_vend, ib_net64_ > > ib_net64_t mtid = (*tid & > cl_ntoh64(0x00000000ffffffffllu)); > > osm_madw_t *res; > > > > + /* > > + * Since mtid == 0 is the empty key we should not > > + * waste time looking for it > > + */ > > + if (mtid == 0) > > + return 0; > > + > > cl_spinlock_acquire( &p_vend->match_tbl_lock ); > > for (m = p_vend->mtbl.tbl, e = m + p_vend->mtbl.max; m < e; > > m++) { > > if (m->tid == mtid) { > > m->tid = 0; > > *tid = mtid; > > res = m->v; > > + m->v = NULL; /* just make sure we do not > point > > to free'd madw */ > > This line wrapped so there is something wrong with your mailer. > > Also, is this line really needed (and if so why) ? I know you did say > "it cleans up old pointers to retrieved madw" but this shouldn't be > accessed, right ? Also, if this is added here, there are other places > where the same thing should be done ? > > > > cl_spinlock_release( &p_vend->match_tbl_lock > > ); > > return res; > > } > > > > Applied to trunk and 1.1 with the exception noted above. > > -- Hal > From halr at voltaire.com Thu Aug 17 06:35:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Aug 2006 09:35:31 -0400 Subject: [openib-general] [PATCHv2] osm: OSM crash TRIVIAL bug fix In-Reply-To: <1155821329.13896.37.camel@kliteynik.yok.mtl.com> References: <1155560727.9532.39151.camel@hal.voltaire.com> <1155821329.13896.37.camel@kliteynik.yok.mtl.com> Message-ID: <1155821729.9855.38600.camel@hal.voltaire.com> Hi Yevgeny, On Thu, 2006-08-17 at 09:28, Yevgeny Kliteynik wrote: > Hi Hal. > > > This line wrapped so there is something wrong with your mailer. > > I'm using a different mailer now, so I hope that it's ok now. Guess we'll see with your next patch with a long line... > > > + m->v = NULL; /* just make sure we do not point tofree'd madw */ > > > > Also, is this line really needed (and if so why) ? I know you did say > > "it cleans up old pointers to retrieved madw" but this shouldn't be > > accessed, right ? > > You're right, it shouldn't be accessed. Does the fix checked in work as is now ? Did you reverify ? > But generally, it's a good practice to assign a null to > any pointer that points to a freed memory, and should not > be in use any more. It's also good practice that when an issue is found in one place to look for other occurences of the same issue. I'm also not sure this is the general approach that OpenSM takes. -- Hal > > Also, if this is added here, there are other places where the same > > thing should be done ? > > I just examined this area of code, so this is what I saw. From eli at mellanox.co.il Thu Aug 17 07:20:46 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 17 Aug 2006 17:20:46 +0300 Subject: [openib-general] [PATCH] huge pages support Message-ID: <1155824446.11238.8.camel@localhost> The following patch adds support for registering huge pages memory regions while passing to the low level driver the correct page size thus preserving mtt entries and kernel memory used to store page descriptors memory. Signed-off-by: Eli Cohen Index: last_stable/drivers/infiniband/core/uverbs_mem.c =================================================================== --- last_stable.orig/drivers/infiniband/core/uverbs_mem.c +++ last_stable/drivers/infiniband/core/uverbs_mem.c @@ -35,6 +35,7 @@ */ #include +#include #include #include "uverbs.h" @@ -49,7 +50,7 @@ struct ib_umem_account_work { static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) { struct ib_umem_chunk *chunk, *tmp; - int i; + int i, j; list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { dma_unmap_sg(dev->dma_device, chunk->page_list, @@ -57,13 +58,43 @@ static void __ib_umem_release(struct ib_ for (i = 0; i < chunk->nents; ++i) { if (umem->writable && dirty) set_page_dirty_lock(chunk->page_list[i].page); - put_page(chunk->page_list[i].page); + for (j = 0; j < umem->page_size / PAGE_SIZE; ++j) + put_page(chunk->page_list[i].page); } kfree(chunk); } } +static int get_page_shift(void *addr, size_t size) +{ + struct vm_area_struct *vma; + unsigned long va = (unsigned long)addr; + int hpage = -1; + +next: + vma = find_vma(current->mm, va); + if (!vma) + return -ENOMEM; + + if (va < vma->vm_start) + return -ENOMEM; + + if (hpage == -1) + hpage = is_vm_hugetlb_page(vma); + else + if (hpage != is_vm_hugetlb_page(vma)) + return -ENOMEM; + + if ((va + size) > vma->vm_end) { + size -= (vma->vm_end - va); + va = vma->vm_end; + goto next; + } + + return hpage ? HPAGE_SHIFT : PAGE_SHIFT; +} + int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, void *addr, size_t size, int write) { @@ -73,9 +104,12 @@ int ib_umem_get(struct ib_device *dev, s unsigned long lock_limit; unsigned long cur_base; unsigned long npages; + int page_shift, shift_diff, nreg_pages, tmp; + int max_page_chunk; + unsigned long page_mask, page_size; int ret = 0; int off; - int i; + int i, j, nents; if (!can_do_mlock()) return -EPERM; @@ -84,17 +118,36 @@ int ib_umem_get(struct ib_device *dev, s if (!page_list) return -ENOMEM; + down_write(¤t->mm->mmap_sem); + + page_shift = get_page_shift(addr, size); + if (IS_ERR_VALUE(page_shift)) { + ret = page_shift; + goto exit_up; + } + + /* make sure enough pointers get into PAGE_SIZE to + contain at least one huge page */ + if (!(PAGE_SIZE / (sizeof (struct page *) << shift_diff))) { + ret = -ENOMEM; + goto exit_up; + } + + page_size = 1 << page_shift; + page_mask = ~(page_size - 1); + shift_diff = page_shift - PAGE_SHIFT; + mem->user_base = (unsigned long) addr; mem->length = size; - mem->offset = (unsigned long) addr & ~PAGE_MASK; - mem->page_size = PAGE_SIZE; + mem->offset = (unsigned long) addr & ~page_mask; + mem->page_size = page_size; mem->writable = write; INIT_LIST_HEAD(&mem->chunk_list); - npages = PAGE_ALIGN(size + mem->offset) >> PAGE_SHIFT; - down_write(¤t->mm->mmap_sem); + nreg_pages = ALIGN(size + mem->offset, page_size) >> page_shift; + npages = nreg_pages << shift_diff; locked = npages + current->mm->locked_vm; lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; @@ -104,14 +157,15 @@ int ib_umem_get(struct ib_device *dev, s goto out; } - cur_base = (unsigned long) addr & PAGE_MASK; + cur_base = (unsigned long) addr & page_mask; while (npages) { + tmp = min_t(int, npages, + (PAGE_SIZE / (sizeof (struct page *) << shift_diff)) + << shift_diff); ret = get_user_pages(current, current->mm, cur_base, - min_t(int, npages, - PAGE_SIZE / sizeof (struct page *)), + tmp, 1, !write, page_list, NULL); - if (ret < 0) goto out; @@ -121,19 +175,27 @@ int ib_umem_get(struct ib_device *dev, s off = 0; while (ret) { - chunk = kmalloc(sizeof *chunk + sizeof (struct scatterlist) * - min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK), - GFP_KERNEL); + if (!shift_diff) { + nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK); + tmp = sizeof *chunk + sizeof (struct scatterlist) * nents; + } + else { + nents = ret >> shift_diff; + tmp = sizeof *chunk + + sizeof (struct scatterlist) * nents; + } + + chunk = kmalloc(tmp, GFP_KERNEL); if (!chunk) { ret = -ENOMEM; goto out; } + chunk->nents = nents; - chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK); for (i = 0; i < chunk->nents; ++i) { - chunk->page_list[i].page = page_list[i + off]; + chunk->page_list[i].page = page_list[(i << shift_diff) + off]; chunk->page_list[i].offset = 0; - chunk->page_list[i].length = PAGE_SIZE; + chunk->page_list[i].length = page_size; } chunk->nmap = dma_map_sg(dev->dma_device, @@ -142,15 +204,17 @@ int ib_umem_get(struct ib_device *dev, s DMA_BIDIRECTIONAL); if (chunk->nmap <= 0) { for (i = 0; i < chunk->nents; ++i) - put_page(chunk->page_list[i].page); + for (j = 0; j < (1 << shift_diff); ++j) + put_page(chunk->page_list[i].page); + kfree(chunk); ret = -ENOMEM; goto out; } - ret -= chunk->nents; - off += chunk->nents; + ret -= (chunk->nents << shift_diff); + off += (chunk->nents << shift_diff); list_add_tail(&chunk->list, &mem->chunk_list); } @@ -163,6 +227,7 @@ out: else current->mm->locked_vm = locked; +exit_up: up_write(¤t->mm->mmap_sem); free_page((unsigned long) page_list); From rkuchimanchi at silverstorm.com Thu Aug 17 07:25:54 2006 From: rkuchimanchi at silverstorm.com (Ramachandra K) Date: Thu, 17 Aug 2006 19:55:54 +0530 Subject: [openib-general] SRP - order of wait_for_completion() and completion() in srp_remove_one() Message-ID: <44E47C72.4090609@silverstorm.com> Hi Roland, In the following code snippet from srp_remove_one() in ib_srp.c: list_for_each_entry_safe(host, tmp_host, &srp_dev->dev_list, list) { class_device_unregister(&host->class_dev); /* * Wait for the sysfs entry to go away, so that no new * target ports can be created. */ wait_for_completion(&host->released); Is it guaranteed that wait_for_completion() will be called before the srp_class release function srp_release_class_dev() which signals the completion ? I was wondering if it is not possible for the release function to run and signal the completion before we call wait_for_completion(). Regards, Ram From tziporet at mellanox.co.il Thu Aug 17 07:53:36 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 17 Aug 2006 17:53:36 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060816143357.48475.qmail@web36904.mail.mud.yahoo.com> References: <20060816143357.48475.qmail@web36904.mail.mud.yahoo.com> Message-ID: <44E482F0.6010203@mellanox.co.il> zhu shi song wrote: > I have changed SDP_RX_SIZE from 0x40 to 1 and rebuilt > ib_sdp.ko. But kernel always crashed. > zhu > Hi Zhu, Can you send us instructions of the test/application you are running so we can try to reproduce it here too? We also need to know the system & kernel you are using Thanks, Tziporet From rdreier at cisco.com Thu Aug 17 08:23:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 08:23:11 -0700 Subject: [openib-general] SRP - order of wait_for_completion() and completion() in srp_remove_one() In-Reply-To: <44E47C72.4090609@silverstorm.com> (Ramachandra K.'s message of "Thu, 17 Aug 2006 19:55:54 +0530") References: <44E47C72.4090609@silverstorm.com> Message-ID: Ramachandra> I was wondering if it is not possible for the release Ramachandra> function to run and signal the completion before we Ramachandra> call wait_for_completion(). Sure, that would seem to be possible. Why is that a problem? - R. From tziporet at mellanox.co.il Thu Aug 17 08:45:14 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 17 Aug 2006 18:45:14 +0300 Subject: [openib-general] OFED 1.1-rc2 is delayed for next Monday (Aug-21) Message-ID: <44E48F0A.8050701@mellanox.co.il> Tziporet From rdreier at cisco.com Thu Aug 17 09:06:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 09:06:36 -0700 Subject: [openib-general] [PATCH] huge pages support In-Reply-To: <1155824446.11238.8.camel@localhost> (Eli Cohen's message of "Thu, 17 Aug 2006 17:20:46 +0300") References: <1155824446.11238.8.camel@localhost> Message-ID: My first reaction on reading this is that it looks like we need some more help from mm/ to make this cleaner. > + for (j = 0; j < umem->page_size / PAGE_SIZE; ++j) > + put_page(chunk->page_list[i].page); This just looks really bad... you're assuming that calling put_page() on a different page than the one get_user_pages() gave you is OK. Does that actually work? (I don't remember the details of how hugetlb pages work) > + if (hpage == -1) > + hpage = is_vm_hugetlb_page(vma); > + else > + if (hpage != is_vm_hugetlb_page(vma)) > + return -ENOMEM; Users aren't allowed to register regions that span both hugetlb and non-hugetlb pages?? Why not? > + if (!(PAGE_SIZE / (sizeof (struct page *) << shift_diff))) { shift_diff is used here... > + ret = -ENOMEM; > + goto exit_up; > + } > + > + page_size = 1 << page_shift; > + page_mask = ~(page_size - 1); > + shift_diff = page_shift - PAGE_SHIFT; and set down here... don't you get a warning about "is used uninitialized"? Also, if someone tries to register a hugetlb region and a regular page isn't big enough to hold all the pointers, then they just lose?? This would be a real problem if/when GB pages are implemented on amd64 -- there would be no way to register such a region. Basically this patch seems to be trying to undo what follow_hugetlb_pages() does to create normal pages for the pieces of a hugetlb page. It seems like it would be better to have a new function like get_user_pages() that returned its result in a struct scatterlist so that it could give back huge pages directly. - R. From caitlinb at broadcom.com Thu Aug 17 09:11:30 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 17 Aug 2006 09:11:30 -0700 Subject: [openib-general] return error when rdma_listen fails In-Reply-To: <44E3AB58.8070109@ichips.intel.com> Message-ID: <54AD0F12E08D1541B826BE97C98F99F189E3D5@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Tom Tucker wrote: >> I think this makes sense for IB, however, for TCP based transports, >> we should share the port space with TCP. > > My view is that the iWarp transport needs to provide the > mapping from an RDMA_PS_TCP to the actual TCP port space, > RDMA_PS_UDP to UDP, etc. This is a function that should be > part of the transport specific code, and not the general RDMA CM code. > I really don't see any benefit in having each iWARP device "map" from RDMA_PS_TCP to actual TCP. It translates to a TCP port for an IP address that is assigned to this host, the kernel should be the final aribiter of the usage of that TCP port. So centralizing that co-ordination with the host stack is best done at the IWCM. I also believe that the above statement is true for IP addresses uses by IPoIB, SDP and when IB connections are established using IP addresses. But the need to co-ordinate IP addresses over an InfiniBand network is admittedly a bit more theoretical. Still, the IB network should not be using IP addresses in anyway that conflicts with other uses of IP addresses on other devices, and it would be better if the host stack could actually enforce that. UDP is a totally different issue. To the best of my knowledge all iWARP device already support an unreliable datagram service that is already fully integrated with the Linux network stack: SOCK_DGRAM sockets and UDP. If there is an actual need to support UDP via a QP/CQ interface when using iWARP then that should be documented, and tradeoffs considered. For example, a software "QP" that merely used SOCK_DGRAMS is extremely easy to use as long as it does not have to share a CQ with reliable connections. From mshefty at ichips.intel.com Thu Aug 17 09:50:10 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 Aug 2006 09:50:10 -0700 Subject: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM In-Reply-To: References: Message-ID: <44E49E42.5090807@ichips.intel.com> Bub Thomas wrote: > I'm getting a little puzzled. > For me it seems as if we are moving in the wrong direction. > I don't have a RDMA CM on the Gen1 counterpart that my gen2 application > is talking too. The RDMA CM is only used on the local (active or client) side to obtain a path record, which is needed by the libibcm. Using the librdmacm allows cmpost to get a path record given only the remote IP address or host name. The connection is established using the IB CM through libibcm. > If yes you have to explain me what the two different versions: > rdma_cm.h > and > rdma_cma.h rdma_cm.h defines the kernel interface to the RDMA CM. rdma_cma.h defines the userspace interface. > The cmpost.c was using rdma_cma.h up to now but the missing defines are > located in rdma_cm.h Can you verify that you have the latest version of rdma_cma.h? - Sean From eli at mellanox.co.il Thu Aug 17 10:01:49 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 17 Aug 2006 20:01:49 +0300 Subject: [openib-general] [PATCH] huge pages support In-Reply-To: References: <1155824446.11238.8.camel@localhost> Message-ID: <1155834110.13477.10.camel@localhost> On Thu, 2006-08-17 at 09:06 -0700, Roland Dreier wrote: > My first reaction on reading this is that it looks like we need some > more help from mm/ to make this cleaner. > > > + for (j = 0; j < umem->page_size / PAGE_SIZE; ++j) > > + put_page(chunk->page_list[i].page); > > This just looks really bad... you're assuming that calling put_page() > on a different page than the one get_user_pages() gave you is OK. > Does that actually work? (I don't remember the details of how hugetlb > pages work) A huge page is a compound page that is comprised of a number of pages and the first one represents all the others so it is safe to do this. > > > + if (hpage == -1) > > + hpage = is_vm_hugetlb_page(vma); > > + else > > + if (hpage != is_vm_hugetlb_page(vma)) > > + return -ENOMEM; > > Users aren't allowed to register regions that span both hugetlb and > non-hugetlb pages?? Why not? It is not that easy anyway to obtain huge pages and even more unlikely that a user will want to register a region with different page sizes so I just don't allow that. But I think these cases can be allowed by using PAGE_SIZE registration. > > > + if (!(PAGE_SIZE / (sizeof (struct page *) << shift_diff))) { > > shift_diff is used here... > > > + ret = -ENOMEM; > > + goto exit_up; > > + } > > + > > + page_size = 1 << page_shift; > > + page_mask = ~(page_size - 1); > > + shift_diff = page_shift - PAGE_SHIFT; > > and set down here... don't you get a warning about "is used uninitialized"? Oops. I didn't see any warning. This if should moved a few lines down. > > Also, if someone tries to register a hugetlb region and a regular page > isn't big enough to hold all the pointers, then they just lose?? This > would be a real problem if/when GB pages are implemented on amd64 -- > there would be no way to register such a region. > > > Basically this patch seems to be trying to undo what follow_hugetlb_pages() > does to create normal pages for the pieces of a hugetlb page. It > seems like it would be better to have a new function like get_user_pages() > that returned its result in a struct scatterlist so that it could give > back huge pages directly. > I agree about that. From mshefty at ichips.intel.com Thu Aug 17 10:18:20 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 Aug 2006 10:18:20 -0700 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E42670.5040308@voltaire.com> References: <44E3914C.4010903@ichips.intel.com> <44E3AD24.4070200@ichips.intel.com> <44E42670.5040308@voltaire.com> Message-ID: <44E4A4DC.2040200@ichips.intel.com> Or Gerlitz wrote: > If you don't mind (also related to the patch you have sent Eric of > randomizing the initial local cm id) to get into this deeper, can we do There's an issue trying to randomize the initial local CM ID. The way the IDR works, if you start at a high value, then the IDR size grows up to the size of the first value, which can result in memory allocation failures. In my tests, using a random value would frequently result in connection failures because of low memory. My conclusion is that the local ID assignment in the IB CM needs to be reworked, or we will run into a condition that after X number of connections have been established, we will be unable to create any new connections, even if the previous connections have all been destroyed. >> static struct cm_id_private * cm_match_req(struct cm_work *work, >> + struct cm_id_private >> *cm_id_priv) >> +{ >> + struct cm_id_private *listen_cm_id_priv, *cur_cm_id_priv; >> + struct cm_timewait_info *timewait_info; >> + struct cm_req_msg *req_msg; >> + unsigned long flags; >> + >> + req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; >> + >> + /* Check for duplicate REQ and stale connections. */ >> + spin_lock_irqsave(&cm.lock, flags); >> + timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); >> + if (!timewait_info) >> + timewait_info = >> cm_insert_remote_qpn(cm_id_priv->timewait_info); > > > This if() holds when entry is present in > remote_id_table OR entry is present in > remote_qpn_table correct > >> + if (timewait_info) { >> + cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, >> + >> timewait_info->work.remote_id); > > > + spin_unlock_irqrestore(&cm.lock, flags); > >> + if (cur_cm_id_priv) { >> + cm_dup_req_handler(work, cur_cm_id_priv); >> + cm_deref_id(cur_cm_id_priv); > > > entry exists in local_id_table, looking on > dup_req_handler() i see it sends REP when the id is in "MRA sent" and > sends a STALE_CONN REJ when the id is in timewait state, else it does > nothing. It sends an MRA if in the MRA sent state, or a reject as indicated. >> + } else >> + cm_issue_rej(work->port, work->mad_recv_wc, >> + IB_CM_REJ_STALE_CONN, >> CM_MSG_RESPONSE_REQ, >> + NULL, 0); > > > what is this case? there is no entry but there is > remote or entries??? If we get here, this means that the REQ was a new REQ and not a duplicate, but the remote_id or remote_qpn is already in use. We need to reject the new REQ as containing stale data. - Sean From halr at voltaire.com Thu Aug 17 10:30:44 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Aug 2006 13:30:44 -0400 Subject: [openib-general] vendor_id field in device attributes In-Reply-To: <44E3CF7E.20405@cse.ohio-state.edu> References: <44E3CF7E.20405@cse.ohio-state.edu> Message-ID: <1155835843.9855.45264.camel@hal.voltaire.com> On Wed, 2006-08-16 at 22:07, Sayantan Sur wrote: > Hi, > > I have a quick question. If I use ibv_query_device() to find out the IB > device properties, does the `vendor_id' field correspond to a unique HCA > vendor? For example, I get the value 713 for Mellanox HCAs. Can I expect > this to remain the same across various Gen2 installations? VendorID is IEEE assigned per vendor. A vendor could obtain more than 1 ID from the IEEE if it uses up its assigned space (for GUIDs/MACs). Some vendors do have more than one VendorID (aka OUI). -- Hal > Thanks, > Sayantan. From ttelford at linuxnetworx.com Thu Aug 17 10:56:44 2006 From: ttelford at linuxnetworx.com (Troy Telford) Date: Thu, 17 Aug 2006 11:56:44 -0600 Subject: [openib-general] OFED 1.1-rc2 is delayed for next Monday (Aug-21) In-Reply-To: <44E48F0A.8050701@mellanox.co.il> References: <44E48F0A.8050701@mellanox.co.il> Message-ID: Quick Question: https://docs.mellanox.com/dm/ibg2/OFED_release_notes_Mellanox.txt states that "SLES 9 SP3. is planned for OFED rev 1.1." Is this still something I can look forward to, or has it been pushed back? Thanks, -- Troy Telford From rep.nop at aon.at Thu Aug 17 11:27:13 2006 From: rep.nop at aon.at (Bernhard Fischer) Date: Thu, 17 Aug 2006 20:27:13 +0200 Subject: [openib-general] [patch] libsdp typo in config_parser Message-ID: <20060817182713.GA4744@aon.at> Hi, The attached trivial patch fixes a typo in the debugging output of libsdp's config parser. Please apply. Signed-off-by: Bernhard Fischer -------------- next part -------------- Index: libsdp/src/config_parser.c =================================================================== --- libsdp/src/config_parser.c (revision 9003) +++ libsdp/src/config_parser.c (working copy) @@ -198,7 +198,7 @@ extern int __sdp_min_level; /* dump the current state in readable format */ static void __sdp_dump_config_state() { char buf[1024]; - sprintf(buf, "CONIFG: use %s %s %s", + sprintf(buf, "CONFIG: use %s %s %s", __sdp_get_family_str(__sdp_rule.target_family), __sdp_get_role_str( current_role ), __sdp_rule.prog_name_expr); Index: libsdp/src/config_parser.y =================================================================== --- libsdp/src/config_parser.y (revision 9003) +++ libsdp/src/config_parser.y (working copy) @@ -143,7 +143,7 @@ extern int __sdp_min_level; /* dump the current state in readable format */ static void __sdp_dump_config_state() { char buf[1024]; - sprintf(buf, "CONIFG: use %s %s %s", + sprintf(buf, "CONFIG: use %s %s %s", __sdp_get_family_str(__sdp_rule.target_family), __sdp_get_role_str( current_role ), __sdp_rule.prog_name_expr); From rolandd at cisco.com Thu Aug 17 13:09:27 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:09:27 -0700 Subject: [openib-general] [PATCH 00/16] IB/ehca: introduction Message-ID: <2006817139.43eVtRoa2IK8yOPl@cisco.com> Here's a series of patches (split up rather arbitrarily to avoid too-big emails) which adds a driver for the IBM eHCA InfiniBand adapter. The driver has been around for a while, and my feeling is that it is good enough to merge, even though it could certainly use some cleaning up. However, my feeling is that we don't need to wait for this driver to be perfect before merging it, and that it would be better for everyone if it gets into mainline (eg coordination with Anton's hcall cleanup becomes simpler). Please review and comment, and do let me know if you disagree with my decision to merge this for 2.6.19. (BTW, just to be clear -- I'll collapse this driver into a single git commit with full changelog and Signed-off-by: lines before actually merging it -- the bare patches are just for review) The driver is also available in git for your reviewing pleasure at git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ehca The developers of the driver are cc'ed on this thread and should respond to any comments. Thanks, Roland From rolandd at cisco.com Thu Aug 17 13:09:28 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:09:28 -0700 Subject: [openib-general] [PATCH 02/16] IB/ehca: classes In-Reply-To: <2006817139.pLkgJggYXy2PkqBH@cisco.com> Message-ID: <2006817139.e1epJYk9xVvFdTao@cisco.com> drivers/infiniband/hw/ehca/ehca_classes.h | 343 +++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_classes_pSeries.h | 236 ++++++++++++++ 2 files changed, 579 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h new file mode 100644 index 0000000..1a87bee --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -0,0 +1,343 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Struct definition for eHCA internal structures + * + * Authors: Heiko J Schick + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef __EHCA_CLASSES_H__ +#define __EHCA_CLASSES_H__ + +#include "ehca_classes.h" +#include "ipz_pt_fn.h" + +struct ehca_module; +struct ehca_qp; +struct ehca_cq; +struct ehca_eq; +struct ehca_mr; +struct ehca_mw; +struct ehca_pd; +struct ehca_av; + +#ifdef CONFIG_PPC64 +#include "ehca_classes_pSeries.h" +#endif + +#include +#include + +#include "ehca_irq.h" + +struct ehca_module { + struct list_head shca_list; + spinlock_t shca_lock; + struct timer_list timer; + kmem_cache_t *cache_pd; + kmem_cache_t *cache_cq; + kmem_cache_t *cache_qp; + kmem_cache_t *cache_av; + kmem_cache_t *cache_mr; + kmem_cache_t *cache_mw; +}; + +struct ehca_eq { + u32 length; + struct ipz_queue ipz_queue; + struct ipz_eq_handle ipz_eq_handle; + struct work_struct work; + struct h_galpas galpas; + int is_initialized; + struct ehca_pfeq pf; + spinlock_t spinlock; + struct tasklet_struct interrupt_task; + u32 ist; +}; + +struct ehca_sport { + struct ib_cq *ibcq_aqp1; + struct ib_qp *ibqp_aqp1; + enum ib_rate rate; + enum ib_port_state port_state; +}; + +struct ehca_shca { + struct ib_device ib_device; + struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; + struct ipz_adapter_handle ipz_hca_handle; + struct ehca_sport sport[2]; + struct ehca_eq eq; + struct ehca_eq neq; + struct ehca_mr *maxmr; + struct ehca_pd *pd; + struct h_galpas galpas; +}; + +struct ehca_pd { + struct ib_pd ib_pd; + struct ipz_pd fw_pd; + u32 ownpid; +}; + +struct ehca_qp { + struct ib_qp ib_qp; + u32 qp_type; + struct ipz_queue ipz_squeue; + struct ipz_queue ipz_rqueue; + struct h_galpas galpas; + u32 qkey; + u32 real_qp_num; + u32 token; + spinlock_t spinlock_s; + spinlock_t spinlock_r; + u32 sq_max_inline_data_size; + struct ipz_qp_handle ipz_qp_handle; + struct ehca_pfqp pf; + struct ib_qp_init_attr init_attr; + u64 uspace_squeue; + u64 uspace_rqueue; + u64 uspace_fwh; + struct ehca_cq *send_cq; + struct ehca_cq *recv_cq; + unsigned int sqerr_purgeflag; + struct hlist_node list_entries; +}; + +/* must be power of 2 */ +#define QP_HASHTAB_LEN 8 + +struct ehca_cq { + struct ib_cq ib_cq; + struct ipz_queue ipz_queue; + struct h_galpas galpas; + spinlock_t spinlock; + u32 cq_number; + u32 token; + u32 nr_of_entries; + struct ipz_cq_handle ipz_cq_handle; + struct ehca_pfcq pf; + spinlock_t cb_lock; + u64 uspace_queue; + u64 uspace_fwh; + struct hlist_head qp_hashtab[QP_HASHTAB_LEN]; + struct list_head entry; + u32 nr_callbacks; + spinlock_t task_lock; + u32 ownpid; +}; + +enum ehca_mr_flag { + EHCA_MR_FLAG_FMR = 0x80000000, /* FMR, created with ehca_alloc_fmr */ + EHCA_MR_FLAG_MAXMR = 0x40000000, /* max-MR */ +}; + +struct ehca_mr { + union { + struct ib_mr ib_mr; /* must always be first in ehca_mr */ + struct ib_fmr ib_fmr; /* must always be first in ehca_mr */ + } ib; + spinlock_t mrlock; + + enum ehca_mr_flag flags; + u32 num_pages; /* number of MR pages */ + u32 num_4k; /* number of 4k "page" portions to form MR */ + int acl; /* ACL (stored here for usage in reregister) */ + u64 *start; /* virtual start address (stored here for */ + /* usage in reregister) */ + u64 size; /* size (stored here for usage in reregister) */ + u32 fmr_page_size; /* page size for FMR */ + u32 fmr_max_pages; /* max pages for FMR */ + u32 fmr_max_maps; /* max outstanding maps for FMR */ + u32 fmr_map_cnt; /* map counter for FMR */ + /* fw specific data */ + struct ipz_mrmw_handle ipz_mr_handle; /* MR handle for h-calls */ + struct h_galpas galpas; + /* data for userspace bridge */ + u32 nr_of_pages; + void *pagearray; +}; + +struct ehca_mw { + struct ib_mw ib_mw; /* gen2 mw, must always be first in ehca_mw */ + spinlock_t mwlock; + + u8 never_bound; /* indication MW was never bound */ + struct ipz_mrmw_handle ipz_mw_handle; /* MW handle for h-calls */ + struct h_galpas galpas; +}; + +enum ehca_mr_pgi_type { + EHCA_MR_PGI_PHYS = 1, /* type of ehca_reg_phys_mr, + * ehca_rereg_phys_mr, + * ehca_reg_internal_maxmr */ + EHCA_MR_PGI_USER = 2, /* type of ehca_reg_user_mr */ + EHCA_MR_PGI_FMR = 3 /* type of ehca_map_phys_fmr */ +}; + +struct ehca_mr_pginfo { + enum ehca_mr_pgi_type type; + u64 num_pages; + u64 page_cnt; + u64 num_4k; /* number of 4k "page" portions */ + u64 page_4k_cnt; /* counter for 4k "page" portions */ + u64 next_4k; /* next 4k "page" portion in buffer/chunk/listelem */ + + /* type EHCA_MR_PGI_PHYS section */ + int num_phys_buf; + struct ib_phys_buf *phys_buf_array; + u64 next_buf; + + /* type EHCA_MR_PGI_USER section */ + struct ib_umem *region; + struct ib_umem_chunk *next_chunk; + u64 next_nmap; + + /* type EHCA_MR_PGI_FMR section */ + u64 *page_list; + u64 next_listelem; + /* next_4k also used within EHCA_MR_PGI_FMR */ +}; + +/* output parameters for MR/FMR hipz calls */ +struct ehca_mr_hipzout_parms { + struct ipz_mrmw_handle handle; + u32 lkey; + u32 rkey; + u64 len; + u64 vaddr; + u32 acl; +}; + +/* output parameters for MW hipz calls */ +struct ehca_mw_hipzout_parms { + struct ipz_mrmw_handle handle; + u32 rkey; +}; + +struct ehca_av { + struct ib_ah ib_ah; + struct ehca_ud_av av; +}; + +struct ehca_ucontext { + struct ib_ucontext ib_ucontext; +}; + +struct ehca_module *ehca_module_new(void); + +int ehca_module_delete(struct ehca_module *me); + +int ehca_eq_ctor(struct ehca_eq *eq); + +int ehca_eq_dtor(struct ehca_eq *eq); + +struct ehca_shca *ehca_shca_new(void); + +int ehca_shca_delete(struct ehca_shca *me); + +struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor); + +extern spinlock_t ehca_qp_idr_lock; +extern spinlock_t ehca_cq_idr_lock; +extern struct idr ehca_qp_idr; +extern struct idr ehca_cq_idr; + +struct ipzu_queue_resp { + u64 queue; /* points to first queue entry */ + u32 qe_size; /* queue entry size */ + u32 act_nr_of_sg; + u32 queue_length; /* queue length allocated in bytes */ + u32 pagesize; + u32 toggle_state; + u32 dummy; /* padding for 8 byte alignment */ +}; + +struct ehca_create_cq_resp { + u32 cq_number; + u32 token; + struct ipzu_queue_resp ipz_queue; + struct h_galpas galpas; +}; + +struct ehca_create_qp_resp { + u32 qp_num; + u32 token; + u32 qp_type; + u32 qkey; + /* qp_num assigned by ehca: sqp0/1 may have got different numbers */ + u32 real_qp_num; + u32 dummy; /* padding for 8 byte alignment */ + struct ipzu_queue_resp ipz_squeue; + struct ipzu_queue_resp ipz_rqueue; + struct h_galpas galpas; +}; + +struct ehca_alloc_cq_parms { + u32 nr_cqe; + u32 act_nr_of_entries; + u32 act_pages; + struct ipz_eq_handle eq_handle; +}; + +struct ehca_alloc_qp_parms { + int servicetype; + int sigtype; + int daqp_ctrl; + int max_send_sge; + int max_recv_sge; + int ud_av_l_key_ctl; + + u16 act_nr_send_wqes; + u16 act_nr_recv_wqes; + u8 act_nr_recv_sges; + u8 act_nr_send_sges; + + u32 nr_rq_pages; + u32 nr_sq_pages; + + struct ipz_eq_handle ipz_eq_handle; + struct ipz_pd pd; +}; + +int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp); +int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int qp_num); +struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int qp_num); + +#endif diff --git a/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h new file mode 100644 index 0000000..5665f21 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h @@ -0,0 +1,236 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * pSeries interface definitions + * + * Authors: Waleri Fomin + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef __EHCA_CLASSES_PSERIES_H__ +#define __EHCA_CLASSES_PSERIES_H__ + +#include "hcp_phyp.h" +#include "ipz_pt_fn.h" + + +struct ehca_pfqp { + struct ipz_qpt sqpt; + struct ipz_qpt rqpt; +}; + +struct ehca_pfcq { + struct ipz_qpt qpt; + u32 cqnr; +}; + +struct ehca_pfeq { + struct ipz_qpt qpt; + struct h_galpa galpa; + u32 eqnr; +}; + +struct ipz_adapter_handle { + u64 handle; +}; + +struct ipz_cq_handle { + u64 handle; +}; + +struct ipz_eq_handle { + u64 handle; +}; + +struct ipz_qp_handle { + u64 handle; +}; +struct ipz_mrmw_handle { + u64 handle; +}; + +struct ipz_pd { + u32 value; +}; + +struct hcp_modify_qp_control_block { + u32 qkey; /* 00 */ + u32 rdd; /* reliable datagram domain */ + u32 send_psn; /* 02 */ + u32 receive_psn; /* 03 */ + u32 prim_phys_port; /* 04 */ + u32 alt_phys_port; /* 05 */ + u32 prim_p_key_idx; /* 06 */ + u32 alt_p_key_idx; /* 07 */ + u32 rdma_atomic_ctrl; /* 08 */ + u32 qp_state; /* 09 */ + u32 reserved_10; /* 10 */ + u32 rdma_nr_atomic_resp_res; /* 11 */ + u32 path_migration_state; /* 12 */ + u32 rdma_atomic_outst_dest_qp; /* 13 */ + u32 dest_qp_nr; /* 14 */ + u32 min_rnr_nak_timer_field; /* 15 */ + u32 service_level; /* 16 */ + u32 send_grh_flag; /* 17 */ + u32 retry_count; /* 18 */ + u32 timeout; /* 19 */ + u32 path_mtu; /* 20 */ + u32 max_static_rate; /* 21 */ + u32 dlid; /* 22 */ + u32 rnr_retry_count; /* 23 */ + u32 source_path_bits; /* 24 */ + u32 traffic_class; /* 25 */ + u32 hop_limit; /* 26 */ + u32 source_gid_idx; /* 27 */ + u32 flow_label; /* 28 */ + u32 reserved_29; /* 29 */ + union { /* 30 */ + u64 dw[2]; + u8 byte[16]; + } dest_gid; + u32 service_level_al; /* 34 */ + u32 send_grh_flag_al; /* 35 */ + u32 retry_count_al; /* 36 */ + u32 timeout_al; /* 37 */ + u32 max_static_rate_al; /* 38 */ + u32 dlid_al; /* 39 */ + u32 rnr_retry_count_al; /* 40 */ + u32 source_path_bits_al; /* 41 */ + u32 traffic_class_al; /* 42 */ + u32 hop_limit_al; /* 43 */ + u32 source_gid_idx_al; /* 44 */ + u32 flow_label_al; /* 45 */ + u32 reserved_46; /* 46 */ + u32 reserved_47; /* 47 */ + union { /* 48 */ + u64 dw[2]; + u8 byte[16]; + } dest_gid_al; + u32 max_nr_outst_send_wr; /* 52 */ + u32 max_nr_outst_recv_wr; /* 53 */ + u32 disable_ete_credit_check; /* 54 */ + u32 qp_number; /* 55 */ + u64 send_queue_handle; /* 56 */ + u64 recv_queue_handle; /* 58 */ + u32 actual_nr_sges_in_sq_wqe; /* 60 */ + u32 actual_nr_sges_in_rq_wqe; /* 61 */ + u32 qp_enable; /* 62 */ + u32 curr_srq_limit; /* 63 */ + u64 qp_aff_asyn_ev_log_reg; /* 64 */ + u64 shared_rq_hndl; /* 66 */ + u64 trigg_doorbell_qp_hndl; /* 68 */ + u32 reserved_70_127[58]; /* 70 */ +}; + +#define MQPCB_MASK_QKEY EHCA_BMASK_IBM(0,0) +#define MQPCB_MASK_SEND_PSN EHCA_BMASK_IBM(2,2) +#define MQPCB_MASK_RECEIVE_PSN EHCA_BMASK_IBM(3,3) +#define MQPCB_MASK_PRIM_PHYS_PORT EHCA_BMASK_IBM(4,4) +#define MQPCB_PRIM_PHYS_PORT EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_ALT_PHYS_PORT EHCA_BMASK_IBM(5,5) +#define MQPCB_MASK_PRIM_P_KEY_IDX EHCA_BMASK_IBM(6,6) +#define MQPCB_PRIM_P_KEY_IDX EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_ALT_P_KEY_IDX EHCA_BMASK_IBM(7,7) +#define MQPCB_MASK_RDMA_ATOMIC_CTRL EHCA_BMASK_IBM(8,8) +#define MQPCB_MASK_QP_STATE EHCA_BMASK_IBM(9,9) +#define MQPCB_QP_STATE EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES EHCA_BMASK_IBM(11,11) +#define MQPCB_MASK_PATH_MIGRATION_STATE EHCA_BMASK_IBM(12,12) +#define MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP EHCA_BMASK_IBM(13,13) +#define MQPCB_MASK_DEST_QP_NR EHCA_BMASK_IBM(14,14) +#define MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD EHCA_BMASK_IBM(15,15) +#define MQPCB_MASK_SERVICE_LEVEL EHCA_BMASK_IBM(16,16) +#define MQPCB_MASK_SEND_GRH_FLAG EHCA_BMASK_IBM(17,17) +#define MQPCB_MASK_RETRY_COUNT EHCA_BMASK_IBM(18,18) +#define MQPCB_MASK_TIMEOUT EHCA_BMASK_IBM(19,19) +#define MQPCB_MASK_PATH_MTU EHCA_BMASK_IBM(20,20) +#define MQPCB_PATH_MTU EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_MAX_STATIC_RATE EHCA_BMASK_IBM(21,21) +#define MQPCB_MAX_STATIC_RATE EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_DLID EHCA_BMASK_IBM(22,22) +#define MQPCB_DLID EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_RNR_RETRY_COUNT EHCA_BMASK_IBM(23,23) +#define MQPCB_RNR_RETRY_COUNT EHCA_BMASK_IBM(29,31) +#define MQPCB_MASK_SOURCE_PATH_BITS EHCA_BMASK_IBM(24,24) +#define MQPCB_SOURCE_PATH_BITS EHCA_BMASK_IBM(25,31) +#define MQPCB_MASK_TRAFFIC_CLASS EHCA_BMASK_IBM(25,25) +#define MQPCB_TRAFFIC_CLASS EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_HOP_LIMIT EHCA_BMASK_IBM(26,26) +#define MQPCB_HOP_LIMIT EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_SOURCE_GID_IDX EHCA_BMASK_IBM(27,27) +#define MQPCB_SOURCE_GID_IDX EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_FLOW_LABEL EHCA_BMASK_IBM(28,28) +#define MQPCB_FLOW_LABEL EHCA_BMASK_IBM(12,31) +#define MQPCB_MASK_DEST_GID EHCA_BMASK_IBM(30,30) +#define MQPCB_MASK_SERVICE_LEVEL_AL EHCA_BMASK_IBM(31,31) +#define MQPCB_SERVICE_LEVEL_AL EHCA_BMASK_IBM(28,31) +#define MQPCB_MASK_SEND_GRH_FLAG_AL EHCA_BMASK_IBM(32,32) +#define MQPCB_SEND_GRH_FLAG_AL EHCA_BMASK_IBM(31,31) +#define MQPCB_MASK_RETRY_COUNT_AL EHCA_BMASK_IBM(33,33) +#define MQPCB_RETRY_COUNT_AL EHCA_BMASK_IBM(29,31) +#define MQPCB_MASK_TIMEOUT_AL EHCA_BMASK_IBM(34,34) +#define MQPCB_TIMEOUT_AL EHCA_BMASK_IBM(27,31) +#define MQPCB_MASK_MAX_STATIC_RATE_AL EHCA_BMASK_IBM(35,35) +#define MQPCB_MAX_STATIC_RATE_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_DLID_AL EHCA_BMASK_IBM(36,36) +#define MQPCB_DLID_AL EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_RNR_RETRY_COUNT_AL EHCA_BMASK_IBM(37,37) +#define MQPCB_RNR_RETRY_COUNT_AL EHCA_BMASK_IBM(29,31) +#define MQPCB_MASK_SOURCE_PATH_BITS_AL EHCA_BMASK_IBM(38,38) +#define MQPCB_SOURCE_PATH_BITS_AL EHCA_BMASK_IBM(25,31) +#define MQPCB_MASK_TRAFFIC_CLASS_AL EHCA_BMASK_IBM(39,39) +#define MQPCB_TRAFFIC_CLASS_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_HOP_LIMIT_AL EHCA_BMASK_IBM(40,40) +#define MQPCB_HOP_LIMIT_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_SOURCE_GID_IDX_AL EHCA_BMASK_IBM(41,41) +#define MQPCB_SOURCE_GID_IDX_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_FLOW_LABEL_AL EHCA_BMASK_IBM(42,42) +#define MQPCB_FLOW_LABEL_AL EHCA_BMASK_IBM(12,31) +#define MQPCB_MASK_DEST_GID_AL EHCA_BMASK_IBM(44,44) +#define MQPCB_MASK_MAX_NR_OUTST_SEND_WR EHCA_BMASK_IBM(45,45) +#define MQPCB_MAX_NR_OUTST_SEND_WR EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_MAX_NR_OUTST_RECV_WR EHCA_BMASK_IBM(46,46) +#define MQPCB_MAX_NR_OUTST_RECV_WR EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_DISABLE_ETE_CREDIT_CHECK EHCA_BMASK_IBM(47,47) +#define MQPCB_DISABLE_ETE_CREDIT_CHECK EHCA_BMASK_IBM(31,31) +#define MQPCB_QP_NUMBER EHCA_BMASK_IBM(8,31) +#define MQPCB_MASK_QP_ENABLE EHCA_BMASK_IBM(48,48) +#define MQPCB_QP_ENABLE EHCA_BMASK_IBM(31,31) +#define MQPCB_MASK_CURR_SQR_LIMIT EHCA_BMASK_IBM(49,49) +#define MQPCB_CURR_SQR_LIMIT EHCA_BMASK_IBM(15,31) +#define MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG EHCA_BMASK_IBM(50,50) +#define MQPCB_MASK_SHARED_RQ_HNDL EHCA_BMASK_IBM(51,51) + +#endif /* __EHCA_CLASSES_PSERIES_H__ */ -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:09:28 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:09:28 -0700 Subject: [openib-general] [PATCH 01/16] IB/ehca: main In-Reply-To: <2006817139.43eVtRoa2IK8yOPl@cisco.com> Message-ID: <2006817139.pLkgJggYXy2PkqBH@cisco.com> drivers/infiniband/hw/ehca/ehca_main.c | 958 ++++++++++++++++++++++++++++++++ 1 files changed, 958 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c new file mode 100644 index 0000000..229ee9c --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -0,0 +1,958 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * module start stop, hca detection + * + * Authors: Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#define DEB_PREFIX "shca" + +#include "ehca_classes.h" +#include "ehca_iverbs.h" +#include "ehca_mrmw.h" +#include "ehca_tools.h" +#include "hcp_if.h" + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Christoph Raisch "); +MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); +MODULE_VERSION("SVNEHCA_0012"); + +int ehca_open_aqp1 = 0; +int ehca_debug_level = -1; +int ehca_hw_level = 0; +int ehca_nr_ports = 2; +int ehca_use_hp_mr = 0; +int ehca_port_act_time = 30; +int ehca_poll_all_eqs = 1; +int ehca_static_rate = -1; + +module_param_named(open_aqp1, ehca_open_aqp1, int, 0); +module_param_named(debug_level, ehca_debug_level, int, 0); +module_param_named(hw_level, ehca_hw_level, int, 0); +module_param_named(nr_ports, ehca_nr_ports, int, 0); +module_param_named(use_hp_mr, ehca_use_hp_mr, int, 0); +module_param_named(port_act_time, ehca_port_act_time, int, 0); +module_param_named(poll_all_eqs, ehca_poll_all_eqs, int, 0); +module_param_named(static_rate, ehca_static_rate, int, 0); + +MODULE_PARM_DESC(open_aqp1, + "AQP1 on startup (0: no (default), 1: yes)"); +MODULE_PARM_DESC(debug_level, + "debug level" + " (0: node, 6: only errors (default), 9: all)"); +MODULE_PARM_DESC(hw_level, + "hardware level" + " (0: autosensing (default), 1: v. 0.20, 2: v. 0.21)"); +MODULE_PARM_DESC(nr_ports, + "number of connected ports (default: 2)"); +MODULE_PARM_DESC(use_hp_mr, + "high performance MRs (0: no (default), 1: yes)"); +MODULE_PARM_DESC(port_act_time, + "time to wait for port activation (default: 30 sec)"); +MODULE_PARM_DESC(poll_all_eqs, + "polls all event queues periodically" + " (0: no, 1: yes (default))"); +MODULE_PARM_DESC(static_rate, + "set permanent static rate (default: disabled)"); + +/* + * This external trace mask controls what will end up in the + * kernel ring buffer. Number 6 means, that everything between + * 0 and 5 will be stored. + */ +u8 ehca_edeb_mask[EHCA_EDEB_TRACE_MASK_SIZE]={6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 0, 0}; + +spinlock_t ehca_qp_idr_lock; +spinlock_t ehca_cq_idr_lock; +DEFINE_IDR(ehca_qp_idr); +DEFINE_IDR(ehca_cq_idr); + +struct ehca_module ehca_module; + +void ehca_init_trace(void) +{ + EDEB_EN(7, ""); + + if (ehca_debug_level != -1) { + int i; + for (i = 0; i < EHCA_EDEB_TRACE_MASK_SIZE; i++) + ehca_edeb_mask[i] = ehca_debug_level; + } + + EDEB_EX(7, ""); +} + +int ehca_create_slab_caches(struct ehca_module *ehca_module) +{ + int ret = 0; + + EDEB_EN(7, ""); + + ehca_module->cache_pd = + kmem_cache_create("ehca_cache_pd", + sizeof(struct ehca_pd), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!ehca_module->cache_pd) { + EDEB_ERR(4, "Cannot create PD SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches1; + } + + ehca_module->cache_cq = + kmem_cache_create("ehca_cache_cq", + sizeof(struct ehca_cq), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!ehca_module->cache_cq) { + EDEB_ERR(4, "Cannot create CQ SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches2; + } + + ehca_module->cache_qp = + kmem_cache_create("ehca_cache_qp", + sizeof(struct ehca_qp), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!ehca_module->cache_qp) { + EDEB_ERR(4, "Cannot create QP SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches3; + } + + ehca_module->cache_av = + kmem_cache_create("ehca_cache_av", + sizeof(struct ehca_av), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!ehca_module->cache_av) { + EDEB_ERR(4, "Cannot create AV SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches4; + } + + ehca_module->cache_mw = + kmem_cache_create("ehca_cache_mw", + sizeof(struct ehca_mw), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!ehca_module->cache_mw) { + EDEB_ERR(4, "Cannot create MW SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches5; + } + + ehca_module->cache_mr = + kmem_cache_create("ehca_cache_mr", + sizeof(struct ehca_mr), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!ehca_module->cache_mr) { + EDEB_ERR(4, "Cannot create MR SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches6; + } + + EDEB_EX(7, "ret=%x", ret); + + return ret; + +create_slab_caches6: + kmem_cache_destroy(ehca_module->cache_mw); + +create_slab_caches5: + kmem_cache_destroy(ehca_module->cache_av); + +create_slab_caches4: + kmem_cache_destroy(ehca_module->cache_qp); + +create_slab_caches3: + kmem_cache_destroy(ehca_module->cache_cq); + +create_slab_caches2: + kmem_cache_destroy(ehca_module->cache_pd); + +create_slab_caches1: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_destroy_slab_caches(struct ehca_module *ehca_module) +{ + int ret; + + EDEB_EN(7, ""); + + ret = kmem_cache_destroy(ehca_module->cache_pd); + if (ret) + EDEB_ERR(4, "Cannot destroy PD SLAB cache. ret=%x", ret); + + ret = kmem_cache_destroy(ehca_module->cache_cq); + if (ret) + EDEB_ERR(4, "Cannot destroy CQ SLAB cache. ret=%x", ret); + + ret = kmem_cache_destroy(ehca_module->cache_qp); + if (ret) + EDEB_ERR(4, "Cannot destroy QP SLAB cache. ret=%x", ret); + + ret = kmem_cache_destroy(ehca_module->cache_av); + if (ret) + EDEB_ERR(4, "Cannot destroy AV SLAB cache. ret=%x", ret); + + ret = kmem_cache_destroy(ehca_module->cache_mw); + if (ret) + EDEB_ERR(4, "Cannot destroy MW SLAB cache. ret=%x", ret); + + ret = kmem_cache_destroy(ehca_module->cache_mr); + if (ret) + EDEB_ERR(4, "Cannot destroy MR SLAB cache. ret=%x", ret); + + EDEB_EX(7, ""); + + return 0; +} + +#define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) +#define EHCA_REVID EHCA_BMASK_IBM(40,63) + +int ehca_sense_attributes(struct ehca_shca *shca) +{ + int ret = -EINVAL; + u64 h_ret = H_SUCCESS; + struct hipz_query_hca *rblock; + + EDEB_EN(7, "shca=%p", shca); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Cannot allocate rblock memory."); + ret = -ENOMEM; + goto num_ports0; + } + + h_ret = hipz_h_query_hca(shca->ipz_hca_handle, rblock); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "Cannot query device properties. h_ret=%lx", h_ret); + ret = -EPERM; + goto num_ports1; + } + + if (ehca_nr_ports == 1) + shca->num_ports = 1; + else + shca->num_ports = (u8)rblock->num_ports; + + EDEB(6, " ... found %x ports", rblock->num_ports); + + if (ehca_hw_level == 0) { + u32 hcaaver; + u32 revid; + + hcaaver = EHCA_BMASK_GET(EHCA_HCAAVER, rblock->hw_ver); + revid = EHCA_BMASK_GET(EHCA_REVID, rblock->hw_ver); + + EDEB(6, " ... hardware version=%x:%x", + hcaaver, revid); + + if ((hcaaver == 1) && (revid == 0)) + shca->hw_level = 0; + else if ((hcaaver == 1) && (revid == 1)) + shca->hw_level = 1; + else if ((hcaaver == 1) && (revid == 2)) + shca->hw_level = 2; + } + EDEB(6, " ... hardware level=%x", shca->hw_level); + + shca->sport[0].rate = IB_RATE_30_GBPS; + shca->sport[1].rate = IB_RATE_30_GBPS; + + ret = 0; + +num_ports1: + kfree(rblock); + +num_ports0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static int init_node_guid(struct ehca_shca* shca) +{ + int ret = 0; + struct hipz_query_hca *rblock; + + EDEB_EN(7, ""); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto init_node_guid0; + } + + if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { + EDEB_ERR(4, "Can't query device properties"); + ret = -EINVAL; + goto init_node_guid1; + } + + memcpy(&shca->ib_device.node_guid, &rblock->node_guid, (sizeof(u64))); + +init_node_guid1: + kfree(rblock); + +init_node_guid0: + EDEB_EX(7, "node_guid=%lx ret=%x", shca->ib_device.node_guid, ret); + + return ret; +} + +int ehca_register_device(struct ehca_shca *shca) +{ + int ret = 0; + + EDEB_EN(7, "shca=%p", shca); + + ret = init_node_guid(shca); + if (ret) + return ret; + + strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX); + shca->ib_device.owner = THIS_MODULE; + + shca->ib_device.uverbs_abi_ver = 5; + shca->ib_device.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_QUERY_QP) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST); + + shca->ib_device.node_type = IB_NODE_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; + shca->ib_device.query_pkey = ehca_query_pkey; + /* shca->in_device.modify_device = ehca_modify_device */ + shca->ib_device.modify_port = ehca_modify_port; + shca->ib_device.alloc_ucontext = ehca_alloc_ucontext; + shca->ib_device.dealloc_ucontext = ehca_dealloc_ucontext; + shca->ib_device.alloc_pd = ehca_alloc_pd; + shca->ib_device.dealloc_pd = ehca_dealloc_pd; + shca->ib_device.create_ah = ehca_create_ah; + /* shca->ib_device.modify_ah = ehca_modify_ah; */ + shca->ib_device.query_ah = ehca_query_ah; + shca->ib_device.destroy_ah = ehca_destroy_ah; + shca->ib_device.create_qp = ehca_create_qp; + shca->ib_device.modify_qp = ehca_modify_qp; + shca->ib_device.query_qp = ehca_query_qp; + shca->ib_device.destroy_qp = ehca_destroy_qp; + shca->ib_device.post_send = ehca_post_send; + shca->ib_device.post_recv = ehca_post_recv; + shca->ib_device.create_cq = ehca_create_cq; + shca->ib_device.destroy_cq = ehca_destroy_cq; + shca->ib_device.resize_cq = ehca_resize_cq; + shca->ib_device.poll_cq = ehca_poll_cq; + /* shca->ib_device.peek_cq = ehca_peek_cq; */ + shca->ib_device.req_notify_cq = ehca_req_notify_cq; + /* shca->ib_device.req_ncomp_notif = ehca_req_ncomp_notif; */ + shca->ib_device.get_dma_mr = ehca_get_dma_mr; + shca->ib_device.reg_phys_mr = ehca_reg_phys_mr; + shca->ib_device.reg_user_mr = ehca_reg_user_mr; + shca->ib_device.query_mr = ehca_query_mr; + shca->ib_device.dereg_mr = ehca_dereg_mr; + shca->ib_device.rereg_phys_mr = ehca_rereg_phys_mr; + shca->ib_device.alloc_mw = ehca_alloc_mw; + shca->ib_device.bind_mw = ehca_bind_mw; + shca->ib_device.dealloc_mw = ehca_dealloc_mw; + shca->ib_device.alloc_fmr = ehca_alloc_fmr; + shca->ib_device.map_phys_fmr = ehca_map_phys_fmr; + shca->ib_device.unmap_fmr = ehca_unmap_fmr; + shca->ib_device.dealloc_fmr = ehca_dealloc_fmr; + shca->ib_device.attach_mcast = ehca_attach_mcast; + shca->ib_device.detach_mcast = ehca_detach_mcast; + /* shca->ib_device.process_mad = ehca_process_mad; */ + shca->ib_device.mmap = ehca_mmap; + + ret = ib_register_device(&shca->ib_device); + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static int ehca_create_aqp1(struct ehca_shca *shca, u32 port) +{ + struct ehca_sport *sport; + struct ib_cq *ibcq; + struct ib_qp *ibqp; + struct ib_qp_init_attr qp_init_attr; + int ret = 0; + + EDEB_EN(7, "shca=%p port=%x", shca, port); + + sport = &shca->sport[port - 1]; + + if (sport->ibcq_aqp1) { + EDEB_ERR(4, "AQP1 CQ is already created."); + return -EPERM; + } + + ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10); + if (IS_ERR(ibcq)) { + EDEB_ERR(4, "Cannot create AQP1 CQ."); + return PTR_ERR(ibcq); + } + sport->ibcq_aqp1 = ibcq; + + if (sport->ibqp_aqp1) { + EDEB_ERR(4, "AQP1 QP is already created."); + ret = -EPERM; + goto create_aqp1; + } + + memset(&qp_init_attr, 0, sizeof(struct ib_qp_init_attr)); + qp_init_attr.send_cq = ibcq; + qp_init_attr.recv_cq = ibcq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = 100; + qp_init_attr.cap.max_recv_wr = 100; + qp_init_attr.cap.max_send_sge = 2; + qp_init_attr.cap.max_recv_sge = 1; + qp_init_attr.qp_type = IB_QPT_GSI; + qp_init_attr.port_num = port; + qp_init_attr.qp_context = NULL; + qp_init_attr.event_handler = NULL; + qp_init_attr.srq = NULL; + + ibqp = ib_create_qp(&shca->pd->ib_pd, &qp_init_attr); + if (IS_ERR(ibqp)) { + EDEB_ERR(4, "Cannot create AQP1 QP."); + ret = PTR_ERR(ibqp); + goto create_aqp1; + } + sport->ibqp_aqp1 = ibqp; + + goto create_aqp0; + +create_aqp1: + ib_destroy_cq(sport->ibcq_aqp1); + +create_aqp0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static int ehca_destroy_aqp1(struct ehca_sport *sport) +{ + int ret = 0; + + EDEB_EN(7, "sport=%p", sport); + + ret = ib_destroy_qp(sport->ibqp_aqp1); + if (ret) { + EDEB_ERR(4, "Cannot destroy AQP1 QP. ret=%x", ret); + goto destroy_aqp1; + } + + ret = ib_destroy_cq(sport->ibcq_aqp1); + if (ret) + EDEB_ERR(4, "Cannot destroy AQP1 CQ. ret=%x", ret); + +destroy_aqp1: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static ssize_t ehca_show_debug_mask(struct device_driver *ddp, char *buf) +{ + int i; + int total = 0; + total += snprintf(buf + total, PAGE_SIZE - total, "%d", + ehca_edeb_mask[0]); + for (i = 1; i < EHCA_EDEB_TRACE_MASK_SIZE; i++) { + total += snprintf(buf + total, PAGE_SIZE - total, "%d", + ehca_edeb_mask[i]); + } + + total += snprintf(buf + total, PAGE_SIZE - total, "\n"); + + return total; +} + +static ssize_t ehca_store_debug_mask(struct device_driver *ddp, + const char *buf, size_t count) +{ + int i; + for (i = 0; i < EHCA_EDEB_TRACE_MASK_SIZE; i++) { + char value = buf[i] - '0'; + if ((value <= 9) && (count >= i)) { + ehca_edeb_mask[i] = value; + } + } + return count; +} +DRIVER_ATTR(debug_mask, S_IRUSR | S_IWUSR, + ehca_show_debug_mask, ehca_store_debug_mask); + +void ehca_create_driver_sysfs(struct ibmebus_driver *drv) +{ + driver_create_file(&drv->driver, &driver_attr_debug_mask); +} + +void ehca_remove_driver_sysfs(struct ibmebus_driver *drv) +{ + driver_remove_file(&drv->driver, &driver_attr_debug_mask); +} + +#define EHCA_RESOURCE_ATTR(name) \ +static ssize_t ehca_show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + struct ehca_shca *shca; \ + struct hipz_query_hca *rblock; \ + int data; \ + \ + shca = dev->driver_data; \ + \ + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); \ + if (!rblock) { \ + EDEB_ERR(4, "Can't allocate rblock memory."); \ + return 0; \ + } \ + \ + if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { \ + EDEB_ERR(4, "Can't query device properties"); \ + kfree(rblock); \ + return 0; \ + } \ + \ + data = rblock->name; \ + kfree(rblock); \ + \ + if ((strcmp(#name, "num_ports") == 0) && (ehca_nr_ports == 1)) \ + return snprintf(buf, 256, "1\n"); \ + else \ + return snprintf(buf, 256, "%d\n", data); \ + \ +} \ +static DEVICE_ATTR(name, S_IRUGO, ehca_show_##name, NULL); + +EHCA_RESOURCE_ATTR(num_ports); +EHCA_RESOURCE_ATTR(hw_ver); +EHCA_RESOURCE_ATTR(max_eq); +EHCA_RESOURCE_ATTR(cur_eq); +EHCA_RESOURCE_ATTR(max_cq); +EHCA_RESOURCE_ATTR(cur_cq); +EHCA_RESOURCE_ATTR(max_qp); +EHCA_RESOURCE_ATTR(cur_qp); +EHCA_RESOURCE_ATTR(max_mr); +EHCA_RESOURCE_ATTR(cur_mr); +EHCA_RESOURCE_ATTR(max_mw); +EHCA_RESOURCE_ATTR(cur_mw); +EHCA_RESOURCE_ATTR(max_pd); +EHCA_RESOURCE_ATTR(max_ah); + +static ssize_t ehca_show_adapter_handle(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ehca_shca *shca = dev->driver_data; + + return sprintf(buf, "%lx\n", shca->ipz_hca_handle.handle); + +} +static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL); + + +void ehca_create_device_sysfs(struct ibmebus_dev *dev) +{ + device_create_file(&dev->ofdev.dev, &dev_attr_adapter_handle); + device_create_file(&dev->ofdev.dev, &dev_attr_num_ports); + device_create_file(&dev->ofdev.dev, &dev_attr_hw_ver); + device_create_file(&dev->ofdev.dev, &dev_attr_max_eq); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_eq); + device_create_file(&dev->ofdev.dev, &dev_attr_max_cq); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_cq); + device_create_file(&dev->ofdev.dev, &dev_attr_max_qp); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_qp); + device_create_file(&dev->ofdev.dev, &dev_attr_max_mr); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_mr); + device_create_file(&dev->ofdev.dev, &dev_attr_max_mw); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_mw); + device_create_file(&dev->ofdev.dev, &dev_attr_max_pd); + device_create_file(&dev->ofdev.dev, &dev_attr_max_ah); +} + +void ehca_remove_device_sysfs(struct ibmebus_dev *dev) +{ + device_remove_file(&dev->ofdev.dev, &dev_attr_adapter_handle); + device_remove_file(&dev->ofdev.dev, &dev_attr_num_ports); + device_remove_file(&dev->ofdev.dev, &dev_attr_hw_ver); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_eq); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_eq); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_cq); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_cq); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_qp); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_qp); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_mr); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_mr); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_mw); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_mw); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_pd); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_ah); +} + +static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) +{ + struct ehca_shca *shca; + u64 *handle; + struct ib_pd *ibpd; + int ret = 0; + + EDEB_EN(7, ""); + + handle = (u64 *)get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + EDEB_ERR(4, "Cannot get eHCA handle for adapter: %s.", + dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + EDEB_ERR(4, "Wrong eHCA handle for adapter: %s.", + dev->ofdev.node->full_name); + return -ENODEV; + } + + shca = (struct ehca_shca *)ib_alloc_device(sizeof(*shca)); + if (shca == NULL) { + EDEB_ERR(4, "Cannot allocate shca memory."); + return -ENOMEM; + } + + shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; + dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { + EDEB_ERR(4, "Cannot sense eHCA attributes."); + goto probe1; + } + + /* create event queues */ + ret = ehca_create_eq(shca, &shca->eq, EHCA_EQ, 2048); + if (ret) { + EDEB_ERR(4, "Cannot create EQ."); + goto probe1; + } + + ret = ehca_create_eq(shca, &shca->neq, EHCA_NEQ, 513); + if (ret) { + EDEB_ERR(4, "Cannot create NEQ."); + goto probe2; + } + + /* create internal protection domain */ + ibpd = ehca_alloc_pd(&shca->ib_device, (void*)(-1), NULL); + if (IS_ERR(ibpd)) { + EDEB_ERR(4, "Cannot create internal PD."); + ret = PTR_ERR(ibpd); + goto probe3; + } + + shca->pd = container_of(ibpd, struct ehca_pd, ib_pd); + shca->pd->ib_pd.device = &shca->ib_device; + + /* create internal max MR */ + ret = ehca_reg_internal_maxmr(shca, shca->pd, &shca->maxmr); + + if (ret) { + EDEB_ERR(4, "Cannot create internal MR. ret=%x", ret); + goto probe4; + } + + ret = ehca_register_device(shca); + if (ret) { + EDEB_ERR(4, "Cannot register Infiniband device."); + goto probe5; + } + + /* create AQP1 for port 1 */ + if (ehca_open_aqp1 == 1) { + shca->sport[0].port_state = IB_PORT_DOWN; + ret = ehca_create_aqp1(shca, 1); + if (ret) { + EDEB_ERR(4, "Cannot create AQP1 for port 1."); + goto probe6; + } + } + + /* create AQP1 for port 2 */ + if ((ehca_open_aqp1 == 1) && (shca->num_ports == 2)) { + shca->sport[1].port_state = IB_PORT_DOWN; + ret = ehca_create_aqp1(shca, 2); + if (ret) { + EDEB_ERR(4, "Cannot create AQP1 for port 2."); + goto probe7; + } + } + + ehca_create_device_sysfs(dev); + + spin_lock(&ehca_module.shca_lock); + list_add(&shca->shca_list, &ehca_module.shca_list); + spin_unlock(&ehca_module.shca_lock); + + EDEB_EX(7, "ret=%x", ret); + + return 0; + +probe7: + ret = ehca_destroy_aqp1(&shca->sport[0]); + if (ret) + EDEB_ERR(4, "Cannot destroy AQP1 for port 1. ret=%x", ret); + +probe6: + ib_unregister_device(&shca->ib_device); + +probe5: + ret = ehca_dereg_internal_maxmr(shca); + if (ret) + EDEB_ERR(4, "Cannot destroy internal MR. ret=%x", ret); + +probe4: + ret = ehca_dealloc_pd(&shca->pd->ib_pd); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy internal PD. ret=%x", ret); + +probe3: + ret = ehca_destroy_eq(shca, &shca->neq); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy NEQ. ret=%x", ret); + +probe2: + ret = ehca_destroy_eq(shca, &shca->eq); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy EQ. ret=%x", ret); + +probe1: + ib_dealloc_device(&shca->ib_device); + + EDEB_EX(4, "ret=%x", ret); + + return -EINVAL; +} + +static int __devexit ehca_remove(struct ibmebus_dev *dev) +{ + struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + + EDEB_EN(7, "shca=%p", shca); + + ehca_remove_device_sysfs(dev); + + if (ehca_open_aqp1 == 1) { + int i; + + for (i = 0; i < shca->num_ports; i++) { + ret = ehca_destroy_aqp1(&shca->sport[i]); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy AQP1 for port %x." + " ret=%x", ret, i); + } + } + + ib_unregister_device(&shca->ib_device); + + ret = ehca_dereg_internal_maxmr(shca); + if (ret) + EDEB_ERR(4, "Cannot destroy internal MR. ret=%x", ret); + + ret = ehca_dealloc_pd(&shca->pd->ib_pd); + if (ret) + EDEB_ERR(4, "Cannot destroy internal PD. ret=%x", ret); + + ret = ehca_destroy_eq(shca, &shca->eq); + if (ret) + EDEB_ERR(4, "Cannot destroy EQ. ret=%x", ret); + + ret = ehca_destroy_eq(shca, &shca->neq); + if (ret) + EDEB_ERR(4, "Canot destroy NEQ. ret=%x", ret); + + ib_dealloc_device(&shca->ib_device); + + spin_lock(&ehca_module.shca_lock); + list_del(&shca->shca_list); + spin_unlock(&ehca_module.shca_lock); + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static struct of_device_id ehca_device_table[] = +{ + { + .name = "lhca", + .compatible = "IBM,lhca", + }, + {}, +}; + +static struct ibmebus_driver ehca_driver = { + .name = "ehca", + .id_table = ehca_device_table, + .probe = ehca_probe, + .remove = ehca_remove, +}; + +int __init ehca_module_init(void) +{ + int ret = 0; + + printk(KERN_INFO "eHCA Infiniband Device Driver " + "(Rel.: SVNEHCA_0012)\n"); + EDEB_EN(7, ""); + + idr_init(&ehca_qp_idr); + idr_init(&ehca_cq_idr); + spin_lock_init(&ehca_qp_idr_lock); + spin_lock_init(&ehca_cq_idr_lock); + + INIT_LIST_HEAD(&ehca_module.shca_list); + spin_lock_init(&ehca_module.shca_lock); + + ehca_init_trace(); + + if ((ret = ehca_create_comp_pool())) { + EDEB_ERR(4, "Cannot create comp pool."); + goto module_init0; + } + + if ((ret = ehca_create_slab_caches(&ehca_module))) { + EDEB_ERR(4, "Cannot create SLAB caches"); + ret = -ENOMEM; + goto module_init1; + } + + if ((ret = ibmebus_register_driver(&ehca_driver))) { + EDEB_ERR(4, "Cannot register eHCA device driver"); + ret = -EINVAL; + goto module_init2; + } + + ehca_create_driver_sysfs(&ehca_driver); + + if (ehca_poll_all_eqs != 1) { + EDEB_ERR(4, "WARNING!!!"); + EDEB_ERR(4, "It is possible to lose interrupts."); + } else { + init_timer(&ehca_module.timer); + ehca_module.timer.function = ehca_poll_eqs; + ehca_module.timer.data = (unsigned long)&ehca_module; + ehca_module.timer.expires = jiffies + HZ; + add_timer(&ehca_module.timer); + } + + goto module_init0; + +module_init2: + ehca_destroy_slab_caches(&ehca_module); + +module_init1: + ehca_destroy_comp_pool(); + +module_init0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +}; + +void __exit ehca_module_exit(void) +{ + EDEB_EN(7, ""); + + if (ehca_poll_all_eqs == 1) + del_timer_sync(&ehca_module.timer); + + ehca_remove_driver_sysfs(&ehca_driver); + ibmebus_unregister_driver(&ehca_driver); + + if (ehca_destroy_slab_caches(&ehca_module) != 0) + EDEB_ERR(4, "Cannot destroy SLAB caches"); + + ehca_destroy_comp_pool(); + + idr_destroy(&ehca_cq_idr); + idr_destroy(&ehca_qp_idr); + + EDEB_EX(7, ""); +}; + +module_init(ehca_module_init); +module_exit(ehca_module_exit); -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:09:28 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:09:28 -0700 Subject: [openib-general] [PATCH 03/16] IB/ehca: uverbs In-Reply-To: <2006817139.e1epJYk9xVvFdTao@cisco.com> Message-ID: <2006817139.ved6VXBVqBUhDDU0@cisco.com> drivers/infiniband/hw/ehca/ehca_uverbs.c | 400 ++++++++++++++++++++++++++++++ 1 files changed, 400 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c new file mode 100644 index 0000000..c148c23 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -0,0 +1,400 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * userspace support verbs + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#undef DEB_PREFIX +#define DEB_PREFIX "uver" + +#include + +#include "ehca_classes.h" +#include "ehca_iverbs.h" +#include "ehca_mrmw.h" +#include "ehca_tools.h" +#include "hcp_if.h" + +struct ib_ucontext *ehca_alloc_ucontext(struct ib_device *device, + struct ib_udata *udata) +{ + struct ehca_ucontext *my_context = NULL; + + EHCA_CHECK_ADR_P(device); + EDEB_EN(7, "device=%p name=%s", device, device->name); + + my_context = kzalloc(sizeof *my_context, GFP_KERNEL); + if (!my_context) { + EDEB_ERR(4, "Out of memory device=%p", device); + return ERR_PTR(-ENOMEM); + } + + EDEB_EX(7, "device=%p ucontext=%p", device, my_context); + + return &my_context->ib_ucontext; +} + +int ehca_dealloc_ucontext(struct ib_ucontext *context) +{ + struct ehca_ucontext *my_context = NULL; + EHCA_CHECK_ADR(context); + EDEB_EN(7, "ucontext=%p", context); + my_context = container_of(context, struct ehca_ucontext, ib_ucontext); + kfree(my_context); + EDEB_EN(7, "ucontext=%p", context); + return 0; +} + +struct page *ehca_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) +{ + struct page *mypage = NULL; + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; + u32 idr_handle = fileoffset >> 32; + u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 cur_pid = current->tgid; + unsigned long flags; + + EDEB_EN(7, "vm_start=%lx vm_end=%lx vm_page_prot=%lx vm_fileoff=%lx " + "address=%lx", + vma->vm_start, vma->vm_end, vma->vm_page_prot, fileoffset, + address); + + if (q_type == 1) { /* CQ */ + struct ehca_cq *cq = NULL; + u64 offset; + void *vaddr = NULL; + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + cq = idr_find(&ehca_cq_idr, idr_handle); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + if (cq->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, cq->ownpid); + return NOPAGE_SIGBUS; + } + + /* make sure this mmap really belongs to the authorized user */ + if (!cq) { + EDEB_ERR(4, "cq is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + if (rsrc_type == 2) { + EDEB(6, "cq=%p cq queuearea", cq); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&cq->ipz_queue, offset); + EDEB(6, "offset=%lx vaddr=%p", offset, vaddr); + mypage = virt_to_page(vaddr); + } + } else if (q_type == 2) { /* QP */ + struct ehca_qp *qp = NULL; + struct ehca_pd *pd = NULL; + u64 offset; + void *vaddr = NULL; + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + qp = idr_find(&ehca_qp_idr, idr_handle); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + + pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, pd->ownpid); + return NOPAGE_SIGBUS; + } + + /* make sure this mmap really belongs to the authorized user */ + if (!qp) { + EDEB_ERR(4, "qp is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + if (rsrc_type == 2) { /* rqueue */ + EDEB(6, "qp=%p qp rqueuearea", qp); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset); + EDEB(6, "offset=%lx vaddr=%p", offset, vaddr); + mypage = virt_to_page(vaddr); + } else if (rsrc_type == 3) { /* squeue */ + EDEB(6, "qp=%p qp squeuearea", qp); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset); + EDEB(6, "offset=%lx vaddr=%p", offset, vaddr); + mypage = virt_to_page(vaddr); + } + } + + if (!mypage) { + EDEB_ERR(4, "Invalid page adr==NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + get_page(mypage); + EDEB_EX(7, "page adr=%p", mypage); + return mypage; +} + +static struct vm_operations_struct ehcau_vm_ops = { + .nopage = ehca_nopage, +}; + +int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; + u32 idr_handle = fileoffset >> 32; + u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 ret = -EFAULT; /* assume the worst */ + u64 vsize = 0; /* must be calculated/set below */ + u64 physical = 0; /* must be calculated/set below */ + u32 cur_pid = current->tgid; + unsigned long flags; + + EDEB_EN(7, "vm_start=%lx vm_end=%lx vm_page_prot=%lx vm_fileoff=%lx", + vma->vm_start, vma->vm_end, vma->vm_page_prot, fileoffset); + + if (q_type == 1) { /* CQ */ + struct ehca_cq *cq; + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + cq = idr_find(&ehca_cq_idr, idr_handle); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + if (cq->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, cq->ownpid); + return -ENOMEM; + } + + /* make sure this mmap really belongs to the authorized user */ + if (!cq) + return -EINVAL; + if (!cq->ib_cq.uobject) + return -EINVAL; + if (cq->ib_cq.uobject->context != context) + return -EINVAL; + if (rsrc_type == 1) { /* galpa fw handle */ + EDEB(6, "cq=%p cq triggerarea", cq); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != EHCA_PAGESIZE) { + EDEB_ERR(4, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + ret = -EINVAL; + goto mmap_exit0; + } + + physical = cq->galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, physical); + ret = remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret) { + EDEB_ERR(4, "remap_pfn_range() failed ret=%x", + ret); + ret = -ENOMEM; + } + goto mmap_exit0; + } else if (rsrc_type == 2) { /* cq queue_addr */ + EDEB(6, "cq=%p cq q_addr", cq); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else { + EDEB_ERR(6, "bad resource type %x", rsrc_type); + ret = -EINVAL; + goto mmap_exit0; + } + } else if (q_type == 2) { /* QP */ + struct ehca_qp *qp = NULL; + struct ehca_pd *pd = NULL; + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + qp = idr_find(&ehca_qp_idr, idr_handle); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, pd->ownpid); + return -ENOMEM; + } + + /* make sure this mmap really belongs to the authorized user */ + if (!qp || !qp->ib_qp.uobject || + qp->ib_qp.uobject->context != context) { + EDEB(6, "qp=%p, uobject=%p, context=%p", + qp, qp->ib_qp.uobject, qp->ib_qp.uobject->context); + ret = -EINVAL; + goto mmap_exit0; + } + if (rsrc_type == 1) { /* galpa fw handle */ + EDEB(6, "qp=%p qp triggerarea", qp); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != EHCA_PAGESIZE) { + EDEB_ERR(4, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + ret = -EINVAL; + goto mmap_exit0; + } + + physical = qp->galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, physical); + ret = remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret) { + EDEB_ERR(4, "remap_pfn_range() failed ret=%x", + ret); + ret = -ENOMEM; + } + goto mmap_exit0; + } else if (rsrc_type == 2) { /* qp rqueue_addr */ + EDEB(6, "qp=%p qp rqueue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else if (rsrc_type == 3) { /* qp squeue_addr */ + EDEB(6, "qp=%p qp squeue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else { + EDEB_ERR(4, "bad resource type %x", rsrc_type); + ret = -EINVAL; + goto mmap_exit0; + } + } else { + EDEB_ERR(4, "bad queue type %x", q_type); + ret = -EINVAL; + goto mmap_exit0; + } + +mmap_exit0: + EDEB_EX(7, "ret=%x", ret); + return ret; +} + +int ehca_mmap_nopage(u64 foffset, u64 length, void ** mapped, + struct vm_area_struct ** vma) +{ + EDEB_EN(7, "foffset=%lx length=%lx", foffset, length); + down_write(¤t->mm->mmap_sem); + *mapped = (void*)do_mmap(NULL,0, length, PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, + foffset); + up_write(¤t->mm->mmap_sem); + if (!(*mapped)) { + EDEB_ERR(4, "couldn't mmap foffset=%lx length=%lx", + foffset, length); + return -EINVAL; + } + + *vma = find_vma(current->mm, (u64)*mapped); + if (!(*vma)) { + down_write(¤t->mm->mmap_sem); + do_munmap(current->mm, 0, length); + up_write(¤t->mm->mmap_sem); + EDEB_ERR(4, "couldn't find vma queue=%p", *mapped); + return -EINVAL; + } + (*vma)->vm_flags |= VM_RESERVED; + (*vma)->vm_ops = &ehcau_vm_ops; + + EDEB_EX(7, "mapped=%p", *mapped); + return 0; +} + +int ehca_mmap_register(u64 physical, void ** mapped, + struct vm_area_struct ** vma) +{ + int ret = 0; + unsigned long vsize; + /* ehca hw supports only 4k page */ + ret = ehca_mmap_nopage(0, EHCA_PAGESIZE, mapped, vma); + if (ret) { + EDEB(4, "could'nt mmap physical=%lx", physical); + return ret; + } + + (*vma)->vm_flags |= VM_RESERVED; + vsize = (*vma)->vm_end - (*vma)->vm_start; + if (vsize != EHCA_PAGESIZE) { + EDEB_ERR(4, "invalid vsize=%lx", + (*vma)->vm_end - (*vma)->vm_start); + ret = -EINVAL; + return ret; + } + + (*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot); + (*vma)->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, physical); + ret = remap_pfn_range((*vma), (*vma)->vm_start, + physical >> PAGE_SHIFT, vsize, + (*vma)->vm_page_prot); + if (ret) { + EDEB_ERR(4, "remap_pfn_range() failed ret=%x", ret); + ret = -ENOMEM; + } + return ret; + +} + +int ehca_munmap(unsigned long addr, size_t len) { + int ret = 0; + struct mm_struct *mm = current->mm; + if (mm) { + down_write(&mm->mmap_sem); + ret = do_munmap(mm, addr, len); + up_write(&mm->mmap_sem); + } + return ret; +} -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:02 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:02 -0700 Subject: [openib-general] [PATCH 13/13] IB/ehca: makefiles/kconfig In-Reply-To: <20068171311.P1OwgyzMAlKlrkeW@cisco.com> Message-ID: <20068171311.WDFBWw0F6z9B3Qes@cisco.com> drivers/infiniband/Kconfig | 1 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/ehca/Kconfig | 12 ++++++++++++ drivers/infiniband/hw/ehca/Makefile | 18 ++++++++++++++++++ 4 files changed, 32 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 69a53d4..fd2d528 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -36,6 +36,7 @@ config INFINIBAND_ADDR_TRANS source "drivers/infiniband/hw/mthca/Kconfig" source "drivers/infiniband/hw/ipath/Kconfig" +source "drivers/infiniband/hw/ehca/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index c7ff58c..893bee0 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -1,6 +1,7 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ obj-$(CONFIG_IPATH_CORE) += hw/ipath/ +obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ diff --git a/drivers/infiniband/hw/ehca/Kconfig b/drivers/infiniband/hw/ehca/Kconfig new file mode 100644 index 0000000..12285d0 --- /dev/null +++ b/drivers/infiniband/hw/ehca/Kconfig @@ -0,0 +1,12 @@ +config INFINIBAND_EHCA + tristate "eHCA support" + depends on IBMEBUS && INFINIBAND + ---help--- + This is a low level device driver for the IBM GX based Host channel + adapters (HCAs). + +config INFINIBAND_EHCA_SCALING + bool "Scaling support (EXPERIMENTAL)" + depends on IBMEBUS && INFINIBAND_EHCA && HOTPLUG_CPU && EXPERIMENTAL + ---help--- + eHCA scaling support schedules the CQ callbacks to different CPUs. diff --git a/drivers/infiniband/hw/ehca/Makefile b/drivers/infiniband/hw/ehca/Makefile new file mode 100644 index 0000000..70032cf --- /dev/null +++ b/drivers/infiniband/hw/ehca/Makefile @@ -0,0 +1,18 @@ +# Authors: Heiko J Schick +# Christoph Raisch +# Joachim Fenkes +# +# Copyright (c) 2005 IBM Corporation +# +# All rights reserved. +# +# This source code is distributed under a dual license of GPL v2.0 and OpenIB BSD. + +obj-$(CONFIG_INFINIBAND_EHCA) += hcad_mod.o + + +hcad_mod-objs = ehca_main.o ehca_hca.o ehca_mcast.o ehca_pd.o ehca_av.o ehca_eq.o \ + ehca_cq.o ehca_qp.o ehca_sqp.o ehca_mrmw.o ehca_reqs.o ehca_irq.o \ + ehca_uverbs.o ipz_pt_fn.o hcp_if.o hcp_phyp.o + +CFLAGS += -DEHCA_USE_HCALL -DEHCA_USE_HCALL_KERNEL -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:01 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:01 -0700 Subject: [openib-general] [PATCH 07/13] IB/ehca: cq In-Reply-To: <20068171311.L4phXwdeU9u1VjBq@cisco.com> Message-ID: <20068171311.4HUmviC1Ip8J2EpE@cisco.com> drivers/infiniband/hw/ehca/ehca_cq.c | 449 ++++++++++++++++++++++++++++++++++ 1 files changed, 449 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c new file mode 100644 index 0000000..c52d1c3 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -0,0 +1,449 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Completion queue handling + * + * Authors: Waleri Fomin + * Khadija Souissi + * Reinhard Ernst + * Heiko J Schick + * Hoang-Nam Nguyen + * + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#define DEB_PREFIX "e_cq" + +#include + +#include "ehca_iverbs.h" +#include "ehca_classes.h" +#include "ehca_irq.h" +#include "hcp_if.h" + +int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp) +{ + unsigned int qp_num = qp->real_qp_num; + unsigned int key = qp_num & (QP_HASHTAB_LEN-1); + unsigned long spl_flags = 0; + + spin_lock_irqsave(&cq->spinlock, spl_flags); + hlist_add_head(&qp->list_entries, &cq->qp_hashtab[key]); + spin_unlock_irqrestore(&cq->spinlock, spl_flags); + + EDEB(7, "cq_num=%x real_qp_num=%x", cq->cq_number, qp_num); + + return 0; +} + +int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int real_qp_num) +{ + int ret = -EINVAL; + unsigned int key = real_qp_num & (QP_HASHTAB_LEN-1); + struct hlist_node *iter = NULL; + struct ehca_qp *qp = NULL; + unsigned long spl_flags = 0; + + spin_lock_irqsave(&cq->spinlock, spl_flags); + hlist_for_each(iter, &cq->qp_hashtab[key]) { + qp = hlist_entry(iter, struct ehca_qp, list_entries); + if (qp->real_qp_num == real_qp_num) { + hlist_del(iter); + EDEB(7, "removed qp from cq .cq_num=%x real_qp_num=%x", + cq->cq_number, real_qp_num); + ret = 0; + break; + } + } + spin_unlock_irqrestore(&cq->spinlock, spl_flags); + if (ret) { + EDEB_ERR(4, "qp not found cq_num=%x real_qp_num=%x", + cq->cq_number, real_qp_num); + } + + return ret; +} + +struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num) +{ + struct ehca_qp *ret = NULL; + unsigned int key = real_qp_num & (QP_HASHTAB_LEN-1); + struct hlist_node *iter = NULL; + struct ehca_qp *qp = NULL; + hlist_for_each(iter, &cq->qp_hashtab[key]) { + qp = hlist_entry(iter, struct ehca_qp, list_entries); + if (qp->real_qp_num == real_qp_num) { + ret = qp; + break; + } + } + return ret; +} + +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + extern struct ehca_module ehca_module; + struct ib_cq *cq = NULL; + struct ehca_cq *my_cq = NULL; + struct ehca_shca *shca = NULL; + struct ipz_adapter_handle adapter_handle; + /* h_call's out parameters */ + struct ehca_alloc_cq_parms param; + u32 counter = 0; + void *vpage = NULL; + u64 rpage = 0; + struct h_galpa gal; + u64 cqx_fec = 0; + u64 h_ret = 0; + int ipz_rc = 0; + int ret = 0; + const u32 additional_cqe=20; + int i= 0; + unsigned long flags; + + EHCA_CHECK_DEVICE_P(device); + EDEB_EN(7, "device=%p cqe=%x context=%p", device, cqe, context); + + if (cqe >= 0xFFFFFFFF - 64 - additional_cqe) + return ERR_PTR(-EINVAL); + + my_cq = kmem_cache_alloc(ehca_module.cache_cq, SLAB_KERNEL); + if (!my_cq) { + cq = ERR_PTR(-ENOMEM); + EDEB_ERR(4, "Out of memory for ehca_cq struct device=%p", + device); + goto create_cq_exit0; + } + + memset(my_cq, 0, sizeof(struct ehca_cq)); + memset(¶m, 0, sizeof(struct ehca_alloc_cq_parms)); + + spin_lock_init(&my_cq->spinlock); + spin_lock_init(&my_cq->cb_lock); + spin_lock_init(&my_cq->task_lock); + my_cq->ownpid = current->tgid; + + cq = &my_cq->ib_cq; + + shca = container_of(device, struct ehca_shca, ib_device); + adapter_handle = shca->ipz_hca_handle; + param.eq_handle = shca->eq.ipz_eq_handle; + + + do { + if (!idr_pre_get(&ehca_cq_idr, GFP_KERNEL)) { + cq = ERR_PTR(-ENOMEM); + EDEB_ERR(4, + "Can't reserve idr resources. " + "device=%p", device); + goto create_cq_exit1; + } + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + ret = idr_get_new(&ehca_cq_idr, my_cq, &my_cq->token); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + } while (ret == -EAGAIN); + + if (ret) { + cq = ERR_PTR(-ENOMEM); + EDEB_ERR(4, + "Can't allocate new idr entry. " + "device=%p", device); + goto create_cq_exit1; + } + + /* + * CQs maximum depth is 4GB-64, but we need additional 20 as buffer + * for receiving errors CQEs. + */ + param.nr_cqe = cqe + additional_cqe; + h_ret = hipz_h_alloc_resource_cq(adapter_handle, my_cq, ¶m); + + if (h_ret != H_SUCCESS) { + EDEB_ERR(4,"hipz_h_alloc_resource_cq() failed " + "h_ret=%lx device=%p", h_ret, device); + cq = ERR_PTR(ehca2ib_return_code(h_ret)); + goto create_cq_exit2; + } + + ipz_rc = ipz_queue_ctor(&my_cq->ipz_queue, param.act_pages, + EHCA_PAGESIZE, sizeof(struct ehca_cqe), 0); + if (!ipz_rc) { + EDEB_ERR(4, + "ipz_queue_ctor() failed " + "ipz_rc=%x device=%p", ipz_rc, device); + cq = ERR_PTR(-EINVAL); + goto create_cq_exit3; + } + + for (counter = 0; counter < param.act_pages; counter++) { + vpage = ipz_qpageit_get_inc(&my_cq->ipz_queue); + if (!vpage) { + EDEB_ERR(4, "ipz_qpageit_get_inc() " + "returns NULL device=%p", device); + cq = ERR_PTR(-EAGAIN); + goto create_cq_exit4; + } + rpage = virt_to_abs(vpage); + + h_ret = hipz_h_register_rpage_cq(adapter_handle, + my_cq->ipz_cq_handle, + &my_cq->pf, + 0, + 0, + rpage, + 1, + my_cq->galpas. + kernel); + + if (h_ret < H_SUCCESS) { + EDEB_ERR(4, "hipz_h_register_rpage_cq() failed " + "ehca_cq=%p cq_num=%x h_ret=%lx " + "counter=%i act_pages=%i", + my_cq, my_cq->cq_number, + h_ret, counter, param.act_pages); + cq = ERR_PTR(-EINVAL); + goto create_cq_exit4; + } + + if (counter == (param.act_pages - 1)) { + vpage = ipz_qpageit_get_inc(&my_cq->ipz_queue); + if ((h_ret != H_SUCCESS) || vpage) { + EDEB_ERR(4, "Registration of pages not " + "complete ehca_cq=%p cq_num=%x " + "h_ret=%lx", + my_cq, my_cq->cq_number, h_ret); + cq = ERR_PTR(-EAGAIN); + goto create_cq_exit4; + } + } else { + if (h_ret != H_PAGE_REGISTERED) { + EDEB_ERR(4, "Registration of page failed " + "ehca_cq=%p cq_num=%x h_ret=%lx" + "counter=%i act_pages=%i", + my_cq, my_cq->cq_number, + h_ret, counter, param.act_pages); + cq = ERR_PTR(-ENOMEM); + goto create_cq_exit4; + } + } + } + + ipz_qeit_reset(&my_cq->ipz_queue); + + gal = my_cq->galpas.kernel; + cqx_fec = hipz_galpa_load(gal, CQTEMM_OFFSET(cqx_fec)); + EDEB(8, "ehca_cq=%p cq_num=%x CQX_FEC=%lx", + my_cq, my_cq->cq_number, cqx_fec); + + my_cq->ib_cq.cqe = my_cq->nr_of_entries = + param.act_nr_of_entries - additional_cqe; + my_cq->cq_number = (my_cq->ipz_cq_handle.handle) & 0xffff; + + for (i = 0; i < QP_HASHTAB_LEN; i++) + INIT_HLIST_HEAD(&my_cq->qp_hashtab[i]); + + if (context) { + struct ipz_queue *ipz_queue = &my_cq->ipz_queue; + struct ehca_create_cq_resp resp; + struct vm_area_struct *vma = NULL; + memset(&resp, 0, sizeof(resp)); + resp.cq_number = my_cq->cq_number; + resp.token = my_cq->token; + resp.ipz_queue.qe_size = ipz_queue->qe_size; + resp.ipz_queue.act_nr_of_sg = ipz_queue->act_nr_of_sg; + resp.ipz_queue.queue_length = ipz_queue->queue_length; + resp.ipz_queue.pagesize = ipz_queue->pagesize; + resp.ipz_queue.toggle_state = ipz_queue->toggle_state; + ret = ehca_mmap_nopage(((u64)(my_cq->token) << 32) | 0x12000000, + ipz_queue->queue_length, + (void**)&resp.ipz_queue.queue, + &vma); + if (ret) { + EDEB_ERR(4, "Could not mmap queue pages"); + cq = ERR_PTR(ret); + goto create_cq_exit4; + } + my_cq->uspace_queue = resp.ipz_queue.queue; + resp.galpas = my_cq->galpas; + ret = ehca_mmap_register(my_cq->galpas.user.fw_handle, + (void**)&resp.galpas.kernel.fw_handle, + &vma); + if (ret) { + EDEB_ERR(4, "Could not mmap fw_handle"); + cq = ERR_PTR(ret); + goto create_cq_exit5; + } + my_cq->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; + if (ib_copy_to_udata(udata, &resp, sizeof(resp))) { + EDEB_ERR(4, "Copy to udata failed."); + goto create_cq_exit6; + } + } + + EDEB_EX(7,"retcode=%p ehca_cq=%p cq_num=%x cq_size=%x", + cq, my_cq, my_cq->cq_number, param.act_nr_of_entries); + return cq; + +create_cq_exit6: + ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE); + +create_cq_exit5: + ehca_munmap(my_cq->uspace_queue, my_cq->ipz_queue.queue_length); + +create_cq_exit4: + ipz_queue_dtor(&my_cq->ipz_queue); + +create_cq_exit3: + h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 1); + if (h_ret != H_SUCCESS) + EDEB(4, "hipz_h_destroy_cq() failed ehca_cq=%p cq_num=%x " + "h_ret=%lx", my_cq, my_cq->cq_number, h_ret); + +create_cq_exit2: + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + idr_remove(&ehca_cq_idr, my_cq->token); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + +create_cq_exit1: + kmem_cache_free(ehca_module.cache_cq, my_cq); + +create_cq_exit0: + EDEB_EX(4, "An error has occured retcode=%p", cq); + return cq; +} + +int ehca_destroy_cq(struct ib_cq *cq) +{ + extern struct ehca_module ehca_module; + u64 h_ret = 0; + int ret = 0; + struct ehca_cq *my_cq = NULL; + int cq_num = 0; + struct ib_device *device = NULL; + struct ehca_shca *shca = NULL; + struct ipz_adapter_handle adapter_handle; + u32 cur_pid = current->tgid; + unsigned long flags; + + EHCA_CHECK_CQ(cq); + my_cq = container_of(cq, struct ehca_cq, ib_cq); + cq_num = my_cq->cq_number; + device = cq->device; + EHCA_CHECK_DEVICE(device); + shca = container_of(device, struct ehca_shca, ib_device); + adapter_handle = shca->ipz_hca_handle; + EDEB_EN(7, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + while (my_cq->nr_callbacks) + yield(); + + idr_remove(&ehca_cq_idr, my_cq->token); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_cq->ownpid); + return -EINVAL; + } + + /* un-mmap if vma alloc */ + if (my_cq->uspace_queue ) { + ret = ehca_munmap(my_cq->uspace_queue, + my_cq->ipz_queue.queue_length); + ret = ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE); + } + + h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0); + if (h_ret == H_R_STATE) { + /* cq in err: read err data and destroy it forcibly */ + EDEB(4, "ehca_cq=%p cq_num=%x ressource=%lx in err state. " + "Try to delete it forcibly.", + my_cq, my_cq->cq_number, my_cq->ipz_cq_handle.handle); + ehca_error_data(shca, my_cq, my_cq->ipz_cq_handle.handle); + h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 1); + if (h_ret == H_SUCCESS) + EDEB(4, "ehca_cq=%p cq_num=%x deleted successfully.", + my_cq, my_cq->cq_number); + } + if (h_ret != H_SUCCESS) { + EDEB_ERR(4,"hipz_h_destroy_cq() failed " + "h_ret=%lx ehca_cq=%p cq_num=%x", + h_ret, my_cq, my_cq->cq_number); + ret = ehca2ib_return_code(h_ret); + goto destroy_cq_exit0; + } + ipz_queue_dtor(&my_cq->ipz_queue); + kmem_cache_free(ehca_module.cache_cq, my_cq); + +destroy_cq_exit0: + EDEB_EX(7, "ehca_cq=%p cq_num=%x ret=%x ", + my_cq, cq_num, ret); + return ret; +} + +int ehca_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) +{ + int ret = 0; + struct ehca_cq *my_cq = NULL; + u32 cur_pid = current->tgid; + + if (unlikely(!cq)) { + EDEB_ERR(4, "cq is NULL"); + return -EFAULT; + } + + my_cq = container_of(cq, struct ehca_cq, ib_cq); + EDEB_EN(7, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + + if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_cq->ownpid); + return -EINVAL; + } + + /* TODO: proper resize needs to be done */ + ret = -EFAULT; + EDEB_ERR(4, "not implemented yet"); + + EDEB_EX(7, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + return ret; +} -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:02 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:02 -0700 Subject: [openib-general] [PATCH 12/13] IB/ehca: phyp In-Reply-To: <20068171311.NYGfAW00YTmK6YKh@cisco.com> Message-ID: <20068171311.P1OwgyzMAlKlrkeW@cisco.com> drivers/infiniband/hw/ehca/hcp_phyp.c | 92 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/hcp_phyp.h | 96 +++++++++++++++++++++++++++++++++ 2 files changed, 188 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hcp_phyp.c b/drivers/infiniband/hw/ehca/hcp_phyp.c new file mode 100644 index 0000000..d522d50 --- /dev/null +++ b/drivers/infiniband/hw/ehca/hcp_phyp.c @@ -0,0 +1,92 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * load store abstraction for ehca register access with tracing + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#define DEB_PREFIX "PHYP" + +#include "ehca_classes.h" +#include "hipz_hw.h" + +int hcall_map_page(u64 physaddr, u64 *mapaddr) +{ + *mapaddr = (u64)(ioremap(physaddr, EHCA_PAGESIZE)); + + EDEB(7, "ioremap physaddr=%lx mapaddr=%lx", physaddr, *mapaddr); + return 0; +} + +int hcall_unmap_page(u64 mapaddr) +{ + EDEB(7, "mapaddr=%lx", mapaddr); + iounmap((volatile void __iomem*)mapaddr); + return 0; +} + +int hcp_galpas_ctor(struct h_galpas *galpas, + u64 paddr_kernel, u64 paddr_user) +{ + int ret = hcall_map_page(paddr_kernel, &galpas->kernel.fw_handle); + if (ret) + return ret; + + galpas->user.fw_handle = paddr_user; + + EDEB(7, "paddr_kernel=%lx paddr_user=%lx galpas->kernel=%lx" + " galpas->user=%lx", + paddr_kernel, paddr_user, galpas->kernel.fw_handle, + galpas->user.fw_handle); + + return ret; +} + +int hcp_galpas_dtor(struct h_galpas *galpas) +{ + int ret = 0; + + if (galpas->kernel.fw_handle) + ret = hcall_unmap_page(galpas->kernel.fw_handle); + + if (ret) + return ret; + + galpas->user.fw_handle = galpas->kernel.fw_handle = 0; + + return ret; +} diff --git a/drivers/infiniband/hw/ehca/hcp_phyp.h b/drivers/infiniband/hw/ehca/hcp_phyp.h new file mode 100644 index 0000000..ecb1117 --- /dev/null +++ b/drivers/infiniband/hw/ehca/hcp_phyp.h @@ -0,0 +1,96 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Firmware calls + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Waleri Fomin + * Gerd Bayer + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef __HCP_PHYP_H__ +#define __HCP_PHYP_H__ + + +/* + * eHCA page (mapped into memory) + * resource to access eHCA register pages in CPU address space +*/ +struct h_galpa { + u64 fw_handle; + /* for pSeries this is a 64bit memory address where + I/O memory is mapped into CPU address space (kv) */ +}; + +/* + * resource to access eHCA address space registers, all types + */ +struct h_galpas { + u32 pid; /*PID of userspace galpa checking */ + struct h_galpa user; /* user space accessible resource, + set to 0 if unused */ + struct h_galpa kernel; /* kernel space accessible resource, + set to 0 if unused */ +}; + +static inline u64 hipz_galpa_load(struct h_galpa galpa, u32 offset) +{ + u64 addr = galpa.fw_handle + offset; + u64 out; + EDEB_EN(7, "addr=%lx offset=%x ", addr, offset); + out = *(u64 *) addr; + EDEB_EX(7, "addr=%lx value=%lx", addr, out); + return out; +} + +static inline void hipz_galpa_store(struct h_galpa galpa, u32 offset, u64 value) +{ + u64 addr = galpa.fw_handle + offset; + EDEB(7, "addr=%lx offset=%x value=%lx", addr, + offset, value); + *(u64 *) addr = value; +} + +int hcp_galpas_ctor(struct h_galpas *galpas, + u64 paddr_kernel, u64 paddr_user); + +int hcp_galpas_dtor(struct h_galpas *galpas); + +int hcall_map_page(u64 physaddr, u64 * mapaddr); + +int hcall_unmap_page(u64 mapaddr); + +#endif -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:01 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:01 -0700 Subject: [openib-general] [PATCH 06/13] IB/ehca: eq In-Reply-To: <20068171311.8D49tRUe7xsVtB0H@cisco.com> Message-ID: <20068171311.L4phXwdeU9u1VjBq@cisco.com> drivers/infiniband/hw/ehca/ehca_eq.c | 222 ++++++++++++++++++++++++++++++++++ 1 files changed, 222 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_eq.c b/drivers/infiniband/hw/ehca/ehca_eq.c new file mode 100644 index 0000000..080ed02 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_eq.c @@ -0,0 +1,222 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Event queue handling + * + * Authors: Waleri Fomin + * Khadija Souissi + * Reinhard Ernst + * Heiko J Schick + * Hoang-Nam Nguyen + * + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#define DEB_PREFIX "e_eq" + +#include "ehca_classes.h" +#include "ehca_irq.h" +#include "ehca_iverbs.h" +#include "ehca_qes.h" +#include "hcp_if.h" +#include "ipz_pt_fn.h" + +int ehca_create_eq(struct ehca_shca *shca, + struct ehca_eq *eq, + const enum ehca_eq_type type, const u32 length) +{ + u64 ret = H_SUCCESS; + u32 nr_pages = 0; + u32 i; + void *vpage = NULL; + + EDEB_EN(7, "shca=%p eq=%p length=%x", shca, eq, length); + EHCA_CHECK_ADR(shca); + EHCA_CHECK_ADR(eq); + + spin_lock_init(&eq->spinlock); + eq->is_initialized = 0; + + if (type != EHCA_EQ && type != EHCA_NEQ) { + EDEB_ERR(4, "Invalid EQ type %x. eq=%p", type, eq); + return -EINVAL; + } + if (length == 0) { + EDEB_ERR(4, "EQ length must not be zero. eq=%p", eq); + return -EINVAL; + } + + ret = hipz_h_alloc_resource_eq(shca->ipz_hca_handle, + &eq->pf, + type, + length, + &eq->ipz_eq_handle, + &eq->length, + &nr_pages, &eq->ist); + + if (ret != H_SUCCESS) { + EDEB_ERR(4, "Can't allocate EQ / NEQ. eq=%p", eq); + return -EINVAL; + } + + ret = ipz_queue_ctor(&eq->ipz_queue, nr_pages, + EHCA_PAGESIZE, sizeof(struct ehca_eqe), 0); + if (!ret) { + EDEB_ERR(4, "Can't allocate EQ pages. eq=%p", eq); + goto create_eq_exit1; + } + + for (i = 0; i < nr_pages; i++) { + u64 rpage; + + if (!(vpage = ipz_qpageit_get_inc(&eq->ipz_queue))) { + ret = H_RESOURCE; + goto create_eq_exit2; + } + + rpage = virt_to_abs(vpage); + ret = hipz_h_register_rpage_eq(shca->ipz_hca_handle, + eq->ipz_eq_handle, + &eq->pf, + 0, 0, rpage, 1); + + if (i == (nr_pages - 1)) { + /* last page */ + vpage = ipz_qpageit_get_inc(&eq->ipz_queue); + if (ret != H_SUCCESS || vpage) + goto create_eq_exit2; + } else { + if (ret != H_PAGE_REGISTERED || !vpage) + goto create_eq_exit2; + } + } + + ipz_qeit_reset(&eq->ipz_queue); + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { + ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + SA_INTERRUPT, "ehca_eq", + (void *)shca); + if (ret < 0) + EDEB_ERR(4, "Can't map interrupt handler."); + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { + ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + SA_INTERRUPT, "ehca_neq", + (void *)shca); + if (ret < 0) + EDEB_ERR(4, "Can't map interrupt handler."); + + tasklet_init(&eq->interrupt_task, ehca_tasklet_neq, (long)shca); + } + + eq->is_initialized = 1; + + EDEB_EX(7, "ret=%lx", ret); + + return 0; + +create_eq_exit2: + ipz_queue_dtor(&eq->ipz_queue); + +create_eq_exit1: + hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + + EDEB_EX(7, "ret=%lx", ret); + + return -EINVAL; +} + +void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq) +{ + unsigned long flags = 0; + void *eqe = NULL; + + EDEB_EN(7, "shca=%p eq=%p", shca, eq); + EHCA_CHECK_ADR_P(shca); + EHCA_CHECK_EQ_P(eq); + + spin_lock_irqsave(&eq->spinlock, flags); + eqe = ipz_eqit_eq_get_inc_valid(&eq->ipz_queue); + spin_unlock_irqrestore(&eq->spinlock, flags); + + EDEB_EX(7, "eq=%p eqe=%p", eq, eqe); + + return eqe; +} + +void ehca_poll_eqs(unsigned long data) +{ + struct ehca_shca *shca; + struct ehca_module *module = (struct ehca_module*)data; + + spin_lock(&module->shca_lock); + list_for_each_entry(shca, &module->shca_list, shca_list) { + if (shca->eq.is_initialized) + ehca_tasklet_eq((unsigned long)(void*)shca); + } + mod_timer(&module->timer, jiffies + HZ); + spin_unlock(&module->shca_lock); + + return; +} + +int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq) +{ + unsigned long flags = 0; + u64 h_ret = H_SUCCESS; + + EDEB_EN(7, "shca=%p eq=%p", shca, eq); + EHCA_CHECK_ADR(shca); + EHCA_CHECK_EQ(eq); + + spin_lock_irqsave(&eq->spinlock, flags); + ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + + spin_unlock_irqrestore(&eq->spinlock, flags); + + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "Can't free EQ resources."); + return -EINVAL; + } + ipz_queue_dtor(&eq->ipz_queue); + + EDEB_EX(7, "h_ret=%lx", h_ret); + + return h_ret; +} -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:02 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:02 -0700 Subject: [openib-general] [PATCH 09/13] IB/ehca: fwif In-Reply-To: <20068171311.7Z4EtLP0ZYtya78R@cisco.com> Message-ID: <20068171311.sHGelL4wOwoc17UG@cisco.com> drivers/infiniband/hw/ehca/hcp_if.c | 1473 +++++++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/hcp_if.h | 261 ++++++ 2 files changed, 1734 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c new file mode 100644 index 0000000..2407eb6 --- /dev/null +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -0,0 +1,1473 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Firmware Infiniband Interface code for POWER + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Gerd Bayer + * Waleri Fomin + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#define DEB_PREFIX "hcpi" + +#include +#include "ehca_tools.h" +#include "hcp_if.h" +#include "hcp_phyp.h" +#include "hipz_fns.h" + +#define H_ALL_RES_QP_ENHANCED_OPS EHCA_BMASK_IBM(9,11) +#define H_ALL_RES_QP_PTE_PIN EHCA_BMASK_IBM(12,12) +#define H_ALL_RES_QP_SERVICE_TYPE EHCA_BMASK_IBM(13,15) +#define H_ALL_RES_QP_LL_RQ_CQE_POSTING EHCA_BMASK_IBM(18,18) +#define H_ALL_RES_QP_LL_SQ_CQE_POSTING EHCA_BMASK_IBM(19,21) +#define H_ALL_RES_QP_SIGNALING_TYPE EHCA_BMASK_IBM(22,23) +#define H_ALL_RES_QP_UD_AV_LKEY_CTRL EHCA_BMASK_IBM(31,31) +#define H_ALL_RES_QP_RESOURCE_TYPE EHCA_BMASK_IBM(56,63) + +#define H_ALL_RES_QP_MAX_OUTST_SEND_WR EHCA_BMASK_IBM(0,15) +#define H_ALL_RES_QP_MAX_OUTST_RECV_WR EHCA_BMASK_IBM(16,31) +#define H_ALL_RES_QP_MAX_SEND_SGE EHCA_BMASK_IBM(32,39) +#define H_ALL_RES_QP_MAX_RECV_SGE EHCA_BMASK_IBM(40,47) + +#define H_ALL_RES_QP_ACT_OUTST_SEND_WR EHCA_BMASK_IBM(16,31) +#define H_ALL_RES_QP_ACT_OUTST_RECV_WR EHCA_BMASK_IBM(48,63) +#define H_ALL_RES_QP_ACT_SEND_SGE EHCA_BMASK_IBM(8,15) +#define H_ALL_RES_QP_ACT_RECV_SGE EHCA_BMASK_IBM(24,31) + +#define H_ALL_RES_QP_SQUEUE_SIZE_PAGES EHCA_BMASK_IBM(0,31) +#define H_ALL_RES_QP_RQUEUE_SIZE_PAGES EHCA_BMASK_IBM(32,63) + +/* direct access qp controls */ +#define DAQP_CTRL_ENABLE 0x01 +#define DAQP_CTRL_SEND_COMP 0x20 +#define DAQP_CTRL_RECV_COMP 0x40 + +static u32 get_longbusy_msecs(int longbusy_rc) +{ + switch (longbusy_rc) { + case H_LONG_BUSY_ORDER_1_MSEC: + return 1; + case H_LONG_BUSY_ORDER_10_MSEC: + return 10; + case H_LONG_BUSY_ORDER_100_MSEC: + return 100; + case H_LONG_BUSY_ORDER_1_SEC: + return 1000; + case H_LONG_BUSY_ORDER_10_SEC: + return 10000; + case H_LONG_BUSY_ORDER_100_SEC: + return 100000; + default: + return 1; + } +} + +static long ehca_hcall_7arg_7ret(unsigned long opcode, + unsigned long arg1, + unsigned long arg2, + unsigned long arg3, + unsigned long arg4, + unsigned long arg5, + unsigned long arg6, + unsigned long arg7, + unsigned long *out1, + unsigned long *out2, + unsigned long *out3, + unsigned long *out4, + unsigned long *out5, + unsigned long *out6, + unsigned long *out7) +{ + long ret = H_SUCCESS; + int i, sleep_msecs; + + EDEB_EN(7, "opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx arg5=%lx" + " arg6=%lx arg7=%lx", opcode, arg1, arg2, arg3, arg4, arg5, + arg6, arg7); + + for (i = 0; i < 5; i++) { + ret = plpar_hcall_7arg_7ret(opcode, + arg1, arg2, arg3, arg4, + arg5, arg6, arg7, + out1, out2, out3, out4, + out5, out6,out7); + + if (H_IS_LONG_BUSY(ret)) { + sleep_msecs = get_longbusy_msecs(ret); + msleep_interruptible(sleep_msecs); + continue; + } + + if (ret < H_SUCCESS) + EDEB_ERR(4, "opcode=%lx ret=%lx" + " arg1=%lx arg2=%lx arg3=%lx arg4=%lx" + " arg5=%lx arg6=%lx arg7=%lx" + " out1=%lx out2=%lx out3=%lx out4=%lx" + " out5=%lx out6=%lx out7=%lx", + opcode, ret, + arg1, arg2, arg3, arg4, + arg5, arg6, arg7, + *out1, *out2, *out3, *out4, + *out5, *out6, *out7); + + EDEB_EX(7, "opcode=%lx ret=%lx out1=%lx out2=%lx out3=%lx " + "out4=%lx out5=%lx out6=%lx out7=%lx", + opcode, ret, *out1, *out2, *out3, *out4, *out5, + *out6, *out7); + return ret; + } + + EDEB_EX(7, "opcode=%lx ret=H_BUSY", opcode); + + return H_BUSY; +} + +static long ehca_hcall_9arg_9ret(unsigned long opcode, + unsigned long arg1, + unsigned long arg2, + unsigned long arg3, + unsigned long arg4, + unsigned long arg5, + unsigned long arg6, + unsigned long arg7, + unsigned long arg8, + unsigned long arg9, + unsigned long *out1, + unsigned long *out2, + unsigned long *out3, + unsigned long *out4, + unsigned long *out5, + unsigned long *out6, + unsigned long *out7, + unsigned long *out8, + unsigned long *out9) +{ + long ret = H_SUCCESS; + int i, sleep_msecs; + + EDEB_EN(7, "opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx " + "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx", + opcode, arg1, arg2, arg3, arg4, arg5, arg6, arg7, + arg8, arg9); + + + for (i = 0; i < 5; i++) { + ret = plpar_hcall_9arg_9ret(opcode, + arg1, arg2, arg3, arg4, + arg5, arg6, arg7, arg8, + arg9, + out1, out2, out3, out4, + out5, out6, out7, out8, + out9); + + if (H_IS_LONG_BUSY(ret)) { + sleep_msecs = get_longbusy_msecs(ret); + msleep_interruptible(sleep_msecs); + continue; + } + + if (ret < H_SUCCESS) + EDEB_ERR(4, "opcode=%lx ret=%lx" + " arg1=%lx arg2=%lx arg3=%lx arg4=%lx" + " arg5=%lx arg6=%lx arg7=%lx arg8=%lx" + " arg9=%lx" + " out1=%lx out2=%lx out3=%lx out4=%lx" + " out5=%lx out6=%lx out7=%lx out8=%lx" + " out9=%lx", + opcode, ret, + arg1, arg2, arg3, arg4, + arg5, arg6, arg7, arg8, + arg9, + *out1, *out2, *out3, *out4, + *out5, *out6, *out7, *out8, + *out9); + + EDEB_EX(7, "opcode=%lx ret=%lx out1=%lx out2=%lx out3=%lx " + "out4=%lx out5=%lx out6=%lx out7=%lx out8=%lx out9=%lx", + opcode, ret,*out1, *out2, *out3, *out4, *out5, *out6, + *out7, *out8, *out9); + return ret; + + } + + EDEB_EX(7, "opcode=%lx ret=H_BUSY", opcode); + return H_BUSY; +} +u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfeq *pfeq, + const u32 neq_control, + const u32 number_of_entries, + struct ipz_eq_handle *eq_handle, + u32 * act_nr_of_entries, + u32 * act_pages, + u32 * eq_ist) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 act_nr_of_entries_out = 0; + u64 act_pages_out = 0; + u64 eq_ist_out = 0; + u64 allocate_controls = 0; + u32 x = (u64)(&x); + + EDEB_EN(7, "pfeq=%p adapter_handle=%lx new_control=%x" + " number_of_entries=%x", + pfeq, adapter_handle.handle, neq_control, + number_of_entries); + + /* resource type */ + allocate_controls = 3ULL; + + /* ISN is associated */ + if (neq_control != 1) + allocate_controls = (1ULL << (63 - 7)) | allocate_controls; + else /* notification event queue */ + allocate_controls = (1ULL << 63) | allocate_controls; + + ret = ehca_hcall_7arg_7ret(H_ALLOC_RESOURCE, + adapter_handle.handle, /* r4 */ + allocate_controls, /* r5 */ + number_of_entries, /* r6 */ + 0, 0, 0, 0, + &eq_handle->handle, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &act_nr_of_entries_out, /* r7 */ + &act_pages_out, /* r8 */ + &eq_ist_out, /* r8 */ + &dummy); + + *act_nr_of_entries = (u32)act_nr_of_entries_out; + *act_pages = (u32)act_pages_out; + *eq_ist = (u32)eq_ist_out; + + if (ret == H_NOT_ENOUGH_RESOURCES) + EDEB_ERR(4, "Not enough resource - ret=%lx ", ret); + + EDEB_EX(7, "act_nr_of_entries=%x act_pages=%x eq_ist=%x", + *act_nr_of_entries, *act_pages, *eq_ist); + + return ret; +} + +u64 hipz_h_reset_event(const struct ipz_adapter_handle adapter_handle, + struct ipz_eq_handle eq_handle, + const u64 event_mask) +{ + u64 ret = H_SUCCESS; + u64 dummy; + + EDEB_EN(7, "eq_handle=%lx, adapter_handle=%lx event_mask=%lx", + eq_handle.handle, adapter_handle.handle, event_mask); + + ret = ehca_hcall_7arg_7ret(H_RESET_EVENTS, + adapter_handle.handle, /* r4 */ + eq_handle.handle, /* r5 */ + event_mask, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle, + struct ehca_cq *cq, + struct ehca_alloc_cq_parms *param) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 act_nr_of_entries_out; + u64 act_pages_out; + u64 g_la_privileged_out; + u64 g_la_user_out; + + EDEB_EN(7, "Adapter_handle=%lx eq_handle=%lx cq_token=%x" + " cq_number_of_entries=%x", + adapter_handle.handle, param->eq_handle.handle, + cq->token, param->nr_cqe); + + ret = ehca_hcall_7arg_7ret(H_ALLOC_RESOURCE, + adapter_handle.handle, /* r4 */ + 2, /* r5 */ + param->eq_handle.handle, /* r6 */ + cq->token, /* r7 */ + param->nr_cqe, /* r8 */ + 0, 0, + &cq->ipz_cq_handle.handle, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &act_nr_of_entries_out, /* r7 */ + &act_pages_out, /* r8 */ + &g_la_privileged_out, /* r9 */ + &g_la_user_out); /* r10 */ + + param->act_nr_of_entries = (u32)act_nr_of_entries_out; + param->act_pages = (u32)act_pages_out; + + if (ret == H_SUCCESS) + hcp_galpas_ctor(&cq->galpas, g_la_privileged_out, g_la_user_out); + + if (ret == H_NOT_ENOUGH_RESOURCES) + EDEB_ERR(4, "Not enough resources. ret=%lx", ret); + + EDEB_EX(7, "cq_handle=%lx act_nr_of_entries=%x act_pages=%x", + cq->ipz_cq_handle.handle, param->act_nr_of_entries, param->act_pages); + + return ret; +} + +u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, + struct ehca_qp *qp, + struct ehca_alloc_qp_parms *parms) +{ + u64 ret = H_SUCCESS; + u64 allocate_controls; + u64 max_r10_reg; + u64 dummy = 0; + u64 qp_nr_out = 0; + u64 r6_out = 0; + u64 r7_out = 0; + u64 r8_out = 0; + u64 g_la_user_out = 0; + u64 r11_out = 0; + u16 max_nr_receive_wqes = qp->init_attr.cap.max_recv_wr + 1; + u16 max_nr_send_wqes = qp->init_attr.cap.max_send_wr + 1; + int daqp_ctrl = parms->daqp_ctrl; + + EDEB_EN(7, "Adapter_handle=%lx servicetype=%x signalingtype=%x" + " ud_av_l_key=%x send_cq_handle=%lx receive_cq_handle=%lx" + " async_eq_handle=%lx qp_token=%x pd=%x max_nr_send_wqes=%x" + " max_nr_receive_wqes=%x max_nr_send_sges=%x" + " max_nr_receive_sges=%x ud_av_l_key=%x galpa.pid=%x", + adapter_handle.handle, parms->servicetype, parms->sigtype, + parms->ud_av_l_key_ctl, qp->send_cq->ipz_cq_handle.handle, + qp->recv_cq->ipz_cq_handle.handle, parms->ipz_eq_handle.handle, + qp->token, parms->pd.value, max_nr_send_wqes, + max_nr_receive_wqes, parms->max_send_sge, parms->max_recv_sge, + parms->ud_av_l_key_ctl, qp->galpas.pid); + + allocate_controls = + EHCA_BMASK_SET(H_ALL_RES_QP_ENHANCED_OPS, + (daqp_ctrl & DAQP_CTRL_ENABLE) ? 1 : 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_PTE_PIN, 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_SERVICE_TYPE, parms->servicetype) + | EHCA_BMASK_SET(H_ALL_RES_QP_SIGNALING_TYPE, parms->sigtype) + | EHCA_BMASK_SET(H_ALL_RES_QP_LL_RQ_CQE_POSTING, + (daqp_ctrl & DAQP_CTRL_RECV_COMP) ? 1 : 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_LL_SQ_CQE_POSTING, + (daqp_ctrl & DAQP_CTRL_SEND_COMP) ? 1 : 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_UD_AV_LKEY_CTRL, + parms->ud_av_l_key_ctl) + | EHCA_BMASK_SET(H_ALL_RES_QP_RESOURCE_TYPE, 1); + + max_r10_reg = + EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_SEND_WR, + max_nr_send_wqes) + | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_RECV_WR, + max_nr_receive_wqes) + | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_SEND_SGE, + parms->max_send_sge) + | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_RECV_SGE, + parms->max_recv_sge); + + + ret = ehca_hcall_9arg_9ret(H_ALLOC_RESOURCE, + adapter_handle.handle, /* r4 */ + allocate_controls, /* r5 */ + qp->send_cq->ipz_cq_handle.handle, + qp->recv_cq->ipz_cq_handle.handle, + parms->ipz_eq_handle.handle, + ((u64)qp->token << 32) | parms->pd.value, + max_r10_reg, /* r10 */ + parms->ud_av_l_key_ctl, /* r11 */ + 0, + &qp->ipz_qp_handle.handle, + &qp_nr_out, /* r5 */ + &r6_out, /* r6 */ + &r7_out, /* r7 */ + &r8_out, /* r8 */ + &dummy, /* r9 */ + &g_la_user_out, /* r10 */ + &r11_out, + &dummy); + + /* extract outputs */ + qp->real_qp_num = (u32)qp_nr_out; + + parms->act_nr_send_sges = + (u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_SEND_WR, r6_out); + parms->act_nr_recv_wqes = + (u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_RECV_WR, r6_out); + parms->act_nr_send_sges = + (u8)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_SEND_SGE, r7_out); + parms->act_nr_recv_sges = + (u8)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_RECV_SGE, r7_out); + parms->nr_sq_pages = + (u32)EHCA_BMASK_GET(H_ALL_RES_QP_SQUEUE_SIZE_PAGES, r8_out); + parms->nr_rq_pages = + (u32)EHCA_BMASK_GET(H_ALL_RES_QP_RQUEUE_SIZE_PAGES, r8_out); + + if (ret == H_SUCCESS) + hcp_galpas_ctor(&qp->galpas, g_la_user_out, g_la_user_out); + + if (ret == H_NOT_ENOUGH_RESOURCES) + EDEB_ERR(4, "Not enough resources. ret=%lx",ret); + + EDEB_EX(7, "qp_nr=%x act_nr_send_wqes=%x" + " act_nr_receive_wqes=%x act_nr_send_sges=%x" + " act_nr_receive_sges=%x nr_sq_pages=%x" + " nr_rq_pages=%x galpa.user=%lx galpa.kernel=%lx", + qp->real_qp_num, parms->act_nr_send_wqes, + parms->act_nr_recv_wqes, parms->act_nr_send_sges, + parms->act_nr_recv_sges, parms->nr_sq_pages, + parms->nr_rq_pages, qp->galpas.user.fw_handle, + qp->galpas.kernel.fw_handle); + + return ret; +} + +u64 hipz_h_query_port(const struct ipz_adapter_handle adapter_handle, + const u8 port_id, + struct hipz_query_port *query_port_response_block) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 r_cb; + + EDEB_EN(7, "adapter_handle=%lx port_id %x", + adapter_handle.handle, port_id); + + if (((u64)query_port_response_block) & 0xfff) { + EDEB_ERR(4, "response block not page aligned"); + return H_PARAMETER; + } + + r_cb = virt_to_abs(query_port_response_block); + + ret = ehca_hcall_7arg_7ret(H_QUERY_PORT, + adapter_handle.handle, /* r4 */ + port_id, /* r5 */ + r_cb, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB_DMP(7, query_port_response_block, 64, "query_port_response_block"); + EDEB(7, "offset31=%x offset35=%x offset36=%x", + ((u32*)query_port_response_block)[32], + ((u32*)query_port_response_block)[36], + ((u32*)query_port_response_block)[37]); + EDEB(7, "offset200=%x offset201=%x offset202=%x " + "offset203=%x", + ((u32*)query_port_response_block)[0x200], + ((u32*)query_port_response_block)[0x201], + ((u32*)query_port_response_block)[0x202], + ((u32*)query_port_response_block)[0x203]); + + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_query_hca(const struct ipz_adapter_handle adapter_handle, + struct hipz_query_hca *query_hca_rblock) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 r_cb; + EDEB_EN(7, "adapter_handle=%lx", adapter_handle.handle); + + if (((u64)query_hca_rblock) & 0xfff) { + EDEB_ERR(4, "response_block=%p not page aligned", + query_hca_rblock); + return H_PARAMETER; + } + + r_cb = virt_to_abs(query_hca_rblock); + + ret = ehca_hcall_7arg_7ret(H_QUERY_HCA, + adapter_handle.handle, /* r4 */ + r_cb, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB(7, "offset0=%x offset1=%x offset2=%x offset3=%x", + ((u32*)query_hca_rblock)[0], + ((u32*)query_hca_rblock)[1], + ((u32*)query_hca_rblock)[2], ((u32*)query_hca_rblock)[3]); + EDEB(7, "offset4=%x offset5=%x offset6=%x offset7=%x", + ((u32*)query_hca_rblock)[4], + ((u32*)query_hca_rblock)[5], + ((u32*)query_hca_rblock)[6], ((u32*)query_hca_rblock)[7]); + EDEB(7, "offset8=%x offset9=%x offseta=%x offsetb=%x", + ((u32*)query_hca_rblock)[8], + ((u32*)query_hca_rblock)[9], + ((u32*)query_hca_rblock)[10], ((u32*)query_hca_rblock)[11]); + EDEB(7, "offsetc=%x offsetd=%x offsete=%x offsetf=%x", + ((u32*)query_hca_rblock)[12], + ((u32*)query_hca_rblock)[13], + ((u32*)query_hca_rblock)[14], ((u32*)query_hca_rblock)[15]); + EDEB(7, "offset136=%x offset192=%x offset204=%x", + ((u32*)query_hca_rblock)[32], + ((u32*)query_hca_rblock)[48], ((u32*)query_hca_rblock)[51]); + EDEB(7, "offset231=%x offset235=%x", + ((u32*)query_hca_rblock)[57], ((u32*)query_hca_rblock)[58]); + EDEB(7, "offset200=%x offset201=%x offset202=%x offset203=%x", + ((u32*)query_hca_rblock)[0x201], + ((u32*)query_hca_rblock)[0x202], + ((u32*)query_hca_rblock)[0x203], + ((u32*)query_hca_rblock)[0x204]); + + EDEB_EX(7, "ret=%lx adapter_handle=%lx", + ret, adapter_handle.handle); + + return ret; +} + +u64 hipz_h_register_rpage(const struct ipz_adapter_handle adapter_handle, + const u8 pagesize, + const u8 queue_type, + const u64 resource_handle, + const u64 logical_address_of_page, + u64 count) +{ + u64 ret = H_SUCCESS; + u64 dummy; + + EDEB_EN(7, "adapter_handle=%lx pagesize=%x queue_type=%x" + " resource_handle=%lx logical_address_of_page=%lx count=%lx", + adapter_handle.handle, pagesize, queue_type, + resource_handle, logical_address_of_page, count); + + ret = ehca_hcall_7arg_7ret(H_REGISTER_RPAGES, + adapter_handle.handle, /* r4 */ + queue_type | pagesize << 8, /* r5 */ + resource_handle, /* r6 */ + logical_address_of_page, /* r7 */ + count, /* r8 */ + 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_register_rpage_eq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_eq_handle eq_handle, + struct ehca_pfeq *pfeq, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count) +{ + u64 ret = H_SUCCESS; + + EDEB_EN(7, "pfeq=%p adapter_handle=%lx eq_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + pfeq, adapter_handle.handle, eq_handle.handle, pagesize, + queue_type,logical_address_of_page, count); + + if (count != 1) { + EDEB_ERR(4, "Ppage counter=%lx", count); + return H_PARAMETER; + } + ret = hipz_h_register_rpage(adapter_handle, + pagesize, + queue_type, + eq_handle.handle, + logical_address_of_page, count); + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u32 hipz_h_query_int_state(const struct ipz_adapter_handle adapter_handle, + u32 ist) +{ + u32 ret = H_SUCCESS; + u64 dummy = 0; + + EDEB_EN(7, "ist=%x", ist); + + ret = ehca_hcall_7arg_7ret(H_QUERY_INT_STATE, + adapter_handle.handle, /* r4 */ + ist, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + if (ret != H_SUCCESS && ret != H_BUSY) + EDEB_ERR(4, "Could not query interrupt state."); + + EDEB_EX(7, "interrupt state: %x", ret); + + return ret; +} + +u64 hipz_h_register_rpage_cq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_cq_handle cq_handle, + struct ehca_pfcq *pfcq, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count, + const struct h_galpa gal) +{ + u64 ret = H_SUCCESS; + + EDEB_EN(7, "pfcq=%p adapter_handle=%lx cq_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + pfcq, adapter_handle.handle, cq_handle.handle, pagesize, + queue_type, logical_address_of_page, count); + + if (count != 1) { + EDEB_ERR(4, "Page counter=%lx", count); + return H_PARAMETER; + } + + ret = hipz_h_register_rpage(adapter_handle, pagesize, queue_type, + cq_handle.handle, logical_address_of_page, + count); + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_register_rpage_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count, + const struct h_galpa galpa) +{ + u64 ret = H_SUCCESS; + + EDEB_EN(7, "pfqp=%p adapter_handle=%lx qp_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + pfqp, adapter_handle.handle, qp_handle.handle, pagesize, + queue_type, logical_address_of_page, count); + + if (count != 1) { + EDEB_ERR(4, "Page counter=%lx", count); + return H_PARAMETER; + } + + ret = hipz_h_register_rpage(adapter_handle,pagesize,queue_type, + qp_handle.handle,logical_address_of_page, + count); + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_disable_and_get_wqe(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + void **log_addr_next_sq_wqe2processed, + void **log_addr_next_rq_wqe2processed, + int dis_and_get_function_code) +{ + u64 ret = H_SUCCESS; + u8 function_code = 1; + u64 dummy, dummy1, dummy2; + + EDEB_EN(7, "pfqp=%p adapter_handle=%lx function=%x qp_handle=%lx", + pfqp, adapter_handle.handle, function_code, qp_handle.handle); + + if (!log_addr_next_sq_wqe2processed) + log_addr_next_sq_wqe2processed = (void**)&dummy1; + if (!log_addr_next_rq_wqe2processed) + log_addr_next_rq_wqe2processed = (void**)&dummy2; + + ret = ehca_hcall_7arg_7ret(H_DISABLE_AND_GETC, + adapter_handle.handle, /* r4 */ + dis_and_get_function_code, /* r5 */ + qp_handle.handle, /* r6 */ + 0, 0, 0, 0, + (void*)log_addr_next_sq_wqe2processed, + (void*)log_addr_next_rq_wqe2processed, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + EDEB_EX(7, "ret=%lx ladr_next_rq_wqe_out=%p" + " ladr_next_sq_wqe_out=%p", ret, + *log_addr_next_sq_wqe2processed, + *log_addr_next_rq_wqe2processed); + + return ret; +} + +u64 hipz_h_modify_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + const u64 update_mask, + struct hcp_modify_qp_control_block *mqpcb, + struct h_galpa gal) +{ + u64 ret = H_SUCCESS; + u64 invalid_attribute_identifier = 0; + u64 rc_attrib_mask = 0; + u64 dummy; + u64 r_cb; + EDEB_EN(7, "pfqp=%p adapter_handle=%lx qp_handle=%lx" + " update_mask=%lx qp_state=%x mqpcb=%p", + pfqp, adapter_handle.handle, qp_handle.handle, + update_mask, mqpcb->qp_state, mqpcb); + + r_cb = virt_to_abs(mqpcb); + ret = ehca_hcall_7arg_7ret(H_MODIFY_QP, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + update_mask, /* r6 */ + r_cb, /* r7 */ + 0, 0, 0, + &invalid_attribute_identifier, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &dummy, /* r7 */ + &dummy, /* r8 */ + &rc_attrib_mask, /* r9 */ + &dummy); + if (ret == H_NOT_ENOUGH_RESOURCES) + EDEB_ERR(4, "Insufficient resources ret=%lx", ret); + + EDEB_EX(7, "ret=%lx invalid_attribute_identifier=%lx" + " invalid_attribute_MASK=%lx", ret, + invalid_attribute_identifier, rc_attrib_mask); + + return ret; +} + +u64 hipz_h_query_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + struct hcp_modify_qp_control_block *qqpcb, + struct h_galpa gal) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 r_cb; + EDEB_EN(7, "adapter_handle=%lx qp_handle=%lx", + adapter_handle.handle, qp_handle.handle); + + r_cb = virt_to_abs(qqpcb); + EDEB(7, "r_cb=%lx", r_cb); + + ret = ehca_hcall_7arg_7ret(H_QUERY_QP, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + r_cb, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_destroy_qp(const struct ipz_adapter_handle adapter_handle, + struct ehca_qp *qp) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 ladr_next_sq_wqe_out; + u64 ladr_next_rq_wqe_out; + + EDEB_EN(7, "qp=%p ipz_qp_handle=%lx adapter_handle=%lx", + qp, qp->ipz_qp_handle.handle, adapter_handle.handle); + + ret = hcp_galpas_dtor(&qp->galpas); + if (ret) { + EDEB_ERR(4, "Could not destruct qp->galpas"); + return H_RESOURCE; + } + ret = ehca_hcall_7arg_7ret(H_DISABLE_AND_GETC, + adapter_handle.handle, /* r4 */ + /* function code */ + 1, /* r5 */ + qp->ipz_qp_handle.handle, /* r6 */ + 0, 0, 0, 0, + &ladr_next_sq_wqe_out, /* r4 */ + &ladr_next_rq_wqe_out, /* r5 */ + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + if (ret == H_HARDWARE) + EDEB_ERR(4, "HCA not operational. ret=%lx", ret); + + ret = ehca_hcall_7arg_7ret(H_FREE_RESOURCE, + adapter_handle.handle, /* r4 */ + qp->ipz_qp_handle.handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + if (ret == H_RESOURCE) + EDEB_ERR(4, "Resource still in use. ret=%lx", ret); + + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_define_aqp0(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u32 port) +{ + u64 ret = H_SUCCESS; + u64 dummy; + + EDEB_EN(7, "port=%x ipz_qp_handle=%lx adapter_handle=%lx", + port, qp_handle.handle, adapter_handle.handle); + + ret = ehca_hcall_7arg_7ret(H_DEFINE_AQP0, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + port, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_define_aqp1(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u32 port, u32 * pma_qp_nr, + u32 * bma_qp_nr) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 pma_qp_nr_out; + u64 bma_qp_nr_out; + + EDEB_EN(7, "port=%x qp_handle=%lx adapter_handle=%lx", + port, qp_handle.handle, adapter_handle.handle); + + ret = ehca_hcall_7arg_7ret(H_DEFINE_AQP1, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + port, /* r6 */ + 0, 0, 0, 0, + &pma_qp_nr_out, /* r4 */ + &bma_qp_nr_out, /* r5 */ + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + *pma_qp_nr = (u32)pma_qp_nr_out; + *bma_qp_nr = (u32)bma_qp_nr_out; + + if (ret == H_ALIAS_EXIST) + EDEB_ERR(4, "AQP1 already exists. ret=%lx", ret); + + EDEB_EX(7, "ret=%lx pma_qp_nr=%i bma_qp_nr=%i", + ret, (int)*pma_qp_nr, (int)*bma_qp_nr); + + return ret; +} + +u64 hipz_h_attach_mcqp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u16 mcg_dlid, + u64 subnet_prefix, u64 interface_id) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u8 *dgid_sp = (u8*)&subnet_prefix; + u8 *dgid_ii = (u8*)&interface_id; + + EDEB_EN(7, "qp_handle=%lx adapter_handle=%lx\nMCG_DGID =" + " %d.%d.%d.%d.%d.%d.%d.%d." + " %d.%d.%d.%d.%d.%d.%d.%d", + qp_handle.handle, adapter_handle.handle, + dgid_sp[0], dgid_sp[1], + dgid_sp[2], dgid_sp[3], + dgid_sp[4], dgid_sp[5], + dgid_sp[6], dgid_sp[7], + dgid_ii[0], dgid_ii[1], + dgid_ii[2], dgid_ii[3], + dgid_ii[4], dgid_ii[5], + dgid_ii[6], dgid_ii[7]); + + ret = ehca_hcall_7arg_7ret(H_ATTACH_MCQP, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + mcg_dlid, /* r6 */ + interface_id, /* r7 */ + subnet_prefix, /* r8 */ + 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + if (ret == H_NOT_ENOUGH_RESOURCES) + EDEB_ERR(4, "Not enough resources. ret=%lx", ret); + + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_detach_mcqp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u16 mcg_dlid, + u64 subnet_prefix, u64 interface_id) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u8 *dgid_sp = (u8*)&subnet_prefix; + u8 *dgid_ii = (u8*)&interface_id; + + EDEB_EN(7, "qp_handle=%lx adapter_handle=%lx\nMCG_DGID =" + " %d.%d.%d.%d.%d.%d.%d.%d." + " %d.%d.%d.%d.%d.%d.%d.%d", + qp_handle.handle, adapter_handle.handle, + dgid_sp[0], dgid_sp[1], + dgid_sp[2], dgid_sp[3], + dgid_sp[4], dgid_sp[5], + dgid_sp[6], dgid_sp[7], + dgid_ii[0], dgid_ii[1], + dgid_ii[2], dgid_ii[3], + dgid_ii[4], dgid_ii[5], + dgid_ii[6], dgid_ii[7]); + ret = ehca_hcall_7arg_7ret(H_DETACH_MCQP, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + mcg_dlid, /* r6 */ + interface_id, /* r7 */ + subnet_prefix, /* r8 */ + 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_destroy_cq(const struct ipz_adapter_handle adapter_handle, + struct ehca_cq *cq, + u8 force_flag) +{ + u64 ret = H_SUCCESS; + u64 dummy; + + EDEB_EN(7, "cq->pf=%p cq=.%p ipz_cq_handle=%lx adapter_handle=%lx", + &cq->pf, cq, cq->ipz_cq_handle.handle, adapter_handle.handle); + + ret = hcp_galpas_dtor(&cq->galpas); + if (ret) { + EDEB_ERR(4, "Could not destruct cp->galpas"); + return H_RESOURCE; + } + + ret = ehca_hcall_7arg_7ret(H_FREE_RESOURCE, + adapter_handle.handle, /* r4 */ + cq->ipz_cq_handle.handle, /* r5 */ + force_flag != 0 ? 1L : 0L, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + if (ret == H_RESOURCE) + EDEB(4, "ret=%lx ", ret); + + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_destroy_eq(const struct ipz_adapter_handle adapter_handle, + struct ehca_eq *eq) +{ + u64 ret = H_SUCCESS; + u64 dummy; + + EDEB_EN(7, "eq->pf=%p eq=%p ipz_eq_handle=%lx adapter_handle=%lx", + &eq->pf, eq, eq->ipz_eq_handle.handle, + adapter_handle.handle); + + ret = hcp_galpas_dtor(&eq->galpas); + if (ret) { + EDEB_ERR(4, "Could not destruct eq->galpas"); + return H_RESOURCE; + } + + ret = ehca_hcall_7arg_7ret(H_FREE_RESOURCE, + adapter_handle.handle, /* r4 */ + eq->ipz_eq_handle.handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + + if (ret == H_RESOURCE) + EDEB_ERR(4, "Resource in use. ret=%lx ", ret); + + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_alloc_resource_mr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr, + const u64 vaddr, + const u64 length, + const u32 access_ctrl, + const struct ipz_pd pd, + struct ehca_mr_hipzout_parms *outparms) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 lkey_out; + u64 rkey_out; + + EDEB_EN(7, "adapter_handle=%lx mr=%p vaddr=%lx length=%lx" + " access_ctrl=%x pd=%x", + adapter_handle.handle, mr, vaddr, length, access_ctrl, + pd.value); + + ret = ehca_hcall_7arg_7ret(H_ALLOC_RESOURCE, + adapter_handle.handle, /* r4 */ + 5, /* r5 */ + vaddr, /* r6 */ + length, /* r7 */ + (((u64)access_ctrl) << 32ULL), /* r8 */ + pd.value, /* r9 */ + 0, + &(outparms->handle.handle), /* r4 */ + &dummy, /* r5 */ + &lkey_out, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + outparms->lkey = (u32)lkey_out; + outparms->rkey = (u32)rkey_out; + + EDEB_EX(7, "ret=%lx mr_handle=%lx lkey=%x rkey=%x", + ret, outparms->handle.handle, outparms->lkey, outparms->rkey); + + return ret; +} + +u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count) +{ + u64 ret = H_SUCCESS; + + EDEB_EN(7, "adapter_handle=%lx mr=%p mr_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + adapter_handle.handle, mr, mr->ipz_mr_handle.handle, pagesize, + queue_type, logical_address_of_page, count); + + if ((count > 1) && (logical_address_of_page & 0xfff)) { + EDEB_ERR(4, "logical_address_of_page not on a 4k boundary " + "adapter_handle=%lx mr=%p mr_handle=%lx " + "pagesize=%x queue_type=%x logical_address_of_page=%lx" + " count=%lx", + adapter_handle.handle, mr, mr->ipz_mr_handle.handle, + pagesize, queue_type, logical_address_of_page, count); + ret = H_PARAMETER; + } else + ret = hipz_h_register_rpage(adapter_handle, pagesize, + queue_type, + mr->ipz_mr_handle.handle, + logical_address_of_page, count); + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_query_mr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr, + struct ehca_mr_hipzout_parms *outparms) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 remote_len_out; + u64 remote_vaddr_out; + u64 acc_ctrl_pd_out; + u64 r9_out; + + EDEB_EN(7, "adapter_handle=%lx mr=%p mr_handle=%lx", + adapter_handle.handle, mr, mr->ipz_mr_handle.handle); + + ret = ehca_hcall_7arg_7ret(H_QUERY_MR, + adapter_handle.handle, /* r4 */ + mr->ipz_mr_handle.handle, /* r5 */ + 0, 0, 0, 0, 0, + &outparms->len, /* r4 */ + &outparms->vaddr, /* r5 */ + &remote_len_out, /* r6 */ + &remote_vaddr_out, /* r7 */ + &acc_ctrl_pd_out, /* r8 */ + &r9_out, + &dummy); + + outparms->acl = acc_ctrl_pd_out >> 32; + outparms->lkey = (u32)(r9_out >> 32); + outparms->rkey = (u32)(r9_out & (0xffffffff)); + + EDEB_EX(7, "ret=%lx mr_local_length=%lx mr_local_vaddr=%lx " + "mr_remote_length=%lx mr_remote_vaddr=%lx access_ctrl=%x " + "pd=%x lkey=%x rkey=%x", ret, outparms->len, + outparms->vaddr, remote_len_out, remote_vaddr_out, + outparms->acl, outparms->acl, outparms->lkey, outparms->rkey); + + return ret; +} + +u64 hipz_h_free_resource_mr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr) +{ + u64 ret = H_SUCCESS; + u64 dummy; + + EDEB_EN(7, "adapter_handle=%lx mr=%p mr_handle=%lx", + adapter_handle.handle, mr, mr->ipz_mr_handle.handle); + + ret = ehca_hcall_7arg_7ret(H_FREE_RESOURCE, + adapter_handle.handle, /* r4 */ + mr->ipz_mr_handle.handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_reregister_pmr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr, + const u64 vaddr_in, + const u64 length, + const u32 access_ctrl, + const struct ipz_pd pd, + const u64 mr_addr_cb, + struct ehca_mr_hipzout_parms *outparms) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 lkey_out; + u64 rkey_out; + + EDEB_EN(7, "adapter_handle=%lx mr=%p mr_handle=%lx vaddr_in=%lx " + "length=%lx access_ctrl=%x pd=%x mr_addr_cb=%lx", + adapter_handle.handle, mr, mr->ipz_mr_handle.handle, vaddr_in, + length, access_ctrl, pd.value, mr_addr_cb); + + ret = ehca_hcall_7arg_7ret(H_REREGISTER_PMR, + adapter_handle.handle, /* r4 */ + mr->ipz_mr_handle.handle, /* r5 */ + vaddr_in, /* r6 */ + length, /* r7 */ + /* r8 */ + ((((u64)access_ctrl) << 32ULL) | pd.value), + mr_addr_cb, /* r9 */ + 0, + &dummy, /* r4 */ + &outparms->vaddr, /* r5 */ + &lkey_out, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + + outparms->lkey = (u32)lkey_out; + outparms->rkey = (u32)rkey_out; + + EDEB_EX(7, "ret=%lx vaddr=%lx lkey=%x rkey=%x", + ret, outparms->vaddr, outparms->lkey, outparms->rkey); + return ret; +} + +u64 hipz_h_register_smr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr, + const struct ehca_mr *orig_mr, + const u64 vaddr_in, + const u32 access_ctrl, + const struct ipz_pd pd, + struct ehca_mr_hipzout_parms *outparms) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 lkey_out; + u64 rkey_out; + + EDEB_EN(7, "adapter_handle=%lx orig_mr=%p orig_mr_handle=%lx " + "vaddr_in=%lx access_ctrl=%x pd=%x", adapter_handle.handle, + orig_mr, orig_mr->ipz_mr_handle.handle, vaddr_in, access_ctrl, + pd.value); + + + ret = ehca_hcall_7arg_7ret(H_REGISTER_SMR, + adapter_handle.handle, /* r4 */ + orig_mr->ipz_mr_handle.handle, /* r5 */ + vaddr_in, /* r6 */ + (((u64)access_ctrl) << 32ULL), /* r7 */ + pd.value, /* r8 */ + 0, 0, + &(outparms->handle.handle), /* r4 */ + &dummy, /* r5 */ + &lkey_out, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + outparms->lkey = (u32)lkey_out; + outparms->rkey = (u32)rkey_out; + + EDEB_EX(7, "ret=%lx mr_handle=%lx lkey=%x rkey=%x", + ret, outparms->handle.handle, outparms->lkey, outparms->rkey); + + return ret; +} + +u64 hipz_h_alloc_resource_mw(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mw *mw, + const struct ipz_pd pd, + struct ehca_mw_hipzout_parms *outparms) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 rkey_out; + + EDEB_EN(7, "adapter_handle=%lx mw=%p pd=%x", + adapter_handle.handle, mw, pd.value); + + ret = ehca_hcall_7arg_7ret(H_ALLOC_RESOURCE, + adapter_handle.handle, /* r4 */ + 6, /* r5 */ + pd.value, /* r6 */ + 0, 0, 0, 0, + &(outparms->handle.handle), /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + + outparms->rkey = (u32)rkey_out; + + EDEB_EX(7, "ret=%lx mw_handle=%lx rkey=%x", + ret, outparms->handle.handle, outparms->rkey); + return ret; +} + +u64 hipz_h_query_mw(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mw *mw, + struct ehca_mw_hipzout_parms *outparms) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 pd_out; + u64 rkey_out; + + EDEB_EN(7, "adapter_handle=%lx mw=%p mw_handle=%lx", + adapter_handle.handle, mw, mw->ipz_mw_handle.handle); + + ret = ehca_hcall_7arg_7ret(H_QUERY_MW, + adapter_handle.handle, /* r4 */ + mw->ipz_mw_handle.handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &rkey_out, /* r7 */ + &pd_out, /* r8 */ + &dummy, + &dummy); + outparms->rkey = (u32)rkey_out; + + EDEB_EX(7, "ret=%lx rkey=%x pd=%lx", ret, outparms->rkey, pd_out); + + return ret; +} + +u64 hipz_h_free_resource_mw(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mw *mw) +{ + u64 ret = H_SUCCESS; + u64 dummy; + + EDEB_EN(7, "adapter_handle=%lx mw=%p mw_handle=%lx", + adapter_handle.handle, mw, mw->ipz_mw_handle.handle); + + ret = ehca_hcall_7arg_7ret(H_FREE_RESOURCE, + adapter_handle.handle, /* r4 */ + mw->ipz_mw_handle.handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} + +u64 hipz_h_error_data(const struct ipz_adapter_handle adapter_handle, + const u64 ressource_handle, + void *rblock, + unsigned long *byte_count) +{ + u64 ret = H_SUCCESS; + u64 dummy; + u64 r_cb; + + EDEB_EN(7, "adapter_handle=%lx ressource_handle=%lx rblock=%p", + adapter_handle.handle, ressource_handle, rblock); + + if (((u64)rblock) & 0xfff) { + EDEB_ERR(4, "rblock not page aligned."); + return H_PARAMETER; + } + + r_cb = virt_to_abs(rblock); + + ret = ehca_hcall_7arg_7ret(H_ERROR_DATA, + adapter_handle.handle, + ressource_handle, + r_cb, + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} diff --git a/drivers/infiniband/hw/ehca/hcp_if.h b/drivers/infiniband/hw/ehca/hcp_if.h new file mode 100644 index 0000000..39956d8 --- /dev/null +++ b/drivers/infiniband/hw/ehca/hcp_if.h @@ -0,0 +1,261 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Firmware Infiniband Interface code for POWER + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Gerd Bayer + * Waleri Fomin + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef __HCP_IF_H__ +#define __HCP_IF_H__ + +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "hipz_hw.h" + +/* + * hipz_h_alloc_resource_eq allocates EQ resources in HW and FW, initalize + * resources, create the empty EQPT (ring). + */ +u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfeq *pfeq, + const u32 neq_control, + const u32 number_of_entries, + struct ipz_eq_handle *eq_handle, + u32 * act_nr_of_entries, + u32 * act_pages, + u32 * eq_ist); + +u64 hipz_h_reset_event(const struct ipz_adapter_handle adapter_handle, + struct ipz_eq_handle eq_handle, + const u64 event_mask); +/* + * hipz_h_allocate_resource_cq allocates CQ resources in HW and FW, initialize + * resources, create the empty CQPT (ring). + */ +u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle, + struct ehca_cq *cq, + struct ehca_alloc_cq_parms *param); + + +/* + * hipz_h_alloc_resource_qp allocates QP resources in HW and FW, + * initialize resources, create empty QPPTs (2 rings). + */ +u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, + struct ehca_qp *qp, + struct ehca_alloc_qp_parms *parms); + +u64 hipz_h_query_port(const struct ipz_adapter_handle adapter_handle, + const u8 port_id, + struct hipz_query_port *query_port_response_block); + +u64 hipz_h_query_hca(const struct ipz_adapter_handle adapter_handle, + struct hipz_query_hca *query_hca_rblock); + +/* + * hipz_h_register_rpage internal function in hcp_if.h for all + * hcp_H_REGISTER_RPAGE calls. + */ +u64 hipz_h_register_rpage(const struct ipz_adapter_handle adapter_handle, + const u8 pagesize, + const u8 queue_type, + const u64 resource_handle, + const u64 logical_address_of_page, + u64 count); + +u64 hipz_h_register_rpage_eq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_eq_handle eq_handle, + struct ehca_pfeq *pfeq, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count); + +u32 hipz_h_query_int_state(const struct ipz_adapter_handle + hcp_adapter_handle, + u32 ist); + +u64 hipz_h_register_rpage_cq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_cq_handle cq_handle, + struct ehca_pfcq *pfcq, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count, + const struct h_galpa gal); + +u64 hipz_h_register_rpage_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count, + const struct h_galpa galpa); + +u64 hipz_h_disable_and_get_wqe(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + void **log_addr_next_sq_wqe_tb_processed, + void **log_addr_next_rq_wqe_tb_processed, + int dis_and_get_function_code); +enum hcall_sigt { + HCALL_SIGT_NO_CQE = 0, + HCALL_SIGT_BY_WQE = 1, + HCALL_SIGT_EVERY = 2 +}; + +u64 hipz_h_modify_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + const u64 update_mask, + struct hcp_modify_qp_control_block *mqpcb, + struct h_galpa gal); + +u64 hipz_h_query_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + struct hcp_modify_qp_control_block *qqpcb, + struct h_galpa gal); + +u64 hipz_h_destroy_qp(const struct ipz_adapter_handle adapter_handle, + struct ehca_qp *qp); + +u64 hipz_h_define_aqp0(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u32 port); + +u64 hipz_h_define_aqp1(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u32 port, u32 * pma_qp_nr, + u32 * bma_qp_nr); + +u64 hipz_h_attach_mcqp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u16 mcg_dlid, + u64 subnet_prefix, u64 interface_id); + +u64 hipz_h_detach_mcqp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u16 mcg_dlid, + u64 subnet_prefix, u64 interface_id); + +u64 hipz_h_destroy_cq(const struct ipz_adapter_handle adapter_handle, + struct ehca_cq *cq, + u8 force_flag); + +u64 hipz_h_destroy_eq(const struct ipz_adapter_handle adapter_handle, + struct ehca_eq *eq); + +/* + * hipz_h_alloc_resource_mr allocates MR resources in HW and FW, initialize + * resources. + */ +u64 hipz_h_alloc_resource_mr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr, + const u64 vaddr, + const u64 length, + const u32 access_ctrl, + const struct ipz_pd pd, + struct ehca_mr_hipzout_parms *outparms); + +/* hipz_h_register_rpage_mr registers MR resource pages in HW and FW */ +u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count); + +/* hipz_h_query_mr queries MR in HW and FW */ +u64 hipz_h_query_mr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr, + struct ehca_mr_hipzout_parms *outparms); + +/* hipz_h_free_resource_mr frees MR resources in HW and FW */ +u64 hipz_h_free_resource_mr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr); + +/* hipz_h_reregister_pmr reregisters MR in HW and FW */ +u64 hipz_h_reregister_pmr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr, + const u64 vaddr_in, + const u64 length, + const u32 access_ctrl, + const struct ipz_pd pd, + const u64 mr_addr_cb, + struct ehca_mr_hipzout_parms *outparms); + +/* hipz_h_register_smr register shared MR in HW and FW */ +u64 hipz_h_register_smr(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mr *mr, + const struct ehca_mr *orig_mr, + const u64 vaddr_in, + const u32 access_ctrl, + const struct ipz_pd pd, + struct ehca_mr_hipzout_parms *outparms); + +/* + * hipz_h_alloc_resource_mw allocates MW resources in HW and FW, initialize + * resources. + */ +u64 hipz_h_alloc_resource_mw(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mw *mw, + const struct ipz_pd pd, + struct ehca_mw_hipzout_parms *outparms); + +/* hipz_h_query_mw queries MW in HW and FW */ +u64 hipz_h_query_mw(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mw *mw, + struct ehca_mw_hipzout_parms *outparms); + +/* hipz_h_free_resource_mw frees MW resources in HW and FW */ +u64 hipz_h_free_resource_mw(const struct ipz_adapter_handle adapter_handle, + const struct ehca_mw *mw); + +u64 hipz_h_error_data(const struct ipz_adapter_handle adapter_handle, + const u64 ressource_handle, + void *rblock, + unsigned long *byte_count); + +#endif /* __HCP_IF_H__ */ -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:01 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:01 -0700 Subject: [openib-general] [PATCH 03/13] IB/ehca: irq In-Reply-To: <20068171311.X1v1Q4Gk1v3wd7qJ@cisco.com> Message-ID: <20068171311.VUo6fig31aLNQqvN@cisco.com> drivers/infiniband/hw/ehca/ehca_irq.c | 847 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_irq.h | 77 +++ 2 files changed, 924 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c new file mode 100644 index 0000000..c66c6aa --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -0,0 +1,847 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Functions for EQs, NEQs and interrupts + * + * Authors: Heiko J Schick + * Khadija Souissi + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#define DEB_PREFIX "eirq" + +#include "ehca_classes.h" +#include "ehca_irq.h" +#include "ehca_iverbs.h" +#include "ehca_tools.h" +#include "hcp_if.h" +#include "hipz_fns.h" + +#define EQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) +#define EQE_CQ_QP_NUMBER EHCA_BMASK_IBM(8,31) +#define EQE_EE_IDENTIFIER EHCA_BMASK_IBM(2,7) +#define EQE_CQ_NUMBER EHCA_BMASK_IBM(8,31) +#define EQE_QP_NUMBER EHCA_BMASK_IBM(8,31) +#define EQE_QP_TOKEN EHCA_BMASK_IBM(32,63) +#define EQE_CQ_TOKEN EHCA_BMASK_IBM(32,63) + +#define NEQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) +#define NEQE_EVENT_CODE EHCA_BMASK_IBM(2,7) +#define NEQE_PORT_NUMBER EHCA_BMASK_IBM(8,15) +#define NEQE_PORT_AVAILABILITY EHCA_BMASK_IBM(16,16) + +#define ERROR_DATA_LENGTH EHCA_BMASK_IBM(52,63) +#define ERROR_DATA_TYPE EHCA_BMASK_IBM(0,7) + +static void queue_comp_task(struct ehca_cq *__cq); + +static struct ehca_comp_pool* pool; +static struct notifier_block comp_pool_callback_nb; + +static inline void comp_event_callback(struct ehca_cq *cq) +{ + EDEB_EN(7, "cq=%p", cq); + + if (!cq->ib_cq.comp_handler) + return; + + spin_lock(&cq->cb_lock); + cq->ib_cq.comp_handler(&cq->ib_cq, cq->ib_cq.cq_context); + spin_unlock(&cq->cb_lock); + + EDEB_EX(7, "cq=%p", cq); + + return; +} + +static void print_error_data(struct ehca_shca * shca, void* data, + u64* rblock, int length) +{ + u64 type = EHCA_BMASK_GET(ERROR_DATA_TYPE, rblock[2]); + u64 resource = rblock[1]; + + EDEB_EN(7, "shca=%p data=%p rblock=%p length=%x", + shca, data, rblock, length); + + switch (type) { + case 0x1: /* Queue Pair */ + { + struct ehca_qp *qp = (struct ehca_qp*)data; + + /* only print error data if AER is set */ + if (rblock[6] == 0) + return; + + EDEB_ERR(4, "QP 0x%x (resource=%lx) has errors.", + qp->ib_qp.qp_num, resource); + break; + } + case 0x4: /* Completion Queue */ + { + struct ehca_cq *cq = (struct ehca_cq*)data; + + EDEB_ERR(4, "CQ 0x%x (resource=%lx) has errors.", + cq->cq_number, resource); + break; + } + default: + EDEB_ERR(4, "Unknown errror type: %lx on %s.", + type, shca->ib_device.name); + break; + } + + EDEB_ERR(4, "Error data is available: %lx.", resource); + EDEB_ERR(4, "EHCA ----- error data begin " + "---------------------------------------------------"); + EDEB_DMP(4, rblock, length, "resource=%lx", resource); + EDEB_ERR(4, "EHCA ----- error data end " + "----------------------------------------------------"); + + EDEB_EX(7, ""); + + return; +} + +int ehca_error_data(struct ehca_shca *shca, void *data, + u64 resource) +{ + + unsigned long ret = 0; + u64 *rblock; + unsigned long block_count; + + EDEB_EN(7, "shca=%p data=%p resource=%lx", shca, data, resource); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Cannot allocate rblock memory."); + ret = -ENOMEM; + goto error_data1; + } + + ret = hipz_h_error_data(shca->ipz_hca_handle, + resource, + rblock, + &block_count); + + if (ret == H_R_STATE) { + EDEB_ERR(4, "No error data is available: %lx.", resource); + } + else if (ret == H_SUCCESS) { + int length; + + length = EHCA_BMASK_GET(ERROR_DATA_LENGTH, rblock[0]); + + if (length > PAGE_SIZE) + length = PAGE_SIZE; + + print_error_data(shca, data, rblock, length); + } + else { + EDEB_ERR(4, "Error data could not be fetched: %lx", resource); + } + + kfree(rblock); + +error_data1: + return ret; + +} + +static void qp_event_callback(struct ehca_shca *shca, + u64 eqe, + enum ib_event_type event_type) +{ + struct ib_event event; + struct ehca_qp *qp; + unsigned long flags; + u32 token = EHCA_BMASK_GET(EQE_QP_TOKEN, eqe); + + EDEB_EN(7, "eqe=%lx", eqe); + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + qp = idr_find(&ehca_qp_idr, token); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + + if (!qp) + return; + + ehca_error_data(shca, qp, qp->ipz_qp_handle.handle); + + if (!qp->ib_qp.event_handler) + return; + + event.device = &shca->ib_device; + event.event = event_type; + event.element.qp = &qp->ib_qp; + + qp->ib_qp.event_handler(&event, qp->ib_qp.qp_context); + + EDEB_EX(7, "qp=%p", qp); + + return; +} + +static void cq_event_callback(struct ehca_shca *shca, + u64 eqe) +{ + struct ehca_cq *cq; + unsigned long flags; + u32 token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe); + + EDEB_EN(7, "eqe=%lx", eqe); + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + cq = idr_find(&ehca_cq_idr, token); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + if (!cq) + return; + + ehca_error_data(shca, cq, cq->ipz_cq_handle.handle); + + EDEB_EX(7, "cq=%p", cq); + + return; +} + +static void parse_identifier(struct ehca_shca *shca, u64 eqe) +{ + u8 identifier = EHCA_BMASK_GET(EQE_EE_IDENTIFIER, eqe); + + EDEB_EN(7, "shca=%p eqe=%lx", shca, eqe); + + switch (identifier) { + case 0x02: /* path migrated */ + qp_event_callback(shca, eqe, IB_EVENT_PATH_MIG); + break; + case 0x03: /* communication established */ + qp_event_callback(shca, eqe, IB_EVENT_COMM_EST); + break; + case 0x04: /* send queue drained */ + qp_event_callback(shca, eqe, IB_EVENT_SQ_DRAINED); + break; + case 0x05: /* QP error */ + case 0x06: /* QP error */ + qp_event_callback(shca, eqe, IB_EVENT_QP_FATAL); + break; + case 0x07: /* CQ error */ + case 0x08: /* CQ error */ + cq_event_callback(shca, eqe); + break; + case 0x09: /* MRMWPTE error */ + EDEB_ERR(4, "MRMWPTE error."); + break; + case 0x0A: /* port event */ + EDEB_ERR(4, "Port event."); + break; + case 0x0B: /* MR access error */ + EDEB_ERR(4, "MR access error."); + break; + case 0x0C: /* EQ error */ + EDEB_ERR(4, "EQ error."); + break; + case 0x0D: /* P/Q_Key mismatch */ + EDEB_ERR(4, "P/Q_Key mismatch."); + break; + case 0x10: /* sampling complete */ + EDEB_ERR(4, "Sampling complete."); + break; + case 0x11: /* unaffiliated access error */ + EDEB_ERR(4, "Unaffiliated access error."); + break; + case 0x12: /* path migrating error */ + EDEB_ERR(4, "Path migration error."); + break; + case 0x13: /* interface trace stopped */ + EDEB_ERR(4, "Interface trace stopped."); + break; + case 0x14: /* first error capture info available */ + default: + EDEB_ERR(4, "Unknown identifier: %x on %s.", + identifier, shca->ib_device.name); + break; + } + + EDEB_EX(7, "eqe=%lx identifier=%x", eqe, identifier); + + return; +} + +static void parse_ec(struct ehca_shca *shca, u64 eqe) +{ + struct ib_event event; + u8 ec = EHCA_BMASK_GET(NEQE_EVENT_CODE, eqe); + u8 port = EHCA_BMASK_GET(NEQE_PORT_NUMBER, eqe); + + EDEB_EN(7, "shca=%p eqe=%lx", shca, eqe); + + switch (ec) { + case 0x30: /* port availability change */ + if (EHCA_BMASK_GET(NEQE_PORT_AVAILABILITY, eqe)) { + EDEB(4, "%s: port %x is active.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ACTIVE; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_ACTIVE; + ib_dispatch_event(&event); + } else { + EDEB(4, "%s: port %x is inactive.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ERR; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_DOWN; + ib_dispatch_event(&event); + } + break; + case 0x31: + /* port configuration change + * disruptive change is caused by + * LID, PKEY or SM change + */ + EDEB(4, "EHCA disruptive port %x " + "configuration change.", port); + + EDEB(4, "%s: port %x is inactive.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ERR; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_DOWN; + ib_dispatch_event(&event); + + EDEB(4, "%s: port %x is active.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ACTIVE; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_ACTIVE; + ib_dispatch_event(&event); + break; + case 0x32: /* adapter malfunction */ + EDEB_ERR(4, "Adapter malfunction."); + break; + case 0x33: /* trace stopped */ + EDEB_ERR(4, "Traced stopped."); + break; + default: + EDEB_ERR(4, "Unknown event code: %x on %s.", + ec, shca->ib_device.name); + break; + } + + EDEB_EN(7, "eqe=%lx ec=%x", eqe, ec); + + return; +} + +static inline void reset_eq_pending(struct ehca_cq *cq) +{ + u64 CQx_EP = 0; + struct h_galpa gal = cq->galpas.kernel; + + EDEB_EN(7, "cq=%p", cq); + + hipz_galpa_store_cq(gal, cqx_ep, 0x0); + CQx_EP = hipz_galpa_load(gal, CQTEMM_OFFSET(cqx_ep)); + EDEB(7, "CQx_EP=%lx", CQx_EP); + + EDEB_EX(7, "cq=%p", cq); + + return; +} + +irqreturn_t ehca_interrupt_neq(int irq, void *dev_id, struct pt_regs *regs) +{ + struct ehca_shca *shca = (struct ehca_shca*)dev_id; + + EDEB_EN(7, "dev_id=%p", dev_id); + + tasklet_hi_schedule(&shca->neq.interrupt_task); + + EDEB_EX(7, ""); + + return IRQ_HANDLED; +} + +void ehca_tasklet_neq(unsigned long data) +{ + struct ehca_shca *shca = (struct ehca_shca*)data; + struct ehca_eqe *eqe; + u64 ret = H_SUCCESS; + + EDEB_EN(7, "shca=%p", shca); + + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->neq); + + while (eqe) { + if (!EHCA_BMASK_GET(NEQE_COMPLETION_EVENT, eqe->entry)) + parse_ec(shca, eqe->entry); + + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->neq); + } + + ret = hipz_h_reset_event(shca->ipz_hca_handle, + shca->neq.ipz_eq_handle, 0xFFFFFFFFFFFFFFFFL); + + if (ret != H_SUCCESS) + EDEB_ERR(4, "Can't clear notification events."); + + EDEB_EX(7, "shca=%p", shca); + + return; +} + +irqreturn_t ehca_interrupt_eq(int irq, void *dev_id, struct pt_regs *regs) +{ + struct ehca_shca *shca = (struct ehca_shca*)dev_id; + + EDEB_EN(7, "dev_id=%p", dev_id); + + tasklet_hi_schedule(&shca->eq.interrupt_task); + + EDEB_EX(7, ""); + + return IRQ_HANDLED; +} + +void ehca_tasklet_eq(unsigned long data) +{ + struct ehca_shca *shca = (struct ehca_shca*)data; + struct ehca_eqe *eqe; + int int_state; + int query_cnt = 0; + + EDEB_EN(7, "shca=%p", shca); + + do { + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->eq); + + if ((shca->hw_level >= 2) && eqe) + int_state = 1; + else + int_state = 0; + + while ((int_state == 1) || eqe) { + while (eqe) { + u64 eqe_value = eqe->entry; + + EDEB(7, "eqe_value=%lx", eqe_value); + + /* TODO: better structure */ + if (EHCA_BMASK_GET(EQE_COMPLETION_EVENT, + eqe_value)) { + extern struct idr ehca_cq_idr; + unsigned long flags; + u32 token; + struct ehca_cq *cq; + + EDEB(6, "... completion event"); + token = + EHCA_BMASK_GET(EQE_CQ_TOKEN, + eqe_value); + spin_lock_irqsave(&ehca_cq_idr_lock, + flags); + cq = idr_find(&ehca_cq_idr, token); + + if (cq == NULL) { + spin_unlock(&ehca_cq_idr_lock); + break; + } + + reset_eq_pending(cq); +#ifdef CONFIG_INFINIBAND_EHCA_SCALING + queue_comp_task(cq); + spin_unlock_irqrestore(&ehca_cq_idr_lock, + flags); +#else + spin_unlock_irqrestore(&ehca_cq_idr_lock, + flags); + comp_event_callback(cq); +#endif + } else { + EDEB(6, "... non completion event"); + parse_identifier(shca, eqe_value); + } + eqe = + (struct ehca_eqe *)ehca_poll_eq(shca, + &shca->eq); + } + + if (shca->hw_level >= 2) { + int_state = + hipz_h_query_int_state(shca->ipz_hca_handle, + shca->eq.ist); + query_cnt++; + iosync(); + if (query_cnt >= 100) { + query_cnt = 0; + int_state = 0; + } + } + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->eq); + + } + } while (int_state != 0); + + EDEB_EX(7, "shca=%p", shca); + + return; +} + +static inline int find_next_online_cpu(struct ehca_comp_pool* pool) +{ + unsigned long flags_last_cpu; + + EDEB_DMP(7, &cpu_online_map, sizeof(cpumask_t), ""); + + spin_lock_irqsave(&pool->last_cpu_lock, flags_last_cpu); + pool->last_cpu = next_cpu(pool->last_cpu, cpu_online_map); + + if (pool->last_cpu == NR_CPUS) + pool->last_cpu = 0; + if (!cpu_online(pool->last_cpu)) + pool->last_cpu = next_cpu(pool->last_cpu, cpu_online_map); + + spin_unlock_irqrestore(&pool->last_cpu_lock, flags_last_cpu); + + // return pool->last_cpu; + return 1; +} + +static void __queue_comp_task(struct ehca_cq *__cq, + struct ehca_cpu_comp_task *cct) +{ + unsigned long flags_cct; + unsigned long flags_cq; + + EDEB_EN(7, "__cq=%p cct=%p", __cq, cct); + + spin_lock_irqsave(&cct->task_lock, flags_cct); + spin_lock_irqsave(&__cq->task_lock, flags_cq); + + if (__cq->nr_callbacks == 0) { + __cq->nr_callbacks++; + list_add_tail(&__cq->entry, &cct->cq_list); + cct->cq_jobs++; + wake_up(&cct->wait_queue); + } + else + __cq->nr_callbacks++; + + spin_unlock_irqrestore(&__cq->task_lock, flags_cq); + spin_unlock_irqrestore(&cct->task_lock, flags_cct); + + + EDEB_EX(7, ""); + +} + +static void queue_comp_task(struct ehca_cq *__cq) +{ + int cpu; + int cpu_id; + struct ehca_cpu_comp_task *cct; + + cpu = get_cpu(); + cpu_id = find_next_online_cpu(pool); + + EDEB_EN(7, "pool=%p cq=%p cq_nr=%x CPU=%x:%x:%x:%x", + pool, __cq, __cq->cq_number, + cpu, cpu_id, num_online_cpus(), num_possible_cpus()); + + BUG_ON(!cpu_online(cpu_id)); + + cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu_id); + + if (cct->cq_jobs > 0) { + cpu_id = find_next_online_cpu(pool); + cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu_id); + } + + __queue_comp_task(__cq, cct); + + put_cpu(); + + EDEB_EX(7, "cct=%p", cct); + + return; +} + +static void run_comp_task(struct ehca_cpu_comp_task* cct) +{ + struct ehca_cq *cq = NULL; + unsigned long flags_cct; + unsigned long flags_cq; + + + EDEB_EN(7, "cct=%p", cct); + + spin_lock_irqsave(&cct->task_lock, flags_cct); + + while (!list_empty(&cct->cq_list)) { + cq = list_entry(cct->cq_list.next, struct ehca_cq, entry); + spin_unlock_irqrestore(&cct->task_lock, flags_cct); + comp_event_callback(cq); + spin_lock_irqsave(&cct->task_lock, flags_cct); + + spin_lock_irqsave(&cq->task_lock, flags_cq); + cq->nr_callbacks--; + if (cq->nr_callbacks == 0) { + list_del_init(cct->cq_list.next); + cct->cq_jobs--; + } + spin_unlock_irqrestore(&cq->task_lock, flags_cq); + + } + + spin_unlock_irqrestore(&cct->task_lock, flags_cct); + + EDEB_EX(7, "cct=%p cq=%p", cct, cq); + + return; +} + +static int comp_task(void *__cct) +{ + struct ehca_cpu_comp_task* cct = __cct; + DECLARE_WAITQUEUE(wait, current); + + EDEB_EN(7, "cct=%p", cct); + + set_current_state(TASK_INTERRUPTIBLE); + while(!kthread_should_stop()) { + add_wait_queue(&cct->wait_queue, &wait); + + if (list_empty(&cct->cq_list)) + schedule(); + else + __set_current_state(TASK_RUNNING); + + remove_wait_queue(&cct->wait_queue, &wait); + + if (!list_empty(&cct->cq_list)) + run_comp_task(__cct); + + set_current_state(TASK_INTERRUPTIBLE); + } + __set_current_state(TASK_RUNNING); + + EDEB_EX(7, ""); + + return 0; +} + +static struct task_struct *create_comp_task(struct ehca_comp_pool *pool, + int cpu) +{ + struct ehca_cpu_comp_task *cct; + + EDEB_EN(7, "cpu=%d:%d", cpu, NR_CPUS); + + cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu); + spin_lock_init(&cct->task_lock); + INIT_LIST_HEAD(&cct->cq_list); + init_waitqueue_head(&cct->wait_queue); + cct->task = kthread_create(comp_task, cct, "ehca_comp/%d", cpu); + + EDEB_EX(7, "cct/%d=%p", cpu, cct); + + return cct->task; +} + +static void destroy_comp_task(struct ehca_comp_pool *pool, + int cpu) +{ + struct ehca_cpu_comp_task *cct; + struct task_struct *task; + unsigned long flags_cct; + + EDEB_EN(7, "pool=%p cpu=%d:%d", pool, cpu, NR_CPUS); + + cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu); + + spin_lock_irqsave(&cct->task_lock, flags_cct); + + task = cct->task; + cct->task = NULL; + cct->cq_jobs = 0; + + spin_unlock_irqrestore(&cct->task_lock, flags_cct); + + if (task) + kthread_stop(task); + + EDEB_EX(7, ""); + + return; +} + +static void take_over_work(struct ehca_comp_pool *pool, + int cpu) +{ + struct ehca_cpu_comp_task *cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu); + LIST_HEAD(list); + struct ehca_cq *cq; + unsigned long flags_cct; + + EDEB_EN(7, "cpu=%x", cpu); + + spin_lock_irqsave(&cct->task_lock, flags_cct); + + list_splice_init(&cct->cq_list, &list); + + while(!list_empty(&list)) { + cq = list_entry(cct->cq_list.next, struct ehca_cq, entry); + + list_del(&cq->entry); + __queue_comp_task(cq, per_cpu_ptr(pool->cpu_comp_tasks, + smp_processor_id())); + } + + spin_unlock_irqrestore(&cct->task_lock, flags_cct); + + EDEB_EX(7, ""); + +} + +static int comp_pool_callback(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + unsigned int cpu = (unsigned long)hcpu; + struct ehca_cpu_comp_task *cct; + + EDEB_EN(7, "CPU number changed (action=%lx)", action); + + switch (action) { + case CPU_UP_PREPARE: + EDEB(4, "CPU: %x (CPU_PREPARE)", cpu); + if(!create_comp_task(pool, cpu)) { + EDEB_ERR(4, "Can't create comp_task for cpu: %x", cpu); + return NOTIFY_BAD; + } + break; + case CPU_UP_CANCELED: + EDEB(4, "CPU: %x (CPU_CANCELED)", cpu); + cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu); + kthread_bind(cct->task, any_online_cpu(cpu_online_map)); + destroy_comp_task(pool, cpu); + break; + case CPU_ONLINE: + EDEB(4, "CPU: %x (CPU_ONLINE)", cpu); + cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu); + kthread_bind(cct->task, cpu); + wake_up_process(cct->task); + break; + case CPU_DOWN_PREPARE: + EDEB(4, "CPU: %x (CPU_DOWN_PREPARE)", cpu); + break; + case CPU_DOWN_FAILED: + EDEB(4, "CPU: %x (CPU_DOWN_FAILED)", cpu); + break; + case CPU_DEAD: + EDEB(4, "CPU: %x (CPU_DEAD)", cpu); + destroy_comp_task(pool, cpu); + take_over_work(pool, cpu); + break; + } + + EDEB_EX(7, "CPU number changed"); + + return NOTIFY_OK; +} + +int ehca_create_comp_pool(void) +{ +#ifdef CONFIG_INFINIBAND_EHCA_SCALING + int cpu; + struct task_struct *task; + + EDEB_EN(7, ""); + + + pool = kzalloc(sizeof(struct ehca_comp_pool), GFP_KERNEL); + if (pool == NULL) + return -ENOMEM; + + spin_lock_init(&pool->last_cpu_lock); + pool->last_cpu = any_online_cpu(cpu_online_map); + + pool->cpu_comp_tasks = alloc_percpu(struct ehca_cpu_comp_task); + if (pool->cpu_comp_tasks == NULL) { + kfree(pool); + return -EINVAL; + } + + for_each_online_cpu(cpu) { + task = create_comp_task(pool, cpu); + if (task) { + kthread_bind(task, cpu); + wake_up_process(task); + } + } + + comp_pool_callback_nb.notifier_call = comp_pool_callback; + comp_pool_callback_nb.priority =0; + register_cpu_notifier(&comp_pool_callback_nb); + + EDEB_EX(7, "pool=%p", pool); +#endif + + return 0; +} + +void ehca_destroy_comp_pool(void) +{ +#ifdef CONFIG_INFINIBAND_EHCA_SCALING + int i; + + EDEB_EN(7, "pool=%p", pool); + + unregister_cpu_notifier(&comp_pool_callback_nb); + + for (i = 0; i < NR_CPUS; i++) { + if (cpu_online(i)) + destroy_comp_task(pool, i); + } + + EDEB_EN(7, ""); +#endif + + return; +} diff --git a/drivers/infiniband/hw/ehca/ehca_irq.h b/drivers/infiniband/hw/ehca/ehca_irq.h new file mode 100644 index 0000000..85bf1fe --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_irq.h @@ -0,0 +1,77 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Function definitions and structs for EQs, NEQs and interrupts + * + * Authors: Heiko J Schick + * Khadija Souissi + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef __EHCA_IRQ_H +#define __EHCA_IRQ_H + + +struct ehca_shca; + +#include +#include +#include + +int ehca_error_data(struct ehca_shca *shca, void *data, u64 resource); + +irqreturn_t ehca_interrupt_neq(int irq, void *dev_id, struct pt_regs *regs); +void ehca_tasklet_neq(unsigned long data); + +irqreturn_t ehca_interrupt_eq(int irq, void *dev_id, struct pt_regs *regs); +void ehca_tasklet_eq(unsigned long data); + +struct ehca_cpu_comp_task { + wait_queue_head_t wait_queue; + struct list_head cq_list; + struct task_struct *task; + spinlock_t task_lock; + int cq_jobs; +}; + +struct ehca_comp_pool { + struct ehca_cpu_comp_task *cpu_comp_tasks; + int last_cpu; + spinlock_t last_cpu_lock; +}; + +int ehca_create_comp_pool(void); +void ehca_destroy_comp_pool(void); + +#endif -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:02 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:02 -0700 Subject: [openib-general] [PATCH 11/13] IB/ehca: ipz In-Reply-To: <20068171311.jebQ3TFd5jvynHCW@cisco.com> Message-ID: <20068171311.NYGfAW00YTmK6YKh@cisco.com> drivers/infiniband/hw/ehca/ipz_pt_fn.c | 166 +++++++++++++++++++++ drivers/infiniband/hw/ehca/ipz_pt_fn.h | 253 ++++++++++++++++++++++++++++++++ 2 files changed, 419 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.c b/drivers/infiniband/hw/ehca/ipz_pt_fn.c new file mode 100644 index 0000000..a14f957 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.c @@ -0,0 +1,166 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * internal queue handling + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#define DEB_PREFIX "iptz" + +#include "ehca_tools.h" +#include "ipz_pt_fn.h" + +extern int ehca_hwlevel; + +void *ipz_qpageit_get_inc(struct ipz_queue *queue) +{ + void *ret = ipz_qeit_get(queue); + queue->current_q_offset += queue->pagesize; + if (queue->current_q_offset > queue->queue_length) { + queue->current_q_offset -= queue->pagesize; + ret = NULL; + } + if (((u64)ret) % EHCA_PAGESIZE) { + EDEB(4, "ERROR!! not at PAGE-Boundary"); + return NULL; + } + EDEB(7, "queue=%p ret=%p", queue, ret); + return ret; +} + +void *ipz_qeit_eq_get_inc(struct ipz_queue *queue) +{ + void *ret = ipz_qeit_get(queue); + u64 last_entry_in_q = queue->queue_length - queue->qe_size; + queue->current_q_offset += queue->qe_size; + if (queue->current_q_offset > last_entry_in_q) { + queue->current_q_offset = 0; + queue->toggle_state = (~queue->toggle_state) & 1; + } + + EDEB(7, "queue=%p ret=%p new current_q_offset=%lx qe_size=%x", + queue, ret, queue->current_q_offset, queue->qe_size); + + return ret; +} + +int ipz_queue_ctor(struct ipz_queue *queue, + const u32 nr_of_pages, + const u32 pagesize, const u32 qe_size, const u32 nr_of_sg) +{ + int pages_per_kpage = PAGE_SIZE >> EHCA_PAGESHIFT; + int f; + + EDEB_EN(7, "nr_of_pages=%x pagesize=%x qe_size=%x pages_per_kpage=%x", + nr_of_pages, pagesize, qe_size, pages_per_kpage); + if (pagesize > PAGE_SIZE) { + EDEB_ERR(4, "FATAL ERROR: pagesize=%x is greater than " + "kernel page size", pagesize); + return 0; + } + if (!pages_per_kpage) { + EDEB_ERR(4, "FATAL ERROR: invalid kernel page size. " + "pages_per_kpage=%x", pages_per_kpage); + return 0; + } + queue->queue_length = nr_of_pages * pagesize; + queue->queue_pages = vmalloc(nr_of_pages * sizeof(void *)); + if (!queue->queue_pages) { + EDEB(4, "ERROR!! didn't get the memory"); + return 0; + } + memset(queue->queue_pages, 0, nr_of_pages * sizeof(void *)); + /* + * allocate pages for queue: + * outer loop allocates whole kernel pages (page aligned) and + * inner loop divides a kernel page into smaller hca queue pages + */ + f = 0; + while (f < nr_of_pages) { + u8 *kpage = (u8*)get_zeroed_page(GFP_KERNEL); + int k; + if (!kpage) + goto ipz_queue_ctor_exit0; /*NOMEM*/ + for (k = 0; k < pages_per_kpage && f < nr_of_pages; k++) { + (queue->queue_pages)[f] = (struct ipz_page *)kpage; + kpage += EHCA_PAGESIZE; + f++; + } + } + + queue->current_q_offset = 0; + queue->qe_size = qe_size; + queue->act_nr_of_sg = nr_of_sg; + queue->pagesize = pagesize; + queue->toggle_state = 1; + EDEB_EX(7, "queue_length=%x queue_pages=%p qe_size=%x" + " act_nr_of_sg=%x", queue->queue_length, queue->queue_pages, + queue->qe_size, queue->act_nr_of_sg); + return 1; + + ipz_queue_ctor_exit0: + EDEB_ERR(4, "Couldn't get alloc pages queue=%p f=%x nr_of_pages=%x", + queue, f, nr_of_pages); + for (f = 0; f < nr_of_pages; f += pages_per_kpage) { + if (!(queue->queue_pages)[f]) + break; + free_page((unsigned long)(queue->queue_pages)[f]); + } + return 0; +} + +int ipz_queue_dtor(struct ipz_queue *queue) +{ + int pages_per_kpage = PAGE_SIZE >> EHCA_PAGESHIFT; + int g; + int nr_pages; + + EDEB_EN(7, "ipz_queue pointer=%p", queue); + if (!queue || !queue->queue_pages) { + EDEB_ERR(4, "queue or queue_pages is NULL"); + return 0; + } + EDEB(7, "destructing a queue with the following " + "properties:\n nr_of_pages=%x pagesize=%x qe_size=%x", + queue->act_nr_of_sg, queue->pagesize, queue->qe_size); + nr_pages = queue->queue_length / queue->pagesize; + for (g = 0; g < nr_pages; g += pages_per_kpage) + free_page((unsigned long)(queue->queue_pages)[g]); + vfree(queue->queue_pages); + + EDEB_EX(7, "queue freed!"); + return 1; +} diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h new file mode 100644 index 0000000..fdd139b --- /dev/null +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h @@ -0,0 +1,253 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * internal queue handling + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef __IPZ_PT_FN_H__ +#define __IPZ_PT_FN_H__ + +#include "ehca_qes.h" +#define EHCA_PAGESHIFT 12 +#define EHCA_PAGESIZE 4096UL +#define EHCA_PAGEMASK (~(EHCA_PAGESIZE-1)) +#define EHCA_PT_ENTRIES 512UL + +#include "ehca_tools.h" +#include "ehca_qes.h" + +/* struct generic ehca page */ +struct ipz_page { + u8 entries[EHCA_PAGESIZE]; +}; + +/* struct generic queue in linux kernel virtual memory (kv) */ +struct ipz_queue { + u64 current_q_offset; /* current queue entry */ + + struct ipz_page **queue_pages; /* array of pages belonging to queue */ + u32 qe_size; /* queue entry size */ + u32 act_nr_of_sg; + u32 queue_length; /* queue length allocated in bytes */ + u32 pagesize; + u32 toggle_state; /* toggle flag - per page */ + u32 dummy3; /* 64 bit alignment */ +}; + +/* + * return current Queue Entry for a certain q_offset + * returns address (kv) of Queue Entry + */ +static inline void *ipz_qeit_calc(struct ipz_queue *queue, u64 q_offset) +{ + struct ipz_page *current_page = NULL; + if (q_offset >= queue->queue_length) + return NULL; + current_page = (queue->queue_pages)[q_offset >> EHCA_PAGESHIFT]; + return ¤t_page->entries[q_offset & (EHCA_PAGESIZE - 1)]; +} + +/* + * return current Queue Entry + * returns address (kv) of Queue Entry + */ +static inline void *ipz_qeit_get(struct ipz_queue *queue) +{ + return ipz_qeit_calc(queue, queue->current_q_offset); +} + +/* + * return current Queue Page , increment Queue Page iterator from + * page to page in struct ipz_queue, last increment will return 0! and + * NOT wrap + * returns address (kv) of Queue Page + * warning don't use in parallel with ipz_QE_get_inc() + */ +void *ipz_qpageit_get_inc(struct ipz_queue *queue); + +/* + * return current Queue Entry, increment Queue Entry iterator by one + * step in struct ipz_queue, will wrap in ringbuffer + * returns address (kv) of Queue Entry BEFORE increment + * warning don't use in parallel with ipz_qpageit_get_inc() + * warning unpredictable results may occur if steps>act_nr_of_queue_entries + */ +static inline void *ipz_qeit_get_inc(struct ipz_queue *queue) +{ + void *ret = ipz_qeit_get(queue); + queue->current_q_offset += queue->qe_size; + if (queue->current_q_offset >= queue->queue_length) { + queue->current_q_offset = 0; + /* toggle the valid flag */ + queue->toggle_state = (~queue->toggle_state) & 1; + } + + EDEB(7, "queue=%p ret=%p new current_q_addr=%lx qe_size=%x", + queue, ret, queue->current_q_offset, queue->qe_size); + + return ret; +} + +/* + * return current Queue Entry, increment Queue Entry iterator by one + * step in struct ipz_queue, will wrap in ringbuffer + * returns address (kv) of Queue Entry BEFORE increment + * returns 0 and does not increment, if wrong valid state + * warning don't use in parallel with ipz_qpageit_get_inc() + * warning unpredictable results may occur if steps>act_nr_of_queue_entries + */ +static inline void *ipz_qeit_get_inc_valid(struct ipz_queue *queue) +{ + struct ehca_cqe *cqe = ipz_qeit_get(queue); + u32 cqe_flags = cqe->cqe_flags; + + if ((cqe_flags >> 7) != (queue->toggle_state & 1)) + return NULL; + + ipz_qeit_get_inc(queue); + return cqe; +} + +/* + * returns and resets Queue Entry iterator + * returns address (kv) of first Queue Entry + */ +static inline void *ipz_qeit_reset(struct ipz_queue *queue) +{ + queue->current_q_offset = 0; + return ipz_qeit_get(queue); +} + +/* struct generic page table */ +struct ipz_pt { + u64 entries[EHCA_PT_ENTRIES]; +}; + +/* struct page table for a queue, only to be used in pf */ +struct ipz_qpt { + /* queue page tables (kv), use u64 because we know the element length */ + u64 *qpts; + u32 n_qpts; + u32 n_ptes; /* number of page table entries */ + u64 *current_pte_addr; +}; + +/* + * constructor for a ipz_queue_t, placement new for ipz_queue_t, + * new for all dependent datastructors + * all QP Tables are the same + * flow: + * allocate+pin queue + * see ipz_qpt_ctor() + * returns true if ok, false if out of memory + */ +int ipz_queue_ctor(struct ipz_queue *queue, const u32 nr_of_pages, + const u32 pagesize, const u32 qe_size, + const u32 nr_of_sg); + +/* + * destructor for a ipz_queue_t + * -# free queue + * see ipz_queue_ctor() + * returns true if ok, false if queue was NULL-ptr of free failed + */ +int ipz_queue_dtor(struct ipz_queue *queue); + +/* + * constructor for a ipz_qpt_t, + * placement new for struct ipz_queue, new for all dependent datastructors + * all QP Tables are the same, + * flow: + * -# allocate+pin queue + * -# initialise ptcb + * -# allocate+pin PTs + * -# link PTs to a ring, according to HCA Arch, set bit62 id needed + * -# the ring must have room for exactly nr_of_PTEs + * see ipz_qpt_ctor() + */ +void ipz_qpt_ctor(struct ipz_qpt *qpt, + const u32 nr_of_qes, + const u32 pagesize, + const u32 qe_size, + const u8 lowbyte, const u8 toggle, + u32 * act_nr_of_QEs, u32 * act_nr_of_pages); + +/* + * return current Queue Entry, increment Queue Entry iterator by one + * step in struct ipz_queue, will wrap in ringbuffer + * returns address (kv) of Queue Entry BEFORE increment + * warning don't use in parallel with ipz_qpageit_get_inc() + * warning unpredictable results may occur if steps>act_nr_of_queue_entries + * fix EQ page problems + */ +void *ipz_qeit_eq_get_inc(struct ipz_queue *queue); + +/* + * return current Event Queue Entry, increment Queue Entry iterator + * by one step in struct ipz_queue if valid, will wrap in ringbuffer + * returns address (kv) of Queue Entry BEFORE increment + * returns 0 and does not increment, if wrong valid state + * warning don't use in parallel with ipz_queue_QPageit_get_inc() + * warning unpredictable results may occur if steps>act_nr_of_queue_entries + */ +static inline void *ipz_eqit_eq_get_inc_valid(struct ipz_queue *queue) +{ + void *ret = ipz_qeit_get(queue); + u32 qe = *(u8 *) ret; + EDEB(7, "ipz_QEit_EQ_get_inc_valid qe=%x", qe); + if ((qe >> 7) == (queue->toggle_state & 1)) + ipz_qeit_eq_get_inc(queue); /* this is a good one */ + else + ret = NULL; + return ret; +} + +/* returns address (GX) of first queue entry */ +static inline u64 ipz_qpt_get_firstpage(struct ipz_qpt *qpt) +{ + return be64_to_cpu(qpt->qpts[0]); +} + +/* returns address (kv) of first page of queue page table */ +static inline void *ipz_qpt_get_qpt(struct ipz_qpt *qpt) +{ + return qpt->qpts; +} + +#endif /* __IPZ_PT_FN_H__ */ -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:01 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:01 -0700 Subject: [openib-general] [PATCH 05/13] IB/ehca: avpd In-Reply-To: <20068171311.Erm4R4ERt5Mpsgua@cisco.com> Message-ID: <20068171311.8D49tRUe7xsVtB0H@cisco.com> drivers/infiniband/hw/ehca/ehca_av.c | 303 ++++++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_pd.c | 120 +++++++++++++ 2 files changed, 423 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c new file mode 100644 index 0000000..fd9fc6d --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_av.c @@ -0,0 +1,303 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * adress vector functions + * + * Authors: Hoang-Nam Nguyen + * Khadija Souissi + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + + +#define DEB_PREFIX "ehav" + +#include + +#include "ehca_tools.h" +#include "ehca_iverbs.h" +#include "hcp_if.h" + +struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + extern struct ehca_module ehca_module; + extern int ehca_static_rate; + int ret = 0; + struct ehca_av *av = NULL; + struct ehca_shca *shca = NULL; + + EHCA_CHECK_PD_P(pd); + EHCA_CHECK_ADR_P(ah_attr); + + shca = container_of(pd->device, struct ehca_shca, ib_device); + + EDEB_EN(7, "pd=%p ah_attr=%p", pd, ah_attr); + + av = kmem_cache_alloc(ehca_module.cache_av, SLAB_KERNEL); + if (!av) { + EDEB_ERR(4, "Out of memory pd=%p ah_attr=%p", pd, ah_attr); + ret = -ENOMEM; + goto create_ah_exit0; + } + + av->av.sl = ah_attr->sl; + av->av.dlid = ah_attr->dlid; + av->av.slid_path_bits = ah_attr->src_path_bits; + + if (ehca_static_rate < 0) { + int ah_mult = ib_rate_to_mult(ah_attr->static_rate); + int ehca_mult = + ib_rate_to_mult(shca->sport[ah_attr->port_num].rate ); + + if (ah_mult >= ehca_mult) + av->av.ipd = 0; + else + av->av.ipd = (ah_mult > 0) ? + ((ehca_mult - 1) / ah_mult) : 0; + } else + av->av.ipd = ehca_static_rate; + + EDEB(7, "IPD av->av.ipd set =%x ah_attr->static_rate=%x " + "shca_ib_rate=%x ",av->av.ipd, ah_attr->static_rate, + shca->sport[ah_attr->port_num].rate); + + av->av.lnh = ah_attr->ah_flags; + av->av.grh.word_0 = EHCA_BMASK_SET(GRH_IPVERSION_MASK, 6); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_TCLASS_MASK, + ah_attr->grh.traffic_class); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_FLOWLABEL_MASK, + ah_attr->grh.flow_label); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_HOPLIMIT_MASK, + ah_attr->grh.hop_limit); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_NEXTHEADER_MASK, 0x1B); + /* set sgid in grh.word_1 */ + if (ah_attr->ah_flags & IB_AH_GRH) { + int rc = 0; + struct ib_port_attr port_attr; + union ib_gid gid; + memset(&port_attr, 0, sizeof(port_attr)); + rc = ehca_query_port(pd->device, ah_attr->port_num, + &port_attr); + if (rc) { /* invalid port number */ + ret = -EINVAL; + EDEB_ERR(4, "Invalid port number " + "ehca_query_port() returned %x " + "pd=%p ah_attr=%p", rc, pd, ah_attr); + goto create_ah_exit1; + } + memset(&gid, 0, sizeof(gid)); + rc = ehca_query_gid(pd->device, + ah_attr->port_num, + ah_attr->grh.sgid_index, &gid); + if (rc) { + ret = -EINVAL; + EDEB_ERR(4, "Failed to retrieve sgid " + "ehca_query_gid() returned %x " + "pd=%p ah_attr=%p", rc, pd, ah_attr); + goto create_ah_exit1; + } + memcpy(&av->av.grh.word_1, &gid, sizeof(gid)); + } + /* for the time being we use a hard coded PMTU of 2048 Bytes */ + av->av.pmtu = 4; + + /* dgid comes in grh.word_3 */ + memcpy(&av->av.grh.word_3, &ah_attr->grh.dgid, + sizeof(ah_attr->grh.dgid)); + + EHCA_REGISTER_AV(device, pd); + + EDEB_EX(7, "pd=%p ah_attr=%p av=%p", pd, ah_attr, av); + return &av->ib_ah; + +create_ah_exit1: + kmem_cache_free(ehca_module.cache_av, av); + +create_ah_exit0: + EDEB_EX(7, "ret=%x pd=%p ah_attr=%p", ret, pd, ah_attr); + + return ERR_PTR(ret); +} + +int ehca_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + struct ehca_av *av = NULL; + struct ehca_ud_av new_ehca_av; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + int ret = 0; + + EHCA_CHECK_AV(ah); + EHCA_CHECK_ADR(ah_attr); + + EDEB_EN(7, "ah=%p ah_attr=%p", ah, ah_attr); + + my_pd = container_of(ah->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + memset(&new_ehca_av, 0, sizeof(new_ehca_av)); + new_ehca_av.sl = ah_attr->sl; + new_ehca_av.dlid = ah_attr->dlid; + new_ehca_av.slid_path_bits = ah_attr->src_path_bits; + new_ehca_av.ipd = ah_attr->static_rate; + new_ehca_av.lnh = EHCA_BMASK_SET(GRH_FLAG_MASK, + (ah_attr->ah_flags & IB_AH_GRH) > 0); + new_ehca_av.grh.word_0 = EHCA_BMASK_SET(GRH_TCLASS_MASK, + ah_attr->grh.traffic_class); + new_ehca_av.grh.word_0 |= EHCA_BMASK_SET(GRH_FLOWLABEL_MASK, + ah_attr->grh.flow_label); + new_ehca_av.grh.word_0 |= EHCA_BMASK_SET(GRH_HOPLIMIT_MASK, + ah_attr->grh.hop_limit); + new_ehca_av.grh.word_0 |= EHCA_BMASK_SET(GRH_NEXTHEADER_MASK, 0x1b); + + /* set sgid in grh.word_1 */ + if (ah_attr->ah_flags & IB_AH_GRH) { + int rc = 0; + struct ib_port_attr port_attr; + union ib_gid gid; + memset(&port_attr, 0, sizeof(port_attr)); + rc = ehca_query_port(ah->device, ah_attr->port_num, + &port_attr); + if (rc) { /* invalid port number */ + ret = -EINVAL; + EDEB_ERR(4, "Invalid port number " + "ehca_query_port() returned %x " + "ah=%p ah_attr=%p port_num=%x", + rc, ah, ah_attr, ah_attr->port_num); + goto modify_ah_exit1; + } + memset(&gid, 0, sizeof(gid)); + rc = ehca_query_gid(ah->device, + ah_attr->port_num, + ah_attr->grh.sgid_index, &gid); + if (rc) { + ret = -EINVAL; + EDEB_ERR(4, "Failed to retrieve sgid " + "ehca_query_gid() returned %x " + "ah=%p ah_attr=%p port_num=%x " + "sgid_index=%x", + rc, ah, ah_attr, ah_attr->port_num, + ah_attr->grh.sgid_index); + goto modify_ah_exit1; + } + memcpy(&new_ehca_av.grh.word_1, &gid, sizeof(gid)); + } + + new_ehca_av.pmtu = 4; /* see also comment in create_ah() */ + + memcpy(&new_ehca_av.grh.word_3, &ah_attr->grh.dgid, + sizeof(ah_attr->grh.dgid)); + + av = container_of(ah, struct ehca_av, ib_ah); + av->av = new_ehca_av; + +modify_ah_exit1: + EDEB_EX(7, "ret=%x ah=%p ah_attr=%p", ret, ah, ah_attr); + + return ret; +} + +int ehca_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + int ret = 0; + struct ehca_av *av = NULL; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + + EHCA_CHECK_AV(ah); + EHCA_CHECK_ADR(ah_attr); + + EDEB_EN(7, "ah=%p ah_attr=%p", ah, ah_attr); + + my_pd = container_of(ah->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + av = container_of(ah, struct ehca_av, ib_ah); + memcpy(&ah_attr->grh.dgid, &av->av.grh.word_3, + sizeof(ah_attr->grh.dgid)); + ah_attr->sl = av->av.sl; + + ah_attr->dlid = av->av.dlid; + + ah_attr->src_path_bits = av->av.slid_path_bits; + ah_attr->static_rate = av->av.ipd; + ah_attr->ah_flags = EHCA_BMASK_GET(GRH_FLAG_MASK, av->av.lnh); + ah_attr->grh.traffic_class = EHCA_BMASK_GET(GRH_TCLASS_MASK, + av->av.grh.word_0); + ah_attr->grh.hop_limit = EHCA_BMASK_GET(GRH_HOPLIMIT_MASK, + av->av.grh.word_0); + ah_attr->grh.flow_label = EHCA_BMASK_GET(GRH_FLOWLABEL_MASK, + av->av.grh.word_0); + + EDEB_EX(7, "ah=%p ah_attr=%p ret=%x", ah, ah_attr, ret); + return ret; +} + +int ehca_destroy_ah(struct ib_ah *ah) +{ + extern struct ehca_module ehca_module; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + int ret = 0; + + EHCA_CHECK_AV(ah); + EHCA_DEREGISTER_AV(ah); + + EDEB_EN(7, "ah=%p", ah); + + my_pd = container_of(ah->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + kmem_cache_free(ehca_module.cache_av, + container_of(ah, struct ehca_av, ib_ah)); + + EDEB_EX(7, "ret=%x ah=%p", ret, ah); + return ret; +} diff --git a/drivers/infiniband/hw/ehca/ehca_pd.c b/drivers/infiniband/hw/ehca/ehca_pd.c new file mode 100644 index 0000000..afcbe59 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_pd.c @@ -0,0 +1,120 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * PD functions + * + * Authors: Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + + +#define DEB_PREFIX "vpd " + +#include + +#include "ehca_tools.h" +#include "ehca_iverbs.h" + +struct ib_pd *ehca_alloc_pd(struct ib_device *device, + struct ib_ucontext *context, struct ib_udata *udata) +{ + extern struct ehca_module ehca_module; + struct ib_pd *mypd = NULL; + struct ehca_pd *pd = NULL; + + EDEB_EN(7, "device=%p context=%p udata=%p", device, context, udata); + + EHCA_CHECK_DEVICE_P(device); + + pd = kmem_cache_alloc(ehca_module.cache_pd, SLAB_KERNEL); + if (!pd) { + EDEB_ERR(4, "ERROR device=%p context=%p pd=%p" + " out of memory", device, context, mypd); + return ERR_PTR(-ENOMEM); + } + + memset(pd, 0, sizeof(struct ehca_pd)); + pd->ownpid = current->tgid; + + /* + * Kernel PD: when device = -1, 0 + * User PD: when context != -1 + */ + if (!context) { + /* + * Kernel PDs after init reuses always + * the one created in ehca_shca_reopen() + */ + struct ehca_shca *shca = container_of(device, struct ehca_shca, + ib_device); + pd->fw_pd.value = shca->pd->fw_pd.value; + } else + pd->fw_pd.value = (u64)pd; + + mypd = &pd->ib_pd; + + EHCA_REGISTER_PD(device, pd); + + EDEB_EX(7, "device=%p context=%p pd=%p", device, context, mypd); + + return mypd; +} + +int ehca_dealloc_pd(struct ib_pd *pd) +{ + extern struct ehca_module ehca_module; + int ret = 0; + u32 cur_pid = current->tgid; + struct ehca_pd *my_pd = NULL; + + EDEB_EN(7, "pd=%p", pd); + + EHCA_CHECK_PD(pd); + my_pd = container_of(pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + EHCA_DEREGISTER_PD(pd); + + kmem_cache_free(ehca_module.cache_pd, + container_of(pd, struct ehca_pd, ib_pd)); + + EDEB_EX(7, "pd=%p", pd); + + return ret; +} -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:02 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:02 -0700 Subject: [openib-general] [PATCH 10/13] IB/ehca: hipz In-Reply-To: <20068171311.sHGelL4wOwoc17UG@cisco.com> Message-ID: <20068171311.jebQ3TFd5jvynHCW@cisco.com> drivers/infiniband/hw/ehca/hipz_fns.h | 68 +++++ drivers/infiniband/hw/ehca/hipz_fns_core.h | 122 +++++++++ drivers/infiniband/hw/ehca/hipz_hw.h | 390 ++++++++++++++++++++++++++++ 3 files changed, 580 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hipz_fns.h b/drivers/infiniband/hw/ehca/hipz_fns.h new file mode 100644 index 0000000..9dac93d --- /dev/null +++ b/drivers/infiniband/hw/ehca/hipz_fns.h @@ -0,0 +1,68 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * HW abstraction register functions + * + * Authors: Christoph Raisch + * Reinhard Ernst + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef __HIPZ_FNS_H__ +#define __HIPZ_FNS_H__ + +#include "ehca_classes.h" +#include "hipz_hw.h" + +#include "hipz_fns_core.h" + +#define hipz_galpa_store_eq(gal, offset, value) \ + hipz_galpa_store(gal, EQTEMM_OFFSET(offset), value) + +#define hipz_galpa_load_eq(gal, offset) \ + hipz_galpa_load(gal, EQTEMM_OFFSET(offset)) + +#define hipz_galpa_store_qped(gal, offset, value) \ + hipz_galpa_store(gal, QPEDMM_OFFSET(offset), value) + +#define hipz_galpa_load_qped(gal, offset) \ + hipz_galpa_load(gal, QPEDMM_OFFSET(offset)) + +#define hipz_galpa_store_mrmw(gal, offset, value) \ + hipz_galpa_store(gal, MRMWMM_OFFSET(offset), value) + +#define hipz_galpa_load_mrmw(gal, offset) \ + hipz_galpa_load(gal, MRMWMM_OFFSET(offset)) + +#endif diff --git a/drivers/infiniband/hw/ehca/hipz_fns_core.h b/drivers/infiniband/hw/ehca/hipz_fns_core.h new file mode 100644 index 0000000..ff8d7ed --- /dev/null +++ b/drivers/infiniband/hw/ehca/hipz_fns_core.h @@ -0,0 +1,122 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * HW abstraction register functions + * + * Authors: Christoph Raisch + * Heiko J Schick + * Hoang-Nam Nguyen + * Reinhard Ernst + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef __HIPZ_FNS_CORE_H__ +#define __HIPZ_FNS_CORE_H__ + +#include "hcp_phyp.h" +#include "hipz_hw.h" + +#define hipz_galpa_store_cq(gal, offset, value) \ + hipz_galpa_store(gal, CQTEMM_OFFSET(offset), value) + +#define hipz_galpa_load_cq(gal, offset) \ + hipz_galpa_load(gal, CQTEMM_OFFSET(offset)) + +#define hipz_galpa_store_qp(gal,offset, value) \ + hipz_galpa_store(gal, QPTEMM_OFFSET(offset), value) +#define hipz_galpa_load_qp(gal, offset) \ + hipz_galpa_load(gal,QPTEMM_OFFSET(offset)) + +static inline void hipz_update_sqa(struct ehca_qp *qp, u16 nr_wqes) +{ + struct h_galpa gal; + + EDEB_EN(7, "qp=%p", qp); + gal = qp->galpas.kernel; + /* ringing doorbell :-) */ + hipz_galpa_store_qp(gal, qpx_sqa, EHCA_BMASK_SET(QPX_SQADDER, nr_wqes)); + EDEB_EX(7, "qp=%p QPx_SQA = %i", qp, nr_wqes); +} + +static inline void hipz_update_rqa(struct ehca_qp *qp, u16 nr_wqes) +{ + struct h_galpa gal; + + EDEB_EN(7, "qp=%p", qp); + gal = qp->galpas.kernel; + /* ringing doorbell :-) */ + hipz_galpa_store_qp(gal, qpx_rqa, EHCA_BMASK_SET(QPX_RQADDER, nr_wqes)); + EDEB_EX(7, "qp=%p QPx_RQA = %i", qp, nr_wqes); +} + +static inline void hipz_update_feca(struct ehca_cq *cq, u32 nr_cqes) +{ + struct h_galpa gal; + + EDEB_EN(7, "cq=%p", cq); + gal = cq->galpas.kernel; + hipz_galpa_store_cq(gal, cqx_feca, + EHCA_BMASK_SET(CQX_FECADDER, nr_cqes)); + EDEB_EX(7, "cq=%p CQx_FECA = %i", cq, nr_cqes); +} + +static inline void hipz_set_cqx_n0(struct ehca_cq *cq, u32 value) +{ + struct h_galpa gal; + u64 CQx_N0_reg = 0; + + EDEB_EN(7, "cq=%p event on solicited completion -- write CQx_N0", cq); + gal = cq->galpas.kernel; + hipz_galpa_store_cq(gal, cqx_n0, + EHCA_BMASK_SET(CQX_N0_GENERATE_SOLICITED_COMP_EVENT, + value)); + CQx_N0_reg = hipz_galpa_load_cq(gal, cqx_n0); + EDEB_EX(7, "cq=%p loaded CQx_N0=%lx", cq, (unsigned long)CQx_N0_reg); +} + +static inline void hipz_set_cqx_n1(struct ehca_cq *cq, u32 value) +{ + struct h_galpa gal; + u64 CQx_N1_reg = 0; + + EDEB_EN(7, "cq=%p event on completion -- write CQx_N1", + cq); + gal = cq->galpas.kernel; + hipz_galpa_store_cq(gal, cqx_n1, + EHCA_BMASK_SET(CQX_N1_GENERATE_COMP_EVENT, value)); + CQx_N1_reg = hipz_galpa_load_cq(gal, cqx_n1); + EDEB_EX(7, "cq=%p loaded CQx_N1=%lx", cq, (unsigned long)CQx_N1_reg); +} + +#endif /* __HIPZ_FNC_CORE_H__ */ diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h new file mode 100644 index 0000000..f5f4871 --- /dev/null +++ b/drivers/infiniband/hw/ehca/hipz_hw.h @@ -0,0 +1,390 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * eHCA register definitions + * + * Authors: Waleri Fomin + * Christoph Raisch + * Reinhard Ernst + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef __HIPZ_HW_H__ +#define __HIPZ_HW_H__ + +#include "ehca_tools.h" + +/* QP Table Entry Memory Map */ +struct hipz_qptemm { + u64 qpx_hcr; + u64 qpx_c; + u64 qpx_herr; + u64 qpx_aer; +/* 0x20*/ + u64 qpx_sqa; + u64 qpx_sqc; + u64 qpx_rqa; + u64 qpx_rqc; +/* 0x40*/ + u64 qpx_st; + u64 qpx_pmstate; + u64 qpx_pmfa; + u64 qpx_pkey; +/* 0x60*/ + u64 qpx_pkeya; + u64 qpx_pkeyb; + u64 qpx_pkeyc; + u64 qpx_pkeyd; +/* 0x80*/ + u64 qpx_qkey; + u64 qpx_dqp; + u64 qpx_dlidp; + u64 qpx_portp; +/* 0xa0*/ + u64 qpx_slidp; + u64 qpx_slidpp; + u64 qpx_dlida; + u64 qpx_porta; +/* 0xc0*/ + u64 qpx_slida; + u64 qpx_slidpa; + u64 qpx_slvl; + u64 qpx_ipd; +/* 0xe0*/ + u64 qpx_mtu; + u64 qpx_lato; + u64 qpx_rlimit; + u64 qpx_rnrlimit; +/* 0x100*/ + u64 qpx_t; + u64 qpx_sqhp; + u64 qpx_sqptp; + u64 qpx_nspsn; +/* 0x120*/ + u64 qpx_nspsnhwm; + u64 reserved1; + u64 qpx_sdsi; + u64 qpx_sdsbc; +/* 0x140*/ + u64 qpx_sqwsize; + u64 qpx_sqwts; + u64 qpx_lsn; + u64 qpx_nssn; +/* 0x160 */ + u64 qpx_mor; + u64 qpx_cor; + u64 qpx_sqsize; + u64 qpx_erc; +/* 0x180*/ + u64 qpx_rnrrc; + u64 qpx_ernrwt; + u64 qpx_rnrresp; + u64 qpx_lmsna; +/* 0x1a0 */ + u64 qpx_sqhpc; + u64 qpx_sqcptp; + u64 qpx_sigt; + u64 qpx_wqecnt; +/* 0x1c0*/ + u64 qpx_rqhp; + u64 qpx_rqptp; + u64 qpx_rqsize; + u64 qpx_nrr; +/* 0x1e0*/ + u64 qpx_rdmac; + u64 qpx_nrpsn; + u64 qpx_lapsn; + u64 qpx_lcr; +/* 0x200*/ + u64 qpx_rwc; + u64 qpx_rwva; + u64 qpx_rdsi; + u64 qpx_rdsbc; +/* 0x220*/ + u64 qpx_rqwsize; + u64 qpx_crmsn; + u64 qpx_rdd; + u64 qpx_larpsn; +/* 0x240*/ + u64 qpx_pd; + u64 qpx_scqn; + u64 qpx_rcqn; + u64 qpx_aeqn; +/* 0x260*/ + u64 qpx_aaelog; + u64 qpx_ram; + u64 qpx_rdmaqe0; + u64 qpx_rdmaqe1; +/* 0x280*/ + u64 qpx_rdmaqe2; + u64 qpx_rdmaqe3; + u64 qpx_nrpsnhwm; +/* 0x298*/ + u64 reserved[(0x400 - 0x298) / 8]; +/* 0x400 extended data */ + u64 reserved_ext[(0x500 - 0x400) / 8]; +/* 0x500 */ + u64 reserved2[(0x1000 - 0x500) / 8]; +/* 0x1000 */ +}; + +#define QPX_SQADDER EHCA_BMASK_IBM(48,63) +#define QPX_RQADDER EHCA_BMASK_IBM(48,63) + +#define QPTEMM_OFFSET(x) offsetof(struct hipz_qptemm,x) + +/* MRMWPT Entry Memory Map */ +struct hipz_mrmwmm { + /* 0x00 */ + u64 mrx_hcr; + + u64 mrx_c; + u64 mrx_herr; + u64 mrx_aer; + /* 0x20 */ + u64 mrx_pp; + u64 reserved1; + u64 reserved2; + u64 reserved3; + /* 0x40 */ + u64 reserved4[(0x200 - 0x40) / 8]; + /* 0x200 */ + u64 mrx_ctl[64]; + +}; + +#define MRX_HCR_LPARID_VALID EHCA_BMASK_IBM(0,0) + +#define MRMWMM_OFFSET(x) offsetof(struct hipz_mrmwmm,x) + +struct hipz_qpedmm { + /* 0x00 */ + u64 reserved0[(0x400) / 8]; + /* 0x400 */ + u64 qpedx_phh; + u64 qpedx_ppsgp; + /* 0x410 */ + u64 qpedx_ppsgu; + u64 qpedx_ppdgp; + /* 0x420 */ + u64 qpedx_ppdgu; + u64 qpedx_aph; + /* 0x430 */ + u64 qpedx_apsgp; + u64 qpedx_apsgu; + /* 0x440 */ + u64 qpedx_apdgp; + u64 qpedx_apdgu; + /* 0x450 */ + u64 qpedx_apav; + u64 qpedx_apsav; + /* 0x460 */ + u64 qpedx_hcr; + u64 reserved1[4]; + /* 0x488 */ + u64 qpedx_rrl0; + /* 0x490 */ + u64 qpedx_rrrkey0; + u64 qpedx_rrva0; + /* 0x4a0 */ + u64 reserved2; + u64 qpedx_rrl1; + /* 0x4b0 */ + u64 qpedx_rrrkey1; + u64 qpedx_rrva1; + /* 0x4c0 */ + u64 reserved3; + u64 qpedx_rrl2; + /* 0x4d0 */ + u64 qpedx_rrrkey2; + u64 qpedx_rrva2; + /* 0x4e0 */ + u64 reserved4; + u64 qpedx_rrl3; + /* 0x4f0 */ + u64 qpedx_rrrkey3; + u64 qpedx_rrva3; +}; + +#define QPEDMM_OFFSET(x) offsetof(struct hipz_qpedmm,x) + +/* CQ Table Entry Memory Map */ +struct hipz_cqtemm { + u64 cqx_hcr; + u64 cqx_c; + u64 cqx_herr; + u64 cqx_aer; +/* 0x20 */ + u64 cqx_ptp; + u64 cqx_tp; + u64 cqx_fec; + u64 cqx_feca; +/* 0x40 */ + u64 cqx_ep; + u64 cqx_eq; +/* 0x50 */ + u64 reserved1; + u64 cqx_n0; +/* 0x60 */ + u64 cqx_n1; + u64 reserved2[(0x1000 - 0x60) / 8]; +/* 0x1000 */ +}; + +#define CQX_FEC_CQE_CNT EHCA_BMASK_IBM(32,63) +#define CQX_FECADDER EHCA_BMASK_IBM(32,63) +#define CQX_N0_GENERATE_SOLICITED_COMP_EVENT EHCA_BMASK_IBM(0,0) +#define CQX_N1_GENERATE_COMP_EVENT EHCA_BMASK_IBM(0,0) + +#define CQTEMM_OFFSET(x) offsetof(struct hipz_cqtemm,x) + +/* EQ Table Entry Memory Map */ +struct hipz_eqtemm { + u64 eqx_hcr; + u64 eqx_c; + + u64 eqx_herr; + u64 eqx_aer; +/* 0x20 */ + u64 eqx_ptp; + u64 eqx_tp; + u64 eqx_ssba; + u64 eqx_psba; + +/* 0x40 */ + u64 eqx_cec; + u64 eqx_meql; + u64 eqx_xisbi; + u64 eqx_xisc; +/* 0x60 */ + u64 eqx_it; + +}; + +#define EQTEMM_OFFSET(x) offsetof(struct hipz_eqtemm,x) + +/* access control defines for MR/MW */ +#define HIPZ_ACCESSCTRL_L_WRITE 0x00800000 +#define HIPZ_ACCESSCTRL_R_WRITE 0x00400000 +#define HIPZ_ACCESSCTRL_R_READ 0x00200000 +#define HIPZ_ACCESSCTRL_R_ATOMIC 0x00100000 +#define HIPZ_ACCESSCTRL_MW_BIND 0x00080000 + +/* query hca response block */ +struct hipz_query_hca { + u32 cur_reliable_dg; + u32 cur_qp; + u32 cur_cq; + u32 cur_eq; + u32 cur_mr; + u32 cur_mw; + u32 cur_ee_context; + u32 cur_mcast_grp; + u32 cur_qp_attached_mcast_grp; + u32 reserved1; + u32 cur_ipv6_qp; + u32 cur_eth_qp; + u32 cur_hp_mr; + u32 reserved2[3]; + u32 max_rd_domain; + u32 max_qp; + u32 max_cq; + u32 max_eq; + u32 max_mr; + u32 max_hp_mr; + u32 max_mw; + u32 max_mrwpte; + u32 max_special_mrwpte; + u32 max_rd_ee_context; + u32 max_mcast_grp; + u32 max_total_mcast_qp_attach; + u32 max_mcast_qp_attach; + u32 max_raw_ipv6_qp; + u32 max_raw_ethy_qp; + u32 internal_clock_frequency; + u32 max_pd; + u32 max_ah; + u32 max_cqe; + u32 max_wqes_wq; + u32 max_partitions; + u32 max_rr_ee_context; + u32 max_rr_qp; + u32 max_rr_hca; + u32 max_act_wqs_ee_context; + u32 max_act_wqs_qp; + u32 max_sge; + u32 max_sge_rd; + u32 memory_page_size_supported; + u64 max_mr_size; + u32 local_ca_ack_delay; + u32 num_ports; + u32 vendor_id; + u32 vendor_part_id; + u32 hw_ver; + u64 node_guid; + u64 hca_cap_indicators; + u32 data_counter_register_size; + u32 max_shared_rq; + u32 max_isns_eq; + u32 max_neq; +} __attribute__ ((packed)); + +/* query port response block */ +struct hipz_query_port { + u32 state; + u32 bad_pkey_cntr; + u32 lmc; + u32 lid; + u32 subnet_timeout; + u32 qkey_viol_cntr; + u32 sm_sl; + u32 sm_lid; + u32 capability_mask; + u32 init_type_reply; + u32 pkey_tbl_len; + u32 gid_tbl_len; + u64 gid_prefix; + u32 port_nr; + u16 pkey_entries[16]; + u8 reserved1[32]; + u32 trent_size; + u32 trbuf_size; + u64 max_msg_sz; + u32 max_mtu; + u32 vl_cap; + u8 reserved2[1900]; + u64 guid_entries[255]; +} __attribute__ ((packed)); + +#endif -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:00 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:00 -0700 Subject: [openib-general] [PATCH 02/13] IB/ehca: includes In-Reply-To: <20068171311.qHSUlh5t6lpV4BeW@cisco.com> Message-ID: <20068171311.X1v1Q4Gk1v3wd7qJ@cisco.com> drivers/infiniband/hw/ehca/ehca_iverbs.h | 181 +++++++++++++ drivers/infiniband/hw/ehca/ehca_tools.h | 417 ++++++++++++++++++++++++++++++ 2 files changed, 598 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h new file mode 100644 index 0000000..bbdc437 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -0,0 +1,181 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Function definitions for internal functions + * + * Authors: Heiko J Schick + * Dietmar Decker + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef __EHCA_IVERBS_H__ +#define __EHCA_IVERBS_H__ + +#include "ehca_classes.h" + +int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props); + +int ehca_query_port(struct ib_device *ibdev, u8 port, + struct ib_port_attr *props); + +int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 * pkey); + +int ehca_query_gid(struct ib_device *ibdev, u8 port, int index, + union ib_gid *gid); + +int ehca_modify_port(struct ib_device *ibdev, u8 port, int port_modify_mask, + struct ib_port_modify *props); + +struct ib_pd *ehca_alloc_pd(struct ib_device *device, + struct ib_ucontext *context, + struct ib_udata *udata); + +int ehca_dealloc_pd(struct ib_pd *pd); + +struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); + +int ehca_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +int ehca_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +int ehca_destroy_ah(struct ib_ah *ah); + +struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags); + +struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, u64 *iova_start); + +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, + struct ib_umem *region, + int mr_access_flags, struct ib_udata *udata); + +int ehca_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, int mr_access_flags, u64 *iova_start); + +int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); + +int ehca_dereg_mr(struct ib_mr *mr); + +struct ib_mw *ehca_alloc_mw(struct ib_pd *pd); + +int ehca_bind_mw(struct ib_qp *qp, struct ib_mw *mw, + struct ib_mw_bind *mw_bind); + +int ehca_dealloc_mw(struct ib_mw *mw); + +struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + +int ehca_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, int list_len, u64 iova); + +int ehca_unmap_fmr(struct list_head *fmr_list); + +int ehca_dealloc_fmr(struct ib_fmr *fmr); + +enum ehca_eq_type { + EHCA_EQ = 0, /* Event Queue */ + EHCA_NEQ /* Notification Event Queue */ +}; + +int ehca_create_eq(struct ehca_shca *shca, struct ehca_eq *eq, + enum ehca_eq_type type, const u32 length); + +int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq); + +void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq); + + +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, + struct ib_ucontext *context, + struct ib_udata *udata); + +int ehca_destroy_cq(struct ib_cq *cq); + +int ehca_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata); + +int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc); + +int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); + +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); + +struct ib_qp *ehca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata); + +int ehca_destroy_qp(struct ib_qp *qp); + +int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); + +int ehca_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, + int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); + +int ehca_post_send(struct ib_qp *qp, struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + +int ehca_post_recv(struct ib_qp *qp, struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr); + +u64 ehca_define_sqp(struct ehca_shca *shca, struct ehca_qp *ibqp, + struct ib_qp_init_attr *qp_init_attr); + +int ehca_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +int ehca_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +struct ib_ucontext *ehca_alloc_ucontext(struct ib_device *device, + struct ib_udata *udata); + +int ehca_dealloc_ucontext(struct ib_ucontext *context); + +int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); + +void ehca_poll_eqs(unsigned long data); + +int ehca_mmap_nopage(u64 foffset,u64 length,void **mapped, + struct vm_area_struct **vma); + +int ehca_mmap_register(u64 physical,void **mapped, + struct vm_area_struct **vma); + +int ehca_munmap(unsigned long addr, size_t len); + +#endif diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h new file mode 100644 index 0000000..783fbb3 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -0,0 +1,417 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * auxiliary functions + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Khadija Souissi + * Waleri Fomin + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + + +#ifndef EHCA_TOOLS_H +#define EHCA_TOOLS_H + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +#define EHCA_EDEB_TRACE_MASK_SIZE 32 +extern u8 ehca_edeb_mask[EHCA_EDEB_TRACE_MASK_SIZE]; +#define EDEB_ID_TO_U32(str4) (str4[3] | (str4[2] << 8) | (str4[1] << 16) | \ + (str4[0] << 24)) + +static inline u64 ehca_edeb_filter(const u32 level, + const u32 id, const u32 line) +{ + u64 ret = 0; + u32 filenr = 0; + u32 filter_level = 9; + u32 dynamic_level = 0; + + /* + * This is code written for the gcc -O2 optimizer + * which should collapse to two single ints. + * Filter_level is the first level kicked out by + * compiler and means trace everything below 6. + */ + + if (id == EDEB_ID_TO_U32("ehav")) { + filenr = 0x01; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("clas")) { + filenr = 0x02; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("cqeq")) { + filenr = 0x03; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("shca")) { + filenr = 0x05; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("eirq")) { + filenr = 0x06; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("lMad")) { + filenr = 0x07; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("mcas")) { + filenr = 0x08; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("mrmw")) { + filenr = 0x09; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("vpd ")) { + filenr = 0x0a; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("e_qp")) { + filenr = 0x0b; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("uqes")) { + filenr = 0x0c; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("PHYP")) { + filenr = 0x0d; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("hcpi")) { + filenr = 0x0e; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("iptz")) { + filenr = 0x0f; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("spta")) { + filenr = 0x10; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("simp")) { + filenr = 0x11; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("reqs")) { + filenr = 0x12; + filter_level = 8; + } + + if ((filenr - 1) > sizeof(ehca_edeb_mask)) { + filenr = 0; + } + + if (filenr == 0) { + filter_level = 9; + } /* default */ + ret = filenr * 0x10000 + line; + if (filter_level <= level) { + return ret | 0x100000000L; /* this is the flag to not trace */ + } + dynamic_level = ehca_edeb_mask[filenr]; + if (likely(dynamic_level <= level)) { + ret = ret | 0x100000000L; + }; + return ret; +} + +#ifdef EHCA_USE_HCALL_KERNEL +#ifdef CONFIG_PPC_PSERIES + +#include + +/* + * IS_EDEB_ON - Checks if debug is on for the given level. + */ +#define IS_EDEB_ON(level) \ +((ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__) & \ + 0x100000000L) == 0) + +#define EDEB_P_GENERIC(level,idstring,format,args...) \ +do { \ + u64 ehca_edeb_filterresult = \ + ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__);\ + if ((ehca_edeb_filterresult & 0x100000000L) == 0) \ + printk("PU%04x %08x:%s " idstring " "format "\n", \ + get_paca()->paca_index, (u32)(ehca_edeb_filterresult), \ + __func__, ##args); \ +} while (1 == 0) + +#elif REAL_HCALL + +#define EDEB_P_GENERIC(level,idstring,format,args...) \ +do { \ + u64 ehca_edeb_filterresult = \ + ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__); \ + if ((ehca_edeb_filterresult & 0x100000000L) == 0) \ + printk("%08x:%s " idstring " "format "\n", \ + (u32)(ehca_edeb_filterresult), \ + __func__, ##args); \ +} while (1 == 0) + +#endif +#else + +#define IS_EDEB_ON(level) (1) + +#define EDEB_P_GENERIC(level,idstring,format,args...) \ +do { \ + printk("%s " idstring " "format "\n", \ + __func__, ##args); \ +} while (1 == 0) + +#endif + +/** + * EDEB - Trace output macro. + * @level: tracelevel + * @format: optional format string, use "" if not desired + * @args: printf like arguments for trace + */ +#define EDEB(level,format,args...) \ + EDEB_P_GENERIC(level,"",format,##args) +#define EDEB_ERR(level,format,args...) \ + EDEB_P_GENERIC(level,"HCAD_ERROR ",format,##args) +#define EDEB_EN(level,format,args...) \ + EDEB_P_GENERIC(level,">>>",format,##args) +#define EDEB_EX(level,format,args...) \ + EDEB_P_GENERIC(level,"<<<",format,##args) + +/** + * EDEB_DMP - macro to dump a memory block, whose length is n*8 bytes. + * Each line has the following layout: + * adr=X ofs=Y <8 bytes hex> <8 bytes hex> + */ +#define EDEB_DMP(level,adr,len,format,args...) \ + do { \ + unsigned int x; \ + unsigned int l = (unsigned int)(len); \ + unsigned char *deb = (unsigned char*)(adr); \ + for (x = 0; x < l; x += 16) { \ + EDEB(level, format " adr=%p ofs=%04x %016lx %016lx", \ + ##args, deb, x, \ + *((u64 *)&deb[0]), *((u64 *)&deb[8])); \ + deb += 16; \ + } \ + } while (0) + +/* define a bitmask, little endian version */ +#define EHCA_BMASK(pos,length) (((pos)<<16)+(length)) + +/* define a bitmask, the ibm way... */ +#define EHCA_BMASK_IBM(from,to) (((63-to)<<16)+((to)-(from)+1)) + +/* internal function, don't use */ +#define EHCA_BMASK_SHIFTPOS(mask) (((mask)>>16)&0xffff) + +/* internal function, don't use */ +#define EHCA_BMASK_MASK(mask) (0xffffffffffffffffULL >> ((64-(mask))&0xffff)) + +/** + * EHCA_BMASK_SET - return value shifted and masked by mask + * variable|=EHCA_BMASK_SET(MY_MASK,0x4711) ORs the bits in variable + * variable&=~EHCA_BMASK_SET(MY_MASK,-1) clears the bits from the mask + * in variable + */ +#define EHCA_BMASK_SET(mask,value) \ + ((EHCA_BMASK_MASK(mask) & ((u64)(value)))<>EHCA_BMASK_SHIFTPOS(mask))) + +#define PARANOIA_MODE +#ifdef PARANOIA_MODE + +#define EHCA_CHECK_ADR_P(adr) \ + if (unlikely(adr == 0)) { \ + EDEB_ERR(4, "adr=%p check failed line %i", adr, \ + __LINE__); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_ADR(adr) \ + if (unlikely(adr == 0)) { \ + EDEB_ERR(4, "adr=%p check failed line %i", adr, \ + __LINE__); \ + return -EFAULT; } + +#define EHCA_CHECK_DEVICE_P(device) \ + if (unlikely(device == 0)) { \ + EDEB_ERR(4, "device=%p check failed", device); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_DEVICE(device) \ + if (unlikely(device == 0)) { \ + EDEB_ERR(4, "device=%p check failed", device); \ + return -EFAULT; } + +#define EHCA_CHECK_PD(pd) \ + if (unlikely(pd == 0)) { \ + EDEB_ERR(4, "pd=%p check failed", pd); \ + return -EFAULT; } + +#define EHCA_CHECK_PD_P(pd) \ + if (unlikely(pd == 0)) { \ + EDEB_ERR(4, "pd=%p check failed", pd); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_AV(av) \ + if (unlikely(av == 0)) { \ + EDEB_ERR(4, "av=%p check failed", av); \ + return -EFAULT; } + +#define EHCA_CHECK_AV_P(av) \ + if (unlikely(av == 0)) { \ + EDEB_ERR(4, "av=%p check failed", av); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_CQ(cq) \ + if (unlikely(cq == 0)) { \ + EDEB_ERR(4, "cq=%p check failed", cq); \ + return -EFAULT; } + +#define EHCA_CHECK_CQ_P(cq) \ + if (unlikely(cq == 0)) { \ + EDEB_ERR(4, "cq=%p check failed", cq); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_EQ(eq) \ + if (unlikely(eq == 0)) { \ + EDEB_ERR(4, "eq=%p check failed", eq); \ + return -EFAULT; } + +#define EHCA_CHECK_EQ_P(eq) \ + if (unlikely(eq == 0)) { \ + EDEB_ERR(4, "eq=%p check failed", eq); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_QP(qp) \ + if (unlikely(qp == 0)) { \ + EDEB_ERR(4, "qp=%p check failed", qp); \ + return -EFAULT; } + +#define EHCA_CHECK_QP_P(qp) \ + if (unlikely(qp == 0)) { \ + EDEB_ERR(4, "qp=%p check failed", qp); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_MR(mr) \ + if (unlikely(mr == 0)) { \ + EDEB_ERR(4, "mr=%p check failed", mr); \ + return -EFAULT; } + +#define EHCA_CHECK_MR_P(mr) \ + if (unlikely(mr == 0)) { \ + EDEB_ERR(4, "mr=%p check failed", mr); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_MW(mw) \ + if (unlikely(mw == 0)) { \ + EDEB_ERR(4, "mw=%p check failed", mw); \ + return -EFAULT; } + +#define EHCA_CHECK_MW_P(mw) \ + if (unlikely(mw == 0)) { \ + EDEB_ERR(4, "mw=%p check failed", mw); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_FMR(fmr) \ + if (unlikely(fmr == 0)) { \ + EDEB_ERR(4, "fmr=%p check failed", fmr); \ + return -EFAULT; } + +#define EHCA_CHECK_FMR_P(fmr) \ + if (unlikely(fmr == 0)) { \ + EDEB_ERR(4, "fmr=%p check failed", fmr); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_REGISTER_PD(device,pd) +#define EHCA_REGISTER_AV(pd,av) +#define EHCA_DEREGISTER_PD(PD) +#define EHCA_DEREGISTER_AV(av) +#else +#define EHCA_CHECK_DEVICE_P(device) + +#define EHCA_CHECK_PD(pd) +#define EHCA_REGISTER_PD(device,pd) +#define EHCA_DEREGISTER_PD(PD) +#endif + +static inline int ehca_adr_bad(void *adr) +{ + return !adr; +} + +/* Converts ehca to ib return code */ +static inline int ehca2ib_return_code(u64 ehca_rc) +{ + switch (ehca_rc) { + case H_SUCCESS: + return 0; + case H_BUSY: + return -EBUSY; + case H_NO_MEM: + return -ENOMEM; + default: + return -EINVAL; + } +} + +#endif /* EHCA_TOOLS_H */ -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:01 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:01 -0700 Subject: [openib-general] [PATCH 08/13] IB/ehca: qp In-Reply-To: <20068171311.4HUmviC1Ip8J2EpE@cisco.com> Message-ID: <20068171311.7Z4EtLP0ZYtya78R@cisco.com> drivers/infiniband/hw/ehca/ehca_qes.h | 259 +++++ drivers/infiniband/hw/ehca/ehca_qp.c | 1594 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_reqs.c | 694 ++++++++++++++ drivers/infiniband/hw/ehca/ehca_sqp.c | 123 ++ 4 files changed, 2670 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qes.h b/drivers/infiniband/hw/ehca/ehca_qes.h new file mode 100644 index 0000000..8707d29 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_qes.h @@ -0,0 +1,259 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Hardware request structures + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + + +#ifndef _EHCA_QES_H_ +#define _EHCA_QES_H_ + +#include "ehca_tools.h" + +/* virtual scatter gather entry to specify remote adresses with length */ +struct ehca_vsgentry { + u64 vaddr; + u32 lkey; + u32 length; +}; + +#define GRH_FLAG_MASK EHCA_BMASK_IBM(7,7) +#define GRH_IPVERSION_MASK EHCA_BMASK_IBM(0,3) +#define GRH_TCLASS_MASK EHCA_BMASK_IBM(4,12) +#define GRH_FLOWLABEL_MASK EHCA_BMASK_IBM(13,31) +#define GRH_PAYLEN_MASK EHCA_BMASK_IBM(32,47) +#define GRH_NEXTHEADER_MASK EHCA_BMASK_IBM(48,55) +#define GRH_HOPLIMIT_MASK EHCA_BMASK_IBM(56,63) + +/* + * Unreliable Datagram Address Vector Format + * see IBTA Vol1 chapter 8.3 Global Routing Header + */ +struct ehca_ud_av { + u8 sl; + u8 lnh; + u16 dlid; + u8 reserved1; + u8 reserved2; + u8 reserved3; + u8 slid_path_bits; + u8 reserved4; + u8 ipd; + u8 reserved5; + u8 pmtu; + u32 reserved6; + u64 reserved7; + union { + struct { + u64 word_0; /* always set to 6 */ + /*should be 0x1B for IB transport */ + u64 word_1; + u64 word_2; + u64 word_3; + u64 word_4; + } grh; + struct { + u32 wd_0; + u32 wd_1; + /* DWord_1 --> SGID */ + + u32 sgid_wd3; + u32 sgid_wd2; + + u32 sgid_wd1; + u32 sgid_wd0; + /* DWord_3 --> DGID */ + + u32 dgid_wd3; + u32 dgid_wd2; + + u32 dgid_wd1; + u32 dgid_wd0; + } grh_l; + }; +}; + +/* maximum number of sg entries allowed in a WQE */ +#define MAX_WQE_SG_ENTRIES 252 + +#define WQE_OPTYPE_SEND 0x80 +#define WQE_OPTYPE_RDMAREAD 0x40 +#define WQE_OPTYPE_RDMAWRITE 0x20 +#define WQE_OPTYPE_CMPSWAP 0x10 +#define WQE_OPTYPE_FETCHADD 0x08 +#define WQE_OPTYPE_BIND 0x04 + +#define WQE_WRFLAG_REQ_SIGNAL_COM 0x80 +#define WQE_WRFLAG_FENCE 0x40 +#define WQE_WRFLAG_IMM_DATA_PRESENT 0x20 +#define WQE_WRFLAG_SOLIC_EVENT 0x10 + +#define WQEF_CACHE_HINT 0x80 +#define WQEF_CACHE_HINT_RD_WR 0x40 +#define WQEF_TIMED_WQE 0x20 +#define WQEF_PURGE 0x08 +#define WQEF_HIGH_NIBBLE 0xF0 + +#define MW_BIND_ACCESSCTRL_R_WRITE 0x40 +#define MW_BIND_ACCESSCTRL_R_READ 0x20 +#define MW_BIND_ACCESSCTRL_R_ATOMIC 0x10 + +struct ehca_wqe { + u64 work_request_id; + u8 optype; + u8 wr_flag; + u16 pkeyi; + u8 wqef; + u8 nr_of_data_seg; + u16 wqe_provided_slid; + u32 destination_qp_number; + u32 resync_psn_sqp; + u32 local_ee_context_qkey; + u32 immediate_data; + union { + struct { + u64 remote_virtual_adress; + u32 rkey; + u32 reserved; + u64 atomic_1st_op_dma_len; + u64 atomic_2nd_op; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES]; + + } nud; + struct { + u64 ehca_ud_av_ptr; + u64 reserved1; + u64 reserved2; + u64 reserved3; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES]; + } ud_avp; + struct { + struct ehca_ud_av ud_av; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES - + 2]; + } ud_av; + struct { + u64 reserved0; + u64 reserved1; + u64 reserved2; + u64 reserved3; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES]; + } all_rcv; + + struct { + u64 reserved; + u32 rkey; + u32 old_rkey; + u64 reserved1; + u64 reserved2; + u64 virtual_address; + u32 reserved3; + u32 length; + u32 reserved4; + u16 reserved5; + u8 reserved6; + u8 lr_ctl; + u32 lkey; + u32 reserved7; + u64 reserved8; + u64 reserved9; + u64 reserved10; + u64 reserved11; + } bind; + struct { + u64 reserved12; + u64 reserved13; + u32 size; + u32 start; + } inline_data; + } u; + +}; + +#define WC_SEND_RECEIVE EHCA_BMASK_IBM(0,0) +#define WC_IMM_DATA EHCA_BMASK_IBM(1,1) +#define WC_GRH_PRESENT EHCA_BMASK_IBM(2,2) +#define WC_SE_BIT EHCA_BMASK_IBM(3,3) +#define WC_STATUS_ERROR_BIT 0x80000000 +#define WC_STATUS_REMOTE_ERROR_FLAGS 0x0000F800 +#define WC_STATUS_PURGE_BIT 0x10 + +struct ehca_cqe { + u64 work_request_id; + u8 optype; + u8 w_completion_flags; + u16 reserved1; + u32 nr_bytes_transferred; + u32 immediate_data; + u32 local_qp_number; + u8 freed_resource_count; + u8 service_level; + u16 wqe_count; + u32 qp_token; + u32 qkey_ee_token; + u32 remote_qp_number; + u16 dlid; + u16 rlid; + u16 reserved2; + u16 pkey_index; + u32 cqe_timestamp; + u32 wqe_timestamp; + u8 wqe_timestamp_valid; + u8 reserved3; + u8 reserved4; + u8 cqe_flags; + u32 status; +}; + +struct ehca_eqe { + u64 entry; +}; + +struct ehca_mrte { + u64 starting_va; + u64 length; /* length of memory region in bytes*/ + u32 pd; + u8 key_instance; + u8 pagesize; + u8 mr_control; + u8 local_remote_access_ctrl; + u8 reserved[0x20 - 0x18]; + u64 at_pointer[4]; +}; +#endif /*_EHCA_QES_H_*/ diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c new file mode 100644 index 0000000..011afa6 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -0,0 +1,1594 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * QP functions + * + * Authors: Waleri Fomin + * Hoang-Nam Nguyen + * Reinhard Ernst + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + + +#define DEB_PREFIX "e_qp" + +#include + +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "ehca_qes.h" +#include "ehca_iverbs.h" +#include "hcp_if.h" +#include "hipz_fns.h" + +/* + * attributes not supported by query qp + */ +#define QP_ATTR_QUERY_NOT_SUPPORTED (IB_QP_MAX_DEST_RD_ATOMIC | \ + IB_QP_MAX_QP_RD_ATOMIC | \ + IB_QP_ACCESS_FLAGS | \ + IB_QP_EN_SQD_ASYNC_NOTIFY) + +/* + * ehca (internal) qp state values + */ +enum ehca_qp_state { + EHCA_QPS_RESET = 1, + EHCA_QPS_INIT = 2, + EHCA_QPS_RTR = 3, + EHCA_QPS_RTS = 5, + EHCA_QPS_SQD = 6, + EHCA_QPS_SQE = 8, + EHCA_QPS_ERR = 128 +}; + +/* + * qp state transitions as defined by IB Arch Rel 1.1 page 431 + */ +enum ib_qp_statetrans { + IB_QPST_ANY2RESET, + IB_QPST_ANY2ERR, + IB_QPST_RESET2INIT, + IB_QPST_INIT2RTR, + IB_QPST_INIT2INIT, + IB_QPST_RTR2RTS, + IB_QPST_RTS2SQD, + IB_QPST_RTS2RTS, + IB_QPST_SQD2RTS, + IB_QPST_SQE2RTS, + IB_QPST_SQD2SQD, + IB_QPST_MAX /* nr of transitions, this must be last!!! */ +}; + +/* + * ib2ehca_qp_state maps IB to ehca qp_state + * returns ehca qp state corresponding to given ib qp state + */ +static inline enum ehca_qp_state ib2ehca_qp_state(enum ib_qp_state ib_qp_state) +{ + switch (ib_qp_state) { + case IB_QPS_RESET: + return EHCA_QPS_RESET; + case IB_QPS_INIT: + return EHCA_QPS_INIT; + case IB_QPS_RTR: + return EHCA_QPS_RTR; + case IB_QPS_RTS: + return EHCA_QPS_RTS; + case IB_QPS_SQD: + return EHCA_QPS_SQD; + case IB_QPS_SQE: + return EHCA_QPS_SQE; + case IB_QPS_ERR: + return EHCA_QPS_ERR; + default: + EDEB_ERR(4, "invalid ib_qp_state=%x", ib_qp_state); + return -EINVAL; + } +} + +/* + * ehca2ib_qp_state maps ehca to IB qp_state + * returns ib qp state corresponding to given ehca qp state + */ +static inline enum ib_qp_state ehca2ib_qp_state(enum ehca_qp_state + ehca_qp_state) +{ + switch (ehca_qp_state) { + case EHCA_QPS_RESET: + return IB_QPS_RESET; + case EHCA_QPS_INIT: + return IB_QPS_INIT; + case EHCA_QPS_RTR: + return IB_QPS_RTR; + case EHCA_QPS_RTS: + return IB_QPS_RTS; + case EHCA_QPS_SQD: + return IB_QPS_SQD; + case EHCA_QPS_SQE: + return IB_QPS_SQE; + case EHCA_QPS_ERR: + return IB_QPS_ERR; + default: + EDEB_ERR(4,"invalid ehca_qp_state=%x",ehca_qp_state); + return -EINVAL; + } +} + +/* + * ehca_qp_type used as index for req_attr and opt_attr of + * struct ehca_modqp_statetrans + */ +enum ehca_qp_type { + QPT_RC = 0, + QPT_UC = 1, + QPT_UD = 2, + QPT_SQP = 3, + QPT_MAX +}; + +/* + * ib2ehcaqptype maps Ib to ehca qp_type + * returns ehca qp type corresponding to ib qp type + */ +static inline enum ehca_qp_type ib2ehcaqptype(enum ib_qp_type ibqptype) +{ + switch (ibqptype) { + case IB_QPT_SMI: + case IB_QPT_GSI: + return QPT_SQP; + case IB_QPT_RC: + return QPT_RC; + case IB_QPT_UC: + return QPT_UC; + case IB_QPT_UD: + return QPT_UD; + default: + EDEB_ERR(4,"Invalid ibqptype=%x", ibqptype); + return -EINVAL; + } +} + +static inline enum ib_qp_statetrans get_modqp_statetrans(int ib_fromstate, + int ib_tostate) +{ + int index = -EINVAL; + switch (ib_tostate) { + case IB_QPS_RESET: + index = IB_QPST_ANY2RESET; + break; + case IB_QPS_INIT: + if (ib_fromstate == IB_QPS_RESET) + index = IB_QPST_RESET2INIT; + else if (ib_fromstate == IB_QPS_INIT) + index = IB_QPST_INIT2INIT; + break; + case IB_QPS_RTR: + if (ib_fromstate == IB_QPS_INIT) + index = IB_QPST_INIT2RTR; + break; + case IB_QPS_RTS: + if (ib_fromstate == IB_QPS_RTR) + index = IB_QPST_RTR2RTS; + else if (ib_fromstate == IB_QPS_RTS) + index = IB_QPST_RTS2RTS; + else if (ib_fromstate == IB_QPS_SQD) + index = IB_QPST_SQD2RTS; + else if (ib_fromstate == IB_QPS_SQE) + index = IB_QPST_SQE2RTS; + break; + case IB_QPS_SQD: + if (ib_fromstate == IB_QPS_RTS) + index = IB_QPST_RTS2SQD; + break; + case IB_QPS_SQE: + break; + case IB_QPS_ERR: + index = IB_QPST_ANY2ERR; + break; + default: + break; + } + return index; +} + +enum ehca_service_type { + ST_RC = 0, + ST_UC = 1, + ST_RD = 2, + ST_UD = 3 +}; + +/* + * ibqptype2servicetype returns hcp service type corresponding to given + * ib qp type used by create_qp() + */ +static inline int ibqptype2servicetype(enum ib_qp_type ibqptype) +{ + switch (ibqptype) { + case IB_QPT_SMI: + case IB_QPT_GSI: + return ST_UD; + case IB_QPT_RC: + return ST_RC; + case IB_QPT_UC: + return ST_UC; + case IB_QPT_UD: + return ST_UD; + case IB_QPT_RAW_IPV6: + return -EINVAL; + case IB_QPT_RAW_ETY: + return -EINVAL; + default: + EDEB_ERR(4, "Invalid ibqptype=%x", ibqptype); + return -EINVAL; + } +} + +/* + * init_qp_queues initializes/constructs r/squeue and registers queue pages. + */ +static inline int init_qp_queues(struct ipz_adapter_handle ipz_hca_handle, + struct ehca_qp *my_qp, + int nr_sq_pages, + int nr_rq_pages, + int swqe_size, + int rwqe_size, + int nr_send_sges, int nr_receive_sges) +{ + int ret = -EINVAL; + int cnt = 0; + void *vpage = NULL; + u64 rpage = 0; + int ipz_rc = -1; + u64 h_ret = H_PARAMETER; + + ipz_rc = ipz_queue_ctor(&my_qp->ipz_squeue, + nr_sq_pages, + EHCA_PAGESIZE, swqe_size, nr_send_sges); + if (!ipz_rc) { + EDEB_ERR(4, "Cannot allocate page for squeue. ipz_rc=%x", + ipz_rc); + ret = -EBUSY; + return ret; + } + + ipz_rc = ipz_queue_ctor(&my_qp->ipz_rqueue, + nr_rq_pages, + EHCA_PAGESIZE, rwqe_size, nr_receive_sges); + if (!ipz_rc) { + EDEB_ERR(4, "Cannot allocate page for rqueue. ipz_rc=%x", + ipz_rc); + ret = -EBUSY; + goto init_qp_queues0; + } + /* register SQ pages */ + for (cnt = 0; cnt < nr_sq_pages; cnt++) { + vpage = ipz_qpageit_get_inc(&my_qp->ipz_squeue); + if (!vpage) { + EDEB_ERR(4, "SQ ipz_qpageit_get_inc() " + "failed p_vpage= %p", vpage); + ret = -EINVAL; + goto init_qp_queues1; + } + rpage = virt_to_abs(vpage); + + h_ret = hipz_h_register_rpage_qp(ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, 0, 0, + rpage, 1, + my_qp->galpas.kernel); + if (h_ret < H_SUCCESS) { + EDEB_ERR(4,"SQ hipz_qp_register_rpage() faield " + "rc=%lx", h_ret); + ret = ehca2ib_return_code(h_ret); + goto init_qp_queues1; + } + } + + ipz_qeit_reset(&my_qp->ipz_squeue); + + /* register RQ pages */ + for (cnt = 0; cnt < nr_rq_pages; cnt++) { + vpage = ipz_qpageit_get_inc(&my_qp->ipz_rqueue); + if (!vpage) { + EDEB_ERR(4,"RQ ipz_qpageit_get_inc() " + "failed p_vpage = %p", vpage); + h_ret = H_RESOURCE; + ret = -EINVAL; + goto init_qp_queues1; + } + + rpage = virt_to_abs(vpage); + + h_ret = hipz_h_register_rpage_qp(ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, 0, 1, + rpage, 1,my_qp->galpas.kernel); + if (h_ret < H_SUCCESS) { + EDEB_ERR(4, "RQ hipz_qp_register_rpage() failed " + "rc=%lx", h_ret); + ret = ehca2ib_return_code(h_ret); + goto init_qp_queues1; + } + if (cnt == (nr_rq_pages - 1)) { /* last page! */ + if (h_ret != H_SUCCESS) { + EDEB_ERR(4,"RQ hipz_qp_register_rpage() " + "h_ret= %lx ", h_ret); + ret = ehca2ib_return_code(h_ret); + goto init_qp_queues1; + } + vpage = ipz_qpageit_get_inc(&my_qp->ipz_rqueue); + if (vpage) { + EDEB_ERR(4,"ipz_qpageit_get_inc() " + "should not succeed vpage=%p", + vpage); + ret = -EINVAL; + goto init_qp_queues1; + } + } else { + if (h_ret != H_PAGE_REGISTERED) { + EDEB_ERR(4,"RQ hipz_qp_register_rpage() " + "h_ret= %lx ", h_ret); + ret = ehca2ib_return_code(h_ret); + goto init_qp_queues1; + } + } + } + + ipz_qeit_reset(&my_qp->ipz_rqueue); + + return 0; + +init_qp_queues1: + ipz_queue_dtor(&my_qp->ipz_rqueue); +init_qp_queues0: + ipz_queue_dtor(&my_qp->ipz_squeue); + return ret; +} + + +struct ib_qp *ehca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + extern struct ehca_module ehca_module; + static int da_msg_size[]={ 128, 256, 512, 1024, 2048, 4096 }; + int ret = -EINVAL; + + struct ehca_qp *my_qp = NULL; + struct ehca_pd *my_pd = NULL; + struct ehca_shca *shca = NULL; + struct ib_ucontext *context = NULL; + u64 h_ret = H_PARAMETER; + int max_send_sge; + int max_recv_sge; + + /* h_call's out parameters */ + struct ehca_alloc_qp_parms parms; + u32 qp_nr = 0, swqe_size = 0, rwqe_size = 0; + u8 daqp_completion, isdaqp; + unsigned long flags; + + EDEB_EN(7,"pd=%p init_attr=%p", pd, init_attr); + EHCA_CHECK_PD_P(pd); + EHCA_CHECK_ADR_P(init_attr); + + if (init_attr->sq_sig_type != IB_SIGNAL_REQ_WR && + init_attr->sq_sig_type != IB_SIGNAL_ALL_WR) { + EDEB_ERR(4, "init_attr->sg_sig_type=%x not allowed", + init_attr->sq_sig_type); + return ERR_PTR(-EINVAL); + } + + /* save daqp completion bits */ + daqp_completion = init_attr->qp_type & 0x60; + /* save daqp bit */ + isdaqp = (init_attr->qp_type & 0x80) ? 1 : 0; + init_attr->qp_type = init_attr->qp_type & 0x1F; + + if (init_attr->qp_type != IB_QPT_UD && + init_attr->qp_type != IB_QPT_SMI && + init_attr->qp_type != IB_QPT_GSI && + init_attr->qp_type != IB_QPT_UC && + init_attr->qp_type != IB_QPT_RC) { + EDEB_ERR(4,"wrong QP Type=%x",init_attr->qp_type); + return ERR_PTR(-EINVAL); + } + if (init_attr->qp_type != IB_QPT_RC && isdaqp != 0) { + EDEB_ERR(4,"unsupported LL QP Type=%x",init_attr->qp_type); + return ERR_PTR(-EINVAL); + } + + if (pd->uobject && udata) + context = pd->uobject->context; + + my_qp = kmem_cache_alloc(ehca_module.cache_qp, SLAB_KERNEL); + if (!my_qp) { + EDEB_ERR(4, "pd=%p not enough memory to alloc qp", pd); + return ERR_PTR(-ENOMEM); + } + + memset(my_qp, 0, sizeof(struct ehca_qp)); + memset (&parms, 0, sizeof(struct ehca_alloc_qp_parms)); + spin_lock_init(&my_qp->spinlock_s); + spin_lock_init(&my_qp->spinlock_r); + + my_pd = container_of(pd, struct ehca_pd, ib_pd); + + shca = container_of(pd->device, struct ehca_shca, ib_device); + my_qp->recv_cq = + container_of(init_attr->recv_cq, struct ehca_cq, ib_cq); + my_qp->send_cq = + container_of(init_attr->send_cq, struct ehca_cq, ib_cq); + + my_qp->init_attr = *init_attr; + + do { + if (!idr_pre_get(&ehca_qp_idr, GFP_KERNEL)) { + ret = -ENOMEM; + EDEB_ERR(4, "Can't reserve idr resources."); + goto create_qp_exit0; + } + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + ret = idr_get_new(&ehca_qp_idr, my_qp, &my_qp->token); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + } while (ret == -EAGAIN); + + if (ret) { + ret = -ENOMEM; + EDEB_ERR(4, "Can't allocate new idr entry."); + goto create_qp_exit0; + } + + parms.servicetype = ibqptype2servicetype(init_attr->qp_type); + if (parms.servicetype < 0) { + ret = -EINVAL; + EDEB_ERR(4, "Invalid qp_type=%x", init_attr->qp_type); + goto create_qp_exit0; + } + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + parms.sigtype = HCALL_SIGT_EVERY; + else + parms.sigtype = HCALL_SIGT_BY_WQE; + + /* UD_AV CIRCUMVENTION */ + max_send_sge = init_attr->cap.max_send_sge; + max_recv_sge = init_attr->cap.max_recv_sge; + if (IB_QPT_UD == init_attr->qp_type || + IB_QPT_GSI == init_attr->qp_type || + IB_QPT_SMI == init_attr->qp_type) { + max_send_sge += 2; + max_recv_sge += 2; + } + + EDEB(7, "isdaqp=%x daqp_completion=%x", isdaqp, daqp_completion); + + parms.ipz_eq_handle = shca->eq.ipz_eq_handle; + parms.daqp_ctrl = isdaqp | daqp_completion; + parms.pd = my_pd->fw_pd; + parms.max_recv_sge = max_recv_sge; + parms.max_send_sge = max_send_sge; + + h_ret = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, my_qp, &parms); + + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "h_alloc_resource_qp() failed h_ret=%lx", h_ret); + ret = ehca2ib_return_code(h_ret); + goto create_qp_exit1; + } + + switch (init_attr->qp_type) { + case IB_QPT_RC: + if (isdaqp == 0) { + swqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[ + (parms.act_nr_send_sges)]); + rwqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[ + (parms.act_nr_recv_sges)]); + } else { /* for daqp we need to use msg size, not wqe size */ + swqe_size = da_msg_size[max_send_sge]; + rwqe_size = da_msg_size[max_recv_sge]; + parms.act_nr_send_sges = 1; + parms.act_nr_recv_sges = 1; + } + break; + case IB_QPT_UC: + swqe_size = offsetof(struct ehca_wqe, + u.nud.sg_list[parms.act_nr_send_sges]); + rwqe_size = offsetof(struct ehca_wqe, + u.nud.sg_list[parms.act_nr_recv_sges]); + break; + + case IB_QPT_UD: + case IB_QPT_GSI: + case IB_QPT_SMI: + /* UD circumvention */ + parms.act_nr_recv_sges -= 2; + parms.act_nr_send_sges -= 2; + swqe_size = offsetof(struct ehca_wqe, + u.ud_av.sg_list[parms.act_nr_send_sges]); + rwqe_size = offsetof(struct ehca_wqe, + u.ud_av.sg_list[parms.act_nr_recv_sges]); + + if (IB_QPT_GSI == init_attr->qp_type || + IB_QPT_SMI == init_attr->qp_type) { + parms.act_nr_send_wqes = init_attr->cap.max_send_wr; + parms.act_nr_recv_wqes = init_attr->cap.max_recv_wr; + parms.act_nr_send_sges = init_attr->cap.max_send_sge; + parms.act_nr_recv_sges = init_attr->cap.max_recv_sge; + my_qp->real_qp_num = + (init_attr->qp_type == IB_QPT_SMI) ? 0 : 1; + } + + break; + + default: + break; + } + + /* initializes r/squeue and registers queue pages */ + ret = init_qp_queues(shca->ipz_hca_handle, my_qp, + parms.nr_sq_pages, parms.nr_rq_pages, + swqe_size, rwqe_size, + parms.act_nr_send_sges, parms.act_nr_recv_sges); + if (ret) { + EDEB_ERR(4,"Couldn't initialize r/squeue and pages ret=%x", + ret); + goto create_qp_exit2; + } + + my_qp->ib_qp.pd = &my_pd->ib_pd; + my_qp->ib_qp.device = my_pd->ib_pd.device; + + my_qp->ib_qp.recv_cq = init_attr->recv_cq; + my_qp->ib_qp.send_cq = init_attr->send_cq; + + my_qp->ib_qp.qp_num = my_qp->real_qp_num; + my_qp->ib_qp.qp_type = init_attr->qp_type; + + my_qp->qp_type = init_attr->qp_type; + my_qp->ib_qp.srq = init_attr->srq; + + my_qp->ib_qp.qp_context = init_attr->qp_context; + my_qp->ib_qp.event_handler = init_attr->event_handler; + + init_attr->cap.max_inline_data = 0; /* not supported yet */ + init_attr->cap.max_recv_sge = parms.act_nr_recv_sges; + init_attr->cap.max_recv_wr = parms.act_nr_recv_wqes; + init_attr->cap.max_send_sge = parms.act_nr_send_sges; + init_attr->cap.max_send_wr = parms.act_nr_send_wqes; + + /* NOTE: define_apq0() not supported yet */ + if (init_attr->qp_type == IB_QPT_GSI) { + h_ret = ehca_define_sqp(shca, my_qp, init_attr); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "ehca_define_sqp() failed rc=%lx",h_ret); + ret = ehca2ib_return_code(h_ret); + goto create_qp_exit3; + } + } + if (init_attr->send_cq) { + struct ehca_cq *cq = container_of(init_attr->send_cq, + struct ehca_cq, ib_cq); + ret = ehca_cq_assign_qp(cq, my_qp); + if (ret) { + EDEB_ERR(4, "Couldn't assign qp to send_cq ret=%x", + ret); + goto create_qp_exit3; + } + my_qp->send_cq = cq; + } + /* copy queues, galpa data to user space */ + if (context && udata) { + struct ipz_queue *ipz_rqueue = &my_qp->ipz_rqueue; + struct ipz_queue *ipz_squeue = &my_qp->ipz_squeue; + struct ehca_create_qp_resp resp; + struct vm_area_struct * vma; + memset(&resp, 0, sizeof(resp)); + + resp.qp_num = my_qp->real_qp_num; + resp.token = my_qp->token; + resp.qp_type = my_qp->qp_type; + resp.qkey = my_qp->qkey; + resp.real_qp_num = my_qp->real_qp_num; + /* rqueue properties */ + resp.ipz_rqueue.qe_size = ipz_rqueue->qe_size; + resp.ipz_rqueue.act_nr_of_sg = ipz_rqueue->act_nr_of_sg; + resp.ipz_rqueue.queue_length = ipz_rqueue->queue_length; + resp.ipz_rqueue.pagesize = ipz_rqueue->pagesize; + resp.ipz_rqueue.toggle_state = ipz_rqueue->toggle_state; + ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x22000000, + ipz_rqueue->queue_length, + (void**)&resp.ipz_rqueue.queue, + &vma); + if (ret) { + EDEB_ERR(4, "Could not mmap rqueue pages"); + goto create_qp_exit3; + } + my_qp->uspace_rqueue = resp.ipz_rqueue.queue; + /* squeue properties */ + resp.ipz_squeue.qe_size = ipz_squeue->qe_size; + resp.ipz_squeue.act_nr_of_sg = ipz_squeue->act_nr_of_sg; + resp.ipz_squeue.queue_length = ipz_squeue->queue_length; + resp.ipz_squeue.pagesize = ipz_squeue->pagesize; + resp.ipz_squeue.toggle_state = ipz_squeue->toggle_state; + ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x23000000, + ipz_squeue->queue_length, + (void**)&resp.ipz_squeue.queue, + &vma); + if (ret) { + EDEB_ERR(4, "Could not mmap squeue pages"); + goto create_qp_exit4; + } + my_qp->uspace_squeue = resp.ipz_squeue.queue; + /* fw_handle */ + resp.galpas = my_qp->galpas; + ret = ehca_mmap_register(my_qp->galpas.user.fw_handle, + (void**)&resp.galpas.kernel.fw_handle, + &vma); + if (ret) { + EDEB_ERR(4, "Could not mmap fw_handle"); + goto create_qp_exit5; + } + my_qp->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; + + if (ib_copy_to_udata(udata, &resp, sizeof resp)) { + EDEB_ERR(4, "Copy to udata failed"); + ret = -EINVAL; + goto create_qp_exit6; + } + } + + EDEB_EX(7, "ehca_qp=%p qp_num=%x, token=%x", + my_qp, qp_nr, my_qp->token); + return &my_qp->ib_qp; + +create_qp_exit6: + ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE); + +create_qp_exit5: + ehca_munmap(my_qp->uspace_squeue, my_qp->ipz_squeue.queue_length); + +create_qp_exit4: + ehca_munmap(my_qp->uspace_rqueue, my_qp->ipz_rqueue.queue_length); + +create_qp_exit3: + ipz_queue_dtor(&my_qp->ipz_rqueue); + ipz_queue_dtor(&my_qp->ipz_squeue); + +create_qp_exit2: + hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); + +create_qp_exit1: + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + idr_remove(&ehca_qp_idr, my_qp->token); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + +create_qp_exit0: + kmem_cache_free(ehca_module.cache_qp, my_qp); + EDEB_EX(4, "failed ret=%x", ret); + return ERR_PTR(ret); + +} + +/* + * prepare_sqe_rts called by internal_modify_qp() at trans sqe -> rts + * set purge bit of bad wqe and subsequent wqes to avoid reentering sqe + * returns total number of bad wqes in bad_wqe_cnt + */ +static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca, + int *bad_wqe_cnt) +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + struct ipz_queue *squeue = NULL; + void *bad_send_wqe_p = NULL; + void *bad_send_wqe_v = NULL; + void *squeue_start_p = NULL; + void *squeue_end_p = NULL; + void *squeue_start_v = NULL; + void *squeue_end_v = NULL; + struct ehca_wqe *wqe = NULL; + int qp_num = my_qp->ib_qp.qp_num; + + EDEB_EN(7, "ehca_qp=%p qp_num=%x ", my_qp, qp_num); + + /* get send wqe pointer */ + h_ret = hipz_h_disable_and_get_wqe(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, &my_qp->pf, + &bad_send_wqe_p, NULL, 2); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_h_disable_and_get_wqe() failed " + "ehca_qp=%p qp_num=%x h_ret=%lx",my_qp, qp_num, h_ret); + ret = ehca2ib_return_code(h_ret); + goto prepare_sqe_rts_exit1; + } + bad_send_wqe_p = (void*)((u64)bad_send_wqe_p & (~(1L<<63))); + EDEB(7, "qp_num=%x bad_send_wqe_p=%p", qp_num, bad_send_wqe_p); + /* convert wqe pointer to vadr */ + bad_send_wqe_v = abs_to_virt((u64)bad_send_wqe_p); + EDEB_DMP(6, bad_send_wqe_v, 32, "qp_num=%x bad_wqe", qp_num); + squeue = &my_qp->ipz_squeue; + squeue_start_p = (void*)virt_to_abs(ipz_qeit_calc(squeue, 0L)); + squeue_end_p = squeue_start_p+squeue->queue_length; + squeue_start_v = abs_to_virt((u64)squeue_start_p); + squeue_end_v = abs_to_virt((u64)squeue_end_p); + EDEB(6, "qp_num=%x squeue_start_v=%p squeue_end_v=%p", + qp_num, squeue_start_v, squeue_end_v); + + /* loop sets wqe's purge bit */ + wqe = (struct ehca_wqe*)bad_send_wqe_v; + *bad_wqe_cnt = 0; + while (wqe->optype != 0xff && wqe->wqef != 0xff) { + EDEB_DMP(6, wqe, 32, "qp_num=%x wqe", qp_num); + wqe->nr_of_data_seg = 0; /* suppress data access */ + wqe->wqef = WQEF_PURGE; /* WQE to be purged */ + wqe = (struct ehca_wqe*)((u8*)wqe+squeue->qe_size); + *bad_wqe_cnt = (*bad_wqe_cnt)+1; + if ((void*)wqe >= squeue_end_v) { + wqe = squeue_start_v; + } + } + /* + * bad wqe will be reprocessed and ignored when pol_cq() is called, + * i.e. nr of wqes with flush error status is one less + */ + EDEB(6, "qp_num=%x flusherr_wqe_cnt=%x", qp_num, (*bad_wqe_cnt)-1); + wqe->wqef = 0; + +prepare_sqe_rts_exit1: + + EDEB_EX(7, "ehca_qp=%p qp_num=%x ret=%x", my_qp, qp_num, ret); + return ret; +} + +/* + * internal_modify_qp with circumvention to handle aqp0 properly + * smi_reset2init indicates if this is an internal reset-to-init-call for + * smi. This flag must always be zero if called from ehca_modify_qp()! + * This internal func was intorduced to avoid recursion of ehca_modify_qp()! + */ +static int internal_modify_qp(struct ib_qp *ibqp, + struct ib_qp_attr *attr, + int attr_mask, int smi_reset2init) +{ + enum ib_qp_state qp_cur_state = 0, qp_new_state = 0; + int cnt = 0, qp_attr_idx = 0, ret = 0; + + enum ib_qp_statetrans statetrans; + struct hcp_modify_qp_control_block *mqpcb = NULL; + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + u64 update_mask = 0; + u64 h_ret = H_SUCCESS; + int bad_wqe_cnt = 0; + int squeue_locked = 0; + unsigned long spl_flags = 0; + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + shca = container_of(ibqp->pd->device, struct ehca_shca, ib_device); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x ibqp_type=%x " + "new qp_state=%x attribute_mask=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, + attr->qp_state, attr_mask); + + /* do query_qp to obtain current attr values */ + mqpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (mqpcb == NULL) { + ret = -ENOMEM; + EDEB_ERR(4, "Could not get zeroed page for mqpcb " + "ehca_qp=%p qp_num=%x ", my_qp, ibqp->qp_num); + goto modify_qp_exit0; + } + + h_ret = hipz_h_query_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + mqpcb, my_qp->galpas.kernel); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_h_query_qp() failed " + "ehca_qp=%p qp_num=%x h_ret=%lx", + my_qp, ibqp->qp_num, h_ret); + ret = ehca2ib_return_code(h_ret); + goto modify_qp_exit1; + } + EDEB(7, "ehca_qp=%p qp_num=%x ehca_qp_state=%x", + my_qp, ibqp->qp_num, mqpcb->qp_state); + + qp_cur_state = ehca2ib_qp_state(mqpcb->qp_state); + + if (qp_cur_state == -EINVAL) { /* invalid qp state */ + ret = -EINVAL; + EDEB_ERR(4, "Invalid current ehca_qp_state=%x " + "ehca_qp=%p qp_num=%x", + mqpcb->qp_state, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + /* + * circumvention to set aqp0 initial state to init + * as expected by IB spec + */ + if (smi_reset2init == 0 && + ibqp->qp_type == IB_QPT_SMI && + qp_cur_state == IB_QPS_RESET && + (attr_mask & IB_QP_STATE) && + attr->qp_state == IB_QPS_INIT) { /* RESET -> INIT */ + struct ib_qp_attr smiqp_attr = { + .qp_state = IB_QPS_INIT, + .port_num = my_qp->init_attr.port_num, + .pkey_index = 0, + .qkey = 0 + }; + int smiqp_attr_mask = IB_QP_STATE | IB_QP_PORT | + IB_QP_PKEY_INDEX | IB_QP_QKEY; + int smirc = internal_modify_qp( + ibqp, &smiqp_attr, smiqp_attr_mask, 1); + if (smirc) { + EDEB_ERR(4, "SMI RESET -> INIT failed. " + "ehca_modify_qp() rc=%x", smirc); + ret = H_PARAMETER; + goto modify_qp_exit1; + } + qp_cur_state = IB_QPS_INIT; + EDEB(7, "SMI RESET -> INIT succeeded"); + } + /* is transmitted current state equal to "real" current state */ + if ((attr_mask & IB_QP_CUR_STATE) && + qp_cur_state != attr->cur_qp_state) { + ret = -EINVAL; + EDEB_ERR(4, "Invalid IB_QP_CUR_STATE attr->curr_qp_state=%x <>" + " actual cur_qp_state=%x. ehca_qp=%p qp_num=%x", + attr->cur_qp_state, qp_cur_state, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + EDEB(7, "ehca_qp=%p qp_num=%x current qp_state=%x " + "new qp_state=%x attribute_mask=%x", + my_qp, ibqp->qp_num, qp_cur_state, attr->qp_state, attr_mask); + + qp_new_state = attr_mask & IB_QP_STATE ? attr->qp_state : qp_cur_state; + if (!smi_reset2init && + !ib_modify_qp_is_ok(qp_cur_state, qp_new_state, ibqp->qp_type, + attr_mask)) { + ret = -EINVAL; + EDEB_ERR(4, "Invalid qp transition new_state=%x cur_state=%x " + "ehca_qp=%p qp_num=%x attr_mask=%x", + qp_new_state, qp_cur_state, my_qp, ibqp->qp_num, + attr_mask); + goto modify_qp_exit1; + } + + if ((mqpcb->qp_state = ib2ehca_qp_state(qp_new_state))) + update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_STATE, 1); + else { + ret = -EINVAL; + EDEB_ERR(4, "Invalid new qp state=%x " + "ehca_qp=%p qp_num=%x", + qp_new_state, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + /* retrieve state transition struct to get req and opt attrs */ + statetrans = get_modqp_statetrans(qp_cur_state, qp_new_state); + if (statetrans < 0) { + ret = -EINVAL; + EDEB_ERR(4, " qp_cur_state=%x " + "new_qp_state=%x State_xsition=%x " + "ehca_qp=%p qp_num=%x", + qp_cur_state, qp_new_state, + statetrans, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + qp_attr_idx = ib2ehcaqptype(ibqp->qp_type); + + if (qp_attr_idx < 0) { + ret = qp_attr_idx; + EDEB_ERR(4, "Invalid QP type=%x ehca_qp=%p qp_num=%x", + ibqp->qp_type, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + EDEB(7, "ehca_qp=%p qp_num=%x qp_state_xsit=%x", + my_qp, ibqp->qp_num, statetrans); + + /* sqe -> rts: set purge bit of bad wqe before actual trans */ + if ((my_qp->qp_type == IB_QPT_UD || + my_qp->qp_type == IB_QPT_GSI || + my_qp->qp_type == IB_QPT_SMI) && + statetrans == IB_QPST_SQE2RTS) { + /* mark next free wqe if kernel */ + if (my_qp->uspace_squeue == 0) { + struct ehca_wqe *wqe = NULL; + /* lock send queue */ + spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); + squeue_locked = 1; + /* mark next free wqe */ + wqe = (struct ehca_wqe*) + ipz_qeit_get(&my_qp->ipz_squeue); + wqe->optype = wqe->wqef = 0xff; + EDEB(7, "qp_num=%x next_free_wqe=%p", + ibqp->qp_num, wqe); + } + ret = prepare_sqe_rts(my_qp, shca, &bad_wqe_cnt); + if (ret) { + EDEB_ERR(4, "prepare_sqe_rts() failed " + "ehca_qp=%p qp_num=%x ret=%x", + my_qp, ibqp->qp_num, ret); + goto modify_qp_exit2; + } + } + + /* + * enable RDMA_Atomic_Control if reset->init und reliable con + * this is necessary since gen2 does not provide that flag, + * but pHyp requires it + */ + if (statetrans == IB_QPST_RESET2INIT && + (ibqp->qp_type == IB_QPT_RC || ibqp->qp_type == IB_QPT_UC)) { + mqpcb->rdma_atomic_ctrl = 3; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RDMA_ATOMIC_CTRL, 1); + } + /* circ. pHyp requires #RDMA/Atomic Resp Res for UC INIT -> RTR */ + if (statetrans == IB_QPST_INIT2RTR && + (ibqp->qp_type == IB_QPT_UC) && + !(attr_mask & IB_QP_MAX_DEST_RD_ATOMIC)) { + mqpcb->rdma_nr_atomic_resp_res = 1; /* default to 1 */ + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES, 1); + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + mqpcb->prim_p_key_idx = attr->pkey_index; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PRIM_P_KEY_IDX, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_PKEY_INDEX update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_PORT) { + if (attr->port_num < 1 || attr->port_num > shca->num_ports) { + ret = -EINVAL; + EDEB_ERR(4, "Invalid port=%x. " + "ehca_qp=%p qp_num=%x num_ports=%x", + attr->port_num, my_qp, ibqp->qp_num, + shca->num_ports); + goto modify_qp_exit2; + } + mqpcb->prim_phys_port = attr->port_num; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PRIM_PHYS_PORT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_PORT update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_QKEY) { + mqpcb->qkey = attr->qkey; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_QKEY, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_QKEY update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_AV) { + int ah_mult = ib_rate_to_mult(attr->ah_attr.static_rate); + int ehca_mult = ib_rate_to_mult(shca->sport[my_qp-> + init_attr.port_num].rate); + + mqpcb->dlid = attr->ah_attr.dlid; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DLID, 1); + mqpcb->source_path_bits = attr->ah_attr.src_path_bits; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SOURCE_PATH_BITS, 1); + mqpcb->service_level = attr->ah_attr.sl; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL, 1); + + if (ah_mult < ehca_mult) + mqpcb->max_static_rate = (ah_mult > 0) ? + ((ehca_mult - 1) / ah_mult) : 0; + else + mqpcb->max_static_rate = 0; + + EDEB(7, " ipd=mqpcb->max_static_rate set %x " + " ah_mult=%x ehca_mult=%x " + " attr->ah_attr.static_rate=%x", + mqpcb->max_static_rate,ah_mult,ehca_mult, + attr->ah_attr.static_rate); + + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE, 1); + + /* + * only if GRH is TRUE we might consider SOURCE_GID_IDX + * and DEST_GID otherwise phype will return H_ATTR_PARM!!! + */ + if (attr->ah_attr.ah_flags == IB_AH_GRH) { + mqpcb->send_grh_flag = 1 << 31; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1); + mqpcb->source_gid_idx = attr->ah_attr.grh.sgid_index; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX, 1); + + for (cnt = 0; cnt < 16; cnt++) + mqpcb->dest_gid.byte[cnt] = + attr->ah_attr.grh.dgid.raw[cnt]; + + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DEST_GID, 1); + mqpcb->flow_label = attr->ah_attr.grh.flow_label; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_FLOW_LABEL, 1); + mqpcb->hop_limit = attr->ah_attr.grh.hop_limit; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_HOP_LIMIT, 1); + mqpcb->traffic_class = attr->ah_attr.grh.traffic_class; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_TRAFFIC_CLASS, 1); + } + + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_AV update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_PATH_MTU) { + mqpcb->path_mtu = attr->path_mtu; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PATH_MTU, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_PATH_MTU update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_TIMEOUT) { + mqpcb->timeout = attr->timeout; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_TIMEOUT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_TIMEOUT update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_RETRY_CNT) { + mqpcb->retry_count = attr->retry_cnt; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RETRY_COUNT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_RETRY_CNT update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_RNR_RETRY) { + mqpcb->rnr_retry_count = attr->rnr_retry; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RNR_RETRY_COUNT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_RNR_RETRY update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_RQ_PSN) { + mqpcb->receive_psn = attr->rq_psn; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RECEIVE_PSN, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_RQ_PSN update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) { + mqpcb->rdma_nr_atomic_resp_res = attr->max_dest_rd_atomic < 3 ? + attr->max_dest_rd_atomic : 2; /* max is 2 */ + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_MAX_DEST_RD_ATOMIC " + "update_mask=%lx", my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { + mqpcb->rdma_atomic_outst_dest_qp = attr->max_rd_atomic < 3 ? + attr->max_rd_atomic : 2; + update_mask |= + EHCA_BMASK_SET + (MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_MAX_QP_RD_ATOMIC " + "update_mask=%lx", my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_ALT_PATH) { + int ah_mult = ib_rate_to_mult(attr->alt_ah_attr.static_rate); + int ehca_mult = ib_rate_to_mult( + shca->sport[my_qp->init_attr.port_num].rate); + + mqpcb->dlid_al = attr->alt_ah_attr.dlid; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DLID_AL, 1); + mqpcb->source_path_bits_al = attr->alt_ah_attr.src_path_bits; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SOURCE_PATH_BITS_AL, 1); + mqpcb->service_level_al = attr->alt_ah_attr.sl; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL_AL, 1); + + if (ah_mult < ehca_mult) + mqpcb->max_static_rate = (ah_mult > 0) ? + ((ehca_mult - 1) / ah_mult) : 0; + else + mqpcb->max_static_rate_al = 0; + + EDEB(7, " ipd=mqpcb->max_static_rate set %x," + " ah_mult=%x ehca_mult=%x", + mqpcb->max_static_rate,ah_mult,ehca_mult); + + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE_AL, 1); + + /* + * only if GRH is TRUE we might consider SOURCE_GID_IDX + * and DEST_GID otherwise phype will return H_ATTR_PARM!!! + */ + if (attr->alt_ah_attr.ah_flags == IB_AH_GRH) { + mqpcb->send_grh_flag_al = 1 << 31; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG_AL, 1); + mqpcb->source_gid_idx_al = + attr->alt_ah_attr.grh.sgid_index; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX_AL, 1); + + for (cnt = 0; cnt < 16; cnt++) + mqpcb->dest_gid_al.byte[cnt] = + attr->alt_ah_attr.grh.dgid.raw[cnt]; + + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_DEST_GID_AL, 1); + mqpcb->flow_label_al = attr->alt_ah_attr.grh.flow_label; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_FLOW_LABEL_AL, 1); + mqpcb->hop_limit_al = attr->alt_ah_attr.grh.hop_limit; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_HOP_LIMIT_AL, 1); + mqpcb->traffic_class_al = + attr->alt_ah_attr.grh.traffic_class; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_TRAFFIC_CLASS_AL, 1); + } + + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_ALT_PATH update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + mqpcb->min_rnr_nak_timer_field = attr->min_rnr_timer; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_MIN_RNR_TIMER update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_SQ_PSN) { + mqpcb->send_psn = attr->sq_psn; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_PSN, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_SQ_PSN update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_DEST_QPN) { + mqpcb->dest_qp_nr = attr->dest_qp_num; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DEST_QP_NR, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_DEST_QPN update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_PATH_MIG_STATE) { + mqpcb->path_migration_state = attr->path_mig_state; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_PATH_MIGRATION_STATE, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_PATH_MIG_STATE update_mask=%lx", my_qp, + ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_CAP) { + mqpcb->max_nr_outst_send_wr = attr->cap.max_send_wr+1; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_MAX_NR_OUTST_SEND_WR, 1); + mqpcb->max_nr_outst_recv_wr = attr->cap.max_recv_wr+1; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_MAX_NR_OUTST_RECV_WR, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_CAP update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + /* no support for max_send/recv_sge yet */ + } + + EDEB_DMP(7, mqpcb, 4*70, "ehca_qp=%p qp_num=%x", my_qp, ibqp->qp_num); + + h_ret = hipz_h_modify_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + update_mask, + mqpcb, my_qp->galpas.kernel); + + if (h_ret != H_SUCCESS) { + ret = ehca2ib_return_code(h_ret); + EDEB_ERR(4, "hipz_h_modify_qp() failed rc=%lx " + "ehca_qp=%p qp_num=%x", + h_ret, my_qp, ibqp->qp_num); + goto modify_qp_exit2; + } + + if ((my_qp->qp_type == IB_QPT_UD || + my_qp->qp_type == IB_QPT_GSI || + my_qp->qp_type == IB_QPT_SMI) && + statetrans == IB_QPST_SQE2RTS) { + /* doorbell to reprocessing wqes */ + iosync(); /* serialize GAL register access */ + hipz_update_sqa(my_qp, bad_wqe_cnt-1); + EDEB(6, "doorbell for %x wqes", bad_wqe_cnt); + } + + if (statetrans == IB_QPST_RESET2INIT || + statetrans == IB_QPST_INIT2INIT) { + mqpcb->qp_enable = 1; + mqpcb->qp_state = EHCA_QPS_INIT; + update_mask = 0; + update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_ENABLE, 1); + + EDEB(7, "ehca_qp=%p qp_num=%x " + "RESET_2_INIT needs an additional enable " + "-> update_mask=%lx", my_qp, ibqp->qp_num, update_mask); + + h_ret = hipz_h_modify_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + update_mask, + mqpcb, + my_qp->galpas.kernel); + + if (h_ret != H_SUCCESS) { + ret = ehca2ib_return_code(h_ret); + EDEB_ERR(4, "ENABLE in context of " + "RESET_2_INIT failed! " + "Maybe you didn't get a LID" + "h_ret=%lx ehca_qp=%p qp_num=%x", + h_ret, my_qp, ibqp->qp_num); + goto modify_qp_exit2; + } + } + + if (statetrans == IB_QPST_ANY2RESET) { + ipz_qeit_reset(&my_qp->ipz_rqueue); + ipz_qeit_reset(&my_qp->ipz_squeue); + } + + if (attr_mask & IB_QP_QKEY) + my_qp->qkey = attr->qkey; + +modify_qp_exit2: + if (squeue_locked) { /* this means: sqe -> rts */ + spin_unlock_irqrestore(&my_qp->spinlock_s, spl_flags); + my_qp->sqerr_purgeflag = 1; + } + +modify_qp_exit1: + kfree(mqpcb); + +modify_qp_exit0: + EDEB_EX(7, "ehca_qp=%p qp_num=%x ibqp_type=%x ret=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, ret); + return ret; +} + +int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + int ret = 0; + struct ehca_qp *my_qp = NULL; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + + EHCA_CHECK_ADR(ibqp); + EHCA_CHECK_ADR(attr); + EHCA_CHECK_ADR(ibqp->device); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x ibqp_type=%x attr_mask=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, attr_mask); + + my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + ret = -EINVAL; + } else + ret = internal_modify_qp(ibqp, attr, attr_mask, 0); + + EDEB_EX(7, "ehca_qp=%p qp_num=%x ibqp_type=%x ret=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, ret); + return ret; +} + +int ehca_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr) +{ + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + struct hcp_modify_qp_control_block *qpcb = NULL; + struct ipz_adapter_handle adapter_handle; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + int cnt = 0, ret = 0; + u64 h_ret = H_SUCCESS; + + EHCA_CHECK_ADR(qp); + EHCA_CHECK_ADR(qp_attr); + EHCA_CHECK_DEVICE(qp->device); + + my_qp = container_of(qp, struct ehca_qp, ib_qp); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x " + "qp_attr=%p qp_attr_mask=%x qp_init_attr=%p", + my_qp, qp->qp_num, qp_attr, qp_attr_mask, qp_init_attr); + + my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + ret = -EINVAL; + goto query_qp_exit0; + } + + shca = container_of(qp->device, struct ehca_shca, ib_device); + adapter_handle = shca->ipz_hca_handle; + + if (qp_attr_mask & QP_ATTR_QUERY_NOT_SUPPORTED) { + ret = -EINVAL; + EDEB_ERR(4,"Invalid attribute mask " + "ehca_qp=%p qp_num=%x qp_attr_mask=%x ", + my_qp, qp->qp_num, qp_attr_mask); + goto query_qp_exit0; + } + + qpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL ); + if (!qpcb) { + ret = -ENOMEM; + EDEB_ERR(4,"Out of memory for qpcb " + "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); + goto query_qp_exit0; + } + + h_ret = hipz_h_query_qp(adapter_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + qpcb, my_qp->galpas.kernel); + + if (h_ret != H_SUCCESS) { + ret = ehca2ib_return_code(h_ret); + EDEB_ERR(4,"hipz_h_query_qp() failed " + "ehca_qp=%p qp_num=%x h_ret=%lx", + my_qp, qp->qp_num, h_ret); + goto query_qp_exit1; + } + + qp_attr->cur_qp_state = ehca2ib_qp_state(qpcb->qp_state); + qp_attr->qp_state = qp_attr->cur_qp_state; + if (qp_attr->cur_qp_state == -EINVAL) { + ret = -EINVAL; + EDEB_ERR(4,"Got invalid ehca_qp_state=%x " + "ehca_qp=%p qp_num=%x", + qpcb->qp_state, my_qp, qp->qp_num); + goto query_qp_exit1; + } + + if (qp_attr->qp_state == IB_QPS_SQD) + qp_attr->sq_draining = 1; + + qp_attr->qkey = qpcb->qkey; + qp_attr->path_mtu = qpcb->path_mtu; + qp_attr->path_mig_state = qpcb->path_migration_state; + qp_attr->rq_psn = qpcb->receive_psn; + qp_attr->sq_psn = qpcb->send_psn; + qp_attr->min_rnr_timer = qpcb->min_rnr_nak_timer_field; + qp_attr->cap.max_send_wr = qpcb->max_nr_outst_send_wr-1; + qp_attr->cap.max_recv_wr = qpcb->max_nr_outst_recv_wr-1; + /* UD_AV CIRCUMVENTION */ + if (my_qp->qp_type == IB_QPT_UD) { + qp_attr->cap.max_send_sge = + qpcb->actual_nr_sges_in_sq_wqe - 2; + qp_attr->cap.max_recv_sge = + qpcb->actual_nr_sges_in_rq_wqe - 2; + } else { + qp_attr->cap.max_send_sge = + qpcb->actual_nr_sges_in_sq_wqe; + qp_attr->cap.max_recv_sge = + qpcb->actual_nr_sges_in_rq_wqe; + } + + qp_attr->cap.max_inline_data = my_qp->sq_max_inline_data_size; + qp_attr->dest_qp_num = qpcb->dest_qp_nr; + + qp_attr->pkey_index = + EHCA_BMASK_GET(MQPCB_PRIM_P_KEY_IDX, qpcb->prim_p_key_idx); + + qp_attr->port_num = + EHCA_BMASK_GET(MQPCB_PRIM_PHYS_PORT, qpcb->prim_phys_port); + + qp_attr->timeout = qpcb->timeout; + qp_attr->retry_cnt = qpcb->retry_count; + qp_attr->rnr_retry = qpcb->rnr_retry_count; + + qp_attr->alt_pkey_index = + EHCA_BMASK_GET(MQPCB_PRIM_P_KEY_IDX, qpcb->alt_p_key_idx); + + qp_attr->alt_port_num = qpcb->alt_phys_port; + qp_attr->alt_timeout = qpcb->timeout_al; + + /* primary av */ + qp_attr->ah_attr.sl = qpcb->service_level; + + if (qpcb->send_grh_flag) { + qp_attr->ah_attr.ah_flags = IB_AH_GRH; + } + + qp_attr->ah_attr.static_rate = qpcb->max_static_rate; + qp_attr->ah_attr.dlid = qpcb->dlid; + qp_attr->ah_attr.src_path_bits = qpcb->source_path_bits; + qp_attr->ah_attr.port_num = qp_attr->port_num; + + /* primary GRH */ + qp_attr->ah_attr.grh.traffic_class = qpcb->traffic_class; + qp_attr->ah_attr.grh.hop_limit = qpcb->hop_limit; + qp_attr->ah_attr.grh.sgid_index = qpcb->source_gid_idx; + qp_attr->ah_attr.grh.flow_label = qpcb->flow_label; + + for (cnt = 0; cnt < 16; cnt++) + qp_attr->ah_attr.grh.dgid.raw[cnt] = + qpcb->dest_gid.byte[cnt]; + + /* alternate AV */ + qp_attr->alt_ah_attr.sl = qpcb->service_level_al; + if (qpcb->send_grh_flag_al) { + qp_attr->alt_ah_attr.ah_flags = IB_AH_GRH; + } + + qp_attr->alt_ah_attr.static_rate = qpcb->max_static_rate_al; + qp_attr->alt_ah_attr.dlid = qpcb->dlid_al; + qp_attr->alt_ah_attr.src_path_bits = qpcb->source_path_bits_al; + + /* alternate GRH */ + qp_attr->alt_ah_attr.grh.traffic_class = qpcb->traffic_class_al; + qp_attr->alt_ah_attr.grh.hop_limit = qpcb->hop_limit_al; + qp_attr->alt_ah_attr.grh.sgid_index = qpcb->source_gid_idx_al; + qp_attr->alt_ah_attr.grh.flow_label = qpcb->flow_label_al; + + for (cnt = 0; cnt < 16; cnt++) + qp_attr->alt_ah_attr.grh.dgid.raw[cnt] = + qpcb->dest_gid_al.byte[cnt]; + + /* return init attributes given in ehca_create_qp */ + if (qp_init_attr) + *qp_init_attr = my_qp->init_attr; + + EDEB(7, "ehca_qp=%p qp_number=%x dest_qp_number=%x " + "dlid=%x path_mtu=%x dest_gid=%lx_%lx " + "service_level=%x qp_state=%x", + my_qp, qpcb->qp_number, qpcb->dest_qp_nr, + qpcb->dlid, qpcb->path_mtu, + qpcb->dest_gid.dw[0], qpcb->dest_gid.dw[1], + qpcb->service_level, qpcb->qp_state); + + EDEB_DMP(7, qpcb, 4*70, "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); + +query_qp_exit1: + kfree(qpcb); + +query_qp_exit0: + EDEB_EX(7, "ehca_qp=%p qp_num=%x ret=%x", + my_qp, qp->qp_num, ret); + return ret; +} + +int ehca_destroy_qp(struct ib_qp *ibqp) +{ + extern struct ehca_module ehca_module; + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + struct ehca_pfqp *qp_pf = NULL; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + u32 qp_num = 0; + int ret = 0; + u64 h_ret = H_SUCCESS; + u8 port_num = 0; + enum ib_qp_type qp_type; + unsigned long flags; + + EHCA_CHECK_ADR(ibqp); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + qp_num = ibqp->qp_num; + qp_pf = &my_qp->pf; + + shca = container_of(ibqp->device, struct ehca_shca, ib_device); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x", my_qp, ibqp->qp_num); + + my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + if (my_qp->send_cq) { + ret = ehca_cq_unassign_qp(my_qp->send_cq, + my_qp->real_qp_num); + if (ret) { + EDEB_ERR(4, "Couldn't unassign qp from send_cq " + "ret=%x qp_num=%x cq_num=%x", + ret, my_qp->ib_qp.qp_num, + my_qp->send_cq->cq_number); + goto destroy_qp_exit0; + } + } + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + idr_remove(&ehca_qp_idr, my_qp->token); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + /* un-mmap if vma alloc */ + if (my_qp->uspace_rqueue) { + ret = ehca_munmap(my_qp->uspace_rqueue, + my_qp->ipz_rqueue.queue_length); + ret = ehca_munmap(my_qp->uspace_squeue, + my_qp->ipz_squeue.queue_length); + ret = ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE); + } + + h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_h_destroy_qp() failed " + "rc=%lx ehca_qp=%p qp_num=%x", + h_ret, qp_pf, qp_num); + goto destroy_qp_exit0; + } + + port_num = my_qp->init_attr.port_num; + qp_type = my_qp->init_attr.qp_type; + + /* no support for IB_QPT_SMI yet */ + if (qp_type == IB_QPT_GSI) { + struct ib_event event; + + EDEB(4, "device %s: port %x is inactive.", + shca->ib_device.name, port_num); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ERR; + event.element.port_num = port_num; + shca->sport[port_num - 1].port_state = IB_PORT_DOWN; + ib_dispatch_event(&event); + } + + ipz_queue_dtor(&my_qp->ipz_rqueue); + ipz_queue_dtor(&my_qp->ipz_squeue); + kmem_cache_free(ehca_module.cache_qp, my_qp); + +destroy_qp_exit0: + ret = ehca2ib_return_code(h_ret); + EDEB_EX(7,"ret=%x", ret); + return ret; +} diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c new file mode 100644 index 0000000..f1afb94 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -0,0 +1,694 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * post_send/recv, poll_cq, req_notify + * + * Authors: Waleri Fomin + * Hoang-Nam Nguyen + * Reinhard Ernst + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + + +#define DEB_PREFIX "reqs" + +#include +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "ehca_qes.h" +#include "ehca_iverbs.h" +#include "hcp_if.h" +#include "hipz_fns.h" + +static inline int ehca_write_rwqe(struct ipz_queue *ipz_rqueue, + struct ehca_wqe *wqe_p, + struct ib_recv_wr *recv_wr) +{ + u8 cnt_ds; + if (unlikely((recv_wr->num_sge < 0) || + (recv_wr->num_sge > ipz_rqueue->act_nr_of_sg))) { + EDEB_ERR(4, "Invalid number of WQE SGE. " + "num_sqe=%x max_nr_of_sg=%x", + recv_wr->num_sge, ipz_rqueue->act_nr_of_sg); + return -EINVAL; /* invalid SG list length */ + } + + /* clear wqe header until sglist */ + memset(wqe_p, 0, offsetof(struct ehca_wqe, u.ud_av.sg_list)); + + wqe_p->work_request_id = recv_wr->wr_id; + wqe_p->nr_of_data_seg = recv_wr->num_sge; + + for (cnt_ds = 0; cnt_ds < recv_wr->num_sge; cnt_ds++) { + wqe_p->u.all_rcv.sg_list[cnt_ds].vaddr = + recv_wr->sg_list[cnt_ds].addr; + wqe_p->u.all_rcv.sg_list[cnt_ds].lkey = + recv_wr->sg_list[cnt_ds].lkey; + wqe_p->u.all_rcv.sg_list[cnt_ds].length = + recv_wr->sg_list[cnt_ds].length; + } + + if (IS_EDEB_ON(7)) { + EDEB(7, "RECEIVE WQE written into ipz_rqueue=%p", ipz_rqueue); + EDEB_DMP(7, wqe_p, 16*(6 + wqe_p->nr_of_data_seg), "recv wqe"); + } + + return 0; +} + +#if defined(DEBUG_GSI_SEND_WR) + +/* need ib_mad struct */ +#include + +static void trace_send_wr_ud(const struct ib_send_wr *send_wr) +{ + int idx = 0; + int j = 0; + while (send_wr) { + struct ib_mad_hdr *mad_hdr = send_wr->wr.ud.mad_hdr; + struct ib_sge *sge = send_wr->sg_list; + EDEB(4, "send_wr#%x wr_id=%lx num_sge=%x " + "send_flags=%x opcode=%x",idx, send_wr->wr_id, + send_wr->num_sge, send_wr->send_flags, send_wr->opcode); + if (mad_hdr) { + EDEB(4, "send_wr#%x mad_hdr base_version=%x " + "mgmt_class=%x class_version=%x method=%x " + "status=%x class_specific=%x tid=%lx attr_id=%x " + "resv=%x attr_mod=%x", + idx, mad_hdr->base_version, mad_hdr->mgmt_class, + mad_hdr->class_version, mad_hdr->method, + mad_hdr->status, mad_hdr->class_specific, + mad_hdr->tid, mad_hdr->attr_id, mad_hdr->resv, + mad_hdr->attr_mod); + } + for (j = 0; j < send_wr->num_sge; j++) { + u8 *data = (u8 *) abs_to_virt(sge->addr); + EDEB(4, "send_wr#%x sge#%x addr=%p length=%x lkey=%x", + idx, j, data, sge->length, sge->lkey); + /* assume length is n*16 */ + EDEB_DMP(4, data, sge->length, "send_wr#%x sge#%x", + idx, j); + sge++; + } /* eof for j */ + idx++; + send_wr = send_wr->next; + } /* eof while send_wr */ +} + +#endif /* DEBUG_GSI_SEND_WR */ + +static inline int ehca_write_swqe(struct ehca_qp *qp, + struct ehca_wqe *wqe_p, + const struct ib_send_wr *send_wr) +{ + u32 idx; + u64 dma_length; + struct ehca_av *my_av; + u32 remote_qkey = send_wr->wr.ud.remote_qkey; + + if (unlikely((send_wr->num_sge < 0) || + (send_wr->num_sge > qp->ipz_squeue.act_nr_of_sg))) { + EDEB_ERR(4, "Invalid number of WQE SGE. " + "num_sqe=%x max_nr_of_sg=%x", + send_wr->num_sge, qp->ipz_squeue.act_nr_of_sg); + return -EINVAL; /* invalid SG list length */ + } + + /* clear wqe header until sglist */ + memset(wqe_p, 0, offsetof(struct ehca_wqe, u.ud_av.sg_list)); + + wqe_p->work_request_id = send_wr->wr_id; + + switch (send_wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + wqe_p->optype = WQE_OPTYPE_SEND; + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + wqe_p->optype = WQE_OPTYPE_RDMAWRITE; + break; + case IB_WR_RDMA_READ: + wqe_p->optype = WQE_OPTYPE_RDMAREAD; + break; + default: + EDEB_ERR(4, "Invalid opcode=%x", send_wr->opcode); + return -EINVAL; /* invalid opcode */ + } + + wqe_p->wqef = (send_wr->opcode) & WQEF_HIGH_NIBBLE; + + wqe_p->wr_flag = 0; + + if (send_wr->send_flags & IB_SEND_SIGNALED) + wqe_p->wr_flag |= WQE_WRFLAG_REQ_SIGNAL_COM; + + if (send_wr->opcode == IB_WR_SEND_WITH_IMM || + send_wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) { + /* this might not work as long as HW does not support it */ + wqe_p->immediate_data = be32_to_cpu(send_wr->imm_data); + wqe_p->wr_flag |= WQE_WRFLAG_IMM_DATA_PRESENT; + } + + wqe_p->nr_of_data_seg = send_wr->num_sge; + + switch (qp->qp_type) { + case IB_QPT_SMI: + case IB_QPT_GSI: + /* no break is intential here */ + case IB_QPT_UD: + /* IB 1.2 spec C10-15 compliance */ + if (send_wr->wr.ud.remote_qkey & 0x80000000) + remote_qkey = qp->qkey; + + wqe_p->destination_qp_number = send_wr->wr.ud.remote_qpn << 8; + wqe_p->local_ee_context_qkey = remote_qkey; + if (!send_wr->wr.ud.ah) { + EDEB_ERR(4, "wr.ud.ah is NULL. qp=%p", qp); + return -EINVAL; + } + my_av = container_of(send_wr->wr.ud.ah, struct ehca_av, ib_ah); + wqe_p->u.ud_av.ud_av = my_av->av; + + /* + * omitted check of IB_SEND_INLINE + * since HW does not support it + */ + for (idx = 0; idx < send_wr->num_sge; idx++) { + wqe_p->u.ud_av.sg_list[idx].vaddr = + send_wr->sg_list[idx].addr; + wqe_p->u.ud_av.sg_list[idx].lkey = + send_wr->sg_list[idx].lkey; + wqe_p->u.ud_av.sg_list[idx].length = + send_wr->sg_list[idx].length; + } /* eof for idx */ + if (qp->qp_type == IB_QPT_SMI || + qp->qp_type == IB_QPT_GSI) + wqe_p->u.ud_av.ud_av.pmtu = 1; + if (qp->qp_type == IB_QPT_GSI) { + wqe_p->pkeyi = send_wr->wr.ud.pkey_index; +#ifdef DEBUG_GSI_SEND_WR + trace_send_wr_ud(send_wr); +#endif /* DEBUG_GSI_SEND_WR */ + } + break; + + case IB_QPT_UC: + if (send_wr->send_flags & IB_SEND_FENCE) + wqe_p->wr_flag |= WQE_WRFLAG_FENCE; + /* no break is intentional here */ + case IB_QPT_RC: + /* TODO: atomic not implemented */ + wqe_p->u.nud.remote_virtual_adress = + send_wr->wr.rdma.remote_addr; + wqe_p->u.nud.rkey = send_wr->wr.rdma.rkey; + + /* + * omitted checking of IB_SEND_INLINE + * since HW does not support it + */ + dma_length = 0; + for (idx = 0; idx < send_wr->num_sge; idx++) { + wqe_p->u.nud.sg_list[idx].vaddr = + send_wr->sg_list[idx].addr; + wqe_p->u.nud.sg_list[idx].lkey = + send_wr->sg_list[idx].lkey; + wqe_p->u.nud.sg_list[idx].length = + send_wr->sg_list[idx].length; + dma_length += send_wr->sg_list[idx].length; + } /* eof idx */ + wqe_p->u.nud.atomic_1st_op_dma_len = dma_length; + + break; + + default: + EDEB_ERR(4, "Invalid qptype=%x", qp->qp_type); + return -EINVAL; + } + + if (IS_EDEB_ON(7)) { + EDEB(7, "SEND WQE written into queue qp=%p ", qp); + EDEB_DMP(7, wqe_p, 16*(6 + wqe_p->nr_of_data_seg), "send wqe"); + } + return 0; +} + +/* map_ib_wc_status converts raw cqe_status to ib_wc_status */ +static inline void map_ib_wc_status(u32 cqe_status, + enum ib_wc_status *wc_status) +{ + if (unlikely(cqe_status & WC_STATUS_ERROR_BIT)) { + switch (cqe_status & 0x3F) { + case 0x01: + case 0x21: + *wc_status = IB_WC_LOC_LEN_ERR; + break; + case 0x02: + case 0x22: + *wc_status = IB_WC_LOC_QP_OP_ERR; + break; + case 0x03: + case 0x23: + *wc_status = IB_WC_LOC_EEC_OP_ERR; + break; + case 0x04: + case 0x24: + *wc_status = IB_WC_LOC_PROT_ERR; + break; + case 0x05: + case 0x25: + *wc_status = IB_WC_WR_FLUSH_ERR; + break; + case 0x06: + *wc_status = IB_WC_MW_BIND_ERR; + break; + case 0x07: /* remote error - look into bits 20:24 */ + switch ((cqe_status + & WC_STATUS_REMOTE_ERROR_FLAGS) >> 11) { + case 0x0: + /* + * PSN Sequence Error! + * couldn't find a matching status! + */ + *wc_status = IB_WC_GENERAL_ERR; + break; + case 0x1: + *wc_status = IB_WC_REM_INV_REQ_ERR; + break; + case 0x2: + *wc_status = IB_WC_REM_ACCESS_ERR; + break; + case 0x3: + *wc_status = IB_WC_REM_OP_ERR; + break; + case 0x4: + *wc_status = IB_WC_REM_INV_RD_REQ_ERR; + break; + } + break; + case 0x08: + *wc_status = IB_WC_RETRY_EXC_ERR; + break; + case 0x09: + *wc_status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case 0x0A: + case 0x2D: + *wc_status = IB_WC_REM_ABORT_ERR; + break; + case 0x0B: + case 0x2E: + *wc_status = IB_WC_INV_EECN_ERR; + break; + case 0x0C: + case 0x2F: + *wc_status = IB_WC_INV_EEC_STATE_ERR; + break; + case 0x0D: + *wc_status = IB_WC_BAD_RESP_ERR; + break; + case 0x10: + /* WQE purged */ + *wc_status = IB_WC_WR_FLUSH_ERR; + break; + default: + *wc_status = IB_WC_FATAL_ERR; + + } + } else + *wc_status = IB_WC_SUCCESS; +} + +int ehca_post_send(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + struct ehca_qp *my_qp = NULL; + struct ib_send_wr *cur_send_wr = NULL; + struct ehca_wqe *wqe_p = NULL; + int wqe_cnt = 0; + int ret = 0; + unsigned long spl_flags = 0; + + EHCA_CHECK_ADR(qp); + my_qp = container_of(qp, struct ehca_qp, ib_qp); + EHCA_CHECK_QP(my_qp); + EHCA_CHECK_ADR(send_wr); + EDEB_EN(7, "ehca_qp=%p qp_num=%x send_wr=%p bad_send_wr=%p", + my_qp, qp->qp_num, send_wr, bad_send_wr); + + /* LOCK the QUEUE */ + spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); + + /* loop processes list of send reqs */ + for (cur_send_wr = send_wr; cur_send_wr != NULL; + cur_send_wr = cur_send_wr->next) { + u64 start_offset = my_qp->ipz_squeue.current_q_offset; + /* get pointer next to free WQE */ + wqe_p = ipz_qeit_get_inc(&my_qp->ipz_squeue); + if (unlikely(!wqe_p)) { + /* too many posted work requests: queue overflow */ + if (bad_send_wr) + *bad_send_wr = cur_send_wr; + if (wqe_cnt == 0) { + ret = -ENOMEM; + EDEB_ERR(4, "Too many posted WQEs qp_num=%x", + qp->qp_num); + } + goto post_send_exit0; + } + /* write a SEND WQE into the QUEUE */ + ret = ehca_write_swqe(my_qp, wqe_p, cur_send_wr); + /* + * if something failed, + * reset the free entry pointer to the start value + */ + if (unlikely(ret)) { + my_qp->ipz_squeue.current_q_offset = start_offset; + *bad_send_wr = cur_send_wr; + if (wqe_cnt == 0) { + ret = -EINVAL; + EDEB_ERR(4, "Could not write WQE qp_num=%x", + qp->qp_num); + } + goto post_send_exit0; + } + wqe_cnt++; + EDEB(7, "ehca_qp=%p qp_num=%x wqe_cnt=%d", + my_qp, qp->qp_num, wqe_cnt); + } /* eof for cur_send_wr */ + +post_send_exit0: + /* UNLOCK the QUEUE */ + spin_unlock_irqrestore(&my_qp->spinlock_s, spl_flags); + iosync(); /* serialize GAL register access */ + hipz_update_sqa(my_qp, wqe_cnt); + EDEB_EX(7, "ehca_qp=%p qp_num=%x ret=%x wqe_cnt=%d", + my_qp, qp->qp_num, ret, wqe_cnt); + return ret; +} + +int ehca_post_recv(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) +{ + struct ehca_qp *my_qp = NULL; + struct ib_recv_wr *cur_recv_wr = NULL; + struct ehca_wqe *wqe_p = NULL; + int wqe_cnt = 0; + int ret = 0; + unsigned long spl_flags = 0; + + EHCA_CHECK_ADR(qp); + my_qp = container_of(qp, struct ehca_qp, ib_qp); + EHCA_CHECK_QP(my_qp); + EHCA_CHECK_ADR(recv_wr); + EDEB_EN(7, "ehca_qp=%p qp_num=%x recv_wr=%p bad_recv_wr=%p", + my_qp, qp->qp_num, recv_wr, bad_recv_wr); + + /* LOCK the QUEUE */ + spin_lock_irqsave(&my_qp->spinlock_r, spl_flags); + + /* loop processes list of send reqs */ + for (cur_recv_wr = recv_wr; cur_recv_wr != NULL; + cur_recv_wr = cur_recv_wr->next) { + u64 start_offset = my_qp->ipz_rqueue.current_q_offset; + /* get pointer next to free WQE */ + wqe_p = ipz_qeit_get_inc(&my_qp->ipz_rqueue); + if (unlikely(!wqe_p)) { + /* too many posted work requests: queue overflow */ + if (bad_recv_wr) + *bad_recv_wr = cur_recv_wr; + if (wqe_cnt == 0) { + ret = -ENOMEM; + EDEB_ERR(4, "Too many posted WQEs qp_num=%x", + qp->qp_num); + } + goto post_recv_exit0; + } + /* write a RECV WQE into the QUEUE */ + ret = ehca_write_rwqe(&my_qp->ipz_rqueue, wqe_p, + cur_recv_wr); + /* + * if something failed, + * reset the free entry pointer to the start value + */ + if (unlikely(ret)) { + my_qp->ipz_rqueue.current_q_offset = start_offset; + *bad_recv_wr = cur_recv_wr; + if (wqe_cnt == 0) { + ret = -EINVAL; + EDEB_ERR(4, "Could not write WQE qp_num=%x", + qp->qp_num); + } + goto post_recv_exit0; + } + wqe_cnt++; + EDEB(7, "ehca_qp=%p qp_num=%x wqe_cnt=%d", + my_qp, qp->qp_num, wqe_cnt); + } /* eof for cur_recv_wr */ + +post_recv_exit0: + spin_unlock_irqrestore(&my_qp->spinlock_r, spl_flags); + iosync(); /* serialize GAL register access */ + hipz_update_rqa(my_qp, wqe_cnt); + EDEB_EX(7, "ehca_qp=%p qp_num=%x ret=%x wqe_cnt=%d", + my_qp, qp->qp_num, ret, wqe_cnt); + return ret; +} + +/* + * ib_wc_opcode table converts ehca wc opcode to ib + * Since we use zero to indicate invalid opcode, the actual ib opcode must + * be decremented!!! + */ +static const u8 ib_wc_opcode[255] = { + [0x01] = IB_WC_RECV+1, + [0x02] = IB_WC_RECV_RDMA_WITH_IMM+1, + [0x04] = IB_WC_BIND_MW+1, + [0x08] = IB_WC_FETCH_ADD+1, + [0x10] = IB_WC_COMP_SWAP+1, + [0x20] = IB_WC_RDMA_WRITE+1, + [0x40] = IB_WC_RDMA_READ+1, + [0x80] = IB_WC_SEND+1 +}; + +/* internal function to poll one entry of cq */ +static inline int ehca_poll_cq_one(struct ib_cq *cq, struct ib_wc *wc) +{ + int ret = 0; + struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); + struct ehca_cqe *cqe = NULL; + int cqe_count = 0; + + EDEB_EN(7, "ehca_cq=%p cq_num=%x wc=%p", my_cq, my_cq->cq_number, wc); + +poll_cq_one_read_cqe: + cqe = (struct ehca_cqe *) + ipz_qeit_get_inc_valid(&my_cq->ipz_queue); + if (!cqe) { + ret = -EAGAIN; + EDEB(7, "Completion queue is empty ehca_cq=%p cq_num=%x " + "ret=%x", my_cq, my_cq->cq_number, ret); + goto poll_cq_one_exit0; + } + + /* prevents loads being reordered across this point */ + rmb(); + + cqe_count++; + if (unlikely(cqe->status & WC_STATUS_PURGE_BIT)) { + struct ehca_qp *qp=ehca_cq_get_qp(my_cq, cqe->local_qp_number); + int purgeflag = 0; + unsigned long spl_flags = 0; + if (!qp) { + EDEB_ERR(4, "cq_num=%x qp_num=%x " + "could not find qp -> ignore cqe", + my_cq->cq_number, cqe->local_qp_number); + EDEB_DMP(4, cqe, 64, "cq_num=%x qp_num=%x", + my_cq->cq_number, cqe->local_qp_number); + /* ignore this purged cqe */ + goto poll_cq_one_read_cqe; + } + spin_lock_irqsave(&qp->spinlock_s, spl_flags); + purgeflag = qp->sqerr_purgeflag; + spin_unlock_irqrestore(&qp->spinlock_s, spl_flags); + + if (purgeflag) { + EDEB(6, "Got CQE with purged bit qp_num=%x src_qp=%x", + cqe->local_qp_number, cqe->remote_qp_number); + EDEB_DMP(6, cqe, 64, "qp_num=%x src_qp=%x", + cqe->local_qp_number, cqe->remote_qp_number); + /* + * ignore this to avoid double cqes of bad wqe + * that caused sqe and turn off purge flag + */ + qp->sqerr_purgeflag = 0; + goto poll_cq_one_read_cqe; + } + } + + /* tracing cqe */ + if (IS_EDEB_ON(7)) { + EDEB(7, "Received COMPLETION ehca_cq=%p cq_num=%x -----", + my_cq, my_cq->cq_number); + EDEB_DMP(7, cqe, 64, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + EDEB(7, "ehca_cq=%p cq_num=%x -------------------------", + my_cq, my_cq->cq_number); + } + + /* we got a completion! */ + wc->wr_id = cqe->work_request_id; + + /* eval ib_wc_opcode */ + wc->opcode = ib_wc_opcode[cqe->optype]-1; + if (unlikely(wc->opcode == -1)) { + EDEB_ERR(4, "Invalid cqe->OPType=%x cqe->status=%x " + "ehca_cq=%p cq_num=%x", + cqe->optype, cqe->status, my_cq, my_cq->cq_number); + /* dump cqe for other infos */ + EDEB_DMP(4, cqe, 64, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + /* update also queue adder to throw away this entry!!! */ + goto poll_cq_one_exit0; + } + /* eval ib_wc_status */ + if (unlikely(cqe->status & WC_STATUS_ERROR_BIT)) { + /* complete with errors */ + map_ib_wc_status(cqe->status, &wc->status); + wc->vendor_err = wc->status; + } else + wc->status = IB_WC_SUCCESS; + + wc->qp_num = cqe->local_qp_number; + wc->byte_len = cqe->nr_bytes_transferred; + wc->pkey_index = cqe->pkey_index; + wc->slid = cqe->rlid; + wc->dlid_path_bits = cqe->dlid; + wc->src_qp = cqe->remote_qp_number; + wc->wc_flags = cqe->w_completion_flags; + wc->imm_data = cpu_to_be32(cqe->immediate_data); + wc->sl = cqe->service_level; + + if (wc->status != IB_WC_SUCCESS) + EDEB(6, "ehca_cq=%p cq_num=%x WARNING unsuccessful cqe " + "OPType=%x status=%x qp_num=%x src_qp=%x wr_id=%lx cqe=%p", + my_cq, my_cq->cq_number, cqe->optype, cqe->status, + cqe->local_qp_number, cqe->remote_qp_number, + cqe->work_request_id, cqe); + +poll_cq_one_exit0: + if (cqe_count > 0) + hipz_update_feca(my_cq, cqe_count); + + EDEB_EX(7, "ret=%x ehca_cq=%p cq_number=%x wc=%p " + "status=%x opcode=%x qp_num=%x byte_len=%x", + ret, my_cq, my_cq->cq_number, wc, wc->status, + wc->opcode, wc->qp_num, wc->byte_len); + + return ret; +} + +int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc) +{ + struct ehca_cq *my_cq = NULL; + int nr = 0; + struct ib_wc *current_wc = NULL; + int ret = 0; + unsigned long spl_flags = 0; + + EHCA_CHECK_CQ(cq); + EHCA_CHECK_ADR(wc); + + my_cq = container_of(cq, struct ehca_cq, ib_cq); + EHCA_CHECK_CQ(my_cq); + + EDEB_EN(7, "ehca_cq=%p cq_num=%x num_entries=%d wc=%p", + my_cq, my_cq->cq_number, num_entries, wc); + + if (num_entries < 1) { + EDEB_ERR(4, "Invalid num_entries=%d ehca_cq=%p cq_num=%x", + num_entries, my_cq, my_cq->cq_number); + ret = -EINVAL; + goto poll_cq_exit0; + } + + current_wc = wc; + spin_lock_irqsave(&my_cq->spinlock, spl_flags); + for (nr = 0; nr < num_entries; nr++) { + ret = ehca_poll_cq_one(cq, current_wc); + if (ret) + break; + current_wc++; + } /* eof for nr */ + spin_unlock_irqrestore(&my_cq->spinlock, spl_flags); + if (ret == -EAGAIN || !ret) + ret = nr; + +poll_cq_exit0: + EDEB_EX(7, "ehca_cq=%p cq_num=%x ret=%x wc=%p nr_entries=%d", + my_cq, my_cq->cq_number, ret, wc, nr); + + return ret; +} + +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) +{ + struct ehca_cq *my_cq = NULL; + int ret = 0; + + EHCA_CHECK_CQ(cq); + my_cq = container_of(cq, struct ehca_cq, ib_cq); + EHCA_CHECK_CQ(my_cq); + EDEB_EN(7, "ehca_cq=%p cq_num=%x cq_notif=%x", + my_cq, my_cq->cq_number, cq_notify); + + switch (cq_notify) { + case IB_CQ_SOLICITED: + hipz_set_cqx_n0(my_cq, 1); + break; + case IB_CQ_NEXT_COMP: + hipz_set_cqx_n1(my_cq, 1); + break; + default: + return -EINVAL; + } + + EDEB_EX(7, "ehca_cq=%p cq_num=%x ret=%x", + my_cq, my_cq->cq_number, ret); + + return ret; +} diff --git a/drivers/infiniband/hw/ehca/ehca_sqp.c b/drivers/infiniband/hw/ehca/ehca_sqp.c new file mode 100644 index 0000000..d2c5552 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_sqp.c @@ -0,0 +1,123 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * SQP functions + * + * Authors: Khadija Souissi + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + + +#define DEB_PREFIX "e_qp" + +#include +#include +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "ehca_qes.h" +#include "ehca_iverbs.h" +#include "hcp_if.h" + + +extern int ehca_create_aqp1(struct ehca_shca *shca, struct ehca_sport *sport); +extern int ehca_destroy_aqp1(struct ehca_sport *sport); + +extern int ehca_port_act_time; + +/** + * ehca_define_sqp - Defines special queue pair 1 (GSI QP). When special queue + * pair is created successfully, the corresponding port gets active. + * + * Define Special Queue pair 0 (SMI QP) is still not supported. + * + * @qp_init_attr: Queue pair init attributes with port and queue pair type + */ + +u64 ehca_define_sqp(struct ehca_shca *shca, + struct ehca_qp *ehca_qp, + struct ib_qp_init_attr *qp_init_attr) +{ + + u32 pma_qp_nr = 0; + u32 bma_qp_nr = 0; + u64 ret = H_SUCCESS; + u8 port = qp_init_attr->port_num; + int counter = 0; + + EDEB_EN(7, "port=%x qp_type=%x", + port, qp_init_attr->qp_type); + + shca->sport[port - 1].port_state = IB_PORT_DOWN; + + switch (qp_init_attr->qp_type) { + case IB_QPT_SMI: + /* function not supported yet */ + break; + case IB_QPT_GSI: + ret = hipz_h_define_aqp1(shca->ipz_hca_handle, + ehca_qp->ipz_qp_handle, + ehca_qp->galpas.kernel, + (u32) qp_init_attr->port_num, + &pma_qp_nr, &bma_qp_nr); + + if (ret != H_SUCCESS) { + EDEB_ERR(4, "Can't define AQP1 for port %x. rc=%lx", + port, ret); + goto ehca_define_aqp1; + } + break; + default: + ret = H_PARAMETER; + goto ehca_define_aqp1; + } + + while ((shca->sport[port - 1].port_state != IB_PORT_ACTIVE) && + (counter < ehca_port_act_time)) { + EDEB(6, "... wait until port %x is active", + port); + msleep_interruptible(1000); + counter++; + } + + if (counter == ehca_port_act_time) { + EDEB_ERR(4, "Port %x is not active.", port); + ret = H_HARDWARE; + } + +ehca_define_aqp1: + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:00 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:00 -0700 Subject: [openib-general] [PATCH 01/13] IB/ehca: hca In-Reply-To: <20068171311.QJ2lcO2NjghtFOX6@cisco.com> Message-ID: <20068171311.qHSUlh5t6lpV4BeW@cisco.com> drivers/infiniband/hw/ehca/ehca_hca.c | 282 +++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_mcast.c | 200 ++++++++++++++++++++++ 2 files changed, 482 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c new file mode 100644 index 0000000..7a871b2 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -0,0 +1,282 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * HCA query functions + * + * Authors: Heiko J Schick + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#undef DEB_PREFIX +#define DEB_PREFIX "shca" + +#include "ehca_tools.h" + +#include "hcp_if.h" + +int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) +{ + int ret = 0; + struct ehca_shca *shca; + struct hipz_query_hca *rblock; + + EDEB_EN(7, ""); + + memset(props, 0, sizeof(struct ib_device_attr)); + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_device0; + } + + if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { + EDEB_ERR(4, "Can't query device properties"); + ret = -EINVAL; + goto query_device1; + } + props->fw_ver = rblock->hw_ver; + props->max_mr_size = rblock->max_mr_size; + props->vendor_id = rblock->vendor_id >> 8; + props->vendor_part_id = rblock->vendor_part_id >> 16; + props->hw_ver = rblock->hw_ver; + props->max_qp = min_t(int, rblock->max_qp, INT_MAX); + props->max_qp_wr = min_t(int, rblock->max_wqes_wq, INT_MAX); + props->max_sge = min_t(int, rblock->max_sge, INT_MAX); + props->max_sge_rd = min_t(int, rblock->max_sge_rd, INT_MAX); + props->max_cq = min_t(int, rblock->max_cq, INT_MAX); + props->max_cqe = min_t(int, rblock->max_cqe, INT_MAX); + props->max_mr = min_t(int, rblock->max_mr, INT_MAX); + props->max_mw = min_t(int, rblock->max_mw, INT_MAX); + props->max_pd = min_t(int, rblock->max_pd, INT_MAX); + props->max_ah = min_t(int, rblock->max_ah, INT_MAX); + props->max_fmr = min_t(int, rblock->max_mr, INT_MAX); + props->max_srq = 0; + props->max_srq_wr = 0; + props->max_srq_sge = 0; + props->max_pkeys = 16; + props->local_ca_ack_delay + = rblock->local_ca_ack_delay; + props->max_raw_ipv6_qp + = min_t(int, rblock->max_raw_ipv6_qp, INT_MAX); + props->max_raw_ethy_qp + = min_t(int, rblock->max_raw_ethy_qp, INT_MAX); + props->max_mcast_grp + = min_t(int, rblock->max_mcast_grp, INT_MAX); + props->max_mcast_qp_attach + = min_t(int, rblock->max_mcast_qp_attach, INT_MAX); + props->max_total_mcast_qp_attach + = min_t(int, rblock->max_total_mcast_qp_attach, INT_MAX); + +query_device1: + kfree(rblock); + +query_device0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + int ret = 0; + struct ehca_shca *shca; + struct hipz_query_port *rblock; + + EDEB_EN(7, "port=%x", port); + + memset(props, 0, sizeof(struct ib_port_attr)); + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_port0; + } + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + EDEB_ERR(4, "Can't query port properties"); + ret = -EINVAL; + goto query_port1; + } + + props->state = rblock->state; + + switch (rblock->max_mtu) { + case 0x1: + props->active_mtu = props->max_mtu = IB_MTU_256; + break; + case 0x2: + props->active_mtu = props->max_mtu = IB_MTU_512; + break; + case 0x3: + props->active_mtu = props->max_mtu = IB_MTU_1024; + break; + case 0x4: + props->active_mtu = props->max_mtu = IB_MTU_2048; + break; + case 0x5: + props->active_mtu = props->max_mtu = IB_MTU_4096; + break; + default: + EDEB_ERR(4, "Unknown MTU size: %x.", rblock->max_mtu); + } + + props->gid_tbl_len = rblock->gid_tbl_len; + props->max_msg_sz = rblock->max_msg_sz; + props->bad_pkey_cntr = rblock->bad_pkey_cntr; + props->qkey_viol_cntr = rblock->qkey_viol_cntr; + props->pkey_tbl_len = rblock->pkey_tbl_len; + props->lid = rblock->lid; + props->sm_lid = rblock->sm_lid; + props->lmc = rblock->lmc; + props->sm_sl = rblock->sm_sl; + props->subnet_timeout = rblock->subnet_timeout; + props->init_type_reply = rblock->init_type_reply; + + props->active_width = IB_WIDTH_12X; + props->active_speed = 0x1; + +query_port1: + kfree(rblock); + +query_port0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) +{ + int ret = 0; + struct ehca_shca *shca; + struct hipz_query_port *rblock; + + EDEB_EN(7, "port=%x index=%x", port, index); + + if (index > 16) { + EDEB_ERR(4, "Invalid index: %x.", index); + ret = -EINVAL; + goto query_pkey0; + } + + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_pkey0; + } + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + EDEB_ERR(4, "Can't query port properties"); + ret = -EINVAL; + goto query_pkey1; + } + + memcpy(pkey, &rblock->pkey_entries + index, sizeof(u16)); + +query_pkey1: + kfree(rblock); + +query_pkey0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + int ret = 0; + struct ehca_shca *shca; + struct hipz_query_port *rblock; + + EDEB_EN(7, "port=%x index=%x", port, index); + + if (index > 255) { + EDEB_ERR(4, "Invalid index: %x.", index); + ret = -EINVAL; + goto query_gid0; + } + + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_gid0; + } + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + EDEB_ERR(4, "Can't query port properties"); + ret = -EINVAL; + goto query_gid1; + } + + memcpy(&gid->raw[0], &rblock->gid_prefix, sizeof(u64)); + memcpy(&gid->raw[8], &rblock->guid_entries[index], sizeof(u64)); + +query_gid1: + kfree(rblock); + +query_gid0: + EDEB_EX(7, "ret=%x GID=%lx%lx", ret, + *(u64 *) & gid->raw[0], + *(u64 *) & gid->raw[8]); + + return ret; +} + +int ehca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + int ret = 0; + + EDEB_EN(7, "port=%x", port); + + /* Not implemented yet. */ + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} diff --git a/drivers/infiniband/hw/ehca/ehca_mcast.c b/drivers/infiniband/hw/ehca/ehca_mcast.c new file mode 100644 index 0000000..5c5b024 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_mcast.c @@ -0,0 +1,200 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * mcast functions + * + * Authors: Khadija Souissi + * Waleri Fomin + * Reinhard Ernst + * Hoang-Nam Nguyen + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#define DEB_PREFIX "mcas" + +#include +#include +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "ehca_qes.h" +#include "ehca_iverbs.h" + +#include "hcp_if.h" + +#define MAX_MC_LID 0xFFFE +#define MIN_MC_LID 0xC000 /* Multicast limits */ +#define EHCA_VALID_MULTICAST_GID(gid) ((gid)[0] == 0xFF) +#define EHCA_VALID_MULTICAST_LID(lid) (((lid) >= MIN_MC_LID) && ((lid) <= MAX_MC_LID)) + +int ehca_attach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + union ib_gid my_gid; + u64 subnet_prefix; + u64 interface_id; + u64 h_ret = H_SUCCESS; + int ret = 0; + + EHCA_CHECK_ADR(ibqp); + EHCA_CHECK_ADR(gid); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + + EHCA_CHECK_QP(my_qp); + if (ibqp->qp_type != IB_QPT_UD) { + EDEB_ERR(4, "invalid qp_type %x gid, ret=%x", + ibqp->qp_type, EINVAL); + return -EINVAL; + } + + shca = container_of(ibqp->pd->device, struct ehca_shca, ib_device); + EHCA_CHECK_ADR(shca); + + if (!(EHCA_VALID_MULTICAST_GID(gid->raw))) { + EDEB_ERR(4, "gid is not valid mulitcast gid ret=%x", + EINVAL); + return -EINVAL; + } else if ((lid < MIN_MC_LID) || (lid > MAX_MC_LID)) { + EDEB_ERR(4, "lid=%x is not valid mulitcast lid ret=%x", + lid, EINVAL); + return -EINVAL; + } + + memcpy(&my_gid.raw, gid->raw, sizeof(union ib_gid)); + + subnet_prefix = be64_to_cpu(my_gid.global.subnet_prefix); + interface_id = be64_to_cpu(my_gid.global.interface_id); + h_ret = hipz_h_attach_mcqp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + my_qp->galpas.kernel, + lid, subnet_prefix, interface_id); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, + "ehca_qp=%p qp_num=%x hipz_h_attach_mcqp() failed " + "h_ret=%lx", my_qp, ibqp->qp_num, h_ret); + } + ret = ehca2ib_return_code(h_ret); + + EDEB_EX(7, "mcast attach ret=%x\n" + "ehca_qp=%p qp_num=%x lid=%x\n" + "my_gid= %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n", + ret, my_qp, ibqp->qp_num, lid, + my_gid.raw[0], my_gid.raw[1], + my_gid.raw[2], my_gid.raw[3], + my_gid.raw[4], my_gid.raw[5], + my_gid.raw[6], my_gid.raw[7], + my_gid.raw[8], my_gid.raw[9], + my_gid.raw[10], my_gid.raw[11], + my_gid.raw[12], my_gid.raw[13], + my_gid.raw[14], my_gid.raw[15]); + + return ret; +} + +int ehca_detach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + union ib_gid my_gid; + u64 subnet_prefix; + u64 interface_id; + u64 h_ret = H_SUCCESS; + int ret = 0; + + EHCA_CHECK_ADR(ibqp); + EHCA_CHECK_ADR(gid); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + + EHCA_CHECK_QP(my_qp); + if (ibqp->qp_type != IB_QPT_UD) { + EDEB_ERR(4, "invalid qp_type %x gid, ret=%x", + ibqp->qp_type, EINVAL); + return -EINVAL; + } + + shca = container_of(ibqp->pd->device, struct ehca_shca, ib_device); + EHCA_CHECK_ADR(shca); + + if (!(EHCA_VALID_MULTICAST_GID(gid->raw))) { + EDEB_ERR(4, "gid is not valid mulitcast gid ret=%x", + EINVAL); + return -EINVAL; + } else if ((lid < MIN_MC_LID) || (lid > MAX_MC_LID)) { + EDEB_ERR(4, "lid=%x is not valid mulitcast lid ret=%x", + lid, EINVAL); + return -EINVAL; + } + + EDEB_EN(7, "dgid=%p qp_numl=%x lid=%x", + gid, ibqp->qp_num, lid); + + memcpy(&my_gid.raw, gid->raw, sizeof(union ib_gid)); + + subnet_prefix = be64_to_cpu(my_gid.global.subnet_prefix); + interface_id = be64_to_cpu(my_gid.global.interface_id); + h_ret = hipz_h_detach_mcqp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + my_qp->galpas.kernel, + lid, subnet_prefix, interface_id); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, + "ehca_qp=%p qp_num=%x hipz_h_detach_mcqp() failed " + "h_ret=%lx", my_qp, ibqp->qp_num, h_ret); + } + ret = ehca2ib_return_code(h_ret); + + EDEB_EX(7, "mcast detach ret=%x\n" + "ehca_qp=%p qp_num=%x lid=%x\n" + "my_gid= %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n", + ret, my_qp, ibqp->qp_num, lid, + my_gid.raw[0], my_gid.raw[1], + my_gid.raw[2], my_gid.raw[3], + my_gid.raw[4], my_gid.raw[5], + my_gid.raw[6], my_gid.raw[7], + my_gid.raw[8], my_gid.raw[9], + my_gid.raw[10], my_gid.raw[11], + my_gid.raw[12], my_gid.raw[13], + my_gid.raw[14], my_gid.raw[15]); + + return ret; +} -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:00 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:00 -0700 Subject: [openib-general] [PATCH 00/13] IB/ehca: uverbs In-Reply-To: <2006817139.e1epJYk9xVvFdTao@cisco.com> Message-ID: <20068171311.QJ2lcO2NjghtFOX6@cisco.com> drivers/infiniband/hw/ehca/ehca_uverbs.c | 400 ++++++++++++++++++++++++++++++ 1 files changed, 400 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c new file mode 100644 index 0000000..c148c23 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -0,0 +1,400 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * userspace support verbs + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#undef DEB_PREFIX +#define DEB_PREFIX "uver" + +#include + +#include "ehca_classes.h" +#include "ehca_iverbs.h" +#include "ehca_mrmw.h" +#include "ehca_tools.h" +#include "hcp_if.h" + +struct ib_ucontext *ehca_alloc_ucontext(struct ib_device *device, + struct ib_udata *udata) +{ + struct ehca_ucontext *my_context = NULL; + + EHCA_CHECK_ADR_P(device); + EDEB_EN(7, "device=%p name=%s", device, device->name); + + my_context = kzalloc(sizeof *my_context, GFP_KERNEL); + if (!my_context) { + EDEB_ERR(4, "Out of memory device=%p", device); + return ERR_PTR(-ENOMEM); + } + + EDEB_EX(7, "device=%p ucontext=%p", device, my_context); + + return &my_context->ib_ucontext; +} + +int ehca_dealloc_ucontext(struct ib_ucontext *context) +{ + struct ehca_ucontext *my_context = NULL; + EHCA_CHECK_ADR(context); + EDEB_EN(7, "ucontext=%p", context); + my_context = container_of(context, struct ehca_ucontext, ib_ucontext); + kfree(my_context); + EDEB_EN(7, "ucontext=%p", context); + return 0; +} + +struct page *ehca_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) +{ + struct page *mypage = NULL; + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; + u32 idr_handle = fileoffset >> 32; + u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 cur_pid = current->tgid; + unsigned long flags; + + EDEB_EN(7, "vm_start=%lx vm_end=%lx vm_page_prot=%lx vm_fileoff=%lx " + "address=%lx", + vma->vm_start, vma->vm_end, vma->vm_page_prot, fileoffset, + address); + + if (q_type == 1) { /* CQ */ + struct ehca_cq *cq = NULL; + u64 offset; + void *vaddr = NULL; + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + cq = idr_find(&ehca_cq_idr, idr_handle); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + if (cq->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, cq->ownpid); + return NOPAGE_SIGBUS; + } + + /* make sure this mmap really belongs to the authorized user */ + if (!cq) { + EDEB_ERR(4, "cq is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + if (rsrc_type == 2) { + EDEB(6, "cq=%p cq queuearea", cq); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&cq->ipz_queue, offset); + EDEB(6, "offset=%lx vaddr=%p", offset, vaddr); + mypage = virt_to_page(vaddr); + } + } else if (q_type == 2) { /* QP */ + struct ehca_qp *qp = NULL; + struct ehca_pd *pd = NULL; + u64 offset; + void *vaddr = NULL; + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + qp = idr_find(&ehca_qp_idr, idr_handle); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + + pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, pd->ownpid); + return NOPAGE_SIGBUS; + } + + /* make sure this mmap really belongs to the authorized user */ + if (!qp) { + EDEB_ERR(4, "qp is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + if (rsrc_type == 2) { /* rqueue */ + EDEB(6, "qp=%p qp rqueuearea", qp); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset); + EDEB(6, "offset=%lx vaddr=%p", offset, vaddr); + mypage = virt_to_page(vaddr); + } else if (rsrc_type == 3) { /* squeue */ + EDEB(6, "qp=%p qp squeuearea", qp); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset); + EDEB(6, "offset=%lx vaddr=%p", offset, vaddr); + mypage = virt_to_page(vaddr); + } + } + + if (!mypage) { + EDEB_ERR(4, "Invalid page adr==NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + get_page(mypage); + EDEB_EX(7, "page adr=%p", mypage); + return mypage; +} + +static struct vm_operations_struct ehcau_vm_ops = { + .nopage = ehca_nopage, +}; + +int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; + u32 idr_handle = fileoffset >> 32; + u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 ret = -EFAULT; /* assume the worst */ + u64 vsize = 0; /* must be calculated/set below */ + u64 physical = 0; /* must be calculated/set below */ + u32 cur_pid = current->tgid; + unsigned long flags; + + EDEB_EN(7, "vm_start=%lx vm_end=%lx vm_page_prot=%lx vm_fileoff=%lx", + vma->vm_start, vma->vm_end, vma->vm_page_prot, fileoffset); + + if (q_type == 1) { /* CQ */ + struct ehca_cq *cq; + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + cq = idr_find(&ehca_cq_idr, idr_handle); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + if (cq->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, cq->ownpid); + return -ENOMEM; + } + + /* make sure this mmap really belongs to the authorized user */ + if (!cq) + return -EINVAL; + if (!cq->ib_cq.uobject) + return -EINVAL; + if (cq->ib_cq.uobject->context != context) + return -EINVAL; + if (rsrc_type == 1) { /* galpa fw handle */ + EDEB(6, "cq=%p cq triggerarea", cq); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != EHCA_PAGESIZE) { + EDEB_ERR(4, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + ret = -EINVAL; + goto mmap_exit0; + } + + physical = cq->galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, physical); + ret = remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret) { + EDEB_ERR(4, "remap_pfn_range() failed ret=%x", + ret); + ret = -ENOMEM; + } + goto mmap_exit0; + } else if (rsrc_type == 2) { /* cq queue_addr */ + EDEB(6, "cq=%p cq q_addr", cq); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else { + EDEB_ERR(6, "bad resource type %x", rsrc_type); + ret = -EINVAL; + goto mmap_exit0; + } + } else if (q_type == 2) { /* QP */ + struct ehca_qp *qp = NULL; + struct ehca_pd *pd = NULL; + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + qp = idr_find(&ehca_qp_idr, idr_handle); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (pd->ownpid != cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, pd->ownpid); + return -ENOMEM; + } + + /* make sure this mmap really belongs to the authorized user */ + if (!qp || !qp->ib_qp.uobject || + qp->ib_qp.uobject->context != context) { + EDEB(6, "qp=%p, uobject=%p, context=%p", + qp, qp->ib_qp.uobject, qp->ib_qp.uobject->context); + ret = -EINVAL; + goto mmap_exit0; + } + if (rsrc_type == 1) { /* galpa fw handle */ + EDEB(6, "qp=%p qp triggerarea", qp); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != EHCA_PAGESIZE) { + EDEB_ERR(4, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + ret = -EINVAL; + goto mmap_exit0; + } + + physical = qp->galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, physical); + ret = remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret) { + EDEB_ERR(4, "remap_pfn_range() failed ret=%x", + ret); + ret = -ENOMEM; + } + goto mmap_exit0; + } else if (rsrc_type == 2) { /* qp rqueue_addr */ + EDEB(6, "qp=%p qp rqueue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else if (rsrc_type == 3) { /* qp squeue_addr */ + EDEB(6, "qp=%p qp squeue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else { + EDEB_ERR(4, "bad resource type %x", rsrc_type); + ret = -EINVAL; + goto mmap_exit0; + } + } else { + EDEB_ERR(4, "bad queue type %x", q_type); + ret = -EINVAL; + goto mmap_exit0; + } + +mmap_exit0: + EDEB_EX(7, "ret=%x", ret); + return ret; +} + +int ehca_mmap_nopage(u64 foffset, u64 length, void ** mapped, + struct vm_area_struct ** vma) +{ + EDEB_EN(7, "foffset=%lx length=%lx", foffset, length); + down_write(¤t->mm->mmap_sem); + *mapped = (void*)do_mmap(NULL,0, length, PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, + foffset); + up_write(¤t->mm->mmap_sem); + if (!(*mapped)) { + EDEB_ERR(4, "couldn't mmap foffset=%lx length=%lx", + foffset, length); + return -EINVAL; + } + + *vma = find_vma(current->mm, (u64)*mapped); + if (!(*vma)) { + down_write(¤t->mm->mmap_sem); + do_munmap(current->mm, 0, length); + up_write(¤t->mm->mmap_sem); + EDEB_ERR(4, "couldn't find vma queue=%p", *mapped); + return -EINVAL; + } + (*vma)->vm_flags |= VM_RESERVED; + (*vma)->vm_ops = &ehcau_vm_ops; + + EDEB_EX(7, "mapped=%p", *mapped); + return 0; +} + +int ehca_mmap_register(u64 physical, void ** mapped, + struct vm_area_struct ** vma) +{ + int ret = 0; + unsigned long vsize; + /* ehca hw supports only 4k page */ + ret = ehca_mmap_nopage(0, EHCA_PAGESIZE, mapped, vma); + if (ret) { + EDEB(4, "could'nt mmap physical=%lx", physical); + return ret; + } + + (*vma)->vm_flags |= VM_RESERVED; + vsize = (*vma)->vm_end - (*vma)->vm_start; + if (vsize != EHCA_PAGESIZE) { + EDEB_ERR(4, "invalid vsize=%lx", + (*vma)->vm_end - (*vma)->vm_start); + ret = -EINVAL; + return ret; + } + + (*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot); + (*vma)->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, physical); + ret = remap_pfn_range((*vma), (*vma)->vm_start, + physical >> PAGE_SHIFT, vsize, + (*vma)->vm_page_prot); + if (ret) { + EDEB_ERR(4, "remap_pfn_range() failed ret=%x", ret); + ret = -ENOMEM; + } + return ret; + +} + +int ehca_munmap(unsigned long addr, size_t len) { + int ret = 0; + struct mm_struct *mm = current->mm; + if (mm) { + down_write(&mm->mmap_sem); + ret = do_munmap(mm, addr, len); + up_write(&mm->mmap_sem); + } + return ret; +} -- 1.4.1 From rolandd at cisco.com Thu Aug 17 13:11:01 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:11:01 -0700 Subject: [openib-general] [PATCH 04/13] IB/ehca: mrmw In-Reply-To: <20068171311.VUo6fig31aLNQqvN@cisco.com> Message-ID: <20068171311.Erm4R4ERt5Mpsgua@cisco.com> drivers/infiniband/hw/ehca/ehca_mrmw.c | 2472 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_mrmw.h | 143 ++ 2 files changed, 2615 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c new file mode 100644 index 0000000..99160d7 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -0,0 +1,2472 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * MR/MW functions + * + * Authors: Dietmar Decker + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#undef DEB_PREFIX +#define DEB_PREFIX "mrmw" + +#include + +#include "ehca_iverbs.h" +#include "ehca_mrmw.h" +#include "hcp_if.h" +#include "hipz_hw.h" + +extern int ehca_use_hp_mr; + +static struct ehca_mr *ehca_mr_new(void) +{ + extern struct ehca_module ehca_module; + struct ehca_mr *me; + + me = kmem_cache_alloc(ehca_module.cache_mr, SLAB_KERNEL); + if (me) { + memset(me, 0, sizeof(struct ehca_mr)); + spin_lock_init(&me->mrlock); + EDEB_EX(7, "ehca_mr=%p sizeof(ehca_mr_t)=%x", me, + (u32) sizeof(struct ehca_mr)); + } else { + EDEB_ERR(3, "alloc failed"); + } + + return me; +} + +static void ehca_mr_delete(struct ehca_mr *me) +{ + extern struct ehca_module ehca_module; + + kmem_cache_free(ehca_module.cache_mr, me); +} + +static struct ehca_mw *ehca_mw_new(void) +{ + extern struct ehca_module ehca_module; + struct ehca_mw *me; + + me = kmem_cache_alloc(ehca_module.cache_mw, SLAB_KERNEL); + if (me) { + memset(me, 0, sizeof(struct ehca_mw)); + spin_lock_init(&me->mwlock); + EDEB_EX(7, "ehca_mw=%p sizeof(ehca_mw_t)=%x", me, + (u32) sizeof(struct ehca_mw)); + } else { + EDEB_ERR(3, "alloc failed"); + } + + return me; +} + +static void ehca_mw_delete(struct ehca_mw *me) +{ + extern struct ehca_module ehca_module; + + kmem_cache_free(ehca_module.cache_mw, me); +} + +/*----------------------------------------------------------------------*/ + +struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags) +{ + struct ib_mr *ib_mr = NULL; + int ret = 0; + struct ehca_mr *e_maxmr = NULL; + struct ehca_pd *e_pd = NULL; + struct ehca_shca *shca = NULL; + + EDEB_EN(7, "pd=%p mr_access_flags=%x", pd, mr_access_flags); + + EHCA_CHECK_PD_P(pd); + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + if (shca->maxmr) { + e_maxmr = ehca_mr_new(); + if (!e_maxmr) { + EDEB_ERR(4, "out of memory"); + ib_mr = ERR_PTR(-ENOMEM); + goto get_dma_mr_exit0; + } + + ret = ehca_reg_maxmr(shca, e_maxmr, (u64*)KERNELBASE, + mr_access_flags, e_pd, + &e_maxmr->ib.ib_mr.lkey, + &e_maxmr->ib.ib_mr.rkey); + if (ret) { + ib_mr = ERR_PTR(ret); + goto get_dma_mr_exit0; + } + ib_mr = &e_maxmr->ib.ib_mr; + } else { + EDEB_ERR(4, "no internal max-MR exist!"); + ib_mr = ERR_PTR(-EINVAL); + goto get_dma_mr_exit0; + } + +get_dma_mr_exit0: + if (IS_ERR(ib_mr)) + EDEB_EX(4, "rc=%lx pd=%p mr_access_flags=%x ", + PTR_ERR(ib_mr), pd, mr_access_flags); + else + EDEB_EX(7, "ib_mr=%p lkey=%x rkey=%x", + ib_mr, ib_mr->lkey, ib_mr->rkey); + return ib_mr; +} /* end ehca_get_dma_mr() */ + +/*----------------------------------------------------------------------*/ + +struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_mr *ib_mr = NULL; + int ret = 0; + struct ehca_mr *e_mr = NULL; + struct ehca_shca *shca = NULL; + struct ehca_pd *e_pd = NULL; + u64 size = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + u32 num_pages_mr = 0; + u32 num_pages_4k = 0; /* 4k portion "pages" */ + + EDEB_EN(7, "pd=%p phys_buf_array=%p num_phys_buf=%x " + "mr_access_flags=%x iova_start=%p", pd, phys_buf_array, + num_phys_buf, mr_access_flags, iova_start); + + EHCA_CHECK_PD_P(pd); + if ((num_phys_buf <= 0) || ehca_adr_bad(phys_buf_array)) { + EDEB_ERR(4, "bad input values: num_phys_buf=%x " + "phys_buf_array=%p", num_phys_buf, phys_buf_array); + ib_mr = ERR_PTR(-EINVAL); + goto reg_phys_mr_exit0; + } + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE))) { + /* + * Remote Write Access requires Local Write Access + * Remote Atomic Access requires Local Write Access + */ + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_mr = ERR_PTR(-EINVAL); + goto reg_phys_mr_exit0; + } + + /* check physical buffer list and calculate size */ + ret = ehca_mr_chk_buf_and_calc_size(phys_buf_array, num_phys_buf, + iova_start, &size); + if (ret) { + ib_mr = ERR_PTR(ret); + goto reg_phys_mr_exit0; + } + if ((size == 0) || + (((u64)iova_start + size) < (u64)iova_start)) { + EDEB_ERR(4, "bad input values: size=%lx iova_start=%p", + size, iova_start); + ib_mr = ERR_PTR(-EINVAL); + goto reg_phys_mr_exit0; + } + + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_mr = ehca_mr_new(); + if (!e_mr) { + EDEB_ERR(4, "out of memory"); + ib_mr = ERR_PTR(-ENOMEM); + goto reg_phys_mr_exit0; + } + + /* determine number of MR pages */ + num_pages_mr = ((((u64)iova_start % PAGE_SIZE) + size + + PAGE_SIZE - 1) / PAGE_SIZE); + num_pages_4k = ((((u64)iova_start % EHCA_PAGESIZE) + size + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE); + + /* register MR on HCA */ + if (ehca_mr_is_maxmr(size, iova_start)) { + e_mr->flags |= EHCA_MR_FLAG_MAXMR; + ret = ehca_reg_maxmr(shca, e_mr, iova_start, mr_access_flags, + e_pd, &e_mr->ib.ib_mr.lkey, + &e_mr->ib.ib_mr.rkey); + if (ret) { + ib_mr = ERR_PTR(ret); + goto reg_phys_mr_exit1; + } + } else { + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_pages = num_pages_mr; + pginfo.num_4k = num_pages_4k; + pginfo.num_phys_buf = num_phys_buf; + pginfo.phys_buf_array = phys_buf_array; + pginfo.next_4k = (((u64)iova_start & ~PAGE_MASK) / + EHCA_PAGESIZE); + + ret = ehca_reg_mr(shca, e_mr, iova_start, size, mr_access_flags, + e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, + &e_mr->ib.ib_mr.rkey); + if (ret) { + ib_mr = ERR_PTR(ret); + goto reg_phys_mr_exit1; + } + } + + /* successful registration of all pages */ + ib_mr = &e_mr->ib.ib_mr; + goto reg_phys_mr_exit0; + +reg_phys_mr_exit1: + ehca_mr_delete(e_mr); +reg_phys_mr_exit0: + if (IS_ERR(ib_mr)) + EDEB_EX(4, "rc=%lx pd=%p phys_buf_array=%p " + "num_phys_buf=%x mr_access_flags=%x iova_start=%p", + PTR_ERR(ib_mr), pd, phys_buf_array, + num_phys_buf, mr_access_flags, iova_start); + else + EDEB_EX(7, "ib_mr=%p lkey=%x rkey=%x", + ib_mr, ib_mr->lkey, ib_mr->rkey); + return ib_mr; +} /* end ehca_reg_phys_mr() */ + +/*----------------------------------------------------------------------*/ + +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, + struct ib_umem *region, + int mr_access_flags, + struct ib_udata *udata) +{ + struct ib_mr *ib_mr = NULL; + struct ehca_mr *e_mr = NULL; + struct ehca_shca *shca = NULL; + struct ehca_pd *e_pd = NULL; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + int ret = 0; + u32 num_pages_mr = 0; + u32 num_pages_4k = 0; /* 4k portion "pages" */ + + EDEB_EN(7, "pd=%p region=%p mr_access_flags=%x udata=%p", + pd, region, mr_access_flags, udata); + + EHCA_CHECK_PD_P(pd); + if (ehca_adr_bad(region)) { + EDEB_ERR(4, "bad input values: region=%p", region); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE))) { + /* + * Remote Write Access requires Local Write Access + * Remote Atomic Access requires Local Write Access + */ + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + EDEB(7, "user_base=%lx virt_base=%lx length=%lx offset=%x page_size=%x " + "chunk_list.next=%p", + region->user_base, region->virt_base, region->length, + region->offset, region->page_size, region->chunk_list.next); + if (region->page_size != PAGE_SIZE) { + EDEB_ERR(4, "page size not supported, region->page_size=%x", + region->page_size); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + + if ((region->length == 0) || + ((region->virt_base + region->length) < region->virt_base)) { + EDEB_ERR(4, "bad input values: length=%lx virt_base=%lx", + region->length, region->virt_base); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_mr = ehca_mr_new(); + if (!e_mr) { + EDEB_ERR(4, "out of memory"); + ib_mr = ERR_PTR(-ENOMEM); + goto reg_user_mr_exit0; + } + + /* determine number of MR pages */ + num_pages_mr = (((region->virt_base % PAGE_SIZE) + region->length + + PAGE_SIZE - 1) / PAGE_SIZE); + num_pages_4k = (((region->virt_base % EHCA_PAGESIZE) + region->length + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE); + + /* register MR on HCA */ + pginfo.type = EHCA_MR_PGI_USER; + pginfo.num_pages = num_pages_mr; + pginfo.num_4k = num_pages_4k; + pginfo.region = region; + pginfo.next_4k = region->offset / EHCA_PAGESIZE; + pginfo.next_chunk = list_prepare_entry(pginfo.next_chunk, + (®ion->chunk_list), + list); + + ret = ehca_reg_mr(shca, e_mr, (u64*)region->virt_base, + region->length, mr_access_flags, e_pd, &pginfo, + &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); + if (ret) { + ib_mr = ERR_PTR(ret); + goto reg_user_mr_exit1; + } + + /* successful registration of all pages */ + ib_mr = &e_mr->ib.ib_mr; + goto reg_user_mr_exit0; + +reg_user_mr_exit1: + ehca_mr_delete(e_mr); +reg_user_mr_exit0: + if (IS_ERR(ib_mr)) + EDEB_EX(4, "rc=%lx pd=%p region=%p mr_access_flags=%x " + "udata=%p", + PTR_ERR(ib_mr), pd, region, mr_access_flags, udata); + else + EDEB_EX(7, "ib_mr=%p lkey=%x rkey=%x", + ib_mr, ib_mr->lkey, ib_mr->rkey); + return ib_mr; +} /* end ehca_reg_user_mr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + int ret = 0; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_mr = NULL; + u64 new_size = 0; + u64 *new_start = NULL; + u32 new_acl = 0; + struct ehca_pd *new_pd = NULL; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + unsigned long sl_flags; + u32 num_pages_mr = 0; + u32 num_pages_4k = 0; /* 4k portion "pages" */ + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + + EDEB_EN(7, "mr=%p mr_rereg_mask=%x pd=%p phys_buf_array=%p " + "num_phys_buf=%x mr_access_flags=%x iova_start=%p", + mr, mr_rereg_mask, pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + EHCA_CHECK_MR(mr); + my_pd = container_of(mr->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + (my_pd->ownpid != cur_pid)) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + ret = -EINVAL; + goto rereg_phys_mr_exit0; + } + + if (!(mr_rereg_mask & IB_MR_REREG_TRANS)) { + /* TODO not supported, because PHYP rereg hCall needs pages */ + EDEB_ERR(4, "rereg without IB_MR_REREG_TRANS not supported yet," + " mr_rereg_mask=%x", mr_rereg_mask); + ret = -EINVAL; + goto rereg_phys_mr_exit0; + } + + e_mr = container_of(mr, struct ehca_mr, ib.ib_mr); + if (mr_rereg_mask & IB_MR_REREG_PD) { + EHCA_CHECK_PD(pd); + } + + if ((mr_rereg_mask & + ~(IB_MR_REREG_TRANS | IB_MR_REREG_PD | IB_MR_REREG_ACCESS)) || + (mr_rereg_mask == 0)) { + ret = -EINVAL; + goto rereg_phys_mr_exit0; + } + + shca = container_of(mr->device, struct ehca_shca, ib_device); + + /* check other parameters */ + if (e_mr == shca->maxmr) { + /* should be impossible, however reject to be sure */ + EDEB_ERR(3, "rereg internal max-MR impossible, mr=%p " + "shca->maxmr=%p mr->lkey=%x", + mr, shca->maxmr, mr->lkey); + ret = -EINVAL; + goto rereg_phys_mr_exit0; + } + if (mr_rereg_mask & IB_MR_REREG_TRANS) { /* transl., i.e. addr/size */ + if (e_mr->flags & EHCA_MR_FLAG_FMR) { + EDEB_ERR(4, "not supported for FMR, mr=%p flags=%x", + mr, e_mr->flags); + ret = -EINVAL; + goto rereg_phys_mr_exit0; + } + if (ehca_adr_bad(phys_buf_array) || num_phys_buf <= 0) { + EDEB_ERR(4, "bad input values: mr_rereg_mask=%x " + "phys_buf_array=%p num_phys_buf=%x", + mr_rereg_mask, phys_buf_array, num_phys_buf); + ret = -EINVAL; + goto rereg_phys_mr_exit0; + } + } + if ((mr_rereg_mask & IB_MR_REREG_ACCESS) && /* change ACL */ + (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)))) { + /* + * Remote Write Access requires Local Write Access + * Remote Atomic Access requires Local Write Access + */ + EDEB_ERR(4, "bad input values: mr_rereg_mask=%x " + "mr_access_flags=%x", mr_rereg_mask, mr_access_flags); + ret = -EINVAL; + goto rereg_phys_mr_exit0; + } + + /* set requested values dependent on rereg request */ + spin_lock_irqsave(&e_mr->mrlock, sl_flags); + new_start = e_mr->start; /* new == old address */ + new_size = e_mr->size; /* new == old length */ + new_acl = e_mr->acl; /* new == old access control */ + new_pd = container_of(mr->pd,struct ehca_pd,ib_pd); /*new == old PD*/ + + if (mr_rereg_mask & IB_MR_REREG_TRANS) { + new_start = iova_start; /* change address */ + /* check physical buffer list and calculate size */ + ret = ehca_mr_chk_buf_and_calc_size(phys_buf_array, + num_phys_buf, iova_start, + &new_size); + if (ret) + goto rereg_phys_mr_exit1; + if ((new_size == 0) || + (((u64)iova_start + new_size) < (u64)iova_start)) { + EDEB_ERR(4, "bad input values: new_size=%lx " + "iova_start=%p", new_size, iova_start); + ret = -EINVAL; + goto rereg_phys_mr_exit1; + } + num_pages_mr = ((((u64)new_start % PAGE_SIZE) + new_size + + PAGE_SIZE - 1) / PAGE_SIZE); + num_pages_4k = ((((u64)new_start % EHCA_PAGESIZE) + new_size + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE); + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_pages = num_pages_mr; + pginfo.num_4k = num_pages_4k; + pginfo.num_phys_buf = num_phys_buf; + pginfo.phys_buf_array = phys_buf_array; + pginfo.next_4k = (((u64)iova_start & ~PAGE_MASK) / + EHCA_PAGESIZE); + } + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + new_acl = mr_access_flags; + if (mr_rereg_mask & IB_MR_REREG_PD) + new_pd = container_of(pd, struct ehca_pd, ib_pd); + + EDEB(7, "mr=%p new_start=%p new_size=%lx new_acl=%x new_pd=%p " + "num_pages_mr=%x num_pages_4k=%x", e_mr, new_start, new_size, + new_acl, new_pd, num_pages_mr, num_pages_4k); + + ret = ehca_rereg_mr(shca, e_mr, new_start, new_size, new_acl, + new_pd, &pginfo, &tmp_lkey, &tmp_rkey); + if (ret) + goto rereg_phys_mr_exit1; + + /* successful reregistration */ + if (mr_rereg_mask & IB_MR_REREG_PD) + mr->pd = pd; + mr->lkey = tmp_lkey; + mr->rkey = tmp_rkey; + +rereg_phys_mr_exit1: + spin_unlock_irqrestore(&e_mr->mrlock, sl_flags); +rereg_phys_mr_exit0: + if (ret) + EDEB_EX(4, "ret=%x mr=%p mr_rereg_mask=%x pd=%p " + "phys_buf_array=%p num_phys_buf=%x mr_access_flags=%x " + "iova_start=%p", + ret, mr, mr_rereg_mask, pd, phys_buf_array, + num_phys_buf, mr_access_flags, iova_start); + else + EDEB_EX(7, "mr=%p mr_rereg_mask=%x pd=%p phys_buf_array=%p " + "num_phys_buf=%x mr_access_flags=%x iova_start=%p", + mr, mr_rereg_mask, pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + return ret; +} /* end ehca_rereg_phys_mr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_mr = NULL; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + unsigned long sl_flags; + struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + + EDEB_EN(7, "mr=%p mr_attr=%p", mr, mr_attr); + + EHCA_CHECK_MR(mr); + + my_pd = container_of(mr->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + (my_pd->ownpid != cur_pid)) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + ret = -EINVAL; + goto query_mr_exit0; + } + + e_mr = container_of(mr, struct ehca_mr, ib.ib_mr); + if (ehca_adr_bad(mr_attr)) { + EDEB_ERR(4, "bad input values: mr_attr=%p", mr_attr); + ret = -EINVAL; + goto query_mr_exit0; + } + if ((e_mr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not supported for FMR, mr=%p e_mr=%p " + "e_mr->flags=%x", mr, e_mr, e_mr->flags); + ret = -EINVAL; + goto query_mr_exit0; + } + + shca = container_of(mr->device, struct ehca_shca, ib_device); + memset(mr_attr, 0, sizeof(struct ib_mr_attr)); + spin_lock_irqsave(&e_mr->mrlock, sl_flags); + + h_ret = hipz_h_query_mr(shca->ipz_hca_handle, e_mr, &hipzout); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_mr_query failed, h_ret=%lx mr=%p " + "hca_hndl=%lx mr_hndl=%lx lkey=%x", + h_ret, mr, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, mr->lkey); + ret = ehca_mrmw_map_hrc_query_mr(h_ret); + goto query_mr_exit1; + } + mr_attr->pd = mr->pd; + mr_attr->device_virt_addr = hipzout.vaddr; + mr_attr->size = hipzout.len; + mr_attr->lkey = hipzout.lkey; + mr_attr->rkey = hipzout.rkey; + ehca_mrmw_reverse_map_acl(&hipzout.acl, &mr_attr->mr_access_flags); + +query_mr_exit1: + spin_unlock_irqrestore(&e_mr->mrlock, sl_flags); +query_mr_exit0: + if (ret) + EDEB_EX(4, "ret=%x mr=%p mr_attr=%p", ret, mr, mr_attr); + else + EDEB_EX(7, "pd=%p device_virt_addr=%lx size=%lx " + "mr_access_flags=%x lkey=%x rkey=%x", + mr_attr->pd, mr_attr->device_virt_addr, + mr_attr->size, mr_attr->mr_access_flags, + mr_attr->lkey, mr_attr->rkey); + return ret; +} /* end ehca_query_mr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_dereg_mr(struct ib_mr *mr) +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_mr = NULL; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + + EDEB_EN(7, "mr=%p", mr); + + EHCA_CHECK_MR(mr); + my_pd = container_of(mr->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + (my_pd->ownpid != cur_pid)) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + ret = -EINVAL; + goto dereg_mr_exit0; + } + + e_mr = container_of(mr, struct ehca_mr, ib.ib_mr); + shca = container_of(mr->device, struct ehca_shca, ib_device); + + if ((e_mr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not supported for FMR, mr=%p e_mr=%p " + "e_mr->flags=%x", mr, e_mr, e_mr->flags); + ret = -EINVAL; + goto dereg_mr_exit0; + } else if (e_mr == shca->maxmr) { + /* should be impossible, however reject to be sure */ + EDEB_ERR(3, "dereg internal max-MR impossible, mr=%p " + "shca->maxmr=%p mr->lkey=%x", + mr, shca->maxmr, mr->lkey); + ret = -EINVAL; + goto dereg_mr_exit0; + } + + /* TODO: BUSY: MR still has bound window(s) */ + h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_free_mr failed, h_ret=%lx shca=%p e_mr=%p" + " hca_hndl=%lx mr_hndl=%lx mr->lkey=%x", + h_ret, shca, e_mr, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, mr->lkey); + ret = ehca_mrmw_map_hrc_free_mr(h_ret); + goto dereg_mr_exit0; + } + + /* successful deregistration */ + ehca_mr_delete(e_mr); + +dereg_mr_exit0: + if (ret) + EDEB_EX(4, "ret=%x mr=%p", ret, mr); + else + EDEB_EX(7, ""); + return ret; +} /* end ehca_dereg_mr() */ + +/*----------------------------------------------------------------------*/ + +struct ib_mw *ehca_alloc_mw(struct ib_pd *pd) +{ + struct ib_mw *ib_mw = NULL; + u64 h_ret = H_SUCCESS; + struct ehca_shca *shca = NULL; + struct ehca_mw *e_mw = NULL; + struct ehca_pd *e_pd = NULL; + struct ehca_mw_hipzout_parms hipzout = {{0},0}; + + EDEB_EN(7, "pd=%p", pd); + + EHCA_CHECK_PD_P(pd); + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_mw = ehca_mw_new(); + if (!e_mw) { + ib_mw = ERR_PTR(-ENOMEM); + goto alloc_mw_exit0; + } + + h_ret = hipz_h_alloc_resource_mw(shca->ipz_hca_handle, e_mw, + e_pd->fw_pd, &hipzout); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_mw_allocate failed, h_ret=%lx shca=%p " + "hca_hndl=%lx mw=%p", h_ret, shca, + shca->ipz_hca_handle.handle, e_mw); + ib_mw = ERR_PTR(ehca_mrmw_map_hrc_alloc(h_ret)); + goto alloc_mw_exit1; + } + /* successful MW allocation */ + e_mw->ipz_mw_handle = hipzout.handle; + e_mw->ib_mw.rkey = hipzout.rkey; + ib_mw = &e_mw->ib_mw; + goto alloc_mw_exit0; + +alloc_mw_exit1: + ehca_mw_delete(e_mw); +alloc_mw_exit0: + if (IS_ERR(ib_mw)) + EDEB_EX(4, "rc=%lx pd=%p", PTR_ERR(ib_mw), pd); + else + EDEB_EX(7, "ib_mw=%p rkey=%x", ib_mw, ib_mw->rkey); + return ib_mw; +} /* end ehca_alloc_mw() */ + +/*----------------------------------------------------------------------*/ + +int ehca_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + int ret = 0; + + /* TODO: not supported up to now */ + EDEB_ERR(4, "bind MW currently not supported by HCAD"); + ret = -EPERM; + goto bind_mw_exit0; + +bind_mw_exit0: + if (ret) + EDEB_EX(4, "ret=%x qp=%p mw=%p mw_bind=%p", + ret, qp, mw, mw_bind); + else + EDEB_EX(7, "qp=%p mw=%p mw_bind=%p", qp, mw, mw_bind); + return ret; +} /* end ehca_bind_mw() */ + +/*----------------------------------------------------------------------*/ + +int ehca_dealloc_mw(struct ib_mw *mw) +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + struct ehca_shca *shca = NULL; + struct ehca_mw *e_mw = NULL; + + EDEB_EN(7, "mw=%p", mw); + + EHCA_CHECK_MW(mw); + e_mw = container_of(mw, struct ehca_mw, ib_mw); + shca = container_of(mw->device, struct ehca_shca, ib_device); + + h_ret = hipz_h_free_resource_mw(shca->ipz_hca_handle, e_mw); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_free_mw failed, h_ret=%lx shca=%p mw=%p " + "rkey=%x hca_hndl=%lx mw_hndl=%lx", + h_ret, shca, mw, mw->rkey, shca->ipz_hca_handle.handle, + e_mw->ipz_mw_handle.handle); + ret = ehca_mrmw_map_hrc_free_mw(h_ret); + goto dealloc_mw_exit0; + } + /* successful deallocation */ + ehca_mw_delete(e_mw); + +dealloc_mw_exit0: + if (ret) + EDEB_EX(4, "ret=%x mw=%p", ret, mw); + else + EDEB_EX(7, ""); + return ret; +} /* end ehca_dealloc_mw() */ + +/*----------------------------------------------------------------------*/ + +struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct ib_fmr *ib_fmr = NULL; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_fmr = NULL; + int ret = 0; + struct ehca_pd *e_pd = NULL; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + + EDEB_EN(7, "pd=%p mr_access_flags=%x fmr_attr=%p", + pd, mr_access_flags, fmr_attr); + + EHCA_CHECK_PD_P(pd); + if (ehca_adr_bad(fmr_attr)) { + EDEB_ERR(4, "bad input values: fmr_attr=%p", fmr_attr); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + + EDEB(7, "max_pages=%x max_maps=%x page_shift=%x", + fmr_attr->max_pages, fmr_attr->max_maps, fmr_attr->page_shift); + + /* check other parameters */ + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE))) { + /* + * Remote Write Access requires Local Write Access + * Remote Atomic Access requires Local Write Access + */ + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + if (mr_access_flags & IB_ACCESS_MW_BIND) { + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + if ((fmr_attr->max_pages == 0) || (fmr_attr->max_maps == 0)) { + EDEB_ERR(4, "bad input values: fmr_attr->max_pages=%x " + "fmr_attr->max_maps=%x fmr_attr->page_shift=%x", + fmr_attr->max_pages, fmr_attr->max_maps, + fmr_attr->page_shift); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + if (((1 << fmr_attr->page_shift) != EHCA_PAGESIZE) && + ((1 << fmr_attr->page_shift) != PAGE_SIZE)) { + EDEB_ERR(4, "unsupported fmr_attr->page_shift=%x", + fmr_attr->page_shift); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_fmr = ehca_mr_new(); + if (!e_fmr) { + ib_fmr = ERR_PTR(-ENOMEM); + goto alloc_fmr_exit0; + } + e_fmr->flags |= EHCA_MR_FLAG_FMR; + + /* register MR on HCA */ + ret = ehca_reg_mr(shca, e_fmr, NULL, + fmr_attr->max_pages * (1 << fmr_attr->page_shift), + mr_access_flags, e_pd, &pginfo, + &tmp_lkey, &tmp_rkey); + if (ret) { + ib_fmr = ERR_PTR(ret); + goto alloc_fmr_exit1; + } + + /* successful */ + e_fmr->fmr_page_size = 1 << fmr_attr->page_shift; + e_fmr->fmr_max_pages = fmr_attr->max_pages; + e_fmr->fmr_max_maps = fmr_attr->max_maps; + e_fmr->fmr_map_cnt = 0; + ib_fmr = &e_fmr->ib.ib_fmr; + goto alloc_fmr_exit0; + +alloc_fmr_exit1: + ehca_mr_delete(e_fmr); +alloc_fmr_exit0: + if (IS_ERR(ib_fmr)) + EDEB_EX(4, "rc=%lx pd=%p mr_access_flags=%x " + "fmr_attr=%p", PTR_ERR(ib_fmr), pd, + mr_access_flags, fmr_attr); + else + EDEB_EX(7, "ib_fmr=%p tmp_lkey=%x tmp_rkey=%x", + ib_fmr, tmp_lkey, tmp_rkey); + return ib_fmr; +} /* end ehca_alloc_fmr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, + int list_len, + u64 iova) +{ + int ret = 0; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_fmr = NULL; + struct ehca_pd *e_pd = NULL; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + + EDEB_EN(7, "fmr=%p page_list=%p list_len=%x iova=%lx", + fmr, page_list, list_len, iova); + + EHCA_CHECK_FMR(fmr); + e_fmr = container_of(fmr, struct ehca_mr, ib.ib_fmr); + shca = container_of(fmr->device, struct ehca_shca, ib_device); + e_pd = container_of(fmr->pd, struct ehca_pd, ib_pd); + + if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not a FMR, e_fmr=%p e_fmr->flags=%x", + e_fmr, e_fmr->flags); + ret = -EINVAL; + goto map_phys_fmr_exit0; + } + ret = ehca_fmr_check_page_list(e_fmr, page_list, list_len); + if (ret) + goto map_phys_fmr_exit0; + if (iova % e_fmr->fmr_page_size) { + /* only whole-numbered pages */ + EDEB_ERR(4, "bad iova, iova=%lx fmr_page_size=%x", + iova, e_fmr->fmr_page_size); + ret = -EINVAL; + goto map_phys_fmr_exit0; + } + if (e_fmr->fmr_map_cnt >= e_fmr->fmr_max_maps) { + /* HCAD does not limit the maps, however trace this anyway */ + EDEB(6, "map limit exceeded, fmr=%p e_fmr->fmr_map_cnt=%x " + "e_fmr->fmr_max_maps=%x", + fmr, e_fmr->fmr_map_cnt, e_fmr->fmr_max_maps); + } + + pginfo.type = EHCA_MR_PGI_FMR; + pginfo.num_pages = list_len; + pginfo.num_4k = list_len * (e_fmr->fmr_page_size / EHCA_PAGESIZE); + pginfo.page_list = page_list; + pginfo.next_4k = ((iova & (e_fmr->fmr_page_size-1)) / + EHCA_PAGESIZE); + + ret = ehca_rereg_mr(shca, e_fmr, (u64*)iova, + list_len * e_fmr->fmr_page_size, + e_fmr->acl, e_pd, &pginfo, &tmp_lkey, &tmp_rkey); + if (ret) + goto map_phys_fmr_exit0; + + /* successful reregistration */ + e_fmr->fmr_map_cnt++; + e_fmr->ib.ib_fmr.lkey = tmp_lkey; + e_fmr->ib.ib_fmr.rkey = tmp_rkey; + +map_phys_fmr_exit0: + if (ret) + EDEB_EX(4, "ret=%x fmr=%p page_list=%p list_len=%x iova=%lx", + ret, fmr, page_list, list_len, iova); + else + EDEB_EX(7, "lkey=%x rkey=%x", + e_fmr->ib.ib_fmr.lkey, e_fmr->ib.ib_fmr.rkey); + return ret; +} /* end ehca_map_phys_fmr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_unmap_fmr(struct list_head *fmr_list) +{ + int ret = 0; + struct ib_fmr *ib_fmr = NULL; + struct ehca_shca *shca = NULL; + struct ehca_shca *prev_shca = NULL; + struct ehca_mr *e_fmr = NULL; + u32 num_fmr = 0; + u32 unmap_fmr_cnt = 0; + + EDEB_EN(7, "fmr_list=%p", fmr_list); + + /* check all FMR belong to same SHCA, and check internal flag */ + list_for_each_entry(ib_fmr, fmr_list, list) { + prev_shca = shca; + shca = container_of(ib_fmr->device, struct ehca_shca, + ib_device); + EHCA_CHECK_FMR(ib_fmr); + e_fmr = container_of(ib_fmr, struct ehca_mr, ib.ib_fmr); + if ((shca != prev_shca) && prev_shca) { + EDEB_ERR(4, "SHCA mismatch, shca=%p prev_shca=%p " + "e_fmr=%p", shca, prev_shca, e_fmr); + ret = -EINVAL; + goto unmap_fmr_exit0; + } + if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not a FMR, e_fmr=%p e_fmr->flags=%x", + e_fmr, e_fmr->flags); + ret = -EINVAL; + goto unmap_fmr_exit0; + } + num_fmr++; + } + + /* loop over all FMRs to unmap */ + list_for_each_entry(ib_fmr, fmr_list, list) { + unmap_fmr_cnt++; + e_fmr = container_of(ib_fmr, struct ehca_mr, ib.ib_fmr); + shca = container_of(ib_fmr->device, struct ehca_shca, + ib_device); + ret = ehca_unmap_one_fmr(shca, e_fmr); + if (ret) { + /* unmap failed, stop unmapping of rest of FMRs */ + EDEB_ERR(4, "unmap of one FMR failed, stop rest, " + "e_fmr=%p num_fmr=%x unmap_fmr_cnt=%x lkey=%x", + e_fmr, num_fmr, unmap_fmr_cnt, + e_fmr->ib.ib_fmr.lkey); + goto unmap_fmr_exit0; + } + } + +unmap_fmr_exit0: + if (ret) + EDEB_EX(4, "ret=%x fmr_list=%p num_fmr=%x unmap_fmr_cnt=%x", + ret, fmr_list, num_fmr, unmap_fmr_cnt); + else + EDEB_EX(7, "num_fmr=%x", num_fmr); + return ret; +} /* end ehca_unmap_fmr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_dealloc_fmr(struct ib_fmr *fmr) +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_fmr = NULL; + + EDEB_EN(7, "fmr=%p", fmr); + + EHCA_CHECK_FMR(fmr); + e_fmr = container_of(fmr, struct ehca_mr, ib.ib_fmr); + shca = container_of(fmr->device, struct ehca_shca, ib_device); + + if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not a FMR, e_fmr=%p e_fmr->flags=%x", + e_fmr, e_fmr->flags); + ret = -EINVAL; + goto free_fmr_exit0; + } + + h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_free_mr failed, h_ret=%lx e_fmr=%p " + "hca_hndl=%lx fmr_hndl=%lx fmr->lkey=%x", + h_ret, e_fmr, shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, fmr->lkey); + ehca_mrmw_map_hrc_free_mr(h_ret); + goto free_fmr_exit0; + } + /* successful deregistration */ + ehca_mr_delete(e_fmr); + +free_fmr_exit0: + if (ret) + EDEB_EX(4, "ret=%x fmr=%p", ret, fmr); + else + EDEB_EX(7, ""); + return ret; +} /* end ehca_dealloc_fmr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_reg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, /*OUT*/ + u32 *rkey) /*OUT*/ +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + u32 hipz_acl = 0; + struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + + EDEB_EN(7, "shca=%p e_mr=%p iova_start=%p size=%lx acl=%x e_pd=%p " + "pginfo=%p num_pages=%lx num_4k=%lx", shca, e_mr, iova_start, + size, acl, e_pd, pginfo, pginfo->num_pages, pginfo->num_4k); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + if (ehca_use_hp_mr == 1) + hipz_acl |= 0x00000001; + + h_ret = hipz_h_alloc_resource_mr(shca->ipz_hca_handle, e_mr, + (u64)iova_start, size, hipz_acl, + e_pd->fw_pd, &hipzout); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_alloc_mr failed, h_ret=%lx hca_hndl=%lx", + h_ret, shca->ipz_hca_handle.handle); + ret = ehca_mrmw_map_hrc_alloc(h_ret); + goto ehca_reg_mr_exit0; + } + + e_mr->ipz_mr_handle = hipzout.handle; + + ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); + if (ret) + goto ehca_reg_mr_exit1; + + /* successful registration */ + e_mr->num_pages = pginfo->num_pages; + e_mr->num_4k = pginfo->num_4k; + e_mr->start = iova_start; + e_mr->size = size; + e_mr->acl = acl; + *lkey = hipzout.lkey; + *rkey = hipzout.rkey; + goto ehca_reg_mr_exit0; + +ehca_reg_mr_exit1: + h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); + if (h_ret != H_SUCCESS) { + EDEB_ERR(1, "h_ret=%lx shca=%p e_mr=%p iova_start=%p " + "size=%lx acl=%x e_pd=%p lkey=%x pginfo=%p " + "num_pages=%lx num_4k=%lx ret=%x", h_ret, shca, e_mr, + iova_start, size, acl, e_pd, hipzout.lkey, pginfo, + pginfo->num_pages, pginfo->num_4k, ret); + EDEB_ERR(1, "internal error in ehca_reg_mr, not recoverable"); + } +ehca_reg_mr_exit0: + if (ret) + EDEB_EX(4, "ret=%x shca=%p e_mr=%p iova_start=%p size=%lx " + "acl=%x e_pd=%p pginfo=%p num_pages=%lx num_4k=%lx", + ret, shca, e_mr, iova_start, size, acl, e_pd, pginfo, + pginfo->num_pages, pginfo->num_4k); + else + EDEB_EX(7, "ret=%x lkey=%x rkey=%x", ret, *lkey, *rkey); + return ret; +} /* end ehca_reg_mr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_reg_mr_rpages(struct ehca_shca *shca, + struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo) +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + u32 rnum = 0; + u64 rpage = 0; + u32 i; + u64 *kpage = NULL; + + EDEB_EN(7, "shca=%p e_mr=%p pginfo=%p num_pages=%lx num_4k=%lx", + shca, e_mr, pginfo, pginfo->num_pages, pginfo->num_4k); + + kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!kpage) { + EDEB_ERR(4, "kpage alloc failed"); + ret = -ENOMEM; + goto ehca_reg_mr_rpages_exit0; + } + + /* max 512 pages per shot */ + for (i = 0; i < ((pginfo->num_4k + 512 - 1) / 512); i++) { + + if (i == ((pginfo->num_4k + 512 - 1) / 512) - 1) { + rnum = pginfo->num_4k % 512; /* last shot */ + if (rnum == 0) + rnum = 512; /* last shot is full */ + } else + rnum = 512; + + if (rnum > 1) { + ret = ehca_set_pagebuf(e_mr, pginfo, rnum, kpage); + if (ret) { + EDEB_ERR(4, "ehca_set_pagebuf bad rc, ret=%x " + "rnum=%x kpage=%p", ret, rnum, kpage); + ret = -EFAULT; + goto ehca_reg_mr_rpages_exit1; + } + rpage = virt_to_abs(kpage); + if (!rpage) { + EDEB_ERR(4, "kpage=%p i=%x", kpage, i); + ret = -EFAULT; + goto ehca_reg_mr_rpages_exit1; + } + } else { /* rnum==1 */ + ret = ehca_set_pagebuf_1(e_mr, pginfo, &rpage); + if (ret) { + EDEB_ERR(4, "ehca_set_pagebuf_1 bad rc, " + "ret=%x i=%x", ret, i); + ret = -EFAULT; + goto ehca_reg_mr_rpages_exit1; + } + } + + EDEB(9, "i=%x rnum=%x rpage=%lx", i, rnum, rpage); + + h_ret = hipz_h_register_rpage_mr(shca->ipz_hca_handle, e_mr, + 0, /* pagesize 4k */ + 0, rpage, rnum); + + if (i == ((pginfo->num_4k + 512 - 1) / 512) - 1) { + /* + * check for 'registration complete'==H_SUCCESS + * and for 'page registered'==H_PAGE_REGISTERED + */ + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "last hipz_reg_rpage_mr failed, " + "h_ret=%lx e_mr=%p i=%x hca_hndl=%lx " + "mr_hndl=%lx lkey=%x", h_ret, e_mr, i, + shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, + e_mr->ib.ib_mr.lkey); + ret = ehca_mrmw_map_hrc_rrpg_last(h_ret); + break; + } else + ret = 0; + } else if (h_ret != H_PAGE_REGISTERED) { + EDEB_ERR(4, "hipz_reg_rpage_mr failed, h_ret=%lx " + "e_mr=%p i=%x lkey=%x hca_hndl=%lx " + "mr_hndl=%lx", h_ret, e_mr, i, + e_mr->ib.ib_mr.lkey, + shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle); + ret = ehca_mrmw_map_hrc_rrpg_notlast(h_ret); + break; + } else + ret = 0; + } /* end for(i) */ + + +ehca_reg_mr_rpages_exit1: + kfree(kpage); +ehca_reg_mr_rpages_exit0: + if (ret) + EDEB_EX(4, "ret=%x shca=%p e_mr=%p pginfo=%p num_pages=%lx " + "num_4k=%lx", ret, shca, e_mr, pginfo, + pginfo->num_pages, pginfo->num_4k); + else + EDEB_EX(7, "ret=%x", ret); + return ret; +} /* end ehca_reg_mr_rpages() */ + +/*----------------------------------------------------------------------*/ + +inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + u32 acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, /*OUT*/ + u32 *rkey) /*OUT*/ +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + u32 hipz_acl = 0; + u64 *kpage = NULL; + u64 rpage = 0; + struct ehca_mr_pginfo pginfo_save; + struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + + EDEB_EN(7, "shca=%p e_mr=%p iova_start=%p size=%lx acl=%x " + "e_pd=%p pginfo=%p num_pages=%lx num_4k=%lx", shca, e_mr, + iova_start, size, acl, e_pd, pginfo, pginfo->num_pages, + pginfo->num_4k); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + + kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!kpage) { + EDEB_ERR(4, "kpage alloc failed"); + ret = -ENOMEM; + goto ehca_rereg_mr_rereg1_exit0; + } + + pginfo_save = *pginfo; + ret = ehca_set_pagebuf(e_mr, pginfo, pginfo->num_4k, kpage); + if (ret) { + EDEB_ERR(4, "set pagebuf failed, e_mr=%p pginfo=%p type=%x " + "num_pages=%lx num_4k=%lx kpage=%p", e_mr, pginfo, + pginfo->type, pginfo->num_pages, pginfo->num_4k,kpage); + goto ehca_rereg_mr_rereg1_exit1; + } + rpage = virt_to_abs(kpage); + if (!rpage) { + EDEB_ERR(4, "kpage=%p", kpage); + ret = -EFAULT; + goto ehca_rereg_mr_rereg1_exit1; + } + h_ret = hipz_h_reregister_pmr(shca->ipz_hca_handle, e_mr, + (u64)iova_start, size, hipz_acl, + e_pd->fw_pd, rpage, &hipzout); + if (h_ret != H_SUCCESS) { + /* + * reregistration unsuccessful, try it again with the 3 hCalls, + * e.g. this is required in case H_MR_CONDITION + * (MW bound or MR is shared) + */ + EDEB(6, "hipz_h_reregister_pmr failed (Rereg1), h_ret=%lx " + "e_mr=%p", h_ret, e_mr); + *pginfo = pginfo_save; + ret = -EAGAIN; + } else if ((u64*)hipzout.vaddr != iova_start) { + EDEB_ERR(4, "PHYP changed iova_start in rereg_pmr, " + "iova_start=%p iova_start_out=%lx e_mr=%p " + "mr_handle=%lx lkey=%x lkey_out=%x", iova_start, + hipzout.vaddr, e_mr, e_mr->ipz_mr_handle.handle, + e_mr->ib.ib_mr.lkey, hipzout.lkey); + ret = -EFAULT; + } else { + /* + * successful reregistration + * note: start and start_out are identical for eServer HCAs + */ + e_mr->num_pages = pginfo->num_pages; + e_mr->num_4k = pginfo->num_4k; + e_mr->start = iova_start; + e_mr->size = size; + e_mr->acl = acl; + *lkey = hipzout.lkey; + *rkey = hipzout.rkey; + } + +ehca_rereg_mr_rereg1_exit1: + kfree(kpage); +ehca_rereg_mr_rereg1_exit0: + if ( ret && (ret != -EAGAIN) ) + EDEB_EX(4, "ret=%x h_ret=%lx lkey=%x rkey=%x pginfo=%p " + "num_pages=%lx num_4k=%lx", ret, h_ret, *lkey, *rkey, + pginfo, pginfo->num_pages, pginfo->num_4k); + else + EDEB_EX(7, "ret=%x h_ret=%lx lkey=%x rkey=%x pginfo=%p " + "num_pages=%lx num_4k=%lx", ret, h_ret, *lkey, *rkey, + pginfo, pginfo->num_pages, pginfo->num_4k); + return ret; +} /* end ehca_rereg_mr_rereg1() */ + +/*----------------------------------------------------------------------*/ + +int ehca_rereg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, + u32 *rkey) +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + int rereg_1_hcall = 1; /* 1: use hipz_h_reregister_pmr directly */ + int rereg_3_hcall = 0; /* 1: use 3 hipz calls for reregistration */ + + EDEB_EN(7, "shca=%p e_mr=%p iova_start=%p size=%lx acl=%x " + "e_pd=%p pginfo=%p num_pages=%lx num_4k=%lx", shca, e_mr, + iova_start, size, acl, e_pd, pginfo, pginfo->num_pages, + pginfo->num_4k); + + /* first determine reregistration hCall(s) */ + if ((pginfo->num_4k > 512) || (e_mr->num_4k > 512) || + (pginfo->num_4k > e_mr->num_4k)) { + EDEB(7, "Rereg3 case, pginfo->num_4k=%lx " + "e_mr->num_4k=%x", pginfo->num_4k, e_mr->num_4k); + rereg_1_hcall = 0; + rereg_3_hcall = 1; + } + + if (e_mr->flags & EHCA_MR_FLAG_MAXMR) { /* check for max-MR */ + rereg_1_hcall = 0; + rereg_3_hcall = 1; + e_mr->flags &= ~EHCA_MR_FLAG_MAXMR; + EDEB(4, "Rereg MR for max-MR! e_mr=%p", e_mr); + } + + if (rereg_1_hcall) { + ret = ehca_rereg_mr_rereg1(shca, e_mr, iova_start, size, + acl, e_pd, pginfo, lkey, rkey); + if (ret) { + if (ret == -EAGAIN) + rereg_3_hcall = 1; + else + goto ehca_rereg_mr_exit0; + } + } + + if (rereg_3_hcall) { + struct ehca_mr save_mr; + + /* first deregister old MR */ + h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_free_mr failed, h_ret=%lx e_mr=%p " + "hca_hndl=%lx mr_hndl=%lx mr->lkey=%x", + h_ret, e_mr, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, + e_mr->ib.ib_mr.lkey); + ret = ehca_mrmw_map_hrc_free_mr(h_ret); + goto ehca_rereg_mr_exit0; + } + /* clean ehca_mr_t, without changing struct ib_mr and lock */ + save_mr = *e_mr; + ehca_mr_deletenew(e_mr); + + /* set some MR values */ + e_mr->flags = save_mr.flags; + e_mr->fmr_page_size = save_mr.fmr_page_size; + e_mr->fmr_max_pages = save_mr.fmr_max_pages; + e_mr->fmr_max_maps = save_mr.fmr_max_maps; + e_mr->fmr_map_cnt = save_mr.fmr_map_cnt; + + ret = ehca_reg_mr(shca, e_mr, iova_start, size, acl, + e_pd, pginfo, lkey, rkey); + if (ret) { + u32 offset = (u64)(&e_mr->flags) - (u64)e_mr; + memcpy(&e_mr->flags, &(save_mr.flags), + sizeof(struct ehca_mr) - offset); + goto ehca_rereg_mr_exit0; + } + } + +ehca_rereg_mr_exit0: + if (ret) + EDEB_EX(4, "ret=%x shca=%p e_mr=%p iova_start=%p size=%lx " + "acl=%x e_pd=%p pginfo=%p num_pages=%lx lkey=%x rkey=%x" + " rereg_1_hcall=%x rereg_3_hcall=%x", ret, shca, e_mr, + iova_start, size, acl, e_pd, pginfo, pginfo->num_pages, + *lkey, *rkey, rereg_1_hcall, rereg_3_hcall); + else + EDEB_EX(7, "ret=%x shca=%p e_mr=%p iova_start=%p size=%lx " + "acl=%x e_pd=%p pginfo=%p num_pages=%lx lkey=%x rkey=%x" + " rereg_1_hcall=%x rereg_3_hcall=%x", ret, shca, e_mr, + iova_start, size, acl, e_pd, pginfo, pginfo->num_pages, + *lkey, *rkey, rereg_1_hcall, rereg_3_hcall); + + return ret; +} /* end ehca_rereg_mr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_unmap_one_fmr(struct ehca_shca *shca, + struct ehca_mr *e_fmr) +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + int rereg_1_hcall = 1; /* 1: use hipz_mr_reregister directly */ + int rereg_3_hcall = 0; /* 1: use 3 hipz calls for unmapping */ + struct ehca_pd *e_pd = NULL; + struct ehca_mr save_fmr; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + + EDEB_EN(7, "shca=%p e_fmr=%p", shca, e_fmr); + + /* first check if reregistration hCall can be used for unmap */ + if (e_fmr->fmr_max_pages > 512) { + rereg_1_hcall = 0; + rereg_3_hcall = 1; + } + + e_pd = container_of(e_fmr->ib.ib_fmr.pd, struct ehca_pd, ib_pd); + + if (rereg_1_hcall) { + /* + * note: after using rereg hcall with len=0, + * rereg hcall must be used again for registering pages + */ + h_ret = hipz_h_reregister_pmr(shca->ipz_hca_handle, e_fmr, 0, + 0, 0, e_pd->fw_pd, 0, &hipzout); + if (h_ret != H_SUCCESS) { + /* + * should not happen, because length checked above, + * FMRs are not shared and no MW bound to FMRs + */ + EDEB_ERR(4, "hipz_reregister_pmr failed (Rereg1), " + "h_ret=%lx e_fmr=%p hca_hndl=%lx mr_hndl=%lx " + "lkey=%x lkey_out=%x", h_ret, e_fmr, + shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, + e_fmr->ib.ib_fmr.lkey, hipzout.lkey); + rereg_3_hcall = 1; + } else { + /* successful reregistration */ + e_fmr->start = NULL; + e_fmr->size = 0; + tmp_lkey = hipzout.lkey; + tmp_rkey = hipzout.rkey; + } + } + + if (rereg_3_hcall) { + struct ehca_mr save_mr; + + /* first free old FMR */ + h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_free_mr failed, h_ret=%lx e_fmr=%p " + "hca_hndl=%lx mr_hndl=%lx lkey=%x", h_ret, + e_fmr, shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, + e_fmr->ib.ib_fmr.lkey); + ret = ehca_mrmw_map_hrc_free_mr(h_ret); + goto ehca_unmap_one_fmr_exit0; + } + /* clean ehca_mr_t, without changing lock */ + save_fmr = *e_fmr; + ehca_mr_deletenew(e_fmr); + + /* set some MR values */ + e_fmr->flags = save_fmr.flags; + e_fmr->fmr_page_size = save_fmr.fmr_page_size; + e_fmr->fmr_max_pages = save_fmr.fmr_max_pages; + e_fmr->fmr_max_maps = save_fmr.fmr_max_maps; + e_fmr->fmr_map_cnt = save_fmr.fmr_map_cnt; + e_fmr->acl = save_fmr.acl; + + pginfo.type = EHCA_MR_PGI_FMR; + pginfo.num_pages = 0; + pginfo.num_4k = 0; + ret = ehca_reg_mr(shca, e_fmr, NULL, + (e_fmr->fmr_max_pages * e_fmr->fmr_page_size), + e_fmr->acl, e_pd, &pginfo, &tmp_lkey, + &tmp_rkey); + if (ret) { + u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr; + memcpy(&e_fmr->flags, &(save_mr.flags), + sizeof(struct ehca_mr) - offset); + goto ehca_unmap_one_fmr_exit0; + } + } + +ehca_unmap_one_fmr_exit0: + EDEB_EX(7, "ret=%x tmp_lkey=%x tmp_rkey=%x fmr_max_pages=%x " + "rereg_1_hcall=%x rereg_3_hcall=%x", ret, tmp_lkey, tmp_rkey, + e_fmr->fmr_max_pages, rereg_1_hcall, rereg_3_hcall); + return ret; +} /* end ehca_unmap_one_fmr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_reg_smr(struct ehca_shca *shca, + struct ehca_mr *e_origmr, + struct ehca_mr *e_newmr, + u64 *iova_start, + int acl, + struct ehca_pd *e_pd, + u32 *lkey, /*OUT*/ + u32 *rkey) /*OUT*/ +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + u32 hipz_acl = 0; + struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + + EDEB_EN(7,"shca=%p e_origmr=%p e_newmr=%p iova_start=%p acl=%x e_pd=%p", + shca, e_origmr, e_newmr, iova_start, acl, e_pd); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + + h_ret = hipz_h_register_smr(shca->ipz_hca_handle, e_newmr, e_origmr, + (u64)iova_start, hipz_acl, e_pd->fw_pd, + &hipzout); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_reg_smr failed, h_ret=%lx shca=%p e_origmr=%p" + " e_newmr=%p iova_start=%p acl=%x e_pd=%p hca_hndl=%lx" + " mr_hndl=%lx lkey=%x", h_ret, shca, e_origmr, e_newmr, + iova_start, acl, e_pd, shca->ipz_hca_handle.handle, + e_origmr->ipz_mr_handle.handle, + e_origmr->ib.ib_mr.lkey); + ret = ehca_mrmw_map_hrc_reg_smr(h_ret); + goto ehca_reg_smr_exit0; + } + /* successful registration */ + e_newmr->num_pages = e_origmr->num_pages; + e_newmr->num_4k = e_origmr->num_4k; + e_newmr->start = iova_start; + e_newmr->size = e_origmr->size; + e_newmr->acl = acl; + e_newmr->ipz_mr_handle = hipzout.handle; + *lkey = hipzout.lkey; + *rkey = hipzout.rkey; + goto ehca_reg_smr_exit0; + +ehca_reg_smr_exit0: + if (ret) + EDEB_EX(4, "ret=%x shca=%p e_origmr=%p e_newmr=%p " + "iova_start=%p acl=%x e_pd=%p", + ret, shca, e_origmr, e_newmr, iova_start, acl, e_pd); + else + EDEB_EX(7, "ret=%x lkey=%x rkey=%x", ret, *lkey, *rkey); + return ret; +} /* end ehca_reg_smr() */ + +/*----------------------------------------------------------------------*/ + +/* register internal max-MR to internal SHCA */ +int ehca_reg_internal_maxmr( + struct ehca_shca *shca, + struct ehca_pd *e_pd, + struct ehca_mr **e_maxmr) /*OUT*/ +{ + int ret = 0; + struct ehca_mr *e_mr = NULL; + u64 *iova_start = NULL; + u64 size_maxmr = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + struct ib_phys_buf ib_pbuf; + u32 num_pages_mr = 0; + u32 num_pages_4k = 0; /* 4k portion "pages" */ + + EDEB_EN(7, "shca=%p e_pd=%p e_maxmr=%p", shca, e_pd, e_maxmr); + + if (ehca_adr_bad(shca) || ehca_adr_bad(e_pd) || ehca_adr_bad(e_maxmr)) { + EDEB_ERR(4, "bad input values: shca=%p e_pd=%p e_maxmr=%p", + shca, e_pd, e_maxmr); + ret = -EINVAL; + goto ehca_reg_internal_maxmr_exit0; + } + + e_mr = ehca_mr_new(); + if (!e_mr) { + EDEB_ERR(4, "out of memory"); + ret = -ENOMEM; + goto ehca_reg_internal_maxmr_exit0; + } + e_mr->flags |= EHCA_MR_FLAG_MAXMR; + + /* register internal max-MR on HCA */ + size_maxmr = (u64)high_memory - PAGE_OFFSET; + EDEB(7, "high_memory=%p PAGE_OFFSET=%lx", high_memory, PAGE_OFFSET); + iova_start = (u64*)KERNELBASE; + ib_pbuf.addr = 0; + ib_pbuf.size = size_maxmr; + num_pages_mr = ((((u64)iova_start % PAGE_SIZE) + size_maxmr + + PAGE_SIZE - 1) / PAGE_SIZE); + num_pages_4k = ((((u64)iova_start % EHCA_PAGESIZE) + size_maxmr + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE); + + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_pages = num_pages_mr; + pginfo.num_4k = num_pages_4k; + pginfo.num_phys_buf = 1; + pginfo.phys_buf_array = &ib_pbuf; + + ret = ehca_reg_mr(shca, e_mr, iova_start, size_maxmr, 0, e_pd, + &pginfo, &e_mr->ib.ib_mr.lkey, + &e_mr->ib.ib_mr.rkey); + if (ret) { + EDEB_ERR(4, "reg of internal max MR failed, e_mr=%p " + "iova_start=%p size_maxmr=%lx num_pages_mr=%x " + "num_pages_4k=%x", e_mr, iova_start, size_maxmr, + num_pages_mr, num_pages_4k); + goto ehca_reg_internal_maxmr_exit1; + } + + /* successful registration of all pages */ + e_mr->ib.ib_mr.device = e_pd->ib_pd.device; + e_mr->ib.ib_mr.pd = &e_pd->ib_pd; + e_mr->ib.ib_mr.uobject = NULL; + atomic_inc(&(e_pd->ib_pd.usecnt)); + atomic_set(&(e_mr->ib.ib_mr.usecnt), 0); + *e_maxmr = e_mr; + goto ehca_reg_internal_maxmr_exit0; + +ehca_reg_internal_maxmr_exit1: + ehca_mr_delete(e_mr); +ehca_reg_internal_maxmr_exit0: + if (ret) + EDEB_EX(4, "ret=%x shca=%p e_pd=%p e_maxmr=%p", + ret, shca, e_pd, e_maxmr); + else + EDEB_EX(7, "*e_maxmr=%p lkey=%x rkey=%x", + *e_maxmr, (*e_maxmr)->ib.ib_mr.lkey, + (*e_maxmr)->ib.ib_mr.rkey); + return ret; +} /* end ehca_reg_internal_maxmr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_reg_maxmr(struct ehca_shca *shca, + struct ehca_mr *e_newmr, + u64 *iova_start, + int acl, + struct ehca_pd *e_pd, + u32 *lkey, + u32 *rkey) +{ + int ret = 0; + u64 h_ret = H_SUCCESS; + struct ehca_mr *e_origmr = shca->maxmr; + u32 hipz_acl = 0; + struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + + EDEB_EN(7,"shca=%p e_origmr=%p e_newmr=%p iova_start=%p acl=%x e_pd=%p", + shca, e_origmr, e_newmr, iova_start, acl, e_pd); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + + h_ret = hipz_h_register_smr(shca->ipz_hca_handle, e_newmr, e_origmr, + (u64)iova_start, hipz_acl, e_pd->fw_pd, + &hipzout); + if (h_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_reg_smr failed, h_ret=%lx e_origmr=%p " + "hca_hndl=%lx mr_hndl=%lx lkey=%x", + h_ret, e_origmr, shca->ipz_hca_handle.handle, + e_origmr->ipz_mr_handle.handle, + e_origmr->ib.ib_mr.lkey); + ret = ehca_mrmw_map_hrc_reg_smr(h_ret); + goto ehca_reg_maxmr_exit0; + } + /* successful registration */ + e_newmr->num_pages = e_origmr->num_pages; + e_newmr->num_4k = e_origmr->num_4k; + e_newmr->start = iova_start; + e_newmr->size = e_origmr->size; + e_newmr->acl = acl; + e_newmr->ipz_mr_handle = hipzout.handle; + *lkey = hipzout.lkey; + *rkey = hipzout.rkey; + +ehca_reg_maxmr_exit0: + EDEB_EX(7, "ret=%x lkey=%x rkey=%x", ret, *lkey, *rkey); + return ret; +} /* end ehca_reg_maxmr() */ + +/*----------------------------------------------------------------------*/ + +int ehca_dereg_internal_maxmr(struct ehca_shca *shca) +{ + int ret = 0; + struct ehca_mr *e_maxmr = NULL; + struct ib_pd *ib_pd = NULL; + + EDEB_EN(7, "shca=%p shca->maxmr=%p", shca, shca->maxmr); + + if (!shca->maxmr) { + EDEB_ERR(4, "bad call, shca=%p", shca); + ret = -EINVAL; + goto ehca_dereg_internal_maxmr_exit0; + } + + e_maxmr = shca->maxmr; + ib_pd = e_maxmr->ib.ib_mr.pd; + shca->maxmr = NULL; /* remove internal max-MR indication from SHCA */ + + ret = ehca_dereg_mr(&e_maxmr->ib.ib_mr); + if (ret) { + EDEB_ERR(3, "dereg internal max-MR failed, " + "ret=%x e_maxmr=%p shca=%p lkey=%x", + ret, e_maxmr, shca, e_maxmr->ib.ib_mr.lkey); + shca->maxmr = e_maxmr; + goto ehca_dereg_internal_maxmr_exit0; + } + + atomic_dec(&ib_pd->usecnt); + +ehca_dereg_internal_maxmr_exit0: + if (ret) + EDEB_EX(4, "ret=%x shca=%p shca->maxmr=%p", + ret, shca, shca->maxmr); + else + EDEB_EX(7, ""); + return ret; +} /* end ehca_dereg_internal_maxmr() */ + +/*----------------------------------------------------------------------*/ + +/* + * check physical buffer array of MR verbs for validness and + * calculates MR size + */ +int ehca_mr_chk_buf_and_calc_size(struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + u64 *iova_start, + u64 *size) +{ + struct ib_phys_buf *pbuf = phys_buf_array; + u64 size_count = 0; + u32 i; + + if (num_phys_buf == 0) { + EDEB_ERR(4, "bad phys buf array len, num_phys_buf=0"); + return -EINVAL; + } + /* check first buffer */ + if (((u64)iova_start & ~PAGE_MASK) != (pbuf->addr & ~PAGE_MASK)) { + EDEB_ERR(4, "iova_start/addr mismatch, iova_start=%p " + "pbuf->addr=%lx pbuf->size=%lx", + iova_start, pbuf->addr, pbuf->size); + return -EINVAL; + } + if (((pbuf->addr + pbuf->size) % PAGE_SIZE) && + (num_phys_buf > 1)) { + EDEB_ERR(4, "addr/size mismatch in 1st buf, pbuf->addr=%lx " + "pbuf->size=%lx", pbuf->addr, pbuf->size); + return -EINVAL; + } + + for (i = 0; i < num_phys_buf; i++) { + if ((i > 0) && (pbuf->addr % PAGE_SIZE)) { + EDEB_ERR(4, "bad address, i=%x pbuf->addr=%lx " + "pbuf->size=%lx", i, pbuf->addr, pbuf->size); + return -EINVAL; + } + if (((i > 0) && /* not 1st */ + (i < (num_phys_buf - 1)) && /* not last */ + (pbuf->size % PAGE_SIZE)) || (pbuf->size == 0)) { + EDEB_ERR(4, "bad size, i=%x pbuf->size=%lx", + i, pbuf->size); + return -EINVAL; + } + size_count += pbuf->size; + pbuf++; + } + + *size = size_count; + return 0; +} /* end ehca_mr_chk_buf_and_calc_size() */ + +/*----------------------------------------------------------------------*/ + +/* check page list of map FMR verb for validness */ +int ehca_fmr_check_page_list(struct ehca_mr *e_fmr, + u64 *page_list, + int list_len) +{ + u32 i; + u64 *page = NULL; + + if (ehca_adr_bad(page_list)) { + EDEB_ERR(4, "bad page_list, page_list=%p fmr=%p", + page_list, e_fmr); + return -EINVAL; + } + + if ((list_len == 0) || (list_len > e_fmr->fmr_max_pages)) { + EDEB_ERR(4, "bad list_len, list_len=%x e_fmr->fmr_max_pages=%x " + "fmr=%p", list_len, e_fmr->fmr_max_pages, e_fmr); + return -EINVAL; + } + + /* each page must be aligned */ + page = page_list; + for (i = 0; i < list_len; i++) { + if (*page % e_fmr->fmr_page_size) { + EDEB_ERR(4, "bad page, i=%x *page=%lx page=%p " + "fmr=%p fmr_page_size=%x", + i, *page, page, e_fmr, e_fmr->fmr_page_size); + return -EINVAL; + } + page++; + } + + return 0; +} /* end ehca_fmr_check_page_list() */ + +/*----------------------------------------------------------------------*/ + +/* setup page buffer from page info */ +int ehca_set_pagebuf(struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage) +{ + int ret = 0; + struct ib_umem_chunk *prev_chunk = NULL; + struct ib_umem_chunk *chunk = NULL; + struct ib_phys_buf *pbuf = NULL; + u64 *fmrlist = NULL; + u64 num4k = 0; + u64 pgaddr = 0; + u64 offs4k = 0; + u32 i = 0; + u32 j = 0; + + EDEB_EN(7, "pginfo=%p type=%x num_pages=%lx num_4k=%lx next_buf=%lx " + "next_4k=%lx number=%x kpage=%p page_cnt=%lx page_4k_cnt=%lx " + "next_listelem=%lx region=%p next_chunk=%p next_nmap=%lx", + pginfo, pginfo->type, pginfo->num_pages, pginfo->num_4k, + pginfo->next_buf, pginfo->next_4k, number, kpage, + pginfo->page_cnt, pginfo->page_4k_cnt, pginfo->next_listelem, + pginfo->region, pginfo->next_chunk, pginfo->next_nmap); + + if (pginfo->type == EHCA_MR_PGI_PHYS) { + /* loop over desired phys_buf_array entries */ + while (i < number) { + pbuf = pginfo->phys_buf_array + pginfo->next_buf; + num4k = ((pbuf->addr % EHCA_PAGESIZE) + pbuf->size + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE; + offs4k = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; + while (pginfo->next_4k < offs4k + num4k) { + /* sanity check */ + if ((pginfo->page_cnt >= pginfo->num_pages) || + (pginfo->page_4k_cnt >= pginfo->num_4k)) { + EDEB_ERR(4, "page_cnt >= num_pages, " + "page_cnt=%lx num_pages=%lx " + "page_4k_cnt=%lx num_4k=%lx " + "i=%x", pginfo->page_cnt, + pginfo->num_pages, + pginfo->page_4k_cnt, + pginfo->num_4k, i); + ret = -EFAULT; + } + *kpage = phys_to_abs( + (pbuf->addr & EHCA_PAGEMASK) + + (pginfo->next_4k * EHCA_PAGESIZE)); + if ( !(*kpage) && pbuf->addr ) { + EDEB_ERR(4, "pbuf->addr=%lx " + "pbuf->size=%lx next_4k=%lx", + pbuf->addr, pbuf->size, + pginfo->next_4k); + ret = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + if (pginfo->next_4k % + (PAGE_SIZE / EHCA_PAGESIZE) == 0) + (pginfo->page_cnt)++; + kpage++; + i++; + if (i >= number) break; + } + if (pginfo->next_4k >= offs4k + num4k) { + (pginfo->next_buf)++; + pginfo->next_4k = 0; + } + } + } else if (pginfo->type == EHCA_MR_PGI_USER) { + /* loop over desired chunk entries */ + chunk = pginfo->next_chunk; + prev_chunk = pginfo->next_chunk; + list_for_each_entry_continue(chunk, + (&(pginfo->region->chunk_list)), + list) { + EDEB(9, "chunk->page_list[0]=%lx", + (u64)sg_dma_address(&chunk->page_list[0])); + for (i = pginfo->next_nmap; i < chunk->nmap; ) { + pgaddr = ( page_to_pfn(chunk->page_list[i].page) + << PAGE_SHIFT ); + *kpage = phys_to_abs(pgaddr + + (pginfo->next_4k * + EHCA_PAGESIZE)); + EDEB(9,"pgaddr=%lx *kpage=%lx next_4k=%lx", + pgaddr, *kpage, pginfo->next_4k); + if ( !(*kpage) ) { + EDEB_ERR(4, "pgaddr=%lx " + "chunk->page_list[i]=%lx i=%x " + "next_4k=%lx mr=%p", pgaddr, + (u64)sg_dma_address( + &chunk->page_list[i]), + i, pginfo->next_4k, e_mr); + ret = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + kpage++; + if (pginfo->next_4k % + (PAGE_SIZE / EHCA_PAGESIZE) == 0) { + (pginfo->page_cnt)++; + (pginfo->next_nmap)++; + pginfo->next_4k = 0; + i++; + } + j++; + if (j >= number) break; + } + if ((pginfo->next_nmap >= chunk->nmap) && + (j >= number)) { + pginfo->next_nmap = 0; + prev_chunk = chunk; + break; + } else if (pginfo->next_nmap >= chunk->nmap) { + pginfo->next_nmap = 0; + prev_chunk = chunk; + } else if (j >= number) + break; + else + prev_chunk = chunk; + } + pginfo->next_chunk = + list_prepare_entry(prev_chunk, + (&(pginfo->region->chunk_list)), + list); + } else if (pginfo->type == EHCA_MR_PGI_FMR) { + /* loop over desired page_list entries */ + fmrlist = pginfo->page_list + pginfo->next_listelem; + for (i = 0; i < number; i++) { + *kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) + + pginfo->next_4k * EHCA_PAGESIZE); + if ( !(*kpage) ) { + EDEB_ERR(4, "*fmrlist=%lx fmrlist=%p " + "next_listelem=%lx next_4k=%lx", + *fmrlist, fmrlist, + pginfo->next_listelem,pginfo->next_4k); + ret = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + kpage++; + if (pginfo->next_4k % + (e_mr->fmr_page_size / EHCA_PAGESIZE) == 0) { + (pginfo->page_cnt)++; + (pginfo->next_listelem)++; + fmrlist++; + pginfo->next_4k = 0; + } + } + } else { + EDEB_ERR(4, "bad pginfo->type=%x", pginfo->type); + ret = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + +ehca_set_pagebuf_exit0: + if (ret) + EDEB_EX(4, "ret=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "num_4k=%lx next_buf=%lx next_4k=%lx number=%x " + "kpage=%p page_cnt=%lx page_4k_cnt=%lx i=%x " + "next_listelem=%lx region=%p next_chunk=%p " + "next_nmap=%lx", ret, e_mr, pginfo, pginfo->type, + pginfo->num_pages, pginfo->num_4k, pginfo->next_buf, + pginfo->next_4k, number, kpage, pginfo->page_cnt, + pginfo->page_4k_cnt, i, pginfo->next_listelem, + pginfo->region, pginfo->next_chunk, pginfo->next_nmap); + else + EDEB_EX(7, "ret=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "num_4k=%lx next_buf=%lx next_4k=%lx number=%x " + "kpage=%p page_cnt=%lx page_4k_cnt=%lx i=%x " + "next_listelem=%lx region=%p next_chunk=%p " + "next_nmap=%lx", ret, e_mr, pginfo, pginfo->type, + pginfo->num_pages, pginfo->num_4k, pginfo->next_buf, + pginfo->next_4k, number, kpage, pginfo->page_cnt, + pginfo->page_4k_cnt, i, pginfo->next_listelem, + pginfo->region, pginfo->next_chunk, pginfo->next_nmap); + return ret; +} /* end ehca_set_pagebuf() */ + +/*----------------------------------------------------------------------*/ + +/* setup 1 page from page info page buffer */ +int ehca_set_pagebuf_1(struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo, + u64 *rpage) +{ + int ret = 0; + struct ib_phys_buf *tmp_pbuf = NULL; + u64 *fmrlist = NULL; + struct ib_umem_chunk *chunk = NULL; + struct ib_umem_chunk *prev_chunk = NULL; + u64 pgaddr = 0; + u64 num4k = 0; + u64 offs4k = 0; + + EDEB_EN(7, "pginfo=%p type=%x num_pages=%lx num_4k=%lx next_buf=%lx " + "next_4k=%lx rpage=%p page_cnt=%lx page_4k_cnt=%lx " + "next_listelem=%lx region=%p next_chunk=%p next_nmap=%lx", + pginfo, pginfo->type, pginfo->num_pages, pginfo->num_4k, + pginfo->next_buf, pginfo->next_4k, rpage, pginfo->page_cnt, + pginfo->page_4k_cnt, pginfo->next_listelem, pginfo->region, + pginfo->next_chunk, pginfo->next_nmap); + + if (pginfo->type == EHCA_MR_PGI_PHYS) { + /* sanity check */ + if ((pginfo->page_cnt >= pginfo->num_pages) || + (pginfo->page_4k_cnt >= pginfo->num_4k)) { + EDEB_ERR(4, "page_cnt >= num_pages, page_cnt=%lx " + "num_pages=%lx page_4k_cnt=%lx num_4k=%lx", + pginfo->page_cnt, pginfo->num_pages, + pginfo->page_4k_cnt, pginfo->num_4k); + ret = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + tmp_pbuf = pginfo->phys_buf_array + pginfo->next_buf; + num4k = ((tmp_pbuf->addr % EHCA_PAGESIZE) + tmp_pbuf->size + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE; + offs4k = (tmp_pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; + *rpage = phys_to_abs((tmp_pbuf->addr & EHCA_PAGEMASK) + + (pginfo->next_4k * EHCA_PAGESIZE)); + if ( !(*rpage) && tmp_pbuf->addr ) { + EDEB_ERR(4, "tmp_pbuf->addr=%lx" + " tmp_pbuf->size=%lx next_4k=%lx", + tmp_pbuf->addr, tmp_pbuf->size, + pginfo->next_4k); + ret = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + if (pginfo->next_4k % (PAGE_SIZE / EHCA_PAGESIZE) == 0) + (pginfo->page_cnt)++; + if (pginfo->next_4k >= offs4k + num4k) { + (pginfo->next_buf)++; + pginfo->next_4k = 0; + } + } else if (pginfo->type == EHCA_MR_PGI_USER) { + chunk = pginfo->next_chunk; + prev_chunk = pginfo->next_chunk; + list_for_each_entry_continue(chunk, + (&(pginfo->region->chunk_list)), + list) { + pgaddr = ( page_to_pfn(chunk->page_list[ + pginfo->next_nmap].page) + << PAGE_SHIFT); + *rpage = phys_to_abs(pgaddr + + (pginfo->next_4k * EHCA_PAGESIZE)); + EDEB(9,"pgaddr=%lx *rpage=%lx next_4k=%lx", pgaddr, + *rpage, pginfo->next_4k); + if ( !(*rpage) ) { + EDEB_ERR(4, "pgaddr=%lx chunk->page_list[]=%lx " + "next_nmap=%lx next_4k=%lx mr=%p", + pgaddr, (u64)sg_dma_address( + &chunk->page_list[ + pginfo->next_nmap]), + pginfo->next_nmap, pginfo->next_4k, + e_mr); + ret = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + if (pginfo->next_4k % + (PAGE_SIZE / EHCA_PAGESIZE) == 0) { + (pginfo->page_cnt)++; + (pginfo->next_nmap)++; + pginfo->next_4k = 0; + } + if (pginfo->next_nmap >= chunk->nmap) { + pginfo->next_nmap = 0; + prev_chunk = chunk; + } + break; + } + pginfo->next_chunk = + list_prepare_entry(prev_chunk, + (&(pginfo->region->chunk_list)), + list); + } else if (pginfo->type == EHCA_MR_PGI_FMR) { + fmrlist = pginfo->page_list + pginfo->next_listelem; + *rpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) + + pginfo->next_4k * EHCA_PAGESIZE); + if ( !(*rpage) ) { + EDEB_ERR(4, "*fmrlist=%lx fmrlist=%p next_listelem=%lx " + "next_4k=%lx", *fmrlist, fmrlist, + pginfo->next_listelem, pginfo->next_4k); + ret = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + if (pginfo->next_4k % + (e_mr->fmr_page_size / EHCA_PAGESIZE) == 0) { + (pginfo->page_cnt)++; + (pginfo->next_listelem)++; + pginfo->next_4k = 0; + } + } else { + EDEB_ERR(4, "bad pginfo->type=%x", pginfo->type); + ret = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + +ehca_set_pagebuf_1_exit0: + if (ret) + EDEB_EX(4, "ret=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "num_4k=%lx next_buf=%lx next_4k=%lx rpage=%p " + "page_cnt=%lx page_4k_cnt=%lx next_listelem=%lx " + "region=%p next_chunk=%p next_nmap=%lx", ret, e_mr, + pginfo, pginfo->type, pginfo->num_pages, pginfo->num_4k, + pginfo->next_buf, pginfo->next_4k, rpage, + pginfo->page_cnt, pginfo->page_4k_cnt, + pginfo->next_listelem, pginfo->region, + pginfo->next_chunk, pginfo->next_nmap); + else + EDEB_EX(7, "ret=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "num_4k=%lx next_buf=%lx next_4k=%lx rpage=%p " + "page_cnt=%lx page_4k_cnt=%lx next_listelem=%lx " + "region=%p next_chunk=%p next_nmap=%lx", ret, e_mr, + pginfo, pginfo->type, pginfo->num_pages, pginfo->num_4k, + pginfo->next_buf, pginfo->next_4k, rpage, + pginfo->page_cnt, pginfo->page_4k_cnt, + pginfo->next_listelem, pginfo->region, + pginfo->next_chunk, pginfo->next_nmap); + return ret; +} /* end ehca_set_pagebuf_1() */ + +/*----------------------------------------------------------------------*/ + +/* + * check MR if it is a max-MR, i.e. uses whole memory + * in case it's a max-MR 1 is returned, else 0 + */ +int ehca_mr_is_maxmr(u64 size, + u64 *iova_start) +{ + /* a MR is treated as max-MR only if it fits following: */ + if ((size == ((u64)high_memory - PAGE_OFFSET)) && + (iova_start == (void*)KERNELBASE)) { + EDEB(6, "this is a max-MR"); + return 1; + } else + return 0; +} /* end ehca_mr_is_maxmr() */ + +/*----------------------------------------------------------------------*/ + +/* map access control for MR/MW. This routine is used for MR and MW. */ +void ehca_mrmw_map_acl(int ib_acl, + u32 *hipz_acl) +{ + *hipz_acl = 0; + if (ib_acl & IB_ACCESS_REMOTE_READ) + *hipz_acl |= HIPZ_ACCESSCTRL_R_READ; + if (ib_acl & IB_ACCESS_REMOTE_WRITE) + *hipz_acl |= HIPZ_ACCESSCTRL_R_WRITE; + if (ib_acl & IB_ACCESS_REMOTE_ATOMIC) + *hipz_acl |= HIPZ_ACCESSCTRL_R_ATOMIC; + if (ib_acl & IB_ACCESS_LOCAL_WRITE) + *hipz_acl |= HIPZ_ACCESSCTRL_L_WRITE; + if (ib_acl & IB_ACCESS_MW_BIND) + *hipz_acl |= HIPZ_ACCESSCTRL_MW_BIND; +} /* end ehca_mrmw_map_acl() */ + +/*----------------------------------------------------------------------*/ + +/* sets page size in hipz access control for MR/MW. */ +void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl) /*INOUT*/ +{ + return; /* HCA supports only 4k */ +} /* end ehca_mrmw_set_pgsize_hipz_acl() */ + +/*----------------------------------------------------------------------*/ + +/* + * reverse map access control for MR/MW. + * This routine is used for MR and MW. + */ +void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl, + int *ib_acl) /*OUT*/ +{ + *ib_acl = 0; + if (*hipz_acl & HIPZ_ACCESSCTRL_R_READ) + *ib_acl |= IB_ACCESS_REMOTE_READ; + if (*hipz_acl & HIPZ_ACCESSCTRL_R_WRITE) + *ib_acl |= IB_ACCESS_REMOTE_WRITE; + if (*hipz_acl & HIPZ_ACCESSCTRL_R_ATOMIC) + *ib_acl |= IB_ACCESS_REMOTE_ATOMIC; + if (*hipz_acl & HIPZ_ACCESSCTRL_L_WRITE) + *ib_acl |= IB_ACCESS_LOCAL_WRITE; + if (*hipz_acl & HIPZ_ACCESSCTRL_MW_BIND) + *ib_acl |= IB_ACCESS_MW_BIND; +} /* end ehca_mrmw_reverse_map_acl() */ + + +/*----------------------------------------------------------------------*/ + +/* + * map HIPZ rc to IB retcodes for MR/MW allocations + * Used for hipz_mr_reg_alloc and hipz_mw_alloc. + */ +int ehca_mrmw_map_hrc_alloc(const u64 hipz_rc) +{ + switch (hipz_rc) { + case H_SUCCESS: /* successful completion */ + return 0; + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RT_PARM: /* invalid resource type */ + case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ + case H_MLENGTH_PARM: /* invalid memory length */ + case H_MEM_ACCESS_PARM: /* invalid access controls */ + case H_CONSTRAINED: /* resource constraint */ + return -EINVAL; + case H_BUSY: /* long busy */ + return -EBUSY; + default: + return -EINVAL; + } +} /* end ehca_mrmw_map_hrc_alloc() */ + +/*----------------------------------------------------------------------*/ + +/* + * map HIPZ rc to IB retcodes for MR register rpage + * Used for hipz_h_register_rpage_mr at registering last page + */ +int ehca_mrmw_map_hrc_rrpg_last(const u64 hipz_rc) +{ + switch (hipz_rc) { + case H_SUCCESS: /* registration complete */ + return 0; + case H_PAGE_REGISTERED: /* page registered */ + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ +/* case H_QT_PARM: invalid queue type */ + case H_PARAMETER: /* + * invalid logical address, + * or count zero or greater 512 + */ + case H_TABLE_FULL: /* page table full */ + case H_HARDWARE: /* HCA not operational */ + return -EINVAL; + case H_BUSY: /* long busy */ + return -EBUSY; + default: + return -EINVAL; + } +} /* end ehca_mrmw_map_hrc_rrpg_last() */ + +/*----------------------------------------------------------------------*/ + +/* + * map HIPZ rc to IB retcodes for MR register rpage + * Used for hipz_h_register_rpage_mr at registering one page, but not last page + */ +int ehca_mrmw_map_hrc_rrpg_notlast(const u64 hipz_rc) +{ + switch (hipz_rc) { + case H_PAGE_REGISTERED: /* page registered */ + return 0; + case H_SUCCESS: /* registration complete */ + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ +/* case H_QT_PARM: invalid queue type */ + case H_PARAMETER: /* + * invalid logical address, + * or count zero or greater 512 + */ + case H_TABLE_FULL: /* page table full */ + case H_HARDWARE: /* HCA not operational */ + return -EINVAL; + case H_BUSY: /* long busy */ + return -EBUSY; + default: + return -EINVAL; + } +} /* end ehca_mrmw_map_hrc_rrpg_notlast() */ + +/*----------------------------------------------------------------------*/ + +/* map HIPZ rc to IB retcodes for MR query. Used for hipz_mr_query. */ +int ehca_mrmw_map_hrc_query_mr(const u64 hipz_rc) +{ + switch (hipz_rc) { + case H_SUCCESS: /* successful completion */ + return 0; + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ + return -EINVAL; + case H_BUSY: /* long busy */ + return -EBUSY; + default: + return -EINVAL; + } +} /* end ehca_mrmw_map_hrc_query_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* + * map HIPZ rc to IB retcodes for freeing MR resource + * Used for hipz_h_free_resource_mr + */ +int ehca_mrmw_map_hrc_free_mr(const u64 hipz_rc) +{ + switch (hipz_rc) { + case H_SUCCESS: /* resource freed */ + return 0; + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ + case H_R_STATE: /* invalid resource state */ + case H_HARDWARE: /* HCA not operational */ + return -EINVAL; + case H_RESOURCE: /* Resource in use */ + case H_BUSY: /* long busy */ + return -EBUSY; + default: + return -EINVAL; + } +} /* end ehca_mrmw_map_hrc_free_mr() */ + +/*----------------------------------------------------------------------*/ + +/* + * map HIPZ rc to IB retcodes for freeing MW resource + * Used for hipz_h_free_resource_mw + */ +int ehca_mrmw_map_hrc_free_mw(const u64 hipz_rc) +{ + switch (hipz_rc) { + case H_SUCCESS: /* resource freed */ + return 0; + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ + case H_R_STATE: /* invalid resource state */ + case H_HARDWARE: /* HCA not operational */ + return -EINVAL; + case H_RESOURCE: /* Resource in use */ + case H_BUSY: /* long busy */ + return -EBUSY; + default: + return -EINVAL; + } +} /* end ehca_mrmw_map_hrc_free_mw() */ + +/*----------------------------------------------------------------------*/ + +/* + * map HIPZ rc to IB retcodes for SMR registrations + * Used for hipz_h_register_smr. + */ +int ehca_mrmw_map_hrc_reg_smr(const u64 hipz_rc) +{ + switch (hipz_rc) { + case H_SUCCESS: /* successful completion */ + return 0; + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ + case H_MEM_PARM: /* invalid MR virtual address */ + case H_MEM_ACCESS_PARM: /* invalid access controls */ + case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ + return -EINVAL; + case H_BUSY: /* long busy */ + return -EBUSY; + default: + return -EINVAL; + } +} /* end ehca_mrmw_map_hrc_reg_smr() */ + +/*----------------------------------------------------------------------*/ + +/* + * MR destructor and constructor + * used in Reregister MR verb, sets all fields in ehca_mr_t to 0, + * except struct ib_mr and spinlock + */ +void ehca_mr_deletenew(struct ehca_mr *mr) +{ + mr->flags = 0; + mr->num_pages = 0; + mr->num_4k = 0; + mr->acl = 0; + mr->start = NULL; + mr->fmr_page_size = 0; + mr->fmr_max_pages = 0; + mr->fmr_max_maps = 0; + mr->fmr_map_cnt = 0; + memset(&mr->ipz_mr_handle, 0, sizeof(mr->ipz_mr_handle)); + memset(&mr->galpas, 0, sizeof(mr->galpas)); + mr->nr_of_pages = 0; + mr->pagearray = NULL; +} /* end ehca_mr_deletenew() */ diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.h b/drivers/infiniband/hw/ehca/ehca_mrmw.h new file mode 100644 index 0000000..01899b4 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.h @@ -0,0 +1,143 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * MR/MW declarations and inline functions + * + * Authors: Dietmar Decker + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _EHCA_MRMW_H_ +#define _EHCA_MRMW_H_ + +#undef DEB_PREFIX +#define DEB_PREFIX "mrmw" + +int ehca_reg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, + u32 *rkey); + +int ehca_reg_mr_rpages(struct ehca_shca *shca, + struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo); + +int ehca_rereg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int mr_access_flags, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, + u32 *rkey); + +int ehca_unmap_one_fmr(struct ehca_shca *shca, + struct ehca_mr *e_fmr); + +int ehca_reg_smr(struct ehca_shca *shca, + struct ehca_mr *e_origmr, + struct ehca_mr *e_newmr, + u64 *iova_start, + int acl, + struct ehca_pd *e_pd, + u32 *lkey, + u32 *rkey); + +int ehca_reg_internal_maxmr(struct ehca_shca *shca, + struct ehca_pd *e_pd, + struct ehca_mr **maxmr); + +int ehca_reg_maxmr(struct ehca_shca *shca, + struct ehca_mr *e_newmr, + u64 *iova_start, + int acl, + struct ehca_pd *e_pd, + u32 *lkey, + u32 *rkey); + +int ehca_dereg_internal_maxmr(struct ehca_shca *shca); + +int ehca_mr_chk_buf_and_calc_size(struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + u64 *iova_start, + u64 *size); + +int ehca_fmr_check_page_list(struct ehca_mr *e_fmr, + u64 *page_list, + int list_len); + +int ehca_set_pagebuf(struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage); + +int ehca_set_pagebuf_1(struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo, + u64 *rpage); + +int ehca_mr_is_maxmr(u64 size, + u64 *iova_start); + +void ehca_mrmw_map_acl(int ib_acl, + u32 *hipz_acl); + +void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl); + +void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl, + int *ib_acl); + +int ehca_mrmw_map_hrc_alloc(const u64 hipz_rc); + +int ehca_mrmw_map_hrc_rrpg_last(const u64 hipz_rc); + +int ehca_mrmw_map_hrc_rrpg_notlast(const u64 hipz_rc); + +int ehca_mrmw_map_hrc_query_mr(const u64 hipz_rc); + +int ehca_mrmw_map_hrc_free_mr(const u64 hipz_rc); + +int ehca_mrmw_map_hrc_free_mw(const u64 hipz_rc); + +int ehca_mrmw_map_hrc_reg_smr(const u64 hipz_rc); + +void ehca_mr_deletenew(struct ehca_mr *mr); + +#endif /*_EHCA_MRMW_H_*/ -- 1.4.1 From rdreier at cisco.com Thu Aug 17 13:13:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:13:57 -0700 Subject: [openib-general] InfiniBand merge plans for 2.6.19 Message-ID: Here's a short summary of what I plan to merge for 2.6.19. Some of this is already in infiniband.git[1], while some still needs to be merged up. Highlights: o iWARP core support[2]. This updates drivers/infiniband to work with devices that do RDMA over IP/ethernet in addition to InfiniBand devices. As a first user of this support, I also plan to merge the amso1100[3] driver for Ammasso RNIC. I will post this for review one more time after I pull it into my git tree for last minute cleanups. But if you feel this iWARP support should not be merged, please let me know why now. o IBM eHCA driver, which supports IBM pSeries-specific InfiniBand hardware. This is in the ehca branch of infiniband.git, and I will post it for review one more time. My feeling is that more cleanups are certainly possible, but this driver is "good enough to merge" now and has languished out of tree for long enough. I'm certainly happy to merge cleanup patches, though. o mmap()ed userspace work queues for ipath. This is a performance enhancement for QLogic/PathScale HCAs but it does touch core stuff in minor ways. Should not be controversial. o I also have the following minor changes queued in the for-2.6.19 branch of infiniband.git: Ishai Rabinovitz: IB/srp: Add port/device attributes James Lentini: IB/mthca: Include the header we really want Michael S. Tsirkin: IB/mthca: Don't use privileged UAR for kernel access IB/ipoib: Fix flush/start xmit race (from code review) Roland Dreier: IB/uverbs: Use idr_read_cq() where appropriate IB/uverbs: Fix lockdep warning when QP is created with 2 CQs [1] git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git [2] http://thread.gmane.org/gmane.linux.network/40903 [3] http://thread.gmane.org/gmane.linux.drivers.openib/28657 From swise at opengridcomputing.com Thu Aug 17 13:30:46 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 17 Aug 2006 15:30:46 -0500 Subject: [openib-general] [PATCH][RDMA CM] IB mcast fix Message-ID: <1155846646.31290.79.camel@stevo-desktop> Set the QKEY to a common global value for all UD QPs and multicast groups created by the RDMA CM. Signed-off-by: swise at opengridcomputing.com ---- Index: src/userspace/librdmacm/include/rdma/rdma_cma_ib.h =================================================================== --- src/userspace/librdmacm/include/rdma/rdma_cma_ib.h (revision 9004) +++ src/userspace/librdmacm/include/rdma/rdma_cma_ib.h (working copy) @@ -59,4 +59,10 @@ struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn, uint32_t *remote_qkey); +/* + * Global qkey value for all UD QPs and multicast groups created via the + * RDMA CM. + */ +#define RDMA_UD_QKEY 0x01234567 + #endif /* RDMA_CMA_IB_H */ Index: src/userspace/librdmacm/src/cma.c =================================================================== --- src/userspace/librdmacm/src/cma.c (revision 9004) +++ src/userspace/librdmacm/src/cma.c (working copy) @@ -701,7 +701,7 @@ qp_attr.port_num = id_priv->id.port_num; qp_attr.qp_state = IBV_QPS_INIT; - qp_attr.qkey = ntohs(rdma_get_src_port(&id_priv->id)); + qp_attr.qkey = RDMA_UD_QKEY; ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_QKEY); if (ret) Index: src/linux-kernel/infiniband/include/rdma/rdma_cm_ib.h =================================================================== --- src/linux-kernel/infiniband/include/rdma/rdma_cm_ib.h (revision 9004) +++ src/linux-kernel/infiniband/include/rdma/rdma_cm_ib.h (working copy) @@ -82,4 +82,10 @@ */ int rdma_set_ib_req_info(struct rdma_cm_id *id, struct ib_cm_req_opt *info); +/* + * Global qkey value for all UD QPs and multicast groups created via the + * RDMA CM. + */ +#define RDMA_UD_QKEY 0x01234567 + #endif /* RDMA_CM_IB_H */ Index: src/linux-kernel/infiniband/core/cma.c =================================================================== --- src/linux-kernel/infiniband/core/cma.c (revision 9004) +++ src/linux-kernel/infiniband/core/cma.c (working copy) @@ -2172,7 +2172,7 @@ ib_addr_get_sgid(dev_addr, &rec.port_gid); rec.pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); rec.join_state = 1; - rec.qkey = sin->sin_addr.s_addr; + rec.qkey = cpu_to_be32(RDMA_UD_QKEY); comp_mask = IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE | From rdreier at cisco.com Thu Aug 17 13:31:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 Aug 2006 13:31:37 -0700 Subject: [openib-general] [PATCH 00/16] IB/ehca: introduction In-Reply-To: <2006817139.43eVtRoa2IK8yOPl@cisco.com> (Roland Dreier's message of "Thu, 17 Aug 2006 13:09:27 -0700") References: <2006817139.43eVtRoa2IK8yOPl@cisco.com> Message-ID: Sorry-- my patchbombing script blew up in the middle, and I didn't restart quite correctly. But I'm pretty sure all 16 patches did make it out, although the numbering is screwy. The correct series is: 01/16, 02/16, 00/13, 01/13, ..., 13/13 I'm not going to spam everybody and resend to all the lists, but I'm happy to resend privately to anyone who asks, or you can clone the git tree to get the series git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ehca Thanks, Roland From mshefty at ichips.intel.com Thu Aug 17 14:12:31 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 Aug 2006 14:12:31 -0700 Subject: [openib-general] [PATCH][RDMA CM] IB mcast fix In-Reply-To: <1155846646.31290.79.camel@stevo-desktop> References: <1155846646.31290.79.camel@stevo-desktop> Message-ID: <44E4DBBF.7070000@ichips.intel.com> Steve Wise wrote: > Index: src/linux-kernel/infiniband/core/cma.c > =================================================================== > --- src/linux-kernel/infiniband/core/cma.c (revision 9004) > +++ src/linux-kernel/infiniband/core/cma.c (working copy) > @@ -2172,7 +2172,7 @@ > ib_addr_get_sgid(dev_addr, &rec.port_gid); > rec.pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); > rec.join_state = 1; > - rec.qkey = sin->sin_addr.s_addr; > + rec.qkey = cpu_to_be32(RDMA_UD_QKEY); The qkey for IB UD QPs must also be updated. - Sean From swise at opengridcomputing.com Thu Aug 17 14:16:26 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 17 Aug 2006 16:16:26 -0500 Subject: [openib-general] [PATCH][RDMA CM] IB mcast fix In-Reply-To: <44E4DBBF.7070000@ichips.intel.com> References: <1155846646.31290.79.camel@stevo-desktop> <44E4DBBF.7070000@ichips.intel.com> Message-ID: <1155849386.31290.80.camel@stevo-desktop> On Thu, 2006-08-17 at 14:12 -0700, Sean Hefty wrote: > Steve Wise wrote: > > Index: src/linux-kernel/infiniband/core/cma.c > > =================================================================== > > --- src/linux-kernel/infiniband/core/cma.c (revision 9004) > > +++ src/linux-kernel/infiniband/core/cma.c (working copy) > > @@ -2172,7 +2172,7 @@ > > ib_addr_get_sgid(dev_addr, &rec.port_gid); > > rec.pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); > > rec.join_state = 1; > > - rec.qkey = sin->sin_addr.s_addr; > > + rec.qkey = cpu_to_be32(RDMA_UD_QKEY); > > The qkey for IB UD QPs must also be updated. > You mean for kernel mode right? I did it for user mode. From mshefty at ichips.intel.com Thu Aug 17 14:24:22 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 Aug 2006 14:24:22 -0700 Subject: [openib-general] [PATCH][RDMA CM] IB mcast fix In-Reply-To: <1155849386.31290.80.camel@stevo-desktop> References: <1155846646.31290.79.camel@stevo-desktop> <44E4DBBF.7070000@ichips.intel.com> <1155849386.31290.80.camel@stevo-desktop> Message-ID: <44E4DE86.3030908@ichips.intel.com> Steve Wise wrote: > You mean for kernel mode right? I did it for user mode. Yes - see cma_sidr_rep_handler() and cma_send_sidr_rep() in kernel cma.c. You'll want to run cmatose, mckey, and udaddy to verify that all changes are correct. - Sean From arnd.bergmann at de.ibm.com Thu Aug 17 16:34:37 2006 From: arnd.bergmann at de.ibm.com (Arnd Bergmann) Date: Fri, 18 Aug 2006 01:34:37 +0200 Subject: [openib-general] [PATCH 13/13] IB/ehca: makefiles/kconfig In-Reply-To: <20068171311.WDFBWw0F6z9B3Qes@cisco.com> References: <20068171311.WDFBWw0F6z9B3Qes@cisco.com> Message-ID: <200608180134.39050.arnd.bergmann@de.ibm.com> On Thursday 17 August 2006 22:11, Roland Dreier wrote: > + > +CFLAGS += -DEHCA_USE_HCALL -DEHCA_USE_HCALL_KERNEL This seems really pointless, since you're always defining these macros to the same value. Just drop the CFLAGS and remove the code that depends on them being different. Arnd <>< From arnd.bergmann at de.ibm.com Thu Aug 17 16:44:17 2006 From: arnd.bergmann at de.ibm.com (Arnd Bergmann) Date: Fri, 18 Aug 2006 01:44:17 +0200 Subject: [openib-general] [PATCH 02/13] IB/ehca: includes In-Reply-To: <20068171311.X1v1Q4Gk1v3wd7qJ@cisco.com> References: <20068171311.X1v1Q4Gk1v3wd7qJ@cisco.com> Message-ID: <200608180144.19149.arnd.bergmann@de.ibm.com> On Thursday 17 August 2006 22:11, Roland Dreier wrote: > + * IS_EDEB_ON - Checks if debug is on for the given level. > + */ > +#define IS_EDEB_ON(level) \ > +((ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__) & \ > +  0x100000000L) == 0) > + > +#define EDEB_P_GENERIC(level,idstring,format,args...) \ > +do { \ > +       u64 ehca_edeb_filterresult =                                    \ > +               ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__);\ > +       if ((ehca_edeb_filterresult & 0x100000000L) == 0)               \ > +               printk("PU%04x %08x:%s " idstring " "format "\n",       \ > +                      get_paca()->paca_index, (u32)(ehca_edeb_filterresult), \ > +                      __func__,  ##args);                              \ > +} while (1 == 0) These macros are responsible for 61% of the object code size of your module. This is completely insane. Please get rid of that crap entirely and replace it with dev_info/dev_dbg/dev_warn calls where appropriate! Arnd <>< From zhushisongzhu at yahoo.com Thu Aug 17 21:32:36 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Thu, 17 Aug 2006 21:32:36 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817113921.GH2630@mellanox.co.il> Message-ID: <20060818043236.41436.qmail@web36913.mail.mud.yahoo.com> lsmod | grep sdp ib_sdp 46768 0 rdma_cm 27144 1 ib_sdp ib_core 60032 10 ib_sdp,rdma_cm, ib_mthca,ib_ipoib,ib_umad,ib_ucm,ib_uverbs,ib_cm,ib_sa,ib_mad zhu --- "Michael S. Tsirkin" wrote: > Quoting r. zhu shi song : > > (3) one time linux kernel on the client crashed. I > > copy the output from the screen. > > Process sdp (pid:4059, threadinfo 0000010036384000 > > task 000001003ea10030) > > Call > > > Trace:{:ib_sdp:sdp_destroy_workto} > > {:ib_sdp:sdp_destroy_qp+77} > > > {:ib_sdp:sdp_destruct+279}{sk_free+28} > > > {worker_thread+419}{default_wake_function+0} > > > {default_wake_function+0}{keventd_create_kthread+0} > > > {worker_thread+0}{keventd_create_kthread+0} > > > {kthread+200}{child_rip+8} > > > {keventd_create_kthread+0}{kthread+0}{child_rip+0} > > Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b 45 31 ff > 45 > > 31 ed 4c 89 > > > RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> > > CR2:0000000000000004 > > <0>kernel panic-not syncing:Oops > > > > zhu > > Hmm, the stack dump does not match my sources. Is > this OFED rc1? > Could you send me the sdp_main.o and sdp_main.c > files from your system please? > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: sdp_main.c URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: sdp_main.o Type: application/octet-stream Size: 341904 bytes Desc: 716185431-sdp_main.o URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ib_sdp.ko Type: application/octet-stream Size: 807552 bytes Desc: 3841283756-ib_sdp.ko URL: From Administrator at thomson.net Thu Aug 17 21:34:55 2006 From: Administrator at thomson.net (Administrator at thomson.net) Date: Fri, 18 Aug 2006 06:34:55 +0200 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <005301c6c27f$a74ffc70$47d80b8d@eu.thmulti.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = openib-general-bounces at openib.org Recipient(s) = Michael S. Tsirkin; openib-general at openib.org Subject = Re: [openib-general] why sdp connections cost so much memory Scanning time = 8/18/2006 6:34:54 AM Action on file blocking: The attachment ib_sdp.ko matches the file blocking settings. ScanMail has Quarantined it. The attachment was quarantined to G:\TrendQuarantine\ib_sdp44e5436e24.ko_. Warning to Recipient: Action taken by attachment blocking. ib_sdp.ko/Quarantined Michael S. Tsirkin; openib-general at openib.org From Administrator at thomson.net Thu Aug 17 21:34:54 2006 From: Administrator at thomson.net (Administrator at thomson.net) Date: Fri, 18 Aug 2006 06:34:54 +0200 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <005001c6c27f$a722aae0$47d80b8d@eu.thmulti.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = openib-general-bounces at openib.org Recipient(s) = Michael S. Tsirkin; openib-general at openib.org Subject = Re: [openib-general] why sdp connections cost so much memory Scanning time = 8/18/2006 6:34:54 AM Action on file blocking: The attachment sdp_main.o matches the file blocking settings. ScanMail has Quarantined it. The attachment was quarantined to G:\TrendQuarantine\sdp_main44e5436e23.o_. Warning to Recipient: Action taken by attachment blocking. sdp_main.o/Quarantined Michael S. Tsirkin; openib-general at openib.org From zhushisongzhu at yahoo.com Thu Aug 17 21:59:11 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Thu, 17 Aug 2006 21:59:11 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <44E482F0.6010203@mellanox.co.il> Message-ID: <20060818045911.37011.qmail@web36903.mail.mud.yahoo.com> OS: RHEL 4.3 kernel: 2.6.9-34.ELsmp On Server: (0) eth0: 192.12.10.24 ib0: 193.12.10.24 (1) sh zhuset ulimit -n 64000 (2) squid.sdp -d 10 -f squid2.conf (I have changed squid to squid.sdp, squid.sdp is always listening on sdp socket 3129) On Client: eth0: 192.12.10.14 ib0: 193.12.10.14 SIMPLE_LIBSDP=1 LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so ab -c 100 -n 100 -X 193.12.10.14:3129 http://192.12.10.130/car.jpg ab version: [root at IB-TEST squid.test]# ab -V This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ [root at IB-TEST squid.test]# Web Server: IIS 6.0 web Page: car.jpg ( about 53K) zhu --- Tziporet Koren wrote: > > > zhu shi song wrote: > > I have changed SDP_RX_SIZE from 0x40 to 1 and > rebuilt > > ib_sdp.ko. But kernel always crashed. > > zhu > > > > Hi Zhu, > Can you send us instructions of the test/application > you are running so > we can try to reproduce it here too? > > We also need to know the system & kernel you are > using > > Thanks, > Tziporet > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -------------- next part -------------- A non-text attachment was scrubbed... Name: sdp.rar Type: application/octet-stream Size: 1186543 bytes Desc: 1800015349-sdp.rar URL: From rkuchimanchi at silverstorm.com Thu Aug 17 23:29:15 2006 From: rkuchimanchi at silverstorm.com (Ramachandra K) Date: Fri, 18 Aug 2006 11:59:15 +0530 Subject: [openib-general] SRP - order of wait_for_completion() and completion() in srp_remove_one() In-Reply-To: References: <44E47C72.4090609@silverstorm.com> Message-ID: <44E55E3B.9030506@silverstorm.com> Roland Dreier wrote: > Ramachandra> I was wondering if it is not possible for the release > Ramachandra> function to run and signal the completion before we > Ramachandra> call wait_for_completion(). > > Sure, that would seem to be possible. Why is that a problem? > > - R. Sorry, I just realized that there will be no problem with that. Thanks. Regards, Ram From zhushisongzhu at yahoo.com Fri Aug 18 00:25:56 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Fri, 18 Aug 2006 00:25:56 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817113921.GH2630@mellanox.co.il> Message-ID: <20060818072556.83097.qmail@web36909.mail.mud.yahoo.com> please see the attachment. zhu --- "Michael S. Tsirkin" wrote: > Quoting r. zhu shi song : > > (3) one time linux kernel on the client crashed. I > > copy the output from the screen. > > Process sdp (pid:4059, threadinfo 0000010036384000 > > task 000001003ea10030) > > Call > > > Trace:{:ib_sdp:sdp_destroy_workto} > > {:ib_sdp:sdp_destroy_qp+77} > > > {:ib_sdp:sdp_destruct+279}{sk_free+28} > > > {worker_thread+419}{default_wake_function+0} > > > {default_wake_function+0}{keventd_create_kthread+0} > > > {worker_thread+0}{keventd_create_kthread+0} > > > {kthread+200}{child_rip+8} > > > {keventd_create_kthread+0}{kthread+0}{child_rip+0} > > Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b 45 31 ff > 45 > > 31 ed 4c 89 > > > RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> > > CR2:0000000000000004 > > <0>kernel panic-not syncing:Oops > > > > zhu > > Hmm, the stack dump does not match my sources. Is > this OFED rc1? > Could you send me the sdp_main.o and sdp_main.c > files from your system please? > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -------------- next part -------------- A non-text attachment was scrubbed... Name: sdp_main.rar Type: application/octet-stream Size: 257523 bytes Desc: 1678057268-sdp_main.rar URL: From johnt1johnt2 at gmail.com Fri Aug 18 02:36:53 2006 From: johnt1johnt2 at gmail.com (john t) Date: Fri, 18 Aug 2006 15:06:53 +0530 Subject: [openib-general] Ib question Message-ID: Hi, In the example code there are things like: attr.max_dest_rd_atomic = 1; attr.min_rnr_timer = 12; attr.ah_attr.is_global = 0; attr.ah_attr.sl = 0; attr.ah_attr.src_path_bits = 0; attr.timeout = 14; attr.retry_cnt = 7; attr.rnr_retry = 7; attr.max_rd_atomic = 1; initattr.cap.max_recv_wr = 1; /* I guess this specifies the length of recv q */ initattr.cap.max_send_wr = tx_depth; /* I guess this specifies the length of send q */ initattr.cap.max_inline_data = MAX_INLINE; /* 400 */ attr.pkey_index = 0; What is the meaning of above fields or where can I find the definition of above fields ? Can I change the value of fields like "timeout" or should it be always set to a fixed value. Like TCP I guess there would be a state transition diagram for IB (QP state machine). Can someone point me to that? In my application, I get an error message "IBV_WC_WR_FLUSH_ERR" and sometimes "IBV_WC_RETRY_EXC_ERR" while polling a CQ after posting some write commands. What could be the reason for that ? Also, can someone point me to a document where I can find the meaning of different error status values (enum ibv_wc_status) returned by "ibv_poll_cq" Regards, John T -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnt1johnt2 at gmail.com Fri Aug 18 02:36:53 2006 From: johnt1johnt2 at gmail.com (john t) Date: Fri, 18 Aug 2006 15:06:53 +0530 Subject: [openib-general] Ib question Message-ID: Hi, In the example code there are things like: attr.max_dest_rd_atomic = 1; attr.min_rnr_timer = 12; attr.ah_attr.is_global = 0; attr.ah_attr.sl = 0; attr.ah_attr.src_path_bits = 0; attr.timeout = 14; attr.retry_cnt = 7; attr.rnr_retry = 7; attr.max_rd_atomic = 1; initattr.cap.max_recv_wr = 1; /* I guess this specifies the length of recv q */ initattr.cap.max_send_wr = tx_depth; /* I guess this specifies the length of send q */ initattr.cap.max_inline_data = MAX_INLINE; /* 400 */ attr.pkey_index = 0; What is the meaning of above fields or where can I find the definition of above fields ? Can I change the value of fields like "timeout" or should it be always set to a fixed value. Like TCP I guess there would be a state transition diagram for IB (QP state machine). Can someone point me to that? In my application, I get an error message "IBV_WC_WR_FLUSH_ERR" and sometimes "IBV_WC_RETRY_EXC_ERR" while polling a CQ after posting some write commands. What could be the reason for that ? Also, can someone point me to a document where I can find the meaning of different error status values (enum ibv_wc_status) returned by "ibv_poll_cq" Regards, John T -------------- next part -------------- An HTML attachment was scrubbed... URL: From yhkim93 at keti.re.kr Fri Aug 18 02:37:36 2006 From: yhkim93 at keti.re.kr (=?ks_c_5601-1987?B?sei/tciv?=) Date: Fri, 18 Aug 2006 18:37:36 +0900 Subject: [openib-general] How to cross-compile to ppc32 from infiniband driver in IBGD-1.8.2 ? Message-ID: <20060818093757.8676E3B0003@sentry-two.sandia.gov> I am doing cross-compile to ppc440SPe (ppc 32bit) from IBGD-1.8.2. Who ever been cross-compile it? I don’t know exactly how to modify Makefile and source code Help me please. -------------- next part -------------- An HTML attachment was scrubbed... URL: From k_mahesh85 at yahoo.co.in Fri Aug 18 04:07:54 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Fri, 18 Aug 2006 12:07:54 +0100 (BST) Subject: [openib-general] Ib question In-Reply-To: Message-ID: <20060818110757.7822.qmail@web8326.mail.in.yahoo.com> you can find all the required information in the IB spec.s download the specification from IBTA john t wrote: Hi, In the example code there are things like: attr.max_dest_rd_atomic = 1; attr.min_rnr_timer = 12; attr.ah_attr.is_global = 0; attr.ah_attr.sl = 0; attr.ah_attr.src_path_bits = 0; attr.timeout = 14; attr.retry_cnt = 7; attr.rnr_retry = 7; attr.max_rd_atomic = 1; initattr.cap.max_recv_wr = 1; /* I guess this specifies the length of recv q */ initattr.cap.max_send_wr = tx_depth; /* I guess this specifies the length of send q */ initattr.cap.max_inline_data = MAX_INLINE; /* 400 */ attr.pkey_index = 0; What is the meaning of above fields or where can I find the definition of above fields ? Can I change the value of fields like "timeout" or should it be always set to a fixed value. Like TCP I guess there would be a state transition diagram for IB (QP state machine). Can someone point me to that? In my application, I get an error message "IBV_WC_WR_FLUSH_ERR" and sometimes "IBV_WC_RETRY_EXC_ERR" while polling a CQ after posting some write commands. What could be the reason for that ? Also, can someone point me to a document where I can find the meaning of different error status values (enum ibv_wc_status) returned by "ibv_poll_cq" Regards, John T _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general --------------------------------- Here's a new way to find what you're looking for - Yahoo! Answers Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.rex at s2001.tu-chemnitz.de Fri Aug 18 06:23:41 2006 From: robert.rex at s2001.tu-chemnitz.de (Robert Rex) Date: Fri, 18 Aug 2006 15:23:41 +0200 (MEST) Subject: [openib-general] [PATCH] huge pages support Message-ID: Hello, I've also worked on the same topic. Here is what I've done so far as I successfully tested it on mthca and ehca. I'd appreciate for comments and suggestions. diff -Nurp a/drivers/infiniband/core/uverbs_mem.c b/drivers/infiniband/core/uverbs_mem.c --- old/drivers/infiniband/core/uverbs_mem.c 2006-08-15 05:42:06.000000000 -0700 +++ new/drivers/infiniband/core/uverbs_mem.c 2006-08-18 04:22:22.000000000 -0700 @@ -36,6 +36,7 @@ #include #include +#include #include "uverbs.h" @@ -73,6 +74,8 @@ int ib_umem_get(struct ib_device *dev, s unsigned long lock_limit; unsigned long cur_base; unsigned long npages; + unsigned long region_page_mask, region_page_shift, region_page_size; + int use_hugepages; int ret = 0; int off; int i; @@ -84,19 +87,39 @@ int ib_umem_get(struct ib_device *dev, s if (!page_list) return -ENOMEM; + down_read(¤t->mm->mmap_sem); + if (is_vm_hugetlb_page(find_vma(current->mm, (unsigned long) addr))) { + use_hugepages = 1; + region_page_mask = HPAGE_MASK; + region_page_size = HPAGE_SIZE; + } else { + use_hugepages = 0; + region_page_mask = PAGE_MASK; + region_page_size = PAGE_SIZE; + } + up_read(¤t->mm->mmap_sem); + + region_page_shift = ffs(region_page_size) - 1; + mem->user_base = (unsigned long) addr; mem->length = size; - mem->offset = (unsigned long) addr & ~PAGE_MASK; - mem->page_size = PAGE_SIZE; + mem->offset = (unsigned long) addr & ~region_page_mask; + mem->page_size = region_page_size; mem->writable = write; INIT_LIST_HEAD(&mem->chunk_list); - npages = PAGE_ALIGN(size + mem->offset) >> PAGE_SHIFT; + npages = ((size + mem->offset + (region_page_size - 1)) & + (~(region_page_size - 1))) >> region_page_shift; down_write(¤t->mm->mmap_sem); - locked = npages + current->mm->locked_vm; + if (use_hugepages) + locked = npages * (HPAGE_SIZE / PAGE_SIZE) + + current->mm->locked_vm; + else + locked = npages + current->mm->locked_vm; + lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) { @@ -104,19 +127,34 @@ int ib_umem_get(struct ib_device *dev, s goto out; } - cur_base = (unsigned long) addr & PAGE_MASK; + cur_base = (unsigned long) addr & region_page_mask; while (npages) { - ret = get_user_pages(current, current->mm, cur_base, - min_t(int, npages, - PAGE_SIZE / sizeof (struct page *)), - 1, !write, page_list, NULL); + if (!use_hugepages) { + ret = get_user_pages(current, current->mm, cur_base, + min_t(int, npages, PAGE_SIZE + / sizeof (struct page *)), + 1, !write, page_list, NULL); - if (ret < 0) - goto out; + if (ret < 0) + goto out; + + cur_base += ret * PAGE_SIZE; + npages -= ret; + } else { + while (npages && (ret <= PAGE_SIZE / + sizeof (struct page *))) { + if (get_user_pages(current, current->mm, + cur_base, 1, 1, !write, + &page_list[ret], NULL) < 0) + goto out; + + ret++; + cur_base += HPAGE_SIZE; + npages--; + } - cur_base += ret * PAGE_SIZE; - npages -= ret; + } off = 0; @@ -133,7 +171,7 @@ int ib_umem_get(struct ib_device *dev, s for (i = 0; i < chunk->nents; ++i) { chunk->page_list[i].page = page_list[i + off]; chunk->page_list[i].offset = 0; - chunk->page_list[i].length = PAGE_SIZE; + chunk->page_list[i].length = region_page_size; } chunk->nmap = dma_map_sg(dev->dma_device, From thomas.bub at thomson.net Fri Aug 18 06:49:59 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Fri, 18 Aug 2006 15:49:59 +0200 Subject: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM Message-ID: OK Can I still use the LID, GUID and SubnetID for connection establishment then? Then Gen1 counterpart has no IP over IB running. I'm using OFED-1.0.1. Do you have a quick link where to find the latest Headers? (Sorry for the dumb question) Thanks Thomas -----Original Message----- From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Thursday, August 17, 2006 6:50 PM To: Bub Thomas Cc: Sean Hefty; openib-general at openib.org; Erez Cohen Subject: Re: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM Bub Thomas wrote: > I'm getting a little puzzled. > For me it seems as if we are moving in the wrong direction. > I don't have a RDMA CM on the Gen1 counterpart that my gen2 application > is talking too. The RDMA CM is only used on the local (active or client) side to obtain a path record, which is needed by the libibcm. Using the librdmacm allows cmpost to get a path record given only the remote IP address or host name. The connection is established using the IB CM through libibcm. > If yes you have to explain me what the two different versions: > rdma_cm.h > and > rdma_cma.h rdma_cm.h defines the kernel interface to the RDMA CM. rdma_cma.h defines the userspace interface. > The cmpost.c was using rdma_cma.h up to now but the missing defines are > located in rdma_cm.h Can you verify that you have the latest version of rdma_cma.h? - Sean From thomas.bub at thomson.net Fri Aug 18 06:53:21 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Fri, 18 Aug 2006 15:53:21 +0200 Subject: [openib-general] libibcm can't open /dev/infiniband/ucm0 Message-ID: I looked into my original libibcm problem today, that made me looking into cmpost, discussed in a sperate thread It seems as if the problem I had there was not in my code but the libibcm not being able to open the device /dev/infiniband/ucm0. All IB modules are installed and the modules loaded the /dev/infiniband devices available are: crw------- 1 root root 231, 64 Aug 17 15:14 issm0 crw-rw-rw- 1 root root 10, 62 Aug 17 15:14 rdma_cm crw------- 1 root root 231, 0 Aug 17 15:14 umad0 crw-rw-rw- 1 root root 231, 192 Aug 17 15:14 uverbs0 The exact error message is: libibcm: error <-1:2> opening device I'm running OFED-1.0.1 on an HP xw9300 running SLES9 SP3 in the x86_64 flawor. Thanks Thomas Bub ............................................................ Thomas Bub Grass Valley Germany GmbH Brunnenweg 9 64331 Weiterstadt, Germany Tel: +49 6150 104 147 Fax: +49 6150 104 656 Email: Thomas.Bub at thomson.net www.GrassValley.com ............................................................ -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Fri Aug 18 07:54:04 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 18 Aug 2006 17:54:04 +0300 Subject: [openib-general] [patch] libsdp typo in config_parser Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302B9A097@mtlexch01.mtl.com> Committed to the trunk (not 1.1 branch) Thanks. > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Bernhard Fischer > Sent: Thursday, August 17, 2006 9:27 PM > To: openib-general at openib.org > Cc: rep.nop at aon.at > Subject: [openib-general] [patch] libsdp typo in config_parser > > Hi, > > The attached trivial patch fixes a typo in the debugging output of libsdp's > config parser. > > Please apply. > > Signed-off-by: Bernhard Fischer From RAISCH at de.ibm.com Fri Aug 18 08:35:54 2006 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Fri, 18 Aug 2006 17:35:54 +0200 Subject: [openib-general] [PATCH 02/13] IB/ehca: includes In-Reply-To: <200608180144.19149.arnd.bergmann@de.ibm.com> Message-ID: abergman > > +#define EDEB_P_GENERIC(level,idstring,format,args...) \ > > These macros are responsible for 61% of the object code size of your module. > ...Please get rid of that crap entirely and replace > it with dev_info/dev_dbg/dev_warn calls where appropriate! > > Arnd <>< we'll change these EDEBs to a wrapper around dev_err, dev_dbg and dev_warn as it's done in the mthca driver. All EDEB_EN and EDEB_EX will be removed, that type of tracing can be done if needed by kprobes. There are a few cases where we won't get to a dev, for these few places we'll use a simple wrapper around printk, as done in ipoib. Hope that's the "official" way how to implement it in ib drivers. Gruss / Regards . . . Christoph R From rep.nop at aon.at Fri Aug 18 08:36:00 2006 From: rep.nop at aon.at (Bernhard Fischer) Date: Fri, 18 Aug 2006 17:36:00 +0200 Subject: [openib-general] [patch] libsdp typo in config_parser In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302B9A097@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302B9A097@mtlexch01.mtl.com> Message-ID: <20060818153600.GA8080@aon.at> On Fri, Aug 18, 2006 at 05:54:04PM +0300, Eitan Zahavi wrote: >Committed to the trunk (not 1.1 branch) >Thanks. > Thank you. PS: I think there is another occurance in srp_daemon that i forgot to include in the diff, fwiw. PPS: IIRC the traffic sent via SDP did not show up in the packet-counters of the corresponding ipoib device last time i looked. Is this still the case? Asking because i'm seeing it accounted to my ib0 device now although libsdp's log indicates that the app is using SDP.. TIA for any hint From HNGUYEN at de.ibm.com Fri Aug 18 08:43:13 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Fri, 18 Aug 2006 17:43:13 +0200 Subject: [openib-general] [PATCH 13/13] IB/ehca: makefiles/kconfig In-Reply-To: <200608180134.39050.arnd.bergmann@de.ibm.com> Message-ID: abergman at de.ltcfwd.linux.ibm.com wrote on 18.08.2006 01:34:37: > On Thursday 17 August 2006 22:11, Roland Dreier wrote: > > + > > +CFLAGS += -DEHCA_USE_HCALL -DEHCA_USE_HCALL_KERNEL > > This seems really pointless, since you're always defining these > macros to the same value. > > Just drop the CFLAGS and remove the code that depends on them > being different. Yes, that's true. Those defines are unnecessary. We'll throw them out. Thx Hoang-Nam Nguyen From mshefty at ichips.intel.com Fri Aug 18 09:05:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 18 Aug 2006 09:05:46 -0700 Subject: [openib-general] libibcm can't open /dev/infiniband/ucm0 In-Reply-To: References: Message-ID: <44E5E55A.4080705@ichips.intel.com> Bub Thomas wrote: > It seems as if the problem I had there was not in my code but the > libibcm not being able to open the device /dev/infiniband/ucm0. You will need to load ib_ucm, which exports the IB CM to userspace. - Sean From mshefty at ichips.intel.com Fri Aug 18 09:08:38 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 18 Aug 2006 09:08:38 -0700 Subject: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM In-Reply-To: References: Message-ID: <44E5E606.2070403@ichips.intel.com> Bub Thomas wrote: > Can I still use the LID, GUID and SubnetID for connection establishment > then? Then Gen1 counterpart has no IP over IB running. If IPoIB is not running, then you will need to use the IB CM directly. The RDMA CM uses ARP to resolve IP addresses to GIDs. > I'm using OFED-1.0.1. > Do you have a quick link where to find the latest Headers? > (Sorry for the dumb question) https://openfabrics.org/svn/gen2/trunk/src/ https://openfabrics.org/svn/gen2/trunk/src/userspace/libibcm https://openfabrics.org/svn/gen2/trunk/src/userspace/librdmacm https://openfabrics.org/svn/gen2/trunk/src/linux-kernel/infiniband/ - Sean From arnd.bergmann at de.ibm.com Fri Aug 18 09:21:49 2006 From: arnd.bergmann at de.ibm.com (Arnd Bergmann) Date: Fri, 18 Aug 2006 18:21:49 +0200 Subject: [openib-general] [PATCH 02/13] IB/ehca: includes In-Reply-To: References: Message-ID: <200608181821.49961.arnd.bergmann@de.ibm.com> On Friday 18 August 2006 17:35, Christoph Raisch wrote: > we'll change these EDEBs to a wrapper around dev_err, dev_dbg and dev_warn > as it's done in the mthca driver. > > ... > > Hope that's the "official" way how to implement it in ib drivers. I guess it would be even better to just use the dev_* macros directly instead of having your own wrapper. You can do that in both ehca and ehea. Arnd <>< From mst at mellanox.co.il Fri Aug 18 09:21:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 18 Aug 2006 19:21:35 +0300 Subject: [openib-general] InfiniBand merge plans for 2.6.19 In-Reply-To: References: Message-ID: <20060818162135.GA20206@mellanox.co.il> Quoting r. Roland Dreier : > o I also have the following minor changes queued in the > for-2.6.19 branch of infiniband.git: Cold you oplease consider IB/mthca: recover from device errors as well? -- MST From rdreier at cisco.com Fri Aug 18 09:31:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 18 Aug 2006 09:31:09 -0700 Subject: [openib-general] InfiniBand merge plans for 2.6.19 In-Reply-To: <20060818162135.GA20206@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 18 Aug 2006 19:21:35 +0300") References: <20060818162135.GA20206@mellanox.co.il> Message-ID: Michael> Cold you oplease consider IB/mthca: recover from device Michael> errors as well? Yes, I will. There's still plenty of time before 2.6.19 opens up. From rdreier at cisco.com Fri Aug 18 10:49:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 18 Aug 2006 10:49:28 -0700 Subject: [openib-general] [PATCH] IB/mthca: No userspace SRQs if HCA doesn't have SRQ support Message-ID: this should fix a crash on HCA's with firwmare too old to support SRQs. commit 5beba53230351b2d77c317c22e66c415f2ebaf02 Author: Roland Dreier Date: Fri Aug 18 10:41:46 2006 -0700 IB/mthca: No userspace SRQs if HCA doesn't have SRQ support Leave all SRQ methods out of the device's uverbs_cmd_mask if the device doesn't have SRQ support (because of ancient firmware) so that we don't allow userspace to call the driver's create_srq method. This fixes a userspace-triggerable oops caused by ib_uverbs_create_srq() following the device's ->create_srq function pointer, which will be NULL if the device doesn't support SRQs. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 230ae21..265b1d1 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1287,11 +1287,7 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | - (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) | - (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | - (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | - (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | - (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST); dev->ib_dev.node_type = IB_NODE_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; @@ -1316,6 +1312,11 @@ int mthca_register_device(struct mthca_d dev->ib_dev.modify_srq = mthca_modify_srq; dev->ib_dev.query_srq = mthca_query_srq; dev->ib_dev.destroy_srq = mthca_destroy_srq; + dev->ib_dev.uverbs_cmd_mask |= + (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | + (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | + (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); if (mthca_is_memfree(dev)) dev->ib_dev.post_srq_recv = mthca_arbel_post_srq_recv; From swise at opengridcomputing.com Fri Aug 18 11:35:24 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 18 Aug 2006 13:35:24 -0500 Subject: [openib-general] [PATCH v2][RDMA CM] IB mcast fix In-Reply-To: <44E4DE86.3030908@ichips.intel.com> References: <1155846646.31290.79.camel@stevo-desktop> <44E4DBBF.7070000@ichips.intel.com> <1155849386.31290.80.camel@stevo-desktop> <44E4DE86.3030908@ichips.intel.com> Message-ID: <1155926124.17288.64.camel@stevo-desktop> Version 2: - Updated SIDR CMA code to use global qkey. - Updated udaddy.c to use global qkey. - Tested with cmatose, udaddy, mckey over mthca/IB. ----- Set the QKEY to a common global value for all UD QPs and multicast groups created by the RDMA CM. Signed-off-by: swise at opengridcomputing.com Index: src/userspace/librdmacm/include/rdma/rdma_cma_ib.h =================================================================== --- src/userspace/librdmacm/include/rdma/rdma_cma_ib.h (revision 9004) +++ src/userspace/librdmacm/include/rdma/rdma_cma_ib.h (working copy) @@ -59,4 +59,10 @@ struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn, uint32_t *remote_qkey); +/* + * Global qkey value for all UD QPs and multicast groups created via the + * RDMA CM. + */ +#define RDMA_UD_QKEY 0x01234567 + #endif /* RDMA_CMA_IB_H */ Index: src/userspace/librdmacm/src/cma.c =================================================================== --- src/userspace/librdmacm/src/cma.c (revision 9004) +++ src/userspace/librdmacm/src/cma.c (working copy) @@ -701,7 +701,7 @@ qp_attr.port_num = id_priv->id.port_num; qp_attr.qp_state = IBV_QPS_INIT; - qp_attr.qkey = ntohs(rdma_get_src_port(&id_priv->id)); + qp_attr.qkey = RDMA_UD_QKEY; ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_QKEY); if (ret) Index: src/userspace/librdmacm/examples/udaddy.c =================================================================== --- src/userspace/librdmacm/examples/udaddy.c (revision 9004) +++ src/userspace/librdmacm/examples/udaddy.c (working copy) @@ -434,7 +434,7 @@ node->ah = ibv_create_ah_from_wc(node->pd, wc, node->mem, node->cma_id->port_num); node->remote_qpn = ntohl(wc->imm_data); - node->remote_qkey = ntohs(rdma_get_dst_port(node->cma_id)); + node->remote_qkey = RDMA_UD_QKEY; } static int poll_cqs(void) Index: src/linux-kernel/infiniband/include/rdma/rdma_cm_ib.h =================================================================== --- src/linux-kernel/infiniband/include/rdma/rdma_cm_ib.h (revision 9004) +++ src/linux-kernel/infiniband/include/rdma/rdma_cm_ib.h (working copy) @@ -82,4 +82,10 @@ */ int rdma_set_ib_req_info(struct rdma_cm_id *id, struct ib_cm_req_opt *info); +/* + * Global qkey value for all UD QPs and multicast groups created via the + * RDMA CM. + */ +#define RDMA_UD_QKEY 0x01234567 + #endif /* RDMA_CM_IB_H */ Index: src/linux-kernel/infiniband/core/cma.c =================================================================== --- src/linux-kernel/infiniband/core/cma.c (revision 9004) +++ src/linux-kernel/infiniband/core/cma.c (working copy) @@ -1822,7 +1822,7 @@ break; } route = &id_priv->id.route; - if (rep->qkey != ntohs(cma_port(&route->addr.dst_addr))) { + if (rep->qkey != RDMA_UD_QKEY) { event = RDMA_CM_EVENT_UNREACHABLE; status = -EINVAL; break; @@ -2012,7 +2012,7 @@ rep.status = status; if (status == IB_SIDR_SUCCESS) { rep.qp_num = id_priv->qp_num; - rep.qkey = ntohs(cma_port(&id_priv->id.route.addr.src_addr)); + rep.qkey = RDMA_UD_QKEY; } return ib_send_cm_sidr_rep(id_priv->cm_id.ib, &rep); @@ -2172,7 +2172,7 @@ ib_addr_get_sgid(dev_addr, &rec.port_gid); rec.pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); rec.join_state = 1; - rec.qkey = sin->sin_addr.s_addr; + rec.qkey = cpu_to_be32(RDMA_UD_QKEY); comp_mask = IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE | From eitan at mellanox.co.il Fri Aug 18 12:05:35 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 18 Aug 2006 22:05:35 +0300 Subject: [openib-general] [patch] libsdp typo in config_parser Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302B9A0AA@mtlexch01.mtl.com> Hi Bernhard SDP traffic will not show on the IPoIB counters. It does no go through IPoIB. You can use lsmod | grep ib_sdp to see how many connections are made over SDP. Exact number of packets and data can flowing through the IB port can be obtained by : /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Bernhard Fischer [mailto:rep.nop at aon.at] > Sent: Friday, August 18, 2006 6:36 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-general] [patch] libsdp typo in config_parser > > On Fri, Aug 18, 2006 at 05:54:04PM +0300, Eitan Zahavi wrote: > >Committed to the trunk (not 1.1 branch) Thanks. > > > Thank you. > PS: I think there is another occurance in srp_daemon that i forgot to include in > the diff, fwiw. > > PPS: IIRC the traffic sent via SDP did not show up in the packet-counters of the > corresponding ipoib device last time i looked. Is this still the case? Asking > because i'm seeing it accounted to my ib0 device now although libsdp's log > indicates that the app is using SDP.. > > TIA for any hint From rep.nop at aon.at Fri Aug 18 12:22:03 2006 From: rep.nop at aon.at (Bernhard Fischer) Date: Fri, 18 Aug 2006 21:22:03 +0200 Subject: [openib-general] [patch] libsdp typo in config_parser In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302B9A0AA@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302B9A0AA@mtlexch01.mtl.com> Message-ID: <20060818192203.GA15241@aon.at> On Fri, Aug 18, 2006 at 10:05:35PM +0300, Eitan Zahavi wrote: >Hi Bernhard > >SDP traffic will not show on the IPoIB counters. It does no go through >IPoIB. That's what i thought, thanks for confirming. >You can use >lsmod | grep ib_sdp >to see how many connections are made over SDP. Running lam via 2 nodes, on 2 CPUs each, i see: # lsmod | grep ib_sdp ib_sdp 28184 4 rdma_cm 27912 1 ib_sdp ib_core 53632 12 ib_ucm,ib_uverbs,ib_sdp,rdma_cm,ib_cm,ib_local_sa,ib_umad,ib_ipoib,ib_multicast,ib_sa,ib_mthca,ib_mad I did start lamboot with libsdp.so preloaded: $ LD_PRELOAD=/usr/local/lib64/libsdp.so lamboot l $ lamnodes C -c -n node13ib.infiniband node13ib.infiniband node15ib.infiniband node15ib.infiniband $ LD_PRELOAD=/usr/local/lib64/libsdp.so mpirun -np 4 /there/vasp/20060503/vasp.4.6/vasp.mpi Still, ifconfig ib0 (which hosts node??ib.infiniband on 10.100.0.0/24) shows that the communication is being sent over ipoib as ifconfigs counters constantly go up when communicating (only one user is active on the system). $ /sbin/ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.100.0.13 Bcast:10.100.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:182037964 errors:0 dropped:0 overruns:0 frame:0 TX packets:183607689 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:189334244937 (180563.2 Mb) TX bytes:194777918565 (185754.6 Mb) My libsdp.conf looks like this: $ cat /usr/local/etc/libsdp.conf #log min-level 1 destination file libsdp.log use both connect * 10.100.0.0/24:* use both server * 10.100.0.0/24:* So i fear i'm missing something crucial. Ideas? >Exact number of packets and data can flowing through the IB port can be >obtained by : >/sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets >/sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets $ for i in /sys/class/infiniband/mthca0/ports/1/counters/*packets;do echo -n $i:' ' ; cat $i;done /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets: 185010549 /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets: 186584856 PS: The different pingpong test (which have outdated names in the openib wiki, btw) do work just fine if run from the very same user, so i think that the basic verbs communication would work proper. From mshefty at ichips.intel.com Fri Aug 18 14:02:42 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 18 Aug 2006 14:02:42 -0700 Subject: [openib-general] [PATCH v2][RDMA CM] IB mcast fix In-Reply-To: <1155926124.17288.64.camel@stevo-desktop> References: <1155846646.31290.79.camel@stevo-desktop> <44E4DBBF.7070000@ichips.intel.com> <1155849386.31290.80.camel@stevo-desktop> <44E4DE86.3030908@ichips.intel.com> <1155926124.17288.64.camel@stevo-desktop> Message-ID: <44E62AF2.6030409@ichips.intel.com> Thanks! committed in 9008. - Sean From rdreier at cisco.com Fri Aug 18 14:11:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 18 Aug 2006 14:11:23 -0700 Subject: [openib-general] [PATCH] huge pages support In-Reply-To: (Robert Rex's message of "Fri, 18 Aug 2006 15:23:41 +0200 (MEST)") References: Message-ID: Looks like you don't support regions that have both huge pages and normal pages either. And this: > + while (npages && (ret <= PAGE_SIZE / > + sizeof (struct page *))) { > + if (get_user_pages(current, current->mm, > + cur_base, 1, 1, !write, > + &page_list[ret], NULL) < 0) > + goto out; > + > + ret++; > + cur_base += HPAGE_SIZE; > + npages--; > + } seems a little iffy as well. As I said I think the solution is to add a little more support to the core stuff in mm/ to make this simpler -- something like a function like get_user_pages() except returning the page list in a struct scatterlist so that it can return pages of different sizes would make sense to me. - R. From sweitzen at cisco.com Fri Aug 18 17:03:52 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 18 Aug 2006 17:03:52 -0700 Subject: [openib-general] [patch] libsdp typo in config_parser Message-ID: Running an MPI command with LD_PRELOAD=libsdp.so at the beginning won't cause SDP to be used on remote nodes. You have to find a way to load libsdp.so on all nodes, this might work better: LD_PRELOAD=libsdp.so mpirun -np 4 env LD_PRELOAD=libsdp.so /there/vasp/20060503/vasp.4.6/vasp.mpi Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of > Bernhard Fischer > Sent: Friday, August 18, 2006 12:22 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-general] [patch] libsdp typo in config_parser > > On Fri, Aug 18, 2006 at 10:05:35PM +0300, Eitan Zahavi wrote: > >Hi Bernhard > > > >SDP traffic will not show on the IPoIB counters. It does no > go through > >IPoIB. > > That's what i thought, thanks for confirming. > >You can use > >lsmod | grep ib_sdp > >to see how many connections are made over SDP. > > Running lam via 2 nodes, on 2 CPUs each, i see: > # lsmod | grep ib_sdp > ib_sdp 28184 4 > rdma_cm 27912 1 ib_sdp > ib_core 53632 12 > ib_ucm,ib_uverbs,ib_sdp,rdma_cm,ib_cm,ib_local_sa,ib_umad,ib_i > poib,ib_multicast,ib_sa,ib_mthca,ib_mad > > I did start lamboot with libsdp.so preloaded: > $ LD_PRELOAD=/usr/local/lib64/libsdp.so lamboot l > $ lamnodes C -c -n > node13ib.infiniband > node13ib.infiniband > node15ib.infiniband > node15ib.infiniband > $ LD_PRELOAD=/usr/local/lib64/libsdp.so mpirun -np 4 > /there/vasp/20060503/vasp.4.6/vasp.mpi > > Still, ifconfig ib0 (which hosts node??ib.infiniband on > 10.100.0.0/24) shows that the > communication is being sent over ipoib as ifconfigs counters > constantly > go up when communicating (only one user is active on the system). > $ /sbin/ifconfig ib0 > ib0 Link encap:UNSPEC HWaddr > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > inet addr:10.100.0.13 Bcast:10.100.0.255 > Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > RX packets:182037964 errors:0 dropped:0 overruns:0 frame:0 > TX packets:183607689 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:189334244937 (180563.2 Mb) TX > bytes:194777918565 (185754.6 Mb) > > My libsdp.conf looks like this: > $ cat /usr/local/etc/libsdp.conf > #log min-level 1 destination file libsdp.log > use both connect * 10.100.0.0/24:* > use both server * 10.100.0.0/24:* > > So i fear i'm missing something crucial. > Ideas? > > >Exact number of packets and data can flowing through the IB > port can be > >obtained by : > >/sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets > >/sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets > > $ for i in > /sys/class/infiniband/mthca0/ports/1/counters/*packets;do > echo -n $i:' ' ; cat $i;done > /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets > : 185010549 > /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packet > s: 186584856 > > PS: The different pingpong test (which have outdated names in > the openib > wiki, btw) do work just fine if run from the very same user, > so i think > that the basic verbs communication would work proper. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Sat Aug 19 12:26:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 19 Aug 2006 22:26:06 +0300 Subject: [openib-general] InfiniBand merge plans for 2.6.19 In-Reply-To: References: Message-ID: <20060819192606.GA3765@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: InfiniBand merge plans for 2.6.19 > > Michael> Cold you oplease consider IB/mthca: recover from device > Michael> errors as well? > > Yes, I will. There's still plenty of time before 2.6.19 opens up. > Right, thanks. -- MST From eitan at mellanox.co.il Sun Aug 20 00:39:11 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 20 Aug 2006 10:39:11 +0300 Subject: [openib-general] [patch] libsdp typo in config_parser Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302B9A1CC@mtlexch01.mtl.com> Hi Bernhard, The only thing I can think of is the chance you did not distribute the libsdp config fiel to all nodes. Please try to change the "log" directives to something like log min-level 1 destination file libsdp.log run MPI and send the log file /tmp/libsdp.log Eitan > -----Original Message----- > From: Bernhard Fischer [mailto:rep.nop at aon.at] > Sent: Friday, August 18, 2006 10:22 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-general] [patch] libsdp typo in config_parser > > On Fri, Aug 18, 2006 at 10:05:35PM +0300, Eitan Zahavi wrote: > >Hi Bernhard > > > >SDP traffic will not show on the IPoIB counters. It does no go through > >IPoIB. > > That's what i thought, thanks for confirming. > >You can use > >lsmod | grep ib_sdp > >to see how many connections are made over SDP. > > Running lam via 2 nodes, on 2 CPUs each, i see: > # lsmod | grep ib_sdp > ib_sdp 28184 4 > rdma_cm 27912 1 ib_sdp > ib_core 53632 12 > ib_ucm,ib_uverbs,ib_sdp,rdma_cm,ib_cm,ib_local_sa,ib_umad,ib_ipoib,ib_mu > lticast,ib_sa,ib_mthca,ib_mad > > I did start lamboot with libsdp.so preloaded: > $ LD_PRELOAD=/usr/local/lib64/libsdp.so lamboot l $ lamnodes C -c -n > node13ib.infiniband node13ib.infiniband node15ib.infiniband > node15ib.infiniband $ LD_PRELOAD=/usr/local/lib64/libsdp.so mpirun -np 4 > /there/vasp/20060503/vasp.4.6/vasp.mpi > > Still, ifconfig ib0 (which hosts node??ib.infiniband on 10.100.0.0/24) shows that > the communication is being sent over ipoib as ifconfigs counters constantly go > up when communicating (only one user is active on the system). > $ /sbin/ifconfig ib0 > ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00- > 00-00-00 > inet addr:10.100.0.13 Bcast:10.100.0.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > RX packets:182037964 errors:0 dropped:0 overruns:0 frame:0 > TX packets:183607689 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:189334244937 (180563.2 Mb) TX bytes:194777918565 > (185754.6 Mb) > > My libsdp.conf looks like this: > $ cat /usr/local/etc/libsdp.conf > #log min-level 1 destination file libsdp.log > use both connect * 10.100.0.0/24:* > use both server * 10.100.0.0/24:* > > So i fear i'm missing something crucial. > Ideas? > > >Exact number of packets and data can flowing through the IB port can be > >obtained by : > >/sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets > >/sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets > > $ for i in /sys/class/infiniband/mthca0/ports/1/counters/*packets;do echo -n > $i:' ' ; cat $i;done > /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets: 185010549 > /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets: 186584856 > > PS: The different pingpong test (which have outdated names in the openib > wiki, btw) do work just fine if run from the very same user, so i think that the > basic verbs communication would work proper. From tziporet at mellanox.co.il Sun Aug 20 04:16:32 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 20 Aug 2006 14:16:32 +0300 Subject: [openib-general] How to cross-compile to ppc32 from infiniband driver in IBGD-1.8.2 ? In-Reply-To: <20060818093757.8676E3B0003@sentry-two.sandia.gov> References: <20060818093757.8676E3B0003@sentry-two.sandia.gov> Message-ID: <44E84490.1040509@mellanox.co.il> In general IBGD does not supporting cross compiling Also VAPI that is the low level driver was not tested on PPC, so probably it does not support it too. Tziporet ??? wrote: > I am doing cross-compile to ppc440SPe (ppc 32bit) from IBGD-1.8.2. > > Who ever been cross-compile it? I don’t know exactly how to modify > Makefile and source code > > Help me please. > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Sun Aug 20 04:30:17 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 20 Aug 2006 14:30:17 +0300 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E4A4DC.2040200@ichips.intel.com> References: <44E3914C.4010903@ichips.intel.com> <44E3AD24.4070200@ichips.intel.com> <44E42670.5040308@voltaire.com> <44E4A4DC.2040200@ichips.intel.com> Message-ID: <44E847C9.40408@voltaire.com> Sean Hefty wrote: > Or Gerlitz wrote: >> If you don't mind (also related to the patch you have sent Eric of >> randomizing the initial local cm id) to get into this deeper, can we do > There's an issue trying to randomize the initial local CM ID. The way > the IDR works, if you start at a high value, then the IDR size grows up > to the size of the first value, which can result in memory allocation > failures. In my tests, using a random value would frequently result in > connection failures because of low memory. > My conclusion is that the local ID assignment in the IB CM needs to be > reworked, or we will run into a condition that after X number of > connections have been established, we will be unable to create any new > connections, even if the previous connections have all been destroyed. How about (for the meantime, till this rework is designed && done) going to projecting the initial random local id into the range of (say) [0-1022] (i think 1023 is prime, if not choose a prime near it) this way with very good probability and with very little overhead on memory consumption a client connect/reboot/"reconnect" would work. Or. From tziporet at mellanox.co.il Sun Aug 20 04:33:47 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 20 Aug 2006 14:33:47 +0300 Subject: [openib-general] [PATCH] huge pages support In-Reply-To: References: <1155824446.11238.8.camel@localhost> Message-ID: <44E8489B.3050009@mellanox.co.il> Roland Dreier wrote: > My first reaction on reading this is that it looks like we need some > more help from mm/ to make this cleaner. > Hi Roland, We need huge pages support in 1.1 release, since some customers are working with huge pages. My suggestion is that we insert this patch (with your comments fixed) under fixes in OFED 1.1. In parallel you (or if you prefer we can do it) will work with the kernel people for a cleaner support. Is this a good resolution for 1.1? Tziporet From ogerlitz at voltaire.com Sun Aug 20 04:45:19 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 20 Aug 2006 14:45:19 +0300 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E4A4DC.2040200@ichips.intel.com> References: <44E3914C.4010903@ichips.intel.com> <44E3AD24.4070200@ichips.intel.com> <44E42670.5040308@voltaire.com> <44E4A4DC.2040200@ichips.intel.com> Message-ID: <44E84B4F.80806@voltaire.com> >>> + } else >>> + cm_issue_rej(work->port, work->mad_recv_wc, >>> + IB_CM_REJ_STALE_CONN, >>> CM_MSG_RESPONSE_REQ, >>> + NULL, 0); >> >> >> what is this case? there is no entry but there is >> remote or entries??? > If we get here, this means that the REQ was a new REQ and not a > duplicate, but the remote_id or remote_qpn is already in use. We need > to reject the new REQ as containing stale data. I don't follow, if we get to the else case its as of cm_get_id() returning NULL. This holds when idr_find() returns NULL or when the entry returned is associated with a different remote_id, so what makes you to conclude that "the remote_id or remote_qpn is already in use"??? > +static struct cm_id_private * cm_get_id(__be32 local_id, __be32 remote_id) > +{ > + struct cm_id_private *cm_id_priv; > + > + cm_id_priv = idr_find(&cm.local_id_table, (__force int) local_id); > + if (cm_id_priv) { > + if (cm_id_priv->id.remote_id == remote_id) > + atomic_inc(&cm_id_priv->refcount); > + else > + cm_id_priv = NULL; > + } > + > + return cm_id_priv; > +} Or. From ogerlitz at voltaire.com Sun Aug 20 04:53:48 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 20 Aug 2006 14:53:48 +0300 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E421CE.1050404@voltaire.com> References: <44E3914C.4010903@ichips.intel.com> <44E421CE.1050404@voltaire.com> Message-ID: <44E84D4C.9090104@voltaire.com> This email appear in the archive, but seems not to be distributed to the subscribers so i am reposting it. Or Gerlitz wrote: > Arlin Davis wrote: >> We are running into connection reject issues (IB_CM_REJ_STALE_CONN) >> with our application under heavy load and lots of connections. >> >> We occassionally get a reject based on the QP being in timewait state >> leftover from a prior connection. It appears that the CM keeps track >> of the QP's in timewait state on both sides of the connection, > > How did you verify that? the CM generated REJ with IB_CM_REJ_STALE_CONN > in two flows for the passive side (ie rejecting a REQ) and one flow for > the active side (ie rejecting a REP). > >> How can a consumer know for sure that the new QP will not be in a >> timewait state according to the CM? Does it make sense to push the >> timewait functionality down into verbs? If not, is there a way for the >> CM to hold a reference to the QP until the timewait expires? > > Just to emphasize what Sean has pointed out, you are asking how can a CM > consumer know that a **local** QPN is not in the timewait state > according to the **remote** CM. Since the issue is with the remote CM, > it seems to me that pushing down timewait into verbs is not the correct > direction to look at. > > Or. > From ogerlitz at voltaire.com Sun Aug 20 04:54:14 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 20 Aug 2006 14:54:14 +0300 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E42670.5040308@voltaire.com> References: <44E3914C.4010903@ichips.intel.com> <44E3AD24.4070200@ichips.intel.com> <44E42670.5040308@voltaire.com> Message-ID: <44E84D66.3040800@voltaire.com> This email appear in the archive, but seems not to be distributed to the subscribers so i am reposting it. Or Gerlitz wrote: > Sean Hefty wrote: >> Even if we pushed timewait handling under verbs, a user could always >> get a QP that the remote side thinks is connected. The original >> connection could fail to disconnect because of lost DREQs. So, >> locally, the QP could have exited timewait, while the remote side >> still thinks that it's connected. > > Sean, > > If you don't mind (also related to the patch you have sent Eric of > randomizing the initial local cm id) to get into this deeper, can we do > here a quick code review of the REQ matching logic? I wrote what i > understand below. > >> static struct cm_id_private * cm_match_req(struct cm_work *work, >> + struct cm_id_private >> *cm_id_priv) >> +{ >> + struct cm_id_private *listen_cm_id_priv, *cur_cm_id_priv; >> + struct cm_timewait_info *timewait_info; >> + struct cm_req_msg *req_msg; >> + unsigned long flags; >> + >> + req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; >> + >> + /* Check for duplicate REQ and stale connections. */ >> + spin_lock_irqsave(&cm.lock, flags); >> + timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); >> + if (!timewait_info) >> + timewait_info = >> cm_insert_remote_qpn(cm_id_priv->timewait_info); > > This if() holds when entry is present in > remote_id_table OR entry is present in > remote_qpn_table > >> + if (timewait_info) { >> + cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, >> + >> timewait_info->work.remote_id); > > + spin_unlock_irqrestore(&cm.lock, flags); >> + if (cur_cm_id_priv) { >> + cm_dup_req_handler(work, cur_cm_id_priv); >> + cm_deref_id(cur_cm_id_priv); > > entry exists in local_id_table, looking on > dup_req_handler() i see it sends REP when the id is in "MRA sent" and > sends a STALE_CONN REJ when the id is in timewait state, else it does > nothing. > >> + } else >> + cm_issue_rej(work->port, work->mad_recv_wc, >> + IB_CM_REJ_STALE_CONN, >> CM_MSG_RESPONSE_REQ, >> + NULL, 0); > > what is this case? there is no entry but there is > remote or entries??? > >> + goto error; >> + } > > Or. > From ogerlitz at voltaire.com Sun Aug 20 04:55:12 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 20 Aug 2006 14:55:12 +0300 (IDT) Subject: [openib-general] void-ness of struct netdevice::set_multicast_list() problematic with IPoIB In-Reply-To: References: Message-ID: This email appear in the archive, but seems not to be distributed to the subscribers so i am reposting it. > Roland, > > If i understand correct someone can attempt (*) setting IFF_ALLMULTI or IFF_PROMISC > for an IPoIB device, and there is no very to return -EINVAL (or whatever) on that. > > This is since they (eg net/ipv4/ipmr.c and friends) just set the flags and later > call the device set_multicast_list() function, which is void... > > I don't think its a big deal, maybe we can just print an unsupported warning > from ipoib_set_mcast_list() if either of the flags is set. > > All of this is based on my understanding that IB does not support those flags, > please correct me if i am wrong here... > > Or. > > (*) eg by calling packet(7) with PACKET_ADD_MEMBERSHIP and setting PACKET_MR_PROMISC or > PACKET_MR_ALLMULTI > From rdreier at cisco.com Sun Aug 20 08:47:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 20 Aug 2006 08:47:33 -0700 Subject: [openib-general] [PATCH] huge pages support In-Reply-To: <44E8489B.3050009@mellanox.co.il> (Tziporet Koren's message of "Sun, 20 Aug 2006 14:33:47 +0300") References: <1155824446.11238.8.camel@localhost> <44E8489B.3050009@mellanox.co.il> Message-ID: Tziporet> Hi Roland, We need huge pages support in 1.1 release, Tziporet> since some customers are working with huge pages. My Tziporet> suggestion is that we insert this patch (with your Tziporet> comments fixed) under fixes in OFED 1.1. In parallel you Tziporet> (or if you prefer we can do it) will work with the Tziporet> kernel people for a cleaner support. Tziporet> Is this a good resolution for 1.1? Sure, you are building the OFED release so you can put whatever you want into it. - R. From rdreier at cisco.com Sun Aug 20 08:49:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 20 Aug 2006 08:49:54 -0700 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E847C9.40408@voltaire.com> (Or Gerlitz's message of "Sun, 20 Aug 2006 14:30:17 +0300") References: <44E3914C.4010903@ichips.intel.com> <44E3AD24.4070200@ichips.intel.com> <44E42670.5040308@voltaire.com> <44E4A4DC.2040200@ichips.intel.com> <44E847C9.40408@voltaire.com> Message-ID: Or> How about (for the meantime, till this rework is designed && Or> done) going to projecting the initial random local id into the Or> range of (say) [0-1022] (i think 1023 is prime, if not choose Or> a prime near it) this way with very good probability and with Or> very little overhead on memory consumption a client Or> connect/reboot/"reconnect" would work. Of course 1023 is not prime -- since (a^2 - b^2) = (a - b) * (a + b), it follows 2 ^ 10 - 1 = (2^5 - 1) * (2^5 + 1) = 31 * 33. I don't see why you care about the range being prime, but the closest primes to 1024 are 1021 and 1031. - R. From sashak at voltaire.com Sun Aug 20 09:05:38 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 20 Aug 2006 19:05:38 +0300 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed Message-ID: <20060820160538.12435.23041.stgit@sashak.voltaire.com> In case when OpenSM log file overflows filesystem and write() fails with 'No space left on device' try to truncate the log file and wrap-around logging. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_log.c | 23 +++++++++++++++-------- 1 files changed, 15 insertions(+), 8 deletions(-) diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c index 668e9a6..b4700c8 100644 --- a/osm/opensm/osm_log.c +++ b/osm/opensm/osm_log.c @@ -58,6 +58,7 @@ #include #include #include #include +#include #ifndef WIN32 #include @@ -152,6 +153,7 @@ #endif cl_spinlock_acquire( &p_log->lock ); #ifdef WIN32 GetLocalTime(&st); + _retry: ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, pid, buffer); @@ -159,6 +161,7 @@ #ifdef WIN32 #else pid = pthread_self(); tim = time(NULL); + _retry: ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", ((result.tm_mon < 12) && (result.tm_mon >= 0) ? month_str[result.tm_mon] : "???"), @@ -166,6 +169,18 @@ #else result.tm_min, result.tm_sec, usecs, pid, buffer); #endif /* WIN32 */ + + if (ret >= 0) + log_exit_count = 0; + else if (errno == ENOSPC && log_exit_count < 3) { + int fd = fileno(p_log->out_port); + fprintf(stderr, "log write failed: %s. Will truncate the log file.\n", + strerror(errno)); + ftruncate(fd, 0); + lseek(fd, 0, SEEK_SET); + log_exit_count++; + goto _retry; + } /* Flush log on errors too. @@ -174,14 +189,6 @@ #endif /* WIN32 */ fflush( p_log->out_port ); cl_spinlock_release( &p_log->lock ); - - if (ret < 0) - { - if (log_exit_count++ < 10) - { - fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n"); - } - } } } From rdreier at cisco.com Sun Aug 20 09:02:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 20 Aug 2006 09:02:14 -0700 Subject: [openib-general] [PATCH] huge pages support In-Reply-To: (Roland Dreier's message of "Sun, 20 Aug 2006 08:47:33 -0700") References: <1155824446.11238.8.camel@localhost> <44E8489B.3050009@mellanox.co.il> Message-ID: Roland> Sure, you are building the OFED release so you can put Roland> whatever you want into it. ...although maybe it would be a good idea to follow the approach of the second patch posted, and make multiple get_user_pages() calls, skipping along by HPAGE_SIZE. This avoids having all the extra work of follow_hugetlb_page() creating extra fake pages and then calling put_page() many times. - R. From sashak at voltaire.com Sun Aug 20 09:24:36 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 20 Aug 2006 19:24:36 +0300 Subject: [openib-general] [PATH TRIVIAL] opensm: management/Makefile: build rules improvement Message-ID: <20060820162436.GU18411@sashak.voltaire.com> Some minor additions to management/Makefile build rules - now this will run autogen.sh and ./configure (without options) if needed. Signed-off-by: Sasha Khapyorsky --- Makefile | 12 ++++++------ 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/Makefile b/Makefile index 770112a..9004f00 100644 --- a/Makefile +++ b/Makefile @@ -64,17 +64,17 @@ depend: rmdep subdirs .PHONY : subdirs subdirs: @for i in $(SUBDIRS); do \ - if [ -e $$i/Makefile ]; then \ - if !(cd $$i; make $(BUILD_TARG)); then exit 1; fi \ - fi \ + test -x $$i/configure || ( cd $$i && ./autogen.sh || exit 1 ) ; \ + test -e $$i/Makefile || ( cd $$i && ./configure || exit 1 ) ; \ + ( cd $$i && make ) || exit 1 ; \ done .PHONY : libs_install libs_install: @for i in $(LIBS); do \ - if [ -e $$i/Makefile ]; then \ - if !(cd $$i; make install); then exit 1; fi \ - fi \ + test -x $$i/configure || ( cd $$i && ./autogen.sh || exit 1 ) ; \ + test -e $$i/Makefile || ( cd $$i && ./configure || exit 1 ) ; \ + ( cd $$i && make && make install ) || exit 1 ; \ done export BUILD_TARG From sashak at voltaire.com Sun Aug 20 09:27:20 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 20 Aug 2006 19:27:20 +0300 Subject: [openib-general] [PATCH TRIVIAL] opensm: GUID net to host conversion in log prints Message-ID: <20060820162720.GV18411@sashak.voltaire.com> Print GUID value in host byte order. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_sa_path_record.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c index 67df71a..36e9061 100644 --- a/osm/opensm/osm_sa_path_record.c +++ b/osm/opensm/osm_sa_path_record.c @@ -415,7 +415,7 @@ __osm_pr_rcv_get_path_parms( "__osm_pr_rcv_get_path_parms: " "New smallest MTU = %u at destination port 0x%016" PRIx64 "\n", mtu, - osm_physp_get_port_guid( p_physp ) ); + cl_ntoh64(osm_physp_get_port_guid( p_physp )) ); } } @@ -428,7 +428,7 @@ __osm_pr_rcv_get_path_parms( "__osm_pr_rcv_get_path_parms: " "New smallest rate = %u at destination port 0x%016" PRIx64 "\n", rate, - osm_physp_get_port_guid( p_physp ) ); + cl_ntoh64(osm_physp_get_port_guid( p_physp )) ); } } From halr at voltaire.com Sun Aug 20 10:01:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Aug 2006 13:01:55 -0400 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <20060820160538.12435.23041.stgit@sashak.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> Message-ID: <1156093312.9855.162745.camel@hal.voltaire.com> Hi Sasha, On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote: > In case when OpenSM log file overflows filesystem and write() fails with > 'No space left on device' try to truncate the log file and wrap-around > logging. Should it be an (admin) option as to whether to truncate the file or not or is there no way to continue without logging (other than this) once the log file fills the disk ? See comment below as well. -- Hal > Signed-off-by: Sasha Khapyorsky > --- > > osm/opensm/osm_log.c | 23 +++++++++++++++-------- > 1 files changed, 15 insertions(+), 8 deletions(-) > > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c > index 668e9a6..b4700c8 100644 > --- a/osm/opensm/osm_log.c > +++ b/osm/opensm/osm_log.c > @@ -58,6 +58,7 @@ #include > #include > #include > #include > +#include > > #ifndef WIN32 > #include > @@ -152,6 +153,7 @@ #endif > cl_spinlock_acquire( &p_log->lock ); > #ifdef WIN32 > GetLocalTime(&st); > + _retry: > ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", > st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, > pid, buffer); > @@ -159,6 +161,7 @@ #ifdef WIN32 > #else > pid = pthread_self(); > tim = time(NULL); > + _retry: > ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", > ((result.tm_mon < 12) && (result.tm_mon >= 0) ? > month_str[result.tm_mon] : "???"), > @@ -166,6 +169,18 @@ #else > result.tm_min, result.tm_sec, > usecs, pid, buffer); > #endif /* WIN32 */ > + > + if (ret >= 0) > + log_exit_count = 0; > + else if (errno == ENOSPC && log_exit_count < 3) { > + int fd = fileno(p_log->out_port); > + fprintf(stderr, "log write failed: %s. Will truncate the log file.\n", > + strerror(errno)); > + ftruncate(fd, 0); Should return from ftruncate be checked here ? > + lseek(fd, 0, SEEK_SET); > + log_exit_count++; > + goto _retry; > + } > > /* > Flush log on errors too. > @@ -174,14 +189,6 @@ #endif /* WIN32 */ > fflush( p_log->out_port ); > > cl_spinlock_release( &p_log->lock ); > - > - if (ret < 0) > - { > - if (log_exit_count++ < 10) > - { > - fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n"); > - } > - } > } > } > From sashak at voltaire.com Sun Aug 20 10:18:08 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 20 Aug 2006 20:18:08 +0300 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <1156093312.9855.162745.camel@hal.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> Message-ID: <20060820171808.GZ18411@sashak.voltaire.com> On 13:01 Sun 20 Aug , Hal Rosenstock wrote: > Hi Sasha, > > On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote: > > In case when OpenSM log file overflows filesystem and write() fails with > > 'No space left on device' try to truncate the log file and wrap-around > > logging. > > Should it be an (admin) option as to whether to truncate the file or not > or is there no way to continue without logging (other than this) once > the log file fills the disk ? In theory OpenSM may continue, but don't think it is good idea to leave overflowed disk on the SM machine (by default it is '/var/log'). For me truncating there looks as reasonable default behavior, don't think we need the option. > > See comment below as well. > > -- Hal > > > Signed-off-by: Sasha Khapyorsky > > --- > > > > osm/opensm/osm_log.c | 23 +++++++++++++++-------- > > 1 files changed, 15 insertions(+), 8 deletions(-) > > > > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c > > index 668e9a6..b4700c8 100644 > > --- a/osm/opensm/osm_log.c > > +++ b/osm/opensm/osm_log.c > > @@ -58,6 +58,7 @@ #include > > #include > > #include > > #include > > +#include > > > > #ifndef WIN32 > > #include > > @@ -152,6 +153,7 @@ #endif > > cl_spinlock_acquire( &p_log->lock ); > > #ifdef WIN32 > > GetLocalTime(&st); > > + _retry: > > ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", > > st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, > > pid, buffer); > > @@ -159,6 +161,7 @@ #ifdef WIN32 > > #else > > pid = pthread_self(); > > tim = time(NULL); > > + _retry: > > ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", > > ((result.tm_mon < 12) && (result.tm_mon >= 0) ? > > month_str[result.tm_mon] : "???"), > > @@ -166,6 +169,18 @@ #else > > result.tm_min, result.tm_sec, > > usecs, pid, buffer); > > #endif /* WIN32 */ > > + > > + if (ret >= 0) > > + log_exit_count = 0; > > + else if (errno == ENOSPC && log_exit_count < 3) { > > + int fd = fileno(p_log->out_port); > > + fprintf(stderr, "log write failed: %s. Will truncate the log file.\n", > > + strerror(errno)); > > + ftruncate(fd, 0); > > Should return from ftruncate be checked here ? May be checked, but I don't think that potential ftruncate() failure should change the flow - in case of failure we will try to continue with lseek() anyway (in order to wrap around the file at least). Sasha > > > + lseek(fd, 0, SEEK_SET); > > + log_exit_count++; > > + goto _retry; > > + } > > > > /* > > Flush log on errors too. > > @@ -174,14 +189,6 @@ #endif /* WIN32 */ > > fflush( p_log->out_port ); > > > > cl_spinlock_release( &p_log->lock ); > > - > > - if (ret < 0) > > - { > > - if (log_exit_count++ < 10) > > - { > > - fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n"); > > - } > > - } > > } > > } > > > From sean.hefty at intel.com Sun Aug 20 10:14:41 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 20 Aug 2006 10:14:41 -0700 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E847C9.40408@voltaire.com> Message-ID: <000001c6c47c$1ffe0dd0$35cd180a@amr.corp.intel.com> >How about (for the meantime, till this rework is designed && done) going >to projecting the initial random local id into the range of (say) >[0-1022] (i think 1023 is prime, if not choose a prime near it) this way >with very good probability and with very little overhead on memory >consumption a client connect/reboot/"reconnect" would work. If we record a base offset, we can start at any random number. We just need to always add/subtract the base when getting a value from the IDR. - Sean From sean.hefty at intel.com Sun Aug 20 10:27:42 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 20 Aug 2006 10:27:42 -0700 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E84D4C.9090104@voltaire.com> Message-ID: <000101c6c47d$f18f6320$35cd180a@amr.corp.intel.com> >> Just to emphasize what Sean has pointed out, you are asking how can a CM >> consumer know that a **local** QPN is not in the timewait state >> according to the **remote** CM. Since the issue is with the remote CM, >> it seems to me that pushing down timewait into verbs is not the correct >> direction to look at. We should still ensure that we don't give a user a local QPN that we know is in timewait. For example, a user 1 connects over a QP, transfers some data, then destroys the QP. User 2 allocates a new QP. Can user 2 get the same QP as the user 1? If so, user 2 is likely to see a stale connection. An option at this point is for user 2 to destroy the QP and allocate a new one. If they do this, will they get the same QP again? Now imagine if user 1 had created 1000 connections. I believe that we should make things as easy on user 2 as possible, including reducing the chance of giving them a QP that the remote side is likely to have in timewait. - Sean From sean.hefty at intel.com Sun Aug 20 10:37:39 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 20 Aug 2006 10:37:39 -0700 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44E84B4F.80806@voltaire.com> Message-ID: <000201c6c47f$555ee190$35cd180a@amr.corp.intel.com> >> If we get here, this means that the REQ was a new REQ and not a >> duplicate, but the remote_id or remote_qpn is already in use. We need >> to reject the new REQ as containing stale data. > >I don't follow, if we get to the else case its as of cm_get_id() >returning NULL. This holds when idr_find() returns NULL or when the >entry returned is associated with a different remote_id, so what makes >you to conclude that "the remote_id or remote_qpn is already in use"??? When a new REQ is received, we enter its timewait structure into two trees: one sorted by remote ID, one sorted by remote QPN. If the REQ is new, both would succeed, and timewait_info would be NULL. Since timewait_info is not NULL, we are dealing with a REQ that re-uses the same remote ID or same remote QPN. If the new REQ has the same remote ID (get_cm_id() returns non-NULL), we treat it as a duplicate, otherwise it's marked as stale. - Sean From rdreier at cisco.com Sun Aug 20 13:21:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 20 Aug 2006 13:21:24 -0700 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <000001c6c47c$1ffe0dd0$35cd180a@amr.corp.intel.com> (Sean Hefty's message of "Sun, 20 Aug 2006 10:14:41 -0700") References: <000001c6c47c$1ffe0dd0$35cd180a@amr.corp.intel.com> Message-ID: Sean> If we record a base offset, we can start at any random Sean> number. We just need to always add/subtract the base when Sean> getting a value from the IDR. Good point -- or better still, we could XOR in a random bit pattern. That way we don't have to keep straight when to add and when to subtract. - R. From bugzilla-daemon at openib.org Sun Aug 20 17:05:01 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Sun, 20 Aug 2006 17:05:01 -0700 (PDT) Subject: [openib-general] [Bug 202] New: System hangs on shutdown, build 459 Message-ID: <20060821000501.AE71B2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=202 Summary: System hangs on shutdown, build 459 Product: OpenFabrics Windows Version: unspecified Platform: X86-64 OS/Version: Other Status: NEW Severity: major Priority: P2 Component: Core AssignedTo: bugzilla at openib.org ReportedBy: jbottorff at xsigo.com When running Windows 2003 SP1 checked 64-bit with driver verifier enabled (all drivers verified with all options except low resource simulation and disk integrity checking) and I do cyclic shutdown tests, the IBAL driver sometimes asserts (checked IB stack build). The test script connects via TCP to WMI and requests a system reboot every 7 minutes. On 32-bit Windows, this assert happens after just a few reboot cycles. It took 65 reboot cycles on 64-bit Windows. I'm reporting this on 64-bit Windows because the object reference dump is useful (it's garbage on 32-bit Windows). This was on a clean OS install with only the IB drivers (build 459) installed (the OS install CD was the only other source of running code). The hardware is a dual processor Xeon. IPoIB drivers were installed on both ports. The host was attached to an IB switch with a couple of other hosts attached. The IB adapter as reported by vstat is: hca_idx=0 pci_location={BUS=NA,DEV/FUNC=NA} uplink={BUS=PCI_E, SPEED=2.5 Gbps, WIDTH=x8} vendor_id=0x02c9 vendor_part_id=0x6282 hw_ver=0xa0 fw_ver=5.01.0400 PSID=MT_0140000001 node_guid=0002:c902:0040:1d20 num_phys_ports=2 port=1 port_state=PORT_ACTIVE (4) sm_lid=0x0001 port_lid=0x000e port_lmc=0x0 max_mtu=2048 (4) port=2 port_state=PORT_ACTIVE (4) sm_lid=0x0001 port_lid=0x0011 port_lmc=0x0 max_mtu=2048 (4) Following is the kernel debugger log that includes the object reference dump that occurs if you ignore the assert. TERMSRV: Last WinStation reset ~0:ib_modify_ca() !ERROR!: IB_INVALID_CA_HANDLE ~0:ib_modify_ca() !ERROR!: IB_INVALID_CA_HANDLE NatTriggerTimer: scheduling DPC NatTriggerTimer: scheduling DPC Process.Thread : 0000000000000004.0000000000000038 (System) is trying to create key: ObjectAttributes = FFFFFADFE4A718D8 The caller should not rely on data written to the registry after shutdown... GetMaxLana : Failed to open registryNetbios : GetMaxLana failed with status c0000001 Netbios : device not found \Device\NetBT_Tcpip_{345D2F3E-23A1-4793-A1FE-D741C4AE0240} Waiting on: \Driver\VERIFIER_FILTER e6868c40 irp (ce7b4e10) SetPower-Shutdown status 0 HvpGetCellMapped called after shutdown for Hive = FFFFFA800079A000 Cell = 372b0 Waiting on: \Driver\VERIFIER_FILTER e6868c40 irp (ce7b4e10) SetPower-Shutdown status 0 Waiting on: \Driver\VERIFIER_FILTER e6868c40 irp (ce7b4e10) SetPower-Shutdown status 0 HvpGetCellMapped called after shutdown for Hive = FFFFFA800078B000 Cell = c05238 ~1:sync_destroy_obj() !ERROR!: Error waiting for references to be released - delaying. ~1:print_al_obj() !ERROR!: AL object fffffabec6602b70(AL_OBJ_TYPE_CI_CA), parent: fffffabec53eee70 ref_cnt: 1 Waiting on: \Driver\VERIFIER_FILTER e6868c40 irp (ce7b4e10) SetPower-Shutdown status 0 Waiting on: \Driver\VERIFIER_FILTER e6868c40 irp (ce7b4e10) SetPower-Shutdown status 0 Process.Thread : 0000000000000334.0000000000000410 (wmiprvse.exe) is trying to create key: ObjectAttributes = 00000000011DE108 The caller should not rely on data written to the registry after shutdown... Process.Thread : 0000000000000334.0000000000000410 (wmiprvse.exe) is trying to create key: ObjectAttributes = 00000000011DDFE0 The caller should not rely on data written to the registry after shutdown... 520: WORK_QUEUE: no work for 180000 ms: committing suicide... 520: WORK_QUEUE: worker thread exiting Waiting on: \Driver\VERIFIER_FILTER e6868c40 irp (ce7b4e10) SetPower-Shutdown status 0 *** Assertion failed: cl_status == CL_SUCCESS *** Source File: k:\windows-openib\src\winib-459\core\al\al_common.c, line 535 Break repeatedly, break Once, Ignore, terminate Process, or terminate Thread (boipt)? i i ~1:sync_destroy_obj() !ERROR!: Forcing object destruction. ~1:print_al_obj() !ERROR!: AL object fffffabec6602b70(AL_OBJ_TYPE_CI_CA), parent: fffffabec53eee70 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5db2d60(AL_OBJ_TYPE_H_AL), parent: fffffabec53eee70 ref_cnt: 1037 ~1:print_al_obj() !ERROR!: AL object fffffabec5db6e20(AL_OBJ_TYPE_PNP_MGR), parent: fffffabec53eee70 ref_cnt: 12 ~1:print_al_obj() !ERROR!: AL object fffffabec5ddcd00(AL_OBJ_TYPE_H_MAD_POOL), parent: fffffabec5db2d60 ref_cnt: 3 ~1:print_al_obj() !ERROR!: AL object fffffabec5604ca0(AL_OBJ_TYPE_RES_MGR), parent: fffffabec53eee70 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5dd80a8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5dd8230(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5dd83b8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5dd8540(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5dd86c8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5dd8850(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5dd89d8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5dd8b60(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5dd8ce8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5dd8e70(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e60f8(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e6238(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e6378(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e64b8(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e65f8(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e6738(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e6878(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e69b8(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e6af8(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e6c38(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e6d78(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec56e6eb8(AL_OBJ_TYPE_H_MR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ce018(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ce150(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ce288(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ce3c0(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ce4f8(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ce630(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ce768(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ce8a0(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ce9d8(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ceb10(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67cec48(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ced80(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffadfe67ceeb8(AL_OBJ_TYPE_H_FMR), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5df0ca0(AL_OBJ_TYPE_SMI), parent: fffffabec53eee70 ref_cnt: 4 ~1:print_al_obj() !ERROR!: AL object fffffabec5c04d70(AL_OBJ_TYPE_H_PNP), parent: fffffabec5db2d60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5df4d70(AL_OBJ_TYPE_H_PNP), parent: fffffabec5db2d60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5c02ec0(AL_OBJ_TYPE_SA_REQ_SVC), parent: fffffabec53eee70 ref_cnt: 2 ~1:print_al_obj() !ERROR!: AL object fffffabec5c1cd70(AL_OBJ_TYPE_H_PNP), parent: fffffabec5db2d60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5c20a40(AL_OBJ_TYPE_CM), parent: fffffabec53eee70 ref_cnt: 2 ~1:print_al_obj() !ERROR!: AL object fffffabec5c32d70(AL_OBJ_TYPE_H_PNP), parent: fffffabec5db2d60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5c58ea0(AL_OBJ_TYPE_DM), parent: fffffabec53eee70 ref_cnt: 3 ~1:print_al_obj() !ERROR!: AL object fffffabec5c42d70(AL_OBJ_TYPE_H_PNP), parent: fffffabec5db2d60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5c26d70(AL_OBJ_TYPE_H_PNP), parent: fffffabec5db2d60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5bc89b0(AL_OBJ_TYPE_IOC_PNP_MGR), parent: fffffabec53eee70 ref_cnt: 2 ~1:print_al_obj() !ERROR!: AL object fffffabec5c4cd70(AL_OBJ_TYPE_H_PNP), parent: fffffabec5db2d60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5c62d70(AL_OBJ_TYPE_H_PNP), parent: fffffabec5db2d60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec5c9ed70(AL_OBJ_TYPE_H_PNP), parent: fffffabec5db2d60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec6602b70(AL_OBJ_TYPE_CI_CA), parent: fffffabec53eee70 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec660cea0(AL_OBJ_TYPE_H_CA), parent: fffffabec5db2d60 ref_cnt: 3 ~1:print_al_obj() !ERROR!: AL object fffffabec6660e90(AL_OBJ_TYPE_H_PD), parent: fffffabec660cea0 ref_cnt: 2 ~1:print_al_obj() !ERROR!: AL object fffffabec6668e90(AL_OBJ_TYPE_H_PD), parent: fffffabec660cea0 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec666ae70(AL_OBJ_TYPE_H_POOL_KEY), parent: fffffabec6660e90 ref_cnt: 1025 ~1:print_al_obj() !ERROR!: AL object fffffabec741cd60(AL_OBJ_TYPE_SMI), parent: fffffabec5df0ca0 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7c86d70(AL_OBJ_TYPE_H_QP), parent: fffffabec6668e90 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7c90d20(AL_OBJ_TYPE_H_MAD_SVC), parent: fffffabec7c86d70 ref_cnt: 2 ~1:print_al_obj() !ERROR!: AL object fffffabec7ea60a8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7ea6230(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7ea63b8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7ea6540(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7ea66c8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7ea6850(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7ea69d8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7ea6b60(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7ea6ce8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7ea6e70(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f1a0a8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f1a230(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f1a3b8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f1a540(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f1a6c8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f1a850(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f1a9d8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f1ab60(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f1ace8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f1ae70(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f920a8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f92230(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f923b8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f92540(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f926c8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f92850(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f929d8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f92b60(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f92ce8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7f92e70(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec747cd60(AL_OBJ_TYPE_H_AL), parent: fffffabec53eee70 ref_cnt: 2 ~1:print_al_obj() !ERROR!: AL object fffffabec7e7ad70(AL_OBJ_TYPE_H_PNP), parent: fffffabec747cd60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabec7e8ad60(AL_OBJ_TYPE_H_AL), parent: fffffabec53eee70 ref_cnt: 2 ~1:print_al_obj() !ERROR!: AL object fffffabec7fe6d70(AL_OBJ_TYPE_H_PNP), parent: fffffabec7e8ad60 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabecd89c0a8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabecd89c230(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabecd89c3b8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabecd89c540(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabecd89c6c8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabecd89c850(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabecd89c9d8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabecd89cb60(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabecd89cce8(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 ~1:print_al_obj() !ERROR!: AL object fffffabecd89ce70(AL_OBJ_TYPE_H_AV), parent: 0000000000000000 ref_cnt: 1 VideoPortPowerDispatch: ERROR IN MINIPORT! VideoPortPowerDispatch: Miniport cannot refuse set power request VideoPortPowerDispatch: ERROR IN MINIPORT! VideoPortPowerDispatch: Miniport cannot refuse set power request Shutdown occurred...unloading all symbol tables. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From dneog at cisco.com Sun Aug 20 21:39:44 2006 From: dneog at cisco.com (Dipak Neog (dneog)) Date: Mon, 21 Aug 2006 10:09:44 +0530 Subject: [openib-general] Mellanox SRP target implementation Message-ID: <6CD20DE6DA82B7428FD0EB44721B3D2801C0CC1E@xmb-blr-417.apac.cisco.com> Hi, Can anybody tell me where to find and download the mellanox "SRP target" implementation code which was supposed to be released to openib? Thanks, Dipak -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at mellanox.co.il Sun Aug 20 23:06:15 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 21 Aug 2006 09:06:15 +0300 Subject: [openib-general] [PATCH] huge pages support Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30249FAF8@mtlexch01.mtl.com> Oh that one I have not seen... but it looks like this is the approach I took in VAPI and somehow it looked cumbersome to me. -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier Sent: Sunday, August 20, 2006 7:02 PM To: Tziporet Koren Cc: Openib-general at openib.org Subject: Re: [PATCH] huge pages support Roland> Sure, you are building the OFED release so you can put Roland> whatever you want into it. ...although maybe it would be a good idea to follow the approach of the second patch posted, and make multiple get_user_pages() calls, skipping along by HPAGE_SIZE. This avoids having all the extra work of follow_hugetlb_page() creating extra fake pages and then calling put_page() many times. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tziporet at mellanox.co.il Mon Aug 21 00:45:33 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 21 Aug 2006 10:45:33 +0300 Subject: [openib-general] Mellanox SRP target implementation In-Reply-To: <6CD20DE6DA82B7428FD0EB44721B3D2801C0CC1E@xmb-blr-417.apac.cisco.com> References: <6CD20DE6DA82B7428FD0EB44721B3D2801C0CC1E@xmb-blr-417.apac.cisco.com> Message-ID: <44E9649D.7070603@mellanox.co.il> Dipak Neog (dneog) wrote: > Hi, > > Can anybody tell me where to find and download the mellanox "SRP target" > implementation code which was supposed to be released to openib? > > Thanks, > Dipak > > Mellanox SRP target is under: https://openib.org/svn/trunk/contrib/mellanox/gen1/ib_srpt/ Note that we have made some fixes from the time we posted this code. The code is also available as part of Mellanox gen1 package - IBGD (available on Mellanox web site) Tziporet From thomas.bub at thomson.net Mon Aug 21 01:18:39 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Mon, 21 Aug 2006 10:18:39 +0200 Subject: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM Message-ID: Sean, as I understand cmpost.c and simple.c where originally pure libibcm examples. Is there any other test code utilizing the libibcm? Thanks Thomas -----Original Message----- From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Friday, August 18, 2006 6:09 PM To: Bub Thomas Cc: Sean Hefty; openib-general at openib.org; Erez Cohen Subject: Re: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM Bub Thomas wrote: > Can I still use the LID, GUID and SubnetID for connection establishment > then? Then Gen1 counterpart has no IP over IB running. If IPoIB is not running, then you will need to use the IB CM directly. The RDMA CM uses ARP to resolve IP addresses to GIDs. > I'm using OFED-1.0.1. > Do you have a quick link where to find the latest Headers? > (Sorry for the dumb question) https://openfabrics.org/svn/gen2/trunk/src/ https://openfabrics.org/svn/gen2/trunk/src/userspace/libibcm https://openfabrics.org/svn/gen2/trunk/src/userspace/librdmacm https://openfabrics.org/svn/gen2/trunk/src/linux-kernel/infiniband/ - Sean From thomas.bub at thomson.net Mon Aug 21 01:30:06 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Mon, 21 Aug 2006 10:30:06 +0200 Subject: [openib-general] libibcm can't open /dev/infiniband/ucm0 Message-ID: Sean, ib_ ucm is already loaded!? Here is the list of all loaded ib modules and their dependencies: ib_rds 37656 0 ib_ucm 21512 0 ib_srp 33924 0 ib_sdp 45468 0 rdma_cm 26760 3 rdma_ucm,ib_rds,ib_sdp ib_addr 10504 1 rdma_cm ib_cm 39952 3 ib_ucm,ib_srp,rdma_cm ib_local_sa 14232 2 rdma_ucm,rdma_cm findex 6784 1 ib_local_sa ib_ipoib 59800 0 ib_sa 18196 4 ib_srp,rdma_cm,ib_local_sa,ib_ipoib ib_uverbs 47408 1 rdma_ucm ib_umad 20272 0 ib_ipath 70424 0 ipath_core 179524 1 ib_ipath ib_mthca 140336 0 ib_mad 43304 5 ib_cm,ib_local_sa,ib_sa,ib_umad,ib_mthca ib_core 59520 14 ib_rds,ib_ucm,ib_srp,ib_sdp,rdma_cm,ib_cm,ib_local_sa,ib_ipoib,ib_sa,ib_ uverbs,ib_umad,ib_ipath,ib_mthca,ib_mad scsi_mod 140177 6 ib_srp,qla2xxx,scsi_transport_fc,libata,mptscsih,sd_mod Thanks Thomas -----Original Message----- From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Friday, August 18, 2006 6:06 PM To: Bub Thomas Cc: openib-general at openib.org Subject: Re: [openib-general] libibcm can't open /dev/infiniband/ucm0 Bub Thomas wrote: > It seems as if the problem I had there was not in my code but the > libibcm not being able to open the device /dev/infiniband/ucm0. You will need to load ib_ucm, which exports the IB CM to userspace. - Sean From krkumar2 at in.ibm.com Mon Aug 21 02:01:32 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 21 Aug 2006 14:31:32 +0530 Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: Message-ID: Hi Roland & Sean, What is your opinion on this patch set ? Anything else needs to be done for acceptance ? Thanks, - KK openib-general-bounces at openib.org wrote on 08/16/2006 11:42:35 AM: > Hi James, > > Sorry for the delay, we had a long weekend. > > > > > My opinion is that the create_qp taking generic parameters is > > > > correct, only subsequent calls may need to use transport specific > > > > calls/arguments. Infact rdma_create_qp uses the ibv_create_qp (now > > > > changed to rdmav_create_qp) call internally. > > > > > > If you want to have a generic rdmav_create_qp() call, there needs to > > > be programmatic way for the API consumer to determine what type of QP > > > (iWARP vs. IB) was created. > > > > > > I don't see any way to do that in your patch: > > > > I think the QP is associated with the transport type indirectly through > > the context. It can be queried with ibv_get_transport_type verb. A > > renamed rdma_get_transport type would probably suffice. > > Correct. Opening the device using rdmav_open_device with argument provided > by > the ULP will provide the context, which is used by subsequent calls to > transparently > make use of other calls. Either Steve or I can provide the > rdmav_get_transport_type() > call to return the actual device (transport) type. > > > > I like the new approach you are taking (keeping 1 verbs library and > > > adding rdmav_ symbol names). This change to transport neutral names is > > > > long overdue. > > > > > > When you finish with the userspace APIs, I hope you will update the > > > kernel APIs as well. > > Sure. > > Thanks, > > - KK > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From devesh28 at gmail.com Mon Aug 21 02:59:11 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Mon, 21 Aug 2006 15:29:11 +0530 Subject: [openib-general] QP SQD State processing. Message-ID: <309a667c0608210259me94ae25jddb78dca0a8f8de8@mail.gmail.com> Hello all, I am facing a problem in interprating the statement given in the IB Spec. It is regarding asynchronous event generation by QP in SQD state. the Spec says C10-35 When transitioning into the SQD state, the QP/EE's send logic must cease processing any additional messages. It must also complete any outstanding messages on a message boundary, and process any incoming acknowledgements. The CI must not begin processing additional messages which had not begun execution when the state transition occurred. what is the meaning of "message boundary"...?? Is it a descriptor which is underprocessing Or it means something else? How HCA behaves? whether HCA generates event immediatly after processing completion of current descriptor or after completing all the descreptors for which DB ring is done? Thanks From ianjiang.ict at gmail.com Mon Aug 21 04:53:37 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Mon, 21 Aug 2006 19:53:37 +0800 Subject: [openib-general] Mellanox SRP target implementation Message-ID: <7b2fa1820608210453x7d47f352p4713147a92c04b86@mail.gmail.com> Hi Dipak, You can find it at https://openib.org/svn/trunk/contrib/mellanox/gen1/ib_srpt/ -- Ian Jiang From halr at voltaire.com Mon Aug 21 05:16:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Aug 2006 08:16:55 -0400 Subject: [openib-general] [PATCH TRIVIAL] opensm: GUID net to host conversion in log prints In-Reply-To: <20060820162720.GV18411@sashak.voltaire.com> References: <20060820162720.GV18411@sashak.voltaire.com> Message-ID: <1156162596.9855.195921.camel@hal.voltaire.com> On Sun, 2006-08-20 at 12:27, Sasha Khapyorsky wrote: > Print GUID value in host byte order. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Mon Aug 21 05:29:50 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Aug 2006 08:29:50 -0400 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <20060820160538.12435.23041.stgit@sashak.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> Message-ID: <1156163374.9855.196244.camel@hal.voltaire.com> On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote: > In case when OpenSM log file overflows filesystem and write() fails with > 'No space left on device' try to truncate the log file and wrap-around > logging. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Mon Aug 21 05:41:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Aug 2006 08:41:40 -0400 Subject: [openib-general] [PATH TRIVIAL] opensm: management/Makefile: build rules improvement In-Reply-To: <20060820162436.GU18411@sashak.voltaire.com> References: <20060820162436.GU18411@sashak.voltaire.com> Message-ID: <1156164094.9855.196567.camel@hal.voltaire.com> On Sun, 2006-08-20 at 12:24, Sasha Khapyorsky wrote: > Some minor additions to management/Makefile build rules - now this will > run autogen.sh and ./configure (without options) if needed. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Mon Aug 21 05:44:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Aug 2006 08:44:53 -0400 Subject: [openib-general] [PATCH 08/13] IB/ehca: qp In-Reply-To: <20068171311.7Z4EtLP0ZYtya78R@cisco.com> References: <20068171311.7Z4EtLP0ZYtya78R@cisco.com> Message-ID: <1156164265.9855.196633.camel@hal.voltaire.com> On Thu, 2006-08-17 at 16:11, Roland Dreier wrote: [snip...] > diff --git a/drivers/infiniband/hw/ehca/ehca_sqp.c b/drivers/infiniband/hw/ehca/ehca_sqp.c > new file mode 100644 > index 0000000..d2c5552 > --- /dev/null > +++ b/drivers/infiniband/hw/ehca/ehca_sqp.c > @@ -0,0 +1,123 @@ > +/* > + * IBM eServer eHCA Infiniband device driver for Linux on POWER > + * > + * SQP functions > + * > + * Authors: Khadija Souissi > + * Heiko J Schick [snip...] > + > +extern int ehca_create_aqp1(struct ehca_shca *shca, struct ehca_sport *sport); > +extern int ehca_destroy_aqp1(struct ehca_sport *sport); > + > +extern int ehca_port_act_time; > + > +/** > + * ehca_define_sqp - Defines special queue pair 1 (GSI QP). When special queue > + * pair is created successfully, the corresponding port gets active. > + * > + * Define Special Queue pair 0 (SMI QP) is still not supported. > + * > + * @qp_init_attr: Queue pair init attributes with port and queue pair type > + */ > + > +u64 ehca_define_sqp(struct ehca_shca *shca, > + struct ehca_qp *ehca_qp, > + struct ib_qp_init_attr *qp_init_attr) > +{ > + > + u32 pma_qp_nr = 0; > + u32 bma_qp_nr = 0; > + u64 ret = H_SUCCESS; > + u8 port = qp_init_attr->port_num; > + int counter = 0; > + > + EDEB_EN(7, "port=%x qp_type=%x", > + port, qp_init_attr->qp_type); > + > + shca->sport[port - 1].port_state = IB_PORT_DOWN; > + > + switch (qp_init_attr->qp_type) { > + case IB_QPT_SMI: > + /* function not supported yet */ > + break; > + case IB_QPT_GSI: > + ret = hipz_h_define_aqp1(shca->ipz_hca_handle, > + ehca_qp->ipz_qp_handle, > + ehca_qp->galpas.kernel, > + (u32) qp_init_attr->port_num, > + &pma_qp_nr, &bma_qp_nr); > + > + if (ret != H_SUCCESS) { > + EDEB_ERR(4, "Can't define AQP1 for port %x. rc=%lx", > + port, ret); > + goto ehca_define_aqp1; > + } > + break; > + default: > + ret = H_PARAMETER; > + goto ehca_define_aqp1; > + } > + > + while ((shca->sport[port - 1].port_state != IB_PORT_ACTIVE) && > + (counter < ehca_port_act_time)) { > + EDEB(6, "... wait until port %x is active", > + port); > + msleep_interruptible(1000); > + counter++; > + } > + > + if (counter == ehca_port_act_time) { > + EDEB_ERR(4, "Port %x is not active.", port); > + ret = H_HARDWARE; > + } > + > +ehca_define_aqp1: > + EDEB_EX(7, "ret=%lx", ret); > + > + return ret; > +} I, for one, was hoping that the timer based transition to active for QP1 would have been resolved before being submitted. Any idea on the plan to resolve this ? -- Hal From RAISCH at de.ibm.com Mon Aug 21 05:59:39 2006 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Mon, 21 Aug 2006 14:59:39 +0200 Subject: [openib-general] [PATCH 08/13] IB/ehca: qp In-Reply-To: <1156164265.9855.196633.camel@hal.voltaire.com> Message-ID: > > I, for one, was hoping that the timer based transition to active for QP1 > would have been resolved before being submitted. Any idea on the plan to > resolve this ? > > -- Hal > > > We're testing it. As I mentioned before, this requires a change in the system firmware. I personally don't think this will show up in firmware in time for 2.6.19 merge window. So unfortunately we still need that loop. Regards . . . Christoph R From halr at voltaire.com Mon Aug 21 06:22:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Aug 2006 09:22:03 -0400 Subject: [openib-general] OpenSM/osm_sa_path_record.c: In __osm_pr_rcv_get_lid_pair_path, remove double calculation of reversible path Message-ID: <1156166466.9855.197583.camel@hal.voltaire.com> OpenSM/osm_sa_path_record.c: In __osm_pr_rcv_get_lid_pair_path, remove double calculation of reversible path Pointed out by Sasha Khapyorsky Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_path_record.c =================================================================== --- opensm/osm_sa_path_record.c (revision 9021) +++ opensm/osm_sa_path_record.c (working copy) @@ -728,7 +728,7 @@ __osm_pr_rcv_get_lid_pair_path( rev_path_status = __osm_pr_rcv_get_path_parms( p_rcv, p_pr, p_dest_port, p_src_port, src_lid_ho, comp_mask, &rev_path_parms ); - path_parms.reversible = (rev_path_status == IB_SUCCESS); + path_parms.reversible = ( rev_path_status == IB_SUCCESS ); /* did we get a Reversible Path compmask ? */ /* @@ -738,11 +738,6 @@ __osm_pr_rcv_get_lid_pair_path( */ if( comp_mask & IB_PR_COMPMASK_REVERSIBLE ) { - /* now try the reversible path */ - rev_path_status = __osm_pr_rcv_get_path_parms( p_rcv, p_pr, p_dest_port, - p_src_port, src_lid_ho, - comp_mask, &rev_path_parms ); - path_parms.reversible = (rev_path_status == IB_SUCCESS); if( (! path_parms.reversible && ( p_pr->num_path & 0x80 ) ) ) { osm_log( p_rcv->p_log, OSM_LOG_DEBUG, From mst at mellanox.co.il Mon Aug 21 06:22:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 Aug 2006 16:22:22 +0300 Subject: [openib-general] restore missing PCI registers after reset In-Reply-To: <20060726164246.GE9871@suse.de> References: <20060726164246.GE9871@suse.de> Message-ID: <20060821132222.GM13693@mellanox.co.il> Quoting r. Greg KH : > Subject: Re: restore missing PCI registers after reset > > On Wed, Jul 26, 2006 at 07:32:26PM +0300, Michael S. Tsirkin wrote: > > Quoting r. Greg KH : > > > I think pci_restore_state() already restores the msi and msix state, > > > take a look at the latest kernel version :) > > > > Yes, I know :) > > but I am not talking abotu MSI/MSI-X, I am talking about the following: > > > > > PCI-X device: PCI-X command register > > > > > PCI-X bridge: upstream and downstream split transaction registers > > > > > PCI Express : PCI Express device control and link control registers > > > > these register values include maxumum MTU for PCI express and other vital > > data. > > Make up a patch that shows how you would save these in a generic way and > we can discuss it. I know people have talked about saving the extended > PCI config space for devices that need it, so that might be all you > need to do here. Like this? -- Restore PCI Express capability registers after PM event. This includes maxumum MTU for PCI express and other vital data. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 9f79dd6..198b200 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -443,6 +443,52 @@ pci_power_t pci_choose_state(struct pci_ EXPORT_SYMBOL(pci_choose_state); +static int __pci_save_pcie_state(struct pci_dev *dev) +{ + int pos, i = 0; + struct pci_cap_saved_state *save_state; + u16 *cap; + + pos = pci_find_capability(dev, PCI_CAP_ID_EXP); + if (pos <= 0) + return 0; + + save_state = kzalloc(sizeof(struct pci_cap_saved_state) + sizeof(u16) * 4, + GFP_KERNEL); + if (!save_state) { + printk(KERN_ERR "Out of memory in pci_save_pcie_state\n"); + return -ENOMEM; + } + cap = (u16 *)&save_state->data[0]; + + pci_read_config_word(dev, pos + PCI_EXP_DEVCTL, &cap[i++]); + pci_read_config_word(dev, pos + PCI_EXP_LNKCTL, &cap[i++]); + pci_read_config_word(dev, pos + PCI_EXP_SLTCTL, &cap[i++]); + pci_read_config_word(dev, pos + PCI_EXP_RTCTL, &cap[i++]); + pci_add_saved_cap(dev, save_state); + return 0; +} + +static void __pci_restore_pcie_state(struct pci_dev *dev) +{ + int i = 0, pos; + struct pci_cap_saved_state *save_state; + u16 *cap; + + save_state = pci_find_saved_cap(dev, PCI_CAP_ID_EXP); + pos = pci_find_capability(dev, PCI_CAP_ID_EXP); + if (!save_state || pos <= 0) + return; + cap = (u16 *)&save_state->data[0]; + + pci_write_config_word(dev, pos + PCI_EXP_DEVCTL, cap[i++]); + pci_write_config_word(dev, pos + PCI_EXP_LNKCTL, cap[i++]); + pci_write_config_word(dev, pos + PCI_EXP_SLTCTL, cap[i++]); + pci_write_config_word(dev, pos + PCI_EXP_RTCTL, cap[i++]); + pci_remove_saved_cap(save_state); + kfree(save_state); +} + /** * pci_save_state - save the PCI configuration space of a device before suspending * @dev: - PCI device that we're dealing with @@ -458,6 +504,8 @@ pci_save_state(struct pci_dev *dev) return i; if ((i = pci_save_msix_state(dev)) != 0) return i; + if ((i = __pci_save_pcie_state(dev)) != 0) + return i; return 0; } @@ -471,6 +519,9 @@ pci_restore_state(struct pci_dev *dev) int i; int val; + /* PCI Express register must be restored first */ + __pci_restore_pcie_state(dev); + /* * The Base Address register should be programmed before the command * register(s) -- MST From rdreier at cisco.com Mon Aug 21 07:44:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 Aug 2006 07:44:19 -0700 Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: (Krishna Kumar2's message of "Mon, 21 Aug 2006 14:31:32 +0530") References: Message-ID: Krishna> Hi Roland & Sean, What is your opinion on this patch set Krishna> ? Anything else needs to be done for acceptance ? It's a very low priority for me, since it's a pain to merge and a pain to maintain, and I don't see any urgency in renaming functions. I'll try to get to it after everything else I want for libibverbs 1.1 is done (expose device type, memory windows, reregister memory region at least) - R. From rdreier at cisco.com Mon Aug 21 07:45:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 Aug 2006 07:45:45 -0700 Subject: [openib-general] [PATCH] huge pages support In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30249FAF8@mtlexch01.mtl.com> (Eli Cohen's message of "Mon, 21 Aug 2006 09:06:15 +0300") References: <6AB138A2AB8C8E4A98B9C0C3D52670E30249FAF8@mtlexch01.mtl.com> Message-ID: Eli> Oh that one I have not seen... but it looks like this is the Eli> approach I took in VAPI and somehow it looked cumbersome to Eli> me. I guess you could benchmark and see if there's a measurable difference. put_page() is an atomic operation so doing it 512 times more often seems like it might be significant. - R. From kliteyn at mellanox.co.il Mon Aug 21 08:27:53 2006 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 21 Aug 2006 18:27:53 +0300 Subject: [openib-general] [PATCHv2] osm: OSM crash TRIVIAL bug fix In-Reply-To: <1155821729.9855.38600.camel@hal.voltaire.com> References: <1155560727.9532.39151.camel@hal.voltaire.com> <1155821329.13896.37.camel@kliteynik.yok.mtl.com> <1155821729.9855.38600.camel@hal.voltaire.com> Message-ID: <1156174073.12684.51.camel@kliteynik.yok.mtl.com> Hi Hal. My answers below. I did check the code using the setup that discovered it. This patch should make its way into both trunk and OFED 1.1 branch. Please let me know if there is anything else required for it. Thanks, Yevgeny On Thu, 2006-08-17 at 09:35 -0400, Hal Rosenstock wrote: > Hi Yevgeny, > > On Thu, 2006-08-17 at 09:28, Yevgeny Kliteynik wrote: > > Hi Hal. > > > > > This line wrapped so there is something wrong with your mailer. > > > > I'm using a different mailer now, so I hope that it's ok now. > > Guess we'll see with your next patch with a long line... > > > > > + m->v = NULL; /* just make sure we do not point tofree'd madw */ > > > > > > Also, is this line really needed (and if so why) ? I know you did say > > > "it cleans up old pointers to retrieved madw" but this shouldn't be > > > accessed, right ? > > > > You're right, it shouldn't be accessed. > > Does the fix checked in work as is now ? Did you reverify ? Yes, I did. > > But generally, it's a good practice to assign a null to > > any pointer that points to a freed memory, and should not > > be in use any more. > > It's also good practice that when an issue is found in one place to look > for other occurences of the same issue. > > I'm also not sure this is the general approach that OpenSM takes. Right, I will try to clean these areas once I get to read them. > -- Hal > > > > Also, if this is added here, there are other places where the same > > > thing should be done ? > > > > I just examined this area of code, so this is what I saw. > -- Regards, Yevgeny Kliteynik Mellanox Technologies LTD Tel: +972-4-909-7200 ext: 394 Fax: +972-4-959-3245 P.O. Box 586 Yokneam 20692 ISRAEL From halr at voltaire.com Mon Aug 21 08:25:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Aug 2006 11:25:46 -0400 Subject: [openib-general] [PATCHv2] osm: OSM crash TRIVIAL bug fix In-Reply-To: <1156174073.12684.51.camel@kliteynik.yok.mtl.com> References: <1155560727.9532.39151.camel@hal.voltaire.com> <1155821329.13896.37.camel@kliteynik.yok.mtl.com> <1155821729.9855.38600.camel@hal.voltaire.com> <1156174073.12684.51.camel@kliteynik.yok.mtl.com> Message-ID: <1156173944.1889.164.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2006-08-21 at 11:27, Yevgeny Kliteynik wrote: > Hi Hal. > > My answers below. > I did check the code using the setup that discovered it. > > This patch should make its way into both trunk and OFED 1.1 branch. It's been on both since 8/14. -- Hal > Please let me know if there is anything else required for it. > > Thanks, > > Yevgeny > > On Thu, 2006-08-17 at 09:35 -0400, Hal Rosenstock wrote: > > Hi Yevgeny, > > > > On Thu, 2006-08-17 at 09:28, Yevgeny Kliteynik wrote: > > > Hi Hal. > > > > > > > This line wrapped so there is something wrong with your mailer. > > > > > > I'm using a different mailer now, so I hope that it's ok now. > > > > Guess we'll see with your next patch with a long line... > > > > > > > + m->v = NULL; /* just make sure we do not point tofree'd madw */ > > > > > > > > Also, is this line really needed (and if so why) ? I know you did say > > > > "it cleans up old pointers to retrieved madw" but this shouldn't be > > > > accessed, right ? > > > > > > You're right, it shouldn't be accessed. > > > > Does the fix checked in work as is now ? Did you reverify ? > > Yes, I did. > > > > But generally, it's a good practice to assign a null to > > > any pointer that points to a freed memory, and should not > > > be in use any more. > > > > It's also good practice that when an issue is found in one place to look > > for other occurences of the same issue. > > > > I'm also not sure this is the general approach that OpenSM takes. > > Right, I will try to clean these areas once I get to read them. > > > -- Hal > > > > > > Also, if this is added here, there are other places where the same > > > > thing should be done ? > > > > > > I just examined this area of code, so this is what I saw. > > From bos at pathscale.com Mon Aug 21 08:45:17 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 21 Aug 2006 08:45:17 -0700 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: <20060816184910.GD5796@mellanox.co.il> References: <20060816183329.GB5796@mellanox.co.il> <20060816184910.GD5796@mellanox.co.il> Message-ID: <1156175117.18663.32.camel@chalcedony.pathscale.com> On Wed, 2006-08-16 at 21:49 +0300, Michael S. Tsirkin wrote: > Woops, only now noticed that this was wrt the ipath driver, not mthca as I > thought. Of course I didn't mean it - I don't edit ipath code in SVN, pathscale > guys do that. We don't actually use SVN to develop the driver either. For kernel stuff, I think it's become just a dumping ground for changes that people have made in their own private trees. This makes it not a suitable place to be pulling driver sources from. Hello everyone, A reminder to mark your calendar for the InfiniBand Trade Association (IBTA) and the OpenFabrics Alliance (OFA) Fall 2006 Developers Conference on September 25, 2006 at the Moscone Center West in San Francisco. The event is being held in co-location with the Fall 2006 Intel Developer Forum. If you are an application developer, systems vendor, hardware/software solution provider or end user of the technology, please join us for presentations and collaborative sessions that will highlight the recent advancements of the InfiniBand specification and available software solutions. The one-day conference begins at 8:30 a.m. with keynotes from Jim Pappas, director of initiative marketing at Intel Corporation, and Krish Ramakrishnan, vice president and general manager of Server Switching at Cisco. In addition, we have an exciting day planned including: * End users sharing experiences on real-life deployment and usage of the technology * Highlights of the recent advancements of the InfiniBand specification by IBTA * Updates on available InfiniBand-supported software solutions from OFA and industry partners * Collaborative sessions and discussions about future joint developments between IBTA and OFA Attendees who register by September 1st can do so for the early-bird rate of $149. Afterwards, the standard registration fee is $199. To register for the event, please visit: www.acteva.com/go/IBTAOFADevCon06 Special discount offered to those registering for IDF: If you haven't yet registered for IDF, we invite you to take advantage of an exclusive discount being offered to those attending the IBTA and OFA conference. Attendees may purchase conference passes to IDF at a discounted rate of $700 - a savings of $995 off the standard rate. To register for IDF and receive this discount, please visit: www.intel.com/idf/us/fall2006/registration IBTA and OFA Member Bulk Code: FCAGRCTA The Intel Developer Forum in San Francisco offers attendees over 130 hours of technology training to choose from, led by top Intel and industry engineers who provide critical training that will help you solve your day-to-day, real-time problems. Linh Dinh For OFA & IBTA 206-322-1167, ext. 115 linhd at owenmedia.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Mon Aug 21 09:43:22 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 21 Aug 2006 09:43:22 -0700 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED Message-ID: Brian Wrote, >On Wed, 2006-08-16 at 21:49 +0300, Michael S. Tsirkin wrote: >> Woops, only now noticed that this was wrt the ipath driver, not mthca as I >> thought. Of course I didn't mean it - I don't edit ipath code in SVN, pathscale >> guys do that. >We don't actually use SVN to develop the driver either. For kernel >stuff, I think it's become just a dumping ground for changes that people >have made in their own private trees. This makes it not a suitable >place to be pulling driver sources from. > References: Message-ID: <44E9E300.1050808@ichips.intel.com> Krishna Kumar2 wrote: > What is your opinion on this patch set ? Anything else needs to be done > for acceptance ? I don't have any issues with it, but Roland would need to commit the changes to verbs as the first step. - Sean From mshefty at ichips.intel.com Mon Aug 21 09:47:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 21 Aug 2006 09:47:33 -0700 Subject: [openib-general] libibcm can't open /dev/infiniband/ucm0 In-Reply-To: References: Message-ID: <44E9E3A5.5030008@ichips.intel.com> Bub Thomas wrote: > Here is the list of all loaded ib modules and their dependencies: > > ib_rds 37656 0 > ib_ucm 21512 0 Did you update udev rules to create the device? - Sean From mshefty at ichips.intel.com Mon Aug 21 09:51:48 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 21 Aug 2006 09:51:48 -0700 Subject: [openib-general] [PATCH] cmpost: allow cmpost to build with latest RDMA CM In-Reply-To: References: Message-ID: <44E9E4A4.7050801@ichips.intel.com> Bub Thomas wrote: > as I understand cmpost.c and simple.c where originally pure libibcm > examples. simple.c was originally a pure libibcm example, but it never actually established any connections. Cmpost has always relied on a separate library to obtain path record information. > Is there any other test code utilizing the libibcm? Cmpost is the only test code that I've written to use the libibcm. - Sean From sashak at voltaire.com Mon Aug 21 10:22:40 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 21 Aug 2006 20:22:40 +0300 Subject: [openib-general] [PATCH] opensm: osm_sa_path_record: mcast destination detection fix Message-ID: <20060821172240.32517.92315.stgit@sashak.voltaire.com> Return error when mcast destination is not consistently indicated. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_sa_path_record.c | 16 +++++++++++----- 1 files changed, 11 insertions(+), 5 deletions(-) diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c index caa9f32..6b0fb28 100644 --- a/osm/opensm/osm_sa_path_record.c +++ b/osm/opensm/osm_sa_path_record.c @@ -1486,7 +1486,7 @@ __osm_pr_match_mgrp_attributes( /********************************************************************** **********************************************************************/ -static boolean_t +static int __osm_pr_rcv_check_mcast_dest( IN osm_pr_rcv_t* const p_rcv, IN const osm_madw_t* const p_madw ) @@ -1494,7 +1494,7 @@ __osm_pr_rcv_check_mcast_dest( const ib_path_rec_t* p_pr; const ib_sa_mad_t* p_sa_mad; ib_net64_t comp_mask; - boolean_t is_multicast = FALSE; + unsigned is_multicast = 0; OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_check_mcast_dest ); @@ -1514,11 +1514,13 @@ __osm_pr_rcv_check_mcast_dest( { if( cl_ntoh16( p_pr->dlid ) >= IB_LID_MCAST_START_HO && cl_ntoh16( p_pr->dlid ) <= IB_LID_MCAST_END_HO ) - is_multicast = TRUE; - else if( is_multicast ) + is_multicast = 1; + else if( is_multicast ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_check_mcast_dest: ERR 1F12: " "PathRecord request indicates MGID but not MLID\n" ); + return -1; + } } Exit: @@ -1693,6 +1695,7 @@ osm_pr_rcv_process( cl_qlist_t pr_list; ib_net16_t sa_status; osm_port_t* requester_port; + int ret; OSM_LOG_ENTER( p_rcv->p_log, osm_pr_rcv_process ); @@ -1737,7 +1740,10 @@ osm_pr_rcv_process( cl_plock_acquire( p_rcv->p_lock ); /* Handle multicast destinations separately */ - if( __osm_pr_rcv_check_mcast_dest( p_rcv, p_madw ) ) + if( (ret = __osm_pr_rcv_check_mcast_dest( p_rcv, p_madw )) < 0) + goto Exit; + + if(ret > 0) goto McastDest; osm_log( p_rcv->p_log, OSM_LOG_DEBUG, From halr at voltaire.com Mon Aug 21 10:45:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Aug 2006 13:45:04 -0400 Subject: [openib-general] [PATCH] opensm: osm_sa_path_record: mcast destination detection fix In-Reply-To: <20060821172240.32517.92315.stgit@sashak.voltaire.com> References: <20060821172240.32517.92315.stgit@sashak.voltaire.com> Message-ID: <1156182301.1889.3131.camel@hal.voltaire.com> On Mon, 2006-08-21 at 13:22, Sasha Khapyorsky wrote: > Return error when mcast destination is not consistently indicated. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to both trunk and 1.1) with the following minor changes below: > osm/opensm/osm_sa_path_record.c | 16 +++++++++++----- > 1 files changed, 11 insertions(+), 5 deletions(-) > > diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c > index caa9f32..6b0fb28 100644 > --- a/osm/opensm/osm_sa_path_record.c > +++ b/osm/opensm/osm_sa_path_record.c > @@ -1486,7 +1486,7 @@ __osm_pr_match_mgrp_attributes( > > /********************************************************************** > **********************************************************************/ > -static boolean_t > +static int > __osm_pr_rcv_check_mcast_dest( > IN osm_pr_rcv_t* const p_rcv, > IN const osm_madw_t* const p_madw ) > @@ -1494,7 +1494,7 @@ __osm_pr_rcv_check_mcast_dest( > const ib_path_rec_t* p_pr; > const ib_sa_mad_t* p_sa_mad; > ib_net64_t comp_mask; > - boolean_t is_multicast = FALSE; > + unsigned is_multicast = 0; > > OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_check_mcast_dest ); > > @@ -1514,11 +1514,13 @@ __osm_pr_rcv_check_mcast_dest( > { > if( cl_ntoh16( p_pr->dlid ) >= IB_LID_MCAST_START_HO && > cl_ntoh16( p_pr->dlid ) <= IB_LID_MCAST_END_HO ) > - is_multicast = TRUE; > - else if( is_multicast ) > + is_multicast = 1; > + else if( is_multicast ) { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_check_mcast_dest: ERR 1F12: " > "PathRecord request indicates MGID but not MLID\n" ); > + return -1; I made this go through the exit so the routine end log message is put into the log. > + } > } > > Exit: > @@ -1693,6 +1695,7 @@ osm_pr_rcv_process( > cl_qlist_t pr_list; > ib_net16_t sa_status; > osm_port_t* requester_port; > + int ret; > > OSM_LOG_ENTER( p_rcv->p_log, osm_pr_rcv_process ); > > @@ -1737,7 +1740,10 @@ osm_pr_rcv_process( > cl_plock_acquire( p_rcv->p_lock ); > > /* Handle multicast destinations separately */ > - if( __osm_pr_rcv_check_mcast_dest( p_rcv, p_madw ) ) > + if( (ret = __osm_pr_rcv_check_mcast_dest( p_rcv, p_madw )) < 0) I added a send of SA status error for IB_MAD_STATUS_INVALID_FIELD here as well as an unlock. > + goto Exit; > + > + if(ret > 0) > goto McastDest; > > osm_log( p_rcv->p_log, OSM_LOG_DEBUG, -- Hal From sashak at voltaire.com Mon Aug 21 10:59:27 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 21 Aug 2006 20:59:27 +0300 Subject: [openib-general] [PATCH] opensm: osm_sa_path_record: mcast destination detection fix In-Reply-To: <1156182301.1889.3131.camel@hal.voltaire.com> References: <20060821172240.32517.92315.stgit@sashak.voltaire.com> <1156182301.1889.3131.camel@hal.voltaire.com> Message-ID: <20060821175927.GB32576@sashak.voltaire.com> On 13:45 Mon 21 Aug , Hal Rosenstock wrote: > On Mon, 2006-08-21 at 13:22, Sasha Khapyorsky wrote: > > Return error when mcast destination is not consistently indicated. > > > > Signed-off-by: Sasha Khapyorsky > > Thanks. Applied (to both trunk and 1.1) with the following minor changes > below: > > > osm/opensm/osm_sa_path_record.c | 16 +++++++++++----- > > 1 files changed, 11 insertions(+), 5 deletions(-) > > > > diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c > > index caa9f32..6b0fb28 100644 > > --- a/osm/opensm/osm_sa_path_record.c > > +++ b/osm/opensm/osm_sa_path_record.c > > @@ -1486,7 +1486,7 @@ __osm_pr_match_mgrp_attributes( > > > > /********************************************************************** > > **********************************************************************/ > > -static boolean_t > > +static int > > __osm_pr_rcv_check_mcast_dest( > > IN osm_pr_rcv_t* const p_rcv, > > IN const osm_madw_t* const p_madw ) > > @@ -1494,7 +1494,7 @@ __osm_pr_rcv_check_mcast_dest( > > const ib_path_rec_t* p_pr; > > const ib_sa_mad_t* p_sa_mad; > > ib_net64_t comp_mask; > > - boolean_t is_multicast = FALSE; > > + unsigned is_multicast = 0; > > > > OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_check_mcast_dest ); > > > > @@ -1514,11 +1514,13 @@ __osm_pr_rcv_check_mcast_dest( > > { > > if( cl_ntoh16( p_pr->dlid ) >= IB_LID_MCAST_START_HO && > > cl_ntoh16( p_pr->dlid ) <= IB_LID_MCAST_END_HO ) > > - is_multicast = TRUE; > > - else if( is_multicast ) > > + is_multicast = 1; > > + else if( is_multicast ) { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > "__osm_pr_rcv_check_mcast_dest: ERR 1F12: " > > "PathRecord request indicates MGID but not MLID\n" ); > > + return -1; > > I made this go through the exit so the routine end log message is put > into the log. Right. Now there is 'is_multicast = -1' - you may want to change is_multicast type to int (now it is unsigned). > > > + } > > } > > > > Exit: > > @@ -1693,6 +1695,7 @@ osm_pr_rcv_process( > > cl_qlist_t pr_list; > > ib_net16_t sa_status; > > osm_port_t* requester_port; > > + int ret; > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pr_rcv_process ); > > > > @@ -1737,7 +1740,10 @@ osm_pr_rcv_process( > > cl_plock_acquire( p_rcv->p_lock ); > > > > /* Handle multicast destinations separately */ > > - if( __osm_pr_rcv_check_mcast_dest( p_rcv, p_madw ) ) > > + if( (ret = __osm_pr_rcv_check_mcast_dest( p_rcv, p_madw )) < 0) > > I added a send of SA status error for IB_MAD_STATUS_INVALID_FIELD here > as well as an unlock. Sure. Sasha > > > + goto Exit; > > + > > + if(ret > 0) > > goto McastDest; > > > > osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > > -- Hal > From halr at voltaire.com Mon Aug 21 11:04:38 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Aug 2006 14:04:38 -0400 Subject: [openib-general] [PATCH] opensm: osm_sa_path_record: mcast destination detection fix In-Reply-To: <20060821175927.GB32576@sashak.voltaire.com> References: <20060821172240.32517.92315.stgit@sashak.voltaire.com> <1156182301.1889.3131.camel@hal.voltaire.com> <20060821175927.GB32576@sashak.voltaire.com> Message-ID: <1156183476.1889.3565.camel@hal.voltaire.com> On Mon, 2006-08-21 at 13:59, Sasha Khapyorsky wrote: > On 13:45 Mon 21 Aug , Hal Rosenstock wrote: > > On Mon, 2006-08-21 at 13:22, Sasha Khapyorsky wrote: > > > Return error when mcast destination is not consistently indicated. > > > > > > Signed-off-by: Sasha Khapyorsky > > > > Thanks. Applied (to both trunk and 1.1) with the following minor changes > > below: > > > > > osm/opensm/osm_sa_path_record.c | 16 +++++++++++----- > > > 1 files changed, 11 insertions(+), 5 deletions(-) > > > > > > diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c > > > index caa9f32..6b0fb28 100644 > > > --- a/osm/opensm/osm_sa_path_record.c > > > +++ b/osm/opensm/osm_sa_path_record.c > > > @@ -1486,7 +1486,7 @@ __osm_pr_match_mgrp_attributes( > > > > > > /********************************************************************** > > > **********************************************************************/ > > > -static boolean_t > > > +static int > > > __osm_pr_rcv_check_mcast_dest( > > > IN osm_pr_rcv_t* const p_rcv, > > > IN const osm_madw_t* const p_madw ) > > > @@ -1494,7 +1494,7 @@ __osm_pr_rcv_check_mcast_dest( > > > const ib_path_rec_t* p_pr; > > > const ib_sa_mad_t* p_sa_mad; > > > ib_net64_t comp_mask; > > > - boolean_t is_multicast = FALSE; > > > + unsigned is_multicast = 0; > > > > > > OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_check_mcast_dest ); > > > > > > @@ -1514,11 +1514,13 @@ __osm_pr_rcv_check_mcast_dest( > > > { > > > if( cl_ntoh16( p_pr->dlid ) >= IB_LID_MCAST_START_HO && > > > cl_ntoh16( p_pr->dlid ) <= IB_LID_MCAST_END_HO ) > > > - is_multicast = TRUE; > > > - else if( is_multicast ) > > > + is_multicast = 1; > > > + else if( is_multicast ) { > > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > > "__osm_pr_rcv_check_mcast_dest: ERR 1F12: " > > > "PathRecord request indicates MGID but not MLID\n" ); > > > + return -1; > > > > I made this go through the exit so the routine end log message is put > > into the log. > > Right. > > Now there is 'is_multicast = -1' - you may want to change is_multicast > type to int (now it is unsigned). Thanks. Just did that (to both trunk and 1.1). -- Hal > > > > > > + } > > > } > > > > > > Exit: > > > @@ -1693,6 +1695,7 @@ osm_pr_rcv_process( > > > cl_qlist_t pr_list; > > > ib_net16_t sa_status; > > > osm_port_t* requester_port; > > > + int ret; > > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pr_rcv_process ); > > > > > > @@ -1737,7 +1740,10 @@ osm_pr_rcv_process( > > > cl_plock_acquire( p_rcv->p_lock ); > > > > > > /* Handle multicast destinations separately */ > > > - if( __osm_pr_rcv_check_mcast_dest( p_rcv, p_madw ) ) > > > + if( (ret = __osm_pr_rcv_check_mcast_dest( p_rcv, p_madw )) < 0) > > > > I added a send of SA status error for IB_MAD_STATUS_INVALID_FIELD here > > as well as an unlock. > > Sure. > > Sasha > > > > > > + goto Exit; > > > + > > > + if(ret > 0) > > > goto McastDest; > > > > > > osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > > > > -- Hal > > From tziporet at mellanox.co.il Mon Aug 21 11:49:51 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 21 Aug 2006 21:49:51 +0300 Subject: [openib-general] OFED 1.1-rc2 is ready Message-ID: <44EA004F.2060608@mellanox.co.il> Hi, OFED 1.1-RC2 is avilable on https://openib.org/svn/gen2/branches/1.1/ofed/releases/ File: OFED-1.1-rc2.tgz Please report any issues in bugzilla http://openib.org/bugzilla/ Tziporet & Vlad ------------------------------------------------------------------------------------- Release details: ================ Build_id: OFED-1.1-rc2 openib-1.1 (REV=9037) # User space https://openib.org/svn/gen2/branches/1.1/src/userspace Git: ref: refs/heads/ofed_1_1 commit a13195d7ca0f047f479a58b2a81ff2b796eb8fa4 # MPI mpi_osu-0.9.7-mlx2.2.0.tgz openmpi-1.1-1.src.rpm mpitests-2.0-0.src.rpm OS support: =========== Novell: - SLES 9.0 SP3* - SLES10 (official release)* Redhat: - Redhat EL4 up3 - Redhat EL4 up4* (not supported yet) kernel.org: - Kernel 2.6.17* * Changed from 1.0 release Note: Redhat EL4 up2, Fedora C4 and SuSE Pro 10 were dropped from the list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. Systems: ======== * x86_64 * x86 * ia64 * ppc64 Main changes from OFED-1.1-rc1: =============================== 1. ipath driver: - Compilation pass on all systems, except SLES9 SP3. - See list of changes in the ipath driver at the end 2. SDP: - Fixed issue with 32 bit systems run out of low memory when opening hundreds of sockets. - Added out of band and message peek support; telnet and ftp are now working 3. SRP - a new srp_daemon was added - see explanation at the end 4. IPoIB: High availability support using a daemon in user level. Daemon is located under /userspace/ipoibtools/. See explanation at the end. 5. Added Madeye utility 6. Added verbs fork support. Should work from kernel 2.6.16 7. Fatal error support in mthca 8. iSER support in install script for SLES 10 was fixed 9. Diagnostic tools does not requires opensm installation. For this the following changes were done to opensm RPM: opensm-devel was removed New packages were added: libosmcomp libosmcomp-devel libosmvendor libosmvendor-devel libopensm libopensm-devel 10. bug fixes: - SRP: Add local_ib_device/local_ib_port attributes to srp scsi_host - mthca: fence bit supported; fixed deadlock in destroy qp - ipoib: connectivity lost on sm lid change - OSM: fix to work with Cisco stack Limitations and known issues: ============================= 1. SDP: For Mellanox Sinai HCAs one must use latest FW version (1.1.000). 2. SDP: Get peer name is not working properly 3. SDP: Scalability issue when many connections are opened 4. ipath driver does not compile on SLES9 SP3 5. RHEL4 up4 is not supported due to problems in the backport patches. Missing features that should be completed for RC3: ================================================== 1. Core: Huge pages fix 2. IPoIB high availability does not support multicast groups 3. Support RHEL4 up4 Changes in the ipath driver: ============================ * lock resource limit counters correctly * fix for crash on module unload, if cfgports < portcnt * fix handling of kpiobufs * drop requirement that PIO buffers be mmaped write-only * merge ipath_core and ib_ipath drivers * simplify layering code * simplify debugging code after ipath_core and ib_ipath merger * remove stale references to userspace SMA * More changes to support InfiniPath on PowerPC 970 systems. * add new minor device to allow sending of diag packets * do not allow use of CQ entries with invalid counts * account for attached QPs correctly * support new QLogic product naming scheme * add serial number to hardware freeze error message * be more strict about testing the modify QP verb * validate path_mig_state properly * put a limit on the number of QPs that can be created * handle sq_sig_all field correctly * allow SMA to be disabled * fix return value from ipath_poll * print warning if LID not acquired within one minute * allow direct control of Rx polarity inversion srp_daemon explanation: ======================= srp_daemon is a tool that identifies SRP targets in the fabric. Each srp_daemon instance is operating on one port. On boot it performs a full rescan of the fabric and waits to srp_daemon events: - a join of a new target to the fabric - a change in the capabilities of a machine that becomes a target - an SA change - an expiration of a predefined timeout When there is an SA change or a timeout expiration srp_daemon perform a full rescan of the fabric. for each target srp_daemon finds, it checks if it is already connected to that port, if it is not connected, srp_daemon can either print the target details or connect to it. Run srp_daemon -h for usage. IPoIB HA daemon: ================ The IPoIB HA daemon can be configured in /etc/infiniband/openib.conf file: # Enable IPoIB High Availability daemon IPOIBHA_ENABLE=yes # PRIMARY_IPOIB_DEV=ib0 # BACKUP_IPOIB_DEV=ib1 The default for PRIMARY_IPOIB_DEV is ib0 and for BACKUP_IPOIB_DEV is ib1. From sashak at voltaire.com Mon Aug 21 12:11:01 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 21 Aug 2006 22:11:01 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready In-Reply-To: <44EA004F.2060608@mellanox.co.il> References: <44EA004F.2060608@mellanox.co.il> Message-ID: <20060821191101.GF32576@sashak.voltaire.com> On 21:49 Mon 21 Aug , Tziporet Koren wrote: > Hi, > > OFED 1.1-RC2 is avilable on https://openib.org/svn/gen2/branches/1.1/ofed/releases/ > File: OFED-1.1-rc2.tgz BTW, Why it is necessary to put binary *.tgz files under Subversion? We are not going to change it and not interesting in history tracking. Sasha > Please report any issues in bugzilla http://openib.org/bugzilla/ > > Tziporet & Vlad > ------------------------------------------------------------------------------------- > > Release details: > ================ > > Build_id: > OFED-1.1-rc2 > > openib-1.1 (REV=9037) > # User space > https://openib.org/svn/gen2/branches/1.1/src/userspace > Git: > ref: refs/heads/ofed_1_1 > commit a13195d7ca0f047f479a58b2a81ff2b796eb8fa4 > > # MPI > mpi_osu-0.9.7-mlx2.2.0.tgz > openmpi-1.1-1.src.rpm > mpitests-2.0-0.src.rpm > > > OS support: > =========== > Novell: > - SLES 9.0 SP3* > - SLES10 (official release)* > Redhat: > - Redhat EL4 up3 > - Redhat EL4 up4* (not supported yet) > kernel.org: > - Kernel 2.6.17* > * Changed from 1.0 release > > Note: Redhat EL4 up2, Fedora C4 and SuSE Pro 10 were dropped from the list. > We keep the backport patches for these OSes and make sure OFED compile and > loaded properly but will not do full QA cycle. > > Systems: > ======== > * x86_64 > * x86 > * ia64 > * ppc64 > > Main changes from OFED-1.1-rc1: > =============================== > 1. ipath driver: > - Compilation pass on all systems, except SLES9 SP3. > - See list of changes in the ipath driver at the end > 2. SDP: > - Fixed issue with 32 bit systems run out of low memory when opening hundreds of sockets. > - Added out of band and message peek support; telnet and ftp are now working > 3. SRP - a new srp_daemon was added - see explanation at the end > 4. IPoIB: High availability support using a daemon in user level. > Daemon is located under /userspace/ipoibtools/. See explanation at the end. > 5. Added Madeye utility > 6. Added verbs fork support. Should work from kernel 2.6.16 > 7. Fatal error support in mthca > 8. iSER support in install script for SLES 10 was fixed > 9. Diagnostic tools does not requires opensm installation. > For this the following changes were done to opensm RPM: > opensm-devel was removed > New packages were added: > libosmcomp > libosmcomp-devel > libosmvendor > libosmvendor-devel > libopensm > libopensm-devel > 10. bug fixes: > - SRP: Add local_ib_device/local_ib_port attributes to srp scsi_host > - mthca: fence bit supported; fixed deadlock in destroy qp > - ipoib: connectivity lost on sm lid change > - OSM: fix to work with Cisco stack > > > Limitations and known issues: > ============================= > 1. SDP: For Mellanox Sinai HCAs one must use latest FW version (1.1.000). > 2. SDP: Get peer name is not working properly > 3. SDP: Scalability issue when many connections are opened > 4. ipath driver does not compile on SLES9 SP3 > 5. RHEL4 up4 is not supported due to problems in the backport patches. > > > Missing features that should be completed for RC3: > ================================================== > 1. Core: Huge pages fix > 2. IPoIB high availability does not support multicast groups > 3. Support RHEL4 up4 > > Changes in the ipath driver: > ============================ > * lock resource limit counters correctly > * fix for crash on module unload, if cfgports < portcnt > * fix handling of kpiobufs > * drop requirement that PIO buffers be mmaped write-only > * merge ipath_core and ib_ipath drivers > * simplify layering code > * simplify debugging code after ipath_core and ib_ipath merger > * remove stale references to userspace SMA > * More changes to support InfiniPath on PowerPC 970 systems. > * add new minor device to allow sending of diag packets > * do not allow use of CQ entries with invalid counts > * account for attached QPs correctly > * support new QLogic product naming scheme > * add serial number to hardware freeze error message > * be more strict about testing the modify QP verb > * validate path_mig_state properly > * put a limit on the number of QPs that can be created > * handle sq_sig_all field correctly > * allow SMA to be disabled > * fix return value from ipath_poll > * print warning if LID not acquired within one minute > * allow direct control of Rx polarity inversion > > srp_daemon explanation: > ======================= > srp_daemon is a tool that identifies SRP targets in the fabric. > > Each srp_daemon instance is operating on one port. > On boot it performs a full rescan of the fabric and waits to srp_daemon events: > - a join of a new target to the fabric > - a change in the capabilities of a machine that becomes a target > - an SA change > - an expiration of a predefined timeout > > When there is an SA change or a timeout expiration srp_daemon perform a full rescan of the fabric. > > for each target srp_daemon finds, it checks if it is already connected to that port, if it is not connected, srp_daemon can either print the target details or connect to it. > > Run srp_daemon -h for usage. > > > IPoIB HA daemon: > ================ > The IPoIB HA daemon can be configured in /etc/infiniband/openib.conf file: > > # Enable IPoIB High Availability daemon > IPOIBHA_ENABLE=yes > # PRIMARY_IPOIB_DEV=ib0 > # BACKUP_IPOIB_DEV=ib1 > > The default for PRIMARY_IPOIB_DEV is ib0 and for BACKUP_IPOIB_DEV is ib1. > > > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > From tziporet at mellanox.co.il Mon Aug 21 12:08:09 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 21 Aug 2006 22:08:09 +0300 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: References: Message-ID: <44EA0499.8070604@mellanox.co.il> Woodruff, Robert J wrote: > > making life very difficult. I assumed that the latest working code > would be put into SVN, as it has been in the past. If this is not the > case, > then please tell me where I can get the latest HCA drivers that will > work > with Sean's latest code that is in SVN. > > woody > You can take the code from the new OFED 1.1-rc2 Tziporet From robert.j.woodruff at intel.com Mon Aug 21 12:28:19 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 21 Aug 2006 12:28:19 -0700 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED Message-ID: Tziporet wrote, >Woodruff, Robert J wrote: >> >> making life very difficult. I assumed that the latest working code >> would be put into SVN, as it has been in the past. If this is not the >> case, >> then please tell me where I can get the latest HCA drivers that will >> work >> with Sean's latest code that is in SVN. >> >> woody >> >You can take the code from the new OFED 1.1-rc2 >Tziporet I can do that as long as it is compatible with the latest changes that Sean has been making and checking into SVN. It would actually be best if you could check your latest driver (if that is the one that is in OFED 1.1) into SVN, so that I can pull it from one place. Same goes for the pathscale driver. Once you have a stable version working with OFED 1.1, can you also check it into the trunk SVN. Bottom line is we either need to keep the code in the trunk up to date or remove it, but having multiple data bases with different versions is somewhat confusing. As for the Mellanox driver, I do not have any problem with the version that is currently in the SVN trunk, it works just fine, and if you only want to put new stable versions in from time to time, that is OK. I would just like to get some version of the pathscale driver and backport patches that will work with the SVN trunk. If that is going to be the OFED 1.1 version, that is fine, but would request that the Qlogic folks check their stable version into SVN or remove what is there if they are not going to keep it up to date. woody From mst at mellanox.co.il Mon Aug 21 12:36:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 Aug 2006 22:36:29 +0300 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: References: Message-ID: <20060821193629.GA15760@mellanox.co.il> Quoting r. Woodruff, Robert J : > Subject: RE: [openfabrics-ewg] Rollup patch for ipath and OFED > > Brian Wrote, > >On Wed, 2006-08-16 at 21:49 +0300, Michael S. Tsirkin wrote: > > >> Woops, only now noticed that this was wrt the ipath driver, not mthca > as I > >> thought. Of course I didn't mean it - I don't edit ipath code in SVN, > pathscale > >> guys do that. > > >We don't actually use SVN to develop the driver either. For kernel > >stuff, I think it's become just a dumping ground for changes that > people > >have made in their own private trees. This makes it not a suitable > >place to be pulling driver sources from. > > > > I am just looking for stable HCA drivers that will work with the rest of > the latest code that is in SVN. Sean is putting all his changes into SVN > and we need to test them with the HCA drivers and the rest of the stack. > > At this point, I have not a clue where the latest code is for all the > different components and it is making life very difficult. I assumed that the > latest working code would be put into SVN, as it has been in the past. If this > is not the case, then please tell me where I can get the latest HCA drivers > that will work with Sean's latest code that is in SVN. Simply put, ideally each component should be developed separately against upstream versions of the rest of them. Maybe Sean can start publishing git trees with his stuff? AFAIK, there are several developments - sa cache, multicast module, userspace cma access ... and these could even go into separate branches, making it easy for people to mix and match. Sean? -- MST From mst at mellanox.co.il Mon Aug 21 12:43:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 Aug 2006 22:43:36 +0300 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: References: Message-ID: <20060821194336.GB15760@mellanox.co.il> Quoting r. Woodruff, Robert J : > Bottom line is we either need to keep the code in the trunk up to date > or remove it, but having multiple data bases with different versions > is somewhat confusing. Since kernel level code is in kernel.org git trees, as long as we do keep kernel code in subversion this automatically implies multiple different databases. No? -- MST From robert.j.woodruff at intel.com Mon Aug 21 12:55:51 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 21 Aug 2006 12:55:51 -0700 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED Message-ID: Michael wrote, Quoting r. Woodruff, Robert J : >> Bottom line is we either need to keep the code in the trunk up to date >> or remove it, but having multiple data bases with different versions >> is somewhat confusing. >Since kernel level code is in kernel.org git trees, >as long as we do keep kernel code in subversion this automatically >implies multiple different databases. No? >-- >MST Yes, there are multiple databases and this is confusing. However, I don't think there are any backport patches and such in the kernel.org trees, so we need some database somewhere that has the latest version of code under development that people can test against, otherwise it seems likely that people would be developing and testing against different branches and when time comes to integrate it all for the next kernel release, there could be issues that arise. In the past, we always kept the latest development version of the code in SVN and could easily build it to test. I have said this in the past but will say it again, if people want to use git instead of SVN for the kernel code, then that is fine, but if that is to be the case, then we should have one git tree somewhere that has all the latest code that we can pull from and then remove the kernel code from SVN to avoid confusion. woody From mshefty at ichips.intel.com Mon Aug 21 13:18:37 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 21 Aug 2006 13:18:37 -0700 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: <20060821193629.GA15760@mellanox.co.il> References: <20060821193629.GA15760@mellanox.co.il> Message-ID: <44EA151D.1090502@ichips.intel.com> Michael S. Tsirkin wrote: > Simply put, ideally each component should be developed > separately against upstream versions of the rest of > them. While this sounds good, it implies that the components are somewhat isolated from each other, which often isn't the case. > Maybe Sean can start publishing git trees with his stuff? > AFAIK, there are several developments - sa cache, multicast module, > userspace cma access ... and these could even go into separate branches, > making it easy for people to mix and match. OpenFabrics as an organization made the decision to use SVN as its code repository for open-source development. We've discussed using something else a few times, without ever reaching consensus. Until OpenFabrics decides to move to a different source control tool or follow a different development model, I think it's best for all developers to be consistent. - Sean From mst at mellanox.co.il Mon Aug 21 13:16:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 Aug 2006 23:16:14 +0300 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: References: Message-ID: <20060821201614.GC15760@mellanox.co.il> Quoting r. Woodruff, Robert J : > We should have one git tree somewhere that has all the latest code that we can > pull from I just don't think "latest code" is a well defined entity in a distributed development environment. -- MST From ardavis at ichips.intel.com Mon Aug 21 13:33:50 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 21 Aug 2006 13:33:50 -0700 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: <20060821201614.GC15760@mellanox.co.il> References: <20060821201614.GC15760@mellanox.co.il> Message-ID: <44EA18AE.5020205@ichips.intel.com> Michael S. Tsirkin wrote: >Quoting r. Woodruff, Robert J : > > >>We should have one git tree somewhere that has all the latest code that we can >>pull from >> >> > >I just don't think "latest code" is a well defined entity in a distributed >development environment. > > > kernel.org is well defined. How are we any different? From rdreier at cisco.com Mon Aug 21 13:45:02 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 Aug 2006 13:45:02 -0700 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: <44EA18AE.5020205@ichips.intel.com> (Arlin Davis's message of "Mon, 21 Aug 2006 13:33:50 -0700") References: <20060821201614.GC15760@mellanox.co.il> <44EA18AE.5020205@ichips.intel.com> Message-ID: Arlin> kernel.org is well defined. How are we any different? Sure, there is the Linus's latest tree. But there's also Len Brown's latest ACPI tree, James Bottomley's latest SCSI tree, Jeff Garzik's latest net driver tree, etc. - R. From mst at mellanox.co.il Mon Aug 21 13:46:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 Aug 2006 23:46:27 +0300 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: <44EA18AE.5020205@ichips.intel.com> References: <44EA18AE.5020205@ichips.intel.com> Message-ID: <20060821204627.GA17157@mellanox.co.il> Quoting r. Arlin Davis : > >>We should have one git tree somewhere that has all the latest code that we > >>can pull from > >> > >> > > > >I just don't think "latest code" is a well defined entity in a distributed > >development environment. > > > > > > kernel.org is well defined. How are we any different? > It depends on what you are talking about. Pls look again at http://www.kernel.org/: The latest stable version of the Linux kernel is: 2.6.17.9 The latest prepatch for the stable Linux kernel tree is: 2.6.18-rc4 The latest snapshot for the stable Linux kernel tree is: 2.6.18-rc4-git1 kernel.org *stable kernels* are well defined. And we have the same with OFED releases and it is not different - so are you saying Woody can just take latest OFED release? If so I agree. But *development* is not usually done on stable tree - it is merged there. See the difference? -- MST From sean.hefty at intel.com Mon Aug 21 14:13:02 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 21 Aug 2006 14:13:02 -0700 Subject: [openib-general] [openfabrics-ewg] Rollup patch for ipath and OFED In-Reply-To: <20060821204627.GA17157@mellanox.co.il> Message-ID: <000001c6c566$964bbd80$8698070a@amr.corp.intel.com> >But *development* is not usually done on stable tree - it is merged there. >See the difference? Let's keep this simple. We submit patches (which are expected to compile and run) against the "latest" code. Today, that is the tip of gen2 branch in SVN. - Sean From sean.hefty at intel.com Mon Aug 21 16:40:12 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 21 Aug 2006 16:40:12 -0700 Subject: [openib-general] [PATCH v3] ib_sa: require SA registration Message-ID: <000501c6c57b$2594dd00$8698070a@amr.corp.intel.com> Require registration with SA module, to prevent module text from going away while sa query callback is still running, and update all users. Signed-off-by: Michael S. Tsirkin Signed-off-by: Sean Hefty --- Changes from the previous post include: * Move struct ib_sa_client definition external. Index: include/rdma/ib_sa.h =================================================================== --- include/rdma/ib_sa.h (revision 8928) +++ include/rdma/ib_sa.h (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2004 Topspin Communications. All rights reserved. * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2006 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -36,6 +37,9 @@ #ifndef IB_SA_H #define IB_SA_H +#include + +#include #include #include @@ -250,11 +254,28 @@ struct ib_sa_service_rec { u64 data64[2]; }; +struct ib_sa_client { + atomic_t users; + struct completion comp; +}; + +/** + * ib_sa_register_client - Register an SA client. + */ +void ib_sa_register_client(struct ib_sa_client *client); + +/** + * ib_sa_unregister_client - Deregister an SA client. + * @client: Client object to deregister. + */ +void ib_sa_unregister_client(struct ib_sa_client *client); + struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -264,7 +285,8 @@ int ib_sa_path_rec_get(struct ib_device void *context, struct ib_sa_query **query); -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -275,7 +297,8 @@ int ib_sa_mcmember_rec_query(struct ib_d void *context, struct ib_sa_query **query); -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, @@ -288,6 +311,7 @@ int ib_sa_service_rec_query(struct ib_de /** * ib_sa_mcmember_rec_set - Start an MCMember set query + * @client:SA client * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -312,7 +336,8 @@ int ib_sa_service_rec_query(struct ib_de * cancel the query. */ static inline int -ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_set(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -322,7 +347,7 @@ ib_sa_mcmember_rec_set(struct ib_device void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, timeout_ms, retries, gfp_mask, callback, @@ -331,6 +356,7 @@ ib_sa_mcmember_rec_set(struct ib_device /** * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @client:SA client * @device:device to send query on * @port_num: port number to send query on * @rec:MCMember Record to send in query @@ -355,7 +381,8 @@ ib_sa_mcmember_rec_set(struct ib_device * cancel the query. */ static inline int -ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, +ib_sa_mcmember_rec_delete(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -365,7 +392,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, + return ib_sa_mcmember_rec_query(client, device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, timeout_ms, retries, gfp_mask, callback, Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 8928) +++ core/sa_query.c (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2004 Topspin Communications. All rights reserved. * Copyright (c) 2005 Voltaire, Inc.  All rights reserved. + * Copyright (c) 2006 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -74,6 +75,7 @@ struct ib_sa_device { struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); + struct ib_sa_client *client; struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; @@ -413,6 +415,38 @@ static void ib_sa_event(struct ib_event_ } } +void ib_sa_register_client(struct ib_sa_client *client) +{ + atomic_set(&client->users, 1); + init_completion(&client->comp); +} +EXPORT_SYMBOL(ib_sa_register_client); + +static inline void ib_sa_client_get(struct ib_sa_query *query, + struct ib_sa_client *client) +{ + atomic_inc(&client->users); + query->client = client; +} + +static inline void deref_client(struct ib_sa_client *client) +{ + if (atomic_dec_and_test(&client->users)) + complete(&client->comp); +} + +static inline void ib_sa_client_put(struct ib_sa_query *query) +{ + deref_client(query->client); +} + +void ib_sa_unregister_client(struct ib_sa_client *client) +{ + deref_client(client); + wait_for_completion(&client->comp); +} +EXPORT_SYMBOL(ib_sa_unregister_client); + /** * ib_sa_cancel_query - try to cancel an SA query * @id:ID of query to cancel @@ -613,6 +647,7 @@ static void ib_sa_path_rec_release(struc /** * ib_sa_path_rec_get - Start a Path get query + * @client:SA client * @device:device to send query on * @port_num: port number to send query on * @rec:Path Record to send in query @@ -636,7 +671,8 @@ static void ib_sa_path_rec_release(struc * error code. Otherwise it is a query ID that can be used to cancel * the query. */ -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -671,6 +707,7 @@ int ib_sa_path_rec_get(struct ib_device goto err1; } + ib_sa_client_get(&query->sa_query, client); query->callback = callback; query->context = context; @@ -696,6 +733,7 @@ int ib_sa_path_rec_get(struct ib_device err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -728,6 +766,7 @@ static void ib_sa_service_rec_release(st /** * ib_sa_service_rec_query - Start Service Record operation + * @client:SA client * @device:device to send request on * @port_num: port number to send request on * @method:SA method - should be get, set, or delete @@ -753,7 +792,8 @@ static void ib_sa_service_rec_release(st * error code. Otherwise it is a request ID that can be used to cancel * the query. */ -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, +int ib_sa_service_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, @@ -793,6 +833,7 @@ int ib_sa_service_rec_query(struct ib_de goto err1; } + ib_sa_client_get(&query->sa_query, client); query->callback = callback; query->context = context; @@ -819,6 +860,7 @@ int ib_sa_service_rec_query(struct ib_de err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -849,7 +891,8 @@ static void ib_sa_mcmember_rec_release(s kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, +int ib_sa_mcmember_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, @@ -885,6 +928,7 @@ int ib_sa_mcmember_rec_query(struct ib_d goto err1; } + ib_sa_client_get(&query->sa_query, client); query->callback = callback; query->context = context; @@ -911,6 +955,7 @@ int ib_sa_mcmember_rec_query(struct ib_d err2: *sa_query = NULL; + ib_sa_client_put(&query->sa_query); ib_free_send_mad(query->sa_query.mad_buf); err1: @@ -947,6 +992,7 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(mad_send_wc->send_buf); kref_put(&query->sm_ah->ref, free_sm_ah); + ib_sa_client_put(query); query->release(query); } Index: core/multicast.c =================================================================== --- core/multicast.c (revision 8928) +++ core/multicast.c (working copy) @@ -63,6 +63,7 @@ static struct ib_client mcast_client = { .remove = mcast_remove_one }; +static struct ib_sa_client sa_client; static struct ib_event_handler event_handler; static struct workqueue_struct *mcast_wq; static union ib_gid mgid0; @@ -305,8 +306,8 @@ static int send_join(struct mcast_group int ret; group->last_join = member; - ret = ib_sa_mcmember_rec_set(port->dev->device, port->port_num, - &member->multicast.rec, + ret = ib_sa_mcmember_rec_set(&sa_client, port->dev->device, + port->port_num, &member->multicast.rec, member->multicast.comp_mask, retry_timer, retries, GFP_KERNEL, join_handler, group, &group->query); @@ -326,7 +327,8 @@ static int send_leave(struct mcast_group rec = group->rec; rec.join_state = leave_state; - ret = ib_sa_mcmember_rec_delete(port->dev->device, port->port_num, &rec, + ret = ib_sa_mcmember_rec_delete(&sa_client, port->dev->device, + port->port_num, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_JOIN_STATE, @@ -770,12 +772,15 @@ static int __init mcast_init(void) if (!mcast_wq) return -ENOMEM; + ib_sa_register_client(&sa_client); + ret = ib_register_client(&mcast_client); if (ret) goto err; return 0; err: + ib_sa_unregister_client(&sa_client); destroy_workqueue(mcast_wq); return ret; } @@ -783,6 +788,7 @@ err: static void __exit mcast_cleanup(void) { ib_unregister_client(&mcast_client); + ib_sa_unregister_client(&sa_client); destroy_workqueue(mcast_wq); } Index: core/cma.c =================================================================== --- core/cma.c (revision 8928) +++ core/cma.c (working copy) @@ -61,6 +61,7 @@ static struct ib_client cma_client = { .remove = cma_remove_one }; +struct ib_sa_client sa_client; static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); @@ -1272,7 +1273,7 @@ static int cma_query_ib_route(struct rdm path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; - id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + id_priv->query_id = ib_sa_path_rec_get(&sa_client, id_priv->id.device, id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, @@ -2367,12 +2368,15 @@ static int cma_init(void) if (!cma_wq) return -ENOMEM; + ib_sa_register_client(&sa_client); + ret = ib_register_client(&cma_client); if (ret) goto err; return 0; err: + ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); return ret; } @@ -2380,6 +2384,7 @@ err: static void cma_cleanup(void) { ib_unregister_client(&cma_client); + ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); idr_destroy(&sdp_ps); idr_destroy(&tcp_ps); Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 8928) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -91,6 +91,8 @@ static struct ib_client ipoib_client = { .remove = ipoib_remove_one }; +static struct ib_sa_client ipoib_sa_client; + int ipoib_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -462,7 +464,7 @@ static int path_rec_start(struct net_dev init_completion(&path->done); path->query_id = - ib_sa_path_rec_get(priv->ca, priv->port, + ib_sa_path_rec_get(&ipoib_sa_client, priv->ca, priv->port, &path->pathrec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | @@ -1187,13 +1189,16 @@ static int __init ipoib_init_module(void goto err_fs; } + ib_sa_register_client(&ipoib_sa_client); + ret = ib_register_client(&ipoib_client); if (ret) - goto err_wq; + goto err_sa; return 0; -err_wq: +err_sa: + ib_sa_unregister_client(&ipoib_sa_client); destroy_workqueue(ipoib_workqueue); err_fs: @@ -1205,6 +1210,7 @@ err_fs: static void __exit ipoib_cleanup_module(void) { ib_unregister_client(&ipoib_client); + ib_sa_unregister_client(&ipoib_sa_client); ipoib_unregister_debugfs(); destroy_workqueue(ipoib_workqueue); } Index: ulp/srp/ib_srp.c =================================================================== --- ulp/srp/ib_srp.c (revision 8928) +++ ulp/srp/ib_srp.c (working copy) @@ -103,6 +103,8 @@ static struct ib_client srp_client = { .remove = srp_remove_one }; +static struct ib_sa_client srp_sa_client; + static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) { return (struct srp_target_port *) host->hostdata; @@ -274,7 +276,8 @@ static int srp_lookup_path(struct srp_ta init_completion(&target->done); - target->path_query_id = ib_sa_path_rec_get(target->srp_host->dev->dev, + target->path_query_id = ib_sa_path_rec_get(&srp_sa_client, + target->srp_host->dev->dev, target->srp_host->port, &target->path, IB_SA_PATH_REC_DGID | @@ -2002,9 +2005,12 @@ static int __init srp_init_module(void) return ret; } + ib_sa_register_client(&srp_sa_client); + ret = ib_register_client(&srp_client); if (ret) { printk(KERN_ERR PFX "couldn't register IB client\n"); + ib_sa_unregister_client(&srp_sa_client); class_unregister(&srp_class); return ret; } @@ -2015,6 +2021,7 @@ static int __init srp_init_module(void) static void __exit srp_cleanup_module(void) { ib_unregister_client(&srp_client); + ib_sa_unregister_client(&srp_sa_client); class_unregister(&srp_class); } From sean.hefty at intel.com Mon Aug 21 17:18:37 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 21 Aug 2006 17:18:37 -0700 Subject: [openib-general] [PATCH 1/2] ib_sa: add generic RMPP query interface Message-ID: <000601c6c580$8343eb30$8698070a@amr.corp.intel.com> The following patch adds a generic interface to send MADs to the SA. The primary motivation of adding these calls is to expand the SA query interface to include RMPP responses for users wanting more than a single attribute returned from a query (e.g. multipath record queries). The implementation of existing SA query routines was layered on top of the generic query interface. Signed-off-by: Sean Hefty --- This patch applies on top of the SA registration patch: http://openib.org/pipermail/openib-general/2006-August/025267.html --- infiniband/include/rdma/ib_sa.h 2006-08-21 16:37:12.700292000 -0700 +++ infiniband.user/include/rdma/ib_sa.h 2006-08-21 16:37:52.126298336 -0700 @@ -82,6 +82,32 @@ enum { IB_SA_ATTR_INFORM_INFO_REC = 0xf3 }; +/* Length of SA attributes on the wire */ +enum { + IB_SA_ATTR_CLASS_PORTINFO_LEN = 72, + IB_SA_ATTR_NOTICE_LEN = 80, + IB_SA_ATTR_INFORM_INFO_LEN = 36, + IB_SA_ATTR_NODE_REC_LEN = 108, + IB_SA_ATTR_PORT_INFO_REC_LEN = 58, + IB_SA_ATTR_SL2VL_REC_LEN = 16, + IB_SA_ATTR_SWITCH_REC_LEN = 21, + IB_SA_ATTR_LINEAR_FDB_REC_LEN = 72, + IB_SA_ATTR_RANDOM_FDB_REC_LEN = 72, + IB_SA_ATTR_MCAST_FDB_REC_LEN = 72, + IB_SA_ATTR_SM_INFO_REC_LEN = 25, + IB_SA_ATTR_LINK_REC_LEN = 6, + IB_SA_ATTR_GUID_INFO_REC_LEN = 72, + IB_SA_ATTR_SERVICE_REC_LEN = 176, + IB_SA_ATTR_PARTITION_REC_LEN = 72, + IB_SA_ATTR_PATH_REC_LEN = 64, + IB_SA_ATTR_VL_ARB_REC_LEN = 72, + IB_SA_ATTR_MC_MEMBER_REC_LEN = 52, + IB_SA_ATTR_TRACE_REC_LEN = 46, + IB_SA_ATTR_MULTI_PATH_REC_LEN = 56, + IB_SA_ATTR_SERVICE_ASSOC_REC_LEN= 80, + IB_SA_ATTR_INFORM_INFO_REC_LEN = 60 +}; + enum ib_sa_selector { IB_SA_GTE = 0, IB_SA_LTE = 1, @@ -270,10 +296,83 @@ void ib_sa_register_client(struct ib_sa_ */ void ib_sa_unregister_client(struct ib_sa_client *client); +struct ib_sa_iter; + +/** + * ib_sa_iter_create - Create an iterator that may be used to walk through + * a list of returned SA records. + * @mad_recv_wc: A received response from the SA. + * + * This call allocates an iterator that is used to walk through a list of + * SA records. Users must free the iterator by calling ib_sa_iter_free. + */ +struct ib_sa_iter *ib_sa_iter_create(struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_sa_iter_free - Release an iterator. + * @iter: The iterator to free. + */ +void ib_sa_iter_free(struct ib_sa_iter *iter); + +/** + * ib_sa_iter_next - Move an iterator to reference the next attribute and + * return the attribute. + * @iter: The iterator to move. + * + * The referenced attribute will be in wire format. The funtion returns NULL + * if there are no more attributes to return. + */ +void *ib_sa_iter_next(struct ib_sa_iter *iter); + +/** + * ib_sa_attr_size - Return the length of an SA attribute on the wire. + * @attr_id: Attribute identifier. + */ +int ib_sa_attr_size(__be16 attr_id); + struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); +/** + * ib_sa_send_mad - Send a MAD to the SA. + * @client:SA client + * @device:device to send query on + * @port_num: port number to send query on + * @method:MAD method to use in the send. + * @attr:Reference to attribute in wire format to send in MAD. + * @attr_id:Attribute type identifier. + * @comp_mask:component mask to send in MAD + * @timeout_ms:time to wait for response, if one is expected + * @retries:number of times to retry request + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a message to the SA. If a response is expected (timeout_ms is + * non-zero), the callback function will be called when the query completes. + * Status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. Mad_recv_wc will reference any returned + * response from the SA. It is the responsibility of the caller to free + * mad_recv_wc by call ib_free_recv_mad() if it is non-NULL. + * + * If the return value of ib_sa_send_mad() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_send_mad(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + int method, void *attr, __be16 attr_id, + ib_sa_comp_mask comp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context), + void *context, struct ib_sa_query **query); + int ib_sa_path_rec_get(struct ib_sa_client *client, struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, --- infiniband/core/sa_query.c 2006-08-21 16:37:05.053454496 -0700 +++ infiniband.user/core/sa_query.c 2006-08-21 16:37:42.200807240 -0700 @@ -73,31 +73,42 @@ struct ib_sa_device { }; struct ib_sa_query { - void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); - void (*release)(struct ib_sa_query *); + void (*callback)(int, struct ib_mad_recv_wc *, void *); struct ib_sa_client *client; struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; + void *context; int id; }; struct ib_sa_service_query { void (*callback)(int, struct ib_sa_service_rec *, void *); void *context; - struct ib_sa_query sa_query; + struct ib_sa_query *sa_query; }; struct ib_sa_path_query { void (*callback)(int, struct ib_sa_path_rec *, void *); void *context; - struct ib_sa_query sa_query; + struct ib_sa_query *sa_query; }; struct ib_sa_mcmember_query { void (*callback)(int, struct ib_sa_mcmember_rec *, void *); void *context; - struct ib_sa_query sa_query; + struct ib_sa_query *sa_query; +}; + +struct ib_sa_iter { + struct ib_mad_recv_wc *recv_wc; + struct ib_mad_recv_buf *recv_buf; + int attr_size; + int attr_offset; + int data_offset; + int data_left; + void *attr; + u8 attr_data[0]; }; static void ib_sa_add_one(struct ib_device *device); @@ -538,9 +549,17 @@ EXPORT_SYMBOL(ib_init_ah_from_mcmember); int ib_sa_pack_attr(void *dst, void *src, int attr_id) { switch (attr_id) { + case IB_SA_ATTR_SERVICE_REC: + ib_pack(service_rec_table, ARRAY_SIZE(service_rec_table), + src, dst); + break; case IB_SA_ATTR_PATH_REC: ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); break; + case IB_SA_ATTR_MC_MEMBER_REC: + ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + src, dst); + break; default: return -EINVAL; } @@ -551,9 +570,17 @@ EXPORT_SYMBOL(ib_sa_pack_attr); int ib_sa_unpack_attr(void *dst, void *src, int attr_id) { switch (attr_id) { + case IB_SA_ATTR_SERVICE_REC: + ib_unpack(service_rec_table, ARRAY_SIZE(service_rec_table), + src, dst); + break; case IB_SA_ATTR_PATH_REC: ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); break; + case IB_SA_ATTR_MC_MEMBER_REC: + ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + src, dst); + break; default: return -EINVAL; } @@ -561,15 +588,100 @@ int ib_sa_unpack_attr(void *dst, void *s } EXPORT_SYMBOL(ib_sa_unpack_attr); -static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) +/* Return size of SA attributes on the wire. */ +int ib_sa_attr_size(__be16 attr_id) { - unsigned long flags; + int size; + + switch (be16_to_cpu(attr_id)) { + case IB_SA_ATTR_CLASS_PORTINFO: + size = IB_SA_ATTR_CLASS_PORTINFO_LEN; + break; + case IB_SA_ATTR_NOTICE: + size = IB_SA_ATTR_NOTICE_LEN; + break; + case IB_SA_ATTR_INFORM_INFO: + size = IB_SA_ATTR_INFORM_INFO_LEN; + break; + case IB_SA_ATTR_NODE_REC: + size = IB_SA_ATTR_NODE_REC_LEN; + break; + case IB_SA_ATTR_PORT_INFO_REC: + size = IB_SA_ATTR_PORT_INFO_REC_LEN; + break; + case IB_SA_ATTR_SL2VL_REC: + size = IB_SA_ATTR_SL2VL_REC_LEN; + break; + case IB_SA_ATTR_SWITCH_REC: + size = IB_SA_ATTR_SWITCH_REC_LEN; + break; + case IB_SA_ATTR_LINEAR_FDB_REC: + size = IB_SA_ATTR_LINEAR_FDB_REC_LEN; + break; + case IB_SA_ATTR_RANDOM_FDB_REC: + size = IB_SA_ATTR_RANDOM_FDB_REC_LEN; + break; + case IB_SA_ATTR_MCAST_FDB_REC: + size = IB_SA_ATTR_MCAST_FDB_REC_LEN; + break; + case IB_SA_ATTR_SM_INFO_REC: + size = IB_SA_ATTR_SM_INFO_REC_LEN; + break; + case IB_SA_ATTR_LINK_REC: + size = IB_SA_ATTR_LINK_REC_LEN; + break; + case IB_SA_ATTR_GUID_INFO_REC: + size = IB_SA_ATTR_GUID_INFO_REC_LEN; + break; + case IB_SA_ATTR_SERVICE_REC: + size = IB_SA_ATTR_SERVICE_REC_LEN; + break; + case IB_SA_ATTR_PARTITION_REC: + size = IB_SA_ATTR_PARTITION_REC_LEN; + break; + case IB_SA_ATTR_PATH_REC: + size = IB_SA_ATTR_PATH_REC_LEN; + break; + case IB_SA_ATTR_VL_ARB_REC: + size = IB_SA_ATTR_VL_ARB_REC_LEN; + break; + case IB_SA_ATTR_MC_MEMBER_REC: + size = IB_SA_ATTR_MC_MEMBER_REC_LEN; + break; + case IB_SA_ATTR_TRACE_REC: + size = IB_SA_ATTR_TRACE_REC_LEN; + break; + case IB_SA_ATTR_MULTI_PATH_REC: + size = IB_SA_ATTR_MULTI_PATH_REC_LEN; + break; + case IB_SA_ATTR_SERVICE_ASSOC_REC: + size = IB_SA_ATTR_SERVICE_ASSOC_REC_LEN; + break; + case IB_SA_ATTR_INFORM_INFO_REC: + size = IB_SA_ATTR_INFORM_INFO_REC_LEN; + break; + default: + size = 0; + break; + } + return size; +} +EXPORT_SYMBOL(ib_sa_attr_size); - memset(mad, 0, sizeof *mad); +static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent, + int method, void *attr, __be16 attr_id, + ib_sa_comp_mask comp_mask) +{ + unsigned long flags; mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = attr_id; + mad->sa_hdr.comp_mask = comp_mask; + + memcpy(mad->data, attr, ib_sa_attr_size(attr_id)); spin_lock_irqsave(&tid_lock, flags); mad->mad_hdr.tid = @@ -623,31 +735,161 @@ retry: return ret ? ret : id; } -static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, - int status, - struct ib_sa_mad *mad) +struct ib_sa_iter *ib_sa_iter_create(struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_iter *iter; + struct ib_sa_mad *mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + int attr_size, attr_offset; + + attr_offset = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; + attr_size = ib_sa_attr_size(mad->mad_hdr.attr_id); + if (!attr_size || attr_offset < attr_size) + return ERR_PTR(-EINVAL); + + iter = kzalloc(sizeof *iter + attr_size, GFP_KERNEL); + if (!iter) + return ERR_PTR(-ENOMEM); + + iter->data_left = mad_recv_wc->mad_len - IB_MGMT_SA_HDR; + iter->recv_wc = mad_recv_wc; + iter->recv_buf = &mad_recv_wc->recv_buf; + iter->attr_offset = attr_offset; + iter->attr_size = attr_size; + return iter; +} +EXPORT_SYMBOL(ib_sa_iter_create); + +void ib_sa_iter_free(struct ib_sa_iter *iter) { - struct ib_sa_path_query *query = - container_of(sa_query, struct ib_sa_path_query, sa_query); + kfree(iter); +} +EXPORT_SYMBOL(ib_sa_iter_free); + +void *ib_sa_iter_next(struct ib_sa_iter *iter) +{ + struct ib_sa_mad *mad; + int left, offset = 0; + + while (iter->data_left >= iter->attr_offset) { + while (iter->data_offset < IB_MGMT_SA_DATA) { + mad = (struct ib_sa_mad *) iter->recv_buf->mad; + + left = IB_MGMT_SA_DATA - iter->data_offset; + if (left < iter->attr_size) { + /* copy first piece of the attribute */ + iter->attr = &iter->attr_data; + memcpy(iter->attr, + &mad->data[iter->data_offset], left); + offset = left; + break; + } else if (offset) { + /* copy the second piece of the attribute */ + memcpy(iter->attr + offset, &mad->data[0], + iter->attr_size - offset); + iter->data_offset = iter->attr_size - offset; + offset = 0; + } else { + iter->attr = &mad->data[iter->data_offset]; + iter->data_offset += iter->attr_size; + } + + iter->data_left -= iter->attr_offset; + goto out; + } + iter->data_offset = 0; + iter->recv_buf = list_entry(iter->recv_buf->list.next, + struct ib_mad_recv_buf, list); + } + iter->attr = NULL; +out: + return iter->attr; +} +EXPORT_SYMBOL(ib_sa_iter_next); + +int ib_sa_send_mad(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + int method, void *attr, __be16 attr_id, + ib_sa_comp_mask comp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context), + void *context, struct ib_sa_query **query) +{ + struct ib_sa_query *sa_query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port; + struct ib_mad_agent *agent; + int ret; + + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + + sa_query = kmalloc(sizeof *sa_query, gfp_mask); + if (!sa_query) + return -ENOMEM; + + sa_query->mad_buf = ib_create_send_mad(agent, 1, 0, + method == IB_SA_METHOD_GET_MULTI, + IB_MGMT_SA_HDR, IB_MGMT_SA_DATA, + gfp_mask); + if (!sa_query->mad_buf) { + ret = -ENOMEM; + goto err1; + } + + sa_query->port = port; + sa_query->callback = callback; + sa_query->context = context; + + init_mad(sa_query->mad_buf->mad, agent, method, attr, attr_id, + comp_mask); + + ib_sa_client_get(sa_query, client); + ret = send_mad(sa_query, timeout_ms, retries, gfp_mask); + if (ret < 0) + goto err2; - if (mad) { - struct ib_sa_path_rec rec; + *query = sa_query; + return ret; - ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), - mad->data, &rec); - query->callback(status, &rec, query->context); - } else - query->callback(status, NULL, query->context); +err2: + ib_sa_client_put(sa_query); + ib_free_send_mad(sa_query->mad_buf); +err1: + kfree(query); + return ret; } +EXPORT_SYMBOL(ib_sa_send_mad); -static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) +static void ib_sa_path_rec_callback(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context) { - kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); + struct ib_sa_path_query *query = context; + + if (query->callback) { + if (mad_recv_wc) { + struct ib_sa_mad *mad; + struct ib_sa_path_rec rec; + + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); + } + if (mad_recv_wc) + ib_free_recv_mad(mad_recv_wc); + kfree(query); } /** * ib_sa_path_rec_get - Start a Path get query - * @client:SA client * @device:device to send query on * @port_num: port number to send query on * @rec:Path Record to send in query @@ -683,90 +925,54 @@ int ib_sa_path_rec_get(struct ib_sa_clie struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port; - struct ib_mad_agent *agent; - struct ib_sa_mad *mad; + u8 path[IB_SA_ATTR_PATH_REC_LEN]; int ret; - if (!sa_dev) - return -ENODEV; - - port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; - query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; - goto err1; - } - - ib_sa_client_get(&query->sa_query, client); query->callback = callback; query->context = context; - mad = query->sa_query.mad_buf->mad; - init_mad(mad, agent); - - query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; - query->sa_query.release = ib_sa_path_rec_release; - query->sa_query.port = port; - mad->mad_hdr.method = IB_MGMT_METHOD_GET; - mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); - mad->sa_hdr.comp_mask = comp_mask; - - ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), rec, mad->data); - - *sa_query = &query->sa_query; - - ret = send_mad(&query->sa_query, timeout_ms, retries, gfp_mask); + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), rec, path); + ret = ib_sa_send_mad(client, device, port_num, IB_MGMT_METHOD_GET, path, + cpu_to_be16(IB_SA_ATTR_PATH_REC), comp_mask, + timeout_ms, retries, gfp_mask, + ib_sa_path_rec_callback, query, &query->sa_query); if (ret < 0) - goto err2; + kfree(query); return ret; - -err2: - *sa_query = NULL; - ib_sa_client_put(&query->sa_query); - ib_free_send_mad(query->sa_query.mad_buf); - -err1: - kfree(query); - return ret; } EXPORT_SYMBOL(ib_sa_path_rec_get); -static void ib_sa_service_rec_callback(struct ib_sa_query *sa_query, - int status, - struct ib_sa_mad *mad) +static void ib_sa_service_rec_callback(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context) { - struct ib_sa_service_query *query = - container_of(sa_query, struct ib_sa_service_query, sa_query); + struct ib_sa_service_query *query = context; - if (mad) { - struct ib_sa_service_rec rec; - - ib_unpack(service_rec_table, ARRAY_SIZE(service_rec_table), - mad->data, &rec); - query->callback(status, &rec, query->context); - } else - query->callback(status, NULL, query->context); -} - -static void ib_sa_service_rec_release(struct ib_sa_query *sa_query) -{ - kfree(container_of(sa_query, struct ib_sa_service_query, sa_query)); + if (query->callback) { + if (mad_recv_wc) { + struct ib_sa_mad *mad; + struct ib_sa_service_rec rec; + + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(service_rec_table, + ARRAY_SIZE(service_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); + } + if (mad_recv_wc) + ib_free_recv_mad(mad_recv_wc); + kfree(query); } /** * ib_sa_service_rec_query - Start Service Record operation - * @client:SA client * @device:device to send request on * @port_num: port number to send request on * @method:SA method - should be get, set, or delete @@ -804,97 +1010,56 @@ int ib_sa_service_rec_query(struct ib_sa struct ib_sa_query **sa_query) { struct ib_sa_service_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port; - struct ib_mad_agent *agent; - struct ib_sa_mad *mad; + u8 service[IB_SA_ATTR_SERVICE_REC_LEN]; int ret; - if (!sa_dev) - return -ENODEV; - - port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; - - if (method != IB_MGMT_METHOD_GET && - method != IB_MGMT_METHOD_SET && - method != IB_SA_METHOD_DELETE) - return -EINVAL; - query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; - goto err1; - } - - ib_sa_client_get(&query->sa_query, client); query->callback = callback; query->context = context; - mad = query->sa_query.mad_buf->mad; - init_mad(mad, agent); - - query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; - query->sa_query.release = ib_sa_service_rec_release; - query->sa_query.port = port; - mad->mad_hdr.method = method; - mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_SERVICE_REC); - mad->sa_hdr.comp_mask = comp_mask; - - ib_pack(service_rec_table, ARRAY_SIZE(service_rec_table), - rec, mad->data); - - *sa_query = &query->sa_query; - - ret = send_mad(&query->sa_query, timeout_ms, retries, gfp_mask); + ib_pack(service_rec_table, ARRAY_SIZE(service_rec_table), rec, service); + ret = ib_sa_send_mad(client, device, port_num, method, service, + cpu_to_be16(IB_SA_ATTR_SERVICE_REC), comp_mask, + timeout_ms, retries, gfp_mask, + ib_sa_service_rec_callback, query, + &query->sa_query); if (ret < 0) - goto err2; - - return ret; + kfree(query); -err2: - *sa_query = NULL; - ib_sa_client_put(&query->sa_query); - ib_free_send_mad(query->sa_query.mad_buf); - -err1: - kfree(query); return ret; } EXPORT_SYMBOL(ib_sa_service_rec_query); -static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, - int status, - struct ib_sa_mad *mad) +static void ib_sa_mcmember_rec_callback(int status, + struct ib_mad_recv_wc *mad_recv_wc, + void *context) { - struct ib_sa_mcmember_query *query = - container_of(sa_query, struct ib_sa_mcmember_query, sa_query); - - if (mad) { - struct ib_sa_mcmember_rec rec; - - ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), - mad->data, &rec); - query->callback(status, &rec, query->context); - } else - query->callback(status, NULL, query->context); -} + struct ib_sa_mcmember_query *query = context; -static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) -{ - kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); + if (query->callback) { + if (mad_recv_wc) { + struct ib_sa_mad *mad; + struct ib_sa_mcmember_rec rec; + + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(mcmember_rec_table, + ARRAY_SIZE(mcmember_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); + } + if (mad_recv_wc) + ib_free_recv_mad(mad_recv_wc); + kfree(query); } int ib_sa_mcmember_rec_query(struct ib_sa_client *client, struct ib_device *device, u8 port_num, - u8 method, - struct ib_sa_mcmember_rec *rec, + u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, @@ -904,63 +1069,27 @@ int ib_sa_mcmember_rec_query(struct ib_s struct ib_sa_query **sa_query) { struct ib_sa_mcmember_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port; - struct ib_mad_agent *agent; - struct ib_sa_mad *mad; + u8 mcmember[IB_SA_ATTR_MC_MEMBER_REC_LEN]; int ret; - if (!sa_dev) - return -ENODEV; - - port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; - query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; - goto err1; - } - - ib_sa_client_get(&query->sa_query, client); query->callback = callback; query->context = context; - mad = query->sa_query.mad_buf->mad; - init_mad(mad, agent); - - query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; - query->sa_query.release = ib_sa_mcmember_rec_release; - query->sa_query.port = port; - mad->mad_hdr.method = method; - mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); - mad->sa_hdr.comp_mask = comp_mask; - ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), - rec, mad->data); - - *sa_query = &query->sa_query; - - ret = send_mad(&query->sa_query, timeout_ms, retries, gfp_mask); + rec, mcmember); + ret = ib_sa_send_mad(client, device, port_num, method, mcmember, + cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC), comp_mask, + timeout_ms, retries, gfp_mask, + ib_sa_mcmember_rec_callback, query, + &query->sa_query); if (ret < 0) - goto err2; + kfree(query); return ret; - -err2: - *sa_query = NULL; - ib_sa_client_put(&query->sa_query); - ib_free_send_mad(query->sa_query.mad_buf); - -err1: - kfree(query); - return ret; } EXPORT_SYMBOL(ib_sa_mcmember_rec_query); @@ -976,13 +1105,13 @@ static void send_handler(struct ib_mad_a /* No callback -- already got recv */ break; case IB_WC_RESP_TIMEOUT_ERR: - query->callback(query, -ETIMEDOUT, NULL); + query->callback(-ETIMEDOUT, NULL, query->context); break; case IB_WC_WR_FLUSH_ERR: - query->callback(query, -EINTR, NULL); + query->callback(-EINTR, NULL, query->context); break; default: - query->callback(query, -EIO, NULL); + query->callback(-EIO, NULL, query->context); break; } @@ -993,7 +1122,7 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(mad_send_wc->send_buf); kref_put(&query->sm_ah->ref, free_sm_ah); ib_sa_client_put(query); - query->release(query); + kfree(query); } static void recv_handler(struct ib_mad_agent *mad_agent, @@ -1005,17 +1134,11 @@ static void recv_handler(struct ib_mad_a mad_buf = (void *) (unsigned long) mad_recv_wc->wc->wr_id; query = mad_buf->context[0]; - if (query->callback) { - if (mad_recv_wc->wc->status == IB_WC_SUCCESS) - query->callback(query, - mad_recv_wc->recv_buf.mad->mad_hdr.status ? - -EINVAL : 0, - (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); - else - query->callback(query, -EIO, NULL); - } - - ib_free_recv_mad(mad_recv_wc); + if (query->callback) + query->callback(mad_recv_wc->recv_buf.mad->mad_hdr.status ? + -EINVAL : 0, mad_recv_wc, query->context); + else + ib_free_recv_mad(mad_recv_wc); } static void ib_sa_add_one(struct ib_device *device) @@ -1049,8 +1172,9 @@ static void ib_sa_add_one(struct ib_devi sa_dev->port[i].agent = ib_register_mad_agent(device, i + s, IB_QPT_GSI, - NULL, 0, send_handler, - recv_handler, sa_dev); + NULL, IB_MGMT_RMPP_VERSION, + send_handler, recv_handler, + sa_dev); if (IS_ERR(sa_dev->port[i].agent)) goto err; From sean.hefty at intel.com Mon Aug 21 17:20:21 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 21 Aug 2006 17:20:21 -0700 Subject: [openib-general] [PATCH 2/2] ib_local_sa: use SA iterator routines to walk RMPP response In-Reply-To: <000601c6c580$8343eb30$8698070a@amr.corp.intel.com> Message-ID: <000701c6c580$c116f1f0$8698070a@amr.corp.intel.com> Convert local SA to use the new SA iterator routines for walking a list of attributes in an RMPP response returned by the SA. This replaces a local SA specific implementation. Signed-off-by: Sean Hefty --- --- infiniband/core/local_sa.c 2006-08-21 16:40:23.760246472 -0700 +++ infiniband.user/core/local_sa.c 2006-08-21 16:48:28.403569488 -0700 @@ -107,16 +107,6 @@ struct sa_db_device { struct sa_db_port port[0]; }; -/* Define path record format to enable needed checks against MAD data. */ -struct ib_path_rec { - u8 reserved[8]; - u8 dgid[16]; - u8 sgid[16]; - __be16 dlid; - __be16 slid; - u8 reserved2[20]; -}; - struct ib_sa_cursor { struct ib_sa_cursor *next; }; @@ -194,60 +184,29 @@ static int insert_attr(struct index_root static void update_path_rec(struct sa_db_port *port, struct ib_mad_recv_wc *mad_recv_wc) { - struct ib_mad_recv_buf *recv_buf; - struct ib_sa_mad *mad = (void *) mad_recv_wc->recv_buf.mad; + struct ib_sa_iter *iter; struct ib_path_rec_info *path_info; - struct ib_path_rec ib_path, *path = NULL; - int i, attr_size, left, offset = 0; + void *attr; - attr_size = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; - if (attr_size < sizeof ib_path) + iter = ib_sa_iter_create(mad_recv_wc); + if (IS_ERR(iter)) return; down_write(&lock); port->update++; - list_for_each_entry(recv_buf, &mad_recv_wc->rmpp_list, list) { - for (i = 0; i < IB_MGMT_SA_DATA;) { - mad = (struct ib_sa_mad *) recv_buf->mad; - - left = IB_MGMT_SA_DATA - i; - if (left < sizeof ib_path) { - /* copy first piece of the attribute */ - memcpy(&ib_path, &mad->data[i], left); - path = &ib_path; - offset = left; - break; - } else if (offset) { - /* copy the second piece of the attribute */ - memcpy((void*) path + offset, &mad->data[i], - sizeof ib_path - offset); - i += attr_size - offset; - offset = 0; - } else { - path = (void *) &mad->data[i]; - i += attr_size; - } - - if (!path->slid) - goto unlock; - - path_info = kmalloc(sizeof *path_info, GFP_KERNEL); - if (!path_info) - goto unlock; - - ib_sa_unpack_attr(&path_info->rec, path, - IB_SA_ATTR_PATH_REC); - - if (insert_attr(&port->index, port->update, - path_info->rec.dgid.raw, - &path_info->cursor)) { - kfree(path_info); - goto unlock; - } + while ((attr = ib_sa_iter_next(iter)) && + (path_info = kmalloc(sizeof *path_info, GFP_KERNEL))) { + + ib_sa_unpack_attr(&path_info->rec, attr, IB_SA_ATTR_PATH_REC); + if (insert_attr(&port->index, port->update, + path_info->rec.dgid.raw, + &path_info->cursor)) { + kfree(path_info); + break; } } -unlock: up_write(&lock); + ib_sa_iter_free(iter); } static void recv_handler(struct ib_mad_agent *mad_agent, From sean.hefty at intel.com Mon Aug 21 17:26:48 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 21 Aug 2006 17:26:48 -0700 Subject: [openib-general] [PATCH] ib_usa: support userspace SA queries and multicast Message-ID: <000801c6c581$a8381aa0$8698070a@amr.corp.intel.com> Add support for userspace SA queries and multicast join operations. This allows a userspace library to issue SA queries and join IB multicast groups. Signed-off-by: Sean Hefty --- This patch depends on the generic RMPP query interface: http://openib.org/pipermail/openib-general/2006-August/025268.html Index: include/rdma/ib_usa.h =================================================================== --- include/rdma/ib_usa.h (revision 0) +++ include/rdma/ib_usa.h (revision 0) @@ -0,0 +1,123 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_USA_H +#define IB_USA_H + +#include +#include + +#define IB_USA_ABI_VERSION 1 + +#define IB_USA_EVENT_DATA 256 + +enum { + IB_USA_CMD_SEND_MAD, + IB_USA_CMD_GET_EVENT, + IB_USA_CMD_GET_DATA, + IB_USA_CMD_JOIN_MCAST, + IB_USA_CMD_FREE_ID, + IB_USA_CMD_GET_MCAST +}; + +enum { + IB_USA_EVENT_MAD, + IB_USA_EVENT_MCAST +}; + +struct ib_usa_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct ib_usa_send_mad { + __u64 response; /* unused - reserved */ + __u64 uid; + __u64 node_guid; + __u64 comp_mask; + __u64 attr; + __u8 port_num; + __u8 method; + __be16 attr_id; + __u32 timeout_ms; + __u32 retries; +}; + +struct ib_usa_join_mcast { + __u64 response; + __u64 uid; + __u64 node_guid; + __u64 comp_mask; + __u64 mcmember_rec; + __u8 port_num; +}; + +struct ib_usa_id_resp { + __u32 id; +}; + +struct ib_usa_free_resp { + __u32 events_reported; +}; + +struct ib_usa_free_id { + __u64 response; + __u32 id; +}; + +struct ib_usa_get_event { + __u64 response; +}; + +struct ib_usa_event_resp { + __u64 uid; + __u32 id; + __u32 event; + __u32 status; + __u32 data_len; + __u8 data[IB_USA_EVENT_DATA]; +}; + +struct ib_usa_get_data { + __u64 response; + __u32 id; +}; + +struct ib_usa_get_mcast { + __u64 response; + __u64 node_guid; + __u8 mgid[16]; + __u8 port_num; +}; + +#endif /* IB_USA_H */ Index: core/usa.c =================================================================== --- core/usa.c (revision 0) +++ core/usa.c (revision 0) @@ -0,0 +1,772 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("IB userspace SA query"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void usa_add_one(struct ib_device *device); +static void usa_remove_one(struct ib_device *device); + +static struct ib_client usa_client = { + .name = "ib_usa", + .add = usa_add_one, + .remove = usa_remove_one +}; + +struct usa_device { + struct list_head list; + struct ib_device *device; + struct completion comp; + atomic_t refcount; + int start_port; + int end_port; +}; + +struct usa_file { + struct mutex file_mutex; + struct file *filp; + struct ib_sa_client sa_client; + struct list_head event_list; + struct list_head data_list; + struct list_head mcast_list; + wait_queue_head_t poll_wait; + int event_id; +}; + +struct usa_event { + struct usa_file *file; + struct list_head list; + struct ib_usa_event_resp resp; + struct ib_mad_recv_wc *mad_recv_wc; +}; + +struct usa_multicast { + struct usa_event event; + struct list_head list; + struct ib_multicast *multicast; + int events_reported; +}; + +static DEFINE_MUTEX(usa_mutex); +static LIST_HEAD(dev_list); +static DEFINE_IDR(usa_idr); + +static struct usa_device *acquire_dev(__be64 guid, __u8 port_num) +{ + struct usa_device *dev; + + mutex_lock(&usa_mutex); + list_for_each_entry(dev, &dev_list, list) { + if (dev->device->node_guid == guid) { + if (port_num < dev->start_port || + port_num > dev->end_port) + break; + atomic_inc(&dev->refcount); + mutex_unlock(&usa_mutex); + return dev; + } + } + mutex_unlock(&usa_mutex); + return NULL; +} + +static void deref_dev(struct usa_device *dev) +{ + if (atomic_dec_and_test(&dev->refcount)) + complete(&dev->comp); +} + +static int insert_obj(void *obj, int *id) +{ + int ret; + + do { + ret = idr_pre_get(&usa_idr, GFP_KERNEL); + if (!ret) + break; + + mutex_lock(&usa_mutex); + ret = idr_get_new(&usa_idr, obj, id); + mutex_unlock(&usa_mutex); + } while (ret == -EAGAIN); + + return ret; +} + +static void remove_obj(int id) +{ + mutex_lock(&usa_mutex); + idr_remove(&usa_idr, id); + mutex_unlock(&usa_mutex); +} + +static void finish_event(struct usa_event *event) +{ + struct usa_multicast *mcast; + + switch (event->resp.event) { + case IB_USA_EVENT_MAD: + list_del(&event->list); + if (event->resp.data_len > IB_USA_EVENT_DATA) + list_add_tail(&event->list, &event->file->data_list); + else + kfree(event); + break; + case IB_USA_EVENT_MCAST: + list_del_init(&event->list); + mcast = container_of(event, struct usa_multicast, event); + mcast->events_reported++; + break; + default: + break; + } +} + +static ssize_t usa_get_event(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_usa_get_event cmd; + struct usa_event *event; + int ret = 0; + DEFINE_WAIT(wait); + + if (out_len < sizeof(struct ib_usa_event_resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + mutex_lock(&file->file_mutex); + while (list_empty(&file->event_list)) { + if (file->filp->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + break; + } + + if (signal_pending(current)) { + ret = -ERESTARTSYS; + break; + } + + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); + mutex_unlock(&file->file_mutex); + schedule(); + mutex_lock(&file->file_mutex); + finish_wait(&file->poll_wait, &wait); + } + + if (ret) + goto done; + + event = list_entry(file->event_list.next, struct usa_event, list); + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &event->resp, sizeof(event->resp))) { + ret = -EFAULT; + goto done; + } + + finish_event(event); +done: + mutex_unlock(&file->file_mutex); + return ret; +} + +static struct usa_event *get_event_data(struct usa_file *file, __u32 id) +{ + struct usa_event *event; + + mutex_lock(&file->file_mutex); + list_for_each_entry(event, &file->data_list, list) { + if (event->resp.id == id) { + list_del(&event->list); + mutex_unlock(&file->file_mutex); + return event; + } + } + mutex_unlock(&file->file_mutex); + return NULL; +} + +static int copy_event_data(struct usa_event *event, __u64 response) +{ + struct ib_sa_mad *mad; + struct ib_sa_iter *iter; + int attr_offset, ret = 0; + void *attr; + + mad = (struct ib_sa_mad *) event->mad_recv_wc->recv_buf.mad; + attr_offset = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; + + iter = ib_sa_iter_create(event->mad_recv_wc); + while ((attr = ib_sa_iter_next(iter))) { + if (copy_to_user((void __user *) (unsigned long) response, + attr, attr_offset)) { + ret = -EFAULT; + break; + } + response += attr_offset; + } + + ib_sa_iter_free(iter); + return ret; +} + +static ssize_t usa_get_data(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_usa_get_data cmd; + struct usa_event *event; + int ret = 0; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + event = get_event_data(file, cmd.id); + if (!event) + return -EINVAL; + + if (out_len < event->resp.data_len) { + ret = -ENOSPC; + goto out; + } + + ret = copy_event_data(event, cmd.response); +out: + ib_free_recv_mad(event->mad_recv_wc); + kfree(event); + return ret; +} + +static void usa_req_handler(int status, struct ib_mad_recv_wc *mad_recv_wc, + void *context) +{ + struct usa_event *event = context; + + if (mad_recv_wc) { + event->resp.data_len = mad_recv_wc->mad_len; + + if (event->resp.data_len <= IB_USA_EVENT_DATA) { + memcpy(event->resp.data, mad_recv_wc->recv_buf.mad, + event->resp.data_len); + ib_free_recv_mad(mad_recv_wc); + } else { + event->mad_recv_wc = mad_recv_wc; + memcpy(event->resp.data, mad_recv_wc->recv_buf.mad, + IB_USA_EVENT_DATA); + } + } + + event->resp.status = status; + + mutex_lock(&event->file->file_mutex); + list_add_tail(&event->list, &event->file->event_list); + wake_up_interruptible(&event->file->poll_wait); + mutex_unlock(&event->file->file_mutex); +} + +static int is_send_req(__u8 method) +{ + switch (method) { + case IB_MGMT_METHOD_GET: + case IB_MGMT_METHOD_SEND: + case IB_SA_METHOD_GET_TABLE: + case IB_SA_METHOD_GET_MULTI: + case IB_SA_METHOD_GET_TRACE_TBL: + return 1; + default: + return 0; + } +} + +static ssize_t usa_send_mad(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct usa_device *dev; + struct usa_event *event; + struct ib_usa_send_mad cmd; + struct ib_sa_query *query; + int attr_size, ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + attr_size = ib_sa_attr_size(cmd.attr_id); + if (!attr_size || !is_send_req(cmd.method)) + return -EINVAL; + + dev = acquire_dev(cmd.node_guid, cmd.port_num); + if (!dev) + return -ENODEV; + + event = kzalloc(sizeof *event, GFP_KERNEL); + if (!event) { + ret = -ENOMEM; + goto deref; + } + + if (copy_from_user(event->resp.data, + (void __user *) (unsigned long) cmd.attr, + attr_size)) { + ret = -EFAULT; + goto free; + } + + event->file = file; + event->resp.event = IB_USA_EVENT_MAD; + event->resp.uid = cmd.uid; + + mutex_lock(&file->file_mutex); + event->resp.id = file->event_id++; + mutex_unlock(&file->file_mutex); + + ret = ib_sa_send_mad(&file->sa_client, dev->device, cmd.port_num, + cmd.method, event->resp.data, cmd.attr_id, + (ib_sa_comp_mask) cmd.comp_mask, + cmd.timeout_ms, cmd.retries, GFP_KERNEL, + usa_req_handler, event, &query); + if (ret < 0) + goto free; + + deref_dev(dev); + return 0; +free: + kfree(event); +deref: + deref_dev(dev); + return ret; +} + +/* + * We can get up to two events for a single multicast member. A second event + * only occurs if there's an error on an existing multicast membership. + * Report only the last event. + */ +static int multicast_handler(int status, struct ib_multicast *multicast) +{ + struct usa_multicast *mcast = multicast->context; + + if (!status) { + mcast->event.resp.data_len = IB_SA_ATTR_MC_MEMBER_REC_LEN; + ib_sa_pack_attr(mcast->event.resp.data, &multicast->rec, + IB_SA_ATTR_MC_MEMBER_REC); + } + + mutex_lock(&mcast->event.file->file_mutex); + mcast->event.resp.status = status; + + list_del(&mcast->event.list); + list_add_tail(&mcast->event.list, &mcast->event.file->event_list); + wake_up_interruptible(&mcast->event.file->poll_wait); + mutex_unlock(&mcast->event.file->file_mutex); + return 0; +} + +static ssize_t usa_join_mcast(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct usa_device *dev; + struct usa_multicast *mcast; + struct ib_usa_join_mcast cmd; + struct ib_usa_id_resp resp; + struct ib_sa_mcmember_rec rec; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + dev = acquire_dev(cmd.node_guid, cmd.port_num); + if (!dev) + return -ENODEV; + + mcast = kzalloc(sizeof *mcast, GFP_KERNEL); + if (!mcast) { + ret = -ENOMEM; + goto deref; + } + INIT_LIST_HEAD(&mcast->event.list); + mcast->event.file = file; + mcast->event.resp.event = IB_USA_EVENT_MCAST; + mcast->event.resp.uid = cmd.uid; + + ret = insert_obj(mcast, &mcast->event.resp.id); + if (ret) + goto free; + + resp.id = mcast->event.resp.id; + + mutex_lock(&file->file_mutex); + list_add_tail(&mcast->list, &file->mcast_list); + mutex_unlock(&file->file_mutex); + + if (copy_from_user(mcast->event.resp.data, + (void __user *) (unsigned long) cmd.mcmember_rec, + IB_SA_ATTR_MC_MEMBER_REC_LEN)) { + ret = -EFAULT; + goto remove; + } + + ib_sa_unpack_attr(&rec, mcast->event.resp.data, + IB_SA_ATTR_MC_MEMBER_REC); + mcast->multicast = ib_join_multicast(dev->device, cmd.port_num, &rec, + (ib_sa_comp_mask) cmd.comp_mask, + GFP_KERNEL, multicast_handler, + mcast); + if (IS_ERR(mcast->multicast)) { + ret = PTR_ERR(mcast->multicast); + goto remove; + } + + deref_dev(dev); + return 0; +remove: + mutex_lock(&file->file_mutex); + list_del(&mcast->list); + mutex_unlock(&file->file_mutex); + remove_obj(mcast->event.resp.id); +free: + kfree(mcast); +deref: + deref_dev(dev); + return ret; +} + +static ssize_t usa_free_id(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_usa_free_id cmd; + struct ib_usa_free_resp resp; + struct usa_multicast *mcast; + int ret = 0; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + mutex_lock(&usa_mutex); + mcast = idr_find(&usa_idr, cmd.id); + if (!mcast) + mcast = ERR_PTR(-ENOENT); + else if (mcast->event.file != file) + mcast = ERR_PTR(-EINVAL); + else + idr_remove(&usa_idr, mcast->event.resp.id); + mutex_unlock(&usa_mutex); + + if (IS_ERR(mcast)) + return PTR_ERR(mcast); + + ib_free_multicast(mcast->multicast); + mutex_lock(&file->file_mutex); + list_del(&mcast->list); + mutex_unlock(&file->file_mutex); + + resp.events_reported = mcast->events_reported; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + + kfree(mcast); + return ret; +} + +static ssize_t usa_get_mcast(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct usa_device *dev; + struct ib_usa_get_mcast cmd; + struct ib_sa_mcmember_rec rec; + u8 mcmember_rec[IB_SA_ATTR_MC_MEMBER_REC_LEN]; + int ret; + + if (out_len < sizeof(IB_SA_ATTR_MC_MEMBER_REC_LEN)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + dev = acquire_dev(cmd.node_guid, cmd.port_num); + if (!dev) + return -ENODEV; + + ret = ib_get_mcmember_rec(dev->device, cmd.port_num, + (union ib_gid *) cmd.mgid, &rec); + if (!ret) { + ib_sa_pack_attr(mcmember_rec, &rec, IB_SA_ATTR_MC_MEMBER_REC); + if (copy_to_user((void __user *) (unsigned long) cmd.response, + mcmember_rec, IB_SA_ATTR_MC_MEMBER_REC_LEN)) + ret = -EFAULT; + } + + deref_dev(dev); + return ret; +} + +static ssize_t (*usa_cmd_table[])(struct usa_file *file, + const char __user *inbuf, + int in_len, int out_len) = { + [IB_USA_CMD_SEND_MAD] = usa_send_mad, + [IB_USA_CMD_GET_EVENT] = usa_get_event, + [IB_USA_CMD_GET_DATA] = usa_get_data, + [IB_USA_CMD_JOIN_MCAST] = usa_join_mcast, + [IB_USA_CMD_FREE_ID] = usa_free_id, + [IB_USA_CMD_GET_MCAST] = usa_get_mcast +}; + + +static ssize_t usa_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct usa_file *file = filp->private_data; + struct ib_usa_cmd_hdr hdr; + ssize_t ret; + + if (len < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + + if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(usa_cmd_table)) + return -EINVAL; + + if (hdr.in + sizeof(hdr) > len) + return -EINVAL; + + ret = usa_cmd_table[hdr.cmd](file, buf + sizeof(hdr), hdr.in, hdr.out); + if (!ret) + ret = len; + + return ret; +} + +static unsigned int usa_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct usa_file *file = filp->private_data; + unsigned int mask = 0; + + poll_wait(filp, &file->poll_wait, wait); + + if (!list_empty(&file->event_list)) + mask = POLLIN | POLLRDNORM; + + return mask; +} + +static int usa_open(struct inode *inode, struct file *filp) +{ + struct usa_file *file; + + file = kmalloc(sizeof *file, GFP_KERNEL); + if (!file) + return -ENOMEM; + + ib_sa_register_client(&file->sa_client); + + INIT_LIST_HEAD(&file->event_list); + INIT_LIST_HEAD(&file->data_list); + INIT_LIST_HEAD(&file->mcast_list); + init_waitqueue_head(&file->poll_wait); + mutex_init(&file->file_mutex); + + filp->private_data = file; + file->filp = filp; + return 0; +} + +static void cleanup_events(struct list_head *list) +{ + struct usa_event *event; + + while (!list_empty(list)) { + event = list_entry(list->next, struct usa_event, list); + list_del(&event->list); + + if (event->mad_recv_wc) + ib_free_recv_mad(event->mad_recv_wc); + + kfree(event); + } +} + +static void cleanup_mcast(struct usa_file *file) +{ + struct usa_multicast *mcast; + + while (!list_empty(&file->mcast_list)) { + mcast = list_entry(file->mcast_list.next, + struct usa_multicast, list); + list_del(&mcast->list); + + remove_obj(mcast->event.resp.id); + + ib_free_multicast(mcast->multicast); + + /* + * Other members may still be generating events, so we need + * to lock the event list to avoid corrupting it. + */ + mutex_lock(&file->file_mutex); + list_del(&mcast->event.list); + mutex_unlock(&file->file_mutex); + + kfree(mcast); + } +} + +static int usa_close(struct inode *inode, struct file *filp) +{ + struct usa_file *file = filp->private_data; + + ib_sa_unregister_client(&file->sa_client); + cleanup_mcast(file); + + cleanup_events(&file->event_list); + cleanup_events(&file->data_list); + kfree(file); + return 0; +} + +static void usa_add_one(struct ib_device *device) +{ + struct usa_device *dev; + + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + dev = kmalloc(sizeof *dev, GFP_KERNEL); + if (!dev) + return; + + dev->device = device; + if (device->node_type == RDMA_NODE_IB_SWITCH) + dev->start_port = dev->end_port = 0; + else { + dev->start_port = 1; + dev->end_port = device->phys_port_cnt; + } + + init_completion(&dev->comp); + atomic_set(&dev->refcount, 1); + ib_set_client_data(device, &usa_client, dev); + + mutex_lock(&usa_mutex); + list_add_tail(&dev->list, &dev_list); + mutex_unlock(&usa_mutex); +} + +static void usa_remove_one(struct ib_device *device) +{ + struct usa_device *dev; + + dev = ib_get_client_data(device, &usa_client); + if (!dev) + return; + + mutex_lock(&usa_mutex); + list_del(&dev->list); + mutex_unlock(&usa_mutex); + + deref_dev(dev); + wait_for_completion(&dev->comp); + kfree(dev); +} + +static struct file_operations usa_fops = { + .owner = THIS_MODULE, + .open = usa_open, + .release = usa_close, + .write = usa_write, + .poll = usa_poll, +}; + +static struct miscdevice usa_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "ib_usa", + .fops = &usa_fops, +}; + +static ssize_t show_abi_version(struct class_device *class_dev, char *buf) +{ + return sprintf(buf, "%d\n", IB_USA_ABI_VERSION); +} +static CLASS_DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); + +static int __init usa_init(void) +{ + int ret; + + ret = misc_register(&usa_misc); + if (ret) + return ret; + + ret = class_device_create_file(usa_misc.class, + &class_device_attr_abi_version); + if (ret) { + printk(KERN_ERR "ib_usa: couldn't create abi_version attr\n"); + goto err1; + } + + ret = ib_register_client(&usa_client); + if (ret) + goto err2; + return 0; + +err2: + class_device_remove_file(usa_misc.class, + &class_device_attr_abi_version); +err1: + misc_deregister(&usa_misc); + return ret; +} + +static void __exit usa_cleanup(void) +{ + ib_unregister_client(&usa_client); + class_device_remove_file(usa_misc.class, + &class_device_attr_abi_version); + misc_deregister(&usa_misc); + idr_destroy(&usa_idr); +} + +module_init(usa_init); +module_exit(usa_cleanup); Index: Kconfig =================================================================== --- Kconfig (revision 8928) +++ Kconfig (working copy) @@ -17,15 +17,15 @@ config INFINIBAND_USER_MAD need libibumad from . config INFINIBAND_USER_ACCESS - tristate "InfiniBand userspace access (verbs and CM)" + tristate "InfiniBand userspace access (verbs, CM, SA client)" depends on INFINIBAND ---help--- Userspace InfiniBand access support. This enables the - kernel side of userspace verbs and the userspace - communication manager (CM). This allows userspace processes - to set up connections and directly access InfiniBand + kernel side of userspace verbs, the userspace communication + manager (CM), and userspace SA client. This allows userspace + processes to set up connections and directly access InfiniBand hardware for fast-path operations. You will also need - libibverbs, libibcm and a hardware driver library from + libibverbs, libibcm, libibsa, and a hardware driver library from . config INFINIBAND_ADDR_TRANS Index: core/Makefile =================================================================== --- core/Makefile (revision 8928) +++ core/Makefile (working copy) @@ -7,7 +7,8 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o i ib_sa.o $(infiniband-y) \ findex.o ib_multicast.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o -obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o $(user_access-y) +obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_usa.o \ + $(user_access-y) findex-y := index.o @@ -39,3 +40,5 @@ ib_uverbs-y := uverbs_main.o uverbs_cm ib_ucm-y := ucm.o +ib_usa-y := usa.o + From sean.hefty at intel.com Mon Aug 21 17:34:06 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 21 Aug 2006 17:34:06 -0700 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support Message-ID: <000901c6c582$ad09f890$8698070a@amr.corp.intel.com> Add a userspace library to support SA queries and joining IB multicast groups. Signed-off-by: Sean Hefty --- Index: libibsa/libibsa.spec.in =================================================================== --- libibsa/libibsa.spec.in (revision 0) +++ libibsa/libibsa.spec.in (revision 0) @@ -0,0 +1,68 @@ +# $Id: $ + +%define ver @VERSION@ +%define RELEASE 1 +%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} + +Summary: Userspace SA client. +Name: libibsa +Version: %ver +Release: %rel%{?dist} +License: GPL/BSD +Group: System Environment/Libraries +BuildRoot: %{_tmppath}/%{name}-%{version}-root +Source: http://openib.org/downloads/%{name}-%{version}.tar.gz +Url: http://openib.org/ + +%description +Along with the OpenIB kernel drivers, libibsa provides a userspace +SA client API. + +%package devel +Summary: Development files for the libibsa library +Group: System Environment/Libraries +Requires: %{name} = %{version}-%{release} + +%description devel +Development files for the libibsa library. + +%package utils +Summary: Utilities for the libibsa library +Group: System Environment/Base +Requires: %{name} = %{version}-%{release} + +%description utils +Utilities for the libibsa library. + +%prep +%setup -q + +%build +%configure +make + +%install +make DESTDIR=${RPM_BUILD_ROOT} install +# remove unpackaged files from the buildroot +rm -f $RPM_BUILD_ROOT%{_libdir}/*.la +cd $RPM_BUILD_ROOT%{_libdir} +mv libibsa.so libibsa.so.%{ver} +ln -s libibsa.so.%{ver} libibsa.so + +%clean +rm -rf $RPM_BUILD_ROOT + +%files +%defattr(-,root,root) +%{_libdir}/libibsa*.so.* +%doc AUTHORS COPYING ChangeLog NEWS README + +%files devel +%defattr(-,root,root) +%{_libdir}/libibsa.so +%{_includedir}/infiniband/*.h + +%files utils +%defattr(-,root,root) +%{_bindir}/satest +%{_bindir}/mchammer Index: libibsa/include/infiniband/sa_net.h =================================================================== --- libibsa/include/infiniband/sa_net.h (revision 0) +++ libibsa/include/infiniband/sa_net.h (revision 0) @@ -0,0 +1,378 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(SA_NET_H) +#define SA_NET_H + +#include + +#include + +enum { + IBV_SA_METHOD_GET = 0x01, + IBV_SA_METHOD_SET = 0x02, + IBV_SA_METHOD_GET_RESP = 0x81, + IBV_SA_METHOD_SEND = 0x03, + IBV_SA_METHOD_TRAP = 0x05, + IBV_SA_METHOD_REPORT = 0x06, + IBV_SA_METHOD_REPORT_RESP = 0x86, + IBV_SA_METHOD_TRAP_REPRESS = 0x07, + IBV_SA_METHOD_GET_TABLE = 0x12, + IBV_SA_METHOD_GET_TABLE_RESP = 0x92, + IBV_SA_METHOD_DELETE = 0x15, + IBV_SA_METHOD_DELETE_RESP = 0x95, + IBV_SA_METHOD_GET_MULTI = 0x14, + IBV_SA_METHOD_GET_MULTI_RESP = 0x94, + IBV_SA_METHOD_GET_TRACE_TBL = 0x13 +}; + +enum { + IBV_SA_MAD_HEADER_SIZE = 56, + IBV_SA_MAD_DATA_SIZE = 200 +}; + +struct ibv_sa_mad { + /* common MAD header */ + uint8_t base_version; + uint8_t mgmt_class; + uint8_t class_version; + uint8_t r_method; + uint16_t status; + uint16_t class_specific; + uint64_t transaction_id; + uint16_t attribute_id; + uint16_t rsvd1; + uint32_t attribute_modifier; + /* RMPP header */ + uint8_t rmpp_version; + uint8_t rmpp_type; + uint8_t rmpp_resptime_flags; + uint8_t rmpp_status; + uint32_t rmpp_data1; + uint32_t rmpp_data2; + /* SA header */ + uint32_t sm_key1; /* define sm_key for 64-bit alignment */ + uint32_t sm_key2; + uint16_t attribute_offset; + uint16_t rsvd2; + uint64_t comp_mask; + uint8_t sa_data[IBV_SA_MAD_DATA_SIZE]; +}; + +enum { + IBV_SA_ATTR_CLASS_PORTINFO = __constant_cpu_to_be16(0x01), + IBV_SA_ATTR_NOTICE = __constant_cpu_to_be16(0x02), + IBV_SA_ATTR_INFORM_INFO = __constant_cpu_to_be16(0x03), + IBV_SA_ATTR_NODE_REC = __constant_cpu_to_be16(0x11), + IBV_SA_ATTR_PORT_INFO_REC = __constant_cpu_to_be16(0x12), + IBV_SA_ATTR_SL2VL_REC = __constant_cpu_to_be16(0x13), + IBV_SA_ATTR_SWITCH_REC = __constant_cpu_to_be16(0x14), + IBV_SA_ATTR_LINEAR_FDB_REC = __constant_cpu_to_be16(0x15), + IBV_SA_ATTR_RANDOM_FDB_REC = __constant_cpu_to_be16(0x16), + IBV_SA_ATTR_MCAST_FDB_REC = __constant_cpu_to_be16(0x17), + IBV_SA_ATTR_SM_INFO_REC = __constant_cpu_to_be16(0x18), + IBV_SA_ATTR_LINK_REC = __constant_cpu_to_be16(0x20), + IBV_SA_ATTR_GUID_INFO_REC = __constant_cpu_to_be16(0x30), + IBV_SA_ATTR_SERVICE_REC = __constant_cpu_to_be16(0x31), + IBV_SA_ATTR_PARTITION_REC = __constant_cpu_to_be16(0x33), + IBV_SA_ATTR_PATH_REC = __constant_cpu_to_be16(0x35), + IBV_SA_ATTR_VL_ARB_REC = __constant_cpu_to_be16(0x36), + IBV_SA_ATTR_MC_MEMBER_REC = __constant_cpu_to_be16(0x38), + IBV_SA_ATTR_TRACE_REC = __constant_cpu_to_be16(0x39), + IBV_SA_ATTR_MULTI_PATH_REC = __constant_cpu_to_be16(0x3a), + IBV_SA_ATTR_SERVICE_ASSOC_REC = __constant_cpu_to_be16(0x3b), + IBV_SA_ATTR_INFORM_INFO_REC = __constant_cpu_to_be16(0xf3) +}; + +/* Length of SA attributes on the wire */ +enum { + IBV_SA_ATTR_CLASS_PORTINFO_LEN = 72, + IBV_SA_ATTR_NOTICE_LEN = 80, + IBV_SA_ATTR_INFORM_INFO_LEN = 36, + IBV_SA_ATTR_NODE_REC_LEN = 108, + IBV_SA_ATTR_PORT_INFO_REC_LEN = 58, + IBV_SA_ATTR_SL2VL_REC_LEN = 16, + IBV_SA_ATTR_SWITCH_REC_LEN = 21, + IBV_SA_ATTR_LINEAR_FDB_REC_LEN = 72, + IBV_SA_ATTR_RANDOM_FDB_REC_LEN = 72, + IBV_SA_ATTR_MCAST_FDB_REC_LEN = 72, + IBV_SA_ATTR_SM_INFO_REC_LEN = 25, + IBV_SA_ATTR_LINK_REC_LEN = 6, + IBV_SA_ATTR_GUID_INFO_REC_LEN = 72, + IBV_SA_ATTR_SERVICE_REC_LEN = 176, + IBV_SA_ATTR_PARTITION_REC_LEN = 72, + IBV_SA_ATTR_PATH_REC_LEN = 64, + IBV_SA_ATTR_VL_ARB_REC_LEN = 72, + IBV_SA_ATTR_MC_MEMBER_REC_LEN = 52, + IBV_SA_ATTR_TRACE_REC_LEN = 46, + IBV_SA_ATTR_MULTI_PATH_REC_LEN = 56, + IBV_SA_ATTR_SERVICE_ASSOC_REC_LEN = 80, + IBV_SA_ATTR_INFORM_INFO_REC_LEN = 60 +}; + +#define IBV_SA_COMP_MASK(n) __constant_cpu_to_be64(1ull << n) + +struct ibv_sa_net_service_rec { + uint64_t service_id; + uint8_t service_gid[16]; + uint16_t service_pkey; + uint16_t rsvd; + uint32_t service_lease; + uint8_t service_key[16]; + uint8_t service_name[64]; + uint8_t service_data8[16]; + uint16_t service_data16[8]; + uint32_t service_data32[4]; + uint64_t service_data64[2]; +}; + +enum { + IBV_SA_SERVICE_REC_SERVICE_ID = IBV_SA_COMP_MASK(0), + IBV_SA_SERVICE_REC_SERVICE_GID = IBV_SA_COMP_MASK(1), + IBV_SA_SERVICE_REC_SERVICE_PKEY = IBV_SA_COMP_MASK(2), + /* reserved: 3 */ + IBV_SA_SERVICE_REC_SERVICE_LEASE = IBV_SA_COMP_MASK(4), + IBV_SA_SERVICE_REC_SERVICE_KEY = IBV_SA_COMP_MASK(5), + IBV_SA_SERVICE_REC_SERVICE_NAME = IBV_SA_COMP_MASK(6), + IBV_SA_SERVICE_REC_SERVICE_DATA8_0 = IBV_SA_COMP_MASK(7), + IBV_SA_SERVICE_REC_SERVICE_DATA8_1 = IBV_SA_COMP_MASK(8), + IBV_SA_SERVICE_REC_SERVICE_DATA8_2 = IBV_SA_COMP_MASK(9), + IBV_SA_SERVICE_REC_SERVICE_DATA8_3 = IBV_SA_COMP_MASK(10), + IBV_SA_SERVICE_REC_SERVICE_DATA8_4 = IBV_SA_COMP_MASK(11), + IBV_SA_SERVICE_REC_SERVICE_DATA8_5 = IBV_SA_COMP_MASK(12), + IBV_SA_SERVICE_REC_SERVICE_DATA8_6 = IBV_SA_COMP_MASK(13), + IBV_SA_SERVICE_REC_SERVICE_DATA8_7 = IBV_SA_COMP_MASK(14), + IBV_SA_SERVICE_REC_SERVICE_DATA8_8 = IBV_SA_COMP_MASK(15), + IBV_SA_SERVICE_REC_SERVICE_DATA8_9 = IBV_SA_COMP_MASK(16), + IBV_SA_SERVICE_REC_SERVICE_DATA8_10 = IBV_SA_COMP_MASK(17), + IBV_SA_SERVICE_REC_SERVICE_DATA8_11 = IBV_SA_COMP_MASK(18), + IBV_SA_SERVICE_REC_SERVICE_DATA8_12 = IBV_SA_COMP_MASK(19), + IBV_SA_SERVICE_REC_SERVICE_DATA8_13 = IBV_SA_COMP_MASK(20), + IBV_SA_SERVICE_REC_SERVICE_DATA8_14 = IBV_SA_COMP_MASK(21), + IBV_SA_SERVICE_REC_SERVICE_DATA8_15 = IBV_SA_COMP_MASK(22), + IBV_SA_SERVICE_REC_SERVICE_DATA16_0 = IBV_SA_COMP_MASK(23), + IBV_SA_SERVICE_REC_SERVICE_DATA16_1 = IBV_SA_COMP_MASK(24), + IBV_SA_SERVICE_REC_SERVICE_DATA16_2 = IBV_SA_COMP_MASK(25), + IBV_SA_SERVICE_REC_SERVICE_DATA16_3 = IBV_SA_COMP_MASK(26), + IBV_SA_SERVICE_REC_SERVICE_DATA16_4 = IBV_SA_COMP_MASK(27), + IBV_SA_SERVICE_REC_SERVICE_DATA16_5 = IBV_SA_COMP_MASK(28), + IBV_SA_SERVICE_REC_SERVICE_DATA16_6 = IBV_SA_COMP_MASK(29), + IBV_SA_SERVICE_REC_SERVICE_DATA16_7 = IBV_SA_COMP_MASK(30), + IBV_SA_SERVICE_REC_SERVICE_DATA32_0 = IBV_SA_COMP_MASK(31), + IBV_SA_SERVICE_REC_SERVICE_DATA32_1 = IBV_SA_COMP_MASK(32), + IBV_SA_SERVICE_REC_SERVICE_DATA32_2 = IBV_SA_COMP_MASK(33), + IBV_SA_SERVICE_REC_SERVICE_DATA32_3 = IBV_SA_COMP_MASK(34), + IBV_SA_SERVICE_REC_SERVICE_DATA64_0 = IBV_SA_COMP_MASK(35), + IBV_SA_SERVICE_REC_SERVICE_DATA64_1 = IBV_SA_COMP_MASK(36) +}; + +struct ibv_sa_net_path_rec { + uint32_t rsvd1; + uint32_t rsvd2; + uint8_t dgid[16]; + uint8_t sgid[16]; + uint16_t dlid; + uint16_t slid; + /* RawTraffic: 1:352, Rsvd: 3:353, FlowLabel: 20:356, HopLimit: 8:376 */ + uint32_t raw_flow_hop; + uint8_t tclass; + /* Reversible: 1:392, NumbPath: 7:393 */ + uint8_t reversible_numbpath; + uint16_t pkey; + /* Rsvd: 12:416, SL: 4:428 */ + uint16_t sl; + /* MtuSelector: 2:432, MTU: 6:434 */ + uint8_t mtu_info; + /* RateSelector: 2:440, Rate: 6:442 */ + uint8_t rate_info; + /* PacketLifeTimeSelector: 2:448, PacketLifeTime: 6:450 */ + uint8_t packetlifetime_info; + uint8_t preference; + uint8_t rsvd3[3]; +}; + +enum { + IBV_SA_PATH_REC_RAW_TRAFFIC_OFFSET = 352, + IBV_SA_PATH_REC_RAW_TRAFFIC_LENGTH = 1, + IBV_SA_PATH_REC_FLOW_LABEL_OFFSET = 356, + IBV_SA_PATH_REC_FLOW_LABEL_LENGTH = 20, + IBV_SA_PATH_REC_HOP_LIMIT_OFFSET = 376, + IBV_SA_PATH_REC_HOP_LIMIT_LENGTH = 8, + IBV_SA_PATH_REC_REVERSIBLE_OFFSET = 392, + IBV_SA_PATH_REC_REVERSIBLE_LENGTH = 1, + IBV_SA_PATH_REC_NUMB_PATH_OFFSET = 393, + IBV_SA_PATH_REC_NUMB_PATH_LENGTH = 7, + IBV_SA_PATH_REC_SL_OFFSET = 428, + IBV_SA_PATH_REC_SL_LENGTH = 4, + IBV_SA_PATH_REC_MTU_SELECTOR_OFFSET = 324, + IBV_SA_PATH_REC_MTU_SELECTOR_LENGTH = 2, + IBV_SA_PATH_REC_MTU_OFFSET = 434, + IBV_SA_PATH_REC_MTU_LENGTH = 6, + IBV_SA_PATH_REC_RATE_SELECTOR_OFFSET = 440, + IBV_SA_PATH_REC_RATE_SELECTOR_LENGTH = 2, + IBV_SA_PATH_REC_RATE_OFFSET = 442, + IBV_SA_PATH_REC_RATE_LENGTH = 6, + IBV_SA_PATH_REC_PACKETLIFE_SELECTOR_OFFSET = 448, + IBV_SA_PATH_REC_PACKETLIFE_SELECTOR_LENGTH = 2, + IBV_SA_PATH_REC_PACKETLIFE_OFFSET = 450, + IBV_SA_PATH_REC_PACKETLIFE_LENGTH = 6 +}; + +enum { + /* reserved: 0 */ + /* reserved: 1 */ + IBV_SA_PATH_REC_DGID = IBV_SA_COMP_MASK(2), + IBV_SA_PATH_REC_SGID = IBV_SA_COMP_MASK(3), + IBV_SA_PATH_REC_DLID = IBV_SA_COMP_MASK(4), + IBV_SA_PATH_REC_SLID = IBV_SA_COMP_MASK(5), + IBV_SA_PATH_REC_RAW_TRAFFIC = IBV_SA_COMP_MASK(6), + /* reserved: 7 */ + IBV_SA_PATH_REC_FLOW_LABEL = IBV_SA_COMP_MASK(8), + IBV_SA_PATH_REC_HOP_LIMIT = IBV_SA_COMP_MASK(9), + IBV_SA_PATH_REC_TRAFFIC_CLASS = IBV_SA_COMP_MASK(10), + IBV_SA_PATH_REC_REVERSIBLE = IBV_SA_COMP_MASK(11), + IBV_SA_PATH_REC_NUMB_PATH = IBV_SA_COMP_MASK(12), + IBV_SA_PATH_REC_PKEY = IBV_SA_COMP_MASK(13), + /* reserved: 14 */ + IBV_SA_PATH_REC_SL = IBV_SA_COMP_MASK(15), + IBV_SA_PATH_REC_MTU_SELECTOR = IBV_SA_COMP_MASK(16), + IBV_SA_PATH_REC_MTU = IBV_SA_COMP_MASK(17), + IBV_SA_PATH_REC_RATE_SELECTOR = IBV_SA_COMP_MASK(18), + IBV_SA_PATH_REC_RATE = IBV_SA_COMP_MASK(19), + IBV_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR = IBV_SA_COMP_MASK(20), + IBV_SA_PATH_REC_PACKET_LIFE_TIME = IBV_SA_COMP_MASK(21), + IBV_SA_PATH_REC_PREFERENCE = IBV_SA_COMP_MASK(22) +}; + +struct ibv_sa_net_mcmember_rec { + uint8_t mgid[16]; + uint8_t port_gid[16]; + uint32_t qkey; + uint16_t mlid; + /* MtuSelector: 2:304, MTU: 6:306 */ + uint8_t mtu_info; + uint8_t tclass; + uint16_t pkey; + /* RateSelector: 2:336, Rate: 6:338 */ + uint8_t rate_info; + /* PacketLifeTimeSelector: 2:344, PacketLifeTime: 6:346 */ + uint8_t packetlifetime_info; + /* SL: 4:352, FlowLabel: 20:356, HopLimit: 8:376 */ + uint32_t sl_flow_hop; + /* Scope: 4:384, JoinState: 4:388 */ + uint8_t scope_join; + /* ProxyJoin: 1:392, rsvd: 7:393 */ + uint8_t proxy_join; + uint8_t rsvd[2]; +}; + +enum { + IBV_SA_MCMEMBER_REC_MTU_SELECTOR_OFFSET = 304, + IBV_SA_MCMEMBER_REC_MTU_SELECTOR_LENGTH = 2, + IBV_SA_MCMEMBER_REC_MTU_OFFSET = 306, + IBV_SA_MCMEMBER_REC_MTU_LENGTH = 6, + IBV_SA_MCMEMBER_REC_RATE_SELECTOR_OFFSET = 336, + IBV_SA_MCMEMBER_REC_RATE_SELECTOR_LENGTH = 2, + IBV_SA_MCMEMBER_REC_RATE_OFFSET = 338, + IBV_SA_MCMEMBER_REC_RATE_LENGTH = 6, + IBV_SA_MCMEMBER_REC_PACKETLIFE_SELECTOR_OFFSET = 344, + IBV_SA_MCMEMBER_REC_PACKETLIFE_SELECTOR_LENGTH = 2, + IBV_SA_MCMEMBER_REC_PACKETLIFE_OFFSET = 346, + IBV_SA_MCMEMBER_REC_PACKETLIFE_LENGTH = 6, + IBV_SA_MCMEMBER_REC_SL_OFFSET = 352, + IBV_SA_MCMEMBER_REC_SL_LENGTH = 4, + IBV_SA_MCMEMBER_REC_FLOW_LABEL_OFFSET = 356, + IBV_SA_MCMEMBER_REC_FLOW_LABEL_LENGTH = 20, + IBV_SA_MCMEMBER_REC_HOP_LIMIT_OFFSET = 376, + IBV_SA_MCMEMBER_REC_HOP_LIMIT_LENGTH = 8, + IBV_SA_MCMEMBER_REC_SCOPE_OFFSET = 384, + IBV_SA_MCMEMBER_REC_SCOPE_LENGTH = 4, + IBV_SA_MCMEMBER_REC_JOIN_STATE_OFFSET = 388, + IBV_SA_MCMEMBER_REC_JOIN_STATE_LENGTH = 4, + IBV_SA_MCMEMBER_REC_PROXY_JOIN_OFFSET = 392, + IBV_SA_MCMEMBER_REC_PROXY_JOIN_LENGTH = 1 +}; + +enum { + IBV_SA_MCMEMBER_REC_MGID = IBV_SA_COMP_MASK(0), + IBV_SA_MCMEMBER_REC_PORT_GID = IBV_SA_COMP_MASK(1), + IBV_SA_MCMEMBER_REC_QKEY = IBV_SA_COMP_MASK(2), + IBV_SA_MCMEMBER_REC_MLID = IBV_SA_COMP_MASK(3), + IBV_SA_MCMEMBER_REC_MTU_SELECTOR = IBV_SA_COMP_MASK(4), + IBV_SA_MCMEMBER_REC_MTU = IBV_SA_COMP_MASK(5), + IBV_SA_MCMEMBER_REC_TRAFFIC_CLASS = IBV_SA_COMP_MASK(6), + IBV_SA_MCMEMBER_REC_PKEY = IBV_SA_COMP_MASK(7), + IBV_SA_MCMEMBER_REC_RATE_SELECTOR = IBV_SA_COMP_MASK(8), + IBV_SA_MCMEMBER_REC_RATE = IBV_SA_COMP_MASK(9), + IBV_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR = IBV_SA_COMP_MASK(10), + IBV_SA_MCMEMBER_REC_PACKET_LIFE_TIME = IBV_SA_COMP_MASK(11), + IBV_SA_MCMEMBER_REC_SL = IBV_SA_COMP_MASK(12), + IBV_SA_MCMEMBER_REC_FLOW_LABEL = IBV_SA_COMP_MASK(13), + IBV_SA_MCMEMBER_REC_HOP_LIMIT = IBV_SA_COMP_MASK(14), + IBV_SA_MCMEMBER_REC_SCOPE = IBV_SA_COMP_MASK(15), + IBV_SA_MCMEMBER_REC_JOIN_STATE = IBV_SA_COMP_MASK(16), + IBV_SA_MCMEMBER_REC_PROXY_JOIN = IBV_SA_COMP_MASK(17) +}; + +/* + * ibv_sa_pack_attr - Copy an attribute from a host defined structure + * to a packed network structure + * ibv_sa_unpack_attr - Copy an attribute from a packed network structure + * to a host defined structure. + */ +void ibv_sa_unpack_path_rec(struct ibv_sa_path_rec *rec, + struct ibv_sa_net_path_rec *net_rec); +void ibv_sa_pack_mcmember_rec(struct ibv_sa_net_mcmember_rec *net_rec, + struct ibv_sa_mcmember_rec *rec); +void ibv_sa_unpack_mcmember_rec(struct ibv_sa_mcmember_rec *rec, + struct ibv_sa_net_mcmember_rec *net_rec); + +/** + * ibv_sa_get_field - Extract a bit field value from a structure. + * @data: Pointer to the start of the structure. + * @offset: Bit offset of field from start of structure. + * @size: Size of field, in bits. + * + * The structure must be in network-byte order. The returned value is in + * host-byte order. + */ +uint32_t ibv_sa_get_field(void *data, int offset, int size); + +/** + * ibv_sa_set_field - Set a bit field value in a structure. + * @data: Pointer to the start of the structure. + * @value: Value to assign to field. + * @offset: Bit offset of field from start of structure. + * @size: Size of field, in bits. + * + * The structure must be in network-byte order. The value to set is in + * host-byte order. + */ +void ibv_sa_set_field(void *data, uint32_t value, int offset, int size); + +#endif /* SA_NET_H */ Property changes on: libibsa/include/infiniband/sa_net.h ___________________________________________________________________ Name: svn:executable + * Index: libibsa/include/infiniband/sa_client_abi.h =================================================================== --- libibsa/include/infiniband/sa_client_abi.h (revision 0) +++ libibsa/include/infiniband/sa_client_abi.h (revision 0) @@ -0,0 +1,127 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef SA_CLIENT_ABI_H +#define SA_CLIENT_ABI_H + +#include + +/* + * This file must be kept in sync with the kernel's version of ib_usa.h + */ + +#define IB_USA_MIN_ABI_VERSION 1 +#define IB_USA_MAX_ABI_VERSION 1 + +#define IB_USA_EVENT_DATA 256 + +enum { + USA_CMD_SEND_MAD, + USA_CMD_GET_EVENT, + USA_CMD_GET_DATA, + USA_CMD_JOIN_MCAST, + USA_CMD_FREE_ID, + USA_CMD_GET_MCAST +}; + +enum { + USA_EVENT_MAD, + USA_EVENT_MCAST +}; + +struct usa_abi_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct usa_abi_send_mad { + __u64 response; /* unused - reserved */ + __u64 uid; + __u64 node_guid; + __u64 comp_mask; + __u64 attr; + __u8 port_num; + __u8 method; + __u16 attr_id; + __u32 timeout_ms; + __u32 retries; +}; + +struct usa_abi_join_mcast { + __u64 response; + __u64 uid; + __u64 node_guid; + __u64 comp_mask; + __u64 mcmember_rec; + __u8 port_num; +}; + +struct usa_abi_id_resp { + __u32 id; +}; + +struct usa_abi_free_resp { + __u32 events_reported; +}; + +struct usa_abi_free_id { + __u64 response; + __u32 id; +}; + +struct usa_abi_get_event { + __u64 response; +}; + +struct usa_abi_event_resp { + __u64 uid; + __u32 id; + __u32 event; + __u32 status; + __u32 data_len; + __u8 data[IB_USA_EVENT_DATA]; +}; + +struct usa_abi_get_data { + __u64 response; + __u32 id; +}; + +struct usa_abi_get_mcast { + __u64 response; + __u64 node_guid; + __u8 mgid[16]; + __u8 port_num; +}; + +#endif /* SA_CLIENT_ABI_H */ Property changes on: libibsa/include/infiniband/sa_client_abi.h ___________________________________________________________________ Name: svn:executable + * Index: libibsa/include/infiniband/sa_client.h =================================================================== --- libibsa/include/infiniband/sa_client.h (revision 0) +++ libibsa/include/infiniband/sa_client.h (revision 0) @@ -0,0 +1,192 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(SA_CLIENT_H) +#define SA_CLIENT_H + +#include +#include + +struct ibv_sa_event_channel { + int fd; +}; + +enum ibv_sa_event_type { + IBV_SA_EVENT_MAD, + IBV_SA_EVENT_MULTICAST +}; + +struct ibv_sa_event { + void *context; + enum ibv_sa_event_type event; + int status; + int attr_count; + int attr_size; + int attr_offset; + uint16_t attr_id; + void *attr; +}; + +/** + * ibv_sa_create_event_channel - Open a channel used to report events. + */ +struct ibv_sa_event_channel *ibv_sa_create_event_channel(void); + +/** + * ibv_sa_destroy_event_channel - Close the event channel. + * @channel: The channel to destroy. + */ +void ibv_sa_destroy_event_channel(struct ibv_sa_event_channel *channel); + +/** + * ibv_sa_send_mad - Send a MAD to the SA. + * @channel: Event channel to report completion to. + * @device: Device to send over. + * @port_num: Port number to send over. + * @method: MAD method to use in the send. + * @attr: Reference to attribute in wire format to send in MAD. + * @attr_id: Attribute type identifier. + * @comp_mask: Component mask to send in MAD. + * @timeout_ms: Time to wait for response, if one is expected. + * @retries: Number of times to retry request. + * @context: User-defined context associated with request. + * + * Send a message to the SA. All values should be in network-byte order. + */ +int ibv_sa_send_mad(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + uint8_t method, void *attr, uint16_t attr_id, + uint64_t comp_mask, int timeout_ms, int retries, + void *context); + +/** + * ibv_sa_get_event - Retrieves the next pending event, if no event is + * pending waits for an event. + * @channel: Event channel to check for events. + * @event: Allocated information about the next event. + * Event should be freed using ibv_sa_ack_event() + */ +int ibv_sa_get_event(struct ibv_sa_event_channel *channel, + struct ibv_sa_event **event); + +/** + * ibv_sa_ack_event - Free an event. + * @event: Event to be released. + * + * All events which are allocated by ibv_sa_get_event() must be released, + * there should be a one-to-one correspondence between successful gets + * and acks. + */ +int ibv_sa_ack_event(struct ibv_sa_event *event); + +/** + * ibv_sa_attr_size - Return the length of an SA attribute on the wire. + * @attr_id: Attribute identifier, in network-byte order. + */ +int ibv_sa_attr_size(uint16_t attr_id); + +static inline void *ibv_sa_get_attr(struct ibv_sa_event *event, int index) +{ + return event->attr + event->attr_offset * index; +} + +/** + * ibv_sa_init_ah_from_path - Initialize address handle attributes. + * @device: Source device. + * @port_num: Source port number. + * @path_rec: Network defined path record. + * @ah_attr: Destination address handle attributes. + */ +int ibv_sa_init_ah_from_path(struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_path_rec *path_rec, + struct ibv_ah_attr *ah_attr); + +/** + * ibv_sa_init_ah_from_mcmember - Initialize address handle attributes. + * @device: Source device. + * @port_num: Source port number. + * @mc_rec: Network defined multicast member record. + * @ah_attr: Destination address handle attributes. + */ +int ibv_sa_init_ah_from_mcmember(struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_mcmember_rec *mc_rec, + struct ibv_ah_attr *ah_attr); + +struct ibv_sa_multicast; + +/** + * ibv_sa_join_multicast - Initiates a join request to the specified multicast + * group. + * @channel: Event channel to report completion to. + * @device: Device to send over. + * @port_num: Port number to send over. + * @rec: SA multicast member record specifying group attributes. + * @comp_mask: Component mask to send in MAD. + * @context: User-defined context associated with join. + * @multicast: Reference to store multicast pointer. + * + * This call initiates a multicast join request with the SA for the specified + * multicast group. If the join operation is started successfully, it returns + * an ibv_sa_multicast structure that is used to track the multicast operation. + * Users must free this structure by calling ibv_sa_free_multicast, even if the + * join operation later fails. + */ +int ibv_sa_join_multicast(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_mcmember_rec *rec, + uint64_t comp_mask, void *context, + struct ibv_sa_multicast **multicast); + +/** + * ibv_sa_free_multicast - Frees the multicast tracking structure, and releases + * any reference on the multicast group. + * @multicast: Multicast tracking structure allocated by ibv_sa_join_multicast. + */ +int ibv_sa_free_multicast(struct ibv_sa_multicast *multicast); + +/** + * ibv_sa_get_mcmember_rec - Looks up a multicast member record by its MGID and + * returns it if found. + * @channel: Event channel to issue query on. + * @device: Device associated with record. + * @port_num: Port number of record. + * @mgid: optional MGID of multicast group. + * @rec: Location to copy SA multicast member record. + * + * If an MGID is specified, returns an existing multicast member record if + * one is found for the local port. If no MGID is specified, or the specified + * MGID is 0, returns a multicast member record filled in with default values + * that may be used to create a new multicast group. + */ +int ibv_sa_get_mcmember_rec(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + union ibv_gid *mgid, + struct ibv_sa_net_mcmember_rec *rec); + +#endif /* SA_CLIENT_H */ Property changes on: libibsa/include/infiniband/sa_client.h ___________________________________________________________________ Name: svn:executable + * Index: libibsa/AUTHORS =================================================================== --- libibsa/AUTHORS (revision 0) +++ libibsa/AUTHORS (revision 0) @@ -0,0 +1 @@ +Sean Hefty Index: libibsa/configure.in =================================================================== --- libibsa/configure.in (revision 0) +++ libibsa/configure.in (revision 0) @@ -0,0 +1,50 @@ +dnl Process this file with autoconf to produce a configure script. + +AC_PREREQ(2.57) +AC_INIT(libibsa, 0.9.0, openib-general at openib.org) +AC_CONFIG_SRCDIR([src/sa_client.c]) +AC_CONFIG_AUX_DIR(config) +AM_CONFIG_HEADER(config.h) +AM_INIT_AUTOMAKE(libibsa, 0.9.0) +AC_DISABLE_STATIC +AM_PROG_LIBTOOL + +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + +dnl Checks for programs +AC_PROG_CC + +dnl Checks for typedefs, structures, and compiler characteristics. +AC_C_CONST +AC_CHECK_SIZEOF(long) + +dnl Checks for libraries +if test "$disable_libcheck" != "yes" +then +AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], + AC_MSG_ERROR([ibv_get_device_list() not found. libibsa requires libibverbs.])) +fi + +dnl Checks for header files. +if test "$disable_libcheck" != "yes" +then +AC_CHECK_HEADER(infiniband/verbs.h, [], + AC_MSG_ERROR([ not found. Is libibverbs installed?])) +fi +AC_HEADER_STDC + +AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, + if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then + ac_cv_version_script=yes + else + ac_cv_version_script=no + fi) + +AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") + +AC_CONFIG_FILES([Makefile libibsa.spec]) +AC_OUTPUT Index: libibsa/INSTALL =================================================================== Index: libibsa/src/libibsa.map =================================================================== --- libibsa/src/libibsa.map (revision 0) +++ libibsa/src/libibsa.map (revision 0) @@ -0,0 +1,21 @@ +IB_SA_1.0 { + global: + ibv_sa_create_event_channel; + ibv_sa_destroy_event_channel; + ibv_sa_send_mad; + ibv_sa_get_event; + ibv_sa_ack_event; + ibv_sa_attr_size; + ibv_sa_get_attr; + ibv_sa_init_ah_from_path; + ibv_sa_init_ah_from_mcmember; + ibv_sa_join_multicast; + ibv_sa_free_multicast; + ibv_sa_get_mcmember_rec; + ibv_sa_get_field; + ibv_sa_set_field; + ibv_sa_unpack_path_rec; + ibv_sa_pack_mcmember_rec; + ibv_sa_unpack_mcmember_rec; + local: *; +}; Index: libibsa/src/sa_client.c =================================================================== --- libibsa/src/sa_client.c (revision 0) +++ libibsa/src/sa_client.c (revision 0) @@ -0,0 +1,552 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: cm.c 3453 2005-09-15 21:43:21Z sean.hefty $ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include +#include + +#define PFX "libibsa: " + +#define container_of(ptr, type, field) \ + ((type *) ((void *) ptr - offsetof(type, field))) + +struct sa_event_tracking { + uint32_t events_completed; + pthread_cond_t cond; + pthread_mutex_t mut; +}; + +struct sa_event { + struct ibv_sa_event event; + struct ibv_sa_event_channel *channel; + void *data; + struct sa_event_tracking *event_tracking; +}; + +struct ibv_sa_multicast { + struct ibv_sa_event_channel *channel; + void *context; + uint32_t id; + struct sa_event_tracking event_tracking; +}; + +static int abi_ver; +static pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; + +#define USA_CREATE_MSG_CMD_RESP(msg, cmd, resp, type, size) \ +do { \ + struct usa_abi_cmd_hdr *hdr; \ + \ + size = sizeof(*hdr) + sizeof(*cmd); \ + msg = alloca(size); \ + if (!msg) \ + return ENOMEM; \ + hdr = msg; \ + cmd = msg + sizeof(*hdr); \ + hdr->cmd = type; \ + hdr->in = sizeof(*cmd); \ + hdr->out = sizeof(*resp); \ + memset(cmd, 0, sizeof(*cmd)); \ + resp = alloca(sizeof(*resp)); \ + if (!resp) \ + return ENOMEM; \ + cmd->response = (uintptr_t)resp; \ +} while (0) + +#define USA_CREATE_MSG_CMD(msg, cmd, type, size) \ +do { \ + struct usa_abi_cmd_hdr *hdr; \ + \ + size = sizeof(*hdr) + sizeof(*cmd); \ + msg = alloca(size); \ + if (!msg) \ + return ENOMEM; \ + hdr = msg; \ + cmd = msg + sizeof(*hdr); \ + hdr->cmd = type; \ + hdr->in = sizeof(*cmd); \ + hdr->out = 0; \ + memset(cmd, 0, sizeof(*cmd)); \ +} while (0) + +static int check_abi_version(void) +{ + char value[8]; + + if (ibv_read_sysfs_file(ibv_get_sysfs_path(), + "class/misc/ib_usa/abi_version", + value, sizeof value) < 0) { + /* + * Older version of Linux do not have class/misc. To support + * backports, assume the most recent version of the ABI. If + * we're wrong, we'll simply fail later when calling the ABI. + */ + abi_ver = IB_USA_MAX_ABI_VERSION; + fprintf(stderr, PFX "couldn't read ABI version, assuming: %d\n", + abi_ver); + return 0; + } + + abi_ver = strtol(value, NULL, 10); + if (abi_ver < IB_USA_MIN_ABI_VERSION || + abi_ver > IB_USA_MAX_ABI_VERSION) { + fprintf(stderr, PFX "kernel ABI version %d " + "doesn't match library version %d.\n", + abi_ver, IB_USA_MAX_ABI_VERSION); + return -1; + } + return 0; +} + +static int usa_init(void) +{ + int ret = 0; + + pthread_mutex_lock(&mut); + if (!abi_ver) + ret = check_abi_version(); + pthread_mutex_unlock(&mut); + + return ret; +} + +struct ibv_sa_event_channel *ibv_sa_create_event_channel(void) +{ + struct ibv_sa_event_channel *channel; + + if (usa_init()) + return NULL; + + channel = malloc(sizeof *channel); + if (!channel) + return NULL; + + channel->fd = open("/dev/infiniband/ib_usa", O_RDWR); + if (channel->fd < 0) { + fprintf(stderr, PFX "unable to open /dev/infiniband/ib_usa\n"); + goto err; + } + return channel; +err: + free(channel); + return NULL; +} + +void ibv_sa_destroy_event_channel(struct ibv_sa_event_channel *channel) +{ + close(channel->fd); + free(channel); +} + +static int init_event_tracking(struct sa_event_tracking *event_tracking) +{ + pthread_mutex_init(&event_tracking->mut, NULL); + return pthread_cond_init(&event_tracking->cond, NULL); +} + +static void cleanup_event_tracking(struct sa_event_tracking *event_tracking) +{ + pthread_cond_destroy(&event_tracking->cond); + pthread_mutex_destroy(&event_tracking->mut); +} + +static void wait_for_events(struct sa_event_tracking *event_tracking, + int events_reported) +{ + pthread_mutex_lock(&event_tracking->mut); + while (event_tracking->events_completed < events_reported) + pthread_cond_wait(&event_tracking->cond, &event_tracking->mut); + pthread_mutex_unlock(&event_tracking->mut); +} + +static void complete_event(struct sa_event_tracking *event_tracking) +{ + pthread_mutex_lock(&event_tracking->mut); + event_tracking->events_completed++; + pthread_cond_signal(&event_tracking->cond); + pthread_mutex_unlock(&event_tracking->mut); +} + +int ibv_sa_send_mad(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + uint8_t method, void *attr, uint16_t attr_id, + uint64_t comp_mask, int timeout_ms, int retries, + void *context) +{ + struct usa_abi_send_mad *cmd; + void *msg; + int ret, size; + + USA_CREATE_MSG_CMD(msg, cmd, USA_CMD_SEND_MAD, size); + cmd->uid = (uintptr_t) context; + cmd->node_guid = ibv_get_device_guid(device->device); + cmd->comp_mask = comp_mask; + cmd->attr = (uintptr_t) attr; + cmd->port_num = port_num; + cmd->method = method; + cmd->attr_id = attr_id; + cmd->timeout_ms = timeout_ms; + cmd->retries = retries; + + ret = write(channel->fd, msg, size); + if (ret != size) + return (ret > 0) ? ENODATA : ret; + + return 0; +} + +static void copy_event_attr(struct sa_event *evt, + struct usa_abi_event_resp *resp) +{ + struct ibv_sa_mad *mad; + int size; + + size = resp->data_len - IBV_SA_MAD_HEADER_SIZE; + if (size <= 0) + return; + + evt->data = malloc(size); + if (!evt->data) + return; + + mad = (struct ibv_sa_mad *) resp->data; + memcpy(evt->data, mad->sa_data, size); + evt->event.attr = evt->data; + evt->event.attr_id = mad->attribute_id; + evt->event.attr_size = ibv_sa_attr_size(mad->attribute_id); + evt->event.attr_offset = ntohs(mad->attribute_offset) * 8; + if (evt->event.attr_offset) + evt->event.attr_count = size / evt->event.attr_offset; +} + +static int get_event_attr(struct sa_event *evt, + struct usa_abi_event_resp *resp) +{ + struct ibv_sa_mad *mad; + struct usa_abi_get_data *cmd; + void *msg; + int ret, size; + + USA_CREATE_MSG_CMD(msg, cmd, USA_CMD_GET_DATA, size); + cmd->id = resp->id; + + evt->data = malloc(resp->data_len); + if (evt->data) { + cmd->response = (uintptr_t) evt->data; + ((struct usa_abi_cmd_hdr *) msg)->out = resp->data_len; + } + + ret = write(evt->channel->fd, msg, size); + if (ret != size) + return (ret > 0) ? ENODATA : ret; + + mad = (struct ibv_sa_mad *) resp->data; + evt->event.attr = evt->data; + evt->event.attr_id = mad->attribute_id; + evt->event.attr_size = ibv_sa_attr_size(mad->attribute_id); + evt->event.attr_offset = ntohs(mad->attribute_offset) * 8; + if (evt->event.attr_offset) + evt->event.attr_count = (resp->data_len - + IBV_SA_MAD_HEADER_SIZE) / + evt->event.attr_offset; + return 0; +} + +static void process_mad_event(struct sa_event *evt, + struct usa_abi_event_resp *resp) +{ + evt->event.context = (void *) (uintptr_t) resp->uid; + if (resp->data_len <= IB_USA_EVENT_DATA) + copy_event_attr(evt, resp); + else + get_event_attr(evt, resp); +} + +static void process_mcast_event(struct sa_event *evt, + struct usa_abi_event_resp *resp) +{ + struct ibv_sa_multicast *multicast; + + multicast = (void *) (uintptr_t) resp->uid; + evt->event.context = multicast->context; + evt->event_tracking = &multicast->event_tracking; + multicast->id = resp->id; + + evt->data = malloc(IBV_SA_ATTR_MC_MEMBER_REC_LEN); + if (!evt->data) + return; + + memcpy(evt->data, resp->data, IBV_SA_ATTR_MC_MEMBER_REC_LEN); + evt->event.attr = evt->data; + evt->event.attr_id = IBV_SA_ATTR_MC_MEMBER_REC; + evt->event.attr_size = IBV_SA_ATTR_MC_MEMBER_REC_LEN; + evt->event.attr_offset = IBV_SA_ATTR_MC_MEMBER_REC_LEN; + evt->event.attr_count = 1; +} + +int ibv_sa_get_event(struct ibv_sa_event_channel *channel, + struct ibv_sa_event **event) +{ + struct usa_abi_get_event *cmd; + struct usa_abi_event_resp *resp; + struct sa_event *evt; + void *msg; + int ret, size; + + evt = malloc(sizeof *evt); + if (!evt) + return ENOMEM; + memset(evt, 0, sizeof *evt); + + USA_CREATE_MSG_CMD_RESP(msg, cmd, resp, USA_CMD_GET_EVENT, size); + ret = write(channel->fd, msg, size); + if (ret != size) { + ret = (ret > 0) ? ENODATA : ret; + goto err; + } + + evt->channel = channel; + evt->event.event = resp->event; + evt->event.status = resp->status; + + switch (resp->event) { + case USA_EVENT_MAD: + process_mad_event(evt, resp); + break; + case USA_EVENT_MCAST: + process_mcast_event(evt, resp); + break; + default: + break; + } + + *event = &evt->event; + return 0; +err: + free(evt); + return ret; +} + +int ibv_sa_ack_event(struct ibv_sa_event *event) +{ + struct sa_event *evt = container_of(event, struct sa_event, event); + + if (evt->data) + free(evt->data); + + if (evt->event_tracking) + complete_event(evt->event_tracking); + + free(event); + return 0; +} + +int ibv_sa_join_multicast(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_mcmember_rec *rec, + uint64_t comp_mask, void *context, + struct ibv_sa_multicast **multicast) +{ + struct usa_abi_join_mcast *cmd; + struct usa_abi_id_resp *resp; + struct ibv_sa_multicast *mcast; + void *msg; + int ret, size; + + mcast = malloc(sizeof *mcast); + if (!mcast) + return ENOMEM; + memset(mcast, 0, sizeof *mcast); + + mcast->channel = channel; + mcast->context = context; + ret = init_event_tracking(&mcast->event_tracking); + if (ret) + goto err; + + USA_CREATE_MSG_CMD_RESP(msg, cmd, resp, USA_CMD_JOIN_MCAST, size); + cmd->uid = (uintptr_t) mcast; + cmd->node_guid = ibv_get_device_guid(device->device); + cmd->comp_mask = comp_mask; + cmd->mcmember_rec = (uintptr_t) rec; + cmd->port_num = port_num; + + ret = write(channel->fd, msg, size); + if (ret != size) { + ret = (ret > 0) ? ENODATA : ret; + goto err; + } + + mcast->id = resp->id; + *multicast = mcast; + return 0; +err: + cleanup_event_tracking(&mcast->event_tracking); + free(mcast); + return ret; +} + +int ibv_sa_free_multicast(struct ibv_sa_multicast *multicast) +{ + struct usa_abi_free_id *cmd; + struct usa_abi_free_resp *resp; + void *msg; + int ret, size; + + USA_CREATE_MSG_CMD_RESP(msg, cmd, resp, USA_CMD_FREE_ID, size); + ret = write(multicast->channel->fd, msg, size); + if (ret != size) + return (ret > 0) ? ENODATA : ret; + + wait_for_events(&multicast->event_tracking, resp->events_reported); + + cleanup_event_tracking(&multicast->event_tracking); + free(multicast); + return 0; +} + +int ibv_sa_get_mcmember_rec(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + union ibv_gid *mgid, + struct ibv_sa_net_mcmember_rec *rec) +{ + struct usa_abi_get_mcast *cmd; + void *msg; + int ret, size; + + USA_CREATE_MSG_CMD(msg, cmd, USA_CMD_GET_MCAST, size); + cmd->node_guid = ibv_get_device_guid(device->device); + cmd->port_num = port_num; + cmd->response = (uintptr_t) rec; + ((struct usa_abi_cmd_hdr *) msg)->out = sizeof *rec; + if (mgid) + memcpy(cmd->mgid, mgid->raw, sizeof *mgid); + + ret = write(channel->fd, msg, size); + if (ret != size) + return (ret > 0) ? ENODATA : ret; + + return 0; +} + +static int get_gid_index(struct ibv_context *device, uint8_t port_num, + union ibv_gid *sgid) +{ + union ibv_gid gid; + int i, ret; + + for (i = 0, ret = 0; !ret; i++) { + ret = ibv_query_gid(device, port_num, i, &gid); + if (!ret && !memcmp(sgid, &gid, sizeof gid)) { + ret = i; + break; + } + } + return ret; +} + +int ibv_sa_init_ah_from_path(struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_path_rec *path_rec, + struct ibv_ah_attr *ah_attr) +{ + struct ibv_sa_path_rec rec; + int ret; + + ibv_sa_unpack_path_rec(&rec, path_rec); + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = ntohs(rec.dlid); + ah_attr->sl = rec.sl; + ah_attr->src_path_bits = ntohs(rec.slid) & 0x7F; + ah_attr->port_num = port_num; + + if (rec.hop_limit > 1) { + ah_attr->is_global = 1; + ah_attr->grh.dgid = rec.dgid; + ret = get_gid_index(device, port_num, &rec.sgid); + if (ret < 0) + return ret; + + ah_attr->grh.sgid_index = (uint8_t) ret; + ah_attr->grh.flow_label = ntohl(rec.flow_label); + ah_attr->grh.hop_limit = rec.hop_limit; + ah_attr->grh.traffic_class = rec.traffic_class; + } + return 0; +} + +int ibv_sa_init_ah_from_mcmember(struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_mcmember_rec *mc_rec, + struct ibv_ah_attr *ah_attr) +{ + struct ibv_sa_mcmember_rec rec; + int ret; + + ibv_sa_unpack_mcmember_rec(&rec, mc_rec); + + ret = get_gid_index(device, port_num, &rec.port_gid); + if (ret < 0) + return ret; + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = ntohs(rec.mlid); + ah_attr->sl = rec.sl; + ah_attr->port_num = port_num; + ah_attr->static_rate = rec.rate; + + ah_attr->is_global = 1; + ah_attr->grh.dgid = rec.mgid; + + ah_attr->grh.sgid_index = (uint8_t) ret; + ah_attr->grh.flow_label = ntohl(rec.flow_label); + ah_attr->grh.hop_limit = rec.hop_limit; + ah_attr->grh.traffic_class = rec.traffic_class; + return 0; +} Property changes on: libibsa/src/sa_client.c ___________________________________________________________________ Name: svn:executable + * Index: libibsa/src/sa_net.c =================================================================== --- libibsa/src/sa_net.c (revision 0) +++ libibsa/src/sa_net.c (revision 0) @@ -0,0 +1,265 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: cm.c 3453 2005-09-15 21:43:21Z sean.hefty $ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +#include + +#include + +int ibv_sa_attr_size(uint16_t attr_id) +{ + int size; + + switch (attr_id) { + case IBV_SA_ATTR_CLASS_PORTINFO: + size = IBV_SA_ATTR_CLASS_PORTINFO_LEN; + break; + case IBV_SA_ATTR_NOTICE: + size = IBV_SA_ATTR_NOTICE_LEN; + break; + case IBV_SA_ATTR_INFORM_INFO: + size = IBV_SA_ATTR_INFORM_INFO_LEN; + break; + case IBV_SA_ATTR_NODE_REC: + size = IBV_SA_ATTR_NODE_REC_LEN; + break; + case IBV_SA_ATTR_PORT_INFO_REC: + size = IBV_SA_ATTR_PORT_INFO_REC_LEN; + break; + case IBV_SA_ATTR_SL2VL_REC: + size = IBV_SA_ATTR_SL2VL_REC_LEN; + break; + case IBV_SA_ATTR_SWITCH_REC: + size = IBV_SA_ATTR_SWITCH_REC_LEN; + break; + case IBV_SA_ATTR_LINEAR_FDB_REC: + size = IBV_SA_ATTR_LINEAR_FDB_REC_LEN; + break; + case IBV_SA_ATTR_RANDOM_FDB_REC: + size = IBV_SA_ATTR_RANDOM_FDB_REC_LEN; + break; + case IBV_SA_ATTR_MCAST_FDB_REC: + size = IBV_SA_ATTR_MCAST_FDB_REC_LEN; + break; + case IBV_SA_ATTR_SM_INFO_REC: + size = IBV_SA_ATTR_SM_INFO_REC_LEN; + break; + case IBV_SA_ATTR_LINK_REC: + size = IBV_SA_ATTR_LINK_REC_LEN; + break; + case IBV_SA_ATTR_GUID_INFO_REC: + size = IBV_SA_ATTR_GUID_INFO_REC_LEN; + break; + case IBV_SA_ATTR_SERVICE_REC: + size = IBV_SA_ATTR_SERVICE_REC_LEN; + break; + case IBV_SA_ATTR_PARTITION_REC: + size = IBV_SA_ATTR_PARTITION_REC_LEN; + break; + case IBV_SA_ATTR_PATH_REC: + size = IBV_SA_ATTR_PATH_REC_LEN; + break; + case IBV_SA_ATTR_VL_ARB_REC: + size = IBV_SA_ATTR_VL_ARB_REC_LEN; + break; + case IBV_SA_ATTR_MC_MEMBER_REC: + size = IBV_SA_ATTR_MC_MEMBER_REC_LEN; + break; + case IBV_SA_ATTR_TRACE_REC: + size = IBV_SA_ATTR_TRACE_REC_LEN; + break; + case IBV_SA_ATTR_MULTI_PATH_REC: + size = IBV_SA_ATTR_MULTI_PATH_REC_LEN; + break; + case IBV_SA_ATTR_SERVICE_ASSOC_REC: + size = IBV_SA_ATTR_SERVICE_ASSOC_REC_LEN; + break; + case IBV_SA_ATTR_INFORM_INFO_REC: + size = IBV_SA_ATTR_INFORM_INFO_REC_LEN; + break; + default: + size = 0; + break; + } + return size; +} + +uint32_t ibv_sa_get_field(void *data, int offset, int size) +{ + uint32_t value, left_offset; + + left_offset = offset & 0x07; + if (size <= 8) { + value = ((uint8_t *) data)[offset / 8]; + value = ((value << left_offset) & 0xFF) >> (8 - size); + } else if (size <= 16) { + value = ntohs(((uint16_t *) data)[offset / 16]); + value = ((value << left_offset) & 0xFFFF) >> (16 - size); + } else { + value = ntohl(((uint32_t *) data)[offset / 32]); + value = (value << left_offset) >> (32 - size); + } + return value; +} + +void ibv_sa_set_field(void *data, uint32_t value, int offset, int size) +{ + uint32_t left_value, right_value; + uint32_t left_offset, right_offset; + uint32_t field_size; + + if (size <= 8) + field_size = 8; + else if (size <= 16) + field_size = 16; + else + field_size = 32; + + left_offset = offset & 0x07; + right_offset = field_size - left_offset - size; + + left_value = left_offset ? ibv_sa_get_field(data, offset - left_offset, + left_offset) : 0; + right_value = right_offset ? ibv_sa_get_field(data, offset + size, + right_offset) : 0; + + value = (left_value << (size + right_offset)) | + (value << right_offset) | right_value; + + if (field_size == 8) + ((uint8_t *) data)[offset / 8] = (uint8_t) value; + else if (field_size == 16) + ((uint16_t *) data)[offset / 16] = htons((uint16_t) value); + else + ((uint32_t *) data)[offset / 32] = htonl((uint32_t) value); +} + +void ibv_sa_unpack_path_rec(struct ibv_sa_path_rec *rec, + struct ibv_sa_net_path_rec *net_rec) +{ + memcpy(rec->dgid.raw, net_rec->dgid, sizeof net_rec->dgid); + memcpy(rec->sgid.raw, net_rec->sgid, sizeof net_rec->sgid); + rec->dlid = net_rec->dlid; + rec->slid = net_rec->slid; + + rec->raw_traffic = ibv_sa_get_field(net_rec, 352, 1); + rec->flow_label = htonl(ibv_sa_get_field(net_rec, 356, 20)); + rec->hop_limit = (uint8_t) ibv_sa_get_field(net_rec, 376, 8); + rec->traffic_class = net_rec->tclass; + + rec->reversible = htonl(ibv_sa_get_field(net_rec, 392, 1)); + rec->numb_path = (uint8_t) ibv_sa_get_field(net_rec, 393, 7); + rec->pkey = net_rec->pkey; + rec->sl = (uint8_t) ibv_sa_get_field(net_rec, 428, 4); + + rec->mtu_selector = (uint8_t) ibv_sa_get_field(net_rec, 432, 2); + rec->mtu = (uint8_t) ibv_sa_get_field(net_rec, 434, 6); + + rec->rate_selector = (uint8_t) ibv_sa_get_field(net_rec, 440, 2); + rec->rate = (uint8_t) ibv_sa_get_field(net_rec, 442, 6); + + rec->packet_life_time_selector = (uint8_t) ibv_sa_get_field(net_rec, + 448, 2); + rec->packet_life_time = (uint8_t) ibv_sa_get_field(net_rec, 450, 6); + + rec->preference = net_rec->preference; +} + +void ibv_sa_pack_mcmember_rec(struct ibv_sa_net_mcmember_rec *net_rec, + struct ibv_sa_mcmember_rec *rec) +{ + memcpy(net_rec->mgid, rec->mgid.raw, sizeof net_rec->mgid); + memcpy(net_rec->port_gid, rec->port_gid.raw, sizeof net_rec->port_gid); + net_rec->qkey = rec->qkey; + net_rec->mlid = rec->mlid; + + ibv_sa_set_field(net_rec, rec->mtu_selector, 304, 2); + ibv_sa_set_field(net_rec, rec->mtu, 306, 6); + + net_rec->tclass = rec->traffic_class; + net_rec->pkey = rec->pkey; + + ibv_sa_set_field(net_rec, rec->rate_selector, 336, 2); + ibv_sa_set_field(net_rec, rec->rate, 338, 6); + + ibv_sa_set_field(net_rec, rec->packet_life_time_selector, 344, 2); + ibv_sa_set_field(net_rec, rec->packet_life_time, 346, 6); + + ibv_sa_set_field(net_rec, rec->sl, 352, 4); + ibv_sa_set_field(net_rec, ntohl(rec->flow_label), 356, 20); + ibv_sa_set_field(net_rec, rec->hop_limit, 376, 8); + + ibv_sa_set_field(net_rec, rec->scope, 384, 4); + ibv_sa_set_field(net_rec, rec->join_state, 388, 4); + + ibv_sa_set_field(net_rec, rec->proxy_join, 392, 1); +} + +void ibv_sa_unpack_mcmember_rec(struct ibv_sa_mcmember_rec *rec, + struct ibv_sa_net_mcmember_rec *net_rec) +{ + memcpy(rec->mgid.raw, net_rec->mgid, sizeof rec->mgid); + memcpy(rec->port_gid.raw, net_rec->port_gid, sizeof rec->port_gid); + rec->qkey = net_rec->qkey; + rec->mlid = net_rec->mlid; + + rec->mtu_selector = (uint8_t) ibv_sa_get_field(net_rec, 304, 2); + rec->mtu = (uint8_t) ibv_sa_get_field(net_rec, 306, 6); + + rec->traffic_class = net_rec->tclass; + rec->pkey = net_rec->pkey; + + rec->rate_selector = (uint8_t) ibv_sa_get_field(net_rec, 336, 2); + rec->rate = (uint8_t) ibv_sa_get_field(net_rec, 338, 6); + + rec->packet_life_time_selector = (uint8_t) ibv_sa_get_field(net_rec, + 344, 2); + rec->packet_life_time = (uint8_t) ibv_sa_get_field(net_rec, 346, 6); + + rec->sl = (uint8_t) ibv_sa_get_field(net_rec, 352, 4); + rec->flow_label = htonl(ibv_sa_get_field(net_rec, 356, 20)); + rec->hop_limit = ibv_sa_get_field(net_rec, 376, 8); + + rec->scope = (uint8_t) ibv_sa_get_field(net_rec, 384, 4); + rec->join_state = (uint8_t) ibv_sa_get_field(net_rec, 388, 4); + + rec->proxy_join = ibv_sa_get_field(net_rec, 392, 1); +} Property changes on: libibsa/src/sa_net.c ___________________________________________________________________ Name: svn:executable + * Index: libibsa/ChangeLog =================================================================== Index: libibsa/COPYING =================================================================== --- libibsa/COPYING (revision 0) +++ libibsa/COPYING (revision 0) @@ -0,0 +1,378 @@ +This software is available to you under a choice of one of two +licenses. You may choose to be licensed under the terms of the the +OpenIB.org BSD license or the GNU General Public License (GPL) Version +2, both included below. + +Copyright (c) 2005 Intel Corporation. All rights reserved. + +================================================================== + + OpenIB.org BSD license + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: + + * Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + * Redistributions in binary form must reproduce the above + copyright notice, this list of conditions and the following + disclaimer in the documentation and/or other materials provided + with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS +FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE +COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, +INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, +BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; +LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN +ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +================================================================== + + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. Index: libibsa/Makefile.am =================================================================== --- libibsa/Makefile.am (revision 0) +++ libibsa/Makefile.am (revision 0) @@ -0,0 +1,40 @@ +# $Id: Makefile.am 3373 2005-09-12 16:34:20Z roland $ +INCLUDES = -I$(srcdir)/include + +AM_CFLAGS = -g -Wall -D_GNU_SOURCE + +ibsalibdir = $(libdir) + +ibsalib_LTLIBRARIES = src/libibsa.la + +src_ibsa_la_CFLAGS = -g -Wall -D_GNU_SOURCE + +if HAVE_LD_VERSION_SCRIPT + ibsa_version_script = -Wl,--version-script=$(srcdir)/src/libibsa.map +else + ibsa_version_script = +endif + +src_libibsa_la_SOURCES = src/sa_client.c src/sa_net.c +src_libibsa_la_LDFLAGS = -avoid-version $(ibsa_version_script) + +bin_PROGRAMS = examples/satest examples/mchammer +examples_satest_SOURCES = examples/satest.c +examples_satest_LDADD = $(top_builddir)/src/libibsa.la +examples_mchammer_SOURCES = examples/mchammer.c +examples_mchammer_LDADD = $(top_builddir)/src/libibsa.la + +libibsaincludedir = $(includedir)/infiniband + +libibsainclude_HEADERS = include/infiniband/sa_client_abi.h \ + include/infiniband/sa_client.h \ + include/infiniband/sa_net.h + +EXTRA_DIST = include/infiniband/sa_client_abi.h \ + include/infiniband/sa_client.h \ + include/infiniband/sa_net.h \ + src/libibsa.map \ + libibsa.spec.in + +dist-hook: libibsa.spec + cp libibsa.spec $(distdir) Index: libibsa/autogen.sh =================================================================== --- libibsa/autogen.sh (revision 0) +++ libibsa/autogen.sh (revision 0) @@ -0,0 +1,8 @@ +#! /bin/sh + +set -x +aclocal -I config +libtoolize --force --copy +autoheader +automake --foreign --add-missing --copy +autoconf Property changes on: libibsa/autogen.sh ___________________________________________________________________ Name: svn:executable + * Index: libibsa/NEWS =================================================================== Index: libibsa/README =================================================================== --- libibsa/README (revision 0) +++ libibsa/README (revision 0) @@ -0,0 +1,16 @@ +This README is for userspace SA client library. + +Building + +To make this directory, run: +./autogen.sh && ./configure && make && make install + +Typically the autogen and configure steps only need be done the first +time unless configure.in or Makefile.am changes. + +Libraries are installed by default at /usr/local/lib. + +Device files + +The userspace SA client uses a single device file under misc devices +regardless of the number of adapters or ports present. Index: libibsa/examples/satest.c =================================================================== --- libibsa/examples/satest.c (revision 0) +++ libibsa/examples/satest.c (revision 0) @@ -0,0 +1,225 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include + +#include +#include + +/* + * To execute: + * satest slid dlid + */ + +struct ibv_context *verbs; +struct ibv_sa_event_channel *channel; +uint16_t slid; +uint16_t dlid; + +static int init(void) +{ + struct ibv_device **dev_list; + int ret = 0; + + dev_list = ibv_get_device_list(NULL); + if (!dev_list[0]) + return -1; + + verbs = ibv_open_device(dev_list[0]); + ibv_free_device_list(dev_list); + if (!verbs) + return -1; + + channel = ibv_sa_create_event_channel(); + if (!channel) { + printf("ibv_sa_create_event_channel failed\n"); + ibv_close_device(verbs); + ret = 1; + } + + return ret; +} + +static void cleanup(void) +{ + ibv_sa_destroy_event_channel(channel); + ibv_close_device(verbs); +} + +static int query_one_path(struct ibv_sa_net_path_rec *path_rec) +{ + struct ibv_sa_event *event; + int ret; + + path_rec->slid = slid; + path_rec->dlid = dlid; + ibv_sa_set_field(path_rec, 1, IBV_SA_PATH_REC_NUMB_PATH_OFFSET, + IBV_SA_PATH_REC_NUMB_PATH_LENGTH); + ret = ibv_sa_send_mad(channel, verbs, 1, IBV_SA_METHOD_GET, + path_rec, IBV_SA_ATTR_PATH_REC, + IBV_SA_PATH_REC_SLID | + IBV_SA_PATH_REC_DLID | + IBV_SA_PATH_REC_NUMB_PATH, 3000, 3, NULL); + if (ret) { + printf("query_one_path ibv_sa_send_mad failed: %d\n", ret); + return ret; + } + + ret = ibv_sa_get_event(channel, &event); + if (ret) { + printf("query_one_path ibv_sa_get_event failed: %d\n", ret); + return ret; + } + + if (event->status) { + printf("query_one_path: status = %d\n", event->status); + ret = event->status; + goto out; + } + + memcpy(path_rec, event->attr, event->attr_size); +out: + ibv_sa_ack_event(event); + return ret; +} + +static int query_many_paths(struct ibv_sa_event **event) +{ + struct ibv_sa_net_path_rec path_rec; + int ret; + + path_rec.slid = slid; + ret = ibv_sa_send_mad(channel, verbs, 1, IBV_SA_METHOD_GET_TABLE, + &path_rec, IBV_SA_ATTR_PATH_REC, + IBV_SA_PATH_REC_SLID, 3000, 3, NULL); + if (ret) { + printf("query_many_paths ibv_sa_send_mad failed: %d\n", ret); + return ret; + } + + ret = ibv_sa_get_event(channel, event); + if (ret) { + printf("query_many_paths ibv_sa_get_event failed: %d\n", ret); + return ret; + } + + if ((*event)->status) { + printf("query_many_paths: status = %d\n", (*event)->status); + ret = (*event)->status; + goto err; + } + + return 0; +err: + ibv_sa_ack_event(*event); + return ret; +} + +static int verify_paths(struct ibv_sa_net_path_rec *path_rec, + struct ibv_sa_event *event) +{ + struct ibv_sa_net_path_rec *rec; + int i, ret = -1; + + if (path_rec->slid != slid || path_rec->dlid != dlid) { + printf("path_rec slid or dlid does not match request\n"); + return -1; + } + + for (i = 0; i < event->attr_count; i++) { + rec = ibv_sa_get_attr(event, i); + + if (rec->slid != slid) { + printf("rec slid does not match request\n"); + return -1; + } + + if (path_rec->dlid == rec->dlid && + !memcmp(path_rec, rec, sizeof *rec)) + ret = 0; + } + + if (ret) + printf("path_rec not found in returned list\n"); + + return ret; +} + +static int run_path_query(void) +{ + struct ibv_sa_net_path_rec path_rec; + struct ibv_sa_event *event; + int ret; + + ret = query_one_path(&path_rec); + if (ret) + return ret; + + ret = query_many_paths(&event); + if (ret) + return ret; + + ret = verify_paths(&path_rec, event); + + ibv_sa_ack_event(event); + return ret; +} + +int main(int argc, char **argv) +{ + int ret; + + if (argc != 3) { + printf("usage: %s slid dlid\n", argv[0]); + exit(1); + } + + slid = htons((uint16_t) atoi(argv[1])); + dlid = htons((uint16_t) atoi(argv[2])); + + if (init()) + exit(1); + + ret = run_path_query(); + + printf("test complete\n"); + cleanup(); + printf("return status %d\n", ret); + return ret; +} Property changes on: libibsa/examples/satest.c ___________________________________________________________________ Name: svn:executable + * Index: libibsa/examples/mchammer.c =================================================================== --- libibsa/examples/mchammer.c (revision 0) +++ libibsa/examples/mchammer.c (revision 0) @@ -0,0 +1,391 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +/* + * To execute: + * mchammer {r | s} + */ + +struct ibv_context *verbs; +struct ibv_pd *pd; +struct ibv_cq *cq; +struct ibv_qp *qp; +struct ibv_mr *mr; +void *msgs; + +static int message_count = 10; +static int message_size = 100; +static int sender; + +static int post_recvs(void) +{ + struct ibv_recv_wr recv_wr, *recv_failure; + struct ibv_sge sge; + int i, ret = 0; + + if (!message_count) + return 0; + + recv_wr.next = NULL; + recv_wr.sg_list = &sge; + recv_wr.num_sge = 1; + + sge.length = message_size + sizeof(struct ibv_grh);; + sge.lkey = mr->lkey; + sge.addr = (uintptr_t) msgs; + + for (i = 0; i < message_count && !ret; i++ ) { + ret = ibv_post_recv(qp, &recv_wr, &recv_failure); + if (ret) { + printf("failed to post receives: %d\n", ret); + break; + } + } + return ret; +} + +static int create_qp(void) +{ + struct ibv_qp_init_attr init_qp_attr; + struct ibv_qp_attr qp_attr; + int ret; + + memset(&init_qp_attr, 0, sizeof init_qp_attr); + init_qp_attr.cap.max_send_wr = message_count ? message_count : 1; + init_qp_attr.cap.max_recv_wr = message_count ? message_count : 1; + init_qp_attr.cap.max_send_sge = 1; + init_qp_attr.cap.max_recv_sge = 1; + init_qp_attr.qp_type = IBV_QPT_UD; + init_qp_attr.send_cq = cq; + init_qp_attr.recv_cq = cq; + qp = ibv_create_qp(pd, &init_qp_attr); + if (!qp) { + printf("unable to create QP\n"); + return -1; + } + + qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.pkey_index = 0; + qp_attr.port_num = 1; + qp_attr.qkey = 0x01234567; + ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | + IBV_QP_PORT | IBV_QP_QKEY); + if (ret) { + printf("failed to modify QP to INIT\n"); + goto err; + } + + qp_attr.qp_state = IBV_QPS_RTR; + ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE); + if (ret) { + printf("failed to modify QP to RTR\n"); + goto err; + } + + qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.sq_psn = 0; + ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_SQ_PSN); + if (ret) { + printf("failed to modify QP to RTS\n"); + goto err; + } + return 0; +err: + ibv_destroy_qp(qp); + return ret; +} + +static int create_messages(void) +{ + if (!message_size) + message_count = 0; + + if (!message_count) + return 0; + + msgs = malloc(message_size + sizeof(struct ibv_grh)); + if (!msgs) { + printf("failed message allocation\n"); + return -1; + } + mr = ibv_reg_mr(pd, msgs, message_size + sizeof(struct ibv_grh), + IBV_ACCESS_LOCAL_WRITE); + if (!mr) { + printf("failed to reg MR\n"); + free(msgs); + return -1; + } + return 0; +} + +static void destroy_messages(void) +{ + if (!message_count) + return; + + ibv_dereg_mr(mr); + free(msgs); +} + +static int init(void) +{ + struct ibv_device **dev_list; + int ret; + + dev_list = ibv_get_device_list(NULL); + if (!dev_list[0]) + return -1; + + verbs = ibv_open_device(dev_list[0]); + ibv_free_device_list(dev_list); + if (!verbs) + return -1; + + pd = ibv_alloc_pd(verbs); + if (!pd) { + printf("unable to alloc PD\n"); + return -1; + } + + ret = create_messages(); + if (ret) { + printf("unable to create test messages\n"); + goto err1; + } + + cq = ibv_create_cq(verbs, message_count, NULL, NULL, 0); + if (!cq) { + printf("unable to create CQ\n"); + ret = -1; + goto err2; + } + + ret = create_qp(); + if (ret) { + printf("unable to create QP\n"); + goto err3; + } + return 0; + +err3: + ibv_destroy_cq(cq); +err2: + destroy_messages(); +err1: + ibv_dealloc_pd(pd); + return -1; +} + +static void cleanup(void) +{ + ibv_destroy_qp(qp); + ibv_destroy_cq(cq); + destroy_messages(); + ibv_dealloc_pd(pd); + ibv_close_device(verbs); +} + +static int send_msgs(struct ibv_ah *ah) +{ + struct ibv_send_wr send_wr, *bad_send_wr; + struct ibv_sge sge; + int i, ret = 0; + + send_wr.next = NULL; + send_wr.sg_list = &sge; + send_wr.num_sge = 1; + send_wr.opcode = IBV_WR_SEND; + send_wr.send_flags = 0; + + send_wr.wr.ud.ah = ah; + send_wr.wr.ud.remote_qpn = 0xFFFFFF; + send_wr.wr.ud.remote_qkey = 0x01234567; + + sge.length = message_size; + sge.lkey = mr->lkey; + sge.addr = (uintptr_t) msgs; + + for (i = 0; i < message_count && !ret; i++) { + ret = ibv_post_send(qp, &send_wr, &bad_send_wr); + if (ret) + printf("failed to post sends: %d\n", ret); + } + return ret; +} + +static int poll_cq(void) +{ + struct ibv_wc wc[8]; + int done, ret; + + for (done = 0; done < message_count; done += ret) { + ret = ibv_poll_cq(cq, 8, wc); + if (ret < 0) { + printf("failed polling CQ: %d\n", ret); + return ret; + } + } + + return 0; +} + +static int run(void) +{ + struct ibv_sa_event_channel *channel; + struct ibv_sa_multicast *mcast; + struct ibv_sa_event *event; + struct ibv_sa_net_mcmember_rec mc_rec, *rec; + struct ibv_ah_attr ah_attr; + struct ibv_ah *ah; + uint64_t comp_mask; + int ret; + + channel = ibv_sa_create_event_channel(); + if (!channel) { + printf("ibv_sa_create_event_channel failed\n"); + return -1; + } + + ret = ibv_sa_get_mcmember_rec(channel, verbs, 1, NULL, &mc_rec); + if (ret) { + printf("ibv_sa_get_mcmember_rec failed\n"); + goto out1; + } + + printf("joining multicast group\n"); + mc_rec.mgid[0] = 0xFF; /* multicast GID */ + mc_rec.mgid[1] = 0x12; /* not permanent (7:4), link-local (3:0) */ + strcpy(&mc_rec.mgid[2], "7471"); /* our GID */ + mc_rec.qkey = htonl(0x01234567); + comp_mask = IBV_SA_MCMEMBER_REC_MGID | IBV_SA_MCMEMBER_REC_PORT_GID | + IBV_SA_MCMEMBER_REC_PKEY | IBV_SA_MCMEMBER_REC_JOIN_STATE | + IBV_SA_MCMEMBER_REC_QKEY | IBV_SA_MCMEMBER_REC_SL | + IBV_SA_MCMEMBER_REC_FLOW_LABEL | + IBV_SA_MCMEMBER_REC_TRAFFIC_CLASS; + ret = ibv_sa_join_multicast(channel, verbs, 1, &mc_rec, + comp_mask, NULL, &mcast); + if (ret) { + printf("ibv_sa_join_multicast failed\n"); + goto out1; + } + + ret = ibv_sa_get_event(channel, &event); + if (ret) { + printf("ibv_sa_get_event failed\n"); + goto out2; + } + + if (event->status) { + printf("join failed: %d\n", event->status); + ret = event->status; + goto out3; + } + + rec = (struct ibv_sa_net_mcmember_rec *) event->attr; + ibv_sa_init_ah_from_mcmember(verbs, 1, rec, &ah_attr); + ah = ibv_create_ah(pd, &ah_attr); + if (!ah) { + printf("ibv_create_ah failed\n"); + ret = -1; + goto out3; + } + + ret = ibv_attach_mcast(qp, (union ibv_gid *) rec->mgid, + htons(rec->mlid)); + if (ret) { + printf("ibv_attach_mcast failed\n"); + goto out4; + } + + /* + * Pause to give SM chance to configure switches. We don't want to + * handle reliability issue in this simple test program. + */ + sleep(2); + + if (sender) { + printf("initiating data transfers\n"); + ret = send_msgs(ah); + sleep(1); + } else { + printf("receiving data transfers\n"); + ret = post_recvs(); + if (!ret) + ret = poll_cq(); + } + printf("data transfers complete\n"); + + ibv_detach_mcast(qp, (union ibv_gid *) rec->mgid, htons(rec->mlid)); +out4: + ibv_destroy_ah(ah); +out3: + ibv_sa_ack_event(event); +out2: + ibv_sa_free_multicast(mcast); +out1: + ibv_sa_destroy_event_channel(channel); + return ret; +} + +int main(int argc, char **argv) +{ + int ret; + + if (argc != 2) { + printf("usage: %s {r | s}\n", argv[0]); + exit(1); + } + + sender = (argv[1][0] == 's'); + if (init()) + exit(1); + + ret = run(); + + printf("test complete\n"); + cleanup(); + printf("return status %d\n", ret); + return ret; +} Property changes on: libibsa/examples/mchammer.c ___________________________________________________________________ Name: svn:executable + * From zhushisongzhu at yahoo.com Mon Aug 21 20:42:58 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Mon, 21 Aug 2006 20:42:58 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060817113921.GH2630@mellanox.co.il> Message-ID: <20060822034258.34346.qmail@web36901.mail.mud.yahoo.com> Do you find the same problem? zhu __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From krkumar2 at in.ibm.com Mon Aug 21 22:14:56 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 22 Aug 2006 10:44:56 +0530 Subject: [openib-general] [PATCH v3 0/6] Tranport Neutral Verbs Proposal. In-Reply-To: Message-ID: Hi Roland, > Krishna> Hi Roland & Sean, What is your opinion on this patch set > Krishna> ? Anything else needs to be done for acceptance ? > > It's a very low priority for me, since it's a pain to merge and a pain > to maintain I understand. If you feel that I can reduce the pain by sending an up-todate patch set, let me know. thanks, - KK > I'll try to get to it after everything else I want for libibverbs 1.1 > is done (expose device type, memory windows, reregister memory region > at least) > > - R. From zhushisongzhu at yahoo.com Tue Aug 22 02:36:17 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Tue, 22 Aug 2006 02:36:17 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <44E482F0.6010203@mellanox.co.il> Message-ID: <20060822093617.24643.qmail@web36903.mail.mud.yahoo.com> Do you have any progress? zhu --- Tziporet Koren wrote: > > > zhu shi song wrote: > > I have changed SDP_RX_SIZE from 0x40 to 1 and > rebuilt > > ib_sdp.ko. But kernel always crashed. > > zhu > > > > Hi Zhu, > Can you send us instructions of the test/application > you are running so > we can try to reproduce it here too? > > We also need to know the system & kernel you are > using > > Thanks, > Tziporet > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From ogerlitz at voltaire.com Tue Aug 22 02:46:31 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 22 Aug 2006 12:46:31 +0300 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: References: <000001c6c47c$1ffe0dd0$35cd180a@amr.corp.intel.com> Message-ID: <44EAD277.6010802@voltaire.com> Roland Dreier wrote: > Sean> If we record a base offset, we can start at any random > Sean> number. We just need to always add/subtract the base when > Sean> getting a value from the IDR. > > Good point -- or better still, we could XOR in a random bit pattern. > That way we don't have to keep straight when to add and when to subtract. Cool, I would go for XOR-ing a random value with the **local id** . Sean, my understanding it can be narrowed for doing so in: 1) cm_alloc_id() after calling idr_get_new_above() 2) cm_free_id() before calling idr_remove() 3) cm_get_id() before calling idr_find() and initializing the random value we XOR in ib_cm_init() What do you think? Or. From dotanb at mellanox.co.il Tue Aug 22 04:06:49 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 22 Aug 2006 14:06:49 +0300 Subject: [openib-general] libibcm can't open /dev/infiniband/ucm0 Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302B9A8F5@mtlexch01.mtl.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hefty > Sent: Friday, August 18, 2006 7:06 PM > To: Bub Thomas > Cc: openib-general at openib.org > Subject: Re: [openib-general] libibcm can't open /dev/infiniband/ucm0 > > > Bub Thomas wrote: > > It seems as if the problem I had there was not in my code but the > > libibcm not being able to open the device /dev/infiniband/ucm0. > > You will need to load ib_ucm, which exports the IB CM to userspace. This issue wasn't solved from me too ... And ib_ucm was loaded when this device file wasn't created for me ... I tried to understand why we have this failure (distribution / arch / kernel / udev version) And the failuer seems to be random (two machines with the same attributes: one of them Have this file and in the other the file wasn't created ...). Dotan From mst at mellanox.co.il Tue Aug 22 04:09:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 Aug 2006 14:09:03 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060818072556.83097.qmail@web36909.mail.mud.yahoo.com> References: <20060818072556.83097.qmail@web36909.mail.mud.yahoo.com> Message-ID: <20060822110903.GF13782@mellanox.co.il> Quoting r. zhu shi song : > --- "Michael S. Tsirkin" wrote: > > > Quoting r. zhu shi song : > > > (3) one time linux kernel on the client crashed. I > > > copy the output from the screen. > > > Process sdp (pid:4059, threadinfo 0000010036384000 > > > task 000001003ea10030) > > > Call > > > > > Trace:{:ib_sdp:sdp_destroy_workto} > > > {:ib_sdp:sdp_destroy_qp+77} > > > > > > {:ib_sdp:sdp_destruct+279}{sk_free+28} > > > > > > {worker_thread+419}{default_wake_function+0} > > > > > > {default_wake_function+0}{keventd_create_kthread+0} > > > > > > {worker_thread+0}{keventd_create_kthread+0} > > > > > > {kthread+200}{child_rip+8} > > > > > > {keventd_create_kthread+0}{kthread+0}{child_rip+0} > > > Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b 45 31 ff > > 45 > > > 31 ed 4c 89 > > > > > > RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> > > > CR2:0000000000000004 > > > <0>kernel panic-not syncing:Oops > > > > > > zhu > > > > Hmm, the stack dump does not match my sources. Is > > this OFED rc1? > > Could you send me the sdp_main.o and sdp_main.c > > files from your system please? --- > Subject: Re: why sdp connections cost so much memory > > please see the attachment. > zhu Ugh, so its crashing inside sdp_bcopy ... By the way, could you please re-test with OFED rc2? We've solved a couple of bugs there ... If this still crashes, could you please post the whole sdp directory, with .o and .c files? Thanks, -- MST From bugzilla-daemon at openib.org Tue Aug 22 04:25:00 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 22 Aug 2006 04:25:00 -0700 (PDT) Subject: [openib-general] [Bug 203] New: Crash on shutdown, timer callback, build 459 Message-ID: <20060822112500.CA00F2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=203 Summary: Crash on shutdown, timer callback, build 459 Product: OpenFabrics Windows Version: unspecified Platform: X86 OS/Version: Other Status: NEW Severity: major Priority: P2 Component: Core AssignedTo: bugzilla at openib.org ReportedBy: jbottorff at xsigo.com While trying to debug some of the shutdown hangs I see, I configured a pair of 32-bit W2k3 sp1 systems back to back with no switch. I then ran opensm on one (free OS build, checked IB drivers), and had a script cycle reboots on the other (checked OS build, driver verifer, checked IB drivers). After just a few reboots, I had a very curious crash (which I had never seen before), which seemed to be repeatable every few reboots. The crash would occur in a timer callback when trying to dereference a garbage context value (it always had the value 0x1). I suspect what may be happening is some IB object that contains a cl_timer_t object is getting deallocated while the timer is still active. The memory containing the cl_timer_t is overwritten (reallocated?) and the context value for the timer callback dpc is destroyed. So now all the details: Here is the initial crash analysis: 0: kd> !analyze -v DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1) An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If kernel debugger is available get stack backtrace. Arguments: Arg1: 00000051, memory referenced Arg2: 00000002, IRQL Arg3: 00000001, value 0 = read operation, 1 = write operation Arg4: baa4b1c8, address which referenced memory Debugging Details: ------------------ OVERLAPPED_MODULE: Address regions for 'Fips' and 'imapi.sys' overlap WRITE_ADDRESS: 00000051 CURRENT_IRQL: 2 FAULTING_IP: ibbus!__timer_callback+8 [k:\windows-openib\src\winib-459\core\complib\kernel\cl_timer.c @ 48] baa4b1c8 c7405000000000 mov dword ptr [eax+0x50],0x0 DEFAULT_BUCKET_ID: DRIVER_FAULT BUGCHECK_STR: 0xD1 LAST_CONTROL_TRANSFER: from 8063717b to 8075cc0c STACK_TEXT: f78a2a5c 8063717b 00000003 00000000 0000000a nt!RtlpBreakWithStatusInstruction f78a2aa8 806380d8 00000003 00000051 baa4b1c8 nt!KiBugCheckDebugBreak+0x19 f78a2e40 8077f6ef 0000000a 00000051 00000002 nt!KeBugCheck2+0x5b2 f78a2e40 baa4b1c8 0000000a 00000051 00000002 nt!KiTrap0E+0x2af f78a2ed4 8064858a 88e76fc0 88e76e40 f1e309be ibbus!__timer_callback+0x8 [k:\windows-openib\src\winib-459\core\complib\kernel\cl_timer.c @ 48] f78a2f9c 80648a46 00000000 00000000 025ed741 nt!KiTimerExpiration+0x660 f78a2ff4 80780d8f f78da208 00000000 00000000 nt!KiRetireDpcList+0x62 f78a2ff8 f78da208 00000000 00000000 00000000 nt!KiDispatchInterrupt+0x3f WARNING: Frame IP not in any known module. Following frames may be wrong. 80780d8f 00000000 0000000a bb837775 00000128 0xf78da208 STACK_COMMAND: .bugcheck ; kb FOLLOWUP_IP: ibbus!__timer_callback+8 [k:\windows-openib\src\winib-459\core\complib\kernel\cl_timer.c @ 48] baa4b1c8 c7405000000000 mov dword ptr [eax+0x50],0x0 FAULTING_SOURCE_CODE: 44: UNUSED_PARAM( p_dpc ); 45: UNUSED_PARAM( arg1 ); 46: UNUSED_PARAM( arg2 ); 47: > 48: p_timer->timeout_time = 0; 49: 50: (p_timer->pfn_callback)( (void*)p_timer->context ); 51: } 52: 53: SYMBOL_STACK_INDEX: 4 FOLLOWUP_NAME: MachineOwner SYMBOL_NAME: ibbus!__timer_callback+8 MODULE_NAME: ibbus IMAGE_NAME: ibbus.sys DEBUG_FLR_IMAGE_TIMESTAMP: 44e65c89 FAILURE_BUCKET_ID: 0xD1_W_VRF_ibbus!__timer_callback+8 BUCKET_ID: 0xD1_W_VRF_ibbus!__timer_callback+8 Followup: MachineOwner --------- The direct cause of the crash is the local variable p_timer has a value of 0x1. A dump of the dpc object confirms the invalid context value: 0: kd> dt p_dpc Local var @ 0xf78a2edc Type _KDPC* 0x88e76fc0 +0x000 Type : 0x13 '' +0x001 Importance : 0x1 '' +0x002 Number : 0 '' +0x003 Expedite : 0 '' +0x004 DpcListEntry : _LIST_ENTRY [ 0x0 - 0x0 ] +0x00c DeferredRoutine : 0xbaa4b1c0 ibbus!__timer_callback+0 +0x010 DeferredContext : 0x00000001 +0x014 SystemArgument1 : (null) +0x018 SystemArgument2 : (null) +0x01c DpcData : (null) A little digging seems to say the dpc object should be contained in what p_timer points at, so we get a pool dump of the dpc object, which should tell us the allocation it's contained in, which it seems to say: 0: kd> !pool 0x88e76fc0 Pool page 88e76fc0 region is Special pool *88e76e38 size: 1c8 non-paged special pool, Tag is Ddk Pooltag Ddk : Default for driver allocated memory (user's of ntddk.h) Since we know the dpc address, and we know the offset of the dpc inside the cl_time_t, we can calculate what p_timer should have been, and get a structured dump, which says: 0: kd> dt p_timer Local var @ 0xf78a2ee0 Type _cl_timer* 0x88e76f98 +0x000 timer : _KTIMER +0x028 dpc : _KDPC +0x048 pfn_callback : 0xba9dba60 ibbus!__recv_timer_cb+0 +0x04c context : 0x88e76e38 +0x050 timeout_time : 0x42cbcd5 Since we also know from !pool where the allocation started, and where the cl_timer_t is, we calculate it's offset as 0x160 (+/- some pool header). I see the context value also now matches what !pool said was the allocation start. Offhand, a al_mad_svc_t looks like a potential candidate as the parent object, although don't know if those really have a size of 0x1c8. This is suggesting what happened is a timer was not canceled when a mad object was destroyed. Or maybey the mad was waiting a reply when it was destroyed, and didn't get correctly cleaned up. I have a crash dump written to a file, and matching sources (459 rev) and symbols, if we need to dig some more. This crash also seems very reproducable at the moment. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From ogerlitz at voltaire.com Tue Aug 22 04:33:58 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 22 Aug 2006 14:33:58 +0300 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <000201c6c47f$555ee190$35cd180a@amr.corp.intel.com> References: <000201c6c47f$555ee190$35cd180a@amr.corp.intel.com> Message-ID: <44EAEBA6.3010405@voltaire.com> Sean Hefty wrote: > When a new REQ is received, we enter its timewait structure into two trees: one > sorted by remote ID, one sorted by remote QPN. If the REQ is new, both would > succeed, and timewait_info would be NULL. Since timewait_info is not NULL, we > are dealing with a REQ that re-uses the same remote ID or same remote QPN. If > the new REQ has the same remote ID (get_cm_id() returns non-NULL), we treat it > as a duplicate, otherwise it's marked as stale. OK, thanks for clarifying this. Or. From ogerlitz at voltaire.com Tue Aug 22 04:39:20 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 22 Aug 2006 14:39:20 +0300 Subject: [openib-general] InfiniBand merge plans for 2.6.19 In-Reply-To: References: Message-ID: <44EAECE8.4070802@voltaire.com> Roland Dreier wrote: > Here's a short summary of what I plan to merge for 2.6.19. Some of > this is already in infiniband.git[1], while some still needs to be > merged up. Sean, What about pushing the char device to support user space CMA, i recall that you have mentioned the API was not mature enough when the 2.6.18 feature merge window was open. Also, there were few CM issues you were suggesting patches, specifically the comm_established thing and the randomization of the initial local id. Or. From thomas.bub at thomson.net Tue Aug 22 05:52:44 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Tue, 22 Aug 2006 14:52:44 +0200 Subject: [openib-general] libibcm can't open /dev/infiniband/ucm0 Message-ID: Not quite sure what you are talking about? Found the Wiki entry: https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T 3&highlight=udev As I'm not running the referenced Kernel version I'm not sure if I should only apply the udev part of the link. Or is there another way/description? Additionally I didn't find the udev rules for the already exsisting /dev/infinibadn devices in the /etc/udev/udev.rules file. Thanks Thomas -----Original Message----- From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Monday, August 21, 2006 6:48 PM To: Bub Thomas Cc: openib-general at openib.org Subject: Re: [openib-general] libibcm can't open /dev/infiniband/ucm0 Bub Thomas wrote: > Here is the list of all loaded ib modules and their dependencies: > > ib_rds 37656 0 > ib_ucm 21512 0 Did you update udev rules to create the device? - Sean From tziporet at mellanox.co.il Tue Aug 22 06:48:29 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 22 Aug 2006 16:48:29 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060822093617.24643.qmail@web36903.mail.mud.yahoo.com> References: <20060822093617.24643.qmail@web36903.mail.mud.yahoo.com> Message-ID: <44EB0B2D.4060906@mellanox.co.il> zhu shi song wrote: > Do you have any progress? > zhu > Please try 1.1-rc2 as Michael asked. Tziporet From halr at voltaire.com Tue Aug 22 07:50:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Aug 2006 10:50:47 -0400 Subject: [openib-general] [PATCH][MINOR] OpenSM/osm_sa_informinfo.c: In osm_infr_rcv_process_set_method, release lock on request with unknown LID Message-ID: <1156258245.7983.19887.camel@hal.voltaire.com> OpenSM/osm_sa_informinfo.c: In osm_infr_rcv_process_set_method, release lock on request with unknown LID Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_informinfo.c =================================================================== --- opensm/osm_sa_informinfo.c (revision 9057) +++ opensm/osm_sa_informinfo.c (working copy) @@ -387,6 +387,8 @@ osm_infr_rcv_process_set_method( &inform_info_rec.inform_record.subscriber_gid ); if ( res != IB_SUCCESS ) { + cl_plock_release( p_rcv->p_lock ); + osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_infr_rcv_process_set_method: ERR 4308 " "Subscribe Request from unknown LID: 0x%04X\n", From sean.hefty at intel.com Tue Aug 22 08:37:54 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 22 Aug 2006 08:37:54 -0700 Subject: [openib-general] InfiniBand merge plans for 2.6.19 In-Reply-To: <44EAECE8.4070802@voltaire.com> Message-ID: <000101c6c600$efad4ed0$b6cc180a@amr.corp.intel.com> >What about pushing the char device to support user space CMA, i recall >that you have mentioned the API was not mature enough when the 2.6.18 >feature merge window was open. I will look at doing this. I need to verify what functionality (RC, UD, multicast) of the kernel RDMA CM we want merged upstream for 2.6.19 and create a patch for exposing that to userspace. >Also, there were few CM issues you were suggesting patches, specifically >the comm_established thing and the randomization of the initial local id. I haven't forgotten about these, and will restart the comm_established thread. - Sean From sean.hefty at intel.com Tue Aug 22 08:47:11 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 22 Aug 2006 08:47:11 -0700 Subject: [openib-general] libibcm can't open /dev/infiniband/ucm0 In-Reply-To: Message-ID: <000201c6c602$3b7cb2a0$b6cc180a@amr.corp.intel.com> >https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T >3&highlight=udev The udev information for this link looks correct. >Or is there another way/description? You can run mknod to manually create the file. (See the README file in the libibcm directory.) >Additionally I didn't find the udev rules for the already exsisting >/dev/infinibadn devices in the /etc/udev/udev.rules file. Is there any chance these were created manually at some point? What files are in the directory? (I would expect something like issm*, umad*, ucm*, uverbs*.) - Sean From sean.hefty at intel.com Tue Aug 22 08:52:02 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 22 Aug 2006 08:52:02 -0700 Subject: [openib-general] Question about QP's in timewait state and CM stale conn rejects In-Reply-To: <44EAD277.6010802@voltaire.com> Message-ID: <000301c6c602$e901ee90$b6cc180a@amr.corp.intel.com> >Cool, I would go for XOR-ing a random value with the **local id** . > >Sean, my understanding it can be narrowed for doing so in: > >1) cm_alloc_id() after calling idr_get_new_above() >2) cm_free_id() before calling idr_remove() >3) cm_get_id() before calling idr_find() > >and initializing the random value we XOR in ib_cm_init() > >What do you think? I like this approach as well. I need to see what else I have in my queue first, but will work on a patch, since it seems straightforward. - Sean From robert.j.woodruff at intel.com Tue Aug 22 10:22:48 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 22 Aug 2006 10:22:48 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready Message-ID: Tziporet wrote, >OFED 1.1-RC2 is avilable on https://openib.org/svn/gen2/branches/1.1/ofed/releases/ >File: OFED-1.1-rc2.tgz >Please report any issues in bugzilla http://openib.org/bugzilla/ I tried to install this on my RedHat EL4 - update 4 system and got the following error. DRPM/BUILD/openib-1.1/drivers/infiniband/core/.tmp_addr.o /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/addr.c In file included from include/linux/slab.h:15, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/slab.h:4, from include/linux/percpu.h:4, from include/linux/sched.h:31, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/sched.h:4, from include/linux/mm.h:4, from include/linux/skbuff.h:26, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/skbuff.h:4, from include/linux/if_ether.h:107, from include/linux/netdevice.h:29, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/netdevice.h:4, from /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/addr.c:31: include/linux/gfp.h:133: error: redefinition of typedef 'gfp_t' /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/types.h:7: error: previous declaration of 'gfp_t' was here In file included from include/linux/percpu.h:4, from include/linux/sched.h:31, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/sched.h:4, from include/linux/mm.h:4, from include/linux/skbuff.h:26, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/skbuff.h:4, from include/linux/if_ether.h:107, from include/linux/netdevice.h:29, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/netdevice.h:4, from /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/addr.c:31: /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/sched.h:8: warning: static declaration of 'wait_for_completion_timeout' follows non-static declaration include/linux/completion.h:32: warning: previous declaration of 'wait_for_completion_timeout' was here make[3]: *** [/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/addr.o] Error 1 make[2]: *** [/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core] Error 2 make[1]: *** [_module_/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband] Error 2 make[1]: Leaving directory `/usr/src/kernels/2.6.9-42.EL-smp-x86_64' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.46002 (%install) RPM build errors: user vlad does not exist - using root group mtl does not exist - using root user vlad does not exist - using root group mtl does not exist - using root Bad exit status from /var/tmp/rpm-tmp.46002 (%install) From sean.hefty at intel.com Tue Aug 22 10:30:00 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 22 Aug 2006 10:30:00 -0700 Subject: [openib-general] [PATCH] ib_cm: randomize starting local comm IDs Message-ID: <000001c6c610$988e0b20$c7cc180a@amr.corp.intel.com> Randomize the starting local comm ID to avoid getting a rejected connection due to a stale connection after a system reboot or reloading of the ib_cm. Signed-off-by: Sean Hefty --- Index: cm.c =================================================================== --- cm.c (revision 8928) +++ cm.c (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004-2006 Intel Corporation. All rights reserved. * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. @@ -41,6 +41,7 @@ #include #include #include +#include #include #include #include @@ -73,6 +74,7 @@ static struct ib_cm { struct rb_root remote_id_table; struct rb_root remote_sidr_table; struct idr local_id_table; + __be32 random_id_operand; struct workqueue_struct *wq; } cm; @@ -300,15 +302,17 @@ static int cm_init_av_by_path(struct ib_ static int cm_alloc_id(struct cm_id_private *cm_id_priv) { unsigned long flags; - int ret; + int ret, id; static int next_id; do { spin_lock_irqsave(&cm.lock, flags); - ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, next_id++, - (__force int *) &cm_id_priv->id.local_id); + ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, + next_id++, &id); spin_unlock_irqrestore(&cm.lock, flags); } while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) ); + + cm_id_priv->id.local_id = (__force __be32) (id ^ cm.random_id_operand); return ret; } @@ -317,7 +321,8 @@ static void cm_free_id(__be32 local_id) unsigned long flags; spin_lock_irqsave(&cm.lock, flags); - idr_remove(&cm.local_id_table, (__force int) local_id); + idr_remove(&cm.local_id_table, + (__force int) (local_id ^ cm.random_id_operand)); spin_unlock_irqrestore(&cm.lock, flags); } @@ -325,7 +330,8 @@ static struct cm_id_private * cm_get_id( { struct cm_id_private *cm_id_priv; - cm_id_priv = idr_find(&cm.local_id_table, (__force int) local_id); + cm_id_priv = idr_find(&cm.local_id_table, + (__force int) (local_id ^ cm.random_id_operand)); if (cm_id_priv) { if (cm_id_priv->id.remote_id == remote_id) atomic_inc(&cm_id_priv->refcount); @@ -2083,8 +2089,9 @@ static struct cm_id_private * cm_acquire spin_unlock_irqrestore(&cm.lock, flags); return NULL; } - cm_id_priv = idr_find(&cm.local_id_table, - (__force int) timewait_info->work.local_id); + cm_id_priv = idr_find(&cm.local_id_table, (__force int) + (timewait_info->work.local_id ^ + cm.random_id_operand)); if (cm_id_priv) { if (cm_id_priv->id.remote_id == remote_id) atomic_inc(&cm_id_priv->refcount); @@ -3391,6 +3398,7 @@ static int __init ib_cm_init(void) cm.remote_qp_table = RB_ROOT; cm.remote_sidr_table = RB_ROOT; idr_init(&cm.local_id_table); + get_random_bytes(&cm.random_id_operand, sizeof cm.random_id_operand); idr_pre_get(&cm.local_id_table, GFP_KERNEL); cm.wq = create_workqueue("ib_cm"); From halr at voltaire.com Tue Aug 22 10:35:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Aug 2006 13:35:58 -0400 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: References: Message-ID: <1156268156.17858.801.camel@hal.voltaire.com> Hi Yevgeny, On Tue, 2006-08-22 at 11:41, Yevgeny Kliteynik wrote: > Hi Hal > > This patch implements first item of the OSM todo list. Thanks! Am I correct in assuming this is both for trunk and 1.1 ? > OpenSM opens a thread that is listening for events on the SM's port. > The events that are being taken care of are IBV_EVENT_DEVICE_FATAL and > IBV_EVENT_PORT_ERROR. > > In case of IBV_EVENT_DEVICE_FATAL, osm is forced to exit. > in case of IBV_EVENT_PORT_ERROR, osm initiates heavy sweep. Some minor comments below. Let me know what you think. You don't have to resubmit for these. > Yevgeny > > Signed-off-by: Yevgeny Kliteynik > > Index: include/opensm/osm_sm_mad_ctrl.h > =================================================================== > -- include/opensm/osm_sm_mad_ctrl.h (revision 8998) > +++ include/opensm/osm_sm_mad_ctrl.h (working copy) > @@ -109,6 +109,7 @@ typedef struct _osm_sm_mad_ctrl > osm_mad_pool_t *p_mad_pool; > osm_vl15_t *p_vl15; > osm_vendor_t *p_vendor; > + struct _osm_state_mgr *p_state_mgr; > osm_bind_handle_t h_bind; > cl_plock_t *p_lock; > cl_dispatcher_t *p_disp; > @@ -130,6 +131,9 @@ typedef struct _osm_sm_mad_ctrl > * p_vendor > * Pointer to the vendor specific interfaces object. > * > +* p_state_mgr > +* Pointer to the state manager object. > +* > * h_bind > * Bind handle returned by the transport layer. > * > @@ -233,6 +237,7 @@ osm_sm_mad_ctrl_init( > IN osm_mad_pool_t* const p_mad_pool, > IN osm_vl15_t* const p_vl15, > IN osm_vendor_t* const p_vendor, > + IN struct _osm_state_mgr* const p_state_mgr, > IN osm_log_t* const p_log, > IN osm_stats_t* const p_stats, > IN cl_plock_t* const p_lock, > @@ -251,6 +256,9 @@ osm_sm_mad_ctrl_init( > * p_vendor > * [in] Pointer to the vendor specific interfaces object. > * > +* p_state_mgr > +* [in] Pointer to the state manager object. > +* > * p_log > * [in] Pointer to the log object. > * > Index: include/vendor/osm_vendor_ibumad.h > =================================================================== > -- include/vendor/osm_vendor_ibumad.h (revision 8998) > +++ include/vendor/osm_vendor_ibumad.h (working copy) > @@ -74,6 +74,8 @@ BEGIN_C_DECLS > #define OSM_UMAD_MAX_CAS 32 > #define OSM_UMAD_MAX_PORTS_PER_CA 2 > > +#define OSM_VENDOR_SUPPORT_EVENTS > + I prefer this as an additional flag turned on in the build for OpenIB. > /* OpenIB gen2 doesn't support RMPP yet */ > > /****s* OpenSM: Vendor UMAD/osm_ca_info_t > @@ -179,6 +181,10 @@ typedef struct _osm_vendor > int umad_port_id; > void *receiver; > int issmfd; > + cl_thread_t events_thread; > + void * events_callback; > + void * sm_context; > + struct ibv_context * ibv_context; > } osm_vendor_t; > > #define OSM_BIND_INVALID_HANDLE 0 > Index: include/vendor/osm_vendor_api.h > =================================================================== > -- include/vendor/osm_vendor_api.h (revision 8998) > +++ include/vendor/osm_vendor_api.h (working copy) > @@ -526,6 +526,110 @@ osm_vendor_set_debug( > * SEE ALSO > *********/ > > +#ifdef OSM_VENDOR_SUPPORT_EVENTS > + > +#define OSM_EVENT_FATAL 1 > +#define OSM_EVENT_PORT_ERR 2 > + > +/****s* OpenSM Vendor API/osm_vend_events_callback_t > +* NAME > +* osm_vend_events_callback_t > +* > +* DESCRIPTION > +* Function prototype for the vendor events callback. > +* The vendor layer calls this function on driver events. > +* > +* SYNOPSIS > +*/ > +typedef void > +(*osm_vend_events_callback_t)( > + IN int events_mask, > + IN void * const context ); > +/* > +* PARAMETERS > +* events_mask > +* [in] The received event(s). > +* > +* context > +* [in] Context supplied as the "sm_context" argument in > +* the osm_vendor_unreg_events_cb call > +* > +* RETURN VALUES > +* None. > +* > +* NOTES > +* > +* SEE ALSO > +* osm_vendor_reg_events_cb osm_vendor_unreg_events_cb > +*********/ > + > +/****f* OpenSM Vendor API/osm_vendor_reg_events_cb > +* NAME > +* osm_vendor_reg_events_cb > +* > +* DESCRIPTION > +* Registers the events callback function and start the events > +* thread > +* > +* SYNOPSIS > +*/ > +int > +osm_vendor_reg_events_cb( > + IN osm_vendor_t * const p_vend, > + IN void * const sm_callback, > + IN void * const sm_context); > +/* > +* PARAMETERS > +* p_vend > +* [in] vendor handle. > +* > +* sm_callback > +* [in] Callback function that should be called when > +* the event is received. > +* > +* sm_context > +* [in] Context supplied as the "context" argument in > +* the subsequenct calls to the sm_callback function > +* > +* RETURN VALUE > +* IB_SUCCESS if OK. > +* > +* NOTES > +* > +* SEE ALSO > +* osm_vend_events_callback_t osm_vendor_unreg_events_cb > +*********/ > + > +/****f* OpenSM Vendor API/osm_vendor_unreg_events_cb > +* NAME > +* osm_vendor_unreg_events_cb > +* > +* DESCRIPTION > +* Un-Registers the events callback function and stops the events > +* thread > +* > +* SYNOPSIS > +*/ > +void > +osm_vendor_unreg_events_cb( > + IN osm_vendor_t * const p_vend); > +/* > +* PARAMETERS > +* p_vend > +* [in] vendor handle. > +* > +* > +* RETURN VALUE > +* None. > +* > +* NOTES > +* > +* SEE ALSO > +* osm_vend_events_callback_t osm_vendor_reg_events_cb > +*********/ > + > +#endif /* OSM_VENDOR_SUPPORT_EVENTS */ > + > END_C_DECLS > > #endif /* _OSM_VENDOR_API_H_ */ > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > -- libvendor/osm_vendor_ibumad.c (revision 8998) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -72,6 +72,7 @@ > #include > #include > #include > +#include > > /****s* OpenSM: Vendor AL/osm_umad_bind_info_t > * NAME > @@ -441,6 +442,91 @@ Exit: > > /********************************************************************** > **********************************************************************/ > +static void > +umad_events_thread( > + IN void * vend_context) > +{ > + int res = 0; > + osm_vendor_t * p_vend = (osm_vendor_t *) vend_context; > + struct ibv_async_event event; > + > + OSM_LOG_ENTER( p_vend->p_log, umad_events_thread ); > + > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Device %s, async event FD: %d\n", > + p_vend->umad_port.ca_name, p_vend->ibv_context->async_fd); > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Listening for events on device %s, port %d\n", > + p_vend->umad_port.ca_name, p_vend->umad_port.portnum); > + > + while (1) { > + > + res = ibv_get_async_event(p_vend->ibv_context, &event); > + if (res) > + { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "umad_events_thread: ERR 5450: " > + "Failed getting async event (device %s, port %d)\n", > + p_vend->umad_port.ca_name, p_vend->umad_port.portnum); > + goto Exit; > + } > + > + if (!p_vend->events_callback) > + { > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Events callback has been unregistered\n"); > + ibv_ack_async_event(&event); > + goto Exit; > + } > + /* > + * We're listening to events on the SM's port only > + */ > + if ( event.element.port_num == p_vend->umad_port.portnum ) > + { > + switch (event.event_type) > + { > + case IBV_EVENT_DEVICE_FATAL: > + osm_log(p_vend->p_log, OSM_LOG_INFO, > + "umad_events_thread: Received IBV_EVENT_DEVICE_FATAL\n"); > + ((osm_vend_events_callback_t) > + (p_vend->events_callback))(OSM_EVENT_FATAL, p_vend->sm_context); > + > + ibv_ack_async_event(&event); > + goto Exit; > + break; > + > + case IBV_EVENT_PORT_ERR: > + osm_log(p_vend->p_log, OSM_LOG_VERBOSE, > + "umad_events_thread: Received IBV_EVENT_PORT_ERR\n"); > + ((osm_vend_events_callback_t) > + (p_vend->events_callback))(OSM_EVENT_PORT_ERR, p_vend->sm_context); > + break; > + > + default: > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Received event #%d on port %d - Ignoring\n", > + event.event_type, event.element.port_num); > + } > + } > + else > + { > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Received event #%d on port %d - Ignoring\n", > + event.event_type, event.element.port_num); > + } > + > + ibv_ack_async_event(&event); > + } > + > + Exit: > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Terminating thread\n"); > + OSM_LOG_EXIT(p_vend->p_log); > + return; > +} > + > +/********************************************************************** > + **********************************************************************/ > ib_api_status_t > osm_vendor_init( > IN osm_vendor_t* const p_vend, > @@ -456,6 +542,7 @@ osm_vendor_init( > p_vend->max_retries = OSM_DEFAULT_RETRY_COUNT; > cl_spinlock_construct( &p_vend->cb_lock ); > cl_spinlock_construct( &p_vend->match_tbl_lock ); > + cl_thread_construct( &p_vend->events_thread ); > p_vend->umad_port_id = -1; > p_vend->issmfd = -1; > > @@ -1217,4 +1304,114 @@ osm_vendor_set_debug( > umad_debug(level); > } > > +/********************************************************************** > + **********************************************************************/ > +int > +osm_vendor_reg_events_cb( > + IN osm_vendor_t * const p_vend, > + IN void * const sm_callback, > + IN void * const sm_context) > +{ > + ib_api_status_t status = IB_SUCCESS; > + struct ibv_device ** dev_list; > + struct ibv_device * device; > + > + OSM_LOG_ENTER( p_vend->p_log, osm_vendor_reg_events_cb ); > + > + p_vend->events_callback = sm_callback; > + p_vend->sm_context = sm_context; > + > + dev_list = ibv_get_device_list(NULL); > + if (!dev_list || !(*dev_list)) { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_reg_events_cb: ERR 5440: " > + "No IB devices found\n"); > + status = IB_ERROR; > + goto Exit; > + } > + > + if (!p_vend->umad_port.ca_name || !p_vend->umad_port.ca_name[0]) > + { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_reg_events_cb: ERR 5441: " > + "Vendor initialization is not completed yet\n"); > + status = IB_ERROR; > + goto Exit; > + } > + > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "osm_vendor_reg_events_cb: Registering on device %s\n", > + p_vend->umad_port.ca_name); > + > + /* > + * find device whos name matches the SM's device > + */ > + for ( device = *dev_list; > + (device != NULL) && > + (strcmp(p_vend->umad_port.ca_name, ibv_get_device_name(device)) != 0); > + device += sizeof(struct ibv_device *) ) > + ; > + if (!device) > + { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_reg_events_cb: ERR 5442: " > + "Device %s hasn't been found in the device list\n" > + ,p_vend->umad_port.ca_name); > + status = IB_ERROR; > + goto Exit; > + } > + > + p_vend->ibv_context = ibv_open_device(device); > + if (!p_vend->ibv_context) { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_reg_events_cb: ERR 5443: " > + "Couldn't get context for %s\n", > + p_vend->umad_port.ca_name); > + status = IB_ERROR; > + goto Exit; > + } > + > + /* > + * Initiate the events thread > + */ > + if (cl_thread_init(&p_vend->events_thread, > + umad_events_thread, > + p_vend, > + "ibumad events thread") != CL_SUCCESS) { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_reg_events_cb: ERR 5444: " > + "Failed initiating event listening thread\n"); > + status = IB_ERROR; > + goto Exit; > + } > + > + Exit: > + if (status != IB_SUCCESS) > + { > + p_vend->events_callback = NULL; > + p_vend->sm_context = NULL; > + p_vend->ibv_context = NULL; > + p_vend->events_callback = NULL; > + } > + OSM_LOG_EXIT( p_vend->p_log ); > + return status; > +} > + > +/********************************************************************** > + **********************************************************************/ > +void > +osm_vendor_unreg_events_cb( > + IN osm_vendor_t * const p_vend) > +{ > + OSM_LOG_ENTER( p_vend->p_log, osm_vendor_unreg_events_cb ); > + p_vend->events_callback = NULL; > + p_vend->sm_context = NULL; > + p_vend->ibv_context = NULL; > + p_vend->events_callback = NULL; > + OSM_LOG_EXIT( p_vend->p_log ); > +} > + > +/********************************************************************** > + **********************************************************************/ > + > #endif /* OSM_VENDOR_INTF_OPENIB */ > Index: libvendor/libosmvendor.map > =================================================================== > -- libvendor/libosmvendor.map (revision 8998) > +++ libvendor/libosmvendor.map (working copy) > @@ -1,4 +1,4 @@ > -OSMVENDOR_2.0 { > +OSMVENDOR_2.1 { > global: > umad_receiver; > osm_vendor_init; > @@ -23,5 +23,7 @@ OSMVENDOR_2.0 { > osmv_bind_sa; > osmv_query_sa; > osm_vendor_get_guid_ca_and_port; > + osm_vendor_reg_events_cb; > + osm_vendor_unreg_events_cb; > local: *; > }; > Index: opensm/osm_sm.c > =================================================================== > -- opensm/osm_sm.c (revision 8998) > +++ opensm/osm_sm.c (working copy) > @@ -313,6 +313,7 @@ osm_sm_init( > p_sm->p_mad_pool, > p_sm->p_vl15, > p_sm->p_vendor, > + &p_sm->state_mgr, > p_log, p_stats, p_lock, p_disp ); > if( status != IB_SUCCESS ) > goto Exit; > Index: opensm/osm_sm_mad_ctrl.c > =================================================================== > -- opensm/osm_sm_mad_ctrl.c (revision 8998) > +++ opensm/osm_sm_mad_ctrl.c (working copy) > @@ -59,6 +59,7 @@ > #include > #include > #include > +#include > > /****f* opensm: SM/__osm_sm_mad_ctrl_retire_trans_mad > * NAME > @@ -953,6 +954,7 @@ osm_sm_mad_ctrl_init( > IN osm_mad_pool_t* const p_mad_pool, > IN osm_vl15_t* const p_vl15, > IN osm_vendor_t* const p_vendor, > + IN struct _osm_state_mgr* const p_state_mgr, > IN osm_log_t* const p_log, > IN osm_stats_t* const p_stats, > IN cl_plock_t* const p_lock, > @@ -969,6 +971,7 @@ osm_sm_mad_ctrl_init( > p_ctrl->p_disp = p_disp; > p_ctrl->p_mad_pool = p_mad_pool; > p_ctrl->p_vendor = p_vendor; > + p_ctrl->p_state_mgr = p_state_mgr; > p_ctrl->p_stats = p_stats; > p_ctrl->p_lock = p_lock; > p_ctrl->p_vl15 = p_vl15; > @@ -995,6 +998,47 @@ osm_sm_mad_ctrl_init( > > /********************************************************************** > **********************************************************************/ > +void > +__osm_vend_events_callback( > + IN int events_mask, > + IN void * const context ) Shouldn't this be conditionalized on OSM_VENDOR_SUPPORT_EVENTS ? > +{ > + osm_sm_mad_ctrl_t * const p_ctrl = (osm_sm_mad_ctrl_t * const) context; > + > + OSM_LOG_ENTER(p_ctrl->p_log, __osm_vend_events_callback); > + > + if (events_mask & OSM_EVENT_FATAL) > + { > + osm_log(p_ctrl->p_log, OSM_LOG_INFO, > + "__osm_vend_events_callback: " > + "Events callback got OSM_EVENT_FATAL\n"); > + osm_log(p_ctrl->p_log, OSM_LOG_SYS, > + "Fatal HCA error - forcing OpenSM exit\n"); > + osm_exit_flag = 1; > + OSM_LOG_EXIT(p_ctrl->p_log); > + return; > + } > + > + if (events_mask & OSM_EVENT_PORT_ERR) > + { > + osm_log(p_ctrl->p_log, OSM_LOG_INFO, > + "__osm_vend_events_callback: " > + "Events callback got OSM_EVENT_PORT_ERR - forcing heavy sweep\n"); > + p_ctrl->p_subn->force_immediate_heavy_sweep = TRUE; > + osm_state_mgr_process((osm_state_mgr_t * const)p_ctrl->p_state_mgr, > + OSM_SIGNAL_SWEEP); > + OSM_LOG_EXIT(p_ctrl->p_log); > + return; > + } > + > + osm_log(p_ctrl->p_log, OSM_LOG_INFO, > + "__osm_vend_events_callback: " > + "Events callback got event mask of %d - No action taken\n"); > + OSM_LOG_EXIT(p_ctrl->p_log); > +} > + > +/********************************************************************** > + **********************************************************************/ > ib_api_status_t > osm_sm_mad_ctrl_bind( > IN osm_sm_mad_ctrl_t* const p_ctrl, > @@ -1044,6 +1088,17 @@ osm_sm_mad_ctrl_bind( > goto Exit; > } > > + if ( osm_vendor_reg_events_cb(p_ctrl->p_vendor, > + __osm_vend_events_callback, > + p_ctrl) ) > + { > + status = IB_ERROR; > + osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > + "osm_sm_mad_ctrl_bind: ERR 3120: " > + "Vendor failed to register for events\n" ); > + goto Exit; > + } > + This should be conditionalized on OSM_VENDOR_SUPPORT_EVENTS. > Exit: > OSM_LOG_EXIT( p_ctrl->p_log ); > return( status ); > Index: config/osmvsel.m4 > =================================================================== > -- config/osmvsel.m4 (revision 8998) > +++ config/osmvsel.m4 (working copy) > @@ -63,9 +63,9 @@ if test $with_osmv = "openib"; then > OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" > OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" > if test "x$with_umad_libs" = "x"; then > - OSMV_LDADD="-libumad" > + OSMV_LDADD="-libumad -libverbs" > else > - OSMV_LDADD="-L$with_umad_libs -libumad" > + OSMV_LDADD="-L$with_umad_libs -libumad -libverbs" > fi > > if test "x$with_umad_includes" != "x"; then > @@ -137,6 +137,8 @@ if test "$disable_libcheck" != "yes"; th > LDFLAGS="$LDFLAGS $OSMV_LDADD" > AC_CHECK_LIB(ibumad, umad_init, [], > AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibumad.])) > + AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], > + AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibverbs.])) Cut and paste error: Error message should indicate ibv_get_device_list rather than umad_init. > LD_FLAGS=$osmv_save_ldflags > elif test $with_osmv = "sim" ; then > LDFLAGS="$LDFLAGS -L$with_sim/lib" -- Hal From mshefty at ichips.intel.com Tue Aug 22 10:55:37 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 22 Aug 2006 10:55:37 -0700 Subject: [openib-general] [libibcm] does the libibcm support multithreaded applications? In-Reply-To: <200608080911.34796.dotanb@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302A3C659@mtlexch01.mtl.com> <200608061650.52846.dotanb@mellanox.co.il> <44D77A58.9020904@ichips.intel.com> <200608080911.34796.dotanb@mellanox.co.il> Message-ID: <44EB4519.1090306@ichips.intel.com> Dotan Barak wrote: >>I understand what the problem is, and I think you're right. If >>ib_cm_get_device() returned a new ib_cm_device, you could more easily control >>event processing. I will fix this up when I remove the dependency on libsysfs >>from the libibcm. I am probably at least 2 weeks away from starting on this though. > > When you'll have the new code, it will have at least one customer .... FYI - I am working on this now. - Sean From mst at mellanox.co.il Tue Aug 22 11:23:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 Aug 2006 21:23:59 +0300 Subject: [openib-general] InfiniBand merge plans for 2.6.19 In-Reply-To: <000101c6c600$efad4ed0$b6cc180a@amr.corp.intel.com> References: <000101c6c600$efad4ed0$b6cc180a@amr.corp.intel.com> Message-ID: <20060822182359.GA8889@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: InfiniBand merge plans for 2.6.19 > > >What about pushing the char device to support user space CMA, i recall > >that you have mentioned the API was not mature enough when the 2.6.18 > >feature merge window was open. > > I will look at doing this. I need to verify what functionality (RC, UD, > multicast) of the kernel RDMA CM we want merged upstream for 2.6.19 and create a > patch for exposing that to userspace. If you like the version that only supports RC, you can take that directly from the ofed tarball. AFAIK the ofed version is enough for uDAPL. HTH, -- MST From tziporet at mellanox.co.il Tue Aug 22 12:10:47 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 22 Aug 2006 22:10:47 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7767@mtlexch01.mtl.com> Hi Woody, I wrote in the release mail RHEL4 up4 does not working: Limitations and known issues: ============================= 1. SDP: For Mellanox Sinai HCAs one must use latest FW version (1.1.000). 2. SDP: Get peer name is not working properly 3. SDP: Scalability issue when many connections are opened 4. ipath driver does not compile on SLES9 SP3 >>> 5. RHEL4 up4 is not supported due to problems in the backport patches. We fixed it today, and will place the patches on our git tree tomorrow, so you can pull it form there. A full solution will be in RC3. Regarding to the printout - we will fix them Thanks for the report, Tziporet -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Tuesday, August 22, 2006 8:23 PM To: Tziporet Koren; EWG; OPENIB Subject: RE: [openfabrics-ewg] OFED 1.1-rc2 is ready Tziporet wrote, >OFED 1.1-RC2 is avilable on https://openib.org/svn/gen2/branches/1.1/ofed/releases/ >File: OFED-1.1-rc2.tgz >Please report any issues in bugzilla http://openib.org/bugzilla/ I tried to install this on my RedHat EL4 - update 4 system and got the following error. DRPM/BUILD/openib-1.1/drivers/infiniband/core/.tmp_addr.o /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/addr.c In file included from include/linux/slab.h:15, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/slab.h:4, from include/linux/percpu.h:4, from include/linux/sched.h:31, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/sched.h:4, from include/linux/mm.h:4, from include/linux/skbuff.h:26, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/skbuff.h:4, from include/linux/if_ether.h:107, from include/linux/netdevice.h:29, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/netdevice.h:4, from /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/addr.c:31: include/linux/gfp.h:133: error: redefinition of typedef 'gfp_t' /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/types.h:7: error: previous declaration of 'gfp_t' was here In file included from include/linux/percpu.h:4, from include/linux/sched.h:31, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/sched.h:4, from include/linux/mm.h:4, from include/linux/skbuff.h:26, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/skbuff.h:4, from include/linux/if_ether.h:107, from include/linux/netdevice.h:29, from /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/netdevice.h:4, from /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/addr.c:31: /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/sched.h:8: warning: static declaration of 'wait_for_completion_timeout' follows non-static declaration include/linux/completion.h:32: warning: previous declaration of 'wait_for_completion_timeout' was here make[3]: *** [/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/addr.o] Error 1 make[2]: *** [/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core] Error 2 make[1]: *** [_module_/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband] Error 2 make[1]: Leaving directory `/usr/src/kernels/2.6.9-42.EL-smp-x86_64' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.46002 (%install) RPM build errors: user vlad does not exist - using root group mtl does not exist - using root user vlad does not exist - using root group mtl does not exist - using root Bad exit status from /var/tmp/rpm-tmp.46002 (%install) From mst at mellanox.co.il Tue Aug 22 12:45:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 Aug 2006 22:45:06 +0300 Subject: [openib-general] [PATCH] IB/mthca: update latest firmware revisions Message-ID: <20060822194505.GA9153@mellanox.co.il> Please consider the following for 2.6.18 - hopefully this will reduce the number of support requests from people with old Sinai firmware. --- Make sure people running Sinai firmware older than 1.1.0 get a message suggesting firmware upgrade. Do this for Arbel as well while we are at it. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 557cde3..7b82c19 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -967,12 +967,12 @@ static struct { } mthca_hca_table[] = { [TAVOR] = { .latest_fw = MTHCA_FW_VER(3, 4, 0), .flags = 0 }, - [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 400), + [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 600), .flags = MTHCA_FLAG_PCIE }, - [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), + [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 400), .flags = MTHCA_FLAG_MEMFREE | MTHCA_FLAG_PCIE }, - [SINAI] = { .latest_fw = MTHCA_FW_VER(1, 0, 800), + [SINAI] = { .latest_fw = MTHCA_FW_VER(1, 1, 0), .flags = MTHCA_FLAG_MEMFREE | MTHCA_FLAG_PCIE | MTHCA_FLAG_SINAI_OPT } -- MST From robert.j.woodruff at intel.com Tue Aug 22 13:09:40 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 22 Aug 2006 13:09:40 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready Message-ID: Tziporet wrote, >Hi Woody, >I wrote in the release mail RHEL4 up4 does not working: >Limitations and known issues: Not a problem, I will test on RHEL4 - U3 for now. woody From sean.hefty at intel.com Tue Aug 22 13:56:41 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 22 Aug 2006 13:56:41 -0700 Subject: [openib-general] [PATCH] libibcm: modify API to support multi-threaded event processing In-Reply-To: <44EB4519.1090306@ichips.intel.com> Message-ID: <000101c6c62d$78431cd0$c7cc180a@amr.corp.intel.com> Modify the libibcm API to provide better support for multi-threaded event processing. CM devices are no longer tied to verb devices and hidden from the user. This should allow an application to direct events to specific threads for processing. This patch also removes the libibcm's dependency on libsysfs. The changes do not break the kernel ABI, but do break the library's API in such a way that requires (hopefully minor) changes to all existing users. Signed-off-by: Sean Hefty --- Index: include/infiniband/cm.h =================================================================== --- include/infiniband/cm.h (revision 8215) +++ include/infiniband/cm.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004-2006 Intel Corporation. All rights reserved. * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004 Voltaire Corporation. All rights reserved. * @@ -78,13 +78,12 @@ enum ib_cm_data_size { }; struct ib_cm_device { - uint64_t guid; - int fd; + struct ibv_context *device_context; + int fd; }; struct ib_cm_id { void *context; - struct ibv_context *device_context; struct ib_cm_device *device; uint32_t handle; }; @@ -270,8 +269,8 @@ int ib_cm_get_event(struct ib_cm_device int ib_cm_ack_event(struct ib_cm_event *event); /** - * ib_cm_get_device - Returns the device the CM uses to submit requests - * and retrieve events that corresponds to the specified verbs device. + * ib_cm_open_device - Returns the device the CM uses to submit requests + * and retrieve events, corresponding to the specified verbs device. * * The CM device contains the file descriptor that the CM uses to * communicate with the kernel CM component. The primary use of the @@ -282,7 +281,13 @@ int ib_cm_ack_event(struct ib_cm_event * * descriptor, it will likely result in an error or unexpected * results. */ -struct ib_cm_device* ib_cm_get_device(struct ibv_context *device_context); +struct ib_cm_device* ib_cm_open_device(struct ibv_context *device_context); + +/** + * ib_cm_close_device - Close a CM device. + * @device: Device to close. + */ +void ib_cm_close_device(struct ib_cm_device *device); /** * ib_cm_create_id - Allocate a communication identifier. @@ -290,7 +295,7 @@ struct ib_cm_device* ib_cm_get_device(st * Communication identifiers are used to track connection states, service * ID resolution requests, and listen requests. */ -int ib_cm_create_id(struct ibv_context *device_context, +int ib_cm_create_id(struct ib_cm_device *device, struct ib_cm_id **cm_id, void *context); /** Index: src/cm.c =================================================================== --- src/cm.c (revision 8215) +++ src/cm.c (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2005-2006 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -39,32 +39,21 @@ #include #include -#include #include #include #include -#include -#include #include #include -#include -#include - -#include #include #include +#include #include #define PFX "libibcm: " -#if __BYTE_ORDER == __LITTLE_ENDIAN -static inline uint64_t htonll(uint64_t x) { return bswap_64(x); } -static inline uint64_t ntohll(uint64_t x) { return bswap_64(x); } -#else -static inline uint64_t htonll(uint64_t x) { return x; } -static inline uint64_t ntohll(uint64_t x) { return x; } -#endif +static int abi_ver; +static pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; #define CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, type, size) \ do { \ @@ -109,177 +98,79 @@ struct cm_id_private { pthread_mutex_t mut; }; -static struct dlist *device_list; - #define container_of(ptr, type, field) \ ((type *) ((void *)ptr - offsetof(type, field))) static int check_abi_version(void) { - char path[256]; - struct sysfs_attribute *attr; - int abi_ver; - int ret = -1; - - if (sysfs_get_mnt_path(path, sizeof path)) { - fprintf(stderr, PFX "couldn't find sysfs mount.\n"); - return -1; - } - - strncat(path, "/class/infiniband_cm/abi_version", sizeof path); + char value[8]; - attr = sysfs_open_attribute(path); - if (!attr) { - fprintf(stderr, PFX "couldn't open ucm ABI version.\n"); - return -1; - } - - if (sysfs_read_attribute(attr)) { - fprintf(stderr, PFX "couldn't read ucm ABI version.\n"); - goto out; + if (ibv_read_sysfs_file(ibv_get_sysfs_path(), + "class/infiniband_cm/abi_version", + value, sizeof value) < 0) { + fprintf(stderr, PFX "couldn't read ABI version\n"); + return 0; } - abi_ver = strtol(attr->value, NULL, 10); + abi_ver = strtol(value, NULL, 10); if (abi_ver < IB_USER_CM_MIN_ABI_VERSION || abi_ver > IB_USER_CM_MAX_ABI_VERSION) { fprintf(stderr, PFX "kernel ABI version %d " - "doesn't match library version %d.\n", - abi_ver, IB_USER_CM_MAX_ABI_VERSION); - goto out; + "doesn't match library version %d.\n", + abi_ver, IB_USER_CM_MAX_ABI_VERSION); + return -1; } - - ret = 0; - -out: - sysfs_close_attribute(attr); - return ret; + return 0; } -static uint64_t get_device_guid(struct sysfs_class_device *ibdev) +static int ucm_init(void) { - struct sysfs_attribute *attr; - uint64_t guid = 0; - uint16_t parts[4]; - int i; + int ret = 0; - attr = sysfs_get_classdev_attr(ibdev, "node_guid"); - if (!attr) - return 0; - - if (sscanf(attr->value, "%hx:%hx:%hx:%hx", - parts, parts + 1, parts + 2, parts + 3) != 4) - return 0; - - for (i = 0; i < 4; ++i) - guid = (guid << 16) | parts[i]; + pthread_mutex_lock(&mut); + if (!abi_ver) + ret = check_abi_version(); + pthread_mutex_unlock(&mut); - return htonll(guid); + return ret; } -static struct ib_cm_device* open_device(struct sysfs_class_device *cm_dev) +struct ib_cm_device* ib_cm_open_device(struct ibv_context *device_context) { - struct sysfs_class_device *ib_dev; - struct sysfs_attribute *attr; struct ib_cm_device *dev; - char ibdev_name[64]; - char *devpath; + char *dev_path; + + if (ucm_init()) + return NULL; dev = malloc(sizeof *dev); if (!dev) return NULL; - attr = sysfs_get_classdev_attr(cm_dev, "ibdev"); - if (!attr) { - fprintf(stderr, PFX "no ibdev class attr for %s\n", - cm_dev->name); - goto err; - } - - sscanf(attr->value, "%63s", ibdev_name); - ib_dev = sysfs_open_class_device("infiniband", ibdev_name); - if (!ib_dev) - goto err; + dev->device_context = device_context; - dev->guid = get_device_guid(ib_dev); - sysfs_close_class_device(ib_dev); - if (!dev->guid) - goto err; + asprintf(&dev_path, "/dev/infiniband/ucm%s", + device_context->device->dev_name + sizeof("uverbs") - 1); - asprintf(&devpath, "/dev/infiniband/%s", cm_dev->name); - dev->fd = open(devpath, O_RDWR); + dev->fd = open(dev_path, O_RDWR); if (dev->fd < 0) { - fprintf(stderr, PFX "error <%d:%d> opening device <%s>\n", - dev->fd, errno, devpath); + fprintf(stderr, PFX "unable to open %s\n", dev_path); goto err; } + + free(dev_path); return dev; + err: + free(dev_path); free(dev); return NULL; } -static void __attribute__((constructor)) ib_cm_init(void) -{ - struct sysfs_class *cls; - struct dlist *cm_dev_list; - struct sysfs_class_device *cm_dev; - struct ib_cm_device *dev; - - device_list = dlist_new(sizeof(struct ib_cm_device)); - if (!device_list) { - fprintf(stderr, PFX "couldn't allocate device list.\n"); - abort(); - } - - cls = sysfs_open_class("infiniband_cm"); - if (!cls) { - fprintf(stderr, PFX "couldn't open 'infiniband_cm'.\n"); - goto err; - } - - if (check_abi_version()) - goto err; - - cm_dev_list = sysfs_get_class_devices(cls); - if (!cm_dev_list) { - fprintf(stderr, PFX "no class devices found.\n"); - goto err; - } - - dlist_for_each_data(cm_dev_list, cm_dev, struct sysfs_class_device) { - dev = open_device(cm_dev); - if (dev) - dlist_push(device_list, dev); - } - return; -err: - sysfs_close_class(cls); -} - -static void __attribute__((destructor)) ib_cm_fini(void) +void ib_cm_close_device(struct ib_cm_device *device) { - struct ib_cm_device *dev; - - if (!device_list) - return; - - dlist_for_each_data(device_list, dev, struct ib_cm_device) - close(dev->fd); - - dlist_destroy(device_list); -} - -struct ib_cm_device* ib_cm_get_device(struct ibv_context *device_context) -{ - struct ib_cm_device *dev; - uint64_t guid; - - guid = ibv_get_device_guid(device_context->device); - dlist_for_each_data(device_list, dev, struct ib_cm_device) - if (dev->guid == guid) - return dev; - - return NULL; + close(device->fd); + free(device); } static void ib_cm_free_id(struct cm_id_private *cm_id_priv) @@ -289,7 +180,7 @@ static void ib_cm_free_id(struct cm_id_p free(cm_id_priv); } -static struct cm_id_private *ib_cm_alloc_id(struct ibv_context *device_context, +static struct cm_id_private *ib_cm_alloc_id(struct ib_cm_device *device, void *context) { struct cm_id_private *cm_id_priv; @@ -299,23 +190,19 @@ static struct cm_id_private *ib_cm_alloc return NULL; memset(cm_id_priv, 0, sizeof *cm_id_priv); - cm_id_priv->id.device_context = device_context; + cm_id_priv->id.device = device; cm_id_priv->id.context = context; pthread_mutex_init(&cm_id_priv->mut, NULL); if (pthread_cond_init(&cm_id_priv->cond, NULL)) goto err; - cm_id_priv->id.device = ib_cm_get_device(device_context); - if (!cm_id_priv->id.device) - goto err; - return cm_id_priv; err: ib_cm_free_id(cm_id_priv); return NULL; } -int ib_cm_create_id(struct ibv_context *device_context, +int ib_cm_create_id(struct ib_cm_device *device, struct ib_cm_id **cm_id, void *context) { struct cm_abi_create_id_resp *resp; @@ -325,14 +212,14 @@ int ib_cm_create_id(struct ibv_context * int result; int size; - cm_id_priv = ib_cm_alloc_id(device_context, context); + cm_id_priv = ib_cm_alloc_id(device, context); if (!cm_id_priv) return -ENOMEM; CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_CREATE_ID, size); cmd->uid = (uintptr_t) cm_id_priv; - result = write(cm_id_priv->id.device->fd, msg, size); + result = write(device->fd, msg, size); if (result != size) goto err; @@ -934,7 +821,7 @@ int ib_cm_get_event(struct ib_cm_device switch (evt->event) { case IB_CM_REQ_RECEIVED: evt->param.req_rcvd.listen_id = evt->cm_id; - cm_id_priv = ib_cm_alloc_id(evt->cm_id->device_context, + cm_id_priv = ib_cm_alloc_id(evt->cm_id->device, evt->cm_id->context); if (!cm_id_priv) { result = -ENOMEM; @@ -972,7 +859,7 @@ int ib_cm_get_event(struct ib_cm_device break; case IB_CM_SIDR_REQ_RECEIVED: evt->param.sidr_req_rcvd.listen_id = evt->cm_id; - cm_id_priv = ib_cm_alloc_id(evt->cm_id->device_context, + cm_id_priv = ib_cm_alloc_id(evt->cm_id->device, evt->cm_id->context); if (!cm_id_priv) { result = -ENOMEM; Index: src/libibcm.map =================================================================== --- src/libibcm.map (revision 8215) +++ src/libibcm.map (working copy) @@ -1,9 +1,9 @@ IBCM_4.0 { global: - + ib_cm_open_device; + ib_cm_close_device; ib_cm_get_event; ib_cm_ack_event; - ib_cm_get_device; ib_cm_create_id; ib_cm_destroy_id; ib_cm_attr_id; From halr at voltaire.com Tue Aug 22 14:05:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Aug 2006 17:05:00 -0400 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <44EB37B2.40906@mellanox.co.il> References: <44EB37B2.40906@mellanox.co.il> Message-ID: <1156280698.17858.4748.camel@hal.voltaire.com> On Tue, 2006-08-22 at 12:58, Eitan Zahavi wrote: > I did not see this on the reflector. It made it. > We did have some mailer problems. So I am resending to the list > > One more thing to add: > The only other event we considered was PORT_ACTIVE. > But as it turns out the event is only generated when the port moves into ACTIVE state > which means an SM already handled it... Not sure what you would do with ACTIVE. There are some port selection rules about port state for umad. -- Hal > EZ From sashak at voltaire.com Tue Aug 22 14:18:55 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 23 Aug 2006 00:18:55 +0300 Subject: [openib-general] [PATCH] opensm: option to limit size of OpenSM log file Message-ID: <20060822211855.GD10446@sashak.voltaire.com> Hi Hal, There is new option which specified max size of OpenSM log file. The default is '0' (not-limited). Please note osm_log_init() has new parameter now. We already saw the problems with FS overflowing in real life - we may want those related fixes in OFED too. Sasha opensm: option to limit size of OpenSM log file New option '-L' will limit size of OpenSM log file. If specified the log file will be truncated upon reaching this limit. Signed-off-by: Sasha Khapyorsky --- osm/complib/cl_event_wheel.c | 2 osm/include/opensm/osm_log.h | 44 +---------- osm/include/opensm/osm_subnet.h | 6 + osm/opensm/libopensm.map | 1 osm/opensm/main.c | 13 +++ osm/opensm/osm_log.c | 163 ++++++++++++++++++++++++++------------- osm/opensm/osm_opensm.c | 3 - osm/opensm/osm_subnet.c | 9 ++ osm/osmtest/osmtest.c | 2 9 files changed, 143 insertions(+), 100 deletions(-) diff --git a/osm/complib/cl_event_wheel.c b/osm/complib/cl_event_wheel.c index 46c1f8e..a215f40 100644 --- a/osm/complib/cl_event_wheel.c +++ b/osm/complib/cl_event_wheel.c @@ -610,7 +610,7 @@ main () cl_event_wheel_construct( &event_wheel ); /* init */ - osm_log_init( &log, TRUE, 0xff, NULL, FALSE); + osm_log_init( &log, TRUE, 0xff, NULL, 0, FALSE); cl_event_wheel_init( &event_wheel, &log ); /* Start Playing */ diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h index f5bffd1..5bfaef5 100644 --- a/osm/include/opensm/osm_log.h +++ b/osm/include/opensm/osm_log.h @@ -121,6 +121,8 @@ typedef struct _osm_log { osm_log_level_t level; cl_spinlock_t lock; + unsigned long count; + unsigned long max_size; boolean_t flush; FILE* out_port; } osm_log_t; @@ -211,50 +213,14 @@ osm_log_destroy( * * SYNOPSIS */ -static inline ib_api_status_t +ib_api_status_t osm_log_init( IN osm_log_t* const p_log, IN const boolean_t flush, IN const uint8_t log_flags, IN const char *log_file, - IN const boolean_t accum_log_file ) -{ - p_log->level = log_flags; - p_log->flush = flush; - - if (log_file == NULL || !strcmp(log_file, "-") || - !strcmp(log_file, "stdout")) - { - p_log->out_port = stdout; - } - else if (!strcmp(log_file, "stderr")) - { - p_log->out_port = stderr; - } - else - { - if (accum_log_file) - p_log->out_port = fopen(log_file, "a+"); - else - p_log->out_port = fopen(log_file, "w+"); - - if (!p_log->out_port) - { - if (accum_log_file) - printf("Cannot open %s for appending. Permission denied\n", log_file); - else - printf("Cannot open %s for writing. Permission denied\n", log_file); - - return(IB_UNKNOWN_ERROR); - } - } - openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); - - if (cl_spinlock_init( &p_log->lock ) == CL_SUCCESS) - return IB_SUCCESS; - else - return IB_ERROR; -} + IN const unsigned long max_size, + IN const boolean_t accum_log_file ); /* * PARAMETERS * p_log diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h index b45e5b6..650391b 100644 --- a/osm/include/opensm/osm_subnet.h +++ b/osm/include/opensm/osm_subnet.h @@ -261,6 +261,7 @@ typedef struct _osm_subn_opt uint8_t log_flags; char * dump_files_dir; char * log_file; + unsigned long log_max_size; char * partition_config_file; boolean_t no_partition_enforcement; boolean_t no_qos; @@ -388,6 +389,11 @@ typedef struct _osm_subn_opt * log_file * Name of the log file (or NULL) for stdout. * +* log_limit +* This option defines maximal log file size in MB. When +* specified the log file will be truncated upon reaching +* this limit. +* * accum_log_file * If TRUE (default) - the log file will be accumulated. * If FALSE - the log file will be erased before starting current opensm run. diff --git a/osm/opensm/libopensm.map b/osm/opensm/libopensm.map index 2b45b5d..c5bc0ab 100644 --- a/osm/opensm/libopensm.map +++ b/osm/opensm/libopensm.map @@ -2,6 +2,7 @@ OPENSM_1.1 { global: osm_log; osm_is_debug; + osm_log_init; osm_mad_pool_construct; osm_mad_pool_destroy; osm_mad_pool_init; diff --git a/osm/opensm/main.c b/osm/opensm/main.c index b429626..d5d8211 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -228,6 +228,11 @@ show_usage(void) " This option defines the log to be the given file.\n" " By default the log goes to /var/log/osm.log.\n" " For the log to go to standard output use -f stdout.\n\n"); + printf( "-L \n" + "--log_limit \n" + " This option defines maximal log file size in MB. When\n" + " specified the log file will be truncated upon reaching\n" + " this limit.\n\n"); printf( "-e\n" "--erase_log_file\n" " This option will cause deletion of the log file\n" @@ -527,7 +532,7 @@ #endif boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:s:t:a:R:U:P:NQvVhorcyx"; + const char * const short_option = "i:f:ed:g:l:L:s:t:a:R:U:P:NQvVhorcyx"; /* In the array below, the 2nd parameter specified the number @@ -547,6 +552,7 @@ #endif { "verbose", 0, NULL, 'v'}, { "D", 1, NULL, 'D'}, { "log_file", 1, NULL, 'f'}, + { "log_limit", 1, NULL, 'L'}, { "erase_log_file",0, NULL, 'e'}, { "Pconfig", 1, NULL, 'P'}, { "no_part_enforce",0,NULL, 'N'}, @@ -731,6 +737,11 @@ #endif opt.log_file = optarg; break; + case 'L': + opt.log_max_size = strtoul(optarg, NULL, 0) * (1024*1024); + printf(" Log file max size is %lu bytes\n", opt.log_max_size); + break; + case 'e': opt.accum_log_file = FALSE; printf(" Creating new log file\n"); diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c index 6efbc5a..eba3cb6 100644 --- a/osm/opensm/osm_log.c +++ b/osm/opensm/osm_log.c @@ -83,6 +83,18 @@ #endif /* ndef WIN32 */ static int log_exit_count = 0; +static void truncate_log_file(osm_log_t* const p_log) +{ + int fd = fileno(p_log->out_port); + if (ftruncate(fd, 0) < 0) + fprintf(stderr, "truncate_log_file: cannot truncate: %s\n", + strerror(errno)); + if (lseek(fd, 0, SEEK_SET) < 0) + fprintf(stderr, "truncate_log_file: cannot rewind: %s\n", + strerror(errno)); + p_log->count = 0; +} + void osm_log( IN osm_log_t* const p_log, @@ -110,84 +122,67 @@ #else #endif /* WIN32 */ /* If this is a call to syslog - always print it */ - if ( verbosity & OSM_LOG_SYS ) + if ( verbosity & (OSM_LOG_SYS|p_log->level)) { - /* this is a call to the syslog */ va_start( args, p_str ); vsprintf( buffer, p_str, args ); va_end(args); - cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); - /* SYSLOG should go to stdout too */ - if (p_log->out_port != stdout) - { - printf("%s\n", buffer); - fflush( stdout ); - } + /* this is a call to the syslog */ + if (verbosity & OSM_LOG_SYS) { + cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); - /* send it also to the log file */ -#ifdef WIN32 - GetLocalTime(&st); - fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", - st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, - pid, buffer); -#else - fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s\n", - (result.tm_mon < 12 ? month_str[result.tm_mon] : "???"), - result.tm_mday, result.tm_hour, - result.tm_min, result.tm_sec, - usecs, pid, buffer); - fflush( p_log->out_port ); -#endif - } + /* SYSLOG should go to stdout too */ + if (p_log->out_port != stdout) + { + printf("%s\n", buffer); + fflush( stdout ); + } + } - /* SYS messages go to the log anyways */ - if (p_log->level & verbosity) - { - - va_start( args, p_str ); - vsprintf( buffer, p_str, args ); - va_end(args); - /* regular log to default out_port */ cl_spinlock_acquire( &p_log->lock ); + + if (p_log->max_size && p_log->count > p_log->max_size) { + /* truncate here */ + fprintf(stderr, "osm_log: log file exceeds the limit %lu. Truncating.\n", + p_log->max_size); + truncate_log_file(p_log); + } + #ifdef WIN32 GetLocalTime(&st); _retry: - ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", + ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, pid, buffer); - #else pid = pthread_self(); - tim = time(NULL); _retry: - ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", - ((result.tm_mon < 12) && (result.tm_mon >= 0) ? - month_str[result.tm_mon] : "???"), - result.tm_mday, result.tm_hour, - result.tm_min, result.tm_sec, - usecs, pid, buffer); -#endif /* WIN32 */ - - if (ret >= 0) - log_exit_count = 0; - else if (errno == ENOSPC && log_exit_count < 3) { - int fd = fileno(p_log->out_port); + ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s\n", + (result.tm_mon < 12 ? month_str[result.tm_mon] : "???"), + result.tm_mday, result.tm_hour, + result.tm_min, result.tm_sec, + usecs, pid, buffer); +#endif + + /* flush log */ + if (ret > 0 && (p_log->flush || (verbosity & OSM_LOG_ERROR)) && + fflush( p_log->out_port ) < 0) + ret = -1; + + if (ret < 0 && errno == ENOSPC && log_exit_count < 3) { fprintf(stderr, "osm_log write failed: %s. Truncating log file.\n", strerror(errno)); - ftruncate(fd, 0); - lseek(fd, 0, SEEK_SET); + truncate_log_file(p_log); log_exit_count++; goto _retry; } - - /* - Flush log on errors too. - */ - if( p_log->flush || (verbosity & OSM_LOG_ERROR) ) - fflush( p_log->out_port ); - + else { + log_exit_count = 0; + p_log->count += ret; + } + cl_spinlock_release( &p_log->lock ); } } @@ -221,3 +216,59 @@ #else return FALSE; #endif /* defined( _DEBUG_ ) */ } + +ib_api_status_t +osm_log_init( + IN osm_log_t* const p_log, + IN const boolean_t flush, + IN const uint8_t log_flags, + IN const char *log_file, + IN const unsigned long max_size, + IN const boolean_t accum_log_file ) +{ + struct stat st; + + p_log->level = log_flags; + p_log->flush = flush; + p_log->count = 0; + p_log->max_size = 0; + + if (log_file == NULL || !strcmp(log_file, "-") || + !strcmp(log_file, "stdout")) + { + p_log->out_port = stdout; + } + else if (!strcmp(log_file, "stderr")) + { + p_log->out_port = stderr; + } + else + { + if (accum_log_file) + p_log->out_port = fopen(log_file, "a+"); + else + p_log->out_port = fopen(log_file, "w+"); + + if (!p_log->out_port) + { + if (accum_log_file) + printf("Cannot open %s for appending. Permission denied\n", log_file); + else + printf("Cannot open %s for writing. Permission denied\n", log_file); + + return(IB_UNKNOWN_ERROR); + } + + if (fstat(fileno(p_log->out_port), &st) == 0) + p_log->count = st.st_size; + + p_log->max_size = max_size; + } + + openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); + + if (cl_spinlock_init( &p_log->lock ) == CL_SUCCESS) + return IB_SUCCESS; + else + return IB_ERROR; +} diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 6704e98..0b39d13 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -181,7 +181,8 @@ osm_opensm_init( osm_opensm_construct( p_osm ); status = osm_log_init( &p_osm->log, p_opt->force_log_flush, - p_opt->log_flags, p_opt->log_file, p_opt->accum_log_file ); + p_opt->log_flags, p_opt->log_file, + p_opt->log_max_size, p_opt->accum_log_file ); if( status != IB_SUCCESS ) return ( status ); diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index bb5067a..395d71b 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -477,6 +477,7 @@ osm_subn_set_default_opt( p_opt->dump_files_dir = OSM_DEFAULT_TMP_DIR; p_opt->log_file = OSM_DEFAULT_LOG_FILE; + p_opt->log_max_size = 0; p_opt->partition_config_file = OSM_DEFAULT_PARTITION_CONFIG_FILE; p_opt->no_partition_enforcement = FALSE; p_opt->no_qos = FALSE; @@ -923,6 +924,10 @@ osm_subn_parse_conf_file( __osm_subn_opts_unpack_charp( "log_file", p_key, p_val, &p_opts->log_file); + __osm_subn_opts_unpack_uint32( + "log_max_size", + p_key, p_val, (uint32_t *)&p_opts->log_max_size); + __osm_subn_opts_unpack_charp( "partition_config_file", p_key, p_val, &p_opts->partition_config_file); @@ -1173,7 +1178,8 @@ osm_subn_write_conf_file( "# Force flush of the log file after each log message\n" "force_log_flush %s\n\n" "# Log file to be used\n" - "log_file %s\n\n" + "log_file %s\n\n" + "log_max_size %lu\n\n" "accum_log_file %s\n\n" "# The directory to hold the file OpenSM dumps\n" "dump_files_dir %s\n\n" @@ -1186,6 +1192,7 @@ osm_subn_write_conf_file( p_opts->log_flags, p_opts->force_log_flush ? "TRUE" : "FALSE", p_opts->log_file, + p_opts->log_max_size, p_opts->accum_log_file ? "TRUE" : "FALSE", p_opts->dump_files_dir, p_opts->no_multicast_option ? "TRUE" : "FALSE", diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index f0f29d3..4f41e38 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -521,7 +521,7 @@ osmtest_init( IN osmtest_t * const p_osm osmtest_construct( p_osmt ); status = osm_log_init( &p_osmt->log, p_opt->force_log_flush, - 0x0001, p_opt->log_file, TRUE ); + 0x0001, p_opt->log_file, 0, TRUE ); if( status != IB_SUCCESS ) return ( status ); From halr at voltaire.com Tue Aug 22 14:25:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Aug 2006 17:25:40 -0400 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <1156268156.17858.801.camel@hal.voltaire.com> References: <1156268156.17858.801.camel@hal.voltaire.com> Message-ID: <1156281936.17858.5168.camel@hal.voltaire.com> Hi again Yevgeny, I've been working on integrating this patch and have some more comments on it: On Tue, 2006-08-22 at 13:35, Hal Rosenstock wrote: > Hi Yevgeny, > > On Tue, 2006-08-22 at 11:41, Yevgeny Kliteynik wrote: > > Hi Hal > > > > This patch implements first item of the OSM todo list. > > Thanks! > > Am I correct in assuming this is both for trunk and 1.1 ? > > > OpenSM opens a thread that is listening for events on the SM's port. > > The events that are being taken care of are IBV_EVENT_DEVICE_FATAL and > > IBV_EVENT_PORT_ERROR. > > > > In case of IBV_EVENT_DEVICE_FATAL, osm is forced to exit. > > in case of IBV_EVENT_PORT_ERROR, osm initiates heavy sweep. What (and how) were this tested ? How can these events be generated ? > Some minor comments below. Let me know what you think. You don't have to > resubmit for these. > > > Yevgeny > > > > Signed-off-by: Yevgeny Kliteynik > > > > Index: include/opensm/osm_sm_mad_ctrl.h > > =================================================================== > > -- include/opensm/osm_sm_mad_ctrl.h (revision 8998) > > +++ include/opensm/osm_sm_mad_ctrl.h (working copy) > > @@ -109,6 +109,7 @@ typedef struct _osm_sm_mad_ctrl > > osm_mad_pool_t *p_mad_pool; > > osm_vl15_t *p_vl15; > > osm_vendor_t *p_vendor; > > + struct _osm_state_mgr *p_state_mgr; > > osm_bind_handle_t h_bind; > > cl_plock_t *p_lock; > > cl_dispatcher_t *p_disp; > > @@ -130,6 +131,9 @@ typedef struct _osm_sm_mad_ctrl > > * p_vendor > > * Pointer to the vendor specific interfaces object. > > * > > +* p_state_mgr > > +* Pointer to the state manager object. > > +* > > * h_bind > > * Bind handle returned by the transport layer. > > * > > @@ -233,6 +237,7 @@ osm_sm_mad_ctrl_init( > > IN osm_mad_pool_t* const p_mad_pool, > > IN osm_vl15_t* const p_vl15, > > IN osm_vendor_t* const p_vendor, > > + IN struct _osm_state_mgr* const p_state_mgr, > > IN osm_log_t* const p_log, > > IN osm_stats_t* const p_stats, > > IN cl_plock_t* const p_lock, > > @@ -251,6 +256,9 @@ osm_sm_mad_ctrl_init( > > * p_vendor > > * [in] Pointer to the vendor specific interfaces object. > > * > > +* p_state_mgr > > +* [in] Pointer to the state manager object. > > +* > > * p_log > > * [in] Pointer to the log object. > > * > > Index: include/vendor/osm_vendor_ibumad.h > > =================================================================== > > -- include/vendor/osm_vendor_ibumad.h (revision 8998) > > +++ include/vendor/osm_vendor_ibumad.h (working copy) > > @@ -74,6 +74,8 @@ BEGIN_C_DECLS > > #define OSM_UMAD_MAX_CAS 32 > > #define OSM_UMAD_MAX_PORTS_PER_CA 2 > > > > +#define OSM_VENDOR_SUPPORT_EVENTS > > + > > I prefer this as an additional flag turned on in the build for OpenIB. I take that back. I think that this define should not be present at all as this is a change to the vendor layer API. If so, then libosmvendor.ver needs to be bumped. Also, what about any of the other vendor layers supported ? They would need at least some stub to get past the linking. > > /* OpenIB gen2 doesn't support RMPP yet */ > > > > /****s* OpenSM: Vendor UMAD/osm_ca_info_t > > @@ -179,6 +181,10 @@ typedef struct _osm_vendor > > int umad_port_id; > > void *receiver; > > int issmfd; > > + cl_thread_t events_thread; > > + void * events_callback; > > + void * sm_context; > > + struct ibv_context * ibv_context; > > } osm_vendor_t; > > > > #define OSM_BIND_INVALID_HANDLE 0 > > Index: include/vendor/osm_vendor_api.h > > =================================================================== > > -- include/vendor/osm_vendor_api.h (revision 8998) > > +++ include/vendor/osm_vendor_api.h (working copy) > > @@ -526,6 +526,110 @@ osm_vendor_set_debug( > > * SEE ALSO > > *********/ > > > > +#ifdef OSM_VENDOR_SUPPORT_EVENTS > > + > > +#define OSM_EVENT_FATAL 1 > > +#define OSM_EVENT_PORT_ERR 2 > > + > > +/****s* OpenSM Vendor API/osm_vend_events_callback_t > > +* NAME > > +* osm_vend_events_callback_t > > +* > > +* DESCRIPTION > > +* Function prototype for the vendor events callback. > > +* The vendor layer calls this function on driver events. > > +* > > +* SYNOPSIS > > +*/ > > +typedef void > > +(*osm_vend_events_callback_t)( > > + IN int events_mask, > > + IN void * const context ); > > +/* > > +* PARAMETERS > > +* events_mask > > +* [in] The received event(s). > > +* > > +* context > > +* [in] Context supplied as the "sm_context" argument in > > +* the osm_vendor_unreg_events_cb call > > +* > > +* RETURN VALUES > > +* None. > > +* > > +* NOTES > > +* > > +* SEE ALSO > > +* osm_vendor_reg_events_cb osm_vendor_unreg_events_cb > > +*********/ > > + > > +/****f* OpenSM Vendor API/osm_vendor_reg_events_cb > > +* NAME > > +* osm_vendor_reg_events_cb > > +* > > +* DESCRIPTION > > +* Registers the events callback function and start the events > > +* thread > > +* > > +* SYNOPSIS > > +*/ > > +int > > +osm_vendor_reg_events_cb( > > + IN osm_vendor_t * const p_vend, > > + IN void * const sm_callback, > > + IN void * const sm_context); > > +/* > > +* PARAMETERS > > +* p_vend > > +* [in] vendor handle. > > +* > > +* sm_callback > > +* [in] Callback function that should be called when > > +* the event is received. > > +* > > +* sm_context > > +* [in] Context supplied as the "context" argument in > > +* the subsequenct calls to the sm_callback function > > +* > > +* RETURN VALUE > > +* IB_SUCCESS if OK. > > +* > > +* NOTES > > +* > > +* SEE ALSO > > +* osm_vend_events_callback_t osm_vendor_unreg_events_cb > > +*********/ > > + > > +/****f* OpenSM Vendor API/osm_vendor_unreg_events_cb > > +* NAME > > +* osm_vendor_unreg_events_cb > > +* > > +* DESCRIPTION > > +* Un-Registers the events callback function and stops the events > > +* thread > > +* > > +* SYNOPSIS > > +*/ > > +void > > +osm_vendor_unreg_events_cb( > > + IN osm_vendor_t * const p_vend); > > +/* > > +* PARAMETERS > > +* p_vend > > +* [in] vendor handle. > > +* > > +* > > +* RETURN VALUE > > +* None. > > +* > > +* NOTES > > +* > > +* SEE ALSO > > +* osm_vend_events_callback_t osm_vendor_reg_events_cb > > +*********/ > > + > > +#endif /* OSM_VENDOR_SUPPORT_EVENTS */ > > + > > END_C_DECLS > > > > #endif /* _OSM_VENDOR_API_H_ */ > > Index: libvendor/osm_vendor_ibumad.c > > =================================================================== > > -- libvendor/osm_vendor_ibumad.c (revision 8998) > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > @@ -72,6 +72,7 @@ > > #include > > #include > > #include > > +#include > > > > /****s* OpenSM: Vendor AL/osm_umad_bind_info_t > > * NAME > > @@ -441,6 +442,91 @@ Exit: > > > > /********************************************************************** > > **********************************************************************/ > > +static void > > +umad_events_thread( > > + IN void * vend_context) > > +{ > > + int res = 0; > > + osm_vendor_t * p_vend = (osm_vendor_t *) vend_context; > > + struct ibv_async_event event; > > + > > + OSM_LOG_ENTER( p_vend->p_log, umad_events_thread ); > > + > > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > > + "umad_events_thread: Device %s, async event FD: %d\n", > > + p_vend->umad_port.ca_name, p_vend->ibv_context->async_fd); > > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > > + "umad_events_thread: Listening for events on device %s, port %d\n", > > + p_vend->umad_port.ca_name, p_vend->umad_port.portnum); > > + > > + while (1) { > > + > > + res = ibv_get_async_event(p_vend->ibv_context, &event); > > + if (res) > > + { > > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > > + "umad_events_thread: ERR 5450: " > > + "Failed getting async event (device %s, port %d)\n", > > + p_vend->umad_port.ca_name, p_vend->umad_port.portnum); > > + goto Exit; > > + } > > + > > + if (!p_vend->events_callback) > > + { > > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > > + "umad_events_thread: Events callback has been unregistered\n"); > > + ibv_ack_async_event(&event); > > + goto Exit; > > + } > > + /* > > + * We're listening to events on the SM's port only > > + */ > > + if ( event.element.port_num == p_vend->umad_port.portnum ) > > + { > > + switch (event.event_type) > > + { > > + case IBV_EVENT_DEVICE_FATAL: > > + osm_log(p_vend->p_log, OSM_LOG_INFO, > > + "umad_events_thread: Received IBV_EVENT_DEVICE_FATAL\n"); > > + ((osm_vend_events_callback_t) > > + (p_vend->events_callback))(OSM_EVENT_FATAL, p_vend->sm_context); > > + > > + ibv_ack_async_event(&event); > > + goto Exit; > > + break; > > + > > + case IBV_EVENT_PORT_ERR: > > + osm_log(p_vend->p_log, OSM_LOG_VERBOSE, > > + "umad_events_thread: Received IBV_EVENT_PORT_ERR\n"); > > + ((osm_vend_events_callback_t) > > + (p_vend->events_callback))(OSM_EVENT_PORT_ERR, p_vend->sm_context); > > + break; > > + > > + default: > > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > > + "umad_events_thread: Received event #%d on port %d - Ignoring\n", > > + event.event_type, event.element.port_num); > > + } > > + } > > + else > > + { > > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > > + "umad_events_thread: Received event #%d on port %d - Ignoring\n", > > + event.event_type, event.element.port_num); > > + } > > + > > + ibv_ack_async_event(&event); > > + } > > + > > + Exit: > > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > > + "umad_events_thread: Terminating thread\n"); > > + OSM_LOG_EXIT(p_vend->p_log); > > + return; > > +} > > + > > +/********************************************************************** > > + **********************************************************************/ > > ib_api_status_t > > osm_vendor_init( > > IN osm_vendor_t* const p_vend, > > @@ -456,6 +542,7 @@ osm_vendor_init( > > p_vend->max_retries = OSM_DEFAULT_RETRY_COUNT; > > cl_spinlock_construct( &p_vend->cb_lock ); > > cl_spinlock_construct( &p_vend->match_tbl_lock ); > > + cl_thread_construct( &p_vend->events_thread ); > > p_vend->umad_port_id = -1; > > p_vend->issmfd = -1; > > > > @@ -1217,4 +1304,114 @@ osm_vendor_set_debug( > > umad_debug(level); > > } > > > > +/********************************************************************** > > + **********************************************************************/ > > +int > > +osm_vendor_reg_events_cb( > > + IN osm_vendor_t * const p_vend, > > + IN void * const sm_callback, > > + IN void * const sm_context) > > +{ > > + ib_api_status_t status = IB_SUCCESS; > > + struct ibv_device ** dev_list; > > + struct ibv_device * device; > > + > > + OSM_LOG_ENTER( p_vend->p_log, osm_vendor_reg_events_cb ); > > + > > + p_vend->events_callback = sm_callback; > > + p_vend->sm_context = sm_context; > > + > > + dev_list = ibv_get_device_list(NULL); > > + if (!dev_list || !(*dev_list)) { > > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > > + "osm_vendor_reg_events_cb: ERR 5440: " > > + "No IB devices found\n"); > > + status = IB_ERROR; > > + goto Exit; > > + } > > + > > + if (!p_vend->umad_port.ca_name || !p_vend->umad_port.ca_name[0]) > > + { > > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > > + "osm_vendor_reg_events_cb: ERR 5441: " > > + "Vendor initialization is not completed yet\n"); > > + status = IB_ERROR; > > + goto Exit; > > + } > > + > > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > > + "osm_vendor_reg_events_cb: Registering on device %s\n", > > + p_vend->umad_port.ca_name); > > + > > + /* > > + * find device whos name matches the SM's device > > + */ > > + for ( device = *dev_list; > > + (device != NULL) && > > + (strcmp(p_vend->umad_port.ca_name, ibv_get_device_name(device)) != 0); > > + device += sizeof(struct ibv_device *) ) > > + ; > > + if (!device) > > + { > > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > > + "osm_vendor_reg_events_cb: ERR 5442: " > > + "Device %s hasn't been found in the device list\n" > > + ,p_vend->umad_port.ca_name); > > + status = IB_ERROR; > > + goto Exit; > > + } > > + > > + p_vend->ibv_context = ibv_open_device(device); > > + if (!p_vend->ibv_context) { > > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > > + "osm_vendor_reg_events_cb: ERR 5443: " > > + "Couldn't get context for %s\n", > > + p_vend->umad_port.ca_name); > > + status = IB_ERROR; > > + goto Exit; > > + } > > + > > + /* > > + * Initiate the events thread > > + */ > > + if (cl_thread_init(&p_vend->events_thread, > > + umad_events_thread, > > + p_vend, > > + "ibumad events thread") != CL_SUCCESS) { > > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > > + "osm_vendor_reg_events_cb: ERR 5444: " > > + "Failed initiating event listening thread\n"); > > + status = IB_ERROR; > > + goto Exit; > > + } > > + > > + Exit: > > + if (status != IB_SUCCESS) > > + { > > + p_vend->events_callback = NULL; > > + p_vend->sm_context = NULL; > > + p_vend->ibv_context = NULL; > > + p_vend->events_callback = NULL; > > + } > > + OSM_LOG_EXIT( p_vend->p_log ); > > + return status; > > +} > > + > > +/********************************************************************** > > + **********************************************************************/ > > +void > > +osm_vendor_unreg_events_cb( > > + IN osm_vendor_t * const p_vend) > > +{ > > + OSM_LOG_ENTER( p_vend->p_log, osm_vendor_unreg_events_cb ); > > + p_vend->events_callback = NULL; > > + p_vend->sm_context = NULL; > > + p_vend->ibv_context = NULL; > > + p_vend->events_callback = NULL; > > + OSM_LOG_EXIT( p_vend->p_log ); > > +} > > + > > +/********************************************************************** > > + **********************************************************************/ > > + > > #endif /* OSM_VENDOR_INTF_OPENIB */ > > Index: libvendor/libosmvendor.map > > =================================================================== > > -- libvendor/libosmvendor.map (revision 8998) > > +++ libvendor/libosmvendor.map (working copy) > > @@ -1,4 +1,4 @@ > > -OSMVENDOR_2.0 { > > +OSMVENDOR_2.1 { > > global: > > umad_receiver; > > osm_vendor_init; > > @@ -23,5 +23,7 @@ OSMVENDOR_2.0 { > > osmv_bind_sa; > > osmv_query_sa; > > osm_vendor_get_guid_ca_and_port; > > + osm_vendor_reg_events_cb; > > + osm_vendor_unreg_events_cb; These are not present for all vendor layers. > > local: *; > > }; > > Index: opensm/osm_sm.c > > =================================================================== > > -- opensm/osm_sm.c (revision 8998) > > +++ opensm/osm_sm.c (working copy) > > @@ -313,6 +313,7 @@ osm_sm_init( > > p_sm->p_mad_pool, > > p_sm->p_vl15, > > p_sm->p_vendor, > > + &p_sm->state_mgr, > > p_log, p_stats, p_lock, p_disp ); > > if( status != IB_SUCCESS ) > > goto Exit; > > Index: opensm/osm_sm_mad_ctrl.c > > =================================================================== > > -- opensm/osm_sm_mad_ctrl.c (revision 8998) > > +++ opensm/osm_sm_mad_ctrl.c (working copy) > > @@ -59,6 +59,7 @@ > > #include > > #include > > #include > > +#include > > > > /****f* opensm: SM/__osm_sm_mad_ctrl_retire_trans_mad > > * NAME > > @@ -953,6 +954,7 @@ osm_sm_mad_ctrl_init( > > IN osm_mad_pool_t* const p_mad_pool, > > IN osm_vl15_t* const p_vl15, > > IN osm_vendor_t* const p_vendor, > > + IN struct _osm_state_mgr* const p_state_mgr, > > IN osm_log_t* const p_log, > > IN osm_stats_t* const p_stats, > > IN cl_plock_t* const p_lock, > > @@ -969,6 +971,7 @@ osm_sm_mad_ctrl_init( > > p_ctrl->p_disp = p_disp; > > p_ctrl->p_mad_pool = p_mad_pool; > > p_ctrl->p_vendor = p_vendor; > > + p_ctrl->p_state_mgr = p_state_mgr; > > p_ctrl->p_stats = p_stats; > > p_ctrl->p_lock = p_lock; > > p_ctrl->p_vl15 = p_vl15; > > @@ -995,6 +998,47 @@ osm_sm_mad_ctrl_init( > > > > /********************************************************************** > > **********************************************************************/ > > +void > > +__osm_vend_events_callback( > > + IN int events_mask, > > + IN void * const context ) > > Shouldn't this be conditionalized on OSM_VENDOR_SUPPORT_EVENTS ? > > +{ > > + osm_sm_mad_ctrl_t * const p_ctrl = (osm_sm_mad_ctrl_t * const) context; > > + > > + OSM_LOG_ENTER(p_ctrl->p_log, __osm_vend_events_callback); > > + > > + if (events_mask & OSM_EVENT_FATAL) > > + { > > + osm_log(p_ctrl->p_log, OSM_LOG_INFO, > > + "__osm_vend_events_callback: " > > + "Events callback got OSM_EVENT_FATAL\n"); > > + osm_log(p_ctrl->p_log, OSM_LOG_SYS, > > + "Fatal HCA error - forcing OpenSM exit\n"); > > + osm_exit_flag = 1; > > + OSM_LOG_EXIT(p_ctrl->p_log); > > + return; > > + } > > + > > + if (events_mask & OSM_EVENT_PORT_ERR) > > + { > > + osm_log(p_ctrl->p_log, OSM_LOG_INFO, > > + "__osm_vend_events_callback: " > > + "Events callback got OSM_EVENT_PORT_ERR - forcing heavy sweep\n"); > > + p_ctrl->p_subn->force_immediate_heavy_sweep = TRUE; > > + osm_state_mgr_process((osm_state_mgr_t * const)p_ctrl->p_state_mgr, > > + OSM_SIGNAL_SWEEP); > > + OSM_LOG_EXIT(p_ctrl->p_log); > > + return; > > + } > > + > > + osm_log(p_ctrl->p_log, OSM_LOG_INFO, > > + "__osm_vend_events_callback: " > > + "Events callback got event mask of %d - No action taken\n"); > > + OSM_LOG_EXIT(p_ctrl->p_log); > > +} > > + > > +/********************************************************************** > > + **********************************************************************/ > > ib_api_status_t > > osm_sm_mad_ctrl_bind( > > IN osm_sm_mad_ctrl_t* const p_ctrl, > > @@ -1044,6 +1088,17 @@ osm_sm_mad_ctrl_bind( > > goto Exit; > > } > > > > + if ( osm_vendor_reg_events_cb(p_ctrl->p_vendor, > > + __osm_vend_events_callback, > > + p_ctrl) ) > > + { Is an osm_vendor_unbind needed here or is this handled elsewhere ? > > + status = IB_ERROR; > > + osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > > + "osm_sm_mad_ctrl_bind: ERR 3120: " > > + "Vendor failed to register for events\n" ); > > + goto Exit; > > + } > > + Also, should osm_m_mad_ctrl_unbind unregister the events callback ? -- Hal > This should be conditionalized on OSM_VENDOR_SUPPORT_EVENTS. > > > Exit: > > OSM_LOG_EXIT( p_ctrl->p_log ); > > return( status ); > > Index: config/osmvsel.m4 > > =================================================================== > > -- config/osmvsel.m4 (revision 8998) > > +++ config/osmvsel.m4 (working copy) > > @@ -63,9 +63,9 @@ if test $with_osmv = "openib"; then > > OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" > > OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" > > if test "x$with_umad_libs" = "x"; then > > - OSMV_LDADD="-libumad" > > + OSMV_LDADD="-libumad -libverbs" > > else > > - OSMV_LDADD="-L$with_umad_libs -libumad" > > + OSMV_LDADD="-L$with_umad_libs -libumad -libverbs" > > fi > > > > if test "x$with_umad_includes" != "x"; then > > @@ -137,6 +137,8 @@ if test "$disable_libcheck" != "yes"; th > > LDFLAGS="$LDFLAGS $OSMV_LDADD" > > AC_CHECK_LIB(ibumad, umad_init, [], > > AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibumad.])) > > + AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], > > + AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibverbs.])) > > Cut and paste error: Error message should indicate ibv_get_device_list > rather than umad_init. > > > LD_FLAGS=$osmv_save_ldflags > > elif test $with_osmv = "sim" ; then > > LDFLAGS="$LDFLAGS -L$with_sim/lib" > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Tue Aug 22 14:54:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Aug 2006 17:54:55 -0400 Subject: [openib-general] [PATCH] opensm: option to limit size of OpenSM log file In-Reply-To: <20060822211855.GD10446@sashak.voltaire.com> References: <20060822211855.GD10446@sashak.voltaire.com> Message-ID: <1156283694.17858.5780.camel@hal.voltaire.com> Hi Sasha, On Tue, 2006-08-22 at 17:18, Sasha Khapyorsky wrote: > Hi Hal, > > There is new option which specified max size of OpenSM log file. The > default is '0' (not-limited). Please note osm_log_init() has new > parameter now. So libopensm.ver needs to be bumped (and this is not backward compatible). > We already saw the problems with FS overflowing in real life - we may > want those related fixes in OFED too. Yes, I think this is appropriate for both the trunk and OFED 1.1. > Sasha > > > opensm: option to limit size of OpenSM log file > > New option '-L' will limit size of OpenSM log file. If specified the log > file will be truncated upon reaching this limit. > > Signed-off-by: Sasha Khapyorsky > --- > > osm/complib/cl_event_wheel.c | 2 > osm/include/opensm/osm_log.h | 44 +---------- > osm/include/opensm/osm_subnet.h | 6 + > osm/opensm/libopensm.map | 1 > osm/opensm/main.c | 13 +++ > osm/opensm/osm_log.c | 163 ++++++++++++++++++++++++++------------- > osm/opensm/osm_opensm.c | 3 - > osm/opensm/osm_subnet.c | 9 ++ > osm/osmtest/osmtest.c | 2 > 9 files changed, 143 insertions(+), 100 deletions(-) I will make the requisite change to the man page when the time comes for this. -- Hal [snip...] From sashak at voltaire.com Tue Aug 22 15:22:30 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 23 Aug 2006 01:22:30 +0300 Subject: [openib-general] [PATCH] opensm: option to limit size of OpenSM log file In-Reply-To: <1156283694.17858.5780.camel@hal.voltaire.com> References: <20060822211855.GD10446@sashak.voltaire.com> <1156283694.17858.5780.camel@hal.voltaire.com> Message-ID: <20060822222230.GG10446@sashak.voltaire.com> On 17:54 Tue 22 Aug , Hal Rosenstock wrote: > Hi Sasha, > > On Tue, 2006-08-22 at 17:18, Sasha Khapyorsky wrote: > > Hi Hal, > > > > There is new option which specified max size of OpenSM log file. The > > default is '0' (not-limited). Please note osm_log_init() has new > > parameter now. > > So libopensm.ver needs to be bumped (and this is not backward > compatible). We may. I'm not sure it is necessary - in this patch I've changed all occurrences of osm_log_init() under osm/ (in opensm and osmtest). So this can be important only if there are osm_log "external" users. > > > We already saw the problems with FS overflowing in real life - we may > > want those related fixes in OFED too. > > Yes, I think this is appropriate for both the trunk and OFED 1.1. > > > Sasha > > > > > > opensm: option to limit size of OpenSM log file > > > > New option '-L' will limit size of OpenSM log file. If specified the log > > file will be truncated upon reaching this limit. > > > > Signed-off-by: Sasha Khapyorsky > > --- > > > > osm/complib/cl_event_wheel.c | 2 > > osm/include/opensm/osm_log.h | 44 +---------- > > osm/include/opensm/osm_subnet.h | 6 + > > osm/opensm/libopensm.map | 1 > > osm/opensm/main.c | 13 +++ > > osm/opensm/osm_log.c | 163 ++++++++++++++++++++++++++------------- > > osm/opensm/osm_opensm.c | 3 - > > osm/opensm/osm_subnet.c | 9 ++ > > osm/osmtest/osmtest.c | 2 > > 9 files changed, 143 insertions(+), 100 deletions(-) > > I will make the requisite change to the man page when the time comes for > this. Great. Thanks. Sasha > > -- Hal > > [snip...] > > From halr at voltaire.com Tue Aug 22 15:26:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Aug 2006 18:26:42 -0400 Subject: [openib-general] [PATCH] opensm: option to limit size of OpenSM log file In-Reply-To: <20060822222230.GG10446@sashak.voltaire.com> References: <20060822211855.GD10446@sashak.voltaire.com> <1156283694.17858.5780.camel@hal.voltaire.com> <20060822222230.GG10446@sashak.voltaire.com> Message-ID: <1156285600.17858.6451.camel@hal.voltaire.com> On Tue, 2006-08-22 at 18:22, Sasha Khapyorsky wrote: > On 17:54 Tue 22 Aug , Hal Rosenstock wrote: > > Hi Sasha, > > > > On Tue, 2006-08-22 at 17:18, Sasha Khapyorsky wrote: > > > Hi Hal, > > > > > > There is new option which specified max size of OpenSM log file. The > > > default is '0' (not-limited). Please note osm_log_init() has new > > > parameter now. > > > > So libopensm.ver needs to be bumped (and this is not backward > > compatible). > > We may. I'm not sure it is necessary - in this patch I've changed all > occurrences of osm_log_init() under osm/ (in opensm and osmtest). So > this can be important only if there are osm_log "external" users. There may be so I will do this. -- Hal From sashak at voltaire.com Tue Aug 22 15:34:09 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 23 Aug 2006 01:34:09 +0300 Subject: [openib-general] [PATCH] opensm: option to limit size of OpenSM log file In-Reply-To: <1156285600.17858.6451.camel@hal.voltaire.com> References: <20060822211855.GD10446@sashak.voltaire.com> <1156283694.17858.5780.camel@hal.voltaire.com> <20060822222230.GG10446@sashak.voltaire.com> <1156285600.17858.6451.camel@hal.voltaire.com> Message-ID: <20060822223409.GH10446@sashak.voltaire.com> On 18:26 Tue 22 Aug , Hal Rosenstock wrote: > On Tue, 2006-08-22 at 18:22, Sasha Khapyorsky wrote: > > On 17:54 Tue 22 Aug , Hal Rosenstock wrote: > > > Hi Sasha, > > > > > > On Tue, 2006-08-22 at 17:18, Sasha Khapyorsky wrote: > > > > Hi Hal, > > > > > > > > There is new option which specified max size of OpenSM log file. The > > > > default is '0' (not-limited). Please note osm_log_init() has new > > > > parameter now. > > > > > > So libopensm.ver needs to be bumped (and this is not backward > > > compatible). > > > > We may. I'm not sure it is necessary - in this patch I've changed all > > occurrences of osm_log_init() under osm/ (in opensm and osmtest). So > > this can be important only if there are osm_log "external" users. > > There may be so I will do this. Ok. Thanks. Sasha > > -- Hal > > > From mshefty at ichips.intel.com Tue Aug 22 16:02:06 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 22 Aug 2006 16:02:06 -0700 Subject: [openib-general] IB CM and the case of the lost RTU: was a bunch of other topics... In-Reply-To: <44D8513C.8000801@voltaire.com> References: <000101c6b821$6f783be0$8698070a@amr.corp.intel.com> <44D71B6B.3000007@voltaire.com> <44D776AD.3080606@ichips.intel.com> <44D8513C.8000801@voltaire.com> Message-ID: <44EB8CEE.9010708@ichips.intel.com> Or Gerlitz wrote: > Indeed, lets see if we can get some input from the ULP people working on > passive side / targets (eg NFS/Lustre/iSER/SDP). To recap (since it's been a couple of weeks), we have two general solutions for how to support the passive/server/target side of a connection: 1. One method requires that the passive side queue send WRs until they get a connection establish event. 2. An alternative allows sending immediately after receiving a response, but may require the user to manually transition the connection to established. Failure to do so will cause the connection to tear down if the RTU is never received (even after retries). Without target developer input, I'm guessing at the right solution. But my expectation is that it is likely that the passive side will process receive completions before the connection is established, but highly unlikely that the RTU will never be received in this case. - Sean From robert.j.woodruff at intel.com Tue Aug 22 16:46:31 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 22 Aug 2006 16:46:31 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready Message-ID: Woody, wrote, >Tziporet wrote, >Hi Woody, >I wrote in the release mail RHEL4 up4 does not working: >Limitations and known issues: >Not a problem, I will test on RHEL4 - U3 for now. >woody Ok, I was able to install the RC2 on EL4-U3 and get intel MPI working on uDAPL. Did have one issue with the install that maybe you could fix for the next RC. It appears that the rdma_ucm and rdma_cm are not being loaded at startup time and I had to manually modprobe rdma_ucm, after that, Intel MPI and uDAPL seemed to work fine with the intial tests I have done so far, will continue to stress test it and let you know if we see any other issues. woody From johnt1johnt2 at gmail.com Tue Aug 22 22:13:53 2006 From: johnt1johnt2 at gmail.com (john t) Date: Wed, 23 Aug 2006 10:43:53 +0530 Subject: [openib-general] basic IB doubt Message-ID: Hi I have a very basic doubt. Suppose Host A is doing RDMA write (say 8 MB) to Host B. When data is copied into Host B's local buffer, is it guaranteed that data will be copied starting from the first location (first buffer address) to the last location (last buffer address)? or it could be in any order? Also does this tranfer involve DMA engine (residinng on HCAs) of both the hosts or just one host? Regards, John T -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Aug 22 22:37:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 Aug 2006 08:37:14 +0300 Subject: [openib-general] IB CM and the case of the lost RTU: was a bunch of other topics... In-Reply-To: <44EB8CEE.9010708@ichips.intel.com> References: <44EB8CEE.9010708@ichips.intel.com> Message-ID: <20060823053714.GB4550@mellanox.co.il> Quoting r. Sean Hefty : > Subject: IB CM and the case of the lost RTU: was a bunch of other topics... > > Or Gerlitz wrote: > > Indeed, lets see if we can get some input from the ULP people working on > > passive side / targets (eg NFS/Lustre/iSER/SDP). > > To recap (since it's been a couple of weeks), we have two general solutions for > how to support the passive/server/target side of a connection: > > 1. One method requires that the passive side queue send WRs until they get a > connection establish event. > > 2. An alternative allows sending immediately after receiving a response, but may > require the user to manually transition the connection to established. Failure > to do so will cause the connection to tear down if the RTU is never received > (even after retries). > > Without target developer input, I'm guessing at the right solution. But my > expectation is that it is likely that the passive side will process receive > completions before the connection is established, but highly unlikely that the > RTU will never be received in this case. I think SDP would simply queue receive WRs and never send any WRs until established event. -- MST From eitan at mellanox.co.il Tue Aug 22 22:38:23 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 23 Aug 2006 08:38:23 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <1156280698.17858.4748.camel@hal.voltaire.com> References: <44EB37B2.40906@mellanox.co.il> <1156280698.17858.4748.camel@hal.voltaire.com> Message-ID: <44EBE9CF.5050808@mellanox.co.il> Hal Rosenstock wrote: > On Tue, 2006-08-22 at 12:58, Eitan Zahavi wrote: > >>I did not see this on the reflector. > > > It made it. > > >>We did have some mailer problems. So I am resending to the list >> >>One more thing to add: >>The only other event we considered was PORT_ACTIVE. >>But as it turns out the event is only generated when the port moves into ACTIVE state >>which means an SM already handled it... > > > Not sure what you would do with ACTIVE. There are some port selection > rules about port state for umad. I was only saying that although such an event exists we did not know what to do with it. So I agree with you we do not have anything to do with it. > > -- Hal > > >>EZ > > From mst at mellanox.co.il Tue Aug 22 22:45:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 Aug 2006 08:45:41 +0300 Subject: [openib-general] basic IB doubt In-Reply-To: References: Message-ID: <20060823054541.GC4550@mellanox.co.il> Quoting r. john t : > Subject: basic IB doubt > > Hi > > I have a very basic doubt. Suppose Host A is doing RDMA write (say 8 MB) to Host B. When data is copied into Host B's local buffer, is it guaranteed that data will be copied starting from the first location (first buffer address) to the last location (last > buffer address)? or it could be in any order? Once B gets a completion (e.g. of a subsequent send), data in its buffer matches that of A, byte for byte. -- MST From eitan at mellanox.co.il Tue Aug 22 23:00:18 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 23 Aug 2006 09:00:18 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <1156281936.17858.5168.camel@hal.voltaire.com> References: <1156268156.17858.801.camel@hal.voltaire.com> <1156281936.17858.5168.camel@hal.voltaire.com> Message-ID: <44EBEEF2.9050206@mellanox.co.il> Hi Hal, Yevgeny is out for few days so I will try and answer: The idea behind using OSM_VENDOR_SUPPORT_EVENTS was that some vendors might not implement it. For these vendor we need osm_vendor_api.h not to define the new APIs (osm_vendor_reg_events_cb and osm_vendor_unreg_events_cb). One way to do that is to have this "define" such that: 1. A vendor that supports this extended functionality will define this flag. It can be done using gcc command line (and find its way into osmvsel.m4), or be defined within the vendor specific include file osm_vendor_ibumad.h. I think it make more sense to have each vendor declare its supported extension in the H file but this is a matter of taste. 2. osm_vendor_api.h already includes the osm_vendor.h which includes the specific osm_vendor_.h - so the supported extensions are "known" or better say "defined". osm_vendoe_api.h will only define these APIs if the OSM_VENDOR_SUPPORT_EVENTS is defined. 3. osm_sm_mad_ctrl.c should use OSM_VENDOR_SUPPORT_EVENTS as a qualifier for including the extra functionality of registering for the events. I see that was missing from the original patch. Regarding testing: Yevgeny did not test with FATAL event yet. Instead he forced the reported event to FATAL (and plugged out the port). So he did observe OpenSM exit due to fatal. I think there is some way an HCA can be forced into such event. To make a full test (mostly of the IBV layer not OpenSM) we will need to do that too. I have answered the rest of the questions below EZ Hal Rosenstock wrote: > Hi again Yevgeny, > > I've been working on integrating this patch and have some more comments > on it: > > On Tue, 2006-08-22 at 13:35, Hal Rosenstock wrote: > >>Hi Yevgeny, >> >>On Tue, 2006-08-22 at 11:41, Yevgeny Kliteynik wrote: >> >>>Hi Hal >>> >>>This patch implements first item of the OSM todo list. >> >>Thanks! >> >>Am I correct in assuming this is both for trunk and 1.1 ? Yes both. >> >> >>>OpenSM opens a thread that is listening for events on the SM's port. >>>The events that are being taken care of are IBV_EVENT_DEVICE_FATAL and >>>IBV_EVENT_PORT_ERROR. >>> >>>In case of IBV_EVENT_DEVICE_FATAL, osm is forced to exit. >>>in case of IBV_EVENT_PORT_ERROR, osm initiates heavy sweep. > > > What (and how) were this tested ? How can these events be generated ? By taking the local port down. > > >>Some minor comments below. Let me know what you think. You don't have to >>resubmit for these. >> >> >>>Yevgeny >>> >>>Signed-off-by: Yevgeny Kliteynik >>> >>>Index: include/opensm/osm_sm_mad_ctrl.h >>>=================================================================== >>>-- include/opensm/osm_sm_mad_ctrl.h (revision 8998) >>>+++ include/opensm/osm_sm_mad_ctrl.h (working copy) >>>@@ -109,6 +109,7 @@ typedef struct _osm_sm_mad_ctrl >>> osm_mad_pool_t *p_mad_pool; >>> osm_vl15_t *p_vl15; >>> osm_vendor_t *p_vendor; >>>+ struct _osm_state_mgr *p_state_mgr; >>> osm_bind_handle_t h_bind; >>> cl_plock_t *p_lock; >>> cl_dispatcher_t *p_disp; >>>@@ -130,6 +131,9 @@ typedef struct _osm_sm_mad_ctrl >>> * p_vendor >>> * Pointer to the vendor specific interfaces object. >>> * >>>+* p_state_mgr >>>+* Pointer to the state manager object. >>>+* >>> * h_bind >>> * Bind handle returned by the transport layer. >>> * >>>@@ -233,6 +237,7 @@ osm_sm_mad_ctrl_init( >>> IN osm_mad_pool_t* const p_mad_pool, >>> IN osm_vl15_t* const p_vl15, >>> IN osm_vendor_t* const p_vendor, >>>+ IN struct _osm_state_mgr* const p_state_mgr, >>> IN osm_log_t* const p_log, >>> IN osm_stats_t* const p_stats, >>> IN cl_plock_t* const p_lock, >>>@@ -251,6 +256,9 @@ osm_sm_mad_ctrl_init( >>> * p_vendor >>> * [in] Pointer to the vendor specific interfaces object. >>> * >>>+* p_state_mgr >>>+* [in] Pointer to the state manager object. >>>+* >>> * p_log >>> * [in] Pointer to the log object. >>> * >>>Index: include/vendor/osm_vendor_ibumad.h >>>=================================================================== >>>-- include/vendor/osm_vendor_ibumad.h (revision 8998) >>>+++ include/vendor/osm_vendor_ibumad.h (working copy) >>>@@ -74,6 +74,8 @@ BEGIN_C_DECLS >>> #define OSM_UMAD_MAX_CAS 32 >>> #define OSM_UMAD_MAX_PORTS_PER_CA 2 >>> >>>+#define OSM_VENDOR_SUPPORT_EVENTS >>>+ >> >>I prefer this as an additional flag turned on in the build for OpenIB. Why? OSM_VENDOR_SUPPORT_RMPP is not > > > I take that back. I think that this define should not be present at all > as this is a change to the vendor layer API. If so, then > libosmvendor.ver needs to be bumped. We might want to bump the API version when compiling with ibumad vendor. > > Also, what about any of the other vendor layers supported ? They would > need at least some stub to get past the linking. Not if the above method of using OSM_VENDOR_SUPPORT_EVENTS is used as qualifier for the inclusion of the extra API in osm_vendor_api.h > > >>> /* OpenIB gen2 doesn't support RMPP yet */ >>> >>> /****s* OpenSM: Vendor UMAD/osm_ca_info_t >>>@@ -179,6 +181,10 @@ typedef struct _osm_vendor >>> int umad_port_id; >>> void *receiver; >>> int issmfd; >>>+ cl_thread_t events_thread; >>>+ void * events_callback; >>>+ void * sm_context; >>>+ struct ibv_context * ibv_context; >>> } osm_vendor_t; >>> >>> #define OSM_BIND_INVALID_HANDLE 0 >>>Index: include/vendor/osm_vendor_api.h >>>=================================================================== >>>-- include/vendor/osm_vendor_api.h (revision 8998) >>>+++ include/vendor/osm_vendor_api.h (working copy) >>>@@ -526,6 +526,110 @@ osm_vendor_set_debug( >>> * SEE ALSO >>> *********/ >>> >>>+#ifdef OSM_VENDOR_SUPPORT_EVENTS >>>+ >>>+#define OSM_EVENT_FATAL 1 >>>+#define OSM_EVENT_PORT_ERR 2 >>>+ >>>+/****s* OpenSM Vendor API/osm_vend_events_callback_t >>>+* NAME >>>+* osm_vend_events_callback_t >>>+* >>>+* DESCRIPTION >>>+* Function prototype for the vendor events callback. >>>+* The vendor layer calls this function on driver events. >>>+* >>>+* SYNOPSIS >>>+*/ >>>+typedef void >>>+(*osm_vend_events_callback_t)( >>>+ IN int events_mask, >>>+ IN void * const context ); >>>+/* >>>+* PARAMETERS >>>+* events_mask >>>+* [in] The received event(s). >>>+* >>>+* context >>>+* [in] Context supplied as the "sm_context" argument in >>>+* the osm_vendor_unreg_events_cb call >>>+* >>>+* RETURN VALUES >>>+* None. >>>+* >>>+* NOTES >>>+* >>>+* SEE ALSO >>>+* osm_vendor_reg_events_cb osm_vendor_unreg_events_cb >>>+*********/ >>>+ >>>+/****f* OpenSM Vendor API/osm_vendor_reg_events_cb >>>+* NAME >>>+* osm_vendor_reg_events_cb >>>+* >>>+* DESCRIPTION >>>+* Registers the events callback function and start the events >>>+* thread >>>+* >>>+* SYNOPSIS >>>+*/ >>>+int >>>+osm_vendor_reg_events_cb( >>>+ IN osm_vendor_t * const p_vend, >>>+ IN void * const sm_callback, >>>+ IN void * const sm_context); >>>+/* >>>+* PARAMETERS >>>+* p_vend >>>+* [in] vendor handle. >>>+* >>>+* sm_callback >>>+* [in] Callback function that should be called when >>>+* the event is received. >>>+* >>>+* sm_context >>>+* [in] Context supplied as the "context" argument in >>>+* the subsequenct calls to the sm_callback function >>>+* >>>+* RETURN VALUE >>>+* IB_SUCCESS if OK. >>>+* >>>+* NOTES >>>+* >>>+* SEE ALSO >>>+* osm_vend_events_callback_t osm_vendor_unreg_events_cb >>>+*********/ >>>+ >>>+/****f* OpenSM Vendor API/osm_vendor_unreg_events_cb >>>+* NAME >>>+* osm_vendor_unreg_events_cb >>>+* >>>+* DESCRIPTION >>>+* Un-Registers the events callback function and stops the events >>>+* thread >>>+* >>>+* SYNOPSIS >>>+*/ >>>+void >>>+osm_vendor_unreg_events_cb( >>>+ IN osm_vendor_t * const p_vend); >>>+/* >>>+* PARAMETERS >>>+* p_vend >>>+* [in] vendor handle. >>>+* >>>+* >>>+* RETURN VALUE >>>+* None. >>>+* >>>+* NOTES >>>+* >>>+* SEE ALSO >>>+* osm_vend_events_callback_t osm_vendor_reg_events_cb >>>+*********/ >>>+ >>>+#endif /* OSM_VENDOR_SUPPORT_EVENTS */ >>>+ >>> END_C_DECLS >>> >>> #endif /* _OSM_VENDOR_API_H_ */ >>>Index: libvendor/osm_vendor_ibumad.c >>>=================================================================== >>>-- libvendor/osm_vendor_ibumad.c (revision 8998) >>>+++ libvendor/osm_vendor_ibumad.c (working copy) >>>@@ -72,6 +72,7 @@ >>> #include >>> #include >>> #include >>>+#include >>> >>> /****s* OpenSM: Vendor AL/osm_umad_bind_info_t >>> * NAME >>>@@ -441,6 +442,91 @@ Exit: >>> >>> /********************************************************************** >>> **********************************************************************/ >>>+static void >>>+umad_events_thread( >>>+ IN void * vend_context) >>>+{ >>>+ int res = 0; >>>+ osm_vendor_t * p_vend = (osm_vendor_t *) vend_context; >>>+ struct ibv_async_event event; >>>+ >>>+ OSM_LOG_ENTER( p_vend->p_log, umad_events_thread ); >>>+ >>>+ osm_log(p_vend->p_log, OSM_LOG_DEBUG, >>>+ "umad_events_thread: Device %s, async event FD: %d\n", >>>+ p_vend->umad_port.ca_name, p_vend->ibv_context->async_fd); >>>+ osm_log(p_vend->p_log, OSM_LOG_DEBUG, >>>+ "umad_events_thread: Listening for events on device %s, port %d\n", >>>+ p_vend->umad_port.ca_name, p_vend->umad_port.portnum); >>>+ >>>+ while (1) { >>>+ >>>+ res = ibv_get_async_event(p_vend->ibv_context, &event); >>>+ if (res) >>>+ { >>>+ osm_log(p_vend->p_log, OSM_LOG_ERROR, >>>+ "umad_events_thread: ERR 5450: " >>>+ "Failed getting async event (device %s, port %d)\n", >>>+ p_vend->umad_port.ca_name, p_vend->umad_port.portnum); >>>+ goto Exit; >>>+ } >>>+ >>>+ if (!p_vend->events_callback) >>>+ { >>>+ osm_log(p_vend->p_log, OSM_LOG_DEBUG, >>>+ "umad_events_thread: Events callback has been unregistered\n"); >>>+ ibv_ack_async_event(&event); >>>+ goto Exit; >>>+ } >>>+ /* >>>+ * We're listening to events on the SM's port only >>>+ */ >>>+ if ( event.element.port_num == p_vend->umad_port.portnum ) >>>+ { >>>+ switch (event.event_type) >>>+ { >>>+ case IBV_EVENT_DEVICE_FATAL: >>>+ osm_log(p_vend->p_log, OSM_LOG_INFO, >>>+ "umad_events_thread: Received IBV_EVENT_DEVICE_FATAL\n"); >>>+ ((osm_vend_events_callback_t) >>>+ (p_vend->events_callback))(OSM_EVENT_FATAL, p_vend->sm_context); >>>+ >>>+ ibv_ack_async_event(&event); >>>+ goto Exit; >>>+ break; >>>+ >>>+ case IBV_EVENT_PORT_ERR: >>>+ osm_log(p_vend->p_log, OSM_LOG_VERBOSE, >>>+ "umad_events_thread: Received IBV_EVENT_PORT_ERR\n"); >>>+ ((osm_vend_events_callback_t) >>>+ (p_vend->events_callback))(OSM_EVENT_PORT_ERR, p_vend->sm_context); >>>+ break; >>>+ >>>+ default: >>>+ osm_log(p_vend->p_log, OSM_LOG_DEBUG, >>>+ "umad_events_thread: Received event #%d on port %d - Ignoring\n", >>>+ event.event_type, event.element.port_num); >>>+ } >>>+ } >>>+ else >>>+ { >>>+ osm_log(p_vend->p_log, OSM_LOG_DEBUG, >>>+ "umad_events_thread: Received event #%d on port %d - Ignoring\n", >>>+ event.event_type, event.element.port_num); >>>+ } >>>+ >>>+ ibv_ack_async_event(&event); >>>+ } >>>+ >>>+ Exit: >>>+ osm_log(p_vend->p_log, OSM_LOG_DEBUG, >>>+ "umad_events_thread: Terminating thread\n"); >>>+ OSM_LOG_EXIT(p_vend->p_log); >>>+ return; >>>+} >>>+ >>>+/********************************************************************** >>>+ **********************************************************************/ >>> ib_api_status_t >>> osm_vendor_init( >>> IN osm_vendor_t* const p_vend, >>>@@ -456,6 +542,7 @@ osm_vendor_init( >>> p_vend->max_retries = OSM_DEFAULT_RETRY_COUNT; >>> cl_spinlock_construct( &p_vend->cb_lock ); >>> cl_spinlock_construct( &p_vend->match_tbl_lock ); >>>+ cl_thread_construct( &p_vend->events_thread ); >>> p_vend->umad_port_id = -1; >>> p_vend->issmfd = -1; >>> >>>@@ -1217,4 +1304,114 @@ osm_vendor_set_debug( >>> umad_debug(level); >>> } >>> >>>+/********************************************************************** >>>+ **********************************************************************/ >>>+int >>>+osm_vendor_reg_events_cb( >>>+ IN osm_vendor_t * const p_vend, >>>+ IN void * const sm_callback, >>>+ IN void * const sm_context) >>>+{ >>>+ ib_api_status_t status = IB_SUCCESS; >>>+ struct ibv_device ** dev_list; >>>+ struct ibv_device * device; >>>+ >>>+ OSM_LOG_ENTER( p_vend->p_log, osm_vendor_reg_events_cb ); >>>+ >>>+ p_vend->events_callback = sm_callback; >>>+ p_vend->sm_context = sm_context; >>>+ >>>+ dev_list = ibv_get_device_list(NULL); >>>+ if (!dev_list || !(*dev_list)) { >>>+ osm_log(p_vend->p_log, OSM_LOG_ERROR, >>>+ "osm_vendor_reg_events_cb: ERR 5440: " >>>+ "No IB devices found\n"); >>>+ status = IB_ERROR; >>>+ goto Exit; >>>+ } >>>+ >>>+ if (!p_vend->umad_port.ca_name || !p_vend->umad_port.ca_name[0]) >>>+ { >>>+ osm_log(p_vend->p_log, OSM_LOG_ERROR, >>>+ "osm_vendor_reg_events_cb: ERR 5441: " >>>+ "Vendor initialization is not completed yet\n"); >>>+ status = IB_ERROR; >>>+ goto Exit; >>>+ } >>>+ >>>+ osm_log(p_vend->p_log, OSM_LOG_DEBUG, >>>+ "osm_vendor_reg_events_cb: Registering on device %s\n", >>>+ p_vend->umad_port.ca_name); >>>+ >>>+ /* >>>+ * find device whos name matches the SM's device >>>+ */ >>>+ for ( device = *dev_list; >>>+ (device != NULL) && >>>+ (strcmp(p_vend->umad_port.ca_name, ibv_get_device_name(device)) != 0); >>>+ device += sizeof(struct ibv_device *) ) >>>+ ; >>>+ if (!device) >>>+ { >>>+ osm_log(p_vend->p_log, OSM_LOG_ERROR, >>>+ "osm_vendor_reg_events_cb: ERR 5442: " >>>+ "Device %s hasn't been found in the device list\n" >>>+ ,p_vend->umad_port.ca_name); >>>+ status = IB_ERROR; >>>+ goto Exit; >>>+ } >>>+ >>>+ p_vend->ibv_context = ibv_open_device(device); >>>+ if (!p_vend->ibv_context) { >>>+ osm_log(p_vend->p_log, OSM_LOG_ERROR, >>>+ "osm_vendor_reg_events_cb: ERR 5443: " >>>+ "Couldn't get context for %s\n", >>>+ p_vend->umad_port.ca_name); >>>+ status = IB_ERROR; >>>+ goto Exit; >>>+ } >>>+ >>>+ /* >>>+ * Initiate the events thread >>>+ */ >>>+ if (cl_thread_init(&p_vend->events_thread, >>>+ umad_events_thread, >>>+ p_vend, >>>+ "ibumad events thread") != CL_SUCCESS) { >>>+ osm_log(p_vend->p_log, OSM_LOG_ERROR, >>>+ "osm_vendor_reg_events_cb: ERR 5444: " >>>+ "Failed initiating event listening thread\n"); >>>+ status = IB_ERROR; >>>+ goto Exit; >>>+ } >>>+ >>>+ Exit: >>>+ if (status != IB_SUCCESS) >>>+ { >>>+ p_vend->events_callback = NULL; >>>+ p_vend->sm_context = NULL; >>>+ p_vend->ibv_context = NULL; >>>+ p_vend->events_callback = NULL; >>>+ } >>>+ OSM_LOG_EXIT( p_vend->p_log ); >>>+ return status; >>>+} >>>+ >>>+/********************************************************************** >>>+ **********************************************************************/ >>>+void >>>+osm_vendor_unreg_events_cb( >>>+ IN osm_vendor_t * const p_vend) >>>+{ >>>+ OSM_LOG_ENTER( p_vend->p_log, osm_vendor_unreg_events_cb ); >>>+ p_vend->events_callback = NULL; >>>+ p_vend->sm_context = NULL; >>>+ p_vend->ibv_context = NULL; >>>+ p_vend->events_callback = NULL; >>>+ OSM_LOG_EXIT( p_vend->p_log ); >>>+} >>>+ >>>+/********************************************************************** >>>+ **********************************************************************/ >>>+ >>> #endif /* OSM_VENDOR_INTF_OPENIB */ >>>Index: libvendor/libosmvendor.map >>>=================================================================== >>>-- libvendor/libosmvendor.map (revision 8998) >>>+++ libvendor/libosmvendor.map (working copy) >>>@@ -1,4 +1,4 @@ >>>-OSMVENDOR_2.0 { >>>+OSMVENDOR_2.1 { >>> global: >>> umad_receiver; >>> osm_vendor_init; >>>@@ -23,5 +23,7 @@ OSMVENDOR_2.0 { >>> osmv_bind_sa; >>> osmv_query_sa; >>> osm_vendor_get_guid_ca_and_port; >>>+ osm_vendor_reg_events_cb; >>>+ osm_vendor_unreg_events_cb; > > > These are not present for all vendor layers. Yes but when tested did not pose any error from the linker. > > >>> local: *; >>> }; >>>Index: opensm/osm_sm.c >>>=================================================================== >>>-- opensm/osm_sm.c (revision 8998) >>>+++ opensm/osm_sm.c (working copy) >>>@@ -313,6 +313,7 @@ osm_sm_init( >>> p_sm->p_mad_pool, >>> p_sm->p_vl15, >>> p_sm->p_vendor, >>>+ &p_sm->state_mgr, >>> p_log, p_stats, p_lock, p_disp ); >>> if( status != IB_SUCCESS ) >>> goto Exit; >>>Index: opensm/osm_sm_mad_ctrl.c >>>=================================================================== >>>-- opensm/osm_sm_mad_ctrl.c (revision 8998) >>>+++ opensm/osm_sm_mad_ctrl.c (working copy) >>>@@ -59,6 +59,7 @@ >>> #include >>> #include >>> #include >>>+#include >>> >>> /****f* opensm: SM/__osm_sm_mad_ctrl_retire_trans_mad >>> * NAME >>>@@ -953,6 +954,7 @@ osm_sm_mad_ctrl_init( >>> IN osm_mad_pool_t* const p_mad_pool, >>> IN osm_vl15_t* const p_vl15, >>> IN osm_vendor_t* const p_vendor, >>>+ IN struct _osm_state_mgr* const p_state_mgr, >>> IN osm_log_t* const p_log, >>> IN osm_stats_t* const p_stats, >>> IN cl_plock_t* const p_lock, >>>@@ -969,6 +971,7 @@ osm_sm_mad_ctrl_init( >>> p_ctrl->p_disp = p_disp; >>> p_ctrl->p_mad_pool = p_mad_pool; >>> p_ctrl->p_vendor = p_vendor; >>>+ p_ctrl->p_state_mgr = p_state_mgr; >>> p_ctrl->p_stats = p_stats; >>> p_ctrl->p_lock = p_lock; >>> p_ctrl->p_vl15 = p_vl15; >>>@@ -995,6 +998,47 @@ osm_sm_mad_ctrl_init( >>> >>> /********************************************************************** >>> **********************************************************************/ >>>+void >>>+__osm_vend_events_callback( >>>+ IN int events_mask, >>>+ IN void * const context ) >> >>Shouldn't this be conditionalized on OSM_VENDOR_SUPPORT_EVENTS ? Yes it should. > > >>>+{ >>>+ osm_sm_mad_ctrl_t * const p_ctrl = (osm_sm_mad_ctrl_t * const) context; >>>+ >>>+ OSM_LOG_ENTER(p_ctrl->p_log, __osm_vend_events_callback); >>>+ >>>+ if (events_mask & OSM_EVENT_FATAL) >>>+ { >>>+ osm_log(p_ctrl->p_log, OSM_LOG_INFO, >>>+ "__osm_vend_events_callback: " >>>+ "Events callback got OSM_EVENT_FATAL\n"); >>>+ osm_log(p_ctrl->p_log, OSM_LOG_SYS, >>>+ "Fatal HCA error - forcing OpenSM exit\n"); >>>+ osm_exit_flag = 1; >>>+ OSM_LOG_EXIT(p_ctrl->p_log); >>>+ return; >>>+ } >>>+ >>>+ if (events_mask & OSM_EVENT_PORT_ERR) >>>+ { >>>+ osm_log(p_ctrl->p_log, OSM_LOG_INFO, >>>+ "__osm_vend_events_callback: " >>>+ "Events callback got OSM_EVENT_PORT_ERR - forcing heavy sweep\n"); >>>+ p_ctrl->p_subn->force_immediate_heavy_sweep = TRUE; >>>+ osm_state_mgr_process((osm_state_mgr_t * const)p_ctrl->p_state_mgr, >>>+ OSM_SIGNAL_SWEEP); >>>+ OSM_LOG_EXIT(p_ctrl->p_log); >>>+ return; >>>+ } >>>+ >>>+ osm_log(p_ctrl->p_log, OSM_LOG_INFO, >>>+ "__osm_vend_events_callback: " >>>+ "Events callback got event mask of %d - No action taken\n"); >>>+ OSM_LOG_EXIT(p_ctrl->p_log); >>>+} >>>+ >>>+/********************************************************************** >>>+ **********************************************************************/ >>> ib_api_status_t >>> osm_sm_mad_ctrl_bind( >>> IN osm_sm_mad_ctrl_t* const p_ctrl, >>>@@ -1044,6 +1088,17 @@ osm_sm_mad_ctrl_bind( >>> goto Exit; >>> } >>> >>>+ if ( osm_vendor_reg_events_cb(p_ctrl->p_vendor, >>>+ __osm_vend_events_callback, >>>+ p_ctrl) ) >>>+ { > > > Is an osm_vendor_unbind needed here or is this handled elsewhere ? No - I do not think so. The unbind is called on osm_bvendor_destroy which will follow in the exit flow. > > >>>+ status = IB_ERROR; >>>+ osm_log( p_ctrl->p_log, OSM_LOG_ERROR, >>>+ "osm_sm_mad_ctrl_bind: ERR 3120: " >>>+ "Vendor failed to register for events\n" ); >>>+ goto Exit; >>>+ } >>>+ > > > Also, should osm_m_mad_ctrl_unbind unregister the events callback ? All it does is to NULL the callback. That would be enough. > > -- Hal > > >>This should be conditionalized on OSM_VENDOR_SUPPORT_EVENTS. True >> >> >>> Exit: >>> OSM_LOG_EXIT( p_ctrl->p_log ); >>> return( status ); >>>Index: config/osmvsel.m4 >>>=================================================================== >>>-- config/osmvsel.m4 (revision 8998) >>>+++ config/osmvsel.m4 (working copy) >>>@@ -63,9 +63,9 @@ if test $with_osmv = "openib"; then >>> OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" >>> OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" >>> if test "x$with_umad_libs" = "x"; then >>>- OSMV_LDADD="-libumad" >>>+ OSMV_LDADD="-libumad -libverbs" >>> else >>>- OSMV_LDADD="-L$with_umad_libs -libumad" >>>+ OSMV_LDADD="-L$with_umad_libs -libumad -libverbs" >>> fi >>> >>> if test "x$with_umad_includes" != "x"; then >>>@@ -137,6 +137,8 @@ if test "$disable_libcheck" != "yes"; th >>> LDFLAGS="$LDFLAGS $OSMV_LDADD" >>> AC_CHECK_LIB(ibumad, umad_init, [], >>> AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibumad.])) >>>+ AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], >>>+ AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibverbs.])) >> >>Cut and paste error: Error message should indicate ibv_get_device_list >>rather than umad_init. True. >> >> >>> LD_FLAGS=$osmv_save_ldflags >>> elif test $with_osmv = "sim" ; then >>> LDFLAGS="$LDFLAGS -L$with_sim/lib" >> >>-- Hal >> >> >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> From zhushisongzhu at yahoo.com Wed Aug 23 01:38:34 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Wed, 23 Aug 2006 01:38:34 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060822110903.GF13782@mellanox.co.il> Message-ID: <20060823083834.297.qmail@web36911.mail.mud.yahoo.com> I haven't met kernel crashes using rc2. But there always occurred connection refusal when max concurrent connections set above 200. All is right when max concurrent connections is set to below 200. ( If using TCP to take the same test, there is no problem.) (1) This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ Benchmarking www.google.com [through 193.12.10.14:3129] (be patient) Completed 100 requests Completed 200 requests apr_recv: Connection refused (111) Total of 257 requests completed (2) This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ Benchmarking www.google.com [through 193.12.10.14:3129] (be patient) Completed 100 requests Completed 200 requests apr_recv: Connection refused (111) Total of 256 requests completed [root at IB-TEST squid.test]# zhu --- "Michael S. Tsirkin" wrote: > Quoting r. zhu shi song : > > --- "Michael S. Tsirkin" > wrote: > > > > > Quoting r. zhu shi song > : > > > > (3) one time linux kernel on the client > crashed. I > > > > copy the output from the screen. > > > > Process sdp (pid:4059, threadinfo > 0000010036384000 > > > > task 000001003ea10030) > > > > Call > > > > > > > > Trace:{:ib_sdp:sdp_destroy_workto} > > > > {:ib_sdp:sdp_destroy_qp+77} > > > > > > > > > > {:ib_sdp:sdp_destruct+279}{sk_free+28} > > > > > > > > > > {worker_thread+419}{default_wake_function+0} > > > > > > > > > > {default_wake_function+0}{keventd_create_kthread+0} > > > > > > > > > > {worker_thread+0}{keventd_create_kthread+0} > > > > > > > > > > {kthread+200}{child_rip+8} > > > > > > > > > > {keventd_create_kthread+0}{kthread+0}{child_rip+0} > > > > Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b 45 31 > ff > > > 45 > > > > 31 ed 4c 89 > > > > > > > > > > RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> > > > > CR2:0000000000000004 > > > > <0>kernel panic-not syncing:Oops > > > > > > > > zhu > > > > > > Hmm, the stack dump does not match my sources. > Is > > > this OFED rc1? > > > Could you send me the sdp_main.o and sdp_main.c > > > files from your system please? > > --- > > > Subject: Re: why sdp connections cost so much > memory > > > > please see the attachment. > > zhu > > Ugh, so its crashing inside sdp_bcopy ... > > By the way, could you please re-test with OFED rc2? > We've solved a couple of bugs there ... > > If this still crashes, could you please post the > whole > sdp directory, with .o and .c files? > > Thanks, > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From bugzilla-daemon at openib.org Wed Aug 23 01:55:28 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 23 Aug 2006 01:55:28 -0700 (PDT) Subject: [openib-general] [Bug 203] Crash on shutdown, timer callback, build 459 Message-ID: <20060823085528.C678D2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=203 jbottorff at xsigo.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jbottorff at xsigo.com ------- Comment #1 from jbottorff at xsigo.com 2006-08-23 01:55 ------- I've trapped a write of 0x1 to the dpc context field of a mad data structure. The stack looks like this just after the write: f797ab00 ba9de265 ibbus!ib_cancel_mad+0x6c0 [k:\windows-openib\src\winib-459\core\al\al_mad.c @ 1831] f797ab14 ba984d68 ibbus!al_cancel_sa_req+0x25 [k:\windows-openib\src\winib-459\core\al\al_query.h @ 140] f797ab28 ba82ec4c ibbus!ib_cancel_query+0x328 [k:\windows-openib\src\winib-459\core\al\al.c @ 429] f797ac00 ba7fe269 ipoib!ipoib_port_down+0x13c [k:\windows-openib\src\winib-459\ulp\ipoib\kernel\ipoib_port.c @ 5066] f797ac74 ba991da1 ipoib!__ipoib_pnp_cb+0xe89 [k:\windows-openib\src\winib-459\ulp\ipoib\kernel\ipoib_adapter.c @ 690] f797acdc ba994f92 ibbus!__pnp_notify_user+0x561 [k:\windows-openib\src\winib-459\core\al\kernel\al_pnp.c @ 523] f797ad04 ba994cb1 ibbus!__pnp_process_port_forward+0x172 [k:\windows-openib\src\winib-459\core\al\kernel\al_pnp.c @ 1230] f797ad48 ba99479a ibbus!__pnp_check_ports+0x411 [k:\windows-openib\src\winib-459\core\al\kernel\al_pnp.c @ 1433] f797ad70 ba950884 ibbus!__pnp_check_events+0x19a [k:\windows-openib\src\winib-459\core\al\kernel\al_pnp.c @ 1510] f797ad8c ba956b54 ibbus!__cl_async_proc_worker+0x94 [k:\windows-openib\src\winib-459\core\complib\cl_async_proc.c @ 153] f797ada0 ba958c0c ibbus!__cl_thread_pool_routine+0x54 [k:\windows-openib\src\winib-459\core\complib\cl_threadpool.c @ 67] f797adac 80a07678 ibbus!__thread_callback+0x2c [k:\windows-openib\src\winib-459\core\complib\kernel\cl_thread.c @ 49] f797addc 80781346 nt!PspSystemThreadStartup+0x2e 00000000 00000000 nt!KiThreadStartup+0x16 This seems to be canceling an outstanding mad query when the port goes down. An event that would happen at shutdown, and at irregular other times. The code that causes the dpc corruption is core\al\al_mad.c about line 1826: if( !p_list_item ) { cl_spinlock_release( &h_mad_svc->obj.lock ); AL_PRINT( TRACE_LEVEL_INFORMATION, AL_DBG_MAD_SVC, ("mad not found\n") ); return IB_NOT_FOUND; } /* Mark the MAD as having been canceled. */ h_send = PARENT_STRUCT( p_list_item, al_mad_send_t, pool_item ); h_send->canceled = TRUE; The local pointer h_send seems to not be pointing at the right thing, and the assignment of TRUE to the cancel field is actually corrupting the dpc context field. A structure dump of p_list_item says: 1: kd> dt p_list_item Local var @ 0xf797aafc Type _cl_list_item* 0x88e76f10 +0x000 p_next : 0x88e76f10 _cl_list_item +0x004 p_prev : 0x88e76f10 _cl_list_item +0x008 p_list : 0x88e76f10 _cl_qlist The address of this 0x88e76f10 is the same address as the send_list field in the local h_mad_svc, and believe it represents an empty list header. This suggests the test for null is an incorrect test for the list being empty. There is also another case that looks like an incorrect list test in the same source file. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mst at mellanox.co.il Wed Aug 23 02:06:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 Aug 2006 12:06:53 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060823083834.297.qmail@web36911.mail.mud.yahoo.com> References: <20060823083834.297.qmail@web36911.mail.mud.yahoo.com> Message-ID: <20060823090653.GA5877@mellanox.co.il> Yes, I have reproduced the connection refusal problem and I am looking into it. Thanks! MST Quoting r. zhu shi song : Subject: Re: why sdp connections cost so much memory I haven't met kernel crashes using rc2. But there always occurred connection refusal when max concurrent connections set above 200. All is right when max concurrent connections is set to below 200. ( If using TCP to take the same test, there is no problem.) (1) This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ Benchmarking www.google.com [through 193.12.10.14:3129] (be patient) Completed 100 requests Completed 200 requests apr_recv: Connection refused (111) Total of 257 requests completed (2) This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ Benchmarking www.google.com [through 193.12.10.14:3129] (be patient) Completed 100 requests Completed 200 requests apr_recv: Connection refused (111) Total of 256 requests completed [root at IB-TEST squid.test]# zhu --- "Michael S. Tsirkin" wrote: > Quoting r. zhu shi song : > > --- "Michael S. Tsirkin" > wrote: > > > > > Quoting r. zhu shi song > : > > > > (3) one time linux kernel on the client > crashed. I > > > > copy the output from the screen. > > > > Process sdp (pid:4059, threadinfo > 0000010036384000 > > > > task 000001003ea10030) > > > > Call > > > > > > > > Trace:{:ib_sdp:sdp_destroy_workto} > > > > {:ib_sdp:sdp_destroy_qp+77} > > > > > > > > > > {:ib_sdp:sdp_destruct+279}{sk_free+28} > > > > > > > > > > {worker_thread+419}{default_wake_function+0} > > > > > > > > > > {default_wake_function+0}{keventd_create_kthread+0} > > > > > > > > > > {worker_thread+0}{keventd_create_kthread+0} > > > > > > > > > > {kthread+200}{child_rip+8} > > > > > > > > > > {keventd_create_kthread+0}{kthread+0}{child_rip+0} > > > > Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b 45 31 > ff > > > 45 > > > > 31 ed 4c 89 > > > > > > > > > > RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> > > > > CR2:0000000000000004 > > > > <0>kernel panic-not syncing:Oops > > > > > > > > zhu > > > > > > Hmm, the stack dump does not match my sources. > Is > > > this OFED rc1? > > > Could you send me the sdp_main.o and sdp_main.c > > > files from your system please? > > --- > > > Subject: Re: why sdp connections cost so much > memory > > > > please see the attachment. > > zhu > > Ugh, so its crashing inside sdp_bcopy ... > > By the way, could you please re-test with OFED rc2? > We've solved a couple of bugs there ... > > If this still crashes, could you please post the > whole > sdp directory, with .o and .c files? > > Thanks, > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- MST From bugzilla-daemon at openib.org Wed Aug 23 03:06:59 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 23 Aug 2006 03:06:59 -0700 (PDT) Subject: [openib-general] [Bug 203] Memory corruption in mad processing Message-ID: <20060823100659.3E3032283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=203 jbottorff at xsigo.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|major |critical Summary|Crash on shutdown, timer |Memory corruption in mad |callback, build 459 |processing ------- Comment #2 from jbottorff at xsigo.com 2006-08-23 03:06 ------- A fix seems to be replacing the null test of p_list_item in al_mad.c (two places) with something like below. A text pattern scan of the sources for any other occurences might be appropriate too. if( p_list_item == cl_qlist_end( &h_mad_svc->send_list ) ) // bad if( !p_list_item ) I'll report back in a day or two on how this effects shutdown stability. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eli at mellanox.co.il Wed Aug 23 04:31:31 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 23 Aug 2006 14:31:31 +0300 Subject: [openib-general] [PATCH] huge pages support In-Reply-To: References: Message-ID: <1156332691.11332.11.camel@localhost> On Fri, 2006-08-18 at 15:23 +0200, Robert Rex wrote: > Hello, > > I've also worked on the same topic. Here is what I've done so far as I > successfully tested it on mthca and ehca. I'd appreciate for comments and > suggestions. > > + down_read(¤t->mm->mmap_sem); > + if (is_vm_hugetlb_page(find_vma(current->mm, (unsigned long) addr))) { > + use_hugepages = 1; > + region_page_mask = HPAGE_MASK; > + region_page_size = HPAGE_SIZE; This might cause a kernel oops if the address passed by the user does not belong to the process's address space. In that case find_vma() might return NULL and is_vm_hugetlb() will crash. And even if find_vma() returns none NULL value, that still does not guarantee that the vma returned is the one that contains that address. You need to check that the address is greater then or equal to vma->vm_start. From mst at mellanox.co.il Wed Aug 23 05:11:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 Aug 2006 15:11:01 +0300 Subject: [openib-general] question: ib_umem page_size In-Reply-To: References: Message-ID: <20060823121101.GA6300@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: question: ib_umem page_size > > Michael> Roland, could you please clarify what does the page_size > Michael> field in struct ib_mem do? > > It gives the page size for the user memory described by the struct. > The idea was that if/when someone tries to optimize for huge pages, > then the low-level driver can know that a region is using huge pages > without having to walk through the page list and search for the > minimum physically contiguous size. Hmm, mthca_reg_user_mr seems to do: len = sg_dma_len(&chunk->page_list[j]) >> shift which means that dma_len must be a multiple of page size. Is this intentional? -- MST From johnt1johnt2 at gmail.com Wed Aug 23 05:34:29 2006 From: johnt1johnt2 at gmail.com (john t) Date: Wed, 23 Aug 2006 18:04:29 +0530 Subject: [openib-general] CONFIG_INFINIBAND_ADDR_TRANS Message-ID: Hi, What does the config option "CONFIG_INFINIBAND_ADDR_TRANS=y" indicate? The code does not seem to use this. Regards, John T. -------------- next part -------------- An HTML attachment was scrubbed... URL: From or.gerlitz at gmail.com Wed Aug 23 06:07:00 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 23 Aug 2006 16:07:00 +0300 Subject: [openib-general] [PATCH] ib_cm: randomize starting local comm IDs In-Reply-To: <000001c6c610$988e0b20$c7cc180a@amr.corp.intel.com> References: <000001c6c610$988e0b20$c7cc180a@amr.corp.intel.com> Message-ID: <15ddcffd0608230607v73c30652re15254ef5dcb9097@mail.gmail.com> On 8/22/06, Sean Hefty wrote: > Randomize the starting local comm ID to avoid getting a rejected connection > due to a stale connection after a system reboot or reloading of the ib_cm. Hi Sean, I have tested the patch against an iser target based on our Gen1 CM - it works as expected. So the CM at the target side rejects the first REQ after the client reboot with STALE reason (and deliveres a disconnect event to the ULP). The second REQ is processed fine and a new connection is established. Without the patch, since the REQ had as this of an existing connection, it was just silently dropped and a target reboot was a must to let the initiator reconnect ! Or. From mst at mellanox.co.il Wed Aug 23 06:04:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 Aug 2006 16:04:20 +0300 Subject: [openib-general] CONFIG_INFINIBAND_ADDR_TRANS In-Reply-To: References: Message-ID: <20060823130420.GB6300@mellanox.co.il> Quoting r. john t : > Subject: CONFIG_INFINIBAND_ADDR_TRANS > > Hi, > > What does the config option "CONFIG_INFINIBAND_ADDR_TRANS=y" indicate? The code does not seem to use this. > > Regards, > John T. Builds CMA modules. -- MST From or.gerlitz at gmail.com Wed Aug 23 06:11:38 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 23 Aug 2006 16:11:38 +0300 Subject: [openib-general] CONFIG_INFINIBAND_ADDR_TRANS In-Reply-To: References: Message-ID: <15ddcffd0608230611q4c25b26bhcfec8d7fc881af0d@mail.gmail.com> On 8/23/06, john t wrote: > What does the config option > "CONFIG_INFINIBAND_ADDR_TRANS=y" indicate? The code does > not seem to use this. No, the code does use it to build and install the rdma_cm (CMA) and the ib_addr modules, see linux-2.6.18-rc4/drivers/infiniband/core/Makefile infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS) := ib_addr.o rdma_cm.o However, i find the config name quite confusing... Or. From or.gerlitz at gmail.com Wed Aug 23 06:14:18 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 23 Aug 2006 16:14:18 +0300 Subject: [openib-general] IB CM and the case of the lost RTU: was a bunch of other topics... In-Reply-To: <44EB8CEE.9010708@ichips.intel.com> References: <000101c6b821$6f783be0$8698070a@amr.corp.intel.com> <44D71B6B.3000007@voltaire.com> <44D776AD.3080606@ichips.intel.com> <44D8513C.8000801@voltaire.com> <44EB8CEE.9010708@ichips.intel.com> Message-ID: <15ddcffd0608230614x2dc572bg68de28f40a74299c@mail.gmail.com> On 8/23/06, Sean Hefty wrote: > To recap (since it's been a couple of weeks), we have two general solutions for > how to support the passive/server/target side of a connection: > > 1. One method requires that the passive side queue send WRs until they get a > connection establish event. > > 2. An alternative allows sending immediately after receiving a response, but may > require the user to manually transition the connection to established. Failure > to do so will cause the connection to tear down if the RTU is never received > (even after retries). The Voltaire iSER target implementation follows a variant of the first method, namely it defers RX processing till getting a connection established event. This is ensured in the current product by the Gen1 Voltaire CM riding on the the IB comm_established async event and synthesizing an RTU and would be the same in the iser target we code over the Gen2 stack (CMA) if the first method is implemented. As typically there is some ULP level handshake when a connection starts, there would be very little to queue (eg in iSER its the login-request). Or. From or.gerlitz at gmail.com Wed Aug 23 06:15:47 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 23 Aug 2006 16:15:47 +0300 Subject: [openib-general] InfiniBand merge plans for 2.6.19 In-Reply-To: <000101c6c600$efad4ed0$b6cc180a@amr.corp.intel.com> References: <44EAECE8.4070802@voltaire.com> <000101c6c600$efad4ed0$b6cc180a@amr.corp.intel.com> Message-ID: <15ddcffd0608230615u56d20736u740ae72d2d7d862@mail.gmail.com> On 8/22/06, Sean Hefty wrote: > >What about pushing the char device to support user space CMA, i recall > >that you have mentioned the API was not mature enough when the 2.6.18 > >feature merge window was open. > > I will look at doing this. I need to verify what functionality (RC, UD, > multicast) of the kernel RDMA CM we want merged upstream for 2.6.19 and create a > patch for exposing that to userspace. OK. Now, if this (RC, UD, MCAST) turns to be too much for your schedule before 2.6.19 opens up, does it make sense for you to push a char device which supports only the CMA RC functionality and the UD and multicast in the future? Or. From eitan at mellanox.co.il Wed Aug 23 06:34:32 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 23 Aug 2006 16:34:32 +0300 Subject: [openib-general] [PATCH] osm: fix memory leak in vendor ibumad Message-ID: <86veoj60ev.fsf@mtl066.yok.mtl.com> Hi Hal These are two trivial fixes for memory leaks in the ibumad vendor. Thanks Eitan Signed-off-by: Eitan Zahavi Index: libvendor/osm_vendor_ibumad_sa.c =================================================================== --- libvendor/osm_vendor_ibumad_sa.c (revision 9087) +++ libvendor/osm_vendor_ibumad_sa.c (working copy) @@ -180,6 +180,11 @@ __osmv_sa_mad_rcv_cb( /* free the copied query request if found */ if (p_query_req_copy) free(p_query_req_copy); + + /* put back the request madw */ + if (p_req_madw) + osm_mad_pool_put(p_bind->p_mad_pool, p_req_madw); + OSM_LOG_EXIT( p_bind->p_log ); } Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 9087) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -617,6 +617,7 @@ osm_vendor_get_all_port_attr( *p_lid = ca.ports[j]->base_lid; *p_linkstates = ca.ports[j]->state; *p_portnum = ca.ports[j]->portnum; + free(ca.ports[j]); } p_lid++; p_linkstates++; From mst at mellanox.co.il Wed Aug 23 06:38:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 Aug 2006 16:38:49 +0300 Subject: [openib-general] [PATCH] ib_cm: randomize starting local comm IDs In-Reply-To: <15ddcffd0608230607v73c30652re15254ef5dcb9097@mail.gmail.com> References: <15ddcffd0608230607v73c30652re15254ef5dcb9097@mail.gmail.com> Message-ID: <20060823133849.GC6300@mellanox.co.il> Quoting r. Or Gerlitz : > Subject: Re: [PATCH] ib_cm: randomize starting local comm IDs > > On 8/22/06, Sean Hefty wrote: > > Randomize the starting local comm ID to avoid getting a rejected connection > > due to a stale connection after a system reboot or reloading of the ib_cm. > > Hi Sean, > > I have tested the patch against an iser target based on our Gen1 CM - > it works as expected. > > So the CM at the target side rejects the first REQ after the client > reboot with STALE reason (and deliveres a disconnect event to the > ULP). The second REQ is processed fine and a new connection is > established. > > Without the patch, since the REQ had as this of an > existing connection, it was just silently dropped and a target reboot > was a must to let the initiator reconnect ! > > Or. Hmm. Might this still be a concern for users such as SDP which don't retry connections? -- MST From ogerlitz at voltaire.com Wed Aug 23 06:45:56 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 23 Aug 2006 16:45:56 +0300 Subject: [openib-general] [PATCH] ib_cm: randomize starting local comm IDs In-Reply-To: <000001c6c610$988e0b20$c7cc180a@amr.corp.intel.com> References: <000001c6c610$988e0b20$c7cc180a@amr.corp.intel.com> Message-ID: <44EC5C14.5090309@voltaire.com> Sean Hefty wrote: > Randomize the starting local comm ID to avoid getting a rejected connection > due to a stale connection after a system reboot or reloading of the ib_cm. Hi Sean, As i wrote you patching the CM it works fine and the problem i could reproduce with our iser target is solved. However, we wonder what is your opinion (and if positive, work estimation...) to make the CM get "GID OUT" traps, which are generated by the SM when a node IB restarts (eg a node reboot). Once the CM gets the trap, it can scan the internal data structures and emulate a disconnect for all relevant (*) connections ??? (*) there are some technical issues here: the GID OUT is on a PORT GID and the CM uses NODE GUIDS, also does openib stack has the means for a client to register on getting **traps** ??? Other then the CM there might be more potential consumers to this trap, and also to "GID IN". Or. From or.gerlitz at gmail.com Wed Aug 23 07:02:55 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 23 Aug 2006 17:02:55 +0300 Subject: [openib-general] [PATCH] ib_cm: randomize starting local comm IDs In-Reply-To: <20060823133849.GC6300@mellanox.co.il> References: <15ddcffd0608230607v73c30652re15254ef5dcb9097@mail.gmail.com> <20060823133849.GC6300@mellanox.co.il> Message-ID: <15ddcffd0608230702h295d81b9s277969bc00d2ca51@mail.gmail.com> On 8/23/06, Michael S. Tsirkin wrote: > Quoting r. Or Gerlitz : > > So the CM at the target side rejects the first REQ after the client > > reboot with STALE reason (and deliveres a disconnect event to the > > ULP). The second REQ is processed fine and a new connection is > > established. > Hmm. Might this still be a concern for users such as SDP > which don't retry connections? I don't know if "this" in your email referes to the quote above but what i discribe there is what stated in the IB spec ch. 12 re stale connections. So you need to either rely on the SDP consumer to reconnect or when getting a STALE reject attempt to reconnect from SDP. Or. From mst at mellanox.co.il Wed Aug 23 07:26:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 Aug 2006 17:26:20 +0300 Subject: [openib-general] [PATCH] ib_cm: randomize starting local comm IDs In-Reply-To: <15ddcffd0608230702h295d81b9s277969bc00d2ca51@mail.gmail.com> References: <15ddcffd0608230702h295d81b9s277969bc00d2ca51@mail.gmail.com> Message-ID: <20060823142620.GA8192@mellanox.co.il> Quoting r. Or Gerlitz : > Subject: Re: [PATCH] ib_cm: randomize starting local comm IDs > > On 8/23/06, Michael S. Tsirkin wrote: > > Quoting r. Or Gerlitz : > > > > So the CM at the target side rejects the first REQ after the client > > > reboot with STALE reason (and deliveres a disconnect event to the > > > ULP). The second REQ is processed fine and a new connection is > > > established. > > > Hmm. Might this still be a concern for users such as SDP > > which don't retry connections? > > I don't know if "this" in your email referes to the quote above I am speaking more or less about this quote from your message: > > Without the patch, since the REQ had as this of an > > existing connection, it was just silently dropped and a target reboot > > was a must to let the initiator reconnect ! the spec says: > > If a CM receives a REQ/REP as described above, if the REQ/REP has the > > same Local Communication ID and Remote Communication ID as are present > > in the existing connection and if the REQ/REP arrives within the window > > of time during which the active/passive side could be legally > > retransmitting REQ/REP, the CM should treat the REQ/REP as a retry and > > not initiate stale connection processing as described above. so I am wandering why is it not sufficient to wait for the window of time as described above to expire? Is something broken in CM that this patch is papering over? > but what > i discribe there is what stated in the IB spec ch. 12 re stale connections. I know. I am just wandering aloud whether this is relevant for SDP. Why won't the window expire as described above? > So you need to either rely on the SDP consumer to reconnect or when > getting a STALE reject attempt to reconnect from SDP. I'm not sure SDP needs to do anything - the port is busy, after all. Retrying seems to be against the SDP spec. -- MST From mst at mellanox.co.il Wed Aug 23 08:01:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 Aug 2006 18:01:32 +0300 Subject: [openib-general] Rollup patch for ipath and OFED In-Reply-To: <1156175117.18663.32.camel@chalcedony.pathscale.com> References: <1156175117.18663.32.camel@chalcedony.pathscale.com> Message-ID: <20060823150132.GB8192@mellanox.co.il> Guys, I just looked at ipath-fixes.patch in ofed. With 36 files changed, 4623 insertions, 4774 deletions, it's quite a biggie with no description what it does whatsoever. Can't this be split to smaller chunks doing one thing at a time, please? You'll have to do this for upstream inclusion anyway, so why not for OFED? Oh well. However, this baby also does for example: diff -r d2661c9eff49 -r 198ed6310295 drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Thu Aug 10 16:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/Makefile Wed Aug 16 11:01:29 2006 -0700 @@ -1,36 +1,34 @@ EXTRA_CFLAGS += -DIPATH_IDSTR='"QLogic k EXTRA_CFLAGS += -DIPATH_IDSTR='"QLogic kernel.org driver"' \ -DIPATH_KERN_TYPE=0 -obj-$(CONFIG_IPATH_CORE) += ipath_core.o obj-$(CONFIG_INFINIBAND_IPATH) += ib_ipath.o -ipath_core-y := \ +ib_ipath-y := \ + ipath_cq.o \ ipath_diag.o \ ipath_driver.o \ ipath_eeprom.o \ ipath_file_ops.o \ ipath_fs.o \ - ipath_ht400.o \ + ipath_iba6110.o \ + ipath_iba6120.o \ ipath_init_chip.o \ ipath_intr.o \ + ipath_keys.o \ ipath_layer.o \ - ipath_pe800.o \ - ipath_stats.o \ - ipath_sysfs.o \ - ipath_user_pages.o - -ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o - -ib_ipath-y := \ - ipath_cq.o \ - ipath_keys.o \ ipath_mad.o \ ipath_mr.o \ ipath_qp.o \ ipath_rc.o \ ipath_ruc.o \ ipath_srq.o \ + ipath_stats.o \ + ipath_sysfs.o \ ipath_uc.o \ ipath_ud.o \ - ipath_verbs.o \ - ipath_verbs_mcast.o + ipath_user_pages.o \ + ipath_verbs_mcast.o \ + ipath_verbs.o + +ib_ipath-$(CONFIG_X86_64) += ipath_wc_x86_64.o +ib_ipath-$(CONFIG_PPC64) += ipath_wc_ppc64.o So this seems to be ripping out chunks of upstream code (ipath_ht400) replacing them with something else (ipath_iba6110, ipath_iba6120.o) This might be a good change for all I know. But I'd like to ask What exactly does this fixes patch, fix? Can there be a list of things it provides at the top? How about a Signed-off-by line? Was this posted on openib-general even once? There's a single unmerged ipath patch posted on openib-general: mmap()ed userspace work queues for ipath. So where does the rest come from? Googling for ipath_iba6110 gets no hits. Thanks, -- MST From bos at pathscale.com Wed Aug 23 08:22:30 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 23 Aug 2006 08:22:30 -0700 Subject: [openib-general] Rollup patch for ipath and OFED In-Reply-To: <20060823150132.GB8192@mellanox.co.il> References: <1156175117.18663.32.camel@chalcedony.pathscale.com> <20060823150132.GB8192@mellanox.co.il> Message-ID: <1156346550.19868.9.camel@chalcedony.pathscale.com> On Wed, 2006-08-23 at 18:01 +0300, Michael S. Tsirkin wrote: > Guys, I just looked at ipath-fixes.patch in ofed. With 36 files changed, 4623 > insertions, 4774 deletions, it's quite a biggie with no description what it does > whatsoever. Can't this be split to smaller chunks doing one thing at a time, > please? You'll have to do this for upstream inclusion anyway, so why not for > OFED? We were in a rush to get a working patch out, is all. I've been splitting that monster up into sensibly-sized chunks in the usual way. Hi Eitan, I'm not getting openib-general email today but got this off a web site: Index: libvendor/osm_vendor_ibumad_sa.c =================================================================== --- libvendor/osm_vendor_ibumad_sa.c (revision 9087) +++ libvendor/osm_vendor_ibumad_sa.c (working copy) @@ -180,6 +180,11 @@ __osmv_sa_mad_rcv_cb( /* free the copied query request if found */ if (p_query_req_copy) free(p_query_req_copy); + + /* put back the request madw */ + if (p_req_madw) + osm_mad_pool_put(p_bind->p_mad_pool, p_req_madw); + OSM_LOG_EXIT( p_bind->p_log ); } There's an additional minor change needed to this routine as there is a case where the request MAD is already free'd. I'm in the process of looking at the osm_vendor_ibumad.c change too. -- Hal From eitan at mellanox.co.il Wed Aug 23 08:57:54 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 23 Aug 2006 18:57:54 +0300 Subject: [openib-general] [PATCH] osm: fix memory leak in vendor ibumad Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302C464FD@mtlexch01.mtl.com> Who is freeing the request MAD? If it is NULL then the flow aborts earlier. > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, August 23, 2006 6:35 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [PATCH] osm: fix memory leak in vendor ibumad > > Hi Eitan, > > I'm not getting openib-general email today but got this off a web site: > > Index: libvendor/osm_vendor_ibumad_sa.c > ================================================================ > === > --- libvendor/osm_vendor_ibumad_sa.c (revision 9087) > +++ libvendor/osm_vendor_ibumad_sa.c (working copy) > @@ -180,6 +180,11 @@ __osmv_sa_mad_rcv_cb( > > /* free the copied query request if found */ > if (p_query_req_copy) free(p_query_req_copy); > + > + /* put back the request madw */ > + if (p_req_madw) > + osm_mad_pool_put(p_bind->p_mad_pool, p_req_madw); > + > OSM_LOG_EXIT( p_bind->p_log ); > } > > There's an additional minor change needed to this routine as there is a case > where the request MAD is already free'd. > > I'm in the process of looking at the osm_vendor_ibumad.c change too. > > -- Hal From jackm at mellanox.co.il Wed Aug 23 08:54:03 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 23 Aug 2006 18:54:03 +0300 Subject: [openib-general] [PATCH] mthca: various bug fixes for mthca_query_qp Message-ID: <200608231854.03936.jackm@mellanox.co.il> Fixed various bugs in mthca_query_qp: 1. correct port_num was not being returned for unconnected QPs. 2. Incorrect number of bits was taken for static_rate field. 3. When default static rate is returned for Tavor, forgot to translate it to an ib rate value. 4. Return sq signalling attribute in query-qp. 5. Return the send_cq, receive cq and srq handles. ib_query_qp() needs them (required by IB Spec). ibv_query_qp() overwrites these values in user-space with appropriate user-space values. Signed-off-by: Jack Morgenstein Index: ofed_1_1/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- ofed_1_1.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-08-23 10:33:04.000000000 +0300 +++ ofed_1_1/drivers/infiniband/hw/mthca/mthca_qp.c 2006-08-23 18:46:08.330885000 +0300 @@ -391,6 +391,12 @@ static int to_ib_qp_access_flags(int mth return ib_flags; } +static enum ib_sig_type to_ib_qp_sq_signal(int params1) +{ + return (params1 & MTHCA_QP_BIT_SSC) ? + IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR; +} + static void to_ib_ah_attr(struct mthca_dev *dev, struct ib_ah_attr *ib_ah_attr, struct mthca_qp_path *path) { @@ -404,7 +410,7 @@ static void to_ib_ah_attr(struct mthca_d ib_ah_attr->sl = be32_to_cpu(path->sl_tclass_flowlabel) >> 28; ib_ah_attr->src_path_bits = path->g_mylmc & 0x7f; ib_ah_attr->static_rate = mthca_rate_to_ib(dev, - path->static_rate & 0x7, + path->static_rate & 0xf, ib_ah_attr->port_num); ib_ah_attr->ah_flags = (path->g_mylmc & (1 << 7)) ? IB_AH_GRH : 0; if (ib_ah_attr->ah_flags) { @@ -468,10 +474,14 @@ int mthca_query_qp(struct ib_qp *ibqp, s if (qp->transport == RC || qp->transport == UC) { to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + qp_attr->alt_pkey_index = + be32_to_cpu(context->alt_path.port_pkey) & 0x7f; + qp_attr->alt_port_num = qp_attr->alt_ah_attr.port_num; } - qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; - qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; + qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; + qp_attr->port_num = + (be32_to_cpu(context->pri_path.port_pkey) >> 24) & 0x3; /* qp_attr->en_sqd_async_notify is only applicable in modify qp */ qp_attr->sq_draining = mthca_state == MTHCA_QP_STATE_DRAINING; @@ -482,13 +492,16 @@ int mthca_query_qp(struct ib_qp *ibqp, s 1 << ((be32_to_cpu(context->params2) >> 21) & 0x7); qp_attr->min_rnr_timer = (be32_to_cpu(context->rnr_nextrecvpsn) >> 24) & 0x1f; - qp_attr->port_num = qp_attr->ah_attr.port_num; qp_attr->timeout = context->pri_path.ackto >> 3; qp_attr->retry_cnt = (be32_to_cpu(context->params1) >> 16) & 0x7; qp_attr->rnr_retry = context->pri_path.rnr_retry >> 5; - qp_attr->alt_port_num = qp_attr->alt_ah_attr.port_num; qp_attr->alt_timeout = context->alt_path.ackto >> 3; qp_init_attr->cap = qp_attr->cap; + qp_init_attr->sq_sig_type = + to_ib_qp_sq_signal(be32_to_cpu(context->params1)); + qp_init_attr->send_cq = ibqp->send_cq; + qp_init_attr->recv_cq = ibqp->recv_cq; + qp_init_attr->srq = ibqp->srq; out: mthca_free_mailbox(dev, mailbox); Index: ofed_1_1/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- ofed_1_1.orig/drivers/infiniband/hw/mthca/mthca_av.c 2006-08-03 14:30:21.000000000 +0300 +++ ofed_1_1/drivers/infiniband/hw/mthca/mthca_av.c 2006-08-23 17:53:01.227220000 +0300 @@ -90,7 +90,7 @@ static enum ib_rate tavor_rate_to_ib(u8 case MTHCA_RATE_TAVOR_1X: return IB_RATE_2_5_GBPS; case MTHCA_RATE_TAVOR_1X_DDR: return IB_RATE_5_GBPS; case MTHCA_RATE_TAVOR_4X: return IB_RATE_10_GBPS; - default: return port_rate; + default: return mult_to_ib_rate(port_rate); } } From robert.j.woodruff at intel.com Wed Aug 23 08:59:12 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 23 Aug 2006 08:59:12 -0700 Subject: [openib-general] OFED 1.1-rc2 bug - Could not read ABI version Message-ID: With the OFED1.1-rc2 when I run the RDMA CM on RedHat EL4 - Update 3 I get the following warning. The application seems to work OK, but the warnings are concerning. librdmacm: couldn't read ABI version. librdmacm: assuming: 1 It appears that the backport of the kernel module does not create the abi_version in /sys/class/misc as the userspace code expects. I suggest either fix the backport patch to create the abi_version or to avoid confusion, remove the check or the warning message from the userspace code if the ABI does actually matches the kernel code. woody From mshefty at ichips.intel.com Wed Aug 23 09:29:18 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 09:29:18 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: References: Message-ID: <44EC825E.2030709@ichips.intel.com> john t wrote: > I have a very basic doubt. Suppose Host A is doing RDMA write (say 8 > MB) to Host B. When data is copied into Host B's local buffer, is it > guaranteed that data will be copied starting from the first location > (first buffer address) to the last location (last buffer address)? or it > could be in any order? I don't believe that there is any ordering guarantee by the architecture. However, specific adapters may behave this way, and I've seen applications make use of this by polling the last memory byte for a completion, for example. > Also does this tranfer involve DMA engine (residinng on HCAs) of both > the hosts or just one host? If I'm understanding the question correctly, the DMA engine of both HCAs are used. - Sean From greg.lindahl at qlogic.com Wed Aug 23 09:07:25 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Wed, 23 Aug 2006 09:07:25 -0700 Subject: [openib-general] Rollup patch for ipath and OFED In-Reply-To: <20060823150132.GB8192@mellanox.co.il> References: <1156175117.18663.32.camel@chalcedony.pathscale.com> <20060823150132.GB8192@mellanox.co.il> Message-ID: <20060823160725.GF1441@greglaptop.t-mobile.de> On Wed, Aug 23, 2006 at 06:01:32PM +0300, Michael S. Tsirkin wrote: > So this seems to be ripping out chunks of upstream code (ipath_ht400) > replacing them with something else (ipath_iba6110, ipath_iba6120.o) To answer this piece of the question, we were acquired last April, and of course we have to rename all our devices. Since patch doesn't have a rename feature, this looks much worse than it really is. -- greg From mshefty at ichips.intel.com Wed Aug 23 09:30:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 09:30:44 -0700 Subject: [openib-general] InfiniBand merge plans for 2.6.19 In-Reply-To: <15ddcffd0608230615u56d20736u740ae72d2d7d862@mail.gmail.com> References: <44EAECE8.4070802@voltaire.com> <000101c6c600$efad4ed0$b6cc180a@amr.corp.intel.com> <15ddcffd0608230615u56d20736u740ae72d2d7d862@mail.gmail.com> Message-ID: <44EC82B4.6010106@ichips.intel.com> Or Gerlitz wrote: > OK. Now, if this (RC, UD, MCAST) turns to be too much for your > schedule before 2.6.19 opens up, does it make sense for you to push a > char device which supports only the CMA RC functionality and the UD > and multicast in the future? Yes - and the fact that I can pull the OFED code for this helps. - Sean From mshefty at ichips.intel.com Wed Aug 23 09:47:40 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 09:47:40 -0700 Subject: [openib-general] [PATCH] ib_cm: randomize starting local comm IDs In-Reply-To: <20060823142620.GA8192@mellanox.co.il> References: <15ddcffd0608230702h295d81b9s277969bc00d2ca51@mail.gmail.com> <20060823142620.GA8192@mellanox.co.il> Message-ID: <44EC86AC.5060701@ichips.intel.com> Michael S. Tsirkin wrote: > so I am wandering why is it not sufficient to wait for > the window of time as described above to expire? > Is something broken in CM that this patch is papering over? Yes. There are a couple of issues. The CM doesn't time when a REQ was received, and the local comm ID's need rework. This is a fix that should avoid the issues most of the time. There's also an issue that a user can allocate a QP that's likely to be in time wait. - Sean From caitlinb at broadcom.com Wed Aug 23 09:47:33 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 23 Aug 2006 09:47:33 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <20060823054541.GC4550@mellanox.co.il> Message-ID: <54AD0F12E08D1541B826BE97C98F99F189E8D5@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Quoting r. john t : >> Subject: basic IB doubt >> >> Hi >> >> I have a very basic doubt. Suppose Host A is doing RDMA write (say 8 >> MB) to Host B. When data is copied into Host B's local > buffer, is it guaranteed that data will be copied starting > from the first location (first buffer address) to the last > location (last buffer address)? or it could be in any order? > > Once B gets a completion (e.g. of a subsequent send), data in > its buffer matches that of A, byte for byte. An excellent and concise answer. That is exactly what the application should rely upon, and nothing else. With iWARP this is very explicit, because portions of the message not only MAY be placed out of order, they SHOULD be when packets have been re-ordered by the network. But for *any* RDMA adapter there is no guarantee on what order the adapter flushes things to host memory or particularly when old contents that may be cached are invalidated or updated. The role of the completion is to limit the frequency with which the RDMA adapter MUST guarantee coherency with application visible buffers. The completion not only indicates that the entire message was received, but that it has been entirely delivered to host memory. From mshefty at ichips.intel.com Wed Aug 23 10:00:52 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 10:00:52 -0700 Subject: [openib-general] [PATCH] ib_cm: randomize starting local comm IDs In-Reply-To: <15ddcffd0608230607v73c30652re15254ef5dcb9097@mail.gmail.com> References: <000001c6c610$988e0b20$c7cc180a@amr.corp.intel.com> <15ddcffd0608230607v73c30652re15254ef5dcb9097@mail.gmail.com> Message-ID: <44EC89C4.5000802@ichips.intel.com> Or Gerlitz wrote: > I have tested the patch against an iser target based on our Gen1 CM - > it works as expected. This has been committed in 9088. - Sean From halr at voltaire.com Wed Aug 23 10:05:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Aug 2006 13:05:28 -0400 Subject: [openib-general] [PATCH] osm: fix memory leak in vendor ibumad Message-ID: <1156352707.17858.31115.camel@hal.voltaire.com> Hi Eitan, > Who is freeing the request MAD? > If it is NULL then the flow aborts earlier. Sorry; My bad :-( -- Hal From ralphc at pathscale.com Wed Aug 23 10:14:53 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 23 Aug 2006 10:14:53 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F189E8D5@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F189E8D5@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1156353294.25846.32.camel@brick.pathscale.com> On Wed, 2006-08-23 at 09:47 -0700, Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > > Quoting r. john t : > >> Subject: basic IB doubt > >> > >> Hi > >> > >> I have a very basic doubt. Suppose Host A is doing RDMA write (say 8 > >> MB) to Host B. When data is copied into Host B's local > > buffer, is it guaranteed that data will be copied starting > > from the first location (first buffer address) to the last > > location (last buffer address)? or it could be in any order? > > > > Once B gets a completion (e.g. of a subsequent send), data in > > its buffer matches that of A, byte for byte. > > An excellent and concise answer. That is exactly what the application > should rely upon, and nothing else. With iWARP this is very explicit, > because portions of the message not only MAY be placed out of > order, they SHOULD be when packets have been re-ordered by the > network. But for *any* RDMA adapter there is no guarantee on > what order the adapter flushes things to host memory or particularly > when old contents that may be cached are invalidated or updated. > The role of the completion is to limit the frequency with which > the RDMA adapter MUST guarantee coherency with application visible > buffers. The completion not only indicates that the entire message > was received, but that it has been entirely delivered to host memory. Actually, A knows the data is in B's memory when A gets the completion notice. B can't rely on anything unless A uses the RDMA write with immediate which puts a completion event in B's CQ. Most applications on B ignore this requirement and test for the last memory location being modified which usually works but doesn't guarantee that all the data is in memory. From mshefty at ichips.intel.com Wed Aug 23 10:23:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 10:23:28 -0700 Subject: [openib-general] [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <000001c6ad0d$e483a290$e598070a@amr.corp.intel.com> References: <000001c6ad0d$e483a290$e598070a@amr.corp.intel.com> Message-ID: <44EC8F10.5050806@ichips.intel.com> Sean Hefty wrote: > The following set of patches forwards communication related events to the IB CM > for processing. Communication events of interest are communication established > and path migration, with only the former is currently handled by the IB CM. > > This removes the need for users to trap for these events and pass the > information onto IB CM. Communication established events can be handled by the > ib_cm_establish() routine, but no mechanism exists to notify the IB CM of path > migration. This adds the framework for doing so. > > Signed-off-by: Sean Hefty Based on feedback from Or and Michael: http://openib.org/pipermail/openib-general/2006-August/025320.html http://openib.org/pipermail/openib-general/2006-August/025306.html This patch set appears to be the preferred approach. Any objection to committing this? - Sean From robert.j.woodruff at intel.com Wed Aug 23 10:24:04 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 23 Aug 2006 10:24:04 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready Message-ID: Woody wrote, >Ok, I was able to install the RC2 on EL4-U3 and get intel MPI working on uDAPL. >Did have one issue with the install that maybe you could fix for the next >RC. It appears that the rdma_ucm and rdma_cm are not being loaded at startup >time and I had to manually modprobe rdma_ucm, after that, Intel MPI and uDAPL >seemed to work fine with the intial tests I have done so far, will continue >to stress test it and let you know if we see any other issues. >woody It appears that what is happening on this one is that the /etc/init.d/openibd script is failing because the ipath driver is not loading, which is expected in rc2 as their latest patches are not in rc2, If I disable loading of the ipath driver in /etc/infiniband/openib.conf the script continues and loads rdma_cm and rdma_ucm. The question is, if a driver like the ipath driver fails to load, should the script go on anyway and load the other drivers like rdma_cm/rdma_ucm ? or is it best to leave it as it is. Anyway, once the ipath driver is fixed, this issue should go away. woody From changquing.tang at hp.com Wed Aug 23 10:28:22 2006 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 23 Aug 2006 12:28:22 -0500 Subject: [openib-general] basic IB doubt In-Reply-To: <1156353294.25846.32.camel@brick.pathscale.com> Message-ID: > >Actually, A knows the data is in B's memory when A gets the >completion notice. B can't rely on anything unless A uses the >RDMA write with immediate which puts a completion event in B's CQ. Ralph: Can you give a few more words on 'immediate', I know A will have A completion event in its CQ, Does B receive a CQ event on the Same RDMA operation as well ? --CQ Tang >Most applications on B ignore this requirement and test for >the last memory location being modified which usually works >but doesn't guarantee that all the data is in memory. > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general > > From mshefty at ichips.intel.com Wed Aug 23 10:45:47 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 10:45:47 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: References: Message-ID: <44EC944B.5050803@ichips.intel.com> Tang, Changqing wrote: > Can you give a few more words on 'immediate', I know A will have > A completion event in its CQ, Does B receive a CQ event on the > Same RDMA operation as well ? He means and RDMA write with immediate data. B will see a completion event for that operation. - Sean From robert.j.woodruff at intel.com Wed Aug 23 10:58:46 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 23 Aug 2006 10:58:46 -0700 Subject: [openib-general] Rollup patch for ipath and OFED Message-ID: Bryan wrote, >On Wed, 2006-08-23 at 18:01 +0300, Michael S. Tsirkin wrote: >> Guys, I just looked at ipath-fixes.patch in ofed. With 36 files changed, 4623 >> insertions, 4774 deletions, it's quite a biggie with no description what it does >> whatsoever. Can't this be split to smaller chunks doing one thing at a time, >> please? You'll have to do this for upstream inclusion anyway, so why not for >> OFED? >We were in a rush to get a working patch out, is all. I've been >splitting that monster up into sensibly-sized chunks in the usual way. > References: Message-ID: <1156356250.10010.22.camel@sardonyx> On Wed, 2006-08-23 at 10:58 -0700, Woodruff, Robert J wrote: > I hate to tell you I told you so, but this is exactly why you guys > should not be off working behind closed doors and then submit some > mongo patch. That's precisely what I'm working to avoid. It's not as if I didn't know this. > If you would just submit the code as you go in smaller > patches and check in the smaller patches daily to SVN, we would not > have such an integration mess at the end. SVN is not a high priority for me personally. Fixing things so that I can send meaningful patches upstream in a timely manner us. References: <1156356250.10010.22.camel@sardonyx> Message-ID: <44EC99F1.9060102@ichips.intel.com> Bryan O'Sullivan wrote: > SVN is not a high priority for me personally. Fixing things so that I > can send meaningful patches upstream in a timely manner us. Why not remove your code from SVN? - Sean From ralphc at pathscale.com Wed Aug 23 11:16:16 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 23 Aug 2006 11:16:16 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: References: Message-ID: <1156356976.25846.42.camel@brick.pathscale.com> On Wed, 2006-08-23 at 12:28 -0500, Tang, Changqing wrote: > > > >Actually, A knows the data is in B's memory when A gets the > >completion notice. B can't rely on anything unless A uses the > >RDMA write with immediate which puts a completion event in B's CQ. > > Ralph: > > Can you give a few more words on 'immediate', I know A will have > A completion event in its CQ, Does B receive a CQ event on the > Same RDMA operation as well ? > > --CQ Tang B doesn't get a completion event for a RDMA write initiated from A unless A does something like the following: struct ib_send_wr wr; wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM; wr.imm_data = cpu_to_be32(value); ... ib_post_send(qp, &wr, NULL); B will get a completion event with the IB_WC_WITH_IMM flag set in struct ib_wc.wc_flags and ib_wc.imm_data set to the value that A sent. From changquing.tang at hp.com Wed Aug 23 12:00:13 2006 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 23 Aug 2006 14:00:13 -0500 Subject: [openib-general] basic IB doubt In-Reply-To: <1156356976.25846.42.camel@brick.pathscale.com> Message-ID: Thanks for all your replies. So my general question is, why only 4bytes immediate data can Generate completion event on B side, Why RDMA-write with any data size does not generate A completion event on B side? basic there are the same thing, the only different is, one Copy 4bytes to completion structure, the other copy all data to provided dest buffer. --CQ >-----Original Message----- >From: Ralph Campbell [mailto:ralphc at pathscale.com] >Sent: Wednesday, August 23, 2006 1:16 PM >To: Tang, Changqing >Cc: Caitlin Bestler; openib-general at openib.org >Subject: RE: [openib-general] basic IB doubt > >On Wed, 2006-08-23 at 12:28 -0500, Tang, Changqing wrote: >> > >> >Actually, A knows the data is in B's memory when A gets the >> >completion notice. B can't rely on anything unless A uses the RDMA our >> >write with immediate which puts a completion event in B's CQ. >> >> Ralph: >> >> Can you give a few more words on 'immediate', I know A will have A >> completion event in its CQ, Does B receive a CQ event on the >Same RDMA >> operation as well ? >> >> --CQ Tang > >B doesn't get a completion event for a RDMA write initiated >from A unless A does something like the following: > > struct ib_send_wr wr; > > wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM; > wr.imm_data = cpu_to_be32(value); > ... > ib_post_send(qp, &wr, NULL); > >B will get a completion event with the IB_WC_WITH_IMM flag set >in struct ib_wc.wc_flags and ib_wc.imm_data set to the value >that A sent. > > From sashak at voltaire.com Wed Aug 23 12:15:42 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 23 Aug 2006 22:15:42 +0300 Subject: [openib-general] [PATCH] opensm: option to limit size of OpenSM log file In-Reply-To: <20060822223409.GH10446@sashak.voltaire.com> References: <20060822211855.GD10446@sashak.voltaire.com> <1156283694.17858.5780.camel@hal.voltaire.com> <20060822222230.GG10446@sashak.voltaire.com> <1156285600.17858.6451.camel@hal.voltaire.com> <20060822223409.GH10446@sashak.voltaire.com> Message-ID: <20060823191542.GN15988@sashak.voltaire.com> Hi Hal, On 01:34 Wed 23 Aug , Sasha Khapyorsky wrote: > On 18:26 Tue 22 Aug , Hal Rosenstock wrote: > > On Tue, 2006-08-22 at 18:22, Sasha Khapyorsky wrote: > > > On 17:54 Tue 22 Aug , Hal Rosenstock wrote: > > > > Hi Sasha, > > > > > > > > On Tue, 2006-08-22 at 17:18, Sasha Khapyorsky wrote: > > > > > Hi Hal, > > > > > > > > > > There is new option which specified max size of OpenSM log file. The > > > > > default is '0' (not-limited). Please note osm_log_init() has new > > > > > parameter now. > > > > > > > > So libopensm.ver needs to be bumped (and this is not backward > > > > compatible). > > > > > > We may. I'm not sure it is necessary - in this patch I've changed all > > > occurrences of osm_log_init() under osm/ (in opensm and osmtest). So > > > this can be important only if there are osm_log "external" users. > > > > There may be so I will do this. > > Ok. Thanks. Found one more osm_log_init(). There is. Sasha Signed-off-by: Sasha Khapyorsky diff --git a/diags/src/saquery.c b/diags/src/saquery.c index f0d1299..0bb46be 100644 --- a/diags/src/saquery.c +++ b/diags/src/saquery.c @@ -443,7 +443,7 @@ get_bind_handle(void) osm_log_construct(&log_osm); if ((status = osm_log_init( &log_osm, TRUE, - 0x0001, NULL, TRUE )) != IB_SUCCESS) { + 0x0001, NULL, 0, TRUE )) != IB_SUCCESS) { fprintf(stderr, "Failed to init osm_log: %s\n", ib_get_err_str(status)); exit (-1); > > Sasha > > > > > -- Hal > > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ralphc at pathscale.com Wed Aug 23 12:30:21 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 23 Aug 2006 12:30:21 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: References: Message-ID: <1156361421.25846.54.camel@brick.pathscale.com> On Wed, 2006-08-23 at 14:00 -0500, Tang, Changqing wrote: > Thanks for all your replies. So my general question is, why only 4bytes > immediate data can > Generate completion event on B side, Why RDMA-write with any data size > does not generate > A completion event on B side? basic there are the same thing, the only > different is, one > Copy 4bytes to completion structure, the other copy all data to provided > dest buffer. > > --CQ This is just the way IB was defined. Both RDMA write and RDMA write with immediate transfer up to 2 Gigabytes of data to the destination address. The latter just signals node B that the operation is complete and the former doesn't. Node B can use the immediate data as a hint to which RDMA operation just completed. From rdreier at cisco.com Wed Aug 23 13:22:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 13:22:51 -0700 Subject: [openib-general] [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <44EC8F10.5050806@ichips.intel.com> (Sean Hefty's message of "Wed, 23 Aug 2006 10:23:28 -0700") References: <000001c6ad0d$e483a290$e598070a@amr.corp.intel.com> <44EC8F10.5050806@ichips.intel.com> Message-ID: Sean> This patch set appears to be the preferred approach. Any Sean> objection to committing this? It's unfortunate that we have to add a special-case event hook for the CM, but I guess the iWARP CM changes are so ugly anyway it doesn't matter much. So I think committing this is OK. - R. From rdreier at cisco.com Wed Aug 23 13:25:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 13:25:53 -0700 Subject: [openib-general] [PATCH] mthca: various bug fixes for mthca_query_qp In-Reply-To: <200608231854.03936.jackm@mellanox.co.il> (Jack Morgenstein's message of "Wed, 23 Aug 2006 18:54:03 +0300") References: <200608231854.03936.jackm@mellanox.co.il> Message-ID: > 5. Return the send_cq, receive cq and srq handles. ib_query_qp() needs them > (required by IB Spec). ibv_query_qp() overwrites these values in user-space > with appropriate user-space values. > + qp_init_attr->send_cq = ibqp->send_cq; > + qp_init_attr->recv_cq = ibqp->recv_cq; > + qp_init_attr->srq = ibqp->srq; I really disagree with this change. It's silly to do this copying since the consumer already has the ibqp pointer. And it's especially silly to put this in a low-level driver, since there's nothing device-specific about it. - R. From rdreier at cisco.com Wed Aug 23 13:27:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 13:27:12 -0700 Subject: [openib-general] question: ib_umem page_size In-Reply-To: <20060823121101.GA6300@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 23 Aug 2006 15:11:01 +0300") References: <20060823121101.GA6300@mellanox.co.il> Message-ID: > > It gives the page size for the user memory described by the struct. > > The idea was that if/when someone tries to optimize for huge pages, > > then the low-level driver can know that a region is using huge pages > > without having to walk through the page list and search for the > > minimum physically contiguous size. > > Hmm, mthca_reg_user_mr seems to do: > > len = sg_dma_len(&chunk->page_list[j]) >> shift > > which means that dma_len must be a multiple of page size. > > Is this intentional? Yes, it's intentional I think. I'm probably missing something, but the upper layer has just told mthca_reg_user_mr() that the page size for this region is (1< (Michael S. Tsirkin's message of "Tue, 22 Aug 2006 22:45:06 +0300") References: <20060822194505.GA9153@mellanox.co.il> Message-ID: > Please consider the following for 2.6.18 - hopefully this will reduce > the number of support requests from people with old Sinai firmware. Makes sense -- this is the sort of thing we want as current as possible. Applied to svn and for-2.6.18 From rdreier at cisco.com Wed Aug 23 13:32:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 13:32:37 -0700 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: <000901c6c582$ad09f890$8698070a@amr.corp.intel.com> (Sean Hefty's message of "Mon, 21 Aug 2006 17:34:06 -0700") References: <000901c6c582$ad09f890$8698070a@amr.corp.intel.com> Message-ID: What's the plan for how this would be used? We can't let unprivileged userspace processes talk to the SA, because they could cause problems like deleting someone else's multicast group membership. And I don't think we want to try to do some elaborate filtering in the kernel, do we? - R. From mst at mellanox.co.il Wed Aug 23 14:16:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Aug 2006 00:16:44 +0300 Subject: [openib-general] [PATCH] mthca: various bug fixes for mthca_query_qp In-Reply-To: References: Message-ID: <20060823211644.GA11692@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: various bug fixes for mthca_query_qp > > > 5. Return the send_cq, receive cq and srq handles. > > I really disagree with this change. The other four bullets make sense however, do they not? -- MST From rdreier at cisco.com Wed Aug 23 14:20:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 14:20:54 -0700 Subject: [openib-general] [PATCH 4/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1155333489.20325.429.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 11 Aug 2006 14:58:09 -0700") References: <1155333489.20325.429.camel@brick.pathscale.com> Message-ID: Thanks, applied and queued for 2.6.19. From mshefty at ichips.intel.com Wed Aug 23 14:26:13 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 14:26:13 -0700 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: References: <000901c6c582$ad09f890$8698070a@amr.corp.intel.com> Message-ID: <44ECC7F5.3000300@ichips.intel.com> Roland Dreier wrote: > What's the plan for how this would be used? We can't let unprivileged > userspace processes talk to the SA, because they could cause problems > like deleting someone else's multicast group membership. And I don't > think we want to try to do some elaborate filtering in the kernel, do we? The ibv_sa_send_mad() routine can only be used to issue the following methods: GET, SEND, GET_TABLE, GET_MULTI, and GET_TRACE_TABLE I do check for this in the kernel, but that is the extent of any filtering that's done. Multicast operations must go through the multicast join / free calls, which drop into the kernel to interface with the ib_multicast module. I would expect that other SET / DELETE type operations would be treated similar to how multicast is handled. I'm expecting that the labs will use at least the multicast interfaces based on e-mail conversations, but without path record query support, the userspace CM interface isn't all that useful. - Sean From mst at mellanox.co.il Wed Aug 23 14:32:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Aug 2006 00:32:59 +0300 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: References: Message-ID: <20060823213259.GC11692@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] libibsa: userspace SA query and multicast support > > What's the plan for how this would be used? We can't let unprivileged > userspace processes talk to the SA, because they could cause problems > like deleting someone else's multicast group membership. And I don't > think we want to try to do some elaborate filtering in the kernel, do we? Yea I had the same question. Shouldn't interface expose just the specific queries that we need? -- MST From rdreier at cisco.com Wed Aug 23 14:36:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 14:36:09 -0700 Subject: [openib-general] [PATCHv2] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1155680172.20325.519.camel@brick.pathscale.com> (Ralph Campbell's message of "Tue, 15 Aug 2006 15:16:12 -0700") References: <1155680172.20325.519.camel@brick.pathscale.com> Message-ID: Thanks, queued for 2.6.19 From rdreier at cisco.com Wed Aug 23 14:37:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 14:37:41 -0700 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: <44ECC7F5.3000300@ichips.intel.com> (Sean Hefty's message of "Wed, 23 Aug 2006 14:26:13 -0700") References: <000901c6c582$ad09f890$8698070a@amr.corp.intel.com> <44ECC7F5.3000300@ichips.intel.com> Message-ID: Sean> The ibv_sa_send_mad() routine can only be used to issue the Sean> following methods: Sean> GET, SEND, GET_TABLE, GET_MULTI, and GET_TRACE_TABLE OK, I missed that -- it's kind of hidden inside is_send_req(). - R. From rdreier at cisco.com Wed Aug 23 14:38:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 14:38:42 -0700 Subject: [openib-general] [PATCH] mthca: various bug fixes for mthca_query_qp In-Reply-To: <20060823211644.GA11692@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 24 Aug 2006 00:16:44 +0300") References: <20060823211644.GA11692@mellanox.co.il> Message-ID: Michael> The other four bullets make sense however, do they not? I guess so, although I wonder if anyone will ever care about the sq_sig_type() field. - R. From mst at mellanox.co.il Wed Aug 23 14:41:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Aug 2006 00:41:53 +0300 Subject: [openib-general] Rollup patch for ipath and OFED In-Reply-To: <20060823160725.GF1441@greglaptop.t-mobile.de> References: <20060823160725.GF1441@greglaptop.t-mobile.de> Message-ID: <20060823214153.GD11692@mellanox.co.il> Quoting r. Greg Lindahl : > Subject: Re: Rollup patch for ipath and OFED > > On Wed, Aug 23, 2006 at 06:01:32PM +0300, Michael S. Tsirkin wrote: > > > So this seems to be ripping out chunks of upstream code (ipath_ht400) > > replacing them with something else (ipath_iba6110, ipath_iba6120.o) > > To answer this piece of the question, we were acquired last April, and > of course we have to rename all our devices. Since patch doesn't have > a rename feature, this looks much worse than it really is. Fine, but I wander why rush this cosmetic change for ofed 1.1? Anyway, so is that right that there's basically just the mmap enhancement plus a lot of file renames? -- MST From mshefty at ichips.intel.com Wed Aug 23 14:48:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 14:48:17 -0700 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: <20060823213259.GC11692@mellanox.co.il> References: <20060823213259.GC11692@mellanox.co.il> Message-ID: <44ECCD21.6050601@ichips.intel.com> > Yea I had the same question. Shouldn't interface expose > just the specific queries that we need? I don't know what queries a user will want, and I'd rather not change the kernel ABI with every new query, but that is a possibility. Which queries are of concern? - Sean From mst at mellanox.co.il Wed Aug 23 14:47:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Aug 2006 00:47:30 +0300 Subject: [openib-general] [PATCH] mthca: various bug fixes for mthca_query_qp In-Reply-To: References: Message-ID: <20060823214730.GE11692@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: various bug fixes for mthca_query_qp > > Michael> The other four bullets make sense however, do they not? > > I guess so, although I wonder if anyone will ever care about the > sq_sig_type() field. It's not in IB spec query QP, so that might be unlikely. However, libibverbs seems to be looking at this field: init_attr->sq_sig_all = resp.sq_sig_all; so it only seems consistent to fill this in. What do you think? Maybe its better to remove it from libibverbs? What about other stuff? -- MST From rdreier at cisco.com Wed Aug 23 14:50:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 14:50:19 -0700 Subject: [openib-general] [PATCH 2/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1155333339.20325.424.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 11 Aug 2006 14:55:38 -0700") References: <1155333339.20325.424.camel@brick.pathscale.com> Message-ID: Thanks, applied. From rdreier at cisco.com Wed Aug 23 14:50:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 14:50:12 -0700 Subject: [openib-general] [PATCH 3/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1155333423.20325.427.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 11 Aug 2006 14:57:02 -0700") References: <1155333423.20325.427.camel@brick.pathscale.com> Message-ID: I applied this, but I'm wondering if this: > +int ipath_resize_cq(struct ibv_cq *ibcq, int cqe) > { > + struct ipath_cq *cq = to_icq(ibcq); > + struct ibv_resize_cq cmd; > + struct ipath_resize_cq_resp resp; > + size_t size; > + int ret; > + > + pthread_spin_lock(&cq->lock); > + /* Save the old size so we can unmmap the queue. */ > + size = sizeof(struct ipath_cq_wc) + > + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); > + ret = ibv_cmd_resize_cq(ibcq, cqe, &cmd, sizeof cmd, > + &resp.ibv_resp, sizeof resp); > + if (ret) { > + pthread_spin_unlock(&cq->lock); > + return ret; > + } > + (void) munmap(cq->queue, size); > + size = sizeof(struct ipath_cq_wc) + > + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); > + cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, > + ibcq->context->cmd_fd, resp.offset); > + ret = errno; > + pthread_spin_unlock(&cq->lock); > + if ((void *) cq->queue == MAP_FAILED) > + return ret; > + return 0; > +} works against an old kernel driver. It seems you do have this: > + if (dev->abi_version == 1) { > + context->ibv_ctx.ops.poll_cq = ibv_cmd_poll_cq; > + context->ibv_ctx.ops.post_srq_recv = ibv_cmd_post_srq_recv; > + context->ibv_ctx.ops.post_recv = ibv_cmd_post_recv; > + } so I guess you're just ignoring the failure of mmap() or something? - R. From mst at mellanox.co.il Wed Aug 23 14:52:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Aug 2006 00:52:46 +0300 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: <44ECCD21.6050601@ichips.intel.com> References: <44ECCD21.6050601@ichips.intel.com> Message-ID: <20060823215245.GF11692@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] libibsa: userspace SA query and multicast support > > > Yea I had the same question. Shouldn't interface expose > > just the specific queries that we need? > > I don't know what queries a user will want, and I'd rather not change the > kernel ABI with every new query, but that is a possibility. Which queries are > of concern? Donnu. I'm just speaking on the general principle that we should deny by default, not allow by default. Which queries do you want to perform? -- MST From rdreier at cisco.com Wed Aug 23 14:56:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 14:56:52 -0700 Subject: [openib-general] [PATCH 1/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1155333276.20325.422.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 11 Aug 2006 14:54:36 -0700") References: <1155333276.20325.422.camel@brick.pathscale.com> Message-ID: Thanks, applied. From mshefty at ichips.intel.com Wed Aug 23 14:59:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 14:59:41 -0700 Subject: [openib-general] [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: References: <000001c6ad0d$e483a290$e598070a@amr.corp.intel.com> <44EC8F10.5050806@ichips.intel.com> Message-ID: <44ECCFCD.5020002@ichips.intel.com> Roland Dreier wrote: > It's unfortunate that we have to add a special-case event hook for the > CM, but I guess the iWARP CM changes are so ugly anyway it doesn't > matter much. So I think committing this is OK. We also have the alternative of pushing the responsibility of notifying the CM of the event to the ULPs, which is what's done today. The problem there is that gets pushed up to all userspace applications as well. And even with these proposed changes, there's a race condition where the CM can timeout a connection after data is received over it, but before this event can be processed. This makes me wonder if any of this is worth it... - Sean From ralphc at pathscale.com Wed Aug 23 15:01:10 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 23 Aug 2006 15:01:10 -0700 Subject: [openib-general] [PATCH 3/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: References: <1155333423.20325.427.camel@brick.pathscale.com> Message-ID: <1156370470.25846.61.camel@brick.pathscale.com> On Wed, 2006-08-23 at 14:50 -0700, Roland Dreier wrote: > I applied this, but I'm wondering if this: > > > +int ipath_resize_cq(struct ibv_cq *ibcq, int cqe) > > { > > + struct ipath_cq *cq = to_icq(ibcq); > > + struct ibv_resize_cq cmd; > > + struct ipath_resize_cq_resp resp; > > + size_t size; > > + int ret; > > + > > + pthread_spin_lock(&cq->lock); > > + /* Save the old size so we can unmmap the queue. */ > > + size = sizeof(struct ipath_cq_wc) + > > + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); > > + ret = ibv_cmd_resize_cq(ibcq, cqe, &cmd, sizeof cmd, > > + &resp.ibv_resp, sizeof resp); > > + if (ret) { > > + pthread_spin_unlock(&cq->lock); > > + return ret; > > + } > > + (void) munmap(cq->queue, size); > > + size = sizeof(struct ipath_cq_wc) + > > + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); > > + cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, > > + ibcq->context->cmd_fd, resp.offset); > > + ret = errno; > > + pthread_spin_unlock(&cq->lock); > > + if ((void *) cq->queue == MAP_FAILED) > > + return ret; > > + return 0; > > +} > > works against an old kernel driver. It seems you do have this: > > > + if (dev->abi_version == 1) { > > + context->ibv_ctx.ops.poll_cq = ibv_cmd_poll_cq; > > + context->ibv_ctx.ops.post_srq_recv = ibv_cmd_post_srq_recv; > > + context->ibv_ctx.ops.post_recv = ibv_cmd_post_recv; > > + } > > so I guess you're just ignoring the failure of mmap() or something? > > - R. Not quite. If the kernel driver is old, libipathverbs is using the old functions which make system calls instead of doing the newer mmap stuff. libipathverbs doesn't need to attempt calling mmap() if it knows the driver doesn't support it. From rdreier at cisco.com Wed Aug 23 15:04:01 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 15:04:01 -0700 Subject: [openib-general] [PATCH 3/7] IB/ipath - performance improvements via mmap of queues In-Reply-To: <1156370470.25846.61.camel@brick.pathscale.com> (Ralph Campbell's message of "Wed, 23 Aug 2006 15:01:10 -0700") References: <1155333423.20325.427.camel@brick.pathscale.com> <1156370470.25846.61.camel@brick.pathscale.com> Message-ID: Ralph> Not quite. If the kernel driver is old, libipathverbs is Ralph> using the old functions which make system calls instead of Ralph> doing the newer mmap stuff. libipathverbs doesn't need to Ralph> attempt calling mmap() if it knows the driver doesn't Ralph> support it. You lost me -- I only see one version of ipath_resize_cq(), and it seems to do munmap()/mmap() without testing an ABI version. - R. From sean.hefty at intel.com Wed Aug 23 15:48:07 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 15:48:07 -0700 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: <20060823215245.GF11692@mellanox.co.il> Message-ID: <000601c6c706$345eff50$a0d0180a@amr.corp.intel.com> >Donnu. I'm just speaking on the general principle that we should deny by >default, not allow by default. Which queries do you want to perform? At a minimum, I would expect the following queries: PathRecord MultiPathRecord MCMemberRecord ServiceRecord Support for ServiceRecord set/delete and InformInfo are likely to be added once kernel support is in place. Is it a reasonable approach to export two devices with different permissions, one that allows limited sends, and another that permits unfiltered sends? - Sean From rdreier at cisco.com Wed Aug 23 16:25:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 16:25:38 -0700 Subject: [openib-general] [GIT PULL] please pull infiniband.git Message-ID: Greg, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus to get a few fixes for the 2.6.18 tree: Jack Morgenstein: IB/core: Fix SM LID/LID change with client reregister set Michael S. Tsirkin: IB/mthca: Make fence flag work for send work requests IB/mthca: Update HCA firmware revisions Roland Dreier: IB/mthca: Fix potential AB-BA deadlock with CQ locks IB/mthca: No userspace SRQs if HCA doesn't have SRQ support drivers/infiniband/core/cache.c | 3 + drivers/infiniband/core/sa_query.c | 3 + drivers/infiniband/hw/mthca/mthca_main.c | 6 +-- drivers/infiniband/hw/mthca/mthca_provider.c | 11 +++-- drivers/infiniband/hw/mthca/mthca_provider.h | 4 +- drivers/infiniband/hw/mthca/mthca_qp.c | 54 +++++++++++++++++++------- 6 files changed, 55 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index e05ca2c..75313ad 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -301,7 +301,8 @@ static void ib_cache_event(struct ib_eve event->event == IB_EVENT_PORT_ACTIVE || event->event == IB_EVENT_LID_CHANGE || event->event == IB_EVENT_PKEY_CHANGE || - event->event == IB_EVENT_SM_CHANGE) { + event->event == IB_EVENT_SM_CHANGE || + event->event == IB_EVENT_CLIENT_REREGISTER) { work = kmalloc(sizeof *work, GFP_ATOMIC); if (work) { INIT_WORK(&work->work, ib_cache_task, work); diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index aeda484..d6b8422 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -405,7 +405,8 @@ static void ib_sa_event(struct ib_event_ event->event == IB_EVENT_PORT_ACTIVE || event->event == IB_EVENT_LID_CHANGE || event->event == IB_EVENT_PKEY_CHANGE || - event->event == IB_EVENT_SM_CHANGE) { + event->event == IB_EVENT_SM_CHANGE || + event->event == IB_EVENT_CLIENT_REREGISTER) { struct ib_sa_device *sa_dev; sa_dev = container_of(handler, typeof(*sa_dev), event_handler); diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 557cde3..7b82c19 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -967,12 +967,12 @@ static struct { } mthca_hca_table[] = { [TAVOR] = { .latest_fw = MTHCA_FW_VER(3, 4, 0), .flags = 0 }, - [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 400), + [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 600), .flags = MTHCA_FLAG_PCIE }, - [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), + [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 400), .flags = MTHCA_FLAG_MEMFREE | MTHCA_FLAG_PCIE }, - [SINAI] = { .latest_fw = MTHCA_FW_VER(1, 0, 800), + [SINAI] = { .latest_fw = MTHCA_FW_VER(1, 1, 0), .flags = MTHCA_FLAG_MEMFREE | MTHCA_FLAG_PCIE | MTHCA_FLAG_SINAI_OPT } diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 230ae21..265b1d1 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1287,11 +1287,7 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | - (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) | - (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | - (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | - (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | - (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST); dev->ib_dev.node_type = IB_NODE_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; @@ -1316,6 +1312,11 @@ int mthca_register_device(struct mthca_d dev->ib_dev.modify_srq = mthca_modify_srq; dev->ib_dev.query_srq = mthca_query_srq; dev->ib_dev.destroy_srq = mthca_destroy_srq; + dev->ib_dev.uverbs_cmd_mask |= + (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | + (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | + (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); if (mthca_is_memfree(dev)) dev->ib_dev.post_srq_recv = mthca_arbel_post_srq_recv; diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index 8de2887..9a5bece 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -136,8 +136,8 @@ struct mthca_ah { * We have one global lock that protects dev->cq/qp_table. Each * struct mthca_cq/qp also has its own lock. An individual qp lock * may be taken inside of an individual cq lock. Both cqs attached to - * a qp may be locked, with the send cq locked first. No other - * nesting should be done. + * a qp may be locked, with the cq with the lower cqn locked first. + * No other nesting should be done. * * Each struct mthca_cq/qp also has an ref count, protected by the * corresponding table lock. The pointer from the cq/qp_table to the diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index cd8b672..2e8f6f3 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -99,6 +99,10 @@ enum { MTHCA_QP_BIT_RSC = 1 << 3 }; +enum { + MTHCA_SEND_DOORBELL_FENCE = 1 << 5 +}; + struct mthca_qp_path { __be32 port_pkey; u8 rnr_retry; @@ -1259,6 +1263,32 @@ int mthca_alloc_qp(struct mthca_dev *dev return 0; } +static void mthca_lock_cqs(struct mthca_cq *send_cq, struct mthca_cq *recv_cq) +{ + if (send_cq == recv_cq) + spin_lock_irq(&send_cq->lock); + else if (send_cq->cqn < recv_cq->cqn) { + spin_lock_irq(&send_cq->lock); + spin_lock_nested(&recv_cq->lock, SINGLE_DEPTH_NESTING); + } else { + spin_lock_irq(&recv_cq->lock); + spin_lock_nested(&send_cq->lock, SINGLE_DEPTH_NESTING); + } +} + +static void mthca_unlock_cqs(struct mthca_cq *send_cq, struct mthca_cq *recv_cq) +{ + if (send_cq == recv_cq) + spin_unlock_irq(&send_cq->lock); + else if (send_cq->cqn < recv_cq->cqn) { + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); + } else { + spin_unlock(&send_cq->lock); + spin_unlock_irq(&recv_cq->lock); + } +} + int mthca_alloc_sqp(struct mthca_dev *dev, struct mthca_pd *pd, struct mthca_cq *send_cq, @@ -1311,17 +1341,13 @@ int mthca_alloc_sqp(struct mthca_dev *de * Lock CQs here, so that CQ polling code can do QP lookup * without taking a lock. */ - spin_lock_irq(&send_cq->lock); - if (send_cq != recv_cq) - spin_lock(&recv_cq->lock); + mthca_lock_cqs(send_cq, recv_cq); spin_lock(&dev->qp_table.lock); mthca_array_clear(&dev->qp_table.qp, mqpn); spin_unlock(&dev->qp_table.lock); - if (send_cq != recv_cq) - spin_unlock(&recv_cq->lock); - spin_unlock_irq(&send_cq->lock); + mthca_unlock_cqs(send_cq, recv_cq); err_out: dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, @@ -1355,9 +1381,7 @@ void mthca_free_qp(struct mthca_dev *dev * Lock CQs here, so that CQ polling code can do QP lookup * without taking a lock. */ - spin_lock_irq(&send_cq->lock); - if (send_cq != recv_cq) - spin_lock(&recv_cq->lock); + mthca_lock_cqs(send_cq, recv_cq); spin_lock(&dev->qp_table.lock); mthca_array_clear(&dev->qp_table.qp, @@ -1365,9 +1389,7 @@ void mthca_free_qp(struct mthca_dev *dev --qp->refcount; spin_unlock(&dev->qp_table.lock); - if (send_cq != recv_cq) - spin_unlock(&recv_cq->lock); - spin_unlock_irq(&send_cq->lock); + mthca_unlock_cqs(send_cq, recv_cq); wait_event(qp->wait, !get_qp_refcount(dev, qp)); @@ -1502,7 +1524,7 @@ int mthca_tavor_post_send(struct ib_qp * int i; int size; int size0 = 0; - u32 f0 = 0; + u32 f0; int ind; u8 op0 = 0; @@ -1686,6 +1708,8 @@ int mthca_tavor_post_send(struct ib_qp * if (!size0) { size0 = size; op0 = mthca_opcode[wr->opcode]; + f0 = wr->send_flags & IB_SEND_FENCE ? + MTHCA_SEND_DOORBELL_FENCE : 0; } ++ind; @@ -1843,7 +1867,7 @@ int mthca_arbel_post_send(struct ib_qp * int i; int size; int size0 = 0; - u32 f0 = 0; + u32 f0; int ind; u8 op0 = 0; @@ -2051,6 +2075,8 @@ int mthca_arbel_post_send(struct ib_qp * if (!size0) { size0 = size; op0 = mthca_opcode[wr->opcode]; + f0 = wr->send_flags & IB_SEND_FENCE ? + MTHCA_SEND_DOORBELL_FENCE : 0; } ++ind; From greg.lindahl at qlogic.com Wed Aug 23 16:08:12 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Wed, 23 Aug 2006 16:08:12 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <44EC825E.2030709@ichips.intel.com> References: <44EC825E.2030709@ichips.intel.com> Message-ID: <20060823230812.GD13187@greglaptop.t-mobile.de> On Wed, Aug 23, 2006 at 09:29:18AM -0700, Sean Hefty wrote: > I don't believe that there is any ordering guarantee by the architecture. > However, specific adapters may behave this way, and I've seen applications make > use of this by polling the last memory byte for a completion, for example. Actually, that leads me to a question: does the vendor of that adaptor say that this is actually safe? Just because something behaves one way most of the time doesn't mean it does it all of the time. So it it really smart to write non-standard-conforming programs unless the vendor stands behind that behavior? -- greg From rdreier at cisco.com Wed Aug 23 16:46:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 Aug 2006 16:46:52 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <20060823230812.GD13187@greglaptop.t-mobile.de> (Greg Lindahl's message of "Wed, 23 Aug 2006 16:08:12 -0700") References: <44EC825E.2030709@ichips.intel.com> <20060823230812.GD13187@greglaptop.t-mobile.de> Message-ID: Greg> Actually, that leads me to a question: does the vendor of Greg> that adaptor say that this is actually safe? Just because Greg> something behaves one way most of the time doesn't mean it Greg> does it all of the time. So it it really smart to write Greg> non-standard-conforming programs unless the vendor stands Greg> behind that behavior? Yes, Mellanox documents that it is safe to rely on the last byte of an RDMA being written last. - R. From sean.hefty at intel.com Wed Aug 23 16:47:25 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 16:47:25 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <20060823230812.GD13187@greglaptop.t-mobile.de> Message-ID: <000001c6c70e$7d437400$15d1180a@amr.corp.intel.com> >Actually, that leads me to a question: does the vendor of that adaptor >say that this is actually safe? I believe so. >most of the time doesn't mean it does it all of the time. So it it >really smart to write non-standard-conforming programs unless the >vendor stands behind that behavior? I'm not saying whether I consider this good computer science or not, but some applications do rely on this feature, and hardware that wants to work best with those applications will have it. - Sean From weiny2 at llnl.gov Wed Aug 23 17:21:47 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 23 Aug 2006 17:21:47 -0700 Subject: [openib-general] librdmacm ABI issues with OFED 1.1 Message-ID: <20060823172147.35daed59.weiny2@llnl.gov> I have some rdma_cm test code and when I run with the OFED 1.1 code (running on 2.6.9 U3 based kernel) I got the following error. librdmacm: couldn't read ABI version. librdmacm: assuming: 2 The code seems to run (as it really does nothing) fine but I was wondering if I could fix this just to clean up the output. I found that the following patch removes the code which creates the abi_version file. ./backport/2.6.9_U3/ucma_6607_to_2_6_9.patch So my question is: How bad is it that the user space is assuming version 2 of the interface and the modules are at version 1? Thanks, Ira From greg at kroah.com Wed Aug 23 18:09:10 2006 From: greg at kroah.com (Greg KH) Date: Wed, 23 Aug 2006 18:09:10 -0700 Subject: [openib-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: Message-ID: <20060824010910.GC18060@kroah.com> On Wed, Aug 23, 2006 at 04:25:38PM -0700, Roland Dreier wrote: > Greg, please pull from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus Pulled from, and pushed out. thanks, greg k-h From ralphc at pathscale.com Wed Aug 23 18:32:45 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 23 Aug 2006 18:32:45 -0700 Subject: [openib-general] [PATCH] IB/libipathverbs - Fix compatibility with old ib_ipath kernel drivers Message-ID: <1156383165.25846.79.camel@brick.pathscale.com> This patch makes libipathverbs backward compatible with old ib_ipath kernel drivers. Signed-off-by: Ralph Campbell Index: src/userspace/libipathverbs/src/verbs.c =================================================================== --- src/userspace/libipathverbs/src/verbs.c (revision 9095) +++ src/userspace/libipathverbs/src/verbs.c (working copy) @@ -177,6 +177,29 @@ struct ibv_cq *ipath_create_cq(struct ib return &cq->ibv_cq; } +struct ibv_cq *ipath_create_cq_v1(struct ibv_context *context, int cqe, + struct ibv_comp_channel *channel, + int comp_vector) +{ + struct ibv_cq *cq; + struct ibv_create_cq cmd; + struct ibv_create_cq_resp resp; + int ret; + + cq = malloc(sizeof *cq); + if (!cq) + return NULL; + + ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, + cq, &cmd, sizeof cmd, &resp, sizeof resp); + if (ret) { + free(cq); + return NULL; + } + + return cq; +} + int ipath_resize_cq(struct ibv_cq *ibcq, int cqe) { struct ipath_cq *cq = to_icq(ibcq); @@ -207,6 +230,15 @@ int ipath_resize_cq(struct ibv_cq *ibcq, return 0; } +int ipath_resize_cq_v1(struct ibv_cq *ibcq, int cqe) +{ + struct ibv_resize_cq cmd; + struct ibv_resize_cq_resp resp; + + return ibv_cmd_resize_cq(ibcq, cqe, &cmd, sizeof cmd, + &resp, sizeof resp); +} + int ipath_destroy_cq(struct ibv_cq *ibcq) { struct ipath_cq *cq = to_icq(ibcq); @@ -222,6 +254,16 @@ int ipath_destroy_cq(struct ibv_cq *ibcq return 0; } +int ipath_destroy_cq_v1(struct ibv_cq *ibcq) +{ + int ret; + + ret = ibv_cmd_destroy_cq(ibcq); + if (!ret) + free(ibcq); + return ret; +} + int ipath_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc) { struct ipath_cq *cq = to_icq(ibcq); @@ -290,6 +332,28 @@ struct ibv_qp *ipath_create_qp(struct ib return &qp->ibv_qp; } +struct ibv_qp *ipath_create_qp_v1(struct ibv_pd *pd, + struct ibv_qp_init_attr *attr) +{ + struct ibv_create_qp cmd; + struct ibv_create_qp_resp resp; + struct ibv_qp *qp; + int ret; + + qp = malloc(sizeof *qp); + if (!qp) + return NULL; + + ret = ibv_cmd_create_qp(pd, qp, attr, &cmd, sizeof cmd, + &resp, sizeof resp); + if (ret) { + free(qp); + return NULL; + } + + return qp; +} + int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask, struct ibv_qp_init_attr *init_attr) @@ -330,6 +394,16 @@ int ipath_destroy_qp(struct ibv_qp *ibqp return 0; } +int ipath_destroy_qp_v1(struct ibv_qp *ibqp) +{ + int ret; + + ret = ibv_cmd_destroy_qp(ibqp); + if (!ret) + free(ibqp); + return ret; +} + static int post_recv(struct ipath_rq *rq, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr) { @@ -412,6 +486,28 @@ struct ibv_srq *ipath_create_srq(struct return &srq->ibv_srq; } +struct ibv_srq *ipath_create_srq_v1(struct ibv_pd *pd, + struct ibv_srq_init_attr *attr) +{ + struct ibv_srq *srq; + struct ibv_create_srq cmd; + struct ibv_create_srq_resp resp; + int ret; + + srq = malloc(sizeof *srq); + if (srq == NULL) + return NULL; + + ret = ibv_cmd_create_srq(pd, srq, attr, &cmd, sizeof cmd, + &resp, sizeof resp); + if (ret) { + free(srq); + return NULL; + } + + return srq; +} + int ipath_modify_srq(struct ibv_srq *ibsrq, struct ibv_srq_attr *attr, enum ibv_srq_attr_mask attr_mask) @@ -456,6 +552,16 @@ int ipath_modify_srq(struct ibv_srq *ibs return 0; } +int ipath_modify_srq_v1(struct ibv_srq *ibsrq, + struct ibv_srq_attr *attr, + enum ibv_srq_attr_mask attr_mask) +{ + struct ibv_modify_srq cmd; + + return ibv_cmd_modify_srq(ibsrq, attr, attr_mask, + &cmd, sizeof cmd); +} + int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr) { struct ibv_query_srq cmd; @@ -481,6 +587,16 @@ int ipath_destroy_srq(struct ibv_srq *ib return 0; } +int ipath_destroy_srq_v1(struct ibv_srq *ibsrq) +{ + int ret; + + ret = ibv_cmd_destroy_srq(ibsrq); + if (!ret) + free(ibsrq); + return ret; +} + int ipath_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr) { Index: src/userspace/libipathverbs/src/ipathverbs.c =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.c (revision 9095) +++ src/userspace/libipathverbs/src/ipathverbs.c (working copy) @@ -134,8 +134,16 @@ static struct ibv_context *ipath_alloc_c context->ibv_ctx.ops = ipath_ctx_ops; dev = to_idev(ibdev); if (dev->abi_version == 1) { + context->ibv_ctx.ops.create_cq = ipath_create_cq_v1; context->ibv_ctx.ops.poll_cq = ibv_cmd_poll_cq; + context->ibv_ctx.ops.resize_cq = ipath_resize_cq_v1; + context->ibv_ctx.ops.destroy_cq = ipath_destroy_cq_v1; + context->ibv_ctx.ops.create_srq = ipath_create_srq_v1; + context->ibv_ctx.ops.destroy_srq = ipath_destroy_srq_v1; + context->ibv_ctx.ops.modify_srq = ipath_modify_srq_v1; context->ibv_ctx.ops.post_srq_recv = ibv_cmd_post_srq_recv; + context->ibv_ctx.ops.create_qp = ipath_create_qp_v1; + context->ibv_ctx.ops.destroy_qp = ipath_destroy_qp_v1; context->ibv_ctx.ops.post_recv = ibv_cmd_post_recv; } return &context->ibv_ctx; Index: src/userspace/libipathverbs/src/ipathverbs.h =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.h (revision 9095) +++ src/userspace/libipathverbs/src/ipathverbs.h (working copy) @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -202,15 +203,26 @@ struct ibv_cq *ipath_create_cq(struct ib struct ibv_comp_channel *channel, int comp_vector); +struct ibv_cq *ipath_create_cq_v1(struct ibv_context *context, int cqe, + struct ibv_comp_channel *channel, + int comp_vector); + int ipath_resize_cq(struct ibv_cq *cq, int cqe); +int ipath_resize_cq_v1(struct ibv_cq *cq, int cqe); + int ipath_destroy_cq(struct ibv_cq *cq); +int ipath_destroy_cq_v1(struct ibv_cq *cq); + int ipath_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); +struct ibv_qp *ipath_create_qp_v1(struct ibv_pd *pd, + struct ibv_qp_init_attr *attr); + int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask, struct ibv_qp_init_attr *init_attr); @@ -220,6 +232,8 @@ int ipath_modify_qp(struct ibv_qp *qp, s int ipath_destroy_qp(struct ibv_qp *qp); +int ipath_destroy_qp_v1(struct ibv_qp *qp); + int ipath_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr); @@ -229,14 +243,23 @@ int ipath_post_recv(struct ibv_qp *ibqp, struct ibv_srq *ipath_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *attr); +struct ibv_srq *ipath_create_srq_v1(struct ibv_pd *pd, + struct ibv_srq_init_attr *attr); + int ipath_modify_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr, enum ibv_srq_attr_mask attr_mask); +int ipath_modify_srq_v1(struct ibv_srq *srq, + struct ibv_srq_attr *attr, + enum ibv_srq_attr_mask attr_mask); + int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr); int ipath_destroy_srq(struct ibv_srq *srq); +int ipath_destroy_srq_v1(struct ibv_srq *srq); + int ipath_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr); From sean.hefty at intel.com Wed Aug 23 22:33:05 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 23 Aug 2006 22:33:05 -0700 Subject: [openib-general] librdmacm ABI issues with OFED 1.1 In-Reply-To: <20060823172147.35daed59.weiny2@llnl.gov> Message-ID: <000001c6c73e$c6a798d0$20d4180a@amr.corp.intel.com> >I have some rdma_cm test code and when I run with the OFED 1.1 code (running on >2.6.9 U3 based kernel) I got the following error. > >librdmacm: couldn't read ABI version. >librdmacm: assuming: 2 The RDMA CM places the abi_version file in /sys/class/misc/rdma_cm. The misc class didn't exist in 2.6.9, which is why it was removed from the OFED code. >The code seems to run (as it really does nothing) fine but I was wondering if I >could fix this just to clean up the output. I found that the following patch >removes the code which creates the abi_version file. If you look at Woody's backport patches, I believe that he moves the RDMA CM files to /sys/class/infiniband/rdma_cm and updates the librdmacm to read the abi_version from there. Or you could just remove the prints from the library. >How bad is it that the user space is assuming version 2 of the interface and >the modules are at version 1? It should work fine for apps using RC QPs. Version 1 assumed that the port space was RDMA TCP for RC QPs. Version 2 added support for UD QPs through the RDMA's UDP port space. The port space information was added to the end of a structure, so if an older kernel is used, it simply won't read in the port space data, and will assume TCP. An application that was expecting to use UD QP will simply get an error on some operation, likely when it tries to actually connect to a remote UD QP. - Sean From mst at mellanox.co.il Wed Aug 23 22:45:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Aug 2006 08:45:25 +0300 Subject: [openib-general] [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <44ECCFCD.5020002@ichips.intel.com> References: <44ECCFCD.5020002@ichips.intel.com> Message-ID: <20060824054525.GA9571@mellanox.co.il> Quoting r. Sean Hefty : > And even with these proposed changes, there's a race condition where the CM > can timeout a connection after data is received over it, but before this event > can be processed. Hmm. And what happens then? -- MST From mst at mellanox.co.il Wed Aug 23 23:15:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Aug 2006 09:15:54 +0300 Subject: [openib-general] librdmacm ABI issues with OFED 1.1 In-Reply-To: <000001c6c73e$c6a798d0$20d4180a@amr.corp.intel.com> References: <000001c6c73e$c6a798d0$20d4180a@amr.corp.intel.com> Message-ID: <20060824061554.GC9498@mellanox.co.il> Quoting r. Sean Hefty : > If you look at Woody's backport patches, I believe that he moves the RDMA CM > files to /sys/class/infiniband/rdma_cm and updates the librdmacm to read the > abi_version from there. Maybe the librdmacm part should be merged to svn? So librdmacm could try to read from misc, then from /sys/class/infiniband/rdma_cm, and then assume latest. It's good to have userspace code portable across distros ... -- MST From rpearson at systemfabricworks.com Thu Aug 24 00:03:16 2006 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Thu, 24 Aug 2006 02:03:16 -0500 Subject: [openib-general] dropped packets Message-ID: <20060824070330.RVVR1364.rrcs-fep-10.hrndva.rr.com@BOBP> Roland, I am trying to write a user level application that receives multicast UD packets at user level. I am seeing about 1-2 % packet loss between the send side and the receive side apparently independent of the packet rate for low rates. (Heavily traced sends and receives with very low rates still drop packets even though there are more packets posted on the receive side than are sent.) I have a couple of questions: 1. Are there any race issues with ibv_get_cq_event? The example code (ud_pingpong) seems to imply that the correct sequence is Start: Call ibv_get_cq_event Call ibv_ack_cq_event <- anywhere so long as it happens before destroy_cq Call ibv_req_notify_cq Call ibv_poll_cq <- just once not as usual until empty according to the example Goto start In the old days we called request notify and poll until poll was empty on a notify thread in order to prevent a race. 2. When I post say 500 receive buffers and send say 200 send buffers and tag the sends with a sequence number I often see one or two missing sequence numbers at the receive side at the poll_cq interface having checked at the post and poll interfaces of the send side to see that all the correct sequence numbers went out. I am not sure how this can be possible regardless of the notification scheme used. I would love for this to be a programming error in my code but I can't figure out how I can mess it up between post_send and poll_cq on the receive side. I see the same behavior between systems and with a loopback between two ports on the same HCA. Please let me know if this rings any bells. Bob Pearson -------------- next part -------------- An HTML attachment was scrubbed... URL: From jackm at mellanox.co.il Thu Aug 24 00:12:51 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 24 Aug 2006 10:12:51 +0300 Subject: [openib-general] [PATCH] mthca: various bug fixes for mthca_query_qp In-Reply-To: References: <200608231854.03936.jackm@mellanox.co.il> Message-ID: <200608241012.51318.jackm@mellanox.co.il> On Wednesday 23 August 2006 23:25, Roland Dreier wrote: > > 5. Return the send_cq, receive cq and srq handles. ib_query_qp() needs them > > (required by IB Spec). ibv_query_qp() overwrites these values in user-space > > with appropriate user-space values. > > > + qp_init_attr->send_cq = ibqp->send_cq; > > + qp_init_attr->recv_cq = ibqp->recv_cq; > > + qp_init_attr->srq = ibqp->srq; > > I really disagree with this change. It's silly to do this copying > since the consumer already has the ibqp pointer. And it's especially > silly to put this in a low-level driver, since there's nothing > device-specific about it. > Note that the same thing is done in user-space (albeit in the verbs layer): libibverbs/src/cmd.c, lines 678 ff. (procedure ibv_cmd_query_qp() ): init_attr->send_cq = qp->send_cq; init_attr->recv_cq = qp->recv_cq; init_attr->srq = qp->srq; Either return these values in both kernel and user, or remove them from the user-space verb. Regarding putting this in the low level kernel driver, I agree. A patch for core/verbs.c is given below. (P.S. IMHO, the correct thing to do is to remove the above lines from cmd.c, and not return to the user stuff that s/he already has available). - Jack ------------- Index: ofed_1_1/drivers/infiniband/core/verbs.c =================================================================== --- ofed_1_1.orig/drivers/infiniband/core/verbs.c 2006-08-03 14:30:21.000000000 +0300 +++ ofed_1_1/drivers/infiniband/core/verbs.c 2006-08-24 10:03:51.831333000 +0300 @@ -556,9 +556,17 @@ int ib_query_qp(struct ib_qp *qp, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr) { - return qp->device->query_qp ? - qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : - -ENOSYS; + int rc = -ENOSYS; + if (qp->device->query_qp) { + rc = qp->device->query_qp(qp, qp_attr, + qp_attr_mask, qp_init_attr); + if(!rc) { + qp_init_attr->recv_cq = qp->recv_cq; + qp_init_attr->send_cq = qp->send_cq; + qp_init_attr->srq = qp->srq; + } + } + return rc; } EXPORT_SYMBOL(ib_query_qp); From mst at mellanox.co.il Thu Aug 24 00:34:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Aug 2006 10:34:51 +0300 Subject: [openib-general] dropped packets In-Reply-To: <20060824070330.RVVR1364.rrcs-fep-10.hrndva.rr.com@BOBP> References: <20060824070330.RVVR1364.rrcs-fep-10.hrndva.rr.com@BOBP> Message-ID: <20060824073450.GD8192@mellanox.co.il> Quoting r. Robert Pearson : > Subject: dropped packets > > Roland, > > > > I am trying to write a user level application that receives multicast UD packets at user level. I am seeing about 1-2 % packet loss between the send side and the receive side apparently independent of the packet rate for low rates. (Heavily traced sends > and receives with very low rates still drop packets even though there are more packets posted on the receive side than are sent.) I have a couple of questions: What kind of hardware do you have? -- MST From kliteyn at mellanox.co.il Tue Aug 22 08:41:58 2006 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 22 Aug 2006 18:41:58 +0300 Subject: [openib-general] [PATCH] osm: handle local events Message-ID: Hi Hal This patch implements first item of the OSM todo list. OpenSM opens a thread that is listening for events on the SM's port. The events that are being taken care of are IBV_EVENT_DEVICE_FATAL and IBV_EVENT_PORT_ERROR. In case of IBV_EVENT_DEVICE_FATAL, osm is forced to exit. in case of IBV_EVENT_PORT_ERROR, osm initiates heavy sweep. Yevgeny Signed-off-by: Yevgeny Kliteynik Index: include/opensm/osm_sm_mad_ctrl.h =================================================================== --- include/opensm/osm_sm_mad_ctrl.h (revision 8998) +++ include/opensm/osm_sm_mad_ctrl.h (working copy) @@ -109,6 +109,7 @@ typedef struct _osm_sm_mad_ctrl osm_mad_pool_t *p_mad_pool; osm_vl15_t *p_vl15; osm_vendor_t *p_vendor; + struct _osm_state_mgr *p_state_mgr; osm_bind_handle_t h_bind; cl_plock_t *p_lock; cl_dispatcher_t *p_disp; @@ -130,6 +131,9 @@ typedef struct _osm_sm_mad_ctrl * p_vendor * Pointer to the vendor specific interfaces object. * +* p_state_mgr +* Pointer to the state manager object. +* * h_bind * Bind handle returned by the transport layer. * @@ -233,6 +237,7 @@ osm_sm_mad_ctrl_init( IN osm_mad_pool_t* const p_mad_pool, IN osm_vl15_t* const p_vl15, IN osm_vendor_t* const p_vendor, + IN struct _osm_state_mgr* const p_state_mgr, IN osm_log_t* const p_log, IN osm_stats_t* const p_stats, IN cl_plock_t* const p_lock, @@ -251,6 +256,9 @@ osm_sm_mad_ctrl_init( * p_vendor * [in] Pointer to the vendor specific interfaces object. * +* p_state_mgr +* [in] Pointer to the state manager object. +* * p_log * [in] Pointer to the log object. * Index: include/vendor/osm_vendor_ibumad.h =================================================================== --- include/vendor/osm_vendor_ibumad.h (revision 8998) +++ include/vendor/osm_vendor_ibumad.h (working copy) @@ -74,6 +74,8 @@ BEGIN_C_DECLS #define OSM_UMAD_MAX_CAS 32 #define OSM_UMAD_MAX_PORTS_PER_CA 2 +#define OSM_VENDOR_SUPPORT_EVENTS + /* OpenIB gen2 doesn't support RMPP yet */ /****s* OpenSM: Vendor UMAD/osm_ca_info_t @@ -179,6 +181,10 @@ typedef struct _osm_vendor int umad_port_id; void *receiver; int issmfd; + cl_thread_t events_thread; + void * events_callback; + void * sm_context; + struct ibv_context * ibv_context; } osm_vendor_t; #define OSM_BIND_INVALID_HANDLE 0 Index: include/vendor/osm_vendor_api.h =================================================================== --- include/vendor/osm_vendor_api.h (revision 8998) +++ include/vendor/osm_vendor_api.h (working copy) @@ -526,6 +526,110 @@ osm_vendor_set_debug( * SEE ALSO *********/ +#ifdef OSM_VENDOR_SUPPORT_EVENTS + +#define OSM_EVENT_FATAL 1 +#define OSM_EVENT_PORT_ERR 2 + +/****s* OpenSM Vendor API/osm_vend_events_callback_t +* NAME +* osm_vend_events_callback_t +* +* DESCRIPTION +* Function prototype for the vendor events callback. +* The vendor layer calls this function on driver events. +* +* SYNOPSIS +*/ +typedef void +(*osm_vend_events_callback_t)( + IN int events_mask, + IN void * const context ); +/* +* PARAMETERS +* events_mask +* [in] The received event(s). +* +* context +* [in] Context supplied as the "sm_context" argument in +* the osm_vendor_unreg_events_cb call +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +* osm_vendor_reg_events_cb osm_vendor_unreg_events_cb +*********/ + +/****f* OpenSM Vendor API/osm_vendor_reg_events_cb +* NAME +* osm_vendor_reg_events_cb +* +* DESCRIPTION +* Registers the events callback function and start the events +* thread +* +* SYNOPSIS +*/ +int +osm_vendor_reg_events_cb( + IN osm_vendor_t * const p_vend, + IN void * const sm_callback, + IN void * const sm_context); +/* +* PARAMETERS +* p_vend +* [in] vendor handle. +* +* sm_callback +* [in] Callback function that should be called when +* the event is received. +* +* sm_context +* [in] Context supplied as the "context" argument in +* the subsequenct calls to the sm_callback function +* +* RETURN VALUE +* IB_SUCCESS if OK. +* +* NOTES +* +* SEE ALSO +* osm_vend_events_callback_t osm_vendor_unreg_events_cb +*********/ + +/****f* OpenSM Vendor API/osm_vendor_unreg_events_cb +* NAME +* osm_vendor_unreg_events_cb +* +* DESCRIPTION +* Un-Registers the events callback function and stops the events +* thread +* +* SYNOPSIS +*/ +void +osm_vendor_unreg_events_cb( + IN osm_vendor_t * const p_vend); +/* +* PARAMETERS +* p_vend +* [in] vendor handle. +* +* +* RETURN VALUE +* None. +* +* NOTES +* +* SEE ALSO +* osm_vend_events_callback_t osm_vendor_reg_events_cb +*********/ + +#endif /* OSM_VENDOR_SUPPORT_EVENTS */ + END_C_DECLS #endif /* _OSM_VENDOR_API_H_ */ Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 8998) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -72,6 +72,7 @@ #include #include #include +#include /****s* OpenSM: Vendor AL/osm_umad_bind_info_t * NAME @@ -441,6 +442,91 @@ Exit: /********************************************************************** **********************************************************************/ +static void +umad_events_thread( + IN void * vend_context) +{ + int res = 0; + osm_vendor_t * p_vend = (osm_vendor_t *) vend_context; + struct ibv_async_event event; + + OSM_LOG_ENTER( p_vend->p_log, umad_events_thread ); + + osm_log(p_vend->p_log, OSM_LOG_DEBUG, + "umad_events_thread: Device %s, async event FD: %d\n", + p_vend->umad_port.ca_name, p_vend->ibv_context->async_fd); + osm_log(p_vend->p_log, OSM_LOG_DEBUG, + "umad_events_thread: Listening for events on device %s, port %d\n", + p_vend->umad_port.ca_name, p_vend->umad_port.portnum); + + while (1) { + + res = ibv_get_async_event(p_vend->ibv_context, &event); + if (res) + { + osm_log(p_vend->p_log, OSM_LOG_ERROR, + "umad_events_thread: ERR 5450: " + "Failed getting async event (device %s, port %d)\n", + p_vend->umad_port.ca_name, p_vend->umad_port.portnum); + goto Exit; + } + + if (!p_vend->events_callback) + { + osm_log(p_vend->p_log, OSM_LOG_DEBUG, + "umad_events_thread: Events callback has been unregistered\n"); + ibv_ack_async_event(&event); + goto Exit; + } + /* + * We're listening to events on the SM's port only + */ + if ( event.element.port_num == p_vend->umad_port.portnum ) + { + switch (event.event_type) + { + case IBV_EVENT_DEVICE_FATAL: + osm_log(p_vend->p_log, OSM_LOG_INFO, + "umad_events_thread: Received IBV_EVENT_DEVICE_FATAL\n"); + ((osm_vend_events_callback_t) + (p_vend->events_callback))(OSM_EVENT_FATAL, p_vend->sm_context); + + ibv_ack_async_event(&event); + goto Exit; + break; + + case IBV_EVENT_PORT_ERR: + osm_log(p_vend->p_log, OSM_LOG_VERBOSE, + "umad_events_thread: Received IBV_EVENT_PORT_ERR\n"); + ((osm_vend_events_callback_t) + (p_vend->events_callback))(OSM_EVENT_PORT_ERR, p_vend->sm_context); + break; + + default: + osm_log(p_vend->p_log, OSM_LOG_DEBUG, + "umad_events_thread: Received event #%d on port %d - Ignoring\n", + event.event_type, event.element.port_num); + } + } + else + { + osm_log(p_vend->p_log, OSM_LOG_DEBUG, + "umad_events_thread: Received event #%d on port %d - Ignoring\n", + event.event_type, event.element.port_num); + } + + ibv_ack_async_event(&event); + } + + Exit: + osm_log(p_vend->p_log, OSM_LOG_DEBUG, + "umad_events_thread: Terminating thread\n"); + OSM_LOG_EXIT(p_vend->p_log); + return; +} + +/********************************************************************** + **********************************************************************/ ib_api_status_t osm_vendor_init( IN osm_vendor_t* const p_vend, @@ -456,6 +542,7 @@ osm_vendor_init( p_vend->max_retries = OSM_DEFAULT_RETRY_COUNT; cl_spinlock_construct( &p_vend->cb_lock ); cl_spinlock_construct( &p_vend->match_tbl_lock ); + cl_thread_construct( &p_vend->events_thread ); p_vend->umad_port_id = -1; p_vend->issmfd = -1; @@ -1217,4 +1304,114 @@ osm_vendor_set_debug( umad_debug(level); } +/********************************************************************** + **********************************************************************/ +int +osm_vendor_reg_events_cb( + IN osm_vendor_t * const p_vend, + IN void * const sm_callback, + IN void * const sm_context) +{ + ib_api_status_t status = IB_SUCCESS; + struct ibv_device ** dev_list; + struct ibv_device * device; + + OSM_LOG_ENTER( p_vend->p_log, osm_vendor_reg_events_cb ); + + p_vend->events_callback = sm_callback; + p_vend->sm_context = sm_context; + + dev_list = ibv_get_device_list(NULL); + if (!dev_list || !(*dev_list)) { + osm_log(p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_reg_events_cb: ERR 5440: " + "No IB devices found\n"); + status = IB_ERROR; + goto Exit; + } + + if (!p_vend->umad_port.ca_name || !p_vend->umad_port.ca_name[0]) + { + osm_log(p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_reg_events_cb: ERR 5441: " + "Vendor initialization is not completed yet\n"); + status = IB_ERROR; + goto Exit; + } + + osm_log(p_vend->p_log, OSM_LOG_DEBUG, + "osm_vendor_reg_events_cb: Registering on device %s\n", + p_vend->umad_port.ca_name); + + /* + * find device whos name matches the SM's device + */ + for ( device = *dev_list; + (device != NULL) && + (strcmp(p_vend->umad_port.ca_name, ibv_get_device_name(device)) != 0); + device += sizeof(struct ibv_device *) ) + ; + if (!device) + { + osm_log(p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_reg_events_cb: ERR 5442: " + "Device %s hasn't been found in the device list\n" + ,p_vend->umad_port.ca_name); + status = IB_ERROR; + goto Exit; + } + + p_vend->ibv_context = ibv_open_device(device); + if (!p_vend->ibv_context) { + osm_log(p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_reg_events_cb: ERR 5443: " + "Couldn't get context for %s\n", + p_vend->umad_port.ca_name); + status = IB_ERROR; + goto Exit; + } + + /* + * Initiate the events thread + */ + if (cl_thread_init(&p_vend->events_thread, + umad_events_thread, + p_vend, + "ibumad events thread") != CL_SUCCESS) { + osm_log(p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_reg_events_cb: ERR 5444: " + "Failed initiating event listening thread\n"); + status = IB_ERROR; + goto Exit; + } + + Exit: + if (status != IB_SUCCESS) + { + p_vend->events_callback = NULL; + p_vend->sm_context = NULL; + p_vend->ibv_context = NULL; + p_vend->events_callback = NULL; + } + OSM_LOG_EXIT( p_vend->p_log ); + return status; +} + +/********************************************************************** + **********************************************************************/ +void +osm_vendor_unreg_events_cb( + IN osm_vendor_t * const p_vend) +{ + OSM_LOG_ENTER( p_vend->p_log, osm_vendor_unreg_events_cb ); + p_vend->events_callback = NULL; + p_vend->sm_context = NULL; + p_vend->ibv_context = NULL; + p_vend->events_callback = NULL; + OSM_LOG_EXIT( p_vend->p_log ); +} + +/********************************************************************** + **********************************************************************/ + #endif /* OSM_VENDOR_INTF_OPENIB */ Index: libvendor/libosmvendor.map =================================================================== --- libvendor/libosmvendor.map (revision 8998) +++ libvendor/libosmvendor.map (working copy) @@ -1,4 +1,4 @@ -OSMVENDOR_2.0 { +OSMVENDOR_2.1 { global: umad_receiver; osm_vendor_init; @@ -23,5 +23,7 @@ OSMVENDOR_2.0 { osmv_bind_sa; osmv_query_sa; osm_vendor_get_guid_ca_and_port; + osm_vendor_reg_events_cb; + osm_vendor_unreg_events_cb; local: *; }; Index: opensm/osm_sm.c =================================================================== --- opensm/osm_sm.c (revision 8998) +++ opensm/osm_sm.c (working copy) @@ -313,6 +313,7 @@ osm_sm_init( p_sm->p_mad_pool, p_sm->p_vl15, p_sm->p_vendor, + &p_sm->state_mgr, p_log, p_stats, p_lock, p_disp ); if( status != IB_SUCCESS ) goto Exit; Index: opensm/osm_sm_mad_ctrl.c =================================================================== --- opensm/osm_sm_mad_ctrl.c (revision 8998) +++ opensm/osm_sm_mad_ctrl.c (working copy) @@ -59,6 +59,7 @@ #include #include #include +#include /****f* opensm: SM/__osm_sm_mad_ctrl_retire_trans_mad * NAME @@ -953,6 +954,7 @@ osm_sm_mad_ctrl_init( IN osm_mad_pool_t* const p_mad_pool, IN osm_vl15_t* const p_vl15, IN osm_vendor_t* const p_vendor, + IN struct _osm_state_mgr* const p_state_mgr, IN osm_log_t* const p_log, IN osm_stats_t* const p_stats, IN cl_plock_t* const p_lock, @@ -969,6 +971,7 @@ osm_sm_mad_ctrl_init( p_ctrl->p_disp = p_disp; p_ctrl->p_mad_pool = p_mad_pool; p_ctrl->p_vendor = p_vendor; + p_ctrl->p_state_mgr = p_state_mgr; p_ctrl->p_stats = p_stats; p_ctrl->p_lock = p_lock; p_ctrl->p_vl15 = p_vl15; @@ -995,6 +998,47 @@ osm_sm_mad_ctrl_init( /********************************************************************** **********************************************************************/ +void +__osm_vend_events_callback( + IN int events_mask, + IN void * const context ) +{ + osm_sm_mad_ctrl_t * const p_ctrl = (osm_sm_mad_ctrl_t * const) context; + + OSM_LOG_ENTER(p_ctrl->p_log, __osm_vend_events_callback); + + if (events_mask & OSM_EVENT_FATAL) + { + osm_log(p_ctrl->p_log, OSM_LOG_INFO, + "__osm_vend_events_callback: " + "Events callback got OSM_EVENT_FATAL\n"); + osm_log(p_ctrl->p_log, OSM_LOG_SYS, + "Fatal HCA error - forcing OpenSM exit\n"); + osm_exit_flag = 1; + OSM_LOG_EXIT(p_ctrl->p_log); + return; + } + + if (events_mask & OSM_EVENT_PORT_ERR) + { + osm_log(p_ctrl->p_log, OSM_LOG_INFO, + "__osm_vend_events_callback: " + "Events callback got OSM_EVENT_PORT_ERR - forcing heavy sweep\n"); + p_ctrl->p_subn->force_immediate_heavy_sweep = TRUE; + osm_state_mgr_process((osm_state_mgr_t * const)p_ctrl->p_state_mgr, + OSM_SIGNAL_SWEEP); + OSM_LOG_EXIT(p_ctrl->p_log); + return; + } + + osm_log(p_ctrl->p_log, OSM_LOG_INFO, + "__osm_vend_events_callback: " + "Events callback got event mask of %d - No action taken\n"); + OSM_LOG_EXIT(p_ctrl->p_log); +} + +/********************************************************************** + **********************************************************************/ ib_api_status_t osm_sm_mad_ctrl_bind( IN osm_sm_mad_ctrl_t* const p_ctrl, @@ -1044,6 +1088,17 @@ osm_sm_mad_ctrl_bind( goto Exit; } + if ( osm_vendor_reg_events_cb(p_ctrl->p_vendor, + __osm_vend_events_callback, + p_ctrl) ) + { + status = IB_ERROR; + osm_log( p_ctrl->p_log, OSM_LOG_ERROR, + "osm_sm_mad_ctrl_bind: ERR 3120: " + "Vendor failed to register for events\n" ); + goto Exit; + } + Exit: OSM_LOG_EXIT( p_ctrl->p_log ); return( status ); Index: config/osmvsel.m4 =================================================================== --- config/osmvsel.m4 (revision 8998) +++ config/osmvsel.m4 (working copy) @@ -63,9 +63,9 @@ if test $with_osmv = "openib"; then OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" if test "x$with_umad_libs" = "x"; then - OSMV_LDADD="-libumad" + OSMV_LDADD="-libumad -libverbs" else - OSMV_LDADD="-L$with_umad_libs -libumad" + OSMV_LDADD="-L$with_umad_libs -libumad -libverbs" fi if test "x$with_umad_includes" != "x"; then @@ -137,6 +137,8 @@ if test "$disable_libcheck" != "yes"; th LDFLAGS="$LDFLAGS $OSMV_LDADD" AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibumad.])) + AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], + AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibverbs.])) LD_FLAGS=$osmv_save_ldflags elif test $with_osmv = "sim" ; then LDFLAGS="$LDFLAGS -L$with_sim/lib" From eitan at mellanox.co.il Tue Aug 22 09:58:26 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 22 Aug 2006 19:58:26 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: References: Message-ID: <44EB37B2.40906@mellanox.co.il> I did not see this on the reflector. We did have some mailer problems. So I am resending to the list One more thing to add: The only other event we considered was PORT_ACTIVE. But as it turns out the event is only generated when the port moves into ACTIVE state which means an SM already handled it... EZ Yevgeny Kliteynik wrote: > Hi Hal > > This patch implements first item of the OSM todo list. > > OpenSM opens a thread that is listening for events on the SM's port. > The events that are being taken care of are IBV_EVENT_DEVICE_FATAL and > IBV_EVENT_PORT_ERROR. > > In case of IBV_EVENT_DEVICE_FATAL, osm is forced to exit. > in case of IBV_EVENT_PORT_ERROR, osm initiates heavy sweep. > > Yevgeny > > Signed-off-by: Yevgeny Kliteynik > > Index: include/opensm/osm_sm_mad_ctrl.h > =================================================================== > --- include/opensm/osm_sm_mad_ctrl.h (revision 8998) > +++ include/opensm/osm_sm_mad_ctrl.h (working copy) > @@ -109,6 +109,7 @@ typedef struct _osm_sm_mad_ctrl > osm_mad_pool_t *p_mad_pool; > osm_vl15_t *p_vl15; > osm_vendor_t *p_vendor; > + struct _osm_state_mgr *p_state_mgr; > osm_bind_handle_t h_bind; > cl_plock_t *p_lock; > cl_dispatcher_t *p_disp; > @@ -130,6 +131,9 @@ typedef struct _osm_sm_mad_ctrl > * p_vendor > * Pointer to the vendor specific interfaces object. > * > +* p_state_mgr > +* Pointer to the state manager object. > +* > * h_bind > * Bind handle returned by the transport layer. > * > @@ -233,6 +237,7 @@ osm_sm_mad_ctrl_init( > IN osm_mad_pool_t* const p_mad_pool, > IN osm_vl15_t* const p_vl15, > IN osm_vendor_t* const p_vendor, > + IN struct _osm_state_mgr* const p_state_mgr, > IN osm_log_t* const p_log, > IN osm_stats_t* const p_stats, > IN cl_plock_t* const p_lock, > @@ -251,6 +256,9 @@ osm_sm_mad_ctrl_init( > * p_vendor > * [in] Pointer to the vendor specific interfaces object. > * > +* p_state_mgr > +* [in] Pointer to the state manager object. > +* > * p_log > * [in] Pointer to the log object. > * > Index: include/vendor/osm_vendor_ibumad.h > =================================================================== > --- include/vendor/osm_vendor_ibumad.h (revision 8998) > +++ include/vendor/osm_vendor_ibumad.h (working copy) > @@ -74,6 +74,8 @@ BEGIN_C_DECLS > #define OSM_UMAD_MAX_CAS 32 > #define OSM_UMAD_MAX_PORTS_PER_CA 2 > > +#define OSM_VENDOR_SUPPORT_EVENTS > + > /* OpenIB gen2 doesn't support RMPP yet */ > > /****s* OpenSM: Vendor UMAD/osm_ca_info_t > @@ -179,6 +181,10 @@ typedef struct _osm_vendor > int umad_port_id; > void *receiver; > int issmfd; > + cl_thread_t events_thread; > + void * events_callback; > + void * sm_context; > + struct ibv_context * ibv_context; > } osm_vendor_t; > > #define OSM_BIND_INVALID_HANDLE 0 > Index: include/vendor/osm_vendor_api.h > =================================================================== > --- include/vendor/osm_vendor_api.h (revision 8998) > +++ include/vendor/osm_vendor_api.h (working copy) > @@ -526,6 +526,110 @@ osm_vendor_set_debug( > * SEE ALSO > *********/ > > +#ifdef OSM_VENDOR_SUPPORT_EVENTS > + > +#define OSM_EVENT_FATAL 1 > +#define OSM_EVENT_PORT_ERR 2 > + > +/****s* OpenSM Vendor API/osm_vend_events_callback_t > +* NAME > +* osm_vend_events_callback_t > +* > +* DESCRIPTION > +* Function prototype for the vendor events callback. > +* The vendor layer calls this function on driver events. > +* > +* SYNOPSIS > +*/ > +typedef void > +(*osm_vend_events_callback_t)( > + IN int events_mask, > + IN void * const context ); > +/* > +* PARAMETERS > +* events_mask > +* [in] The received event(s). > +* > +* context > +* [in] Context supplied as the "sm_context" argument in > +* the osm_vendor_unreg_events_cb call > +* > +* RETURN VALUES > +* None. > +* > +* NOTES > +* > +* SEE ALSO > +* osm_vendor_reg_events_cb osm_vendor_unreg_events_cb > +*********/ > + > +/****f* OpenSM Vendor API/osm_vendor_reg_events_cb > +* NAME > +* osm_vendor_reg_events_cb > +* > +* DESCRIPTION > +* Registers the events callback function and start the events > +* thread > +* > +* SYNOPSIS > +*/ > +int > +osm_vendor_reg_events_cb( > + IN osm_vendor_t * const p_vend, > + IN void * const sm_callback, > + IN void * const sm_context); > +/* > +* PARAMETERS > +* p_vend > +* [in] vendor handle. > +* > +* sm_callback > +* [in] Callback function that should be called when > +* the event is received. > +* > +* sm_context > +* [in] Context supplied as the "context" argument in > +* the subsequenct calls to the sm_callback function > +* > +* RETURN VALUE > +* IB_SUCCESS if OK. > +* > +* NOTES > +* > +* SEE ALSO > +* osm_vend_events_callback_t osm_vendor_unreg_events_cb > +*********/ > + > +/****f* OpenSM Vendor API/osm_vendor_unreg_events_cb > +* NAME > +* osm_vendor_unreg_events_cb > +* > +* DESCRIPTION > +* Un-Registers the events callback function and stops the events > +* thread > +* > +* SYNOPSIS > +*/ > +void > +osm_vendor_unreg_events_cb( > + IN osm_vendor_t * const p_vend); > +/* > +* PARAMETERS > +* p_vend > +* [in] vendor handle. > +* > +* > +* RETURN VALUE > +* None. > +* > +* NOTES > +* > +* SEE ALSO > +* osm_vend_events_callback_t osm_vendor_reg_events_cb > +*********/ > + > +#endif /* OSM_VENDOR_SUPPORT_EVENTS */ > + > END_C_DECLS > > #endif /* _OSM_VENDOR_API_H_ */ > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 8998) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -72,6 +72,7 @@ > #include > #include > #include > +#include > > /****s* OpenSM: Vendor AL/osm_umad_bind_info_t > * NAME > @@ -441,6 +442,91 @@ Exit: > > /********************************************************************** > **********************************************************************/ > +static void > +umad_events_thread( > + IN void * vend_context) > +{ > + int res = 0; > + osm_vendor_t * p_vend = (osm_vendor_t *) vend_context; > + struct ibv_async_event event; > + > + OSM_LOG_ENTER( p_vend->p_log, umad_events_thread ); > + > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Device %s, async event FD: %d\n", > + p_vend->umad_port.ca_name, p_vend->ibv_context->async_fd); > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Listening for events on device %s, port %d\n", > + p_vend->umad_port.ca_name, p_vend->umad_port.portnum); > + > + while (1) { > + > + res = ibv_get_async_event(p_vend->ibv_context, &event); > + if (res) > + { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "umad_events_thread: ERR 5450: " > + "Failed getting async event (device %s, port %d)\n", > + p_vend->umad_port.ca_name, p_vend->umad_port.portnum); > + goto Exit; > + } > + > + if (!p_vend->events_callback) > + { > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Events callback has been unregistered\n"); > + ibv_ack_async_event(&event); > + goto Exit; > + } > + /* > + * We're listening to events on the SM's port only > + */ > + if ( event.element.port_num == p_vend->umad_port.portnum ) > + { > + switch (event.event_type) > + { > + case IBV_EVENT_DEVICE_FATAL: > + osm_log(p_vend->p_log, OSM_LOG_INFO, > + "umad_events_thread: Received IBV_EVENT_DEVICE_FATAL\n"); > + ((osm_vend_events_callback_t) > + (p_vend->events_callback))(OSM_EVENT_FATAL, p_vend->sm_context); > + > + ibv_ack_async_event(&event); > + goto Exit; > + break; > + > + case IBV_EVENT_PORT_ERR: > + osm_log(p_vend->p_log, OSM_LOG_VERBOSE, > + "umad_events_thread: Received IBV_EVENT_PORT_ERR\n"); > + ((osm_vend_events_callback_t) > + (p_vend->events_callback))(OSM_EVENT_PORT_ERR, p_vend->sm_context); > + break; > + > + default: > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Received event #%d on port %d - Ignoring\n", > + event.event_type, event.element.port_num); > + } > + } > + else > + { > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Received event #%d on port %d - Ignoring\n", > + event.event_type, event.element.port_num); > + } > + > + ibv_ack_async_event(&event); > + } > + > + Exit: > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "umad_events_thread: Terminating thread\n"); > + OSM_LOG_EXIT(p_vend->p_log); > + return; > +} > + > +/********************************************************************** > + **********************************************************************/ > ib_api_status_t > osm_vendor_init( > IN osm_vendor_t* const p_vend, > @@ -456,6 +542,7 @@ osm_vendor_init( > p_vend->max_retries = OSM_DEFAULT_RETRY_COUNT; > cl_spinlock_construct( &p_vend->cb_lock ); > cl_spinlock_construct( &p_vend->match_tbl_lock ); > + cl_thread_construct( &p_vend->events_thread ); > p_vend->umad_port_id = -1; > p_vend->issmfd = -1; > > @@ -1217,4 +1304,114 @@ osm_vendor_set_debug( > umad_debug(level); > } > > +/********************************************************************** > + **********************************************************************/ > +int > +osm_vendor_reg_events_cb( > + IN osm_vendor_t * const p_vend, > + IN void * const sm_callback, > + IN void * const sm_context) > +{ > + ib_api_status_t status = IB_SUCCESS; > + struct ibv_device ** dev_list; > + struct ibv_device * device; > + > + OSM_LOG_ENTER( p_vend->p_log, osm_vendor_reg_events_cb ); > + > + p_vend->events_callback = sm_callback; > + p_vend->sm_context = sm_context; > + > + dev_list = ibv_get_device_list(NULL); > + if (!dev_list || !(*dev_list)) { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_reg_events_cb: ERR 5440: " > + "No IB devices found\n"); > + status = IB_ERROR; > + goto Exit; > + } > + > + if (!p_vend->umad_port.ca_name || !p_vend->umad_port.ca_name[0]) > + { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_reg_events_cb: ERR 5441: " > + "Vendor initialization is not completed yet\n"); > + status = IB_ERROR; > + goto Exit; > + } > + > + osm_log(p_vend->p_log, OSM_LOG_DEBUG, > + "osm_vendor_reg_events_cb: Registering on device %s\n", > + p_vend->umad_port.ca_name); > + > + /* > + * find device whos name matches the SM's device > + */ > + for ( device = *dev_list; > + (device != NULL) && > + (strcmp(p_vend->umad_port.ca_name, ibv_get_device_name(device)) != 0); > + device += sizeof(struct ibv_device *) ) > + ; > + if (!device) > + { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_reg_events_cb: ERR 5442: " > + "Device %s hasn't been found in the device list\n" > + ,p_vend->umad_port.ca_name); > + status = IB_ERROR; > + goto Exit; > + } > + > + p_vend->ibv_context = ibv_open_device(device); > + if (!p_vend->ibv_context) { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_reg_events_cb: ERR 5443: " > + "Couldn't get context for %s\n", > + p_vend->umad_port.ca_name); > + status = IB_ERROR; > + goto Exit; > + } > + > + /* > + * Initiate the events thread > + */ > + if (cl_thread_init(&p_vend->events_thread, > + umad_events_thread, > + p_vend, > + "ibumad events thread") != CL_SUCCESS) { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_reg_events_cb: ERR 5444: " > + "Failed initiating event listening thread\n"); > + status = IB_ERROR; > + goto Exit; > + } > + > + Exit: > + if (status != IB_SUCCESS) > + { > + p_vend->events_callback = NULL; > + p_vend->sm_context = NULL; > + p_vend->ibv_context = NULL; > + p_vend->events_callback = NULL; > + } > + OSM_LOG_EXIT( p_vend->p_log ); > + return status; > +} > + > +/********************************************************************** > + **********************************************************************/ > +void > +osm_vendor_unreg_events_cb( > + IN osm_vendor_t * const p_vend) > +{ > + OSM_LOG_ENTER( p_vend->p_log, osm_vendor_unreg_events_cb ); > + p_vend->events_callback = NULL; > + p_vend->sm_context = NULL; > + p_vend->ibv_context = NULL; > + p_vend->events_callback = NULL; > + OSM_LOG_EXIT( p_vend->p_log ); > +} > + > +/********************************************************************** > + **********************************************************************/ > + > #endif /* OSM_VENDOR_INTF_OPENIB */ > Index: libvendor/libosmvendor.map > =================================================================== > --- libvendor/libosmvendor.map (revision 8998) > +++ libvendor/libosmvendor.map (working copy) > @@ -1,4 +1,4 @@ > -OSMVENDOR_2.0 { > +OSMVENDOR_2.1 { > global: > umad_receiver; > osm_vendor_init; > @@ -23,5 +23,7 @@ OSMVENDOR_2.0 { > osmv_bind_sa; > osmv_query_sa; > osm_vendor_get_guid_ca_and_port; > + osm_vendor_reg_events_cb; > + osm_vendor_unreg_events_cb; > local: *; > }; > Index: opensm/osm_sm.c > =================================================================== > --- opensm/osm_sm.c (revision 8998) > +++ opensm/osm_sm.c (working copy) > @@ -313,6 +313,7 @@ osm_sm_init( > p_sm->p_mad_pool, > p_sm->p_vl15, > p_sm->p_vendor, > + &p_sm->state_mgr, > p_log, p_stats, p_lock, p_disp ); > if( status != IB_SUCCESS ) > goto Exit; > Index: opensm/osm_sm_mad_ctrl.c > =================================================================== > --- opensm/osm_sm_mad_ctrl.c (revision 8998) > +++ opensm/osm_sm_mad_ctrl.c (working copy) > @@ -59,6 +59,7 @@ > #include > #include > #include > +#include > > /****f* opensm: SM/__osm_sm_mad_ctrl_retire_trans_mad > * NAME > @@ -953,6 +954,7 @@ osm_sm_mad_ctrl_init( > IN osm_mad_pool_t* const p_mad_pool, > IN osm_vl15_t* const p_vl15, > IN osm_vendor_t* const p_vendor, > + IN struct _osm_state_mgr* const p_state_mgr, > IN osm_log_t* const p_log, > IN osm_stats_t* const p_stats, > IN cl_plock_t* const p_lock, > @@ -969,6 +971,7 @@ osm_sm_mad_ctrl_init( > p_ctrl->p_disp = p_disp; > p_ctrl->p_mad_pool = p_mad_pool; > p_ctrl->p_vendor = p_vendor; > + p_ctrl->p_state_mgr = p_state_mgr; > p_ctrl->p_stats = p_stats; > p_ctrl->p_lock = p_lock; > p_ctrl->p_vl15 = p_vl15; > @@ -995,6 +998,47 @@ osm_sm_mad_ctrl_init( > > /********************************************************************** > **********************************************************************/ > +void > +__osm_vend_events_callback( > + IN int events_mask, > + IN void * const context ) > +{ > + osm_sm_mad_ctrl_t * const p_ctrl = (osm_sm_mad_ctrl_t * const) context; > + > + OSM_LOG_ENTER(p_ctrl->p_log, __osm_vend_events_callback); > + > + if (events_mask & OSM_EVENT_FATAL) > + { > + osm_log(p_ctrl->p_log, OSM_LOG_INFO, > + "__osm_vend_events_callback: " > + "Events callback got OSM_EVENT_FATAL\n"); > + osm_log(p_ctrl->p_log, OSM_LOG_SYS, > + "Fatal HCA error - forcing OpenSM exit\n"); > + osm_exit_flag = 1; > + OSM_LOG_EXIT(p_ctrl->p_log); > + return; > + } > + > + if (events_mask & OSM_EVENT_PORT_ERR) > + { > + osm_log(p_ctrl->p_log, OSM_LOG_INFO, > + "__osm_vend_events_callback: " > + "Events callback got OSM_EVENT_PORT_ERR - forcing heavy sweep\n"); > + p_ctrl->p_subn->force_immediate_heavy_sweep = TRUE; > + osm_state_mgr_process((osm_state_mgr_t * const)p_ctrl->p_state_mgr, > + OSM_SIGNAL_SWEEP); > + OSM_LOG_EXIT(p_ctrl->p_log); > + return; > + } > + > + osm_log(p_ctrl->p_log, OSM_LOG_INFO, > + "__osm_vend_events_callback: " > + "Events callback got event mask of %d - No action taken\n"); > + OSM_LOG_EXIT(p_ctrl->p_log); > +} > + > +/********************************************************************** > + **********************************************************************/ > ib_api_status_t > osm_sm_mad_ctrl_bind( > IN osm_sm_mad_ctrl_t* const p_ctrl, > @@ -1044,6 +1088,17 @@ osm_sm_mad_ctrl_bind( > goto Exit; > } > > + if ( osm_vendor_reg_events_cb(p_ctrl->p_vendor, > + __osm_vend_events_callback, > + p_ctrl) ) > + { > + status = IB_ERROR; > + osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > + "osm_sm_mad_ctrl_bind: ERR 3120: " > + "Vendor failed to register for events\n" ); > + goto Exit; > + } > + > Exit: > OSM_LOG_EXIT( p_ctrl->p_log ); > return( status ); > Index: config/osmvsel.m4 > =================================================================== > --- config/osmvsel.m4 (revision 8998) > +++ config/osmvsel.m4 (working copy) > @@ -63,9 +63,9 @@ if test $with_osmv = "openib"; then > OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" > OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" > if test "x$with_umad_libs" = "x"; then > - OSMV_LDADD="-libumad" > + OSMV_LDADD="-libumad -libverbs" > else > - OSMV_LDADD="-L$with_umad_libs -libumad" > + OSMV_LDADD="-L$with_umad_libs -libumad -libverbs" > fi > > if test "x$with_umad_includes" != "x"; then > @@ -137,6 +137,8 @@ if test "$disable_libcheck" != "yes"; th > LDFLAGS="$LDFLAGS $OSMV_LDADD" > AC_CHECK_LIB(ibumad, umad_init, [], > AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibumad.])) > + AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], > + AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibverbs.])) > LD_FLAGS=$osmv_save_ldflags > elif test $with_osmv = "sim" ; then > LDFLAGS="$LDFLAGS -L$with_sim/lib" > From tziporet at mellanox.co.il Thu Aug 24 02:15:39 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 24 Aug 2006 12:15:39 +0300 Subject: [openib-general] OFED 1.1-rc2 is delayed for next Monday (Aug-21) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7777@mtlexch01.mtl.com> This is already done - you can use OFED 1.1-rc2 or 1.0.1 Tziporet -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Troy Telford Sent: Thursday, August 17, 2006 8:57 PM To: OPENIB Subject: Re: [openib-general] OFED 1.1-rc2 is delayed for next Monday (Aug-21) Quick Question: https://docs.mellanox.com/dm/ibg2/OFED_release_notes_Mellanox.txt states that "SLES 9 SP3. is planned for OFED rev 1.1." Is this still something I can look forward to, or has it been pushed back? Thanks, -- Troy Telford _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From johnt1johnt2 at gmail.com Thu Aug 24 03:29:14 2006 From: johnt1johnt2 at gmail.com (john t) Date: Thu, 24 Aug 2006 15:59:14 +0530 Subject: [openib-general] IPoIB Message-ID: Hi, Does IPoIB work across IB subnets. For example if there are 4 IB subnets and all the IB subnets are reachable (meaning there is one host that is in all the IB subnets), then is it possible to ping/ssh to hosts in other IB subnets using IPoIB (assuming there is no ethernet) Regards, John T. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Thu Aug 24 06:28:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Aug 2006 16:28:48 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: References: Message-ID: <20060824132848.GI8192@mellanox.co.il> Quoting r. Yevgeny Kliteynik : > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 8998) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -72,6 +72,7 @@ > #include > #include > #include > +#include > > /****s* OpenSM: Vendor AL/osm_umad_bind_info_t > * NAME NAK. This means that the SM becomes dependent on the uverbs module. I don't think this is a good idea. Let's not go there - SM should depend just on the umad module and libc. In particular, SM should work even on embedded platforms where uverbs do not necessarily work. Further, hotplug events still do not seem to be handled, even with this patch. For port events, it seems sane that umad module could provide a way to listen for them. A recent patch to mthca converts fatal events to hotplug events, so fatal events can and should be handled as part of general hotplug support. -- MST From christian.guggenberger at rzg.mpg.de Thu Aug 24 07:07:04 2006 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Thu, 24 Aug 2006 16:07:04 +0200 Subject: [openib-general] openib-general] OFED 1.1-rc2 is delayed for next Monday (Aug-21) Message-ID: <20060824140704.GD13066@daltons.rzg.mpg.de> >This is already done - you can use OFED 1.1-rc2 or 1.0.1 > >Tziporet Hi, again, regarding SLES9 SP3. is it also planned to support later kernels than 2.6.5-7.244 ? I tried to build against 2.6.5-7.276, but failed. I also failed a bug (#192) some time ago, but never got a response, unfortunately. cheers. - Christian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5594 bytes Desc: not available URL: From eitan at mellanox.co.il Thu Aug 24 07:38:21 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 24 Aug 2006 17:38:21 +0300 Subject: [openib-general] [PATCH] osm: handle local events Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302C467D7@mtlexch01.mtl.com> Hi MST, Hotplug event of HCA FATAL is supported and this is the #1 issue we are troubled with. Regarding other OpenSM vendors implementations (like the switch stack or gen1 stacks) This patch makes OpenSM agnostic for their support of events. If they do support it OpenSM will Register and behave according to the events if they do not - then no harm is done. Regarding ibumad implementation of propagating of events. Once it is supported I will vote for Re-writing the events feature to that interface. But it does not exist now and ibuverbs do exist. If required we could make the OSM_VENDOR_SUPPORT_EVENTS depend on the availability of the libibuverbs. But today it is common practice to have that lib compiled in all distributions. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Michael S. Tsirkin > Sent: Thursday, August 24, 2006 4:29 PM > To: Yevgeny Kliteynik > Cc: OPENIB; Roland Dreier; Hal Rosenstock; Eitan Zahavi > Subject: Re: [PATCH] osm: handle local events > > Quoting r. Yevgeny Kliteynik : > > Index: libvendor/osm_vendor_ibumad.c > > > ================================================================ > === > > --- libvendor/osm_vendor_ibumad.c (revision 8998) > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > @@ -72,6 +72,7 @@ > > #include > > #include > > #include > > +#include > > > > /****s* OpenSM: Vendor AL/osm_umad_bind_info_t > > * NAME > > NAK. > > This means that the SM becomes dependent on the uverbs module. I don't > think this is a good idea. Let's not go there - SM should depend just on the > umad module and libc. In particular, SM should work even on embedded > platforms where uverbs do not necessarily work. > > Further, hotplug events still do not seem to be handled, even with this patch. > > For port events, it seems sane that umad module could provide a way to listen > for them. > > A recent patch to mthca converts fatal events to hotplug events, so fatal > events can and should be handled as part of general hotplug support. > > -- > MST From tziporet at mellanox.co.il Thu Aug 24 08:08:07 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 24 Aug 2006 18:08:07 +0300 Subject: [openib-general] openib-general] OFED 1.1-rc2 is delayed for next Monday (Aug-21) In-Reply-To: <20060824140704.GD13066@daltons.rzg.mpg.de> References: <20060824140704.GD13066@daltons.rzg.mpg.de> Message-ID: <44EDC0D7.6070704@mellanox.co.il> Christian Guggenberger wrote: > is it also planned to support later kernels than 2.6.5-7.244 ? I tried > to build against 2.6.5-7.276, but failed. I also failed a bug (#192) > some time ago, but never got a response, unfortunately. > > > To solve this problem in 1.1 you should apply the following change on both build_env.sh and the configure: Replace the check for kernel 2.6.5-7.244* with 2.6.5-7.* The configure script is located at: OFED-1.1/SOURCES/openib-1.1.tgz under directory openib-1.1/configure. (Need to open the tgz change it and tar again) The build_env.sh is located on the OFED-1.1 directory. Note that for OFED 1.1 we will replace the check (will be done in RC3) since we understand it's limiting us to be so specific. Tziporet From christian.guggenberger at rzg.mpg.de Thu Aug 24 08:20:20 2006 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Thu, 24 Aug 2006 17:20:20 +0200 Subject: [openib-general] openib-general] OFED 1.1-rc2 is delayed for next Monday (Aug-21) In-Reply-To: <44EDC0D7.6070704@mellanox.co.il> References: <20060824140704.GD13066@daltons.rzg.mpg.de> <44EDC0D7.6070704@mellanox.co.il> Message-ID: <20060824152020.GE13066@daltons.rzg.mpg.de> On Thu, Aug 24, 2006 at 06:08:07PM +0300, Tziporet Koren wrote: > Christian Guggenberger wrote: > >is it also planned to support later kernels than 2.6.5-7.244 ? I tried > >to build against 2.6.5-7.276, but failed. I also failed a bug (#192) > >some time ago, but never got a response, unfortunately. > > > > > > > > To solve this problem in 1.1 you should apply the following change on > both build_env.sh and the configure: > Replace the check for kernel 2.6.5-7.244* with 2.6.5-7.* > > The configure script is located at: OFED-1.1/SOURCES/openib-1.1.tgz > under directory openib-1.1/configure. (Need to open the tgz change it > and tar again) > I tried such things before already, but never got any step further. Compilation still aborts, and the cause for this is that the build system still does not know how to handle, e.g., 2.6.5-7.276-smp. (excerpt from the logs: ... patching file drivers/infiniband/hw/mthca/mthca_provider.c Hunk #1 succeeded at 1289 (offset 2 lines). Hunk #2 succeeded at 1314 (offset 2 lines). No Patches found for 2.6.5-7.276-smp kernel /bin/rm -f /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache ... ) cheers. - Christian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5594 bytes Desc: not available URL: From mst at mellanox.co.il Thu Aug 24 08:23:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Aug 2006 18:23:19 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302C467D7@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302C467D7@mtlexch01.mtl.com> Message-ID: <20060824152319.GA11239@mellanox.co.il> Quoting r. Eitan Zahavi : > Subject: RE: [PATCH] osm: handle local events > > Hi MST, > > Hotplug event of HCA FATAL is supported and this is the #1 issue we are troubled with. No I mean real hotplug. -- MST From eitan at mellanox.co.il Thu Aug 24 08:31:42 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 24 Aug 2006 18:31:42 +0300 Subject: [openib-general] [PATCH] osm: handle local events Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302C46814@mtlexch01.mtl.com> > > Hotplug event of HCA FATAL is supported and this is the #1 issue we are > troubled with. > > No I mean real hotplug. [EZ] And umad will provide it? > > -- > MST From mshefty at ichips.intel.com Thu Aug 24 09:18:50 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 24 Aug 2006 09:18:50 -0700 Subject: [openib-general] librdmacm ABI issues with OFED 1.1 In-Reply-To: <20060824061554.GC9498@mellanox.co.il> References: <000001c6c73e$c6a798d0$20d4180a@amr.corp.intel.com> <20060824061554.GC9498@mellanox.co.il> Message-ID: <44EDD16A.5070006@ichips.intel.com> Michael S. Tsirkin wrote: > Maybe the librdmacm part should be merged to svn? > So librdmacm could try to read from misc, then from > /sys/class/infiniband/rdma_cm, and then assume latest. > It's good to have userspace code portable across distros ... I can go with that. - Sean From mshefty at ichips.intel.com Thu Aug 24 09:21:02 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 24 Aug 2006 09:21:02 -0700 Subject: [openib-general] [PATCH 0/4] Dispatch communication related events to the IB CM In-Reply-To: <20060824054525.GA9571@mellanox.co.il> References: <44ECCFCD.5020002@ichips.intel.com> <20060824054525.GA9571@mellanox.co.il> Message-ID: <44EDD1EE.5020309@ichips.intel.com> Michael S. Tsirkin wrote: >>And even with these proposed changes, there's a race condition where the CM >>can timeout a connection after data is received over it, but before this event >>can be processed. > > > Hmm. And what happens then? The connection is aborted by the CM. The CM sends a REJ for the connection and bumps the QP into timewait. - Sean From rdreier at cisco.com Thu Aug 24 09:31:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 Aug 2006 09:31:34 -0700 Subject: [openib-general] drop mthca from svn? (was: Rollup patch for ipath and OFED) In-Reply-To: <44EC99F1.9060102@ichips.intel.com> (Sean Hefty's message of "Wed, 23 Aug 2006 11:09:53 -0700") References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> Message-ID: Sean> Why not remove your code from SVN? Along those lines, how would people feel if I removed the mthca kernel code from svn, and just maintained mthca in kernel.org git trees? I am getting heartily sick of double checkins for every mthca change... - R. From rdreier at cisco.com Thu Aug 24 09:43:21 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 Aug 2006 09:43:21 -0700 Subject: [openib-general] dropped packets In-Reply-To: <20060824070330.RVVR1364.rrcs-fep-10.hrndva.rr.com@BOBP> ( Robert Pearson's message of "Thu, 24 Aug 2006 02:03:16 -0500") References: <20060824070330.RVVR1364.rrcs-fep-10.hrndva.rr.com@BOBP> Message-ID: I have no idea what your bug is, but as for this part: > 1. Are there any race issues with ibv_get_cq_event? The example code > (ud_pingpong) seems to imply that the correct sequence is > > Start: > > Call ibv_get_cq_event > > Call ibv_ack_cq_event <- anywhere so long as it happens before destroy_cq > > Call ibv_req_notify_cq > > Call ibv_poll_cq <- just once not as usual until empty according to the example > > Goto start this pingpong example is somewhat of a special case, since we know in advance that no more than 1 send and 1 receive completion can ever be outstanding, so a single ibv_poll_cq suffices. (Although thinking about this now, there may be some theoretical races that could case problems) - R. From weiny2 at llnl.gov Thu Aug 24 09:57:54 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 24 Aug 2006 09:57:54 -0700 Subject: [openib-general] librdmacm ABI issues with OFED 1.1 In-Reply-To: <44EDD16A.5070006@ichips.intel.com> References: <000001c6c73e$c6a798d0$20d4180a@amr.corp.intel.com> <20060824061554.GC9498@mellanox.co.il> <44EDD16A.5070006@ichips.intel.com> Message-ID: <20060824095754.460f8f86.weiny2@llnl.gov> On Thu, 24 Aug 2006 09:18:50 -0700 "Sean Hefty" wrote: > Michael S. Tsirkin wrote: > > Maybe the librdmacm part should be merged to svn? > > So librdmacm could try to read from misc, then from > > /sys/class/infiniband/rdma_cm, and then assume latest. > > It's good to have userspace code portable across distros ... > > I can go with that. > > - Sean > Something like this? Ira Index: openib/src/userspace/librdmacm/src/cma.c =================================================================== --- openib/src/userspace/librdmacm/src/cma.c (revision 213) +++ openib/src/userspace/librdmacm/src/cma.c (revision 220) @@ -141,9 +141,13 @@ { char value[8]; - if (ibv_read_sysfs_file(ibv_get_sysfs_path(), + if ((ibv_read_sysfs_file(ibv_get_sysfs_path(), "class/misc/rdma_cm/abi_version", - value, sizeof value) < 0) { + value, sizeof value) < 0) + && + (ibv_read_sysfs_file(ibv_get_sysfs_path(), + "class/infiniband_ucma/abi_version", + value, sizeof value) < 0)) { /* * Older version of Linux do not have class/misc. To support * backports, assume the most recent version of the ABI. If From mshefty at ichips.intel.com Thu Aug 24 10:33:01 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 24 Aug 2006 10:33:01 -0700 Subject: [openib-general] librdmacm ABI issues with OFED 1.1 In-Reply-To: <20060824095754.460f8f86.weiny2@llnl.gov> References: <000001c6c73e$c6a798d0$20d4180a@amr.corp.intel.com> <20060824061554.GC9498@mellanox.co.il> <44EDD16A.5070006@ichips.intel.com> <20060824095754.460f8f86.weiny2@llnl.gov> Message-ID: <44EDE2CD.6070608@ichips.intel.com> I committed this change to the librdmacm in svn 9105. It still requires a backport patch for the kernel code. - Sean From rjwalsh at pathscale.com Thu Aug 24 10:38:55 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Thu, 24 Aug 2006 10:38:55 -0700 Subject: [openib-general] drop mthca from svn? In-Reply-To: References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> Message-ID: <44EDE42F.5070209@pathscale.com> Roland Dreier wrote: > Sean> Why not remove your code from SVN? > > Along those lines, how would people feel if I removed the mthca kernel > code from svn, and just maintained mthca in kernel.org git trees? I > am getting heartily sick of double checkins for every mthca change... I think this is a fine idea. Both drivers should now use git as the golden source. From robert.j.woodruff at intel.com Thu Aug 24 11:02:17 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 24 Aug 2006 11:02:17 -0700 Subject: [openib-general] [openfabrics-ewg] librdmacm ABI issues with OFED 1.1 Message-ID: Here is a patch that I used in my backport to 2.6.9 for RedHat EL4 - U3 svn 8841 openib fixups patch. It also applies to svn9006 and will likely work fine on the OFED 1.1 code if you replace the current patch that removes the creation of the abi_version with this one that creates it under /sys/class/infiniband_uma and modify the library to check either place (Ira's patch below), it should work. woody -----Original Message----- From: openfabrics-ewg-bounces at openib.org [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Ira Weiny Sent: Thursday, August 24, 2006 9:58 AM To: Sean Hefty Cc: mst at mellanox.co.il; openfabrics-ewg at openib.org; openib-general at openib.org Subject: Re: [openfabrics-ewg] [openib-general] librdmacm ABI issues with OFED 1.1 On Thu, 24 Aug 2006 09:18:50 -0700 "Sean Hefty" wrote: > Michael S. Tsirkin wrote: > > Maybe the librdmacm part should be merged to svn? > > So librdmacm could try to read from misc, then from > > /sys/class/infiniband/rdma_cm, and then assume latest. > > It's good to have userspace code portable across distros ... > > I can go with that. > > - Sean > Something like this? Ira Index: openib/src/userspace/librdmacm/src/cma.c =================================================================== --- openib/src/userspace/librdmacm/src/cma.c (revision 213) +++ openib/src/userspace/librdmacm/src/cma.c (revision 220) @@ -141,9 +141,13 @@ { char value[8]; - if (ibv_read_sysfs_file(ibv_get_sysfs_path(), + if ((ibv_read_sysfs_file(ibv_get_sysfs_path(), "class/misc/rdma_cm/abi_version", - value, sizeof value) < 0) { + value, sizeof value) < 0) + && + (ibv_read_sysfs_file(ibv_get_sysfs_path(), + "class/infiniband_ucma/abi_version", + value, sizeof value) < 0)) { /* * Older version of Linux do not have class/misc. To support * backports, assume the most recent version of the ABI. If _______________________________________________ openfabrics-ewg mailing list openfabrics-ewg at openib.org http://openib.org/mailman/listinfo/openfabrics-ewg -------------- next part -------------- A non-text attachment was scrubbed... Name: ucma_abi_version_backport_to_2.6.9.patch Type: application/octet-stream Size: 5876 bytes Desc: ucma_abi_version_backport_to_2.6.9.patch URL: From bos at pathscale.com Thu Aug 24 11:49:47 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 24 Aug 2006 11:49:47 -0700 Subject: [openib-general] drop mthca from svn? (was: Rollup patch for ipath and OFED) In-Reply-To: References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> Message-ID: <1156445387.17908.6.camel@chalcedony.pathscale.com> On Thu, 2006-08-24 at 09:31 -0700, Roland Dreier wrote: > Along those lines, how would people feel if I removed the mthca kernel > code from svn, and just maintained mthca in kernel.org git trees? +1 from me. We'll drop the ipath code, too. References: <44EC825E.2030709@ichips.intel.com> <20060823230812.GD13187@greglaptop.t-mobile.de> Message-ID: <20060824202227.GA2962@greglaptop.hotels-on-air.de> On Wed, Aug 23, 2006 at 04:46:52PM -0700, Roland Dreier wrote: > Yes, Mellanox documents that it is safe to rely on the last byte of an > RDMA being written last. OK, great. I'm fine with people using things which are supported, but then we need the big, blinking "Warning! This program is non-standard, and won't work with many of the devices supported by Open Fabrics!" sign. -- greg From suri at baymicrosystems.com Thu Aug 24 13:42:06 2006 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 24 Aug 2006 16:42:06 -0400 Subject: [openib-general] Utilities for sending traffic with different SL In-Reply-To: <20060824152319.GA11239@mellanox.co.il> Message-ID: <200608242042.k7OKgBWG010194@mail.baymicrosystems.com> Folks: Is there a utility within the OFED1.0 package which can be used for generating traffic on different SL (akin to the Voltaire perf_main utility)? Many thanks, Suri From sean.hefty at intel.com Thu Aug 24 13:57:39 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 24 Aug 2006 13:57:39 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <20060824202227.GA2962@greglaptop.hotels-on-air.de> Message-ID: <000201c6c7bf$ef6b37a0$ff0da8c0@amr.corp.intel.com> >OK, great. I'm fine with people using things which are supported, but >then we need the big, blinking "Warning! This program is non-standard, and >won't work with many of the devices supported by Open Fabrics!" sign. If an application were written to use Myrinet, would you consider it non-standard? The application is simply written to take advantage of specific hardware. It's up to the application to verify that the hardware that they're using provides the required features, or adjust accordingly, and publish those requirements to the end users. - Sean From robert.j.woodruff at intel.com Thu Aug 24 14:10:38 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 24 Aug 2006 14:10:38 -0700 Subject: [openib-general] basic IB doubt Message-ID: Greg wrote, >OK, great. I'm fine with people using things which are supported, but >then we need the big, blinking "Warning! This program is non-standard, and >won't work with many of the devices supported by Open Fabrics!" sign. >-- greg The other way to look at it is, the customer goes to the ISV and asks, what hardware should I buy, and the ISV says I support X version of MPI and vendor Y's hardware works with X version of MPI. The customer buys the software solution and will get the hardware that works with that software. So, if you want your hardware to work with basically almost any MPI today, since most of the MPIs assume this data placement ordering, then you will make your hardware so that it will guarantee this type of data delivery. If you would rather not, cause some spec says you don't have to, then you are just limiting the amount of hardware that you will sell. my 2 cents, woody From greg.lindahl at qlogic.com Thu Aug 24 14:32:25 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 24 Aug 2006 14:32:25 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <000201c6c7bf$ef6b37a0$ff0da8c0@amr.corp.intel.com> References: <20060824202227.GA2962@greglaptop.hotels-on-air.de> <000201c6c7bf$ef6b37a0$ff0da8c0@amr.corp.intel.com> Message-ID: <20060824213225.GI2962@greglaptop.hotels-on-air.de> On Thu, Aug 24, 2006 at 01:57:39PM -0700, Sean Hefty wrote: > >OK, great. I'm fine with people using things which are supported, but > >then we need the big, blinking "Warning! This program is non-standard, and > >won't work with many of the devices supported by Open Fabrics!" sign. > > If an application were written to use Myrinet, would you consider it > non-standard? Er, this question is a bit existential for my taste. Myrinet has its own standards. We're trying to create *inter-operable* hardware and software in this community. So we follow the IB standard. Myricom is doing their own thing, although of course they have software which obeys the Ethernet, VIA, DAPL, and other standards. And I expect that if they say they obey a standard, that they do. They're good people that way. > It's up to the application to verify that the hardware that they're > using provides the required features, or adjust accordingly, and > publish those requirements to the end users. If that was being done (and it isn't), it would still be bad for the ecosystem as a whole. But, basically, that's about the same as what I proposed, quoted above. -- g From greg.lindahl at qlogic.com Thu Aug 24 14:53:27 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 24 Aug 2006 14:53:27 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: References: Message-ID: <20060824215327.GJ2962@greglaptop.hotels-on-air.de> On Thu, Aug 24, 2006 at 02:10:38PM -0700, Woodruff, Robert J wrote: > The other way to look at it is, the customer goes to the ISV and asks, > what hardware should I buy, and the ISV says I support X version of MPI > and vendor Y's hardware works with X version of MPI. I thought the goal of InfiniBand was to create an ecosystem where you didn't have to do this. I guess I missed something somewhere. Adding undocumented requirements to a standard isn't the way to entice more people into implementing or using it. I would challenge you to find a single ISV that would prefer a situation where some "infiniband" middleware requires things which aren't in the standard. > So, if you want your hardware to work with > basically almost any MPI today, since most of the MPIs assume this > data placement ordering, then you will make your hardware so that > it will guarantee this type of data delivery. I think there's some confusion here between practicality and theory. There's no question that any IB vendor would make that kind of decision, although it will be expensive and annoying for iWarp vendors to do so. They'll feel forced to do so. But this is bad for the community. -- greg From sean.hefty at intel.com Thu Aug 24 14:58:18 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 24 Aug 2006 14:58:18 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <20060824213225.GI2962@greglaptop.hotels-on-air.de> Message-ID: <000301c6c7c8$688a7990$ff0da8c0@amr.corp.intel.com> >We're trying to create *inter-operable* hardware and >software in this community. So we follow the IB standard. Atomic operations and RDD are optional, yet still part of the IB "standard". An application that makes use of either of these isn't guaranteed to operate with all IB hardware. I'm not even sure that CAs are required to implement RDMA reads. >> It's up to the application to verify that the hardware that they're >> using provides the required features, or adjust accordingly, and >> publish those requirements to the end users. > >If that was being done (and it isn't), it would still be bad for the >ecosystem as a whole. Applications should drive the requirements. Some poll on memory today. A lot of existing hardware provides support for this by guaranteeing that the last byte will always be written last. This doesn't mean that data cannot be placed out of order, only that the last byte is deferred. Again, if a vendor wants to work with applications written this way, then this is a feature that should be provided. If a vendor doesn't care about working with those applications, or wants to require that the apps be rewritten, then this feature isn't important. But I do not see an issue with a vendor adding value beyond what's defined in the spec. - Sean From greg.lindahl at qlogic.com Thu Aug 24 15:07:18 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 24 Aug 2006 15:07:18 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <000301c6c7c8$688a7990$ff0da8c0@amr.corp.intel.com> References: <20060824213225.GI2962@greglaptop.hotels-on-air.de> <000301c6c7c8$688a7990$ff0da8c0@amr.corp.intel.com> Message-ID: <20060824220718.GC3670@greglaptop.hotels-on-air.de> On Thu, Aug 24, 2006 at 02:58:18PM -0700, Sean Hefty wrote: > >We're trying to create *inter-operable* hardware and > >software in this community. So we follow the IB standard. > > Atomic operations and RDD are optional, yet still part of the IB "standard". An > application that makes use of either of these isn't guaranteed to operate with > all IB hardware. But those are example of things which are actually written down in the standard. The example we were talking about isn't. > But I do not see an issue with a vendor adding value beyond what's defined in > the spec. Neither do I. If you think so, you haven't understood my argument. -- greg From robert.j.woodruff at intel.com Thu Aug 24 15:13:33 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 24 Aug 2006 15:13:33 -0700 Subject: [openib-general] basic IB doubt Message-ID: Greg wrote, >On Thu, Aug 24, 2006 at 02:10:38PM -0700, Woodruff, Robert J wrote: >> The other way to look at it is, the customer goes to the ISV and asks, >> what hardware should I buy, and the ISV says I support X version of MPI >> and vendor Y's hardware works with X version of MPI. >I thought the goal of InfiniBand was to create an ecosystem where you >didn't have to do this. I guess I missed something somewhere. The customers still buy the hardware cause it runs an application, they don't just go out and buy some cool InfinBand hardware and then go see what they can use it for. >>Adding undocumented requirements to a standard isn't the way to entice >>more people into implementing or using it. Well, sometimes the applications drive the feature set, even though it is not part of the actual standard. Unfortunate, but a fact of live. >I would challenge you to find a single ISV that would prefer a >situation where some "infiniband" middleware requires things which >aren't in the standard. If the feature gives them a huge advantage in performance (and it does) and all of the hardware vendors that they deal with already implement it, then yes, they will force, by defacto standard that all other newcomers implement it or face the fact that no one will buy their hardware. It seems like that is what is happening in this case. From greg.lindahl at qlogic.com Thu Aug 24 15:18:33 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 24 Aug 2006 15:18:33 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: References: Message-ID: <20060824221833.GD3670@greglaptop.hotels-on-air.de> On Thu, Aug 24, 2006 at 03:13:33PM -0700, Woodruff, Robert J wrote: > If the feature gives them a huge advantage in performance (and it > does) and all of the hardware vendors that they deal with already > implement it, then yes, they will force, by defacto standard that > all other newcomers implement it or face the fact that no one will > buy their hardware. It seems like that is what is happening in this > case. In this case the feature reduces performance on one HCA and increases it on another. Which shows why it's a bad idea to pick features based on a single implementation. But you're still confusing practicality and theory. I can see why it's pratical sense for newcomers to implement this new, performance- reducing feature. But why is it theoretically good? And shouldn't it be added to the standard, before all the poor iWarp people discover the hard way that they need it? -- greg From robert.j.woodruff at intel.com Thu Aug 24 15:28:44 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 24 Aug 2006 15:28:44 -0700 Subject: [openib-general] basic IB doubt Message-ID: Greg wrote, >reducing feature. But why is it theoretically good? And shouldn't it >be added to the standard, before all the poor iWarp people discover >the hard way that they need it? >-- greg Yes, IMO, the iWarp folks and the IBTA should consider making this a requirement, but even if they do not the ISVs will still require it. That being said, if you can show the ISVs how they can implement their completion model faster using some other mechanism than what they do now, they would probably listen, as they are going to do what gives them the best performance, standard or not. From ftillier at silverstorm.com Thu Aug 24 15:31:38 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Thu, 24 Aug 2006 15:31:38 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: References: Message-ID: <79ae2f320608241531n4b6c5010x88cdc25c474e6baf@mail.gmail.com> On 8/24/06, Woodruff, Robert J wrote: > If the feature gives them a huge advantage in performance (and it does) > and all of the hardware vendors that they deal with already implement > it, then yes, > they will force, by defacto standard that all other newcomers implement > it > or face the fact that no one will buy their hardware. It seems like that > is > what is happening in this case. Actually, if a hardware implementation provided the same performance (in this case latency) by polling on a CQ as one where polling on memory was garanteed to work, the customer may actually prefer the "standard" implementation. If the CQ entry could be updated faster than it is today, then polling the CQ would be a viable not to mention IB and iWARP standard solution. The problem is there's a considerable delay from when the last byte of data is written to when the CQE is written. My last measurement was 2us using a Mellanox PCI-X device. - Fab From greg.lindahl at qlogic.com Thu Aug 24 15:35:09 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 24 Aug 2006 15:35:09 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: References: Message-ID: <20060824223509.GA3927@greglaptop.hotels-on-air.de> On Thu, Aug 24, 2006 at 03:28:44PM -0700, Woodruff, Robert J wrote: > Yes, IMO, the iWarp folks and the IBTA should consider making this a > requirement, but even if they do not the ISVs will still require it. Ahah. So with all this head and light, we agree that it should be added to the standard. > That being said, if you can show the ISVs how they can implement > their completion model faster using some other mechanism than > what they do now, they would probably listen, as they are going to do > what gives them the best performance, standard or not. The other way to do it, faster on some HCAs, is to follow the standard. So, no showing needed. If MPI implementations implemented this in addition to the other, non-standard way, and automagically picked the right one, we wouldn't be having this discussion. So, I think we're ending up in agreement. Feel free to disagree ;-) -- greg From sean.hefty at intel.com Thu Aug 24 15:37:21 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 24 Aug 2006 15:37:21 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <20060824221833.GD3670@greglaptop.hotels-on-air.de> Message-ID: <000401c6c7cd$dcf5b0b0$ff0da8c0@amr.corp.intel.com> >But you're still confusing practicality and theory. I can see why it's >pratical sense for newcomers to implement this new, performance- >reducing feature. But why is it theoretically good? I'm missing the standard you're using to judge what's theoretically good and bad. Applications are written this way today. A vendor can either: * Support those apps by providing the feature. * Require that the apps be rewritten to use their hardware. Whether apps should have been written this way seems irrelevant. They are, and we should make decisions based on that, including extending the spec and/or implementation if needed. - Sean From sean.hefty at intel.com Thu Aug 24 15:43:38 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 24 Aug 2006 15:43:38 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <79ae2f320608241531n4b6c5010x88cdc25c474e6baf@mail.gmail.com> Message-ID: <000501c6c7ce$bd6178f0$ff0da8c0@amr.corp.intel.com> >Actually, if a hardware implementation provided the same performance >(in this case latency) by polling on a CQ as one where polling on >memory was guaranteed to work, the customer may actually prefer the >"standard" implementation. Polling on a CQ involves a function call, synchronization to the CQ, and formatting a structure to return to the user. I don't see this ever being faster than polling memory. If the data is received in order, there's no additional delay between writing the data and polling on memory. A CQ entry requires a separate write, plus the overhead mentioned above. Even if the data arrived out of order, only the last byte needs to be deferred. Writing that byte shouldn't be any slower than writing a CQ entry. - Sean From greg.lindahl at qlogic.com Thu Aug 24 15:45:07 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 24 Aug 2006 15:45:07 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <000401c6c7cd$dcf5b0b0$ff0da8c0@amr.corp.intel.com> References: <20060824221833.GD3670@greglaptop.hotels-on-air.de> <000401c6c7cd$dcf5b0b0$ff0da8c0@amr.corp.intel.com> Message-ID: <20060824224507.GC3927@greglaptop.hotels-on-air.de> On Thu, Aug 24, 2006 at 03:37:21PM -0700, Sean Hefty wrote: > I'm missing the standard you're using to judge what's theoretically good and > bad. Having simpler programming to get good performance is a theoretical good. Extra hacks for specific hardware is theoretically bad, pratically good only when it ends up with much better performance. Silently non-standard software is bad by both accounts. > Applications are written this way today. A vendor can either: > > * Support those apps by providing the feature. > * Require that the apps be rewritten to use their hardware. That's what I was calling practical. It's clear what a hardware vendor will do in that case. > Whether apps should have been written this way seems irrelevant. They are, and > we should make decisions based on that, including extending the spec and/or > implementation if needed. In this case we're talking about code which can easily be changed to follow the standard, in addition to having a hack mode that's faster on 1 particular hardware implementation. You seem to be implying that the applications are set in stone, and that their authors have no interest in making them standard-conformant. I don't think that's the case. If there is a standard extension which can provide better performance on 1 particular hardware implementation, let's add it to the standard. But let's also make the software standard-conformant on other hardware. -- greg From caitlinb at broadcom.com Thu Aug 24 15:49:06 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 24 Aug 2006 15:49:06 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <79ae2f320608241531n4b6c5010x88cdc25c474e6baf@mail.gmail.com> Message-ID: <54AD0F12E08D1541B826BE97C98F99F189EA9A@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On 8/24/06, Woodruff, Robert J wrote: >> If the feature gives them a huge advantage in performance (and it >> does) and all of the hardware vendors that they deal with already >> implement it, then yes, they will force, by defacto standard that all >> other newcomers implement it or face the fact that no one will buy >> their hardware. It seems like that is what is happening in this case. > > Actually, if a hardware implementation provided the same > performance (in this case latency) by polling on a CQ as one > where polling on memory was garanteed to work, the customer > may actually prefer the "standard" implementation. > Exactly. The correct solution is to rely on a CQE to signal a completion. That is why the standards are written the way they are. For iWARP there are network performance reasons why in-order memory writes will never be guaranteed. Now any time an *application* wants to write something that is vendor dependent then it is really up to that application developer to decide if the benefit is worthwhile. And in my opinion the "in order write" solution is not a feature, it is a work around to a slow CQ. So I certainly would not want to encourage that workaround, but applications are always free to do device or vendor specific workaround in their application code. It's their code. But I would hope that we would agree that no code that is part of the project should rely on this "feature". From greg.lindahl at qlogic.com Thu Aug 24 15:53:43 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 24 Aug 2006 15:53:43 -0700 Subject: [openib-general] A critique of RDMA PUT/GET in HPC Message-ID: <20060824225343.GD3927@greglaptop.hotels-on-air.de> For those of you interested in this topic, there's an interesting article by Patrick Geoffrey in HPCWire entitled "A Critique of RDMA". http://www.hpcwire.com/hpc/815242.html (you might have to be a subscriber, but I'm sure Patrick would send you a copy if you ask.) It's basically a critique of why SEND/RECV is better for MPI implementations than PUT/GET. Even if you don't agree with him, it's good reading. For motivation, you might want to note that most of the SEND/RECV-based products mentioned achieve better MPI 0-byte latency than IB Verbs-based MPI implementations. While I don't agree with everything Patrick says, this does get back to my point that I've run into many people who assume that PUT/GET is always the right way to do things. And it isn't. -- greg From robert.j.woodruff at intel.com Thu Aug 24 15:53:37 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 24 Aug 2006 15:53:37 -0700 Subject: [openib-general] basic IB doubt Message-ID: Greg wrote, >In this case we're talking about code which can easily be changed to >follow the standard, in addition to having a hack mode that's faster >on 1 particular hardware implementation. >You seem to be implying that the applications are set in stone, and >that their authors have no interest in making them >standard-conformant. I don't think that's the case. If there is a >standard extension which can provide better performance on 1 particular >hardware implementation, let's add it to the standard. But let's >also make the software standard-conformant on other hardware. >-- greg If the overhead in polling the CQ rather than memory was not so high, they would have used it, but they found that it added > 2us to the latency and found they could get better performance if they polled memory, so that is what they did, and as long as the hardware (or at least the hardware they care about support's it) they won't change it. If you can show them how they could get "better" (not just equal) performance using some other method, they would probably listen. From greg.lindahl at qlogic.com Thu Aug 24 15:57:12 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 24 Aug 2006 15:57:12 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <000501c6c7ce$bd6178f0$ff0da8c0@amr.corp.intel.com> References: <79ae2f320608241531n4b6c5010x88cdc25c474e6baf@mail.gmail.com> <000501c6c7ce$bd6178f0$ff0da8c0@amr.corp.intel.com> Message-ID: <20060824225712.GE3927@greglaptop.hotels-on-air.de> On Thu, Aug 24, 2006 at 03:43:38PM -0700, Sean Hefty wrote: > >Actually, if a hardware implementation provided the same performance > >(in this case latency) by polling on a CQ as one where polling on > >memory was guaranteed to work, the customer may actually prefer the > >"standard" implementation. > > Polling on a CQ involves a function call, synchronization to the CQ, and > formatting a structure to return to the user. I don't see this ever being > faster than polling memory. Why don't you measure it, then? For example, an iWarp implementation is going to be slowed down if it has to reorder segments to deliver the last byte last. This expense might be more than the function call. You guess not, but... You're also assuming that programs are only checking the last byte of the buffer. For all you know, Mellanox is delivering the whole buffer in ascending order, and the user is checking bytes in the middle, too. Which is a hazard of not-yet-specified standards extensions. -- greg From robert.j.woodruff at intel.com Thu Aug 24 16:00:44 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 24 Aug 2006 16:00:44 -0700 Subject: [openib-general] basic IB doubt Message-ID: Catlin wrote, >For iWARP there are network performance reasons why in-order >memory writes will never be guaranteed. For iWarp, or any other RDMA over Ethernet protocol, the behavior is not to guarantee all packets are written in-order, just that the last byte of the last packet is written last. This can easily be implemented in an iWarp card or by the driver with minimal performance impact in most cases. So for example, if the last packet arrives before all the other packets have arrived, the iWarp card or driver does not place that data of the last packet until all the other packets have arrived. woody From greg.lindahl at qlogic.com Thu Aug 24 16:01:17 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 24 Aug 2006 16:01:17 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: References: Message-ID: <20060824230117.GF3927@greglaptop.hotels-on-air.de> On Thu, Aug 24, 2006 at 03:53:37PM -0700, Woodruff, Robert J wrote: > If the overhead in polling the CQ rather than memory was not so > high, they would have used it, but they found that it added > 2us to > the latency and found they could get better performance if they > polled memory, I keep on mentioning that measuring one instance of one implementation isn't necessarily a good way to evaluate a standard. Or the implementation. We already have one example where Mellanox improved their implementation in the DDR generation such that a performance workaround -- choosing a 1k MTU for RC connections -- was no longer needed. I was pleased when we all agreed to remove defaulting to the workaround -- it was clearly the right thing to do. -- g From sean.hefty at intel.com Thu Aug 24 16:13:22 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 24 Aug 2006 16:13:22 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <20060824225712.GE3927@greglaptop.hotels-on-air.de> Message-ID: <000601c6c7d2$e4cdccf0$ff0da8c0@amr.corp.intel.com> >> Polling on a CQ involves a function call, synchronization to the CQ, and >> formatting a structure to return to the user. I don't see this ever being >> faster than polling memory. > >Why don't you measure it, then? Why? Reading a memory location directly will be faster than calling a function to read from a memory location. >For example, an iWarp implementation >is going to be slowed down if it has to reorder segments to deliver >the last byte last. This expense might be more than the function call. It only needs to defer the last byte. The claim being made is that the last byte will be delivered last. Yes, I can implement something non-performant in this mode, but that doesn't invalidate polling memory as a general solution. >You're also assuming that programs are only checking the last byte of >the buffer. The applications I care about are polling on the last byte. - Sean From greg.lindahl at qlogic.com Thu Aug 24 16:22:55 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 24 Aug 2006 16:22:55 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <000601c6c7d2$e4cdccf0$ff0da8c0@amr.corp.intel.com> References: <20060824225712.GE3927@greglaptop.hotels-on-air.de> <000601c6c7d2$e4cdccf0$ff0da8c0@amr.corp.intel.com> Message-ID: <20060824232255.GA4460@greglaptop.hotels-on-air.de> On Thu, Aug 24, 2006 at 04:13:22PM -0700, Sean Hefty wrote: > >Why don't you measure it, then? > > Why? Reading a memory location directly will be faster than calling a function > to read from a memory location. ... sigh. This is not true, there are quite obvious implementations where this is not true. We had one, actually, and changed it due to this issue. That was a practical choice. > >You're also assuming that programs are only checking the last byte of > >the buffer. > > The applications I care about are polling on the last byte. And tomorrow, the next app may depend on the behavior that all the bytes arrive in order. Slippery slope, you know. -- greg From caitlinb at broadcom.com Thu Aug 24 16:28:18 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 24 Aug 2006 16:28:18 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: Message-ID: <54AD0F12E08D1541B826BE97C98F99F189EAAC@NT-SJCA-0751.brcm.ad.broadcom.com> Woodruff, Robert J wrote: > Catlin wrote, > >> For iWARP there are network performance reasons why in-order memory >> writes will never be guaranteed. > > For iWarp, or any other RDMA over Ethernet protocol, the > behavior is not to guarantee all packets are written > in-order, just that the last byte of the last packet is > written last. This can easily be implemented in an iWarp card > or by the driver with minimal performance impact in most cases. > > So for example, if the last packet arrives before all the > other packets have arrived, the iWarp card or driver does not > place that data of the last packet until all the other > packets have arrived. > > woody Fascinating argument. The correct forum for that is the IETF RDDP Workgroup, not a Linux open source project. From sean.hefty at intel.com Thu Aug 24 17:16:46 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 24 Aug 2006 17:16:46 -0700 Subject: [openib-general] [PATCH v2] libibsa: userspace SA query and multicast support In-Reply-To: <000901c6c582$ad09f890$8698070a@amr.corp.intel.com> Message-ID: <000801c6c7db$c035e2c0$ff0da8c0@amr.corp.intel.com> Changes from v1: Event channels are created using one of two access modes: default or raw. Raw access is intended for privileged users, and will result in opening the ib_usa_raw file. Default access is intended for most users, and will open ib_usa_default. This will restrict the user to PathRecord, MultiPathRecord, MCMemberRecord, and ServiceRecord queries, and joining multicast groups. Signed-off-by: Sean Hefty --- Index: libibsa/libibsa.spec.in =================================================================== --- libibsa/libibsa.spec.in (revision 0) +++ libibsa/libibsa.spec.in (revision 0) @@ -0,0 +1,68 @@ +# $Id: $ + +%define ver @VERSION@ +%define RELEASE 1 +%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} + +Summary: Userspace SA client. +Name: libibsa +Version: %ver +Release: %rel%{?dist} +License: GPL/BSD +Group: System Environment/Libraries +BuildRoot: %{_tmppath}/%{name}-%{version}-root +Source: http://openib.org/downloads/%{name}-%{version}.tar.gz +Url: http://openib.org/ + +%description +Along with the OpenIB kernel drivers, libibsa provides a userspace +SA client API. + +%package devel +Summary: Development files for the libibsa library +Group: System Environment/Libraries +Requires: %{name} = %{version}-%{release} + +%description devel +Development files for the libibsa library. + +%package utils +Summary: Utilities for the libibsa library +Group: System Environment/Base +Requires: %{name} = %{version}-%{release} + +%description utils +Utilities for the libibsa library. + +%prep +%setup -q + +%build +%configure +make + +%install +make DESTDIR=${RPM_BUILD_ROOT} install +# remove unpackaged files from the buildroot +rm -f $RPM_BUILD_ROOT%{_libdir}/*.la +cd $RPM_BUILD_ROOT%{_libdir} +mv libibsa.so libibsa.so.%{ver} +ln -s libibsa.so.%{ver} libibsa.so + +%clean +rm -rf $RPM_BUILD_ROOT + +%files +%defattr(-,root,root) +%{_libdir}/libibsa*.so.* +%doc AUTHORS COPYING ChangeLog NEWS README + +%files devel +%defattr(-,root,root) +%{_libdir}/libibsa.so +%{_includedir}/infiniband/*.h + +%files utils +%defattr(-,root,root) +%{_bindir}/satest +%{_bindir}/mchammer Index: libibsa/include/infiniband/sa_net.h =================================================================== --- libibsa/include/infiniband/sa_net.h (revision 0) +++ libibsa/include/infiniband/sa_net.h (revision 0) @@ -0,0 +1,378 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(SA_NET_H) +#define SA_NET_H + +#include + +#include + +enum { + IBV_SA_METHOD_GET = 0x01, + IBV_SA_METHOD_SET = 0x02, + IBV_SA_METHOD_GET_RESP = 0x81, + IBV_SA_METHOD_SEND = 0x03, + IBV_SA_METHOD_TRAP = 0x05, + IBV_SA_METHOD_REPORT = 0x06, + IBV_SA_METHOD_REPORT_RESP = 0x86, + IBV_SA_METHOD_TRAP_REPRESS = 0x07, + IBV_SA_METHOD_GET_TABLE = 0x12, + IBV_SA_METHOD_GET_TABLE_RESP = 0x92, + IBV_SA_METHOD_DELETE = 0x15, + IBV_SA_METHOD_DELETE_RESP = 0x95, + IBV_SA_METHOD_GET_MULTI = 0x14, + IBV_SA_METHOD_GET_MULTI_RESP = 0x94, + IBV_SA_METHOD_GET_TRACE_TBL = 0x13 +}; + +enum { + IBV_SA_MAD_HEADER_SIZE = 56, + IBV_SA_MAD_DATA_SIZE = 200 +}; + +struct ibv_sa_mad { + /* common MAD header */ + uint8_t base_version; + uint8_t mgmt_class; + uint8_t class_version; + uint8_t r_method; + uint16_t status; + uint16_t class_specific; + uint64_t transaction_id; + uint16_t attribute_id; + uint16_t rsvd1; + uint32_t attribute_modifier; + /* RMPP header */ + uint8_t rmpp_version; + uint8_t rmpp_type; + uint8_t rmpp_resptime_flags; + uint8_t rmpp_status; + uint32_t rmpp_data1; + uint32_t rmpp_data2; + /* SA header */ + uint32_t sm_key1; /* define sm_key for 64-bit alignment */ + uint32_t sm_key2; + uint16_t attribute_offset; + uint16_t rsvd2; + uint64_t comp_mask; + uint8_t sa_data[IBV_SA_MAD_DATA_SIZE]; +}; + +enum { + IBV_SA_ATTR_CLASS_PORTINFO = __constant_cpu_to_be16(0x01), + IBV_SA_ATTR_NOTICE = __constant_cpu_to_be16(0x02), + IBV_SA_ATTR_INFORM_INFO = __constant_cpu_to_be16(0x03), + IBV_SA_ATTR_NODE_REC = __constant_cpu_to_be16(0x11), + IBV_SA_ATTR_PORT_INFO_REC = __constant_cpu_to_be16(0x12), + IBV_SA_ATTR_SL2VL_REC = __constant_cpu_to_be16(0x13), + IBV_SA_ATTR_SWITCH_REC = __constant_cpu_to_be16(0x14), + IBV_SA_ATTR_LINEAR_FDB_REC = __constant_cpu_to_be16(0x15), + IBV_SA_ATTR_RANDOM_FDB_REC = __constant_cpu_to_be16(0x16), + IBV_SA_ATTR_MCAST_FDB_REC = __constant_cpu_to_be16(0x17), + IBV_SA_ATTR_SM_INFO_REC = __constant_cpu_to_be16(0x18), + IBV_SA_ATTR_LINK_REC = __constant_cpu_to_be16(0x20), + IBV_SA_ATTR_GUID_INFO_REC = __constant_cpu_to_be16(0x30), + IBV_SA_ATTR_SERVICE_REC = __constant_cpu_to_be16(0x31), + IBV_SA_ATTR_PARTITION_REC = __constant_cpu_to_be16(0x33), + IBV_SA_ATTR_PATH_REC = __constant_cpu_to_be16(0x35), + IBV_SA_ATTR_VL_ARB_REC = __constant_cpu_to_be16(0x36), + IBV_SA_ATTR_MC_MEMBER_REC = __constant_cpu_to_be16(0x38), + IBV_SA_ATTR_TRACE_REC = __constant_cpu_to_be16(0x39), + IBV_SA_ATTR_MULTI_PATH_REC = __constant_cpu_to_be16(0x3a), + IBV_SA_ATTR_SERVICE_ASSOC_REC = __constant_cpu_to_be16(0x3b), + IBV_SA_ATTR_INFORM_INFO_REC = __constant_cpu_to_be16(0xf3) +}; + +/* Length of SA attributes on the wire */ +enum { + IBV_SA_ATTR_CLASS_PORTINFO_LEN = 72, + IBV_SA_ATTR_NOTICE_LEN = 80, + IBV_SA_ATTR_INFORM_INFO_LEN = 36, + IBV_SA_ATTR_NODE_REC_LEN = 108, + IBV_SA_ATTR_PORT_INFO_REC_LEN = 58, + IBV_SA_ATTR_SL2VL_REC_LEN = 16, + IBV_SA_ATTR_SWITCH_REC_LEN = 21, + IBV_SA_ATTR_LINEAR_FDB_REC_LEN = 72, + IBV_SA_ATTR_RANDOM_FDB_REC_LEN = 72, + IBV_SA_ATTR_MCAST_FDB_REC_LEN = 72, + IBV_SA_ATTR_SM_INFO_REC_LEN = 25, + IBV_SA_ATTR_LINK_REC_LEN = 6, + IBV_SA_ATTR_GUID_INFO_REC_LEN = 72, + IBV_SA_ATTR_SERVICE_REC_LEN = 176, + IBV_SA_ATTR_PARTITION_REC_LEN = 72, + IBV_SA_ATTR_PATH_REC_LEN = 64, + IBV_SA_ATTR_VL_ARB_REC_LEN = 72, + IBV_SA_ATTR_MC_MEMBER_REC_LEN = 52, + IBV_SA_ATTR_TRACE_REC_LEN = 46, + IBV_SA_ATTR_MULTI_PATH_REC_LEN = 56, + IBV_SA_ATTR_SERVICE_ASSOC_REC_LEN = 80, + IBV_SA_ATTR_INFORM_INFO_REC_LEN = 60 +}; + +#define IBV_SA_COMP_MASK(n) __constant_cpu_to_be64(1ull << n) + +struct ibv_sa_net_service_rec { + uint64_t service_id; + uint8_t service_gid[16]; + uint16_t service_pkey; + uint16_t rsvd; + uint32_t service_lease; + uint8_t service_key[16]; + uint8_t service_name[64]; + uint8_t service_data8[16]; + uint16_t service_data16[8]; + uint32_t service_data32[4]; + uint64_t service_data64[2]; +}; + +enum { + IBV_SA_SERVICE_REC_SERVICE_ID = IBV_SA_COMP_MASK(0), + IBV_SA_SERVICE_REC_SERVICE_GID = IBV_SA_COMP_MASK(1), + IBV_SA_SERVICE_REC_SERVICE_PKEY = IBV_SA_COMP_MASK(2), + /* reserved: 3 */ + IBV_SA_SERVICE_REC_SERVICE_LEASE = IBV_SA_COMP_MASK(4), + IBV_SA_SERVICE_REC_SERVICE_KEY = IBV_SA_COMP_MASK(5), + IBV_SA_SERVICE_REC_SERVICE_NAME = IBV_SA_COMP_MASK(6), + IBV_SA_SERVICE_REC_SERVICE_DATA8_0 = IBV_SA_COMP_MASK(7), + IBV_SA_SERVICE_REC_SERVICE_DATA8_1 = IBV_SA_COMP_MASK(8), + IBV_SA_SERVICE_REC_SERVICE_DATA8_2 = IBV_SA_COMP_MASK(9), + IBV_SA_SERVICE_REC_SERVICE_DATA8_3 = IBV_SA_COMP_MASK(10), + IBV_SA_SERVICE_REC_SERVICE_DATA8_4 = IBV_SA_COMP_MASK(11), + IBV_SA_SERVICE_REC_SERVICE_DATA8_5 = IBV_SA_COMP_MASK(12), + IBV_SA_SERVICE_REC_SERVICE_DATA8_6 = IBV_SA_COMP_MASK(13), + IBV_SA_SERVICE_REC_SERVICE_DATA8_7 = IBV_SA_COMP_MASK(14), + IBV_SA_SERVICE_REC_SERVICE_DATA8_8 = IBV_SA_COMP_MASK(15), + IBV_SA_SERVICE_REC_SERVICE_DATA8_9 = IBV_SA_COMP_MASK(16), + IBV_SA_SERVICE_REC_SERVICE_DATA8_10 = IBV_SA_COMP_MASK(17), + IBV_SA_SERVICE_REC_SERVICE_DATA8_11 = IBV_SA_COMP_MASK(18), + IBV_SA_SERVICE_REC_SERVICE_DATA8_12 = IBV_SA_COMP_MASK(19), + IBV_SA_SERVICE_REC_SERVICE_DATA8_13 = IBV_SA_COMP_MASK(20), + IBV_SA_SERVICE_REC_SERVICE_DATA8_14 = IBV_SA_COMP_MASK(21), + IBV_SA_SERVICE_REC_SERVICE_DATA8_15 = IBV_SA_COMP_MASK(22), + IBV_SA_SERVICE_REC_SERVICE_DATA16_0 = IBV_SA_COMP_MASK(23), + IBV_SA_SERVICE_REC_SERVICE_DATA16_1 = IBV_SA_COMP_MASK(24), + IBV_SA_SERVICE_REC_SERVICE_DATA16_2 = IBV_SA_COMP_MASK(25), + IBV_SA_SERVICE_REC_SERVICE_DATA16_3 = IBV_SA_COMP_MASK(26), + IBV_SA_SERVICE_REC_SERVICE_DATA16_4 = IBV_SA_COMP_MASK(27), + IBV_SA_SERVICE_REC_SERVICE_DATA16_5 = IBV_SA_COMP_MASK(28), + IBV_SA_SERVICE_REC_SERVICE_DATA16_6 = IBV_SA_COMP_MASK(29), + IBV_SA_SERVICE_REC_SERVICE_DATA16_7 = IBV_SA_COMP_MASK(30), + IBV_SA_SERVICE_REC_SERVICE_DATA32_0 = IBV_SA_COMP_MASK(31), + IBV_SA_SERVICE_REC_SERVICE_DATA32_1 = IBV_SA_COMP_MASK(32), + IBV_SA_SERVICE_REC_SERVICE_DATA32_2 = IBV_SA_COMP_MASK(33), + IBV_SA_SERVICE_REC_SERVICE_DATA32_3 = IBV_SA_COMP_MASK(34), + IBV_SA_SERVICE_REC_SERVICE_DATA64_0 = IBV_SA_COMP_MASK(35), + IBV_SA_SERVICE_REC_SERVICE_DATA64_1 = IBV_SA_COMP_MASK(36) +}; + +struct ibv_sa_net_path_rec { + uint32_t rsvd1; + uint32_t rsvd2; + uint8_t dgid[16]; + uint8_t sgid[16]; + uint16_t dlid; + uint16_t slid; + /* RawTraffic: 1:352, Rsvd: 3:353, FlowLabel: 20:356, HopLimit: 8:376 */ + uint32_t raw_flow_hop; + uint8_t tclass; + /* Reversible: 1:392, NumbPath: 7:393 */ + uint8_t reversible_numbpath; + uint16_t pkey; + /* Rsvd: 12:416, SL: 4:428 */ + uint16_t sl; + /* MtuSelector: 2:432, MTU: 6:434 */ + uint8_t mtu_info; + /* RateSelector: 2:440, Rate: 6:442 */ + uint8_t rate_info; + /* PacketLifeTimeSelector: 2:448, PacketLifeTime: 6:450 */ + uint8_t packetlifetime_info; + uint8_t preference; + uint8_t rsvd3[3]; +}; + +enum { + IBV_SA_PATH_REC_RAW_TRAFFIC_OFFSET = 352, + IBV_SA_PATH_REC_RAW_TRAFFIC_LENGTH = 1, + IBV_SA_PATH_REC_FLOW_LABEL_OFFSET = 356, + IBV_SA_PATH_REC_FLOW_LABEL_LENGTH = 20, + IBV_SA_PATH_REC_HOP_LIMIT_OFFSET = 376, + IBV_SA_PATH_REC_HOP_LIMIT_LENGTH = 8, + IBV_SA_PATH_REC_REVERSIBLE_OFFSET = 392, + IBV_SA_PATH_REC_REVERSIBLE_LENGTH = 1, + IBV_SA_PATH_REC_NUMB_PATH_OFFSET = 393, + IBV_SA_PATH_REC_NUMB_PATH_LENGTH = 7, + IBV_SA_PATH_REC_SL_OFFSET = 428, + IBV_SA_PATH_REC_SL_LENGTH = 4, + IBV_SA_PATH_REC_MTU_SELECTOR_OFFSET = 324, + IBV_SA_PATH_REC_MTU_SELECTOR_LENGTH = 2, + IBV_SA_PATH_REC_MTU_OFFSET = 434, + IBV_SA_PATH_REC_MTU_LENGTH = 6, + IBV_SA_PATH_REC_RATE_SELECTOR_OFFSET = 440, + IBV_SA_PATH_REC_RATE_SELECTOR_LENGTH = 2, + IBV_SA_PATH_REC_RATE_OFFSET = 442, + IBV_SA_PATH_REC_RATE_LENGTH = 6, + IBV_SA_PATH_REC_PACKETLIFE_SELECTOR_OFFSET = 448, + IBV_SA_PATH_REC_PACKETLIFE_SELECTOR_LENGTH = 2, + IBV_SA_PATH_REC_PACKETLIFE_OFFSET = 450, + IBV_SA_PATH_REC_PACKETLIFE_LENGTH = 6 +}; + +enum { + /* reserved: 0 */ + /* reserved: 1 */ + IBV_SA_PATH_REC_DGID = IBV_SA_COMP_MASK(2), + IBV_SA_PATH_REC_SGID = IBV_SA_COMP_MASK(3), + IBV_SA_PATH_REC_DLID = IBV_SA_COMP_MASK(4), + IBV_SA_PATH_REC_SLID = IBV_SA_COMP_MASK(5), + IBV_SA_PATH_REC_RAW_TRAFFIC = IBV_SA_COMP_MASK(6), + /* reserved: 7 */ + IBV_SA_PATH_REC_FLOW_LABEL = IBV_SA_COMP_MASK(8), + IBV_SA_PATH_REC_HOP_LIMIT = IBV_SA_COMP_MASK(9), + IBV_SA_PATH_REC_TRAFFIC_CLASS = IBV_SA_COMP_MASK(10), + IBV_SA_PATH_REC_REVERSIBLE = IBV_SA_COMP_MASK(11), + IBV_SA_PATH_REC_NUMB_PATH = IBV_SA_COMP_MASK(12), + IBV_SA_PATH_REC_PKEY = IBV_SA_COMP_MASK(13), + /* reserved: 14 */ + IBV_SA_PATH_REC_SL = IBV_SA_COMP_MASK(15), + IBV_SA_PATH_REC_MTU_SELECTOR = IBV_SA_COMP_MASK(16), + IBV_SA_PATH_REC_MTU = IBV_SA_COMP_MASK(17), + IBV_SA_PATH_REC_RATE_SELECTOR = IBV_SA_COMP_MASK(18), + IBV_SA_PATH_REC_RATE = IBV_SA_COMP_MASK(19), + IBV_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR = IBV_SA_COMP_MASK(20), + IBV_SA_PATH_REC_PACKET_LIFE_TIME = IBV_SA_COMP_MASK(21), + IBV_SA_PATH_REC_PREFERENCE = IBV_SA_COMP_MASK(22) +}; + +struct ibv_sa_net_mcmember_rec { + uint8_t mgid[16]; + uint8_t port_gid[16]; + uint32_t qkey; + uint16_t mlid; + /* MtuSelector: 2:304, MTU: 6:306 */ + uint8_t mtu_info; + uint8_t tclass; + uint16_t pkey; + /* RateSelector: 2:336, Rate: 6:338 */ + uint8_t rate_info; + /* PacketLifeTimeSelector: 2:344, PacketLifeTime: 6:346 */ + uint8_t packetlifetime_info; + /* SL: 4:352, FlowLabel: 20:356, HopLimit: 8:376 */ + uint32_t sl_flow_hop; + /* Scope: 4:384, JoinState: 4:388 */ + uint8_t scope_join; + /* ProxyJoin: 1:392, rsvd: 7:393 */ + uint8_t proxy_join; + uint8_t rsvd[2]; +}; + +enum { + IBV_SA_MCMEMBER_REC_MTU_SELECTOR_OFFSET = 304, + IBV_SA_MCMEMBER_REC_MTU_SELECTOR_LENGTH = 2, + IBV_SA_MCMEMBER_REC_MTU_OFFSET = 306, + IBV_SA_MCMEMBER_REC_MTU_LENGTH = 6, + IBV_SA_MCMEMBER_REC_RATE_SELECTOR_OFFSET = 336, + IBV_SA_MCMEMBER_REC_RATE_SELECTOR_LENGTH = 2, + IBV_SA_MCMEMBER_REC_RATE_OFFSET = 338, + IBV_SA_MCMEMBER_REC_RATE_LENGTH = 6, + IBV_SA_MCMEMBER_REC_PACKETLIFE_SELECTOR_OFFSET = 344, + IBV_SA_MCMEMBER_REC_PACKETLIFE_SELECTOR_LENGTH = 2, + IBV_SA_MCMEMBER_REC_PACKETLIFE_OFFSET = 346, + IBV_SA_MCMEMBER_REC_PACKETLIFE_LENGTH = 6, + IBV_SA_MCMEMBER_REC_SL_OFFSET = 352, + IBV_SA_MCMEMBER_REC_SL_LENGTH = 4, + IBV_SA_MCMEMBER_REC_FLOW_LABEL_OFFSET = 356, + IBV_SA_MCMEMBER_REC_FLOW_LABEL_LENGTH = 20, + IBV_SA_MCMEMBER_REC_HOP_LIMIT_OFFSET = 376, + IBV_SA_MCMEMBER_REC_HOP_LIMIT_LENGTH = 8, + IBV_SA_MCMEMBER_REC_SCOPE_OFFSET = 384, + IBV_SA_MCMEMBER_REC_SCOPE_LENGTH = 4, + IBV_SA_MCMEMBER_REC_JOIN_STATE_OFFSET = 388, + IBV_SA_MCMEMBER_REC_JOIN_STATE_LENGTH = 4, + IBV_SA_MCMEMBER_REC_PROXY_JOIN_OFFSET = 392, + IBV_SA_MCMEMBER_REC_PROXY_JOIN_LENGTH = 1 +}; + +enum { + IBV_SA_MCMEMBER_REC_MGID = IBV_SA_COMP_MASK(0), + IBV_SA_MCMEMBER_REC_PORT_GID = IBV_SA_COMP_MASK(1), + IBV_SA_MCMEMBER_REC_QKEY = IBV_SA_COMP_MASK(2), + IBV_SA_MCMEMBER_REC_MLID = IBV_SA_COMP_MASK(3), + IBV_SA_MCMEMBER_REC_MTU_SELECTOR = IBV_SA_COMP_MASK(4), + IBV_SA_MCMEMBER_REC_MTU = IBV_SA_COMP_MASK(5), + IBV_SA_MCMEMBER_REC_TRAFFIC_CLASS = IBV_SA_COMP_MASK(6), + IBV_SA_MCMEMBER_REC_PKEY = IBV_SA_COMP_MASK(7), + IBV_SA_MCMEMBER_REC_RATE_SELECTOR = IBV_SA_COMP_MASK(8), + IBV_SA_MCMEMBER_REC_RATE = IBV_SA_COMP_MASK(9), + IBV_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR = IBV_SA_COMP_MASK(10), + IBV_SA_MCMEMBER_REC_PACKET_LIFE_TIME = IBV_SA_COMP_MASK(11), + IBV_SA_MCMEMBER_REC_SL = IBV_SA_COMP_MASK(12), + IBV_SA_MCMEMBER_REC_FLOW_LABEL = IBV_SA_COMP_MASK(13), + IBV_SA_MCMEMBER_REC_HOP_LIMIT = IBV_SA_COMP_MASK(14), + IBV_SA_MCMEMBER_REC_SCOPE = IBV_SA_COMP_MASK(15), + IBV_SA_MCMEMBER_REC_JOIN_STATE = IBV_SA_COMP_MASK(16), + IBV_SA_MCMEMBER_REC_PROXY_JOIN = IBV_SA_COMP_MASK(17) +}; + +/* + * ibv_sa_pack_attr - Copy an attribute from a host defined structure + * to a packed network structure + * ibv_sa_unpack_attr - Copy an attribute from a packed network structure + * to a host defined structure. + */ +void ibv_sa_unpack_path_rec(struct ibv_sa_path_rec *rec, + struct ibv_sa_net_path_rec *net_rec); +void ibv_sa_pack_mcmember_rec(struct ibv_sa_net_mcmember_rec *net_rec, + struct ibv_sa_mcmember_rec *rec); +void ibv_sa_unpack_mcmember_rec(struct ibv_sa_mcmember_rec *rec, + struct ibv_sa_net_mcmember_rec *net_rec); + +/** + * ibv_sa_get_field - Extract a bit field value from a structure. + * @data: Pointer to the start of the structure. + * @offset: Bit offset of field from start of structure. + * @size: Size of field, in bits. + * + * The structure must be in network-byte order. The returned value is in + * host-byte order. + */ +uint32_t ibv_sa_get_field(void *data, int offset, int size); + +/** + * ibv_sa_set_field - Set a bit field value in a structure. + * @data: Pointer to the start of the structure. + * @value: Value to assign to field. + * @offset: Bit offset of field from start of structure. + * @size: Size of field, in bits. + * + * The structure must be in network-byte order. The value to set is in + * host-byte order. + */ +void ibv_sa_set_field(void *data, uint32_t value, int offset, int size); + +#endif /* SA_NET_H */ Property changes on: libibsa/include/infiniband/sa_net.h ___________________________________________________________________ Name: svn:executable + * Index: libibsa/include/infiniband/sa_client_abi.h =================================================================== --- libibsa/include/infiniband/sa_client_abi.h (revision 0) +++ libibsa/include/infiniband/sa_client_abi.h (revision 0) @@ -0,0 +1,127 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef SA_CLIENT_ABI_H +#define SA_CLIENT_ABI_H + +#include + +/* + * This file must be kept in sync with the kernel's version of ib_usa.h + */ + +#define IB_USA_MIN_ABI_VERSION 1 +#define IB_USA_MAX_ABI_VERSION 1 + +#define IB_USA_EVENT_DATA 256 + +enum { + USA_CMD_SEND_MAD, + USA_CMD_GET_EVENT, + USA_CMD_GET_DATA, + USA_CMD_JOIN_MCAST, + USA_CMD_FREE_ID, + USA_CMD_GET_MCAST +}; + +enum { + USA_EVENT_MAD, + USA_EVENT_MCAST +}; + +struct usa_abi_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct usa_abi_send_mad { + __u64 response; /* unused - reserved */ + __u64 uid; + __u64 node_guid; + __u64 comp_mask; + __u64 attr; + __u8 port_num; + __u8 method; + __u16 attr_id; + __u32 timeout_ms; + __u32 retries; +}; + +struct usa_abi_join_mcast { + __u64 response; + __u64 uid; + __u64 node_guid; + __u64 comp_mask; + __u64 mcmember_rec; + __u8 port_num; +}; + +struct usa_abi_id_resp { + __u32 id; +}; + +struct usa_abi_free_resp { + __u32 events_reported; +}; + +struct usa_abi_free_id { + __u64 response; + __u32 id; +}; + +struct usa_abi_get_event { + __u64 response; +}; + +struct usa_abi_event_resp { + __u64 uid; + __u32 id; + __u32 event; + __u32 status; + __u32 data_len; + __u8 data[IB_USA_EVENT_DATA]; +}; + +struct usa_abi_get_data { + __u64 response; + __u32 id; +}; + +struct usa_abi_get_mcast { + __u64 response; + __u64 node_guid; + __u8 mgid[16]; + __u8 port_num; +}; + +#endif /* SA_CLIENT_ABI_H */ Property changes on: libibsa/include/infiniband/sa_client_abi.h ___________________________________________________________________ Name: svn:executable + * Index: libibsa/include/infiniband/sa_client.h =================================================================== --- libibsa/include/infiniband/sa_client.h (revision 0) +++ libibsa/include/infiniband/sa_client.h (revision 0) @@ -0,0 +1,204 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(SA_CLIENT_H) +#define SA_CLIENT_H + +#include +#include + +struct ibv_sa_event_channel { + int fd; +}; + +enum ibv_sa_event_type { + IBV_SA_EVENT_MAD, + IBV_SA_EVENT_MULTICAST +}; + +struct ibv_sa_event { + void *context; + enum ibv_sa_event_type event; + int status; + int attr_count; + int attr_size; + int attr_offset; + uint16_t attr_id; + void *attr; +}; + +enum ibv_sa_access_mode { + IBV_SA_ACCESS_RAW, + IBV_SA_ACCESS_DEFAULT +}; + +/** + * ibv_sa_create_event_channel - Open a channel used to report events. + * @mode: Access mode permitted when sending to the SA. + * + * Users will typically require only default access when interfacing with the + * SA. Default access supports basic SA queries, along with join operations. + * Raw access is often restricted to privileged users, but allows greater + * flexibility in the type of MADs that may be sent to the SA. + */ +struct ibv_sa_event_channel * +ibv_sa_create_event_channel(enum ibv_sa_access_mode mode); + +/** + * ibv_sa_destroy_event_channel - Close the event channel. + * @channel: The channel to destroy. + */ +void ibv_sa_destroy_event_channel(struct ibv_sa_event_channel *channel); + +/** + * ibv_sa_send_mad - Send a MAD to the SA. + * @channel: Event channel to report completion to. + * @device: Device to send over. + * @port_num: Port number to send over. + * @method: MAD method to use in the send. + * @attr: Reference to attribute in wire format to send in MAD. + * @attr_id: Attribute type identifier. + * @comp_mask: Component mask to send in MAD. + * @timeout_ms: Time to wait for response, if one is expected. + * @retries: Number of times to retry request. + * @context: User-defined context associated with request. + * + * Send a message to the SA. All values should be in network-byte order. + */ +int ibv_sa_send_mad(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + uint8_t method, void *attr, uint16_t attr_id, + uint64_t comp_mask, int timeout_ms, int retries, + void *context); + +/** + * ibv_sa_get_event - Retrieves the next pending event, if no event is + * pending waits for an event. + * @channel: Event channel to check for events. + * @event: Allocated information about the next event. + * Event should be freed using ibv_sa_ack_event() + */ +int ibv_sa_get_event(struct ibv_sa_event_channel *channel, + struct ibv_sa_event **event); + +/** + * ibv_sa_ack_event - Free an event. + * @event: Event to be released. + * + * All events which are allocated by ibv_sa_get_event() must be released, + * there should be a one-to-one correspondence between successful gets + * and acks. + */ +int ibv_sa_ack_event(struct ibv_sa_event *event); + +/** + * ibv_sa_attr_size - Return the length of an SA attribute on the wire. + * @attr_id: Attribute identifier, in network-byte order. + */ +int ibv_sa_attr_size(uint16_t attr_id); + +static inline void *ibv_sa_get_attr(struct ibv_sa_event *event, int index) +{ + return event->attr + event->attr_offset * index; +} + +/** + * ibv_sa_init_ah_from_path - Initialize address handle attributes. + * @device: Source device. + * @port_num: Source port number. + * @path_rec: Network defined path record. + * @ah_attr: Destination address handle attributes. + */ +int ibv_sa_init_ah_from_path(struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_path_rec *path_rec, + struct ibv_ah_attr *ah_attr); + +/** + * ibv_sa_init_ah_from_mcmember - Initialize address handle attributes. + * @device: Source device. + * @port_num: Source port number. + * @mc_rec: Network defined multicast member record. + * @ah_attr: Destination address handle attributes. + */ +int ibv_sa_init_ah_from_mcmember(struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_mcmember_rec *mc_rec, + struct ibv_ah_attr *ah_attr); + +struct ibv_sa_multicast; + +/** + * ibv_sa_join_multicast - Initiates a join request to the specified multicast + * group. + * @channel: Event channel to report completion to. + * @device: Device to send over. + * @port_num: Port number to send over. + * @rec: SA multicast member record specifying group attributes. + * @comp_mask: Component mask to send in MAD. + * @context: User-defined context associated with join. + * @multicast: Reference to store multicast pointer. + * + * This call initiates a multicast join request with the SA for the specified + * multicast group. If the join operation is started successfully, it returns + * an ibv_sa_multicast structure that is used to track the multicast operation. + * Users must free this structure by calling ibv_sa_free_multicast, even if the + * join operation later fails. + */ +int ibv_sa_join_multicast(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_mcmember_rec *rec, + uint64_t comp_mask, void *context, + struct ibv_sa_multicast **multicast); + +/** + * ibv_sa_free_multicast - Frees the multicast tracking structure, and releases + * any reference on the multicast group. + * @multicast: Multicast tracking structure allocated by ibv_sa_join_multicast. + */ +int ibv_sa_free_multicast(struct ibv_sa_multicast *multicast); + +/** + * ibv_sa_get_mcmember_rec - Looks up a multicast member record by its MGID and + * returns it if found. + * @channel: Event channel to issue query on. + * @device: Device associated with record. + * @port_num: Port number of record. + * @mgid: optional MGID of multicast group. + * @rec: Location to copy SA multicast member record. + * + * If an MGID is specified, returns an existing multicast member record if + * one is found for the local port. If no MGID is specified, or the specified + * MGID is 0, returns a multicast member record filled in with default values + * that may be used to create a new multicast group. + */ +int ibv_sa_get_mcmember_rec(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + union ibv_gid *mgid, + struct ibv_sa_net_mcmember_rec *rec); + +#endif /* SA_CLIENT_H */ Property changes on: libibsa/include/infiniband/sa_client.h ___________________________________________________________________ Name: svn:executable + * Index: libibsa/AUTHORS =================================================================== --- libibsa/AUTHORS (revision 0) +++ libibsa/AUTHORS (revision 0) @@ -0,0 +1 @@ +Sean Hefty Index: libibsa/configure.in =================================================================== --- libibsa/configure.in (revision 0) +++ libibsa/configure.in (revision 0) @@ -0,0 +1,50 @@ +dnl Process this file with autoconf to produce a configure script. + +AC_PREREQ(2.57) +AC_INIT(libibsa, 0.9.0, openib-general at openib.org) +AC_CONFIG_SRCDIR([src/sa_client.c]) +AC_CONFIG_AUX_DIR(config) +AM_CONFIG_HEADER(config.h) +AM_INIT_AUTOMAKE(libibsa, 0.9.0) +AC_DISABLE_STATIC +AM_PROG_LIBTOOL + +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + +dnl Checks for programs +AC_PROG_CC + +dnl Checks for typedefs, structures, and compiler characteristics. +AC_C_CONST +AC_CHECK_SIZEOF(long) + +dnl Checks for libraries +if test "$disable_libcheck" != "yes" +then +AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], + AC_MSG_ERROR([ibv_get_device_list() not found. libibsa requires libibverbs.])) +fi + +dnl Checks for header files. +if test "$disable_libcheck" != "yes" +then +AC_CHECK_HEADER(infiniband/verbs.h, [], + AC_MSG_ERROR([ not found. Is libibverbs installed?])) +fi +AC_HEADER_STDC + +AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, + if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then + ac_cv_version_script=yes + else + ac_cv_version_script=no + fi) + +AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") + +AC_CONFIG_FILES([Makefile libibsa.spec]) +AC_OUTPUT Index: libibsa/INSTALL =================================================================== Index: libibsa/src/libibsa.map =================================================================== --- libibsa/src/libibsa.map (revision 0) +++ libibsa/src/libibsa.map (revision 0) @@ -0,0 +1,21 @@ +IB_SA_1.0 { + global: + ibv_sa_create_event_channel; + ibv_sa_destroy_event_channel; + ibv_sa_send_mad; + ibv_sa_get_event; + ibv_sa_ack_event; + ibv_sa_attr_size; + ibv_sa_get_attr; + ibv_sa_init_ah_from_path; + ibv_sa_init_ah_from_mcmember; + ibv_sa_join_multicast; + ibv_sa_free_multicast; + ibv_sa_get_mcmember_rec; + ibv_sa_get_field; + ibv_sa_set_field; + ibv_sa_unpack_path_rec; + ibv_sa_pack_mcmember_rec; + ibv_sa_unpack_mcmember_rec; + local: *; +}; Index: libibsa/src/sa_client.c =================================================================== --- libibsa/src/sa_client.c (revision 0) +++ libibsa/src/sa_client.c (revision 0) @@ -0,0 +1,557 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: cm.c 3453 2005-09-15 21:43:21Z sean.hefty $ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include +#include + +#define PFX "libibsa: " + +#define container_of(ptr, type, field) \ + ((type *) ((void *) ptr - offsetof(type, field))) + +struct sa_event_tracking { + uint32_t events_completed; + pthread_cond_t cond; + pthread_mutex_t mut; +}; + +struct sa_event { + struct ibv_sa_event event; + struct ibv_sa_event_channel *channel; + void *data; + struct sa_event_tracking *event_tracking; +}; + +struct ibv_sa_multicast { + struct ibv_sa_event_channel *channel; + void *context; + uint32_t id; + struct sa_event_tracking event_tracking; +}; + +static int abi_ver; +static pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; + +#define USA_CREATE_MSG_CMD_RESP(msg, cmd, resp, type, size) \ +do { \ + struct usa_abi_cmd_hdr *hdr; \ + \ + size = sizeof(*hdr) + sizeof(*cmd); \ + msg = alloca(size); \ + if (!msg) \ + return ENOMEM; \ + hdr = msg; \ + cmd = msg + sizeof(*hdr); \ + hdr->cmd = type; \ + hdr->in = sizeof(*cmd); \ + hdr->out = sizeof(*resp); \ + memset(cmd, 0, sizeof(*cmd)); \ + resp = alloca(sizeof(*resp)); \ + if (!resp) \ + return ENOMEM; \ + cmd->response = (uintptr_t)resp; \ +} while (0) + +#define USA_CREATE_MSG_CMD(msg, cmd, type, size) \ +do { \ + struct usa_abi_cmd_hdr *hdr; \ + \ + size = sizeof(*hdr) + sizeof(*cmd); \ + msg = alloca(size); \ + if (!msg) \ + return ENOMEM; \ + hdr = msg; \ + cmd = msg + sizeof(*hdr); \ + hdr->cmd = type; \ + hdr->in = sizeof(*cmd); \ + hdr->out = 0; \ + memset(cmd, 0, sizeof(*cmd)); \ +} while (0) + +static int check_abi_version(void) +{ + char value[8]; + + if (ibv_read_sysfs_file(ibv_get_sysfs_path(), + "class/misc/ib_usa_default/abi_version", + value, sizeof value) < 0) { + /* + * Older version of Linux do not have class/misc. To support + * backports, assume the most recent version of the ABI. If + * we're wrong, we'll simply fail later when calling the ABI. + */ + abi_ver = IB_USA_MAX_ABI_VERSION; + fprintf(stderr, PFX "couldn't read ABI version, assuming: %d\n", + abi_ver); + return 0; + } + + abi_ver = strtol(value, NULL, 10); + if (abi_ver < IB_USA_MIN_ABI_VERSION || + abi_ver > IB_USA_MAX_ABI_VERSION) { + fprintf(stderr, PFX "kernel ABI version %d " + "doesn't match library version %d.\n", + abi_ver, IB_USA_MAX_ABI_VERSION); + return -1; + } + return 0; +} + +static int usa_init(void) +{ + int ret = 0; + + pthread_mutex_lock(&mut); + if (!abi_ver) + ret = check_abi_version(); + pthread_mutex_unlock(&mut); + + return ret; +} + +struct ibv_sa_event_channel * +ibv_sa_create_event_channel(enum ibv_sa_access_mode mode) +{ + struct ibv_sa_event_channel *channel; + + if (usa_init()) + return NULL; + + channel = malloc(sizeof *channel); + if (!channel) + return NULL; + + if (mode == IBV_SA_ACCESS_DEFAULT) + channel->fd = open("/dev/infiniband/ib_usa_default", O_RDWR); + else + channel->fd = open("/dev/infiniband/ib_usa_raw", O_RDWR); + if (channel->fd < 0) { + fprintf(stderr, PFX "unable to open /dev/infiniband/" + "ib_usa_default or ib_usa_raw\n"); + goto err; + } + return channel; +err: + free(channel); + return NULL; +} + +void ibv_sa_destroy_event_channel(struct ibv_sa_event_channel *channel) +{ + close(channel->fd); + free(channel); +} + +static int init_event_tracking(struct sa_event_tracking *event_tracking) +{ + pthread_mutex_init(&event_tracking->mut, NULL); + return pthread_cond_init(&event_tracking->cond, NULL); +} + +static void cleanup_event_tracking(struct sa_event_tracking *event_tracking) +{ + pthread_cond_destroy(&event_tracking->cond); + pthread_mutex_destroy(&event_tracking->mut); +} + +static void wait_for_events(struct sa_event_tracking *event_tracking, + int events_reported) +{ + pthread_mutex_lock(&event_tracking->mut); + while (event_tracking->events_completed < events_reported) + pthread_cond_wait(&event_tracking->cond, &event_tracking->mut); + pthread_mutex_unlock(&event_tracking->mut); +} + +static void complete_event(struct sa_event_tracking *event_tracking) +{ + pthread_mutex_lock(&event_tracking->mut); + event_tracking->events_completed++; + pthread_cond_signal(&event_tracking->cond); + pthread_mutex_unlock(&event_tracking->mut); +} + +int ibv_sa_send_mad(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + uint8_t method, void *attr, uint16_t attr_id, + uint64_t comp_mask, int timeout_ms, int retries, + void *context) +{ + struct usa_abi_send_mad *cmd; + void *msg; + int ret, size; + + USA_CREATE_MSG_CMD(msg, cmd, USA_CMD_SEND_MAD, size); + cmd->uid = (uintptr_t) context; + cmd->node_guid = ibv_get_device_guid(device->device); + cmd->comp_mask = comp_mask; + cmd->attr = (uintptr_t) attr; + cmd->port_num = port_num; + cmd->method = method; + cmd->attr_id = attr_id; + cmd->timeout_ms = timeout_ms; + cmd->retries = retries; + + ret = write(channel->fd, msg, size); + if (ret != size) + return (ret > 0) ? ENODATA : ret; + + return 0; +} + +static void copy_event_attr(struct sa_event *evt, + struct usa_abi_event_resp *resp) +{ + struct ibv_sa_mad *mad; + int size; + + size = resp->data_len - IBV_SA_MAD_HEADER_SIZE; + if (size <= 0) + return; + + evt->data = malloc(size); + if (!evt->data) + return; + + mad = (struct ibv_sa_mad *) resp->data; + memcpy(evt->data, mad->sa_data, size); + evt->event.attr = evt->data; + evt->event.attr_id = mad->attribute_id; + evt->event.attr_size = ibv_sa_attr_size(mad->attribute_id); + evt->event.attr_offset = ntohs(mad->attribute_offset) * 8; + if (evt->event.attr_offset) + evt->event.attr_count = size / evt->event.attr_offset; +} + +static int get_event_attr(struct sa_event *evt, + struct usa_abi_event_resp *resp) +{ + struct ibv_sa_mad *mad; + struct usa_abi_get_data *cmd; + void *msg; + int ret, size; + + USA_CREATE_MSG_CMD(msg, cmd, USA_CMD_GET_DATA, size); + cmd->id = resp->id; + + evt->data = malloc(resp->data_len); + if (evt->data) { + cmd->response = (uintptr_t) evt->data; + ((struct usa_abi_cmd_hdr *) msg)->out = resp->data_len; + } + + ret = write(evt->channel->fd, msg, size); + if (ret != size) + return (ret > 0) ? ENODATA : ret; + + mad = (struct ibv_sa_mad *) resp->data; + evt->event.attr = evt->data; + evt->event.attr_id = mad->attribute_id; + evt->event.attr_size = ibv_sa_attr_size(mad->attribute_id); + evt->event.attr_offset = ntohs(mad->attribute_offset) * 8; + if (evt->event.attr_offset) + evt->event.attr_count = (resp->data_len - + IBV_SA_MAD_HEADER_SIZE) / + evt->event.attr_offset; + return 0; +} + +static void process_mad_event(struct sa_event *evt, + struct usa_abi_event_resp *resp) +{ + evt->event.context = (void *) (uintptr_t) resp->uid; + if (resp->data_len <= IB_USA_EVENT_DATA) + copy_event_attr(evt, resp); + else + get_event_attr(evt, resp); +} + +static void process_mcast_event(struct sa_event *evt, + struct usa_abi_event_resp *resp) +{ + struct ibv_sa_multicast *multicast; + + multicast = (void *) (uintptr_t) resp->uid; + evt->event.context = multicast->context; + evt->event_tracking = &multicast->event_tracking; + multicast->id = resp->id; + + evt->data = malloc(IBV_SA_ATTR_MC_MEMBER_REC_LEN); + if (!evt->data) + return; + + memcpy(evt->data, resp->data, IBV_SA_ATTR_MC_MEMBER_REC_LEN); + evt->event.attr = evt->data; + evt->event.attr_id = IBV_SA_ATTR_MC_MEMBER_REC; + evt->event.attr_size = IBV_SA_ATTR_MC_MEMBER_REC_LEN; + evt->event.attr_offset = IBV_SA_ATTR_MC_MEMBER_REC_LEN; + evt->event.attr_count = 1; +} + +int ibv_sa_get_event(struct ibv_sa_event_channel *channel, + struct ibv_sa_event **event) +{ + struct usa_abi_get_event *cmd; + struct usa_abi_event_resp *resp; + struct sa_event *evt; + void *msg; + int ret, size; + + evt = malloc(sizeof *evt); + if (!evt) + return ENOMEM; + memset(evt, 0, sizeof *evt); + + USA_CREATE_MSG_CMD_RESP(msg, cmd, resp, USA_CMD_GET_EVENT, size); + ret = write(channel->fd, msg, size); + if (ret != size) { + ret = (ret > 0) ? ENODATA : ret; + goto err; + } + + evt->channel = channel; + evt->event.event = resp->event; + evt->event.status = resp->status; + + switch (resp->event) { + case USA_EVENT_MAD: + process_mad_event(evt, resp); + break; + case USA_EVENT_MCAST: + process_mcast_event(evt, resp); + break; + default: + break; + } + + *event = &evt->event; + return 0; +err: + free(evt); + return ret; +} + +int ibv_sa_ack_event(struct ibv_sa_event *event) +{ + struct sa_event *evt = container_of(event, struct sa_event, event); + + if (evt->data) + free(evt->data); + + if (evt->event_tracking) + complete_event(evt->event_tracking); + + free(event); + return 0; +} + +int ibv_sa_join_multicast(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_mcmember_rec *rec, + uint64_t comp_mask, void *context, + struct ibv_sa_multicast **multicast) +{ + struct usa_abi_join_mcast *cmd; + struct usa_abi_id_resp *resp; + struct ibv_sa_multicast *mcast; + void *msg; + int ret, size; + + mcast = malloc(sizeof *mcast); + if (!mcast) + return ENOMEM; + memset(mcast, 0, sizeof *mcast); + + mcast->channel = channel; + mcast->context = context; + ret = init_event_tracking(&mcast->event_tracking); + if (ret) + goto err; + + USA_CREATE_MSG_CMD_RESP(msg, cmd, resp, USA_CMD_JOIN_MCAST, size); + cmd->uid = (uintptr_t) mcast; + cmd->node_guid = ibv_get_device_guid(device->device); + cmd->comp_mask = comp_mask; + cmd->mcmember_rec = (uintptr_t) rec; + cmd->port_num = port_num; + + ret = write(channel->fd, msg, size); + if (ret != size) { + ret = (ret > 0) ? ENODATA : ret; + goto err; + } + + mcast->id = resp->id; + *multicast = mcast; + return 0; +err: + cleanup_event_tracking(&mcast->event_tracking); + free(mcast); + return ret; +} + +int ibv_sa_free_multicast(struct ibv_sa_multicast *multicast) +{ + struct usa_abi_free_id *cmd; + struct usa_abi_free_resp *resp; + void *msg; + int ret, size; + + USA_CREATE_MSG_CMD_RESP(msg, cmd, resp, USA_CMD_FREE_ID, size); + ret = write(multicast->channel->fd, msg, size); + if (ret != size) + return (ret > 0) ? ENODATA : ret; + + wait_for_events(&multicast->event_tracking, resp->events_reported); + + cleanup_event_tracking(&multicast->event_tracking); + free(multicast); + return 0; +} + +int ibv_sa_get_mcmember_rec(struct ibv_sa_event_channel *channel, + struct ibv_context *device, uint8_t port_num, + union ibv_gid *mgid, + struct ibv_sa_net_mcmember_rec *rec) +{ + struct usa_abi_get_mcast *cmd; + void *msg; + int ret, size; + + USA_CREATE_MSG_CMD(msg, cmd, USA_CMD_GET_MCAST, size); + cmd->node_guid = ibv_get_device_guid(device->device); + cmd->port_num = port_num; + cmd->response = (uintptr_t) rec; + ((struct usa_abi_cmd_hdr *) msg)->out = sizeof *rec; + if (mgid) + memcpy(cmd->mgid, mgid->raw, sizeof *mgid); + + ret = write(channel->fd, msg, size); + if (ret != size) + return (ret > 0) ? ENODATA : ret; + + return 0; +} + +static int get_gid_index(struct ibv_context *device, uint8_t port_num, + union ibv_gid *sgid) +{ + union ibv_gid gid; + int i, ret; + + for (i = 0, ret = 0; !ret; i++) { + ret = ibv_query_gid(device, port_num, i, &gid); + if (!ret && !memcmp(sgid, &gid, sizeof gid)) { + ret = i; + break; + } + } + return ret; +} + +int ibv_sa_init_ah_from_path(struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_path_rec *path_rec, + struct ibv_ah_attr *ah_attr) +{ + struct ibv_sa_path_rec rec; + int ret; + + ibv_sa_unpack_path_rec(&rec, path_rec); + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = ntohs(rec.dlid); + ah_attr->sl = rec.sl; + ah_attr->src_path_bits = ntohs(rec.slid) & 0x7F; + ah_attr->port_num = port_num; + + if (rec.hop_limit > 1) { + ah_attr->is_global = 1; + ah_attr->grh.dgid = rec.dgid; + ret = get_gid_index(device, port_num, &rec.sgid); + if (ret < 0) + return ret; + + ah_attr->grh.sgid_index = (uint8_t) ret; + ah_attr->grh.flow_label = ntohl(rec.flow_label); + ah_attr->grh.hop_limit = rec.hop_limit; + ah_attr->grh.traffic_class = rec.traffic_class; + } + return 0; +} + +int ibv_sa_init_ah_from_mcmember(struct ibv_context *device, uint8_t port_num, + struct ibv_sa_net_mcmember_rec *mc_rec, + struct ibv_ah_attr *ah_attr) +{ + struct ibv_sa_mcmember_rec rec; + int ret; + + ibv_sa_unpack_mcmember_rec(&rec, mc_rec); + + ret = get_gid_index(device, port_num, &rec.port_gid); + if (ret < 0) + return ret; + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = ntohs(rec.mlid); + ah_attr->sl = rec.sl; + ah_attr->port_num = port_num; + ah_attr->static_rate = rec.rate; + + ah_attr->is_global = 1; + ah_attr->grh.dgid = rec.mgid; + + ah_attr->grh.sgid_index = (uint8_t) ret; + ah_attr->grh.flow_label = ntohl(rec.flow_label); + ah_attr->grh.hop_limit = rec.hop_limit; + ah_attr->grh.traffic_class = rec.traffic_class; + return 0; +} Property changes on: libibsa/src/sa_client.c ___________________________________________________________________ Name: svn:executable + * Index: libibsa/src/sa_net.c =================================================================== --- libibsa/src/sa_net.c (revision 0) +++ libibsa/src/sa_net.c (revision 0) @@ -0,0 +1,265 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: cm.c 3453 2005-09-15 21:43:21Z sean.hefty $ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +#include + +#include + +int ibv_sa_attr_size(uint16_t attr_id) +{ + int size; + + switch (attr_id) { + case IBV_SA_ATTR_CLASS_PORTINFO: + size = IBV_SA_ATTR_CLASS_PORTINFO_LEN; + break; + case IBV_SA_ATTR_NOTICE: + size = IBV_SA_ATTR_NOTICE_LEN; + break; + case IBV_SA_ATTR_INFORM_INFO: + size = IBV_SA_ATTR_INFORM_INFO_LEN; + break; + case IBV_SA_ATTR_NODE_REC: + size = IBV_SA_ATTR_NODE_REC_LEN; + break; + case IBV_SA_ATTR_PORT_INFO_REC: + size = IBV_SA_ATTR_PORT_INFO_REC_LEN; + break; + case IBV_SA_ATTR_SL2VL_REC: + size = IBV_SA_ATTR_SL2VL_REC_LEN; + break; + case IBV_SA_ATTR_SWITCH_REC: + size = IBV_SA_ATTR_SWITCH_REC_LEN; + break; + case IBV_SA_ATTR_LINEAR_FDB_REC: + size = IBV_SA_ATTR_LINEAR_FDB_REC_LEN; + break; + case IBV_SA_ATTR_RANDOM_FDB_REC: + size = IBV_SA_ATTR_RANDOM_FDB_REC_LEN; + break; + case IBV_SA_ATTR_MCAST_FDB_REC: + size = IBV_SA_ATTR_MCAST_FDB_REC_LEN; + break; + case IBV_SA_ATTR_SM_INFO_REC: + size = IBV_SA_ATTR_SM_INFO_REC_LEN; + break; + case IBV_SA_ATTR_LINK_REC: + size = IBV_SA_ATTR_LINK_REC_LEN; + break; + case IBV_SA_ATTR_GUID_INFO_REC: + size = IBV_SA_ATTR_GUID_INFO_REC_LEN; + break; + case IBV_SA_ATTR_SERVICE_REC: + size = IBV_SA_ATTR_SERVICE_REC_LEN; + break; + case IBV_SA_ATTR_PARTITION_REC: + size = IBV_SA_ATTR_PARTITION_REC_LEN; + break; + case IBV_SA_ATTR_PATH_REC: + size = IBV_SA_ATTR_PATH_REC_LEN; + break; + case IBV_SA_ATTR_VL_ARB_REC: + size = IBV_SA_ATTR_VL_ARB_REC_LEN; + break; + case IBV_SA_ATTR_MC_MEMBER_REC: + size = IBV_SA_ATTR_MC_MEMBER_REC_LEN; + break; + case IBV_SA_ATTR_TRACE_REC: + size = IBV_SA_ATTR_TRACE_REC_LEN; + break; + case IBV_SA_ATTR_MULTI_PATH_REC: + size = IBV_SA_ATTR_MULTI_PATH_REC_LEN; + break; + case IBV_SA_ATTR_SERVICE_ASSOC_REC: + size = IBV_SA_ATTR_SERVICE_ASSOC_REC_LEN; + break; + case IBV_SA_ATTR_INFORM_INFO_REC: + size = IBV_SA_ATTR_INFORM_INFO_REC_LEN; + break; + default: + size = 0; + break; + } + return size; +} + +uint32_t ibv_sa_get_field(void *data, int offset, int size) +{ + uint32_t value, left_offset; + + left_offset = offset & 0x07; + if (size <= 8) { + value = ((uint8_t *) data)[offset / 8]; + value = ((value << left_offset) & 0xFF) >> (8 - size); + } else if (size <= 16) { + value = ntohs(((uint16_t *) data)[offset / 16]); + value = ((value << left_offset) & 0xFFFF) >> (16 - size); + } else { + value = ntohl(((uint32_t *) data)[offset / 32]); + value = (value << left_offset) >> (32 - size); + } + return value; +} + +void ibv_sa_set_field(void *data, uint32_t value, int offset, int size) +{ + uint32_t left_value, right_value; + uint32_t left_offset, right_offset; + uint32_t field_size; + + if (size <= 8) + field_size = 8; + else if (size <= 16) + field_size = 16; + else + field_size = 32; + + left_offset = offset & 0x07; + right_offset = field_size - left_offset - size; + + left_value = left_offset ? ibv_sa_get_field(data, offset - left_offset, + left_offset) : 0; + right_value = right_offset ? ibv_sa_get_field(data, offset + size, + right_offset) : 0; + + value = (left_value << (size + right_offset)) | + (value << right_offset) | right_value; + + if (field_size == 8) + ((uint8_t *) data)[offset / 8] = (uint8_t) value; + else if (field_size == 16) + ((uint16_t *) data)[offset / 16] = htons((uint16_t) value); + else + ((uint32_t *) data)[offset / 32] = htonl((uint32_t) value); +} + +void ibv_sa_unpack_path_rec(struct ibv_sa_path_rec *rec, + struct ibv_sa_net_path_rec *net_rec) +{ + memcpy(rec->dgid.raw, net_rec->dgid, sizeof net_rec->dgid); + memcpy(rec->sgid.raw, net_rec->sgid, sizeof net_rec->sgid); + rec->dlid = net_rec->dlid; + rec->slid = net_rec->slid; + + rec->raw_traffic = ibv_sa_get_field(net_rec, 352, 1); + rec->flow_label = htonl(ibv_sa_get_field(net_rec, 356, 20)); + rec->hop_limit = (uint8_t) ibv_sa_get_field(net_rec, 376, 8); + rec->traffic_class = net_rec->tclass; + + rec->reversible = htonl(ibv_sa_get_field(net_rec, 392, 1)); + rec->numb_path = (uint8_t) ibv_sa_get_field(net_rec, 393, 7); + rec->pkey = net_rec->pkey; + rec->sl = (uint8_t) ibv_sa_get_field(net_rec, 428, 4); + + rec->mtu_selector = (uint8_t) ibv_sa_get_field(net_rec, 432, 2); + rec->mtu = (uint8_t) ibv_sa_get_field(net_rec, 434, 6); + + rec->rate_selector = (uint8_t) ibv_sa_get_field(net_rec, 440, 2); + rec->rate = (uint8_t) ibv_sa_get_field(net_rec, 442, 6); + + rec->packet_life_time_selector = (uint8_t) ibv_sa_get_field(net_rec, + 448, 2); + rec->packet_life_time = (uint8_t) ibv_sa_get_field(net_rec, 450, 6); + + rec->preference = net_rec->preference; +} + +void ibv_sa_pack_mcmember_rec(struct ibv_sa_net_mcmember_rec *net_rec, + struct ibv_sa_mcmember_rec *rec) +{ + memcpy(net_rec->mgid, rec->mgid.raw, sizeof net_rec->mgid); + memcpy(net_rec->port_gid, rec->port_gid.raw, sizeof net_rec->port_gid); + net_rec->qkey = rec->qkey; + net_rec->mlid = rec->mlid; + + ibv_sa_set_field(net_rec, rec->mtu_selector, 304, 2); + ibv_sa_set_field(net_rec, rec->mtu, 306, 6); + + net_rec->tclass = rec->traffic_class; + net_rec->pkey = rec->pkey; + + ibv_sa_set_field(net_rec, rec->rate_selector, 336, 2); + ibv_sa_set_field(net_rec, rec->rate, 338, 6); + + ibv_sa_set_field(net_rec, rec->packet_life_time_selector, 344, 2); + ibv_sa_set_field(net_rec, rec->packet_life_time, 346, 6); + + ibv_sa_set_field(net_rec, rec->sl, 352, 4); + ibv_sa_set_field(net_rec, ntohl(rec->flow_label), 356, 20); + ibv_sa_set_field(net_rec, rec->hop_limit, 376, 8); + + ibv_sa_set_field(net_rec, rec->scope, 384, 4); + ibv_sa_set_field(net_rec, rec->join_state, 388, 4); + + ibv_sa_set_field(net_rec, rec->proxy_join, 392, 1); +} + +void ibv_sa_unpack_mcmember_rec(struct ibv_sa_mcmember_rec *rec, + struct ibv_sa_net_mcmember_rec *net_rec) +{ + memcpy(rec->mgid.raw, net_rec->mgid, sizeof rec->mgid); + memcpy(rec->port_gid.raw, net_rec->port_gid, sizeof rec->port_gid); + rec->qkey = net_rec->qkey; + rec->mlid = net_rec->mlid; + + rec->mtu_selector = (uint8_t) ibv_sa_get_field(net_rec, 304, 2); + rec->mtu = (uint8_t) ibv_sa_get_field(net_rec, 306, 6); + + rec->traffic_class = net_rec->tclass; + rec->pkey = net_rec->pkey; + + rec->rate_selector = (uint8_t) ibv_sa_get_field(net_rec, 336, 2); + rec->rate = (uint8_t) ibv_sa_get_field(net_rec, 338, 6); + + rec->packet_life_time_selector = (uint8_t) ibv_sa_get_field(net_rec, + 344, 2); + rec->packet_life_time = (uint8_t) ibv_sa_get_field(net_rec, 346, 6); + + rec->sl = (uint8_t) ibv_sa_get_field(net_rec, 352, 4); + rec->flow_label = htonl(ibv_sa_get_field(net_rec, 356, 20)); + rec->hop_limit = ibv_sa_get_field(net_rec, 376, 8); + + rec->scope = (uint8_t) ibv_sa_get_field(net_rec, 384, 4); + rec->join_state = (uint8_t) ibv_sa_get_field(net_rec, 388, 4); + + rec->proxy_join = ibv_sa_get_field(net_rec, 392, 1); +} Property changes on: libibsa/src/sa_net.c ___________________________________________________________________ Name: svn:executable + * Index: libibsa/ChangeLog =================================================================== Index: libibsa/COPYING =================================================================== --- libibsa/COPYING (revision 0) +++ libibsa/COPYING (revision 0) @@ -0,0 +1,378 @@ +This software is available to you under a choice of one of two +licenses. You may choose to be licensed under the terms of the the +OpenIB.org BSD license or the GNU General Public License (GPL) Version +2, both included below. + +Copyright (c) 2005 Intel Corporation. All rights reserved. + +================================================================== + + OpenIB.org BSD license + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: + + * Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + * Redistributions in binary form must reproduce the above + copyright notice, this list of conditions and the following + disclaimer in the documentation and/or other materials provided + with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS +FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE +COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, +INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, +BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; +LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN +ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +================================================================== + + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. Index: libibsa/Makefile.am =================================================================== --- libibsa/Makefile.am (revision 0) +++ libibsa/Makefile.am (revision 0) @@ -0,0 +1,40 @@ +# $Id: Makefile.am 3373 2005-09-12 16:34:20Z roland $ +INCLUDES = -I$(srcdir)/include + +AM_CFLAGS = -g -Wall -D_GNU_SOURCE + +ibsalibdir = $(libdir) + +ibsalib_LTLIBRARIES = src/libibsa.la + +src_ibsa_la_CFLAGS = -g -Wall -D_GNU_SOURCE + +if HAVE_LD_VERSION_SCRIPT + ibsa_version_script = -Wl,--version-script=$(srcdir)/src/libibsa.map +else + ibsa_version_script = +endif + +src_libibsa_la_SOURCES = src/sa_client.c src/sa_net.c +src_libibsa_la_LDFLAGS = -avoid-version $(ibsa_version_script) + +bin_PROGRAMS = examples/satest examples/mchammer +examples_satest_SOURCES = examples/satest.c +examples_satest_LDADD = $(top_builddir)/src/libibsa.la +examples_mchammer_SOURCES = examples/mchammer.c +examples_mchammer_LDADD = $(top_builddir)/src/libibsa.la + +libibsaincludedir = $(includedir)/infiniband + +libibsainclude_HEADERS = include/infiniband/sa_client_abi.h \ + include/infiniband/sa_client.h \ + include/infiniband/sa_net.h + +EXTRA_DIST = include/infiniband/sa_client_abi.h \ + include/infiniband/sa_client.h \ + include/infiniband/sa_net.h \ + src/libibsa.map \ + libibsa.spec.in + +dist-hook: libibsa.spec + cp libibsa.spec $(distdir) Index: libibsa/autogen.sh =================================================================== --- libibsa/autogen.sh (revision 0) +++ libibsa/autogen.sh (revision 0) @@ -0,0 +1,8 @@ +#! /bin/sh + +set -x +aclocal -I config +libtoolize --force --copy +autoheader +automake --foreign --add-missing --copy +autoconf Property changes on: libibsa/autogen.sh ___________________________________________________________________ Name: svn:executable + * Index: libibsa/NEWS =================================================================== Index: libibsa/README =================================================================== --- libibsa/README (revision 0) +++ libibsa/README (revision 0) @@ -0,0 +1,34 @@ +This README is for userspace SA client library. + +Building + +To make this directory, run: +./autogen.sh && ./configure && make && make install + +Typically the autogen and configure steps only need be done the first +time unless configure.in or Makefile.am changes. + +Libraries are installed by default at /usr/local/lib. + +Device files + +The userspace SA client uses two device files under misc devices +regardless of the number of adapters or ports present. + +ib_usa_default allows basic SA query services and join operations. It is +intended to be safe for use by most applications. + +ib_usa_raw allows unfiltered sends to the SA and should be restricted to +privileged users. + +To create appropriate character device files automatically with udev, a rule +like + + KERNEL="ib_usa_default", NAME="infiniband/%k", MODE="0666" + +can be used. This will create the device node named + + \dev\infiniband\ib_usa_default + +A similar file should be created for ib_usa_raw if privileged sends to the SA +will be allowed, but typically with mode "0600". Index: libibsa/examples/satest.c =================================================================== --- libibsa/examples/satest.c (revision 0) +++ libibsa/examples/satest.c (revision 0) @@ -0,0 +1,225 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include + +#include +#include + +/* + * To execute: + * satest slid dlid + */ + +struct ibv_context *verbs; +struct ibv_sa_event_channel *channel; +uint16_t slid; +uint16_t dlid; + +static int init(void) +{ + struct ibv_device **dev_list; + int ret = 0; + + dev_list = ibv_get_device_list(NULL); + if (!dev_list[0]) + return -1; + + verbs = ibv_open_device(dev_list[0]); + ibv_free_device_list(dev_list); + if (!verbs) + return -1; + + channel = ibv_sa_create_event_channel(IBV_SA_ACCESS_DEFAULT); + if (!channel) { + printf("ibv_sa_create_event_channel failed\n"); + ibv_close_device(verbs); + ret = 1; + } + + return ret; +} + +static void cleanup(void) +{ + ibv_sa_destroy_event_channel(channel); + ibv_close_device(verbs); +} + +static int query_one_path(struct ibv_sa_net_path_rec *path_rec) +{ + struct ibv_sa_event *event; + int ret; + + path_rec->slid = slid; + path_rec->dlid = dlid; + ibv_sa_set_field(path_rec, 1, IBV_SA_PATH_REC_NUMB_PATH_OFFSET, + IBV_SA_PATH_REC_NUMB_PATH_LENGTH); + ret = ibv_sa_send_mad(channel, verbs, 1, IBV_SA_METHOD_GET, + path_rec, IBV_SA_ATTR_PATH_REC, + IBV_SA_PATH_REC_SLID | + IBV_SA_PATH_REC_DLID | + IBV_SA_PATH_REC_NUMB_PATH, 3000, 3, NULL); + if (ret) { + printf("query_one_path ibv_sa_send_mad failed: %d\n", ret); + return ret; + } + + ret = ibv_sa_get_event(channel, &event); + if (ret) { + printf("query_one_path ibv_sa_get_event failed: %d\n", ret); + return ret; + } + + if (event->status) { + printf("query_one_path: status = %d\n", event->status); + ret = event->status; + goto out; + } + + memcpy(path_rec, event->attr, event->attr_size); +out: + ibv_sa_ack_event(event); + return ret; +} + +static int query_many_paths(struct ibv_sa_event **event) +{ + struct ibv_sa_net_path_rec path_rec; + int ret; + + path_rec.slid = slid; + ret = ibv_sa_send_mad(channel, verbs, 1, IBV_SA_METHOD_GET_TABLE, + &path_rec, IBV_SA_ATTR_PATH_REC, + IBV_SA_PATH_REC_SLID, 3000, 3, NULL); + if (ret) { + printf("query_many_paths ibv_sa_send_mad failed: %d\n", ret); + return ret; + } + + ret = ibv_sa_get_event(channel, event); + if (ret) { + printf("query_many_paths ibv_sa_get_event failed: %d\n", ret); + return ret; + } + + if ((*event)->status) { + printf("query_many_paths: status = %d\n", (*event)->status); + ret = (*event)->status; + goto err; + } + + return 0; +err: + ibv_sa_ack_event(*event); + return ret; +} + +static int verify_paths(struct ibv_sa_net_path_rec *path_rec, + struct ibv_sa_event *event) +{ + struct ibv_sa_net_path_rec *rec; + int i, ret = -1; + + if (path_rec->slid != slid || path_rec->dlid != dlid) { + printf("path_rec slid or dlid does not match request\n"); + return -1; + } + + for (i = 0; i < event->attr_count; i++) { + rec = ibv_sa_get_attr(event, i); + + if (rec->slid != slid) { + printf("rec slid does not match request\n"); + return -1; + } + + if (path_rec->dlid == rec->dlid && + !memcmp(path_rec, rec, sizeof *rec)) + ret = 0; + } + + if (ret) + printf("path_rec not found in returned list\n"); + + return ret; +} + +static int run_path_query(void) +{ + struct ibv_sa_net_path_rec path_rec; + struct ibv_sa_event *event; + int ret; + + ret = query_one_path(&path_rec); + if (ret) + return ret; + + ret = query_many_paths(&event); + if (ret) + return ret; + + ret = verify_paths(&path_rec, event); + + ibv_sa_ack_event(event); + return ret; +} + +int main(int argc, char **argv) +{ + int ret; + + if (argc != 3) { + printf("usage: %s slid dlid\n", argv[0]); + exit(1); + } + + slid = htons((uint16_t) atoi(argv[1])); + dlid = htons((uint16_t) atoi(argv[2])); + + if (init()) + exit(1); + + ret = run_path_query(); + + printf("test complete\n"); + cleanup(); + printf("return status %d\n", ret); + return ret; +} Property changes on: libibsa/examples/satest.c ___________________________________________________________________ Name: svn:executable + * Index: libibsa/examples/mchammer.c =================================================================== --- libibsa/examples/mchammer.c (revision 0) +++ libibsa/examples/mchammer.c (revision 0) @@ -0,0 +1,391 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +/* + * To execute: + * mchammer {r | s} + */ + +struct ibv_context *verbs; +struct ibv_pd *pd; +struct ibv_cq *cq; +struct ibv_qp *qp; +struct ibv_mr *mr; +void *msgs; + +static int message_count = 10; +static int message_size = 100; +static int sender; + +static int post_recvs(void) +{ + struct ibv_recv_wr recv_wr, *recv_failure; + struct ibv_sge sge; + int i, ret = 0; + + if (!message_count) + return 0; + + recv_wr.next = NULL; + recv_wr.sg_list = &sge; + recv_wr.num_sge = 1; + + sge.length = message_size + sizeof(struct ibv_grh);; + sge.lkey = mr->lkey; + sge.addr = (uintptr_t) msgs; + + for (i = 0; i < message_count && !ret; i++ ) { + ret = ibv_post_recv(qp, &recv_wr, &recv_failure); + if (ret) { + printf("failed to post receives: %d\n", ret); + break; + } + } + return ret; +} + +static int create_qp(void) +{ + struct ibv_qp_init_attr init_qp_attr; + struct ibv_qp_attr qp_attr; + int ret; + + memset(&init_qp_attr, 0, sizeof init_qp_attr); + init_qp_attr.cap.max_send_wr = message_count ? message_count : 1; + init_qp_attr.cap.max_recv_wr = message_count ? message_count : 1; + init_qp_attr.cap.max_send_sge = 1; + init_qp_attr.cap.max_recv_sge = 1; + init_qp_attr.qp_type = IBV_QPT_UD; + init_qp_attr.send_cq = cq; + init_qp_attr.recv_cq = cq; + qp = ibv_create_qp(pd, &init_qp_attr); + if (!qp) { + printf("unable to create QP\n"); + return -1; + } + + qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.pkey_index = 0; + qp_attr.port_num = 1; + qp_attr.qkey = 0x01234567; + ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | + IBV_QP_PORT | IBV_QP_QKEY); + if (ret) { + printf("failed to modify QP to INIT\n"); + goto err; + } + + qp_attr.qp_state = IBV_QPS_RTR; + ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE); + if (ret) { + printf("failed to modify QP to RTR\n"); + goto err; + } + + qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.sq_psn = 0; + ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_SQ_PSN); + if (ret) { + printf("failed to modify QP to RTS\n"); + goto err; + } + return 0; +err: + ibv_destroy_qp(qp); + return ret; +} + +static int create_messages(void) +{ + if (!message_size) + message_count = 0; + + if (!message_count) + return 0; + + msgs = malloc(message_size + sizeof(struct ibv_grh)); + if (!msgs) { + printf("failed message allocation\n"); + return -1; + } + mr = ibv_reg_mr(pd, msgs, message_size + sizeof(struct ibv_grh), + IBV_ACCESS_LOCAL_WRITE); + if (!mr) { + printf("failed to reg MR\n"); + free(msgs); + return -1; + } + return 0; +} + +static void destroy_messages(void) +{ + if (!message_count) + return; + + ibv_dereg_mr(mr); + free(msgs); +} + +static int init(void) +{ + struct ibv_device **dev_list; + int ret; + + dev_list = ibv_get_device_list(NULL); + if (!dev_list[0]) + return -1; + + verbs = ibv_open_device(dev_list[0]); + ibv_free_device_list(dev_list); + if (!verbs) + return -1; + + pd = ibv_alloc_pd(verbs); + if (!pd) { + printf("unable to alloc PD\n"); + return -1; + } + + ret = create_messages(); + if (ret) { + printf("unable to create test messages\n"); + goto err1; + } + + cq = ibv_create_cq(verbs, message_count, NULL, NULL, 0); + if (!cq) { + printf("unable to create CQ\n"); + ret = -1; + goto err2; + } + + ret = create_qp(); + if (ret) { + printf("unable to create QP\n"); + goto err3; + } + return 0; + +err3: + ibv_destroy_cq(cq); +err2: + destroy_messages(); +err1: + ibv_dealloc_pd(pd); + return -1; +} + +static void cleanup(void) +{ + ibv_destroy_qp(qp); + ibv_destroy_cq(cq); + destroy_messages(); + ibv_dealloc_pd(pd); + ibv_close_device(verbs); +} + +static int send_msgs(struct ibv_ah *ah) +{ + struct ibv_send_wr send_wr, *bad_send_wr; + struct ibv_sge sge; + int i, ret = 0; + + send_wr.next = NULL; + send_wr.sg_list = &sge; + send_wr.num_sge = 1; + send_wr.opcode = IBV_WR_SEND; + send_wr.send_flags = 0; + + send_wr.wr.ud.ah = ah; + send_wr.wr.ud.remote_qpn = 0xFFFFFF; + send_wr.wr.ud.remote_qkey = 0x01234567; + + sge.length = message_size; + sge.lkey = mr->lkey; + sge.addr = (uintptr_t) msgs; + + for (i = 0; i < message_count && !ret; i++) { + ret = ibv_post_send(qp, &send_wr, &bad_send_wr); + if (ret) + printf("failed to post sends: %d\n", ret); + } + return ret; +} + +static int poll_cq(void) +{ + struct ibv_wc wc[8]; + int done, ret; + + for (done = 0; done < message_count; done += ret) { + ret = ibv_poll_cq(cq, 8, wc); + if (ret < 0) { + printf("failed polling CQ: %d\n", ret); + return ret; + } + } + + return 0; +} + +static int run(void) +{ + struct ibv_sa_event_channel *channel; + struct ibv_sa_multicast *mcast; + struct ibv_sa_event *event; + struct ibv_sa_net_mcmember_rec mc_rec, *rec; + struct ibv_ah_attr ah_attr; + struct ibv_ah *ah; + uint64_t comp_mask; + int ret; + + channel = ibv_sa_create_event_channel(IBV_SA_ACCESS_DEFAULT); + if (!channel) { + printf("ibv_sa_create_event_channel failed\n"); + return -1; + } + + ret = ibv_sa_get_mcmember_rec(channel, verbs, 1, NULL, &mc_rec); + if (ret) { + printf("ibv_sa_get_mcmember_rec failed\n"); + goto out1; + } + + printf("joining multicast group\n"); + mc_rec.mgid[0] = 0xFF; /* multicast GID */ + mc_rec.mgid[1] = 0x12; /* not permanent (7:4), link-local (3:0) */ + strcpy(&mc_rec.mgid[2], "7471"); /* our GID */ + mc_rec.qkey = htonl(0x01234567); + comp_mask = IBV_SA_MCMEMBER_REC_MGID | IBV_SA_MCMEMBER_REC_PORT_GID | + IBV_SA_MCMEMBER_REC_PKEY | IBV_SA_MCMEMBER_REC_JOIN_STATE | + IBV_SA_MCMEMBER_REC_QKEY | IBV_SA_MCMEMBER_REC_SL | + IBV_SA_MCMEMBER_REC_FLOW_LABEL | + IBV_SA_MCMEMBER_REC_TRAFFIC_CLASS; + ret = ibv_sa_join_multicast(channel, verbs, 1, &mc_rec, + comp_mask, NULL, &mcast); + if (ret) { + printf("ibv_sa_join_multicast failed\n"); + goto out1; + } + + ret = ibv_sa_get_event(channel, &event); + if (ret) { + printf("ibv_sa_get_event failed\n"); + goto out2; + } + + if (event->status) { + printf("join failed: %d\n", event->status); + ret = event->status; + goto out3; + } + + rec = (struct ibv_sa_net_mcmember_rec *) event->attr; + ibv_sa_init_ah_from_mcmember(verbs, 1, rec, &ah_attr); + ah = ibv_create_ah(pd, &ah_attr); + if (!ah) { + printf("ibv_create_ah failed\n"); + ret = -1; + goto out3; + } + + ret = ibv_attach_mcast(qp, (union ibv_gid *) rec->mgid, + htons(rec->mlid)); + if (ret) { + printf("ibv_attach_mcast failed\n"); + goto out4; + } + + /* + * Pause to give SM chance to configure switches. We don't want to + * handle reliability issue in this simple test program. + */ + sleep(2); + + if (sender) { + printf("initiating data transfers\n"); + ret = send_msgs(ah); + sleep(1); + } else { + printf("receiving data transfers\n"); + ret = post_recvs(); + if (!ret) + ret = poll_cq(); + } + printf("data transfers complete\n"); + + ibv_detach_mcast(qp, (union ibv_gid *) rec->mgid, htons(rec->mlid)); +out4: + ibv_destroy_ah(ah); +out3: + ibv_sa_ack_event(event); +out2: + ibv_sa_free_multicast(mcast); +out1: + ibv_sa_destroy_event_channel(channel); + return ret; +} + +int main(int argc, char **argv) +{ + int ret; + + if (argc != 2) { + printf("usage: %s {r | s}\n", argv[0]); + exit(1); + } + + sender = (argv[1][0] == 's'); + if (init()) + exit(1); + + ret = run(); + + printf("test complete\n"); + cleanup(); + printf("return status %d\n", ret); + return ret; +} Property changes on: libibsa/examples/mchammer.c ___________________________________________________________________ Name: svn:executable + * From sean.hefty at intel.com Thu Aug 24 17:10:00 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 24 Aug 2006 17:10:00 -0700 Subject: [openib-general] [PATCH v2] ib_usa: support userspace SA queries and multicast In-Reply-To: <000801c6c581$a8381aa0$8698070a@amr.corp.intel.com> Message-ID: <000701c6c7da$ce4e3e30$ff0da8c0@amr.corp.intel.com> Changes from v1: The ib_usa module exports two files: ib_usa_default and ib_usa_raw. Use of the ib_usa_default restricts the user to sending PathRecord, MultiPathRecord, MCMemberRecord, and ServiceRecord queries, and joining / leaving multicast groups. Use of ib_usa_raw allows any MADs to be sent to the SA. An administrator can set control on these files in any appropriate way. Signed-off-by: Sean Hefty --- Index: include/rdma/ib_usa.h =================================================================== --- include/rdma/ib_usa.h (revision 0) +++ include/rdma/ib_usa.h (revision 0) @@ -0,0 +1,123 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_USA_H +#define IB_USA_H + +#include +#include + +#define IB_USA_ABI_VERSION 1 + +#define IB_USA_EVENT_DATA 256 + +enum { + IB_USA_CMD_SEND_MAD, + IB_USA_CMD_GET_EVENT, + IB_USA_CMD_GET_DATA, + IB_USA_CMD_JOIN_MCAST, + IB_USA_CMD_FREE_ID, + IB_USA_CMD_GET_MCAST +}; + +enum { + IB_USA_EVENT_MAD, + IB_USA_EVENT_MCAST +}; + +struct ib_usa_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct ib_usa_send_mad { + __u64 response; /* unused - reserved */ + __u64 uid; + __u64 node_guid; + __u64 comp_mask; + __u64 attr; + __u8 port_num; + __u8 method; + __be16 attr_id; + __u32 timeout_ms; + __u32 retries; +}; + +struct ib_usa_join_mcast { + __u64 response; + __u64 uid; + __u64 node_guid; + __u64 comp_mask; + __u64 mcmember_rec; + __u8 port_num; +}; + +struct ib_usa_id_resp { + __u32 id; +}; + +struct ib_usa_free_resp { + __u32 events_reported; +}; + +struct ib_usa_free_id { + __u64 response; + __u32 id; +}; + +struct ib_usa_get_event { + __u64 response; +}; + +struct ib_usa_event_resp { + __u64 uid; + __u32 id; + __u32 event; + __u32 status; + __u32 data_len; + __u8 data[IB_USA_EVENT_DATA]; +}; + +struct ib_usa_get_data { + __u64 response; + __u32 id; +}; + +struct ib_usa_get_mcast { + __u64 response; + __u64 node_guid; + __u8 mgid[16]; + __u8 port_num; +}; + +#endif /* IB_USA_H */ Index: core/usa.c =================================================================== --- core/usa.c (revision 0) +++ core/usa.c (revision 0) @@ -0,0 +1,846 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("IB userspace SA query"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void usa_add_one(struct ib_device *device); +static void usa_remove_one(struct ib_device *device); + +static struct ib_client usa_client = { + .name = "ib_usa", + .add = usa_add_one, + .remove = usa_remove_one +}; + +struct usa_device { + struct list_head list; + struct ib_device *device; + struct completion comp; + atomic_t refcount; + int start_port; + int end_port; +}; + +struct usa_file { + struct mutex file_mutex; + struct file *filp; + struct ib_sa_client sa_client; + struct list_head event_list; + struct list_head data_list; + struct list_head mcast_list; + wait_queue_head_t poll_wait; + int event_id; +}; + +struct usa_event { + struct usa_file *file; + struct list_head list; + struct ib_usa_event_resp resp; + struct ib_mad_recv_wc *mad_recv_wc; +}; + +struct usa_multicast { + struct usa_event event; + struct list_head list; + struct ib_multicast *multicast; + int events_reported; +}; + +static DEFINE_MUTEX(usa_mutex); +static LIST_HEAD(dev_list); +static DEFINE_IDR(usa_idr); + +static struct usa_device *acquire_dev(__be64 guid, __u8 port_num) +{ + struct usa_device *dev; + + mutex_lock(&usa_mutex); + list_for_each_entry(dev, &dev_list, list) { + if (dev->device->node_guid == guid) { + if (port_num < dev->start_port || + port_num > dev->end_port) + break; + atomic_inc(&dev->refcount); + mutex_unlock(&usa_mutex); + return dev; + } + } + mutex_unlock(&usa_mutex); + return NULL; +} + +static void deref_dev(struct usa_device *dev) +{ + if (atomic_dec_and_test(&dev->refcount)) + complete(&dev->comp); +} + +static int insert_obj(void *obj, int *id) +{ + int ret; + + do { + ret = idr_pre_get(&usa_idr, GFP_KERNEL); + if (!ret) + break; + + mutex_lock(&usa_mutex); + ret = idr_get_new(&usa_idr, obj, id); + mutex_unlock(&usa_mutex); + } while (ret == -EAGAIN); + + return ret; +} + +static void remove_obj(int id) +{ + mutex_lock(&usa_mutex); + idr_remove(&usa_idr, id); + mutex_unlock(&usa_mutex); +} + +static void finish_event(struct usa_event *event) +{ + struct usa_multicast *mcast; + + switch (event->resp.event) { + case IB_USA_EVENT_MAD: + list_del(&event->list); + if (event->resp.data_len > IB_USA_EVENT_DATA) + list_add_tail(&event->list, &event->file->data_list); + else + kfree(event); + break; + case IB_USA_EVENT_MCAST: + list_del_init(&event->list); + mcast = container_of(event, struct usa_multicast, event); + mcast->events_reported++; + break; + default: + break; + } +} + +static ssize_t usa_get_event(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_usa_get_event cmd; + struct usa_event *event; + int ret = 0; + DEFINE_WAIT(wait); + + if (out_len < sizeof(struct ib_usa_event_resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + mutex_lock(&file->file_mutex); + while (list_empty(&file->event_list)) { + if (file->filp->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + break; + } + + if (signal_pending(current)) { + ret = -ERESTARTSYS; + break; + } + + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); + mutex_unlock(&file->file_mutex); + schedule(); + mutex_lock(&file->file_mutex); + finish_wait(&file->poll_wait, &wait); + } + + if (ret) + goto done; + + event = list_entry(file->event_list.next, struct usa_event, list); + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &event->resp, sizeof(event->resp))) { + ret = -EFAULT; + goto done; + } + + finish_event(event); +done: + mutex_unlock(&file->file_mutex); + return ret; +} + +static struct usa_event *get_event_data(struct usa_file *file, __u32 id) +{ + struct usa_event *event; + + mutex_lock(&file->file_mutex); + list_for_each_entry(event, &file->data_list, list) { + if (event->resp.id == id) { + list_del(&event->list); + mutex_unlock(&file->file_mutex); + return event; + } + } + mutex_unlock(&file->file_mutex); + return NULL; +} + +static int copy_event_data(struct usa_event *event, __u64 response) +{ + struct ib_sa_mad *mad; + struct ib_sa_iter *iter; + int attr_offset, ret = 0; + void *attr; + + mad = (struct ib_sa_mad *) event->mad_recv_wc->recv_buf.mad; + attr_offset = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; + + iter = ib_sa_iter_create(event->mad_recv_wc); + while ((attr = ib_sa_iter_next(iter))) { + if (copy_to_user((void __user *) (unsigned long) response, + attr, attr_offset)) { + ret = -EFAULT; + break; + } + response += attr_offset; + } + + ib_sa_iter_free(iter); + return ret; +} + +static ssize_t usa_get_data(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_usa_get_data cmd; + struct usa_event *event; + int ret = 0; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + event = get_event_data(file, cmd.id); + if (!event) + return -EINVAL; + + if (out_len < event->resp.data_len) { + ret = -ENOSPC; + goto out; + } + + ret = copy_event_data(event, cmd.response); +out: + ib_free_recv_mad(event->mad_recv_wc); + kfree(event); + return ret; +} + +static void usa_req_handler(int status, struct ib_mad_recv_wc *mad_recv_wc, + void *context) +{ + struct usa_event *event = context; + + if (mad_recv_wc) { + event->resp.data_len = mad_recv_wc->mad_len; + + if (event->resp.data_len <= IB_USA_EVENT_DATA) { + memcpy(event->resp.data, mad_recv_wc->recv_buf.mad, + event->resp.data_len); + ib_free_recv_mad(mad_recv_wc); + } else { + event->mad_recv_wc = mad_recv_wc; + memcpy(event->resp.data, mad_recv_wc->recv_buf.mad, + IB_USA_EVENT_DATA); + } + } + + event->resp.status = status; + + mutex_lock(&event->file->file_mutex); + list_add_tail(&event->list, &event->file->event_list); + wake_up_interruptible(&event->file->poll_wait); + mutex_unlock(&event->file->file_mutex); +} + +static int send_mad(struct usa_file *file, struct ib_usa_send_mad *cmd) +{ + struct usa_device *dev; + struct usa_event *event; + struct ib_sa_query *query; + int attr_size, ret; + + attr_size = ib_sa_attr_size(cmd->attr_id); + if (!attr_size) + return -EINVAL; + + dev = acquire_dev(cmd->node_guid, cmd->port_num); + if (!dev) + return -ENODEV; + + event = kzalloc(sizeof *event, GFP_KERNEL); + if (!event) { + ret = -ENOMEM; + goto deref; + } + + if (copy_from_user(event->resp.data, + (void __user *) (unsigned long) cmd->attr, + attr_size)) { + ret = -EFAULT; + goto free; + } + + event->file = file; + event->resp.event = IB_USA_EVENT_MAD; + event->resp.uid = cmd->uid; + + mutex_lock(&file->file_mutex); + event->resp.id = file->event_id++; + mutex_unlock(&file->file_mutex); + + ret = ib_sa_send_mad(&file->sa_client, dev->device, cmd->port_num, + cmd->method, event->resp.data, cmd->attr_id, + (ib_sa_comp_mask) cmd->comp_mask, + cmd->timeout_ms, cmd->retries, GFP_KERNEL, + usa_req_handler, event, &query); + if (ret < 0) + goto free; + + deref_dev(dev); + return 0; +free: + kfree(event); +deref: + deref_dev(dev); + return ret; +} + +static ssize_t usa_send_mad(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_usa_send_mad cmd; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + return send_mad(file, &cmd); +} + +static ssize_t usa_query(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_usa_send_mad cmd; + uint16_t attr_id; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + attr_id = be16_to_cpu(cmd.attr_id); + + switch (cmd.method) { + case IB_MGMT_METHOD_GET: + case IB_SA_METHOD_GET_TABLE: + switch (attr_id) { + case IB_SA_ATTR_PATH_REC: + case IB_SA_ATTR_MC_MEMBER_REC: + case IB_SA_ATTR_SERVICE_REC: + break; + default: + return -EINVAL; + } + break; + case IB_SA_METHOD_GET_MULTI: + if (attr_id != IB_SA_ATTR_MULTI_PATH_REC) + return -EINVAL; + break; + default: + return -EINVAL; + } + + return send_mad(file, &cmd); +} + +/* + * We can get up to two events for a single multicast member. A second event + * only occurs if there's an error on an existing multicast membership. + * Report only the last event. + */ +static int multicast_handler(int status, struct ib_multicast *multicast) +{ + struct usa_multicast *mcast = multicast->context; + + if (!status) { + mcast->event.resp.data_len = IB_SA_ATTR_MC_MEMBER_REC_LEN; + ib_sa_pack_attr(mcast->event.resp.data, &multicast->rec, + IB_SA_ATTR_MC_MEMBER_REC); + } + + mutex_lock(&mcast->event.file->file_mutex); + mcast->event.resp.status = status; + + list_del(&mcast->event.list); + list_add_tail(&mcast->event.list, &mcast->event.file->event_list); + wake_up_interruptible(&mcast->event.file->poll_wait); + mutex_unlock(&mcast->event.file->file_mutex); + return 0; +} + +static ssize_t usa_join_mcast(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct usa_device *dev; + struct usa_multicast *mcast; + struct ib_usa_join_mcast cmd; + struct ib_usa_id_resp resp; + struct ib_sa_mcmember_rec rec; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + dev = acquire_dev(cmd.node_guid, cmd.port_num); + if (!dev) + return -ENODEV; + + mcast = kzalloc(sizeof *mcast, GFP_KERNEL); + if (!mcast) { + ret = -ENOMEM; + goto deref; + } + INIT_LIST_HEAD(&mcast->event.list); + mcast->event.file = file; + mcast->event.resp.event = IB_USA_EVENT_MCAST; + mcast->event.resp.uid = cmd.uid; + + ret = insert_obj(mcast, &mcast->event.resp.id); + if (ret) + goto free; + + resp.id = mcast->event.resp.id; + + mutex_lock(&file->file_mutex); + list_add_tail(&mcast->list, &file->mcast_list); + mutex_unlock(&file->file_mutex); + + if (copy_from_user(mcast->event.resp.data, + (void __user *) (unsigned long) cmd.mcmember_rec, + IB_SA_ATTR_MC_MEMBER_REC_LEN)) { + ret = -EFAULT; + goto remove; + } + + ib_sa_unpack_attr(&rec, mcast->event.resp.data, + IB_SA_ATTR_MC_MEMBER_REC); + mcast->multicast = ib_join_multicast(dev->device, cmd.port_num, &rec, + (ib_sa_comp_mask) cmd.comp_mask, + GFP_KERNEL, multicast_handler, + mcast); + if (IS_ERR(mcast->multicast)) { + ret = PTR_ERR(mcast->multicast); + goto remove; + } + + deref_dev(dev); + return 0; +remove: + mutex_lock(&file->file_mutex); + list_del(&mcast->list); + mutex_unlock(&file->file_mutex); + remove_obj(mcast->event.resp.id); +free: + kfree(mcast); +deref: + deref_dev(dev); + return ret; +} + +static ssize_t usa_free_id(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_usa_free_id cmd; + struct ib_usa_free_resp resp; + struct usa_multicast *mcast; + int ret = 0; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + mutex_lock(&usa_mutex); + mcast = idr_find(&usa_idr, cmd.id); + if (!mcast) + mcast = ERR_PTR(-ENOENT); + else if (mcast->event.file != file) + mcast = ERR_PTR(-EINVAL); + else + idr_remove(&usa_idr, mcast->event.resp.id); + mutex_unlock(&usa_mutex); + + if (IS_ERR(mcast)) + return PTR_ERR(mcast); + + ib_free_multicast(mcast->multicast); + mutex_lock(&file->file_mutex); + list_del(&mcast->list); + mutex_unlock(&file->file_mutex); + + resp.events_reported = mcast->events_reported; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + + kfree(mcast); + return ret; +} + +static ssize_t usa_get_mcast(struct usa_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct usa_device *dev; + struct ib_usa_get_mcast cmd; + struct ib_sa_mcmember_rec rec; + u8 mcmember_rec[IB_SA_ATTR_MC_MEMBER_REC_LEN]; + int ret; + + if (out_len < sizeof(IB_SA_ATTR_MC_MEMBER_REC_LEN)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + dev = acquire_dev(cmd.node_guid, cmd.port_num); + if (!dev) + return -ENODEV; + + ret = ib_get_mcmember_rec(dev->device, cmd.port_num, + (union ib_gid *) cmd.mgid, &rec); + if (!ret) { + ib_sa_pack_attr(mcmember_rec, &rec, IB_SA_ATTR_MC_MEMBER_REC); + if (copy_to_user((void __user *) (unsigned long) cmd.response, + mcmember_rec, IB_SA_ATTR_MC_MEMBER_REC_LEN)) + ret = -EFAULT; + } + + deref_dev(dev); + return ret; +} + +static ssize_t (*usa_cmd_table[])(struct usa_file *file, + const char __user *inbuf, + int in_len, int out_len) = { + [IB_USA_CMD_SEND_MAD] = usa_query, /* Limited queries by default */ + [IB_USA_CMD_GET_EVENT] = usa_get_event, + [IB_USA_CMD_GET_DATA] = usa_get_data, + [IB_USA_CMD_JOIN_MCAST] = usa_join_mcast, + [IB_USA_CMD_FREE_ID] = usa_free_id, + [IB_USA_CMD_GET_MCAST] = usa_get_mcast, +}; + +static ssize_t usa_raw_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct usa_file *file = filp->private_data; + struct ib_usa_cmd_hdr hdr; + ssize_t ret; + + if (len < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + + if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(usa_cmd_table)) + return -EINVAL; + + if (hdr.in + sizeof(hdr) > len) + return -EINVAL; + + if (hdr.cmd == IB_USA_CMD_SEND_MAD) + ret = usa_send_mad(file, buf + sizeof(hdr), hdr.in, hdr.out); + else + ret = usa_cmd_table[hdr.cmd](file, buf + sizeof(hdr), + hdr.in, hdr.out); + if (!ret) + ret = len; + + return ret; +} + +static ssize_t usa_default_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct usa_file *file = filp->private_data; + struct ib_usa_cmd_hdr hdr; + ssize_t ret; + + if (len < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + + if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(usa_cmd_table)) + return -EINVAL; + + if (hdr.in + sizeof(hdr) > len) + return -EINVAL; + + ret = usa_cmd_table[hdr.cmd](file, buf + sizeof(hdr), hdr.in, hdr.out); + if (!ret) + ret = len; + + return ret; +} + +static unsigned int usa_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct usa_file *file = filp->private_data; + unsigned int mask = 0; + + poll_wait(filp, &file->poll_wait, wait); + + if (!list_empty(&file->event_list)) + mask = POLLIN | POLLRDNORM; + + return mask; +} + +static int usa_open(struct inode *inode, struct file *filp) +{ + struct usa_file *file; + + file = kmalloc(sizeof *file, GFP_KERNEL); + if (!file) + return -ENOMEM; + + ib_sa_register_client(&file->sa_client); + + INIT_LIST_HEAD(&file->event_list); + INIT_LIST_HEAD(&file->data_list); + INIT_LIST_HEAD(&file->mcast_list); + init_waitqueue_head(&file->poll_wait); + mutex_init(&file->file_mutex); + + filp->private_data = file; + file->filp = filp; + return 0; +} + +static void cleanup_events(struct list_head *list) +{ + struct usa_event *event; + + while (!list_empty(list)) { + event = list_entry(list->next, struct usa_event, list); + list_del(&event->list); + + if (event->mad_recv_wc) + ib_free_recv_mad(event->mad_recv_wc); + + kfree(event); + } +} + +static void cleanup_mcast(struct usa_file *file) +{ + struct usa_multicast *mcast; + + while (!list_empty(&file->mcast_list)) { + mcast = list_entry(file->mcast_list.next, + struct usa_multicast, list); + list_del(&mcast->list); + + remove_obj(mcast->event.resp.id); + + ib_free_multicast(mcast->multicast); + + /* + * Other members may still be generating events, so we need + * to lock the event list to avoid corrupting it. + */ + mutex_lock(&file->file_mutex); + list_del(&mcast->event.list); + mutex_unlock(&file->file_mutex); + + kfree(mcast); + } +} + +static int usa_close(struct inode *inode, struct file *filp) +{ + struct usa_file *file = filp->private_data; + + ib_sa_unregister_client(&file->sa_client); + cleanup_mcast(file); + + cleanup_events(&file->event_list); + cleanup_events(&file->data_list); + kfree(file); + return 0; +} + +static void usa_add_one(struct ib_device *device) +{ + struct usa_device *dev; + + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + dev = kmalloc(sizeof *dev, GFP_KERNEL); + if (!dev) + return; + + dev->device = device; + if (device->node_type == RDMA_NODE_IB_SWITCH) + dev->start_port = dev->end_port = 0; + else { + dev->start_port = 1; + dev->end_port = device->phys_port_cnt; + } + + init_completion(&dev->comp); + atomic_set(&dev->refcount, 1); + ib_set_client_data(device, &usa_client, dev); + + mutex_lock(&usa_mutex); + list_add_tail(&dev->list, &dev_list); + mutex_unlock(&usa_mutex); +} + +static void usa_remove_one(struct ib_device *device) +{ + struct usa_device *dev; + + dev = ib_get_client_data(device, &usa_client); + if (!dev) + return; + + mutex_lock(&usa_mutex); + list_del(&dev->list); + mutex_unlock(&usa_mutex); + + deref_dev(dev); + wait_for_completion(&dev->comp); + kfree(dev); +} + +static struct file_operations usa_raw_fops = { + .owner = THIS_MODULE, + .open = usa_open, + .release = usa_close, + .write = usa_raw_write, + .poll = usa_poll, +}; + +static struct miscdevice usa_raw_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "ib_usa_raw", + .fops = &usa_raw_fops, +}; + +static struct file_operations usa_default_fops = { + .owner = THIS_MODULE, + .open = usa_open, + .release = usa_close, + .write = usa_default_write, + .poll = usa_poll, +}; + +static struct miscdevice usa_default_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "ib_usa_default", + .fops = &usa_default_fops, +}; + +static ssize_t show_abi_version(struct class_device *class_dev, char *buf) +{ + return sprintf(buf, "%d\n", IB_USA_ABI_VERSION); +} +static CLASS_DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); + +static int __init usa_init(void) +{ + int ret; + + ret = misc_register(&usa_raw_misc); + if (ret) + return ret; + + ret = misc_register(&usa_default_misc); + if (ret) + goto err1; + + ret = class_device_create_file(usa_default_misc.class, + &class_device_attr_abi_version); + if (ret) + goto err2; + + ret = ib_register_client(&usa_client); + if (ret) + goto err3; + return 0; + +err3: + class_device_remove_file(usa_default_misc.class, + &class_device_attr_abi_version); +err2: + misc_deregister(&usa_default_misc); +err1: + misc_deregister(&usa_raw_misc); + return ret; +} + +static void __exit usa_cleanup(void) +{ + ib_unregister_client(&usa_client); + class_device_remove_file(usa_default_misc.class, + &class_device_attr_abi_version); + misc_deregister(&usa_default_misc); + misc_deregister(&usa_raw_misc); + idr_destroy(&usa_idr); +} + +module_init(usa_init); +module_exit(usa_cleanup); Index: Kconfig =================================================================== --- Kconfig (revision 9096) +++ Kconfig (working copy) @@ -17,15 +17,15 @@ config INFINIBAND_USER_MAD need libibumad from . config INFINIBAND_USER_ACCESS - tristate "InfiniBand userspace access (verbs and CM)" + tristate "InfiniBand userspace access (verbs, CM, SA client)" depends on INFINIBAND ---help--- Userspace InfiniBand access support. This enables the - kernel side of userspace verbs and the userspace - communication manager (CM). This allows userspace processes - to set up connections and directly access InfiniBand + kernel side of userspace verbs, the userspace communication + manager (CM), and userspace SA client. This allows userspace + processes to set up connections and directly access InfiniBand hardware for fast-path operations. You will also need - libibverbs, libibcm and a hardware driver library from + libibverbs, libibcm, libibsa, and a hardware driver library from . config INFINIBAND_ADDR_TRANS Index: core/Makefile =================================================================== --- core/Makefile (revision 9096) +++ core/Makefile (working copy) @@ -7,7 +7,8 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o i ib_sa.o $(infiniband-y) \ findex.o ib_multicast.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o -obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o $(user_access-y) +obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_usa.o \ + $(user_access-y) findex-y := index.o @@ -39,3 +40,5 @@ ib_uverbs-y := uverbs_main.o uverbs_cm ib_ucm-y := ucm.o +ib_usa-y := usa.o + From Thomas.Talpey at netapp.com Thu Aug 24 19:03:07 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 24 Aug 2006 22:03:07 -0400 Subject: [openib-general] basic IB doubt In-Reply-To: References: <44EC825E.2030709@ichips.intel.com> <20060823230812.GD13187@greglaptop.t-mobile.de> Message-ID: <7.0.1.0.2.20060824215543.042604b0@netapp.com> At 07:46 PM 8/23/2006, Roland Dreier wrote: > Greg> Actually, that leads me to a question: does the vendor of > Greg> that adaptor say that this is actually safe? Just because > Greg> something behaves one way most of the time doesn't mean it > Greg> does it all of the time. So it it really smart to write > Greg> non-standard-conforming programs unless the vendor stands > Greg> behind that behavior? > >Yes, Mellanox documents that it is safe to rely on the last byte of an >RDMA being written last. How does an adapter guarantee that no bridges or other intervening devices reorder their writes, or for that matter flush them to memory at all!? Without signalling the host processor, that is. Isn't that what the dma_sync() API is all about? Tom. From zhushisongzhu at yahoo.com Thu Aug 24 22:01:17 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Thu, 24 Aug 2006 22:01:17 -0700 (PDT) Subject: [openib-general] about rdma and send/recv model In-Reply-To: Message-ID: <20060825050117.82683.qmail@web36904.mail.mud.yahoo.com> Although SDP is not working correctly now, it can easily support many existing applications based on Sockets API. Patrick say rdma is a connected model and has serious scalability problem. But Does send/recv model has or support SDP-alike API? zhu __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From sashak at voltaire.com Fri Aug 25 06:17:35 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 25 Aug 2006 16:17:35 +0300 Subject: [openib-general] [PATCH] opensm: libibmad: rpc API which supports more than one ports. Message-ID: <20060825131734.3786.74359.stgit@sashak.voltaire.com> This provides RPC like API which may work with several ports. Signed-off-by: Sasha Khapyorsky --- libibmad/include/infiniband/mad.h | 9 +++ libibmad/src/libibmad.map | 4 + libibmad/src/register.c | 20 +++++-- libibmad/src/rpc.c | 106 +++++++++++++++++++++++++++++++++++-- libibumad/src/umad.c | 4 + 5 files changed, 130 insertions(+), 13 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 45ff572..bd8a80b 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -660,6 +660,7 @@ uint64_t mad_trid(void); int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); /* register.c */ +int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); int mad_register_client(int mgmt, uint8_t rmpp_version); int mad_register_server(int mgmt, uint8_t rmpp_version, uint32_t method_mask[4], uint32_t class_oui); @@ -704,6 +705,14 @@ void madrpc_lock(void); void madrpc_unlock(void); void madrpc_show_errors(int set); +void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, + int num_classes); +void mad_rpc_close_port(void *ibmad_port); +void * mad_rpc(void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, + void *payload, void *rcvdata); +void * mad_rpc_rmpp(void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, + ib_rmpp_hdr_t *rmpp, void *data); + /* smp.c */ uint8_t * smp_query(void *buf, ib_portid_t *id, uint attrid, uint mod, uint timeout); diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index bf81bd1..78b7ff0 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -62,6 +62,10 @@ IBMAD_1.0 { ib_resolve_self; ib_resolve_smlid; ibdebug; + mad_rpc_open_port; + mad_rpc_close_port; + mad_rpc; + mad_rpc_rmpp; madrpc; madrpc_def_timeout; madrpc_init; diff --git a/libibmad/src/register.c b/libibmad/src/register.c index 4f44625..52d6989 100644 --- a/libibmad/src/register.c +++ b/libibmad/src/register.c @@ -43,6 +43,7 @@ #include #include #include #include +#include #include #include "mad.h" @@ -118,7 +119,7 @@ mad_agent_class(int agent) } int -mad_register_client(int mgmt, uint8_t rmpp_version) +mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version) { int vers, agent; @@ -126,7 +127,7 @@ mad_register_client(int mgmt, uint8_t rm DEBUG("Unknown class %d mgmt_class", mgmt); return -1; } - if ((agent = umad_register(madrpc_portid(), mgmt, + if ((agent = umad_register(port_id, mgmt, vers, rmpp_version, 0)) < 0) { DEBUG("Can't register agent for class %d", mgmt); return -1; @@ -137,13 +138,22 @@ mad_register_client(int mgmt, uint8_t rm return -1; } - if (register_agent(agent, mgmt) < 0) - return -1; - return agent; } int +mad_register_client(int mgmt, uint8_t rmpp_version) +{ + int agent; + + agent = mad_register_port_client(madrpc_portid(), mgmt, rmpp_version); + if (agent < 0) + return agent; + + return register_agent(agent, mgmt); +} + +int mad_register_server(int mgmt, uint8_t rmpp_version, uint32_t method_mask[4], uint32_t class_oui) { diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index b2d3e77..ac4f361 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -48,6 +48,13 @@ #include #include #include "mad.h" +#define MAX_CLASS 256 + +struct ibmad_port { + int port_id; /* file descriptor returned by umad_open() */ + int class_agents[MAX_CLASS]; /* class2agent mapper */ +}; + int ibdebug; static int mad_portid = -1; @@ -105,7 +112,8 @@ madrpc_portid(void) } static int -_do_madrpc(void *sndbuf, void *rcvbuf, int agentid, int len, int timeout) +_do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, + int timeout) { uint32_t trid; /* only low 32 bits */ int retries; @@ -133,7 +141,7 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i } length = len; - if (umad_send(mad_portid, agentid, sndbuf, length, timeout, 0) < 0) { + if (umad_send(port_id, agentid, sndbuf, length, timeout, 0) < 0) { IBWARN("send failed; %m"); return -1; } @@ -141,7 +149,7 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i /* Use same timeout on receive side just in case */ /* send packet is lost somewhere. */ do { - if (umad_recv(mad_portid, rcvbuf, &length, timeout) < 0) { + if (umad_recv(port_id, rcvbuf, &length, timeout) < 0) { IBWARN("recv failed: %m"); return -1; } @@ -164,8 +172,10 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i } void * -madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) +mad_rpc(void *port_id, ib_rpc_t *rpc, ib_portid_t *dport, void *payload, + void *rcvdata) { + struct ibmad_port *p = port_id; int status, len; uint8_t sndbuf[1024], rcvbuf[1024], *mad; @@ -175,7 +185,8 @@ madrpc(ib_rpc_t *rpc, ib_portid_t *dport if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) return 0; - if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), + if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, + p->class_agents[rpc->mgtclass], len, rpc->timeout)) < 0) return 0; @@ -198,8 +209,10 @@ madrpc(ib_rpc_t *rpc, ib_portid_t *dport } void * -madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) +mad_rpc_rmpp(void *port_id, ib_rpc_t *rpc, ib_portid_t *dport, + ib_rmpp_hdr_t *rmpp, void *data) { + struct ibmad_port *p = port_id; int status, len; uint8_t sndbuf[1024], rcvbuf[1024], *mad; @@ -210,7 +223,8 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t * if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) return 0; - if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), + if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, + p->class_agents[rpc->mgtclass], len, rpc->timeout)) < 0) return 0; @@ -249,6 +263,24 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t * return data; } +void * +madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) +{ + struct ibmad_port port; + port.port_id = mad_portid; + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); + return mad_rpc(&port, rpc, dport, payload, rcvdata); +} + +void * +madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) +{ + struct ibmad_port port; + port.port_id = mad_portid; + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); + return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); +} + static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; void @@ -282,3 +314,63 @@ madrpc_init(char *dev_name, int dev_port IBPANIC("client_register for mgmt %d failed", mgmt); } } + +void * +mad_rpc_open_port(char *dev_name, int dev_port, + int *mgmt_classes, int num_classes) +{ + struct ibmad_port *p; + int port_id; + + if (umad_init() < 0) { + IBWARN("can't init UMAD library"); + errno = ENODEV; + return NULL; + } + + p = malloc(sizeof(*p)); + if (!p) { + errno = ENOMEM; + return NULL; + } + memset(p, 0, sizeof(*p)); + + if ((port_id = umad_open_port(dev_name, dev_port)) < 0) { + IBWARN("can't open UMAD port (%s:%d)", dev_name, dev_port); + if (!errno) + errno = EIO; + free(p); + return NULL; + } + + while (num_classes--) { + int rmpp_version = 0; + int mgmt = *mgmt_classes++; + int agent; + + if (mgmt == IB_SA_CLASS) + rmpp_version = 1; + if (mgmt < 0 || mgmt >= MAX_CLASS || + (agent = mad_register_port_client(port_id, mgmt, + rmpp_version)) < 0) { + IBWARN("client_register for mgmt %d failed", mgmt); + if(!errno) + errno = EINVAL; + umad_close_port(port_id); + free(p); + return NULL; + } + p->class_agents[mgmt] = agent; + } + + p->port_id = port_id; + return p; +} + +void +mad_rpc_close_port(void *port_id) +{ + struct ibmad_port *p = port_id; + umad_close_port(p->port_id); + free(p); +} diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index a99fb5a..cb9eef6 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -93,12 +93,14 @@ port_alloc(int portid, char *dev, int po if (portid < 0 || portid >= UMAD_MAX_PORTS) { IBWARN("bad umad portid %d", portid); + errno = EINVAL; return 0; } if (port->dev_name[0]) { IBWARN("umad port id %d is already allocated for %s %d", portid, port->dev_name, port->dev_port); + errno = EBUSY; return 0; } @@ -567,7 +569,7 @@ umad_open_port(char *ca_name, int portnu return -EINVAL; if (!(port = port_alloc(umad_id, ca_name, portnum))) - return -EINVAL; + return -errno; snprintf(port->dev_file, sizeof port->dev_file - 1, "%s/umad%d", UMAD_DEV_DIR , umad_id); From tbachman at annapmicro.com Fri Aug 25 07:00:50 2006 From: tbachman at annapmicro.com (Thomas Bachman) Date: Fri, 25 Aug 2006 10:00:50 -0400 Subject: [openib-general] basic IB doubt In-Reply-To: <20060824221833.GD3670@greglaptop.hotels-on-air.de> Message-ID: <000001c6c84e$e7cfd700$4e019e84@bullseye> >On Thu, Aug 24, 2006 at 03:13:33PM -0700, Woodruff, Robert J wrote: >> If the feature gives them a huge advantage in performance (and it >> does) and all of the hardware vendors that they deal with already >> implement it, then yes, they will force, by defacto standard that >> all other newcomers implement it or face the fact that no one will >> buy their hardware. It seems like that is what is happening in this >> case. > >In this case the feature reduces performance on one HCA and increases >it on another. Which shows why it's a bad idea to pick features based >on a single implementation. > > But you're still confusing practicality and theory. I can see why it's > pratical sense for newcomers to implement this new, performance- > reducing feature. But why is it theoretically good? And shouldn't it > be added to the standard, before all the poor iWarp people discover > the hard way that they need it? > > -- greg Not that I have any stance on this issue, but is this is the text in the spec that is being debated? (page 269, section 9.5, Transaction Ordering): "An application shall not depend upon the order of data writes to memory within a message. For example, if an application sets up data buffers that overlap, for separate data segments within a message, it is not guaranteed that the last sent data will always overwrite the earlier." I'm assuming that the spec authors had reason for putting this in there, so maybe they could provide guidance here? Or was this only meant to apply to SENDs, and not RDMA WRITEs? Cheers, -Thomas Bachman From sashak at voltaire.com Fri Aug 25 07:17:04 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 25 Aug 2006 17:17:04 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <20060824132848.GI8192@mellanox.co.il> References: <20060824132848.GI8192@mellanox.co.il> Message-ID: <20060825141704.GA3867@sashak.voltaire.com> On 16:28 Thu 24 Aug , Michael S. Tsirkin wrote: > Quoting r. Yevgeny Kliteynik : > > Index: libvendor/osm_vendor_ibumad.c > > =================================================================== > > --- libvendor/osm_vendor_ibumad.c (revision 8998) > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > @@ -72,6 +72,7 @@ > > #include > > #include > > #include > > +#include > > > > /****s* OpenSM: Vendor AL/osm_umad_bind_info_t > > * NAME > > NAK. > > This means that the SM becomes dependent on the uverbs module. I don't think > this is a good idea. Let's not go there - SM should depend just on the umad > module and libc. Agree on this point. I dislike this new libibverbs dependency too. I think we need to work with umad. So more generic question: some application performs blocked read() from /dev/umadN, should this read() be interrupted and return error (with appropriate errno value), then the port state becomes DOWN? I think yes, it should. Other opinions? Sean? And if yes, then in OpenSM we will need just to check errno value upon umad_recv() failure. Sasha > In particular, SM should work even on embedded platforms where > uverbs do not necessarily work. > > Further, hotplug events still do not seem to be handled, even with this patch. > > For port events, it seems sane that umad module could provide a way > to listen for them. > > A recent patch to mthca converts fatal events to hotplug events, so fatal events > can and should be handled as part of general hotplug support. > > -- > MST > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From greg.lindahl at qlogic.com Fri Aug 25 07:58:10 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri, 25 Aug 2006 07:58:10 -0700 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <20060825141704.GA3867@sashak.voltaire.com> References: <20060824132848.GI8192@mellanox.co.il> <20060825141704.GA3867@sashak.voltaire.com> Message-ID: <20060825145810.GD8380@greglaptop.hotels-on-air.de> On Fri, Aug 25, 2006 at 05:17:04PM +0300, Sasha Khapyorsky wrote: > So more generic question: some application performs blocked read() from > /dev/umadN, should this read() be interrupted and return error (with > appropriate errno value), then the port state becomes DOWN? Iif the SM gets a signal (alarm timeout) and the read() is interrupted with errno=EINTR... presumably this is not the case you had in mind. -- greg From jlentini at netapp.com Fri Aug 25 07:57:06 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 25 Aug 2006 10:57:06 -0400 (EDT) Subject: [openib-general] drop mthca from svn? (was: Rollup patch for ipath and OFED) In-Reply-To: <1156445387.17908.6.camel@chalcedony.pathscale.com> References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> <1156445387.17908.6.camel@chalcedony.pathscale.com> Message-ID: On Thu, 24 Aug 2006, Bryan O'Sullivan wrote: > On Thu, 2006-08-24 at 09:31 -0700, Roland Dreier wrote: > > > Along those lines, how would people feel if I removed the mthca kernel > > code from svn, and just maintained mthca in kernel.org git trees? > > +1 from me. We'll drop the ipath code, too. I'm concerned about the licensing implications on moving the code. Most of the source code hosted on kernel.org is GPL-only (the sparse repository is the only one I know of that is not). If the code is moved, how can the OpenFabrics community be guaranteed that the entire software stack will remain under a dual BSD/GPL license? If the only goal is to use git, git can be setup on an OpenFabrics server. From tom at opengridcomputing.com Fri Aug 25 08:13:01 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 25 Aug 2006 10:13:01 -0500 Subject: [openib-general] A critique of RDMA PUT/GET in HPC In-Reply-To: <20060824225343.GD3927@greglaptop.hotels-on-air.de> References: <20060824225343.GD3927@greglaptop.hotels-on-air.de> Message-ID: <1156518781.25769.22.camel@trinity.ogc.int> On Thu, 2006-08-24 at 15:53 -0700, Greg Lindahl wrote: > For those of you interested in this topic, there's an interesting > article by Patrick Geoffrey in HPCWire entitled "A Critique of RDMA". > > http://www.hpcwire.com/hpc/815242.html > > (you might have to be a subscriber, but I'm sure Patrick would send > you a copy if you ask.) > > It's basically a critique of why SEND/RECV is better for MPI > implementations than PUT/GET. He does say this, but his analysis does not support this conclusion. His analysis revolves around MPI send/recv, not the MPI 2.0 get/put services. He makes the point (true in my opinion) that the MPI_RECV 64bit (tag,communicator) filter make MPI_RECV prickly to implement on IB/iWARP SEND/RECV and IB/iWARP RDMA. His data are drawn from observations of MPI applications that use MPI send/recv mapped to an RDMA transport. However, his conclusion covers a programming model (MPI get/put) that is not observed in the data. In other words, he doesn't compare the performance of an algorithm implemented using MPI send/recv vs. the same algorithm implemented using MPI get/put. He evaluates the performance of an algorithm implemented using MPI send/recv mapped to an RDMA transport and then says because this mapping has problems that the RDMA programming model is bad. That conclusion is not supported by his analysis or his data. A valid conclusion IMO is that "MPI send/recv can be most efficiently implemented over an unconnected reliable datagram protocol that supports 64bit tag matching at the data sink." And not coincidentally, Myricom has this ;-) I DO agree that it is interesting reading. :-), it's definitely got people fired up. My 2 cents. > > Even if you don't agree with him, it's good reading. For motivation, > you might want to note that most of the SEND/RECV-based products > mentioned achieve better MPI 0-byte latency than IB Verbs-based MPI > implementations. > > While I don't agree with everything Patrick says, this does get back > to my point that I've run into many people who assume that PUT/GET is > always the right way to do things. And it isn't. > > -- greg > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From greg.lindahl at qlogic.com Fri Aug 25 08:56:53 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri, 25 Aug 2006 08:56:53 -0700 Subject: [openib-general] A critique of RDMA PUT/GET in HPC In-Reply-To: <1156518781.25769.22.camel@trinity.ogc.int> References: <20060824225343.GD3927@greglaptop.hotels-on-air.de> <1156518781.25769.22.camel@trinity.ogc.int> Message-ID: <20060825155652.GG8380@greglaptop.hotels-on-air.de> On Fri, Aug 25, 2006 at 10:13:01AM -0500, Tom Tucker wrote: > He does say this, but his analysis does not support this conclusion. His > analysis revolves around MPI send/recv, not the MPI 2.0 get/put > services. Nobody uses MPI put/get anyway, so leaving out analyzing that doesn't change reality much. > A valid conclusion IMO is that "MPI send/recv can > be most efficiently implemented over an unconnected reliable datagram > protocol that supports 64bit tag matching at the data sink." And not > coincidentally, Myricom has this ;-) As do all of the non-VIA-family interconnects he mentions. Since "we" all landed on the same conclusion, you might think we're on to something. Or not. However, that's only part of the argument. Another part is that the buffer space needed to use RDMA put/get for all data links is huge. And there are some other interesting points. > I DO agree that it is interesting reading. :-), it's definitely got > people fired up. Heh. Glad you found it interesting. -- greg From rdreier at cisco.com Fri Aug 25 09:15:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 25 Aug 2006 09:15:44 -0700 Subject: [openib-general] drop mthca from svn? In-Reply-To: (James Lentini's message of "Fri, 25 Aug 2006 10:57:06 -0400 (EDT)") References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> <1156445387.17908.6.camel@chalcedony.pathscale.com> Message-ID: James> I'm concerned about the licensing implications on moving James> the code. Most of the source code hosted on kernel.org is James> GPL-only (the sparse repository is the only one I know of James> that is not). I have no plans to change any of the copyright notices on the code. I see quite a few other dual licensed drivers in the Linux kernel tree so I don't see why the current status quo would be a problem. James> If the code is moved, how can the OpenFabrics community be James> guaranteed that the entire software stack will remain under James> a dual BSD/GPL license? You can't guarantee that someone won't come along and write some IB driver and get it merged upstream without a BSD license. So there's not much we can do anyway. - R. From rdreier at cisco.com Fri Aug 25 09:33:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 25 Aug 2006 09:33:31 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <7.0.1.0.2.20060824215543.042604b0@netapp.com> (Thomas Talpey's message of "Thu, 24 Aug 2006 22:03:07 -0400") References: <44EC825E.2030709@ichips.intel.com> <20060823230812.GD13187@greglaptop.t-mobile.de> <7.0.1.0.2.20060824215543.042604b0@netapp.com> Message-ID: Thomas> How does an adapter guarantee that no bridges or other Thomas> intervening devices reorder their writes, or for that Thomas> matter flush them to memory at all!? That's a good point. The HCA would have to do a read to flush the posted writes, and I'm sure it's not doing that (since it would add horrible latency for no good reason). I guess it's not safe to rely on ordering of RDMA writes after all. - R. From sean.hefty at intel.com Fri Aug 25 09:40:54 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 25 Aug 2006 09:40:54 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: Message-ID: <000201c6c865$3be47d80$8698070a@amr.corp.intel.com> > Thomas> How does an adapter guarantee that no bridges or other > Thomas> intervening devices reorder their writes, or for that > Thomas> matter flush them to memory at all!? > >That's a good point. The HCA would have to do a read to flush the >posted writes, and I'm sure it's not doing that (since it would add >horrible latency for no good reason). > >I guess it's not safe to rely on ordering of RDMA writes after all. Couldn't the same point then be made that a CQ entry may come before the data has been posted? - Sean From caitlinb at broadcom.com Fri Aug 25 09:44:48 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 25 Aug 2006 09:44:48 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: Message-ID: <54AD0F12E08D1541B826BE97C98F99F189EB41@NT-SJCA-0751.brcm.ad.broadcom.com> Woodruff, Robert J wrote: > Catlin wrote, > >> For iWARP there are network performance reasons why in-order memory >> writes will never be guaranteed. > > For iWarp, or any other RDMA over Ethernet protocol, the > behavior is not to guarantee all packets are written > in-order, just that the last byte of the last packet is > written last. This can easily be implemented in an iWarp card > or by the driver with minimal performance impact in most cases. > > So for example, if the last packet arrives before all the > other packets have arrived, the iWarp card or driver does not > place that data of the last packet until all the other > packets have arrived. > > woody Another point, even if a vendor were to implement the firmware you suggest, how does the Data Source know that it is safe to use just RDMA Writes? The enabling firmware is in the Data Sink. Applications certainly do not want to have to validate the model of the RNIC that they are connected with. From caitlinb at broadcom.com Fri Aug 25 09:50:56 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 25 Aug 2006 09:50:56 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <000201c6c865$3be47d80$8698070a@amr.corp.intel.com> Message-ID: <54AD0F12E08D1541B826BE97C98F99F189EB44@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: >> Thomas> How does an adapter guarantee that no bridges or other >> Thomas> intervening devices reorder their writes, or for that >> Thomas> matter flush them to memory at all!? >> >> That's a good point. The HCA would have to do a read to flush the >> posted writes, and I'm sure it's not doing that (since it would add >> horrible latency for no good reason). >> >> I guess it's not safe to rely on ordering of RDMA writes after all. > > Couldn't the same point then be made that a CQ entry may come > before the data has been posted? > That's why both specs (IBTA and RDMAC) are very explicit that all prior messages are complete before the CQE is given to the user. It is up to the RDMA Device and/or its driver to guarantee this by whatever means are appropriate. An implementation that allows a CQE post to pass the data placement that it is reporting on the PCI bus is in error. The critical concept of the Work Completion is that it consolidates guarantees and notificatins. The implementation can do all sorts of strange things that it thinks optimize *before* the work completion, but at the time the work completion is delivered to the user everything is supposed to be as expected. From Thomas.Talpey at netapp.com Fri Aug 25 09:51:20 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 25 Aug 2006 12:51:20 -0400 Subject: [openib-general] basic IB doubt In-Reply-To: <000201c6c865$3be47d80$8698070a@amr.corp.intel.com> References: <000201c6c865$3be47d80$8698070a@amr.corp.intel.com> Message-ID: <7.0.1.0.2.20060825124752.0464b7c8@netapp.com> At 12:40 PM 8/25/2006, Sean Hefty wrote: >> Thomas> How does an adapter guarantee that no bridges or other >> Thomas> intervening devices reorder their writes, or for that >> Thomas> matter flush them to memory at all!? >> >>That's a good point. The HCA would have to do a read to flush the >>posted writes, and I'm sure it's not doing that (since it would add >>horrible latency for no good reason). >> >>I guess it's not safe to rely on ordering of RDMA writes after all. > >Couldn't the same point then be made that a CQ entry may come before the data >has been posted? When the CQ entry arrives, the context that polls it off the queue must use the dma_sync_*() api to finalize any associated data transactions (known by the uper layer). This is basic, and it's the reason that a completion is so important. The completion, in and of itself, isn't what drives the synchronization. It's the transfer of control to the processor. Tom. From rdreier at cisco.com Fri Aug 25 09:53:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 25 Aug 2006 09:53:27 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <000201c6c865$3be47d80$8698070a@amr.corp.intel.com> (Sean Hefty's message of "Fri, 25 Aug 2006 09:40:54 -0700") References: <000201c6c865$3be47d80$8698070a@amr.corp.intel.com> Message-ID: Sean> Couldn't the same point then be made that a CQ entry may Sean> come before the data has been posted? That's true -- I guess I need to look at what ordering guarantees the PCI spec makes to give a real answer. - R. From robert.j.woodruff at intel.com Fri Aug 25 09:58:48 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 25 Aug 2006 09:58:48 -0700 Subject: [openib-general] basic IB doubt Message-ID: Catlin wrote, >Another point, even if a vendor were to implement the firmware >you suggest, how does the Data Source know that it is safe to >use just RDMA Writes? The enabling firmware is in the Data Sink. Huh, don't understand the question. >Applications certainly do not want to have to validate the model >of the RNIC that they are connected with. If ISVs want to use an RNIC that does not support this technique, then they will have to implement their completion checking another way, which will be slower, so the hardware NICs that do not support this fast polling completion technique will be at a competitive disadvantage. Sometimes you can lead a horse to water, but you can't make then drink. From caitlinb at broadcom.com Fri Aug 25 10:12:11 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 25 Aug 2006 10:12:11 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: Message-ID: <54AD0F12E08D1541B826BE97C98F99F189EB4E@NT-SJCA-0751.brcm.ad.broadcom.com> Woodruff, Robert J wrote: > Catlin wrote, >> Another point, even if a vendor were to implement the firmware you >> suggest, how does the Data Source know that it is safe to use just >> RDMA Writes? The enabling firmware is in the Data Sink. > > Huh, don't understand the question. > >> Applications certainly do not want to have to validate the model of >> the RNIC that they are connected with. > > If ISVs want to use an RNIC that does not support this > technique, then they will have to implement their completion > checking another way, which will be slower, so the hardware > NICs that do not support this fast polling completion > technique will be at a competitive disadvantage. Sometimes > you can lead a horse to water, but you can't make then drink. The benefit of "last byte RDMA Write ordering" would be to sipmlify the logic of the remote peer doing the RDMA Writes. It does not benefit the application doing the receiving. The decision on whether or not to take the action that requires a clean completion at the data sink must be taken by the data source -- which has no method of knowing what vendor specific features the remote peer has. The whole point of using a standard protocol is to at least define the optional features in a vendor independent way. From dledford at redhat.com Fri Aug 25 10:29:54 2006 From: dledford at redhat.com (Doug Ledford) Date: Fri, 25 Aug 2006 17:29:54 +0000 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready In-Reply-To: <44EA004F.2060608@mellanox.co.il> References: <44EA004F.2060608@mellanox.co.il> Message-ID: <1156526995.12257.107.camel@fc6.xsintricity.com> On Mon, 2006-08-21 at 21:49 +0300, Tziporet Koren wrote: > Hi, > > OFED 1.1-RC2 is avilable on https://openib.org/svn/gen2/branches/1.1/ofed/releases/ > File: OFED-1.1-rc2.tgz > Please report any issues in bugzilla http://openib.org/bugzilla/ > > Tziporet & Vlad > ------------------------------------------------------------------------------------- > > Release details: > ================ > > Build_id: > OFED-1.1-rc2 > > openib-1.1 (REV=9037) > # User space > https://openib.org/svn/gen2/branches/1.1/src/userspace > Git: > ref: refs/heads/ofed_1_1 > commit a13195d7ca0f047f479a58b2a81ff2b796eb8fa4 > > # MPI > mpi_osu-0.9.7-mlx2.2.0.tgz > openmpi-1.1-1.src.rpm > mpitests-2.0-0.src.rpm > > > OS support: > =========== > Novell: > - SLES 9.0 SP3* > - SLES10 (official release)* > Redhat: > - Redhat EL4 up3 > - Redhat EL4 up4* (not supported yet) > kernel.org: > - Kernel 2.6.17* > * Changed from 1.0 release > > Note: Redhat EL4 up2, Fedora C4 and SuSE Pro 10 were dropped from the list. > We keep the backport patches for these OSes and make sure OFED compile and > loaded properly but will not do full QA cycle. > > Systems: > ======== > * x86_64 > * x86 > * ia64 > * ppc64 Not supporting ppc is a problem to a certain extent. I can't speak for SuSE, but at least for Red Hat, ppc is the default and over rides ppc64. The ppc64 arch is less efficient than the ppc arch on ppc64 processors except when large memory footprints are involved. So, for things like opensm, ibv_*, etc. the ppc arch should actually be preferred, and the ppc64 arch libs should be present for those end user apps that need large memory access. That fact that dapl doesn't compile on ppc at all is problematic as well. In addition, what are you guys doing about the lack of asm/atomic.h (breaking udapl compiles on ppc64 and ia64) going forward? I'd look in the packages and see for myself but the svn update is taking forever due to those binary rpms packed into svn...ahh, it's finally done....ok, still broken. Without getting into an argument over the usage of that include, suffice it to say that the include file is gone and builds fails on fc6/rhel5beta. Since the code really only uses low level intrinsics as opposed to high level atomic ops, I made a ppc and ia64 intrinsics header for linux and added it to the dapl package itself to work around the issue. > > Main changes from OFED-1.1-rc1: > =============================== > 1. ipath driver: > - Compilation pass on all systems, except SLES9 SP3. > - See list of changes in the ipath driver at the end > 2. SDP: > - Fixed issue with 32 bit systems run out of low memory when opening hundreds of sockets. > - Added out of band and message peek support; telnet and ftp are now working > 3. SRP - a new srp_daemon was added - see explanation at the end > 4. IPoIB: High availability support using a daemon in user level. > Daemon is located under /userspace/ipoibtools/. See explanation at the end. > 5. Added Madeye utility > 6. Added verbs fork support. Should work from kernel 2.6.16 > 7. Fatal error support in mthca > 8. iSER support in install script for SLES 10 was fixed > 9. Diagnostic tools does not requires opensm installation. > For this the following changes were done to opensm RPM: > opensm-devel was removed > New packages were added: > libosmcomp > libosmcomp-devel > libosmvendor > libosmvendor-devel > libopensm > libopensm-devel Ugh. Each library does not need it's own package. Imagine what X would do to your RPM count otherwise. For grouped libraries like this, it is perfectly acceptable to do opensm, opensm-libs, opensm-devel (and that's in fact what I did for RHEL4 U4). Regardless though, make a decision and stick to it. Changing package names with each release == not good. > 10. bug fixes: > - SRP: Add local_ib_device/local_ib_port attributes to srp scsi_host > - mthca: fence bit supported; fixed deadlock in destroy qp > - ipoib: connectivity lost on sm lid change > - OSM: fix to work with Cisco stack > > > Limitations and known issues: > ============================= > 1. SDP: For Mellanox Sinai HCAs one must use latest FW version (1.1.000). > 2. SDP: Get peer name is not working properly > 3. SDP: Scalability issue when many connections are opened > 4. ipath driver does not compile on SLES9 SP3 > 5. RHEL4 up4 is not supported due to problems in the backport patches. You should be able to start by pulling the patches that are already applied out of the RHEL4 U4 kernel rpm, looking at which ones fix up the core kernel to provide what's needed instead of doing a thousand little backports all over the kernel tree, and axing any backport patches you had planned that would undo that. IOW, make use of the infrastructure provided in U4 instead of working around it. > > Missing features that should be completed for RC3: > ================================================== > 1. Core: Huge pages fix > 2. IPoIB high availability does not support multicast groups > 3. Support RHEL4 up4 > > Changes in the ipath driver: > ============================ > * lock resource limit counters correctly > * fix for crash on module unload, if cfgports < portcnt > * fix handling of kpiobufs > * drop requirement that PIO buffers be mmaped write-only > * merge ipath_core and ib_ipath drivers > * simplify layering code > * simplify debugging code after ipath_core and ib_ipath merger > * remove stale references to userspace SMA > * More changes to support InfiniPath on PowerPC 970 systems. > * add new minor device to allow sending of diag packets > * do not allow use of CQ entries with invalid counts > * account for attached QPs correctly > * support new QLogic product naming scheme > * add serial number to hardware freeze error message > * be more strict about testing the modify QP verb > * validate path_mig_state properly > * put a limit on the number of QPs that can be created > * handle sq_sig_all field correctly > * allow SMA to be disabled > * fix return value from ipath_poll > * print warning if LID not acquired within one minute > * allow direct control of Rx polarity inversion > > srp_daemon explanation: > ======================= > srp_daemon is a tool that identifies SRP targets in the fabric. > > Each srp_daemon instance is operating on one port. > On boot it performs a full rescan of the fabric and waits to srp_daemon events: > - a join of a new target to the fabric > - a change in the capabilities of a machine that becomes a target > - an SA change > - an expiration of a predefined timeout > > When there is an SA change or a timeout expiration srp_daemon perform a full rescan of the fabric. > > for each target srp_daemon finds, it checks if it is already connected to that port, if it is not connected, srp_daemon can either print the target details or connect to it. > > Run srp_daemon -h for usage. > > > IPoIB HA daemon: > ================ > The IPoIB HA daemon can be configured in /etc/infiniband/openib.conf file: > > # Enable IPoIB High Availability daemon > IPOIBHA_ENABLE=yes > # PRIMARY_IPOIB_DEV=ib0 > # BACKUP_IPOIB_DEV=ib1 > > The default for PRIMARY_IPOIB_DEV is ib0 and for BACKUP_IPOIB_DEV is ib1. Now that my svn update is complete, I'll review the 1.1rc2 spec and install files and then send a separate email to the list about the various things I had to change in the 1.0 release to meet packaging guidelines relevant to the Fedora/Red Hat package review process that still apply to 1.1. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From swise at opengridcomputing.com Fri Aug 25 10:36:24 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 25 Aug 2006 12:36:24 -0500 Subject: [openib-general] basic IB doubt In-Reply-To: References: <000201c6c865$3be47d80$8698070a@amr.corp.intel.com> Message-ID: <1156527384.25674.7.camel@stevo-desktop> On Fri, 2006-08-25 at 09:53 -0700, Roland Dreier wrote: > Sean> Couldn't the same point then be made that a CQ entry may > Sean> come before the data has been posted? > > That's true -- I guess I need to look at what ordering guarantees the > PCI spec makes to give a real answer. > I believe bus bridges between devices and memory _must_ ensure write ordering. Otherwise nothing works, right? From jgunthorpe at obsidianresearch.com Fri Aug 25 10:35:20 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 25 Aug 2006 11:35:20 -0600 Subject: [openib-general] basic IB doubt In-Reply-To: References: <44EC825E.2030709@ichips.intel.com> <20060823230812.GD13187@greglaptop.t-mobile.de> <7.0.1.0.2.20060824215543.042604b0@netapp.com> Message-ID: <20060825173520.GA1624@obsidianresearch.com> On Fri, Aug 25, 2006 at 09:33:31AM -0700, Roland Dreier wrote: > Thomas> How does an adapter guarantee that no bridges or other > Thomas> intervening devices reorder their writes, or for that > Thomas> matter flush them to memory at all!? > That's a good point. The HCA would have to do a read to flush the > posted writes, and I'm sure it's not doing that (since it would add > horrible latency for no good reason). PCI (-X and -E) have strict transaction ordering rules, writes may not be re-ordered, and two ordered writes to the same address have defined semantics. One thing that is absolutely assured in a PCI system is that if write B follows write A and the CPU observes B's data then all of A must be visible to the CPU. What I don't recall being assured is if all the data in a single transaction has some defined order that it must be become visible to the CPU.. This is why CQ's don't have a problem, seeing the new CQ entry, or the MSI, is enough to ensure everything is visible to the CPU. So, at the worst case, all a HCA would have to do is put the last dword in a seperate transaction.. Jason From tom at opengridcomputing.com Fri Aug 25 10:45:26 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 25 Aug 2006 12:45:26 -0500 Subject: [openib-general] basic IB doubt In-Reply-To: <7.0.1.0.2.20060825124752.0464b7c8@netapp.com> References: <000201c6c865$3be47d80$8698070a@amr.corp.intel.com> <7.0.1.0.2.20060825124752.0464b7c8@netapp.com> Message-ID: <1156527926.25769.39.camel@trinity.ogc.int> On Fri, 2006-08-25 at 12:51 -0400, Talpey, Thomas wrote: > At 12:40 PM 8/25/2006, Sean Hefty wrote: > >> Thomas> How does an adapter guarantee that no bridges or other > >> Thomas> intervening devices reorder their writes, or for that > >> Thomas> matter flush them to memory at all!? > >> > >>That's a good point. The HCA would have to do a read to flush the > >>posted writes, and I'm sure it's not doing that (since it would add > >>horrible latency for no good reason). > >> > >>I guess it's not safe to rely on ordering of RDMA writes after all. > > > >Couldn't the same point then be made that a CQ entry may come before the data > >has been posted? > > When the CQ entry arrives, the context that polls it off the queue > must use the dma_sync_*() api to finalize any associated data > transactions (known by the uper layer). > > This is basic, and it's the reason that a completion is so important. > The completion, in and of itself, isn't what drives the synchronization. > It's the transfer of control to the processor. This is a giant rat hole. On a coherent cache architecture, the CQE write posted to the bus following the write of the last byte of data will NOT be seen by the processor prior to the last byte of data. That is, write ordering is preserved in bridges. The dma_sync_* API has to do with processor cache, not transaction ordering. In fact, per this argument at the time you called dma_sync_*, the processor may not have seen the reordered transaction yet, so what would it be syncing? Write ordering and read ordering/fence is preserved in intervening bridges. What you DON'T know is whether or not a write (which was posted and may be sitting in a bridge FIFO) has been flushed and/or propagated to memory at the time you submit the next write and/or interrupt the host. If you submit a READ following the write, however, per the PCI bus ordering rules you know that the data is in the target. Unless, of course, I'm wrong ... :-) > > Tom. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Fri Aug 25 11:17:59 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 25 Aug 2006 21:17:59 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <20060825145810.GD8380@greglaptop.hotels-on-air.de> References: <20060824132848.GI8192@mellanox.co.il> <20060825141704.GA3867@sashak.voltaire.com> <20060825145810.GD8380@greglaptop.hotels-on-air.de> Message-ID: <20060825181759.GA4239@sashak.voltaire.com> On 07:58 Fri 25 Aug , Greg Lindahl wrote: > On Fri, Aug 25, 2006 at 05:17:04PM +0300, Sasha Khapyorsky wrote: > > > So more generic question: some application performs blocked read() from > > /dev/umadN, should this read() be interrupted and return error (with > > appropriate errno value), then the port state becomes DOWN? > > Iif the SM gets a signal (alarm timeout) and the read() is interrupted > with errno=EINTR... presumably this is not the case you had in mind. Right, not this case, I'm not about signals. By "interrupted" I meant that read() returns error. Sasha > > -- greg > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From bos at pathscale.com Fri Aug 25 11:24:42 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:42 -0700 Subject: [openib-general] [PATCH 17 of 23] IB/ipath - validate path_mig_state properly In-Reply-To: Message-ID: <98402de144a44d858937.1156530282@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 @@ -491,7 +491,8 @@ int ipath_modify_qp(struct ib_qp *ibqp, goto inval; if (attr_mask & IB_QP_PATH_MIG_STATE) - if (attr->path_mig_state != IB_MIG_MIGRATED) + if (attr->path_mig_state != IB_MIG_MIGRATED && + attr->path_mig_state != IB_MIG_REARM) goto inval; switch (new_state) { From bos at pathscale.com Fri Aug 25 11:24:40 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:40 -0700 Subject: [openib-general] [PATCH 15 of 23] IB/ipath - add serial number to hardware freeze error message In-Reply-To: Message-ID: <10a37abd2abd9711e9e5.1156530280@eng-12.pathscale.com> Also added the word "Hardware" after "Fatal" to make it more obvious that it's hardware, not software. Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c Fri Aug 25 11:19:45 2006 -0700 @@ -461,8 +461,9 @@ static void ipath_ht_handle_hwerrors(str * times. */ if (dd->ipath_flags & IPATH_INITTED) { - ipath_dev_err(dd, "Fatal Error (freeze " - "mode), no longer usable\n"); + ipath_dev_err(dd, "Fatal Hardware Error (freeze " + "mode), no longer usable, SN %.16s\n", + dd->ipath_serial); isfatal = 1; } *dd->ipath_statusp &= ~IPATH_STATUS_IB_READY; diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c Fri Aug 25 11:19:45 2006 -0700 @@ -363,8 +363,9 @@ static void ipath_pe_handle_hwerrors(str * and we get here multiple times */ if (dd->ipath_flags & IPATH_INITTED) { - ipath_dev_err(dd, "Fatal Error (freeze " - "mode), no longer usable\n"); + ipath_dev_err(dd, "Fatal Hardware Error (freeze " + "mode), no longer usable, SN %.16s\n", + dd->ipath_serial); isfatal = 1; } /* From bos at pathscale.com Fri Aug 25 11:24:45 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:45 -0700 Subject: [openib-general] [PATCH 20 of 23] IB/ipath - allow SMA to be disabled In-Reply-To: Message-ID: <278073a561698a354124.1156530285@eng-12.pathscale.com> This is useful for testing purposes. Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 @@ -107,6 +107,10 @@ module_param_named(max_srq_wrs, ib_ipath uint, S_IWUSR | S_IRUGO); MODULE_PARM_DESC(max_srq_wrs, "Maximum number of SRQ WRs support"); +static unsigned int ib_ipath_disable_sma; +module_param_named(disable_sma, ib_ipath_disable_sma, uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(ib_ipath_disable_sma, "Disable the SMA"); + const int ib_ipath_state_ops[IB_QPS_ERR + 1] = { [IB_QPS_RESET] = 0, [IB_QPS_INIT] = IPATH_POST_RECV_OK, @@ -354,6 +358,9 @@ static void ipath_qp_rcv(struct ipath_ib switch (qp->ibqp.qp_type) { case IB_QPT_SMI: case IB_QPT_GSI: + if (ib_ipath_disable_sma) + break; + /* FALLTHROUGH */ case IB_QPT_UD: ipath_ud_rcv(dev, hdr, has_grh, data, tlen, qp); break; From bos at pathscale.com Fri Aug 25 11:24:41 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:41 -0700 Subject: [openib-general] [PATCH 16 of 23] IB/ipath - be more strict about testing the modify QP verb In-Reply-To: Message-ID: <24ecb7ac41f857a67660.1156530281@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 @@ -455,11 +455,16 @@ int ipath_modify_qp(struct ib_qp *ibqp, attr_mask)) goto inval; - if (attr_mask & IB_QP_AV) + if (attr_mask & IB_QP_AV) { if (attr->ah_attr.dlid == 0 || attr->ah_attr.dlid >= IPATH_MULTICAST_LID_BASE) goto inval; + if ((attr->ah_attr.ah_flags & IB_AH_GRH) && + (attr->ah_attr.grh.sgid_index > 1)) + goto inval; + } + if (attr_mask & IB_QP_PKEY_INDEX) if (attr->pkey_index >= ipath_get_npkeys(dev->dd)) goto inval; @@ -468,6 +473,27 @@ int ipath_modify_qp(struct ib_qp *ibqp, if (attr->min_rnr_timer > 31) goto inval; + if (attr_mask & IB_QP_PORT) + if (attr->port_num == 0 || + attr->port_num > ibqp->device->phys_port_cnt) + goto inval; + + if (attr_mask & IB_QP_PATH_MTU) + if (attr->path_mtu > IB_MTU_4096) + goto inval; + + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) + if (attr->max_dest_rd_atomic > 1) + goto inval; + + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) + if (attr->max_rd_atomic > 1) + goto inval; + + if (attr_mask & IB_QP_PATH_MIG_STATE) + if (attr->path_mig_state != IB_MIG_MIGRATED) + goto inval; + switch (new_state) { case IB_QPS_RESET: ipath_reset_qp(qp); @@ -517,6 +543,9 @@ int ipath_modify_qp(struct ib_qp *ibqp, if (attr_mask & IB_QP_MIN_RNR_TIMER) qp->r_min_rnr_timer = attr->min_rnr_timer; + + if (attr_mask & IB_QP_TIMEOUT) + qp->timeout = attr->timeout; if (attr_mask & IB_QP_QKEY) qp->qkey = attr->qkey; @@ -564,7 +593,7 @@ int ipath_query_qp(struct ib_qp *ibqp, s attr->max_dest_rd_atomic = 1; attr->min_rnr_timer = qp->r_min_rnr_timer; attr->port_num = 1; - attr->timeout = 0; + attr->timeout = qp->timeout; attr->retry_cnt = qp->s_retry_cnt; attr->rnr_retry = qp->s_rnr_retry; attr->alt_port_num = 0; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:45 2006 -0700 @@ -371,6 +371,7 @@ struct ipath_qp { u8 s_retry; /* requester retry counter */ u8 s_rnr_retry; /* requester RNR retry counter */ u8 s_pkey_index; /* PKEY index to use */ + u8 timeout; /* Timeout for this QP */ enum ib_mtu path_mtu; u32 remote_qpn; u32 qkey; /* QKEY for this QP (for UD or RD) */ From bos at pathscale.com Fri Aug 25 11:24:46 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:46 -0700 Subject: [openib-general] [PATCH 21 of 23] IB/ipath - fix return value from ipath_poll In-Reply-To: Message-ID: <1f9c75c844a96aa8f1e3.1156530286@eng-12.pathscale.com> This stops the generic poll code from waiting for a timeout. Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:45 2006 -0700 @@ -1150,6 +1150,7 @@ static unsigned int ipath_poll(struct fi struct ipath_portdata *pd; u32 head, tail; int bit; + unsigned pollflag = 0; struct ipath_devdata *dd; pd = port_fp(fp); @@ -1186,9 +1187,12 @@ static unsigned int ipath_poll(struct fi clear_bit(IPATH_PORT_WAITING_RCV, &pd->port_flag); pd->port_rcvwait_to++; } + else + pollflag = POLLIN | POLLRDNORM; } else { /* it's already happened; don't do wait_event overhead */ + pollflag = POLLIN | POLLRDNORM; pd->port_rcvnowait++; } @@ -1196,7 +1200,7 @@ static unsigned int ipath_poll(struct fi ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvctrl, dd->ipath_rcvctrl); - return 0; + return pollflag; } static int try_alloc_port(struct ipath_devdata *dd, int port, From bos at pathscale.com Fri Aug 25 11:24:44 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:44 -0700 Subject: [openib-general] [PATCH 19 of 23] IB/ipath - handle sq_sig_all field correctly In-Reply-To: Message-ID: <263d5f544bb43586b9b8.1156530284@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 @@ -606,9 +606,10 @@ int ipath_query_qp(struct ib_qp *ibqp, s init_attr->recv_cq = qp->ibqp.recv_cq; init_attr->srq = qp->ibqp.srq; init_attr->cap = attr->cap; - init_attr->sq_sig_type = - (qp->s_flags & (1 << IPATH_S_SIGNAL_REQ_WR)) - ? IB_SIGNAL_REQ_WR : 0; + if (qp->s_flags & (1 << IPATH_S_SIGNAL_REQ_WR)) + init_attr->sq_sig_type = IB_SIGNAL_REQ_WR; + else + init_attr->sq_sig_type = IB_SIGNAL_ALL_WR; init_attr->qp_type = qp->ibqp.qp_type; init_attr->port_num = 1; return 0; @@ -776,8 +777,10 @@ struct ib_qp *ipath_create_qp(struct ib_ qp->s_wq = swq; qp->s_size = init_attr->cap.max_send_wr + 1; qp->s_max_sge = init_attr->cap.max_send_sge; - qp->s_flags = init_attr->sq_sig_type == IB_SIGNAL_REQ_WR ? - 1 << IPATH_S_SIGNAL_REQ_WR : 0; + if (init_attr->sq_sig_type == IB_SIGNAL_REQ_WR) + qp->s_flags = 1 << IPATH_S_SIGNAL_REQ_WR; + else + qp->s_flags = 0; dev = to_idev(ibpd->device); err = ipath_alloc_qpn(&dev->qp_table, qp, init_attr->qp_type); From bos at pathscale.com Fri Aug 25 11:24:47 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:47 -0700 Subject: [openib-general] [PATCH 22 of 23] IB/ipath - print warning if LID not acquired within one minute In-Reply-To: Message-ID: <1a41dc627c5a1bc2f7e9.1156530287@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:46 2006 -0700 @@ -114,6 +114,13 @@ static int __devinit ipath_init_one(stru #define PCI_DEVICE_ID_INFINIPATH_HT 0xd #define PCI_DEVICE_ID_INFINIPATH_PE800 0x10 +/* + * Number of seconds before we complain about not getting a LID + * assignment. + */ + +#define LID_TIMEOUT 60 + static const struct pci_device_id ipath_pci_tbl[] = { { PCI_DEVICE(PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_INFINIPATH_HT) }, { PCI_DEVICE(PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_INFINIPATH_PE800) }, @@ -129,6 +136,29 @@ static struct pci_driver ipath_driver = .id_table = ipath_pci_tbl, }; + +static void check_link_status(void *data) +{ + struct ipath_devdata *dd = data; + + /* + * If we're in the NOCABLE state, try again in another minute. + */ + + if (dd->ipath_flags & IPATH_STATUS_IB_NOCABLE) { + schedule_delayed_work(&dd->link_task, HZ * LID_TIMEOUT); + return; + } + + /* + * If we don't have a LID, let the user know and don't bother + * checking again. + */ + + if (dd->ipath_lid == 0) + dev_info(&dd->pcidev->dev, + "We don't have a LID yet (no subnet manager?)"); +} static inline void read_bars(struct ipath_devdata *dd, struct pci_dev *dev, u32 *bar0, u32 *bar1) @@ -196,6 +226,8 @@ static struct ipath_devdata *ipath_alloc dd->pcidev = pdev; pci_set_drvdata(pdev, dd); + + INIT_WORK(&dd->link_task, check_link_status, dd); list_add(&dd->ipath_list, &ipath_dev_list); @@ -509,6 +541,9 @@ static int __devinit ipath_init_one(stru ipath_diag_add(dd); ipath_register_ib_device(dd); + /* Check that we have a LID in LID_TIMEOUT seconds. */ + schedule_delayed_work(&dd->link_task, HZ * LID_TIMEOUT); + goto bail; bail_iounmap: @@ -536,6 +571,9 @@ static void __devexit ipath_remove_one(s return; dd = pci_get_drvdata(pdev); + + cancel_delayed_work(&dd->link_task); + ipath_unregister_ib_device(dd->verbs_dev); ipath_diag_remove(dd); ipath_user_remove(dd); @@ -1644,6 +1682,8 @@ int ipath_set_lid(struct ipath_devdata * dd->ipath_lid = arg; dd->ipath_lmc = lmc; + dev_info(&dd->pcidev->dev, "We got a lid: %u\n", arg); + return 0; } diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:46 2006 -0700 @@ -508,6 +508,9 @@ struct ipath_devdata { u32 ipath_lli_counter; /* local link integrity errors */ u32 ipath_lli_errors; + + /* Link status check work */ + struct work_struct link_task; }; extern struct list_head ipath_dev_list; From bos at pathscale.com Fri Aug 25 11:24:25 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:25 -0700 Subject: [openib-general] [PATCH 0 of 23] IB/ipath - updates for 2.6.19 Message-ID: Hi, Roland - This is a series of patches to bring the ipath driver up to date for 2.6.19. The patches apply on top of Ralph's mmap patch that you accepted yesterday. Please apply. Thanks, Message-ID: <02d9f3ef2291dea0f4d8.1156530283@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 @@ -833,9 +833,21 @@ struct ib_qp *ipath_create_qp(struct ib_ } } + spin_lock(&dev->n_qps_lock); + if (dev->n_qps_allocated == ib_ipath_max_qps) { + spin_unlock(&dev->n_qps_lock); + ret = ERR_PTR(-ENOMEM); + goto bail_ip; + } + + dev->n_qps_allocated++; + spin_unlock(&dev->n_qps_lock); + ret = &qp->ibqp; goto bail; +bail_ip: + kfree(qp->ip); bail_rwq: vfree(qp->r_rq.wq); bail_qp: @@ -864,6 +876,9 @@ int ipath_destroy_qp(struct ib_qp *ibqp) spin_lock_irqsave(&qp->s_lock, flags); qp->state = IB_QPS_ERR; spin_unlock_irqrestore(&qp->s_lock, flags); + spin_lock(&dev->n_qps_lock); + dev->n_qps_allocated--; + spin_unlock(&dev->n_qps_lock); /* Stop the sending tasklet. */ tasklet_kill(&qp->s_task); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 @@ -72,6 +72,10 @@ module_param_named(max_qp_wrs, ib_ipath_ module_param_named(max_qp_wrs, ib_ipath_max_qp_wrs, uint, S_IWUSR | S_IRUGO); MODULE_PARM_DESC(max_qp_wrs, "Maximum number of QP WRs to support"); + +unsigned int ib_ipath_max_qps = 16384; +module_param_named(max_qps, ib_ipath_max_qps, uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_qps, "Maximum number of QPs to support"); unsigned int ib_ipath_max_sges = 0x60; module_param_named(max_sges, ib_ipath_max_sges, uint, S_IWUSR | S_IRUGO); @@ -958,7 +962,7 @@ static int ipath_query_device(struct ib_ props->sys_image_guid = dev->sys_image_guid; props->max_mr_size = ~0ull; - props->max_qp = dev->qp_table.max; + props->max_qp = ib_ipath_max_qps; props->max_qp_wr = ib_ipath_max_qp_wrs; props->max_sge = ib_ipath_max_sges; props->max_cq = ib_ipath_max_cqs; @@ -1420,6 +1424,7 @@ int ipath_register_ib_device(struct ipat spin_lock_init(&idev->n_pds_lock); spin_lock_init(&idev->n_ahs_lock); spin_lock_init(&idev->n_cqs_lock); + spin_lock_init(&idev->n_qps_lock); spin_lock_init(&idev->n_srqs_lock); spin_lock_init(&idev->n_mcast_grps_lock); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:45 2006 -0700 @@ -482,6 +482,8 @@ struct ipath_ibdev { spinlock_t n_ahs_lock; u32 n_cqs_allocated; /* number of CQs allocated for device */ spinlock_t n_cqs_lock; + u32 n_qps_allocated; /* number of QPs allocated for device */ + spinlock_t n_qps_lock; u32 n_srqs_allocated; /* number of SRQs allocated for device */ spinlock_t n_srqs_lock; u32 n_mcast_grps_allocated; /* number of mcast groups allocated */ @@ -792,6 +794,8 @@ extern unsigned int ib_ipath_max_cqs; extern unsigned int ib_ipath_max_qp_wrs; +extern unsigned int ib_ipath_max_qps; + extern unsigned int ib_ipath_max_sges; extern unsigned int ib_ipath_max_mcast_grps; From bos at pathscale.com Fri Aug 25 11:24:48 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:48 -0700 Subject: [openib-general] [PATCH 23 of 23] IB/ipath - control receive polarity inversion In-Reply-To: Message-ID: <7a03a7b18dcfe1afeeb1.1156530288@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:46 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:46 2006 -0700 @@ -2156,5 +2156,22 @@ bail: return ret; } +int ipath_set_rx_pol_inv(struct ipath_devdata *dd, u8 new_pol_inv) +{ + u64 val; + if ( new_pol_inv > INFINIPATH_XGXS_RX_POL_MASK ) { + return -1; + } + if ( dd->ipath_rx_pol_inv != new_pol_inv ) { + dd->ipath_rx_pol_inv = new_pol_inv; + val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_xgxsconfig); + val &= ~(INFINIPATH_XGXS_RX_POL_MASK << + INFINIPATH_XGXS_RX_POL_SHIFT); + val |= ((u64)dd->ipath_rx_pol_inv) << + INFINIPATH_XGXS_RX_POL_SHIFT; + ipath_write_kreg(dd, dd->ipath_kregs->kr_xgxsconfig, val); + } + return 0; +} module_init(infinipath_init); module_exit(infinipath_cleanup); diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c Fri Aug 25 11:19:46 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c Fri Aug 25 11:19:46 2006 -0700 @@ -1290,6 +1290,15 @@ static int ipath_ht_bringup_serdes(struc val &= ~INFINIPATH_XGXS_RESET; change = 1; } + if (((val >> INFINIPATH_XGXS_RX_POL_SHIFT) & + INFINIPATH_XGXS_RX_POL_MASK) != dd->ipath_rx_pol_inv ) { + /* need to compensate for Tx inversion in partner */ + val &= ~(INFINIPATH_XGXS_RX_POL_MASK << + INFINIPATH_XGXS_RX_POL_SHIFT); + val |= dd->ipath_rx_pol_inv << + INFINIPATH_XGXS_RX_POL_SHIFT; + change = 1; + } if (change) ipath_write_kreg(dd, dd->ipath_kregs->kr_xgxsconfig, val); diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c Fri Aug 25 11:19:46 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c Fri Aug 25 11:19:46 2006 -0700 @@ -654,6 +654,15 @@ static int ipath_pe_bringup_serdes(struc val &= ~INFINIPATH_XGXS_RESET; change = 1; } + if (((val >> INFINIPATH_XGXS_RX_POL_SHIFT) & + INFINIPATH_XGXS_RX_POL_MASK) != dd->ipath_rx_pol_inv ) { + /* need to compensate for Tx inversion in partner */ + val &= ~(INFINIPATH_XGXS_RX_POL_MASK << + INFINIPATH_XGXS_RX_POL_SHIFT); + val |= dd->ipath_rx_pol_inv << + INFINIPATH_XGXS_RX_POL_SHIFT; + change = 1; + } if (change) ipath_write_kreg(dd, dd->ipath_kregs->kr_xgxsconfig, val); diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:46 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:46 2006 -0700 @@ -503,6 +503,8 @@ struct ipath_devdata { u8 ipath_pci_cacheline; /* LID mask control */ u8 ipath_lmc; + /* Rx Polarity inversion (compensate for ~tx on partner) */ + u8 ipath_rx_pol_inv; /* local link integrity counter */ u32 ipath_lli_counter; @@ -570,6 +572,7 @@ int ipath_set_linkstate(struct ipath_dev int ipath_set_linkstate(struct ipath_devdata *, u8); int ipath_set_mtu(struct ipath_devdata *, u16); int ipath_set_lid(struct ipath_devdata *, u32, u8); +int ipath_set_rx_pol_inv(struct ipath_devdata *dd, u8 new_pol_inv); /* for use in system calls, where we want to know device type, etc. */ #define port_fp(fp) ((struct ipath_portdata *) (fp)->private_data) diff --git a/drivers/infiniband/hw/ipath/ipath_registers.h b/drivers/infiniband/hw/ipath/ipath_registers.h --- a/drivers/infiniband/hw/ipath/ipath_registers.h Fri Aug 25 11:19:46 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_registers.h Fri Aug 25 11:19:46 2006 -0700 @@ -282,6 +282,8 @@ #define INFINIPATH_XGXS_RESET 0x7ULL #define INFINIPATH_XGXS_MDIOADDR_MASK 0xfULL #define INFINIPATH_XGXS_MDIOADDR_SHIFT 4 +#define INFINIPATH_XGXS_RX_POL_SHIFT 19 +#define INFINIPATH_XGXS_RX_POL_MASK 0xfULL #define INFINIPATH_RT_ADDR_MASK 0xFFFFFFFFFFULL /* 40 bits valid */ diff --git a/drivers/infiniband/hw/ipath/ipath_sysfs.c b/drivers/infiniband/hw/ipath/ipath_sysfs.c --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c Fri Aug 25 11:19:46 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c Fri Aug 25 11:19:46 2006 -0700 @@ -561,6 +561,33 @@ bail: return ret; } +static ssize_t store_rx_pol_inv(struct device *dev, + struct device_attribute *attr, + const char *buf, + size_t count) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + int ret, r; + u16 val; + + ret = ipath_parse_ushort(buf, &val); + if (ret < 0) + goto invalid; + + r = ipath_set_rx_pol_inv(dd, val); + if (r < 0) { + ret = r; + goto bail; + } + + goto bail; +invalid: + ipath_dev_err(dd, "attempt to set invalid Rx Polarity invert\n"); +bail: + return ret; +} + + static DRIVER_ATTR(num_units, S_IRUGO, show_num_units, NULL); static DRIVER_ATTR(version, S_IRUGO, show_version, NULL); @@ -587,6 +614,7 @@ static DEVICE_ATTR(status_str, S_IRUGO, static DEVICE_ATTR(status_str, S_IRUGO, show_status_str, NULL); static DEVICE_ATTR(boardversion, S_IRUGO, show_boardversion, NULL); static DEVICE_ATTR(unit, S_IRUGO, show_unit, NULL); +static DEVICE_ATTR(rx_pol_inv, S_IWUSR, NULL, store_rx_pol_inv); static struct attribute *dev_attributes[] = { &dev_attr_guid.attr, @@ -601,6 +629,7 @@ static struct attribute *dev_attributes[ &dev_attr_boardversion.attr, &dev_attr_unit.attr, &dev_attr_enabled.attr, + &dev_attr_rx_pol_inv.attr, NULL }; From bos at pathscale.com Fri Aug 25 11:24:38 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:38 -0700 Subject: [openib-general] [PATCH 13 of 23] IB/ipath - account for attached QPs correctly In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c --- a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c Fri Aug 25 11:19:45 2006 -0700 @@ -217,6 +217,8 @@ static int ipath_mcast_add(struct ipath_ dev->n_mcast_grps_allocated++; spin_unlock(&dev->n_mcast_grps_lock); + mcast->n_attached++; + list_add_tail_rcu(&mqp->list, &mcast->qp_list); atomic_inc(&mcast->refcount); From bos at pathscale.com Fri Aug 25 11:24:37 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:37 -0700 Subject: [openib-general] [PATCH 12 of 23] IB/ipath - do not allow use of CQ entries with invalid counts In-Reply-To: Message-ID: <94b773e8d36a655ea540.1156530277@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c --- a/drivers/infiniband/hw/ipath/ipath_cq.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_cq.c Fri Aug 25 11:19:45 2006 -0700 @@ -172,7 +172,7 @@ struct ib_cq *ipath_create_cq(struct ib_ struct ipath_cq_wc *wc; struct ib_cq *ret; - if (entries > ib_ipath_max_cqes) { + if (entries < 1 || entries > ib_ipath_max_cqes) { ret = ERR_PTR(-EINVAL); goto done; } @@ -324,6 +324,11 @@ int ipath_resize_cq(struct ib_cq *ibcq, u32 head, tail, n; int ret; + if (cqe < 1 || cqe > ib_ipath_max_cqes) { + ret = -EINVAL; + goto bail; + } + /* * Need to use vmalloc() if we want to support large #s of entries. */ From bos at pathscale.com Fri Aug 25 11:24:28 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:28 -0700 Subject: [openib-general] [PATCH 3 of 23] IB/ipath - fix for crash on module unload, if cfgports < portcnt In-Reply-To: Message-ID: Allocate enough pointers for all possible ports, to avoid problems in cleanup/unload. Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Fri Aug 25 11:19:44 2006 -0700 @@ -240,7 +240,11 @@ static int init_chip_first(struct ipath_ "only supports %u\n", ipath_cfgports, dd->ipath_portcnt); } - dd->ipath_pd = kzalloc(sizeof(*dd->ipath_pd) * dd->ipath_cfgports, + /* + * Allocate full portcnt array, rather than just cfgports, because + * cleanup iterates across all possible ports. + */ + dd->ipath_pd = kzalloc(sizeof(*dd->ipath_pd) * dd->ipath_portcnt, GFP_KERNEL); if (!dd->ipath_pd) { From bos at pathscale.com Fri Aug 25 11:24:29 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:29 -0700 Subject: [openib-general] [PATCH 4 of 23] IB/ipath - fix handling of kpiobufs In-Reply-To: Message-ID: <599649f41f050c5a267c.1156530269@eng-12.pathscale.com> Change comment: no longer imply that user can set ipath_kpiobufs to zero. Actually set ipath_kpiobufs from parameter. Previously only altered per-device ipath_lastport_piobuf, which was over-written in chip init. Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Fri Aug 25 11:19:44 2006 -0700 @@ -691,7 +691,7 @@ int ipath_init_chip(struct ipath_devdata dd->ipath_pioavregs = ALIGN(val, sizeof(u64) * BITS_PER_BYTE / 2) / (sizeof(u64) * BITS_PER_BYTE / 2); if (ipath_kpiobufs == 0) { - /* not set by user, or set explictly to default */ + /* not set by user (this is default) */ if ((dd->ipath_piobcnt2k + dd->ipath_piobcnt4k) > 128) kpiobufs = 32; else @@ -950,6 +950,7 @@ static int ipath_set_kpiobufs(const char dd->ipath_piobcnt2k + dd->ipath_piobcnt4k - val; } + ipath_kpiobufs = val; ret = 0; bail: spin_unlock_irqrestore(&ipath_devs_lock, flags); From bos at pathscale.com Fri Aug 25 11:24:35 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:35 -0700 Subject: [openib-general] [PATCH 10 of 23] IB/ipath - trivial cleanups In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 @@ -528,7 +528,6 @@ void ipath_cdev_cleanup(struct cdev **cd int ipath_diag_add(struct ipath_devdata *); void ipath_diag_remove(struct ipath_devdata *); -void ipath_diag_bringup_link(struct ipath_devdata *); extern wait_queue_head_t ipath_state_wait; From bos at pathscale.com Fri Aug 25 11:24:30 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:30 -0700 Subject: [openib-general] [PATCH 5 of 23] IB/ipath - drop requirement that PIO buffers be mmaped write-only In-Reply-To: Message-ID: Some userlands try to mmap these pages read-write, so accommodate them. Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:44 2006 -0700 @@ -992,15 +992,10 @@ static int mmap_piobufs(struct vm_area_s pgprot_val(vma->vm_page_prot) &= ~_PAGE_GUARDED; #endif - if (vma->vm_flags & VM_READ) { - dev_info(&dd->pcidev->dev, - "Can't map piobufs as readable (flags=%lx)\n", - vma->vm_flags); - ret = -EPERM; - goto bail; - } - - /* don't allow them to later change to readable with mprotect */ + /* + * don't allow them to later change to readable with mprotect (for when + * not initially mapped readable, as is normally the case) + */ vma->vm_flags &= ~VM_MAYREAD; vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; From bos at pathscale.com Fri Aug 25 11:24:26 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:26 -0700 Subject: [openib-general] [PATCH 1 of 23] IB/ipath - More changes to support InfiniPath on PowerPC 970 systems In-Reply-To: Message-ID: <44809b730ac95b39b672.1156530266@eng-12.pathscale.com> Ordering of writethrough store buffers needs to be forced, and we need to use ifdef to get writethrough behavior to InfiniPath buffers, because there is no generic way to specify that at this time (similar to code in char/drm/drm_vm.c and block/z2ram.c). Signed-off-by: John Gregor Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/Makefile Fri Aug 25 11:19:44 2006 -0700 @@ -20,6 +20,7 @@ ipath_core-y := \ ipath_user_pages.o ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o +ipath_core-$(CONFIG_PPC64) += ipath_wc_ppc64.o ib_ipath-y := \ ipath_cq.o \ diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:44 2006 -0700 @@ -440,7 +440,13 @@ static int __devinit ipath_init_one(stru } dd->ipath_pcirev = rev; +#if defined(__powerpc__) + /* There isn't a generic way to specify writethrough mappings */ + dd->ipath_kregbase = __ioremap(addr, len, + (_PAGE_NO_CACHE|_PAGE_WRITETHRU)); +#else dd->ipath_kregbase = ioremap_nocache(addr, len); +#endif if (!dd->ipath_kregbase) { ipath_dbg("Unable to map io addr %llx to kvirt, failing\n", diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:44 2006 -0700 @@ -985,6 +985,13 @@ static int mmap_piobufs(struct vm_area_s * write combining behavior we want on the PIO buffers! */ +#if defined(__powerpc__) + /* There isn't a generic way to specify writethrough mappings */ + pgprot_val(vma->vm_page_prot) |= _PAGE_NO_CACHE; + pgprot_val(vma->vm_page_prot) |= _PAGE_WRITETHRU; + pgprot_val(vma->vm_page_prot) &= ~_PAGE_GUARDED; +#endif + if (vma->vm_flags & VM_READ) { dev_info(&dd->pcidev->dev, "Can't map piobufs as readable (flags=%lx)\n", diff --git a/drivers/infiniband/hw/ipath/ipath_wc_ppc64.c b/drivers/infiniband/hw/ipath/ipath_wc_ppc64.c new file mode 100644 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_wc_ppc64.c Fri Aug 25 11:19:44 2006 -0700 @@ -0,0 +1,52 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +/* + * This file is conditionally built on PowerPC only. Otherwise weak symbol + * versions of the functions exported from here are used. + */ + +#include "ipath_kernel.h" + +/** + * ipath_unordered_wc - indicate whether write combining is ordered + * + * PowerPC systems (at least those in the 970 processor family) + * write partially filled store buffers in address order, but will write + * completely filled store buffers in "random" order, and therefore must + * have serialization for correctness with current InfiniPath chips. + * + */ +int ipath_unordered_wc(void) +{ + return 1; +} From bos at pathscale.com Fri Aug 25 11:24:27 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:27 -0700 Subject: [openib-general] [PATCH 2 of 23] IB/ipath - lock resource limit counters correctly In-Reply-To: Message-ID: <4326e8fdb03d3ca806c6.1156530267@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:44 2006 -0700 @@ -776,18 +776,22 @@ static struct ib_pd *ipath_alloc_pd(stru * we allow allocations of more than we report for this value. */ - if (dev->n_pds_allocated == ib_ipath_max_pds) { - ret = ERR_PTR(-ENOMEM); - goto bail; - } - pd = kmalloc(sizeof *pd, GFP_KERNEL); if (!pd) { ret = ERR_PTR(-ENOMEM); goto bail; } + spin_lock(&dev->n_pds_lock); + if (dev->n_pds_allocated == ib_ipath_max_pds) { + spin_unlock(&dev->n_pds_lock); + kfree(pd); + ret = ERR_PTR(-ENOMEM); + goto bail; + } + dev->n_pds_allocated++; + spin_unlock(&dev->n_pds_lock); /* ib_alloc_pd() will initialize pd->ibpd. */ pd->user = udata != NULL; @@ -803,7 +807,9 @@ static int ipath_dealloc_pd(struct ib_pd struct ipath_pd *pd = to_ipd(ibpd); struct ipath_ibdev *dev = to_idev(ibpd->device); + spin_lock(&dev->n_pds_lock); dev->n_pds_allocated--; + spin_unlock(&dev->n_pds_lock); kfree(pd); @@ -824,11 +830,6 @@ static struct ib_ah *ipath_create_ah(str struct ib_ah *ret; struct ipath_ibdev *dev = to_idev(pd->device); - if (dev->n_ahs_allocated == ib_ipath_max_ahs) { - ret = ERR_PTR(-ENOMEM); - goto bail; - } - /* A multicast address requires a GRH (see ch. 8.4.1). */ if (ah_attr->dlid >= IPATH_MULTICAST_LID_BASE && ah_attr->dlid != IPATH_PERMISSIVE_LID && @@ -854,7 +855,16 @@ static struct ib_ah *ipath_create_ah(str goto bail; } + spin_lock(&dev->n_ahs_lock); + if (dev->n_ahs_allocated == ib_ipath_max_ahs) { + spin_unlock(&dev->n_ahs_lock); + kfree(ah); + ret = ERR_PTR(-ENOMEM); + goto bail; + } + dev->n_ahs_allocated++; + spin_unlock(&dev->n_ahs_lock); /* ib_create_ah() will initialize ah->ibah. */ ah->attr = *ah_attr; @@ -876,7 +886,9 @@ static int ipath_destroy_ah(struct ib_ah struct ipath_ibdev *dev = to_idev(ibah->device); struct ipath_ah *ah = to_iah(ibah); + spin_lock(&dev->n_ahs_lock); dev->n_ahs_allocated--; + spin_unlock(&dev->n_ahs_lock); kfree(ah); @@ -963,6 +975,12 @@ static void *ipath_register_ib_device(in dev = &idev->ibdev; /* Only need to initialize non-zero fields. */ + spin_lock_init(&idev->n_pds_lock); + spin_lock_init(&idev->n_ahs_lock); + spin_lock_init(&idev->n_cqs_lock); + spin_lock_init(&idev->n_srqs_lock); + spin_lock_init(&idev->n_mcast_grps_lock); + spin_lock_init(&idev->qp_table.lock); spin_lock_init(&idev->lk_table.lock); idev->sm_lid = __constant_be16_to_cpu(IB_LID_PERMISSIVE); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:44 2006 -0700 @@ -434,11 +434,18 @@ struct ipath_ibdev { __be64 sys_image_guid; /* in network order */ __be64 gid_prefix; /* in network order */ __be64 mkey; + u32 n_pds_allocated; /* number of PDs allocated for device */ + spinlock_t n_pds_lock; u32 n_ahs_allocated; /* number of AHs allocated for device */ + spinlock_t n_ahs_lock; u32 n_cqs_allocated; /* number of CQs allocated for device */ + spinlock_t n_cqs_lock; u32 n_srqs_allocated; /* number of SRQs allocated for device */ + spinlock_t n_srqs_lock; u32 n_mcast_grps_allocated; /* number of mcast groups allocated */ + spinlock_t n_mcast_grps_lock; + u64 ipath_sword; /* total dwords sent (sample result) */ u64 ipath_rword; /* total dwords received (sample result) */ u64 ipath_spkts; /* total packets sent (sample result) */ diff --git a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c --- a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c Fri Aug 25 11:19:44 2006 -0700 @@ -207,12 +207,15 @@ static int ipath_mcast_add(struct ipath_ goto bail; } + spin_lock(&dev->n_mcast_grps_lock); if (dev->n_mcast_grps_allocated == ib_ipath_max_mcast_grps) { + spin_unlock(&dev->n_mcast_grps_lock); ret = ENOMEM; goto bail; } dev->n_mcast_grps_allocated++; + spin_unlock(&dev->n_mcast_grps_lock); list_add_tail_rcu(&mqp->list, &mcast->qp_list); @@ -343,7 +346,9 @@ int ipath_multicast_detach(struct ib_qp atomic_dec(&mcast->refcount); wait_event(mcast->wait, !atomic_read(&mcast->refcount)); ipath_mcast_free(mcast); + spin_lock(&dev->n_mcast_grps_lock); dev->n_mcast_grps_allocated--; + spin_unlock(&dev->n_mcast_grps_lock); } ret = 0; From bos at pathscale.com Fri Aug 25 11:24:36 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:36 -0700 Subject: [openib-general] [PATCH 11 of 23] IB/ipath - add new minor device to allow sending of diag packets In-Reply-To: Message-ID: <8743e6ee09c51e799f0f.1156530276@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_common.h b/drivers/infiniband/hw/ipath/ipath_common.h --- a/drivers/infiniband/hw/ipath/ipath_common.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_common.h Fri Aug 25 11:19:45 2006 -0700 @@ -461,6 +461,13 @@ struct __ipath_sendpkt { __u32 sps_cnt; /* number of entries to use in sps_iov */ /* array of iov's describing packet. TEMPORARY */ struct ipath_iovec sps_iov[4]; +}; + +/* Passed into diag data special file's ->write method. */ +struct ipath_diag_pkt { + __u32 unit; + __u64 data; + __u32 len; }; /* diff --git a/drivers/infiniband/hw/ipath/ipath_diag.c b/drivers/infiniband/hw/ipath/ipath_diag.c --- a/drivers/infiniband/hw/ipath/ipath_diag.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_diag.c Fri Aug 25 11:19:45 2006 -0700 @@ -41,6 +41,7 @@ * through the /sys/bus/pci resource mmap interface. */ +#include #include #include @@ -273,6 +274,158 @@ bail: return ret; } +static ssize_t ipath_diagpkt_write(struct file *fp, + const char __user *data, + size_t count, loff_t *off); + +static struct file_operations diagpkt_file_ops = { + .owner = THIS_MODULE, + .write = ipath_diagpkt_write, +}; + +static struct cdev *diagpkt_cdev; +static struct class_device *diagpkt_class_dev; + +int __init ipath_diagpkt_add(void) +{ + return ipath_cdev_init(IPATH_DIAGPKT_MINOR, + "ipath_diagpkt", &diagpkt_file_ops, + &diagpkt_cdev, &diagpkt_class_dev); +} + +void __exit ipath_diagpkt_remove(void) +{ + ipath_cdev_cleanup(&diagpkt_cdev, &diagpkt_class_dev); +} + +/** + * ipath_diagpkt_write - write an IB packet + * @fp: the diag data device file pointer + * @data: ipath_diag_pkt structure saying where to get the packet + * @count: size of data to write + * @off: unused by this code + */ +static ssize_t ipath_diagpkt_write(struct file *fp, + const char __user *data, + size_t count, loff_t *off) +{ + u32 __iomem *piobuf; + u32 plen, clen, pbufn; + struct ipath_diag_pkt dp; + u32 *tmpbuf = NULL; + struct ipath_devdata *dd; + ssize_t ret = 0; + u64 val; + + if (count < sizeof(dp)) { + ret = -EINVAL; + goto bail; + } + + if (copy_from_user(&dp, data, sizeof(dp))) { + ret = -EFAULT; + goto bail; + } + + /* send count must be an exact number of dwords */ + if (dp.len & 3) { + ret = -EINVAL; + goto bail; + } + + clen = dp.len >> 2; + + dd = ipath_lookup(dp.unit); + if (!dd || !(dd->ipath_flags & IPATH_PRESENT) || + !dd->ipath_kregbase) { + ipath_cdbg(VERBOSE, "illegal unit %u for diag data send\n", + dp.unit); + ret = -ENODEV; + goto bail; + } + + if (ipath_diag_inuse && !diag_set_link && + !(dd->ipath_flags & IPATH_LINKACTIVE)) { + diag_set_link = 1; + ipath_cdbg(VERBOSE, "Trying to set to set link active for " + "diag pkt\n"); + ipath_set_linkstate(dd, IPATH_IB_LINKARM); + ipath_set_linkstate(dd, IPATH_IB_LINKACTIVE); + } + + if (!(dd->ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + ipath_cdbg(VERBOSE, "unit %u not usable\n", dd->ipath_unit); + ret = -ENODEV; + goto bail; + } + val = dd->ipath_lastibcstat & IPATH_IBSTATE_MASK; + if (val != IPATH_IBSTATE_INIT && val != IPATH_IBSTATE_ARM && + val != IPATH_IBSTATE_ACTIVE) { + ipath_cdbg(VERBOSE, "unit %u not ready (state %llx)\n", + dd->ipath_unit, (unsigned long long) val); + ret = -EINVAL; + goto bail; + } + + /* need total length before first word written */ + /* +1 word is for the qword padding */ + plen = sizeof(u32) + dp.len; + + if ((plen + 4) > dd->ipath_ibmaxlen) { + ipath_dbg("Pkt len 0x%x > ibmaxlen %x\n", + plen - 4, dd->ipath_ibmaxlen); + ret = -EINVAL; + goto bail; /* before writing pbc */ + } + tmpbuf = vmalloc(plen); + if (!tmpbuf) { + dev_info(&dd->pcidev->dev, "Unable to allocate tmp buffer, " + "failing\n"); + ret = -ENOMEM; + goto bail; + } + + if (copy_from_user(tmpbuf, + (const void __user *) (unsigned long) dp.data, + dp.len)) { + ret = -EFAULT; + goto bail; + } + + piobuf = ipath_getpiobuf(dd, &pbufn); + if (!piobuf) { + ipath_cdbg(VERBOSE, "No PIO buffers avail unit for %u\n", + dd->ipath_unit); + ret = -EBUSY; + goto bail; + } + + plen >>= 2; /* in dwords */ + + if (ipath_debug & __IPATH_PKTDBG) + ipath_cdbg(VERBOSE, "unit %u 0x%x+1w pio%d\n", + dd->ipath_unit, plen - 1, pbufn); + + /* we have to flush after the PBC for correctness on some cpus + * or WC buffer can be written out of order */ + writeq(plen, piobuf); + ipath_flush_wc(); + /* copy all by the trigger word, then flush, so it's written + * to chip before trigger word, then write trigger word, then + * flush again, so packet is sent. */ + __iowrite32_copy(piobuf + 2, tmpbuf, clen - 1); + ipath_flush_wc(); + __raw_writel(tmpbuf[clen - 1], piobuf + clen + 1); + ipath_flush_wc(); + + ret = sizeof(dp); + +bail: + vfree(tmpbuf); + return ret; +} + static int ipath_diag_release(struct inode *in, struct file *fp) { mutex_lock(&ipath_mutex); diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 @@ -1881,7 +1881,17 @@ static int __init infinipath_init(void) goto bail_group; } + ret = ipath_diagpkt_add(); + if (ret < 0) { + printk(KERN_ERR IPATH_DRV_NAME ": Unable to create " + "diag data device: error %d\n", -ret); + goto bail_ipathfs; + } + goto bail; + +bail_ipathfs: + ipath_exit_ipathfs(); bail_group: ipath_driver_remove_group(&ipath_driver.driver); @@ -1993,6 +2003,8 @@ static void __exit infinipath_cleanup(vo struct ipath_devdata *dd, *tmp; unsigned long flags; + ipath_diagpkt_remove(); + ipath_exit_ipathfs(); ipath_driver_remove_group(&ipath_driver.driver); diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 @@ -789,6 +789,9 @@ void ipath_device_remove_group(struct de void ipath_device_remove_group(struct device *, struct ipath_devdata *); int ipath_expose_reset(struct device *); +int ipath_diagpkt_add(void); +void ipath_diagpkt_remove(void); + int ipath_init_ipathfs(void); void ipath_exit_ipathfs(void); int ipathfs_add_device(struct ipath_devdata *); @@ -813,6 +816,7 @@ extern struct mutex ipath_mutex; #define IPATH_DRV_NAME "ib_ipath" #define IPATH_MAJOR 233 #define IPATH_USER_MINOR_BASE 0 +#define IPATH_DIAGPKT_MINOR 127 #define IPATH_DIAG_MINOR_BASE 129 #define IPATH_NMINORS 255 From bos at pathscale.com Fri Aug 25 11:24:33 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:33 -0700 Subject: [openib-general] [PATCH 8 of 23] IB/ipath - simplify debugging code after ipath_core and ib_ipath merger In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 @@ -58,7 +58,7 @@ const char *ipath_get_unit_name(int unit * The size has to be longer than this string, so we can append * board/chip information to it in the init code. */ -const char ipath_core_version[] = IPATH_IDSTR "\n"; +const char ib_ipath_version[] = IPATH_IDSTR "\n"; static struct idr unit_table; DEFINE_SPINLOCK(ipath_devs_lock); @@ -1847,7 +1847,7 @@ static int __init infinipath_init(void) { int ret; - ipath_dbg(KERN_INFO DRIVER_LOAD_MSG "%s", ipath_core_version); + ipath_dbg(KERN_INFO DRIVER_LOAD_MSG "%s", ib_ipath_version); /* * These must be called before the driver is registered with diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 @@ -785,7 +785,7 @@ static inline u32 ipath_read_creg32(cons struct device_driver; -extern const char ipath_core_version[]; +extern const char ib_ipath_version[]; int ipath_driver_create_group(struct device_driver *); void ipath_driver_remove_group(struct device_driver *); @@ -815,7 +815,7 @@ const char *ipath_get_unit_name(int unit extern struct mutex ipath_mutex; -#define IPATH_DRV_NAME "ipath_core" +#define IPATH_DRV_NAME "ib_ipath" #define IPATH_MAJOR 233 #define IPATH_USER_MINOR_BASE 0 #define IPATH_SMA_MINOR 128 diff --git a/drivers/infiniband/hw/ipath/ipath_keys.c b/drivers/infiniband/hw/ipath/ipath_keys.c --- a/drivers/infiniband/hw/ipath/ipath_keys.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_keys.c Fri Aug 25 11:19:45 2006 -0700 @@ -34,6 +34,7 @@ #include #include "ipath_verbs.h" +#include "ipath_kernel.h" /** * ipath_alloc_lkey - allocate an lkey @@ -60,7 +61,7 @@ int ipath_alloc_lkey(struct ipath_lkey_t r = (r + 1) & (rkt->max - 1); if (r == n) { spin_unlock_irqrestore(&rkt->lock, flags); - _VERBS_INFO("LKEY table full\n"); + ipath_dbg(KERN_INFO "LKEY table full\n"); ret = 0; goto bail; } diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 @@ -274,7 +274,7 @@ void ipath_free_all_qps(struct ipath_qp_ free_qpn(qpt, qp->ibqp.qp_num); if (!atomic_dec_and_test(&qp->refcount) || !ipath_destroy_qp(&qp->ibqp)) - _VERBS_INFO("QP memory leak!\n"); + ipath_dbg(KERN_INFO "QP memory leak!\n"); qp = nqp; } } @@ -362,8 +362,8 @@ void ipath_error_qp(struct ipath_qp *qp) struct ipath_ibdev *dev = to_idev(qp->ibqp.device); struct ib_wc wc; - _VERBS_INFO("QP%d/%d in error state\n", - qp->ibqp.qp_num, qp->remote_qpn); + ipath_dbg(KERN_INFO "QP%d/%d in error state\n", + qp->ibqp.qp_num, qp->remote_qpn); spin_lock(&dev->pending_lock); /* XXX What if its already removed by the timeout code? */ @@ -945,8 +945,8 @@ void ipath_sqerror_qp(struct ipath_qp *q struct ipath_ibdev *dev = to_idev(qp->ibqp.device); struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); - _VERBS_INFO("Send queue error on QP%d/%d: err: %d\n", - qp->ibqp.qp_num, qp->remote_qpn, wc->status); + ipath_dbg(KERN_INFO "Send queue error on QP%d/%d: err: %d\n", + qp->ibqp.qp_num, qp->remote_qpn, wc->status); spin_lock(&dev->pending_lock); /* XXX What if its already removed by the timeout code? */ diff --git a/drivers/infiniband/hw/ipath/ipath_sysfs.c b/drivers/infiniband/hw/ipath/ipath_sysfs.c --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c Fri Aug 25 11:19:45 2006 -0700 @@ -75,7 +75,7 @@ static ssize_t show_version(struct devic static ssize_t show_version(struct device_driver *dev, char *buf) { /* The string printed here is already newline-terminated. */ - return scnprintf(buf, PAGE_SIZE, "%s", ipath_core_version); + return scnprintf(buf, PAGE_SIZE, "%s", ib_ipath_version); } static ssize_t show_num_units(struct device_driver *dev, char *buf) diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 @@ -49,10 +49,6 @@ module_param_named(lkey_table_size, ib_i S_IRUGO); MODULE_PARM_DESC(lkey_table_size, "LKEY table size in bits (2^n, 1 <= n <= 23)"); - -unsigned int ib_ipath_debug; /* debug mask */ -module_param_named(debug, ib_ipath_debug, uint, S_IWUSR | S_IRUGO); -MODULE_PARM_DESC(debug, "Verbs debug mask"); static unsigned int ib_ipath_max_pds = 0xFFFF; module_param_named(max_pds, ib_ipath_max_pds, uint, S_IWUSR | S_IRUGO); @@ -1598,8 +1594,7 @@ err_lk: kfree(idev->qp_table.table); err_qp: ib_dealloc_device(dev); - _VERBS_ERROR("ib_ipath%d cannot register verbs (%d)!\n", - dd->ipath_unit, -ret); + ipath_dev_err(dd, "cannot register verbs: %d!\n", -ret); idev = NULL; bail: @@ -1618,17 +1613,13 @@ void ipath_unregister_ib_device(struct i if (!list_empty(&dev->pending[0]) || !list_empty(&dev->pending[1]) || !list_empty(&dev->pending[2])) - _VERBS_ERROR("ipath%d pending list not empty!\n", - dev->ib_unit); + ipath_dev_err(dev->dd, "pending list not empty!\n"); if (!list_empty(&dev->piowait)) - _VERBS_ERROR("ipath%d piowait list not empty!\n", - dev->ib_unit); + ipath_dev_err(dev->dd, "piowait list not empty!\n"); if (!list_empty(&dev->rnrwait)) - _VERBS_ERROR("ipath%d rnrwait list not empty!\n", - dev->ib_unit); + ipath_dev_err(dev->dd, "rnrwait list not empty!\n"); if (!ipath_mcast_tree_empty()) - _VERBS_ERROR("ipath%d multicast table memory leak!\n", - dev->ib_unit); + ipath_dev_err(dev->dd, "multicast table memory leak!\n"); /* * Note that ipath_unregister_ib_device() can be called before all * the QPs are destroyed! diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:45 2006 -0700 @@ -42,7 +42,6 @@ #include #include "ipath_layer.h" -#include "verbs_debug.h" #define QPN_MAX (1 << 24) #define QPNMAP_ENTRIES (QPN_MAX / PAGE_SIZE / BITS_PER_BYTE) diff --git a/drivers/infiniband/hw/ipath/verbs_debug.h b/drivers/infiniband/hw/ipath/verbs_debug.h deleted file mode 100644 --- a/drivers/infiniband/hw/ipath/verbs_debug.h Fri Aug 25 11:19:45 2006 -0700 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,108 +0,0 @@ -/* - * Copyright (c) 2006 QLogic, Inc. All rights reserved. - * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - */ - -#ifndef _VERBS_DEBUG_H -#define _VERBS_DEBUG_H - -/* - * This file contains tracing code for the ib_ipath kernel module. - */ -#ifndef _VERBS_DEBUGGING /* tracing enabled or not */ -#define _VERBS_DEBUGGING 1 -#endif - -extern unsigned ib_ipath_debug; - -#define _VERBS_ERROR(fmt,...) \ - do { \ - printk(KERN_ERR "%s: " fmt, "ib_ipath", ##__VA_ARGS__); \ - } while(0) - -#define _VERBS_UNIT_ERROR(unit,fmt,...) \ - do { \ - printk(KERN_ERR "%s: " fmt, "ib_ipath", ##__VA_ARGS__); \ - } while(0) - -#if _VERBS_DEBUGGING - -/* - * Mask values for debugging. The scheme allows us to compile out any - * of the debug tracing stuff, and if compiled in, to enable or - * disable dynamically. - * This can be set at modprobe time also: - * modprobe ib_path ib_ipath_debug=3 - */ - -#define __VERBS_INFO 0x1 /* generic low verbosity stuff */ -#define __VERBS_DBG 0x2 /* generic debug */ -#define __VERBS_VDBG 0x4 /* verbose debug */ -#define __VERBS_SMADBG 0x8000 /* sma packet debug */ - -#define _VERBS_INFO(fmt,...) \ - do { \ - if (unlikely(ib_ipath_debug&__VERBS_INFO)) \ - printk(KERN_INFO "%s: " fmt,"ib_ipath", \ - ##__VA_ARGS__); \ - } while(0) - -#define _VERBS_DBG(fmt,...) \ - do { \ - if (unlikely(ib_ipath_debug&__VERBS_DBG)) \ - printk(KERN_DEBUG "%s: " fmt, __func__, \ - ##__VA_ARGS__); \ - } while(0) - -#define _VERBS_VDBG(fmt,...) \ - do { \ - if (unlikely(ib_ipath_debug&__VERBS_VDBG)) \ - printk(KERN_DEBUG "%s: " fmt, __func__, \ - ##__VA_ARGS__); \ - } while(0) - -#define _VERBS_SMADBG(fmt,...) \ - do { \ - if (unlikely(ib_ipath_debug&__VERBS_SMADBG)) \ - printk(KERN_DEBUG "%s: " fmt, __func__, \ - ##__VA_ARGS__); \ - } while(0) - -#else /* ! _VERBS_DEBUGGING */ - -#define _VERBS_INFO(fmt,...) -#define _VERBS_DBG(fmt,...) -#define _VERBS_VDBG(fmt,...) -#define _VERBS_SMADBG(fmt,...) - -#endif /* _VERBS_DEBUGGING */ - -#endif /* _VERBS_DEBUG_H */ From bos at pathscale.com Fri Aug 25 11:24:34 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:34 -0700 Subject: [openib-general] [PATCH 9 of 23] IB/ipath - remove stale references to userspace SMA In-Reply-To: Message-ID: When we first submitted a userspace subnet management agent, it was rejected, so we left it out of the final driver submission. This patch removes a number of vestigial references to it. Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_common.h b/drivers/infiniband/hw/ipath/ipath_common.h --- a/drivers/infiniband/hw/ipath/ipath_common.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_common.h Fri Aug 25 11:19:45 2006 -0700 @@ -106,9 +106,9 @@ struct infinipath_stats { __u64 sps_ether_spkts; /* number of "ethernet" packets received by driver */ __u64 sps_ether_rpkts; - /* number of SMA packets sent by driver */ + /* number of SMA packets sent by driver. Obsolete. */ __u64 sps_sma_spkts; - /* number of SMA packets received by driver */ + /* number of SMA packets received by driver. Obsolete. */ __u64 sps_sma_rpkts; /* number of times all ports rcvhdrq was full and packet dropped */ __u64 sps_hdrqfull; @@ -138,7 +138,7 @@ struct infinipath_stats { __u64 sps_pageunlocks; /* * Number of packets dropped in kernel other than errors (ether - * packets if ipath not configured, sma/mad, etc.) + * packets if ipath not configured, etc.) */ __u64 sps_krdrops; /* pad for future growth */ @@ -153,8 +153,6 @@ struct infinipath_stats { #define IPATH_STATUS_DISABLED 0x2 /* hardware disabled */ /* Device has been disabled via admin request */ #define IPATH_STATUS_ADMIN_DISABLED 0x4 -#define IPATH_STATUS_OIB_SMA 0x8 /* ipath_mad kernel SMA running */ -#define IPATH_STATUS_SMA 0x10 /* user SMA running */ /* Chip has been found and initted */ #define IPATH_STATUS_CHIP_PRESENT 0x20 /* IB link is at ACTIVE, usable for data traffic */ @@ -463,14 +461,6 @@ struct __ipath_sendpkt { __u32 sps_cnt; /* number of entries to use in sps_iov */ /* array of iov's describing packet. TEMPORARY */ struct ipath_iovec sps_iov[4]; -}; - -/* Passed into SMA special file's ->read and ->write methods. */ -struct ipath_sma_pkt -{ - __u32 unit; /* unit on which to send packet */ - __u64 data; /* address of payload in userspace */ - __u32 len; /* length of payload */ }; /* diff --git a/drivers/infiniband/hw/ipath/ipath_debug.h b/drivers/infiniband/hw/ipath/ipath_debug.h --- a/drivers/infiniband/hw/ipath/ipath_debug.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_debug.h Fri Aug 25 11:19:45 2006 -0700 @@ -60,7 +60,6 @@ #define __IPATH_USER_SEND 0x1000 /* use user mode send */ #define __IPATH_KERNEL_SEND 0x2000 /* use kernel mode send */ #define __IPATH_EPKTDBG 0x4000 /* print ethernet packet data */ -#define __IPATH_SMADBG 0x8000 /* sma packet debug */ #define __IPATH_IPATHDBG 0x10000 /* Ethernet (IPATH) gen debug */ #define __IPATH_IPATHWARN 0x20000 /* Ethernet (IPATH) warnings */ #define __IPATH_IPATHERR 0x40000 /* Ethernet (IPATH) errors */ @@ -84,7 +83,6 @@ /* print mmap/nopage stuff, not using VDBG any more */ #define __IPATH_MMDBG 0x0 #define __IPATH_EPKTDBG 0x0 /* print ethernet packet data */ -#define __IPATH_SMADBG 0x0 /* process startup (init)/exit messages */ #define __IPATH_IPATHDBG 0x0 /* Ethernet (IPATH) table dump on */ #define __IPATH_IPATHWARN 0x0 /* Ethernet (IPATH) warnings on */ #define __IPATH_IPATHERR 0x0 /* Ethernet (IPATH) errors on */ diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 @@ -64,7 +64,7 @@ DEFINE_SPINLOCK(ipath_devs_lock); DEFINE_SPINLOCK(ipath_devs_lock); LIST_HEAD(ipath_dev_list); -wait_queue_head_t ipath_sma_state_wait; +wait_queue_head_t ipath_state_wait; unsigned ipath_debug = __IPATH_INFO; @@ -618,15 +618,16 @@ static int ipath_wait_linkstate(struct i static int ipath_wait_linkstate(struct ipath_devdata *dd, u32 state, int msecs) { - dd->ipath_sma_state_wanted = state; - wait_event_interruptible_timeout(ipath_sma_state_wait, + dd->ipath_state_wanted = state; + wait_event_interruptible_timeout(ipath_state_wait, (dd->ipath_flags & state), msecs_to_jiffies(msecs)); - dd->ipath_sma_state_wanted = 0; + dd->ipath_state_wanted = 0; if (!(dd->ipath_flags & state)) { u64 val; - ipath_cdbg(SMA, "Didn't reach linkstate %s within %u ms\n", + ipath_cdbg(VERBOSE, "Didn't reach linkstate %s within %u" + " ms\n", /* test INIT ahead of DOWN, both can be set */ (state & IPATH_LINKINIT) ? "INIT" : ((state & IPATH_LINKDOWN) ? "DOWN" : @@ -1155,7 +1156,7 @@ int ipath_setrcvhdrsize(struct ipath_dev * * do appropriate marking as busy, etc. * returns buffer number if one found (>=0), negative number is error. - * Used by ipath_sma_send_pkt and ipath_layer_send + * Used by ipath_layer_send */ u32 __iomem *ipath_getpiobuf(struct ipath_devdata *dd, u32 * pbufnum) { @@ -1448,7 +1449,7 @@ static void ipath_set_ib_lstate(struct i int linkcmd = (which >> INFINIPATH_IBCC_LINKCMD_SHIFT) & INFINIPATH_IBCC_LINKCMD_MASK; - ipath_cdbg(SMA, "Trying to move unit %u to %s, current ltstate " + ipath_cdbg(VERBOSE, "Trying to move unit %u to %s, current ltstate " "is %s\n", dd->ipath_unit, what[linkcmd], ipath_ibcstatus_str[ @@ -1457,7 +1458,7 @@ static void ipath_set_ib_lstate(struct i INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) & INFINIPATH_IBCS_LINKTRAININGSTATE_MASK]); /* flush all queued sends when going to DOWN or INIT, to be sure that - * they don't block SMA and other MAD packets */ + * they don't block MAD packets */ if (!linkcmd || linkcmd == INFINIPATH_IBCC_LINKCMD_INIT) { ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, INFINIPATH_S_ABORT); diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:45 2006 -0700 @@ -1816,7 +1816,7 @@ int ipath_user_add(struct ipath_devdata if (ret < 0) { ipath_dev_err(dd, "Could not create wildcard " "minor: error %d\n", -ret); - goto bail_sma; + goto bail_user; } atomic_set(&user_setup, 1); @@ -1832,7 +1832,7 @@ int ipath_user_add(struct ipath_devdata goto bail; -bail_sma: +bail_user: user_cleanup(); bail: return ret; diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c b/drivers/infiniband/hw/ipath/ipath_fs.c --- a/drivers/infiniband/hw/ipath/ipath_fs.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_fs.c Fri Aug 25 11:19:45 2006 -0700 @@ -191,8 +191,8 @@ static ssize_t atomic_port_info_read(str portinfo[4] = (dd->ipath_lid << 16); /* - * Notimpl yet SMLID (should we store this in the driver, in case - * SMA dies?) CapabilityMask is 0, we don't support any of these + * Notimpl yet SMLID. + * CapabilityMask is 0, we don't support any of these * DiagCode is 0; we don't store any diag info for now Notimpl yet * M_KeyLeasePeriod (we don't support M_Key) */ diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Fri Aug 25 11:19:45 2006 -0700 @@ -53,8 +53,8 @@ MODULE_PARM_DESC(cfgports, "Set max numb MODULE_PARM_DESC(cfgports, "Set max number of ports to use"); /* - * Number of buffers reserved for driver (layered drivers and SMA - * send). Reserved at end of buffer list. Initialized based on + * Number of buffers reserved for driver (verbs and layered drivers.) + * Reserved at end of buffer list. Initialized based on * number of PIO buffers if not set via module interface. * The problem with this is that it's global, but we'll use different * numbers for different chip types. So the default value is not @@ -80,7 +80,7 @@ MODULE_PARM_DESC(kpiobufs, "Set number o * * Allocate the eager TID buffers and program them into infinipath. * We use the network layer alloc_skb() allocator to allocate the - * memory, and either use the buffers as is for things like SMA + * memory, and either use the buffers as is for things like verbs * packets, or pass the buffers up to the ipath layered driver and * thence the network layer, replacing them as we do so (see * ipath_rcv_layer()). @@ -450,9 +450,9 @@ static void enable_chip(struct ipath_dev u32 val; int i; - if (!reinit) { - init_waitqueue_head(&ipath_sma_state_wait); - } + if (!reinit) + init_waitqueue_head(&ipath_state_wait); + ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvctrl, dd->ipath_rcvctrl); diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Fri Aug 25 11:19:45 2006 -0700 @@ -201,7 +201,7 @@ static void handle_e_ibstatuschanged(str ib_linkstate(lstate)); } else - ipath_cdbg(SMA, "Unit %u link state %s, last " + ipath_cdbg(VERBOSE, "Unit %u link state %s, last " "was %s\n", dd->ipath_unit, ib_linkstate(lstate), ib_linkstate((unsigned) @@ -213,7 +213,7 @@ static void handle_e_ibstatuschanged(str if (lstate == IPATH_IBSTATE_INIT || lstate == IPATH_IBSTATE_ARM || lstate == IPATH_IBSTATE_ACTIVE) - ipath_cdbg(SMA, "Unit %u link state down" + ipath_cdbg(VERBOSE, "Unit %u link state down" " (state 0x%x), from %s\n", dd->ipath_unit, (u32)val & IPATH_IBSTATE_MASK, @@ -269,7 +269,7 @@ static void handle_e_ibstatuschanged(str INFINIPATH_IBCS_LINKSTATE_MASK) == INFINIPATH_IBCS_L_STATE_ACTIVE) /* if from up to down be more vocal */ - ipath_cdbg(SMA, + ipath_cdbg(VERBOSE, "Unit %u link now down (%s)\n", dd->ipath_unit, ipath_ibcstatus_str[ltstate]); @@ -596,11 +596,11 @@ static int handle_errors(struct ipath_de if (!noprint && *msg) ipath_dev_err(dd, "%s error\n", msg); - if (dd->ipath_sma_state_wanted & dd->ipath_flags) { - ipath_cdbg(VERBOSE, "sma wanted state %x, iflags now %x, " - "waking\n", dd->ipath_sma_state_wanted, + if (dd->ipath_state_wanted & dd->ipath_flags) { + ipath_cdbg(VERBOSE, "driver wanted state %x, iflags now %x, " + "waking\n", dd->ipath_state_wanted, dd->ipath_flags); - wake_up_interruptible(&ipath_sma_state_wait); + wake_up_interruptible(&ipath_state_wait); } return chkerrpkts; diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 @@ -245,8 +245,8 @@ struct ipath_devdata { u32 ipath_pioavregs; /* IPATH_POLL, etc. */ u32 ipath_flags; - /* ipath_flags sma is waiting for */ - u32 ipath_sma_state_wanted; + /* ipath_flags driver is waiting for */ + u32 ipath_state_wanted; /* last buffer for user use, first buf for kernel use is this * index. */ u32 ipath_lastport_piobuf; @@ -306,10 +306,6 @@ struct ipath_devdata { u32 ipath_pcibar0; /* so we can rewrite it after a chip reset */ u32 ipath_pcibar1; - /* sequential tries for SMA send and no bufs */ - u32 ipath_nosma_bufs; - /* duration (seconds) ipath_nosma_bufs set */ - u32 ipath_nosma_secs; /* HT/PCI Vendor ID (here for NodeInfo) */ u16 ipath_vendorid; @@ -534,7 +530,7 @@ void ipath_diag_remove(struct ipath_devd void ipath_diag_remove(struct ipath_devdata *); void ipath_diag_bringup_link(struct ipath_devdata *); -extern wait_queue_head_t ipath_sma_state_wait; +extern wait_queue_head_t ipath_state_wait; int ipath_user_add(struct ipath_devdata *dd); void ipath_user_remove(struct ipath_devdata *dd); @@ -818,7 +814,6 @@ extern struct mutex ipath_mutex; #define IPATH_DRV_NAME "ib_ipath" #define IPATH_MAJOR 233 #define IPATH_USER_MINOR_BASE 0 -#define IPATH_SMA_MINOR 128 #define IPATH_DIAG_MINOR_BASE 129 #define IPATH_NMINORS 255 diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c --- a/drivers/infiniband/hw/ipath/ipath_layer.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Fri Aug 25 11:19:45 2006 -0700 @@ -161,9 +161,6 @@ int ipath_layer_register(void *(*l_add)( if (dd->ipath_layer.l_arg) continue; - - if (!(*dd->ipath_statusp & IPATH_STATUS_SMA)) - *dd->ipath_statusp |= IPATH_STATUS_OIB_SMA; spin_unlock_irqrestore(&ipath_devs_lock, flags); dd->ipath_layer.l_arg = l_add(dd->ipath_unit, dd); diff --git a/drivers/infiniband/hw/ipath/ipath_layer.h b/drivers/infiniband/hw/ipath/ipath_layer.h --- a/drivers/infiniband/hw/ipath/ipath_layer.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.h Fri Aug 25 11:19:45 2006 -0700 @@ -66,9 +66,6 @@ int ipath_layer_set_piointbufavail_int(s #define IPATH_LAYER_INT_SEND_CONTINUE 0x10 #define IPATH_LAYER_INT_BCAST 0x40 -/* _verbs_layer.l_flags */ -#define IPATH_VERBS_KERNEL_SMA 0x1 - extern unsigned ipath_debug; /* debugging bit mask */ #endif /* _IPATH_LAYER_H */ diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 @@ -645,33 +645,6 @@ __be32 ipath_compute_aeth(struct ipath_q } /** - * set_verbs_flags - set the verbs layer flags - * @dd: the infinipath device - * @flags: the flags to set - */ -static int set_verbs_flags(struct ipath_devdata *dd, unsigned flags) -{ - struct ipath_devdata *ss; - unsigned long lflags; - - spin_lock_irqsave(&ipath_devs_lock, lflags); - - list_for_each_entry(ss, &ipath_dev_list, ipath_list) { - if (!(ss->ipath_flags & IPATH_INITTED)) - continue; - if ((flags & IPATH_VERBS_KERNEL_SMA) && - !(*ss->ipath_statusp & IPATH_STATUS_SMA)) - *ss->ipath_statusp |= IPATH_STATUS_OIB_SMA; - else - *ss->ipath_statusp &= ~IPATH_STATUS_OIB_SMA; - } - - spin_unlock_irqrestore(&ipath_devs_lock, lflags); - - return 0; -} - -/** * ipath_create_qp - create a queue pair for a device * @ibpd: the protection domain who's device we create the queue pair for * @init_attr: the attributes of the queue pair @@ -784,10 +757,6 @@ struct ib_qp *ipath_create_qp(struct ib_ } qp->ip = NULL; ipath_reset_qp(qp); - - /* Tell the core driver that the kernel SMA is present. */ - if (init_attr->qp_type == IB_QPT_SMI) - set_verbs_flags(dev->dd, IPATH_VERBS_KERNEL_SMA); break; default: @@ -862,10 +831,6 @@ int ipath_destroy_qp(struct ib_qp *ibqp) struct ipath_ibdev *dev = to_idev(ibqp->device); unsigned long flags; - /* Tell the core driver that the kernel SMA is gone. */ - if (qp->ibqp.qp_type == IB_QPT_SMI) - set_verbs_flags(dev->dd, 0); - spin_lock_irqsave(&qp->s_lock, flags); qp->state = IB_QPS_ERR; spin_unlock_irqrestore(&qp->s_lock, flags); diff --git a/drivers/infiniband/hw/ipath/ipath_stats.c b/drivers/infiniband/hw/ipath/ipath_stats.c --- a/drivers/infiniband/hw/ipath/ipath_stats.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_stats.c Fri Aug 25 11:19:45 2006 -0700 @@ -271,33 +271,6 @@ void ipath_get_faststats(unsigned long o } } - if (dd->ipath_nosma_bufs) { - dd->ipath_nosma_secs += 5; - if (dd->ipath_nosma_secs >= 30) { - ipath_cdbg(SMA, "No SMA bufs avail %u seconds; " - "cancelling pending sends\n", - dd->ipath_nosma_secs); - /* - * issue an abort as well, in case we have a packet - * stuck in launch fifo. This could corrupt an - * outgoing user packet in the worst case, - * but this is a pretty catastrophic, anyway. - */ - ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, - INFINIPATH_S_ABORT); - ipath_disarm_piobufs(dd, dd->ipath_lastport_piobuf, - dd->ipath_piobcnt2k + - dd->ipath_piobcnt4k - - dd->ipath_lastport_piobuf); - /* start again, if necessary */ - dd->ipath_nosma_secs = 0; - } else - ipath_cdbg(SMA, "No SMA bufs avail %u tries, " - "after %u seconds\n", - dd->ipath_nosma_bufs, - dd->ipath_nosma_secs); - } - done: mod_timer(&dd->ipath_stats_timer, jiffies + HZ * 5); } diff --git a/drivers/infiniband/hw/ipath/ipath_sysfs.c b/drivers/infiniband/hw/ipath/ipath_sysfs.c --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c Fri Aug 25 11:19:45 2006 -0700 @@ -107,8 +107,8 @@ static const char *ipath_status_str[] = "Initted", "Disabled", "Admin_Disabled", - "OIB_SMA", - "SMA", + "", /* This used to be the old "OIB_SMA" status. */ + "", /* This used to be the old "SMA" status. */ "Present", "IB_link_up", "IB_configured", diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 @@ -1573,7 +1573,7 @@ int ipath_register_ib_device(struct ipat dev->mmap = ipath_mmap; snprintf(dev->node_desc, sizeof(dev->node_desc), - IPATH_IDSTR " %s kernel_SMA", system_utsname.nodename); + IPATH_IDSTR " %s", system_utsname.nodename); ret = ib_register_device(dev); if (ret) From bos at pathscale.com Fri Aug 25 11:24:39 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:39 -0700 Subject: [openib-general] [PATCH 14 of 23] IB/ipath - support new QLogic product naming scheme In-Reply-To: Message-ID: This patch only renames files, fixes product names, and updates comments. Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/Makefile Fri Aug 25 11:19:45 2006 -0700 @@ -10,7 +10,8 @@ ib_ipath-y := \ ipath_eeprom.o \ ipath_file_ops.o \ ipath_fs.o \ - ipath_ht400.o \ + ipath_iba6110.o \ + ipath_iba6120.o \ ipath_init_chip.o \ ipath_intr.o \ ipath_keys.o \ @@ -18,7 +19,6 @@ ib_ipath-y := \ ipath_mad.o \ ipath_mmap.o \ ipath_mr.o \ - ipath_pe800.o \ ipath_qp.o \ ipath_rc.o \ ipath_ruc.o \ diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 @@ -401,10 +401,10 @@ static int __devinit ipath_init_one(stru /* setup the chip-specific functions, as early as possible. */ switch (ent->device) { case PCI_DEVICE_ID_INFINIPATH_HT: - ipath_init_ht400_funcs(dd); + ipath_init_iba6110_funcs(dd); break; case PCI_DEVICE_ID_INFINIPATH_PE800: - ipath_init_pe800_funcs(dd); + ipath_init_iba6120_funcs(dd); break; default: ipath_dev_err(dd, "Found unknown QLogic deviceid 0x%x, " @@ -969,7 +969,8 @@ reloop: */ if (l == hdrqtail || (i && !(i&0xf))) { u64 lval; - if (l == hdrqtail) /* PE-800 interrupt only on last */ + if (l == hdrqtail) + /* request IBA6120 interrupt only on last */ lval = dd->ipath_rhdrhead_intr_off | l; else lval = l; @@ -983,7 +984,7 @@ reloop: } if (!dd->ipath_rhdrhead_intr_off && !reloop) { - /* HT-400 workaround; we can have a race clearing chip + /* IBA6110 workaround; we can have a race clearing chip * interrupt with another interrupt about to be delivered, * and can clear it before it is delivered on the GPIO * workaround. By doing the extra check here for the diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:45 2006 -0700 @@ -1110,7 +1110,7 @@ static int ipath_mmap(struct file *fp, s ret = mmap_rcvegrbufs(vma, pd); else if (pgaddr == (u64) pd->port_rcvhdrq_phys) { /* - * The rcvhdrq itself; readonly except on HT-400 (so have + * The rcvhdrq itself; readonly except on HT (so have * to allow writable mapping), multiple pages, contiguous * from an i/o perspective. */ @@ -1298,14 +1298,14 @@ static int find_best_unit(struct file *f * This code is present to allow a knowledgeable person to * specify the layout of processes to processors before opening * this driver, and then we'll assign the process to the "closest" - * HT-400 to that processor (we assume reasonable connectivity, + * InfiniPath chip to that processor (we assume reasonable connectivity, * for now). This code assumes that if affinity has been set * before this point, that at most one cpu is set; for now this * is reasonable. I check for both cpus_empty() and cpus_full(), * in case some kernel variant sets none of the bits when no * affinity is set. 2.6.11 and 12 kernels have all present * cpus set. Some day we'll have to fix it up further to handle - * a cpu subset. This algorithm fails for two HT-400's connected + * a cpu subset. This algorithm fails for two HT chips connected * in tunnel fashion. Eventually this needs real topology * information. There may be some issues with dual core numbering * as well. This needs more work prior to release. diff --git a/drivers/infiniband/hw/ipath/ipath_ht400.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c rename from drivers/infiniband/hw/ipath/ipath_ht400.c rename to drivers/infiniband/hw/ipath/ipath_iba6110.c --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c Fri Aug 25 11:19:45 2006 -0700 @@ -33,7 +33,7 @@ /* * This file contains all of the code that is specific to the InfiniPath - * HT-400 chip. + * HT chip. */ #include @@ -43,7 +43,7 @@ #include "ipath_registers.h" /* - * This lists the InfiniPath HT400 registers, in the actual chip layout. + * This lists the InfiniPath registers, in the actual chip layout. * This structure should never be directly accessed. * * The names are in InterCap form because they're taken straight from @@ -537,7 +537,7 @@ static void ipath_ht_handle_hwerrors(str if (hwerrs & INFINIPATH_HWE_HTCMISCERR7) strlcat(msg, "[HT core Misc7]", msgl); if (hwerrs & INFINIPATH_HWE_MEMBISTFAILED) { - strlcat(msg, "[Memory BIST test failed, HT-400 unusable]", + strlcat(msg, "[Memory BIST test failed, InfiniPath hardware unusable]", msgl); /* ignore from now on, so disable until driver reloaded */ dd->ipath_hwerrmask &= ~INFINIPATH_HWE_MEMBISTFAILED; @@ -553,7 +553,7 @@ static void ipath_ht_handle_hwerrors(str if (hwerrs & _IPATH_PLL_FAIL) { snprintf(bitsmsg, sizeof bitsmsg, - "[PLL failed (%llx), HT-400 unusable]", + "[PLL failed (%llx), InfiniPath hardware unusable]", (unsigned long long) (hwerrs & _IPATH_PLL_FAIL)); strlcat(msg, bitsmsg, msgl); /* ignore from now on, so disable until driver reloaded */ @@ -610,18 +610,18 @@ static int ipath_ht_boardname(struct ipa break; case 5: /* - * HT-460 original production board; two production levels, with + * original production board; two production levels, with * different serial number ranges. See ipath_ht_early_init() for * case where we enable IPATH_GPIO_INTR for later serial # range. */ - n = "InfiniPath_HT-460"; + n = "InfiniPath_QHT7040"; break; case 6: n = "OEM_Board_3"; break; case 7: - /* HT-460 small form factor production board */ - n = "InfiniPath_HT-465"; + /* small form factor production board */ + n = "InfiniPath_QHT7140"; break; case 8: n = "LS/X-1"; @@ -633,7 +633,7 @@ static int ipath_ht_boardname(struct ipa n = "OEM_Board_2"; break; case 11: - n = "InfiniPath_HT-470"; + n = "InfiniPath_HT-470"; /* obsoleted */ break; case 12: n = "OEM_Board_4"; @@ -641,7 +641,7 @@ static int ipath_ht_boardname(struct ipa default: /* don't know, just print the number */ ipath_dev_err(dd, "Don't yet know about board " "with ID %u\n", boardrev); - snprintf(name, namelen, "Unknown_InfiniPath_HT-4xx_%u", + snprintf(name, namelen, "Unknown_InfiniPath_QHT7xxx_%u", boardrev); break; } @@ -650,11 +650,10 @@ static int ipath_ht_boardname(struct ipa if (dd->ipath_majrev != 3 || (dd->ipath_minrev < 2 || dd->ipath_minrev > 3)) { /* - * This version of the driver only supports the HT-400 - * Rev 3.2 + * This version of the driver only supports Rev 3.2 and 3.3 */ ipath_dev_err(dd, - "Unsupported HT-400 revision %u.%u!\n", + "Unsupported InfiniPath hardware revision %u.%u!\n", dd->ipath_majrev, dd->ipath_minrev); ret = 1; goto bail; @@ -738,7 +737,7 @@ static void ipath_check_htlink(struct ip static int ipath_setup_ht_reset(struct ipath_devdata *dd) { - ipath_dbg("No reset possible for HT-400\n"); + ipath_dbg("No reset possible for this InfiniPath hardware\n"); return 0; } @@ -925,7 +924,7 @@ static int set_int_handler(struct ipath_ /* * kernels with CONFIG_PCI_MSI set the vector in the irq field of - * struct pci_device, so we use that to program the HT-400 internal + * struct pci_device, so we use that to program the internal * interrupt register (not config space) with that value. The BIOS * must still have done the basic MSI setup. */ @@ -1013,7 +1012,7 @@ bail: * @dd: the infinipath device * * Called during driver unload. - * This is currently a nop for the HT-400, not for all chips + * This is currently a nop for the HT chip, not for all chips */ static void ipath_setup_ht_cleanup(struct ipath_devdata *dd) { @@ -1470,7 +1469,7 @@ static int ipath_ht_early_init(struct ip dd->ipath_rcvhdrsize = IPATH_DFLT_RCVHDRSIZE; /* - * For HT-400, we allocate a somewhat overly large eager buffer, + * For HT, we allocate a somewhat overly large eager buffer, * such that we can guarantee that we can receive the largest * packet that we can send out. To truly support a 4KB MTU, * we need to bump this to a large value. To date, other than @@ -1531,7 +1530,7 @@ static int ipath_ht_early_init(struct ip if(dd->ipath_boardrev == 5 && dd->ipath_serial[0] == '1' && dd->ipath_serial[1] == '2' && dd->ipath_serial[2] == '8') { /* - * Later production HT-460 has same changes as HT-465, so + * Later production QHT7040 has same changes as QHT7140, so * can use GPIO interrupts. They have serial #'s starting * with 128, rather than 112. */ @@ -1560,13 +1559,13 @@ static int ipath_ht_get_base_info(struct } /** - * ipath_init_ht400_funcs - set up the chip-specific function pointers + * ipath_init_iba6110_funcs - set up the chip-specific function pointers * @dd: the infinipath device * * This is global, and is called directly at init to set up the * chip-specific function pointers for later use. */ -void ipath_init_ht400_funcs(struct ipath_devdata *dd) +void ipath_init_iba6110_funcs(struct ipath_devdata *dd) { dd->ipath_f_intrsetup = ipath_ht_intconfig; dd->ipath_f_bus = ipath_setup_ht_config; diff --git a/drivers/infiniband/hw/ipath/ipath_pe800.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c rename from drivers/infiniband/hw/ipath/ipath_pe800.c rename to drivers/infiniband/hw/ipath/ipath_iba6120.c --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c Fri Aug 25 11:19:45 2006 -0700 @@ -32,7 +32,7 @@ */ /* * This file contains all of the code that is specific to the - * InfiniPath PE-800 chip. + * InfiniPath PCIe chip. */ #include @@ -45,9 +45,9 @@ /* * This file contains all the chip-specific register information and - * access functions for the QLogic InfiniPath PE800, the PCI-Express chip. - * - * This lists the InfiniPath PE800 registers, in the actual chip layout. + * access functions for the QLogic InfiniPath PCI-Express chip. + * + * This lists the InfiniPath registers, in the actual chip layout. * This structure should never be directly accessed. */ struct _infinipath_do_not_use_kernel_regs { @@ -213,7 +213,6 @@ static const struct ipath_kregs ipath_pe .kr_rcvhdraddr = IPATH_KREG_OFFSET(RcvHdrAddr0), .kr_rcvhdrtailaddr = IPATH_KREG_OFFSET(RcvHdrTailAddr0), - /* This group is pe-800-specific; and used only in this file */ /* The rcvpktled register controls one of the debug port signals, so * a packet activity LED can be connected to it. */ .kr_rcvpktledcnt = IPATH_KREG_OFFSET(RcvPktLEDCnt), @@ -388,7 +387,7 @@ static void ipath_pe_handle_hwerrors(str *msg = '\0'; if (hwerrs & INFINIPATH_HWE_MEMBISTFAILED) { - strlcat(msg, "[Memory BIST test failed, PE-800 unusable]", + strlcat(msg, "[Memory BIST test failed, InfiniPath hardware unusable]", msgl); /* ignore from now on, so disable until driver reloaded */ *dd->ipath_statusp |= IPATH_STATUS_HWERROR; @@ -433,7 +432,7 @@ static void ipath_pe_handle_hwerrors(str if (hwerrs & _IPATH_PLL_FAIL) { snprintf(bitsmsg, sizeof bitsmsg, - "[PLL failed (%llx), PE-800 unusable]", + "[PLL failed (%llx), InfiniPath hardware unusable]", (unsigned long long) hwerrs & _IPATH_PLL_FAIL); strlcat(msg, bitsmsg, msgl); /* ignore from now on, so disable until driver reloaded */ @@ -511,22 +510,25 @@ static int ipath_pe_boardname(struct ipa n = "InfiniPath_Emulation"; break; case 1: - n = "InfiniPath_PE-800-Bringup"; + n = "InfiniPath_QLE7140-Bringup"; break; case 2: - n = "InfiniPath_PE-880"; + n = "InfiniPath_QLE7140"; break; case 3: - n = "InfiniPath_PE-850"; + n = "InfiniPath_QMI7140"; break; case 4: - n = "InfiniPath_PE-860"; + n = "InfiniPath_QEM7140"; + break; + case 5: + n = "InfiniPath_QMH7140"; break; default: ipath_dev_err(dd, "Don't yet know about board with ID %u\n", boardrev); - snprintf(name, namelen, "Unknown_InfiniPath_PE-8xx_%u", + snprintf(name, namelen, "Unknown_InfiniPath_PCIe_%u", boardrev); break; } @@ -534,7 +536,7 @@ static int ipath_pe_boardname(struct ipa snprintf(name, namelen, "%s", n); if (dd->ipath_majrev != 4 || !dd->ipath_minrev || dd->ipath_minrev>2) { - ipath_dev_err(dd, "Unsupported PE-800 revision %u.%u!\n", + ipath_dev_err(dd, "Unsupported InfiniPath hardware revision %u.%u!\n", dd->ipath_majrev, dd->ipath_minrev); ret = 1; } else @@ -705,7 +707,7 @@ static void ipath_pe_quiet_serdes(struct ipath_write_kreg(dd, dd->ipath_kregs->kr_serdesconfig0, val); } -/* this is not yet needed on the PE800, so just return 0. */ +/* this is not yet needed on this chip, so just return 0. */ static int ipath_pe_intconfig(struct ipath_devdata *dd) { return 0; @@ -759,8 +761,8 @@ static void ipath_setup_pe_setextled(str * * This is called during driver unload. * We do the pci_disable_msi here, not in generic code, because it - * isn't used for the HT-400. If we do end up needing pci_enable_msi - * at some point in the future for HT-400, we'll move the call back + * isn't used for the HT chips. If we do end up needing pci_enable_msi + * at some point in the future for HT, we'll move the call back * into the main init_one code. */ static void ipath_setup_pe_cleanup(struct ipath_devdata *dd) @@ -780,10 +782,10 @@ static void ipath_setup_pe_cleanup(struc * late in 2.6.16). * All that can be done is to edit the kernel source to remove the quirk * check until that is fixed. - * We do not need to call enable_msi() for our HyperTransport chip (HT-400), - * even those it uses MSI, and we want to avoid the quirk warning, so - * So we call enable_msi only for the PE-800. If we do end up needing - * pci_enable_msi at some point in the future for HT-400, we'll move the + * We do not need to call enable_msi() for our HyperTransport chip, + * even though it uses MSI, and we want to avoid the quirk warning, so + * So we call enable_msi only for PCIe. If we do end up needing + * pci_enable_msi at some point in the future for HT, we'll move the * call back into the main init_one code. * We save the msi lo and hi values, so we can restore them after * chip reset (the kernel PCI infrastructure doesn't yet handle that @@ -971,8 +973,7 @@ static int ipath_setup_pe_reset(struct i int ret; /* Use ERROR so it shows up in logs, etc. */ - ipath_dev_err(dd, "Resetting PE-800 unit %u\n", - dd->ipath_unit); + ipath_dev_err(dd, "Resetting InfiniPath unit %u\n", dd->ipath_unit); /* keep chip from being accessed in a few places */ dd->ipath_flags &= ~(IPATH_INITTED|IPATH_PRESENT); val = dd->ipath_control | INFINIPATH_C_RESET; @@ -1078,7 +1079,7 @@ static void ipath_pe_put_tid(struct ipat * @port: the port * * clear all TID entries for a port, expected and eager. - * Used from ipath_close(). On PE800, TIDs are only 32 bits, + * Used from ipath_close(). On this chip, TIDs are only 32 bits, * not 64, but they are still on 64 bit boundaries, so tidbase * is declared as u64 * for the pointer math, even though we write 32 bits */ @@ -1148,9 +1149,9 @@ static int ipath_pe_early_init(struct ip dd->ipath_flags |= IPATH_4BYTE_TID; /* - * For openib, we need to be able to handle an IB header of 96 bytes - * or 24 dwords. HT-400 has arbitrary sized receive buffers, so we - * made them the same size as the PIO buffers. The PE-800 does not + * For openfabrics, we need to be able to handle an IB header of + * 24 dwords. HT chip has arbitrary sized receive buffers, so we + * made them the same size as the PIO buffers. This chip does not * handle arbitrary size buffers, so we need the header large enough * to handle largest IB header, but still have room for a 2KB MTU * standard IB packet. @@ -1158,11 +1159,10 @@ static int ipath_pe_early_init(struct ip dd->ipath_rcvhdrentsize = 24; dd->ipath_rcvhdrsize = IPATH_DFLT_RCVHDRSIZE; - /* For HT-400, we allocate a somewhat overly large eager buffer, - * such that we can guarantee that we can receive the largest packet - * that we can send out. To truly support a 4KB MTU, we need to - * bump this to a larger value. We'll do this when I get around to - * testing 4KB sends on the PE-800, which I have not yet done. + /* + * To truly support a 4KB MTU (for usermode), we need to + * bump this to a larger value. For now, we use them for + * the kernel only. */ dd->ipath_rcvegrbufsize = 2048; /* @@ -1175,9 +1175,9 @@ static int ipath_pe_early_init(struct ip dd->ipath_init_ibmaxlen = dd->ipath_ibmaxlen; /* - * For PE-800, we can request a receive interrupt for 1 or + * We can request a receive interrupt for 1 or * more packets from current offset. For now, we set this - * up for a single packet, to match the HT-400 behavior. + * up for a single packet. */ dd->ipath_rhdrhead_intr_off = 1ULL<<32; @@ -1216,13 +1216,13 @@ static int ipath_pe_get_base_info(struct } /** - * ipath_init_pe800_funcs - set up the chip-specific function pointers + * ipath_init_iba6120_funcs - set up the chip-specific function pointers * @dd: the infinipath device * * This is global, and is called directly at init to set up the * chip-specific function pointers for later use. */ -void ipath_init_pe800_funcs(struct ipath_devdata *dd) +void ipath_init_iba6120_funcs(struct ipath_devdata *dd) { dd->ipath_f_intrsetup = ipath_pe_intconfig; dd->ipath_f_bus = ipath_setup_pe_config; diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 @@ -236,7 +236,7 @@ struct ipath_devdata { u64 ipath_tidtemplate; /* value to write to free TIDs */ u64 ipath_tidinvalid; - /* PE-800 rcv interrupt setup */ + /* IBA6120 rcv interrupt setup */ u64 ipath_rhdrhead_intr_off; /* size of memory at ipath_kregbase */ @@ -621,10 +621,8 @@ int ipath_waitfor_mdio_cmdready(struct i int ipath_waitfor_mdio_cmdready(struct ipath_devdata *); int ipath_waitfor_complete(struct ipath_devdata *, ipath_kreg, u64, u64 *); u32 __iomem *ipath_getpiobuf(struct ipath_devdata *, u32 *); -/* init PE-800-specific func */ -void ipath_init_pe800_funcs(struct ipath_devdata *); -/* init HT-400-specific func */ -void ipath_init_ht400_funcs(struct ipath_devdata *); +void ipath_init_iba6120_funcs(struct ipath_devdata *); +void ipath_init_iba6110_funcs(struct ipath_devdata *); void ipath_get_eeprom_info(struct ipath_devdata *); u64 ipath_snap_cntr(struct ipath_devdata *, ipath_creg); diff --git a/drivers/infiniband/hw/ipath/ipath_registers.h b/drivers/infiniband/hw/ipath/ipath_registers.h --- a/drivers/infiniband/hw/ipath/ipath_registers.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_registers.h Fri Aug 25 11:19:45 2006 -0700 @@ -36,8 +36,7 @@ /* * This file should only be included by kernel source, and by the diags. It - * defines the registers, and their contents, for the InfiniPath HT-400 - * chip. + * defines the registers, and their contents, for InfiniPath chips. */ /* @@ -286,7 +285,7 @@ #define INFINIPATH_RT_ADDR_MASK 0xFFFFFFFFFFULL /* 40 bits valid */ -/* TID entries (memory), HT400-only */ +/* TID entries (memory), HT-only */ #define INFINIPATH_RT_VALID 0x8000000000000000ULL #define INFINIPATH_RT_ADDR_SHIFT 0 #define INFINIPATH_RT_BUFSIZE_MASK 0x3FFF From bos at pathscale.com Fri Aug 25 11:24:31 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:31 -0700 Subject: [openib-general] [PATCH 6 of 23] IB/ipath - merge ipath_core and ib_ipath drivers In-Reply-To: Message-ID: There is little point in keeping the two drivers separate, so we are merging them. Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile --- a/drivers/infiniband/Makefile Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/Makefile Fri Aug 25 11:19:45 2006 -0700 @@ -1,6 +1,6 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ -obj-$(CONFIG_IPATH_CORE) += hw/ipath/ +obj-$(CONFIG_INFINIBAND_IPATH) += hw/ipath/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ diff --git a/drivers/infiniband/hw/ipath/Kconfig b/drivers/infiniband/hw/ipath/Kconfig --- a/drivers/infiniband/hw/ipath/Kconfig Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/Kconfig Fri Aug 25 11:19:45 2006 -0700 @@ -1,16 +1,9 @@ config IPATH_CORE -config IPATH_CORE +config INFINIBAND_IPATH tristate "QLogic InfiniPath Driver" - depends on 64BIT && PCI_MSI && NET + depends on PCI_MSI && 64BIT && INFINIBAND ---help--- - This is a low-level driver for QLogic InfiniPath host channel - adapters (HCAs) based on the HT-400 and PE-800 chips. - -config INFINIBAND_IPATH - tristate "QLogic InfiniPath Verbs Driver" - depends on IPATH_CORE && INFINIBAND - ---help--- - This is a driver that provides InfiniBand verbs support for - QLogic InfiniPath host channel adapters (HCAs). This - allows these devices to be used with both kernel upper level - protocols such as IP-over-InfiniBand as well as with userspace - applications (in conjunction with InfiniBand userspace access). + This is a driver for QLogic InfiniPath host channel adapters, + including InfiniBand verbs support. This driver allows these + devices to be used with both kernel upper level protocols such + as IP-over-InfiniBand as well as with userspace applications + (in conjunction with InfiniBand userspace access). diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/Makefile Fri Aug 25 11:19:45 2006 -0700 @@ -1,10 +1,10 @@ EXTRA_CFLAGS += -DIPATH_IDSTR='"QLogic k EXTRA_CFLAGS += -DIPATH_IDSTR='"QLogic kernel.org driver"' \ -DIPATH_KERN_TYPE=0 -obj-$(CONFIG_IPATH_CORE) += ipath_core.o obj-$(CONFIG_INFINIBAND_IPATH) += ib_ipath.o -ipath_core-y := \ +ib_ipath-y := \ + ipath_cq.o \ ipath_diag.o \ ipath_driver.o \ ipath_eeprom.o \ @@ -13,26 +13,23 @@ ipath_core-y := \ ipath_ht400.o \ ipath_init_chip.o \ ipath_intr.o \ + ipath_keys.o \ ipath_layer.o \ - ipath_pe800.o \ - ipath_stats.o \ - ipath_sysfs.o \ - ipath_user_pages.o - -ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o -ipath_core-$(CONFIG_PPC64) += ipath_wc_ppc64.o - -ib_ipath-y := \ - ipath_cq.o \ - ipath_keys.o \ ipath_mad.o \ ipath_mmap.o \ ipath_mr.o \ + ipath_pe800.o \ ipath_qp.o \ ipath_rc.o \ ipath_ruc.o \ ipath_srq.o \ + ipath_stats.o \ + ipath_sysfs.o \ ipath_uc.o \ ipath_ud.o \ - ipath_verbs.o \ - ipath_verbs_mcast.o + ipath_user_pages.o \ + ipath_verbs_mcast.o \ + ipath_verbs.o + +ib_ipath-$(CONFIG_X86_64) += ipath_wc_x86_64.o +ib_ipath-$(CONFIG_PPC64) += ipath_wc_ppc64.o diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 @@ -40,6 +40,7 @@ #include "ipath_kernel.h" #include "ipath_layer.h" +#include "ipath_verbs.h" #include "ipath_common.h" static void ipath_update_pio_bufs(struct ipath_devdata *); @@ -50,8 +51,6 @@ const char *ipath_get_unit_name(int unit snprintf(iname, sizeof iname, "infinipath%u", unit); return iname; } - -EXPORT_SYMBOL_GPL(ipath_get_unit_name); #define DRIVER_LOAD_MSG "QLogic " IPATH_DRV_NAME " loaded: " #define PFX IPATH_DRV_NAME ": " @@ -510,6 +509,7 @@ static int __devinit ipath_init_one(stru ipath_user_add(dd); ipath_diag_add(dd); ipath_layer_add(dd); + ipath_register_ib_device(dd); goto bail; @@ -538,6 +538,7 @@ static void __devexit ipath_remove_one(s return; dd = pci_get_drvdata(pdev); + ipath_unregister_ib_device(dd->verbs_dev); ipath_layer_remove(dd); ipath_diag_remove(dd); ipath_user_remove(dd); @@ -978,12 +979,8 @@ reloop: if (unlikely(eflags)) ipath_rcv_hdrerr(dd, eflags, l, etail, rc); else if (etype == RCVHQ_RCV_TYPE_NON_KD) { - int ret = __ipath_verbs_rcv(dd, rc + 1, - ebuf, tlen); - if (ret == -ENODEV) - ipath_cdbg(VERBOSE, - "received IB packet, " - "not SMA (QP=%x)\n", qp); + ipath_ib_rcv(dd->verbs_dev, rc + 1, ebuf, + tlen); if (dd->ipath_lli_counter) dd->ipath_lli_counter--; diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Fri Aug 25 11:19:45 2006 -0700 @@ -35,6 +35,7 @@ #include "ipath_kernel.h" #include "ipath_layer.h" +#include "ipath_verbs.h" #include "ipath_common.h" /* These are all rcv-related errors which we want to count for stats */ @@ -712,7 +713,7 @@ static void handle_layer_pioavail(struct if (ret > 0) goto set; - ret = __ipath_verbs_piobufavail(dd); + ret = ipath_ib_piobufavail(dd->verbs_dev); if (ret > 0) goto set; diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 @@ -132,12 +132,6 @@ struct _ipath_layer { void *l_arg; }; -/* Verbs layer interface */ -struct _verbs_layer { - void *l_arg; - struct timer_list l_timer; -}; - struct ipath_devdata { struct list_head ipath_list; @@ -198,7 +192,8 @@ struct ipath_devdata { void (*ipath_f_setextled)(struct ipath_devdata *, u64, u64); /* fill out chip-specific fields */ int (*ipath_f_get_base_info)(struct ipath_portdata *, void *); - struct _verbs_layer verbs_layer; + struct ipath_ibdev *verbs_dev; + struct timer_list verbs_timer; /* total dwords sent (summed from counter) */ u64 ipath_sword; /* total dwords rcvd (summed from counter) */ @@ -529,8 +524,6 @@ extern int __ipath_layer_rcv(struct ipat extern int __ipath_layer_rcv(struct ipath_devdata *, void *, struct sk_buff *); extern int __ipath_layer_rcv_lid(struct ipath_devdata *, void *); -extern int __ipath_verbs_piobufavail(struct ipath_devdata *); -extern int __ipath_verbs_rcv(struct ipath_devdata *, void *, void *, u32); void ipath_layer_add(struct ipath_devdata *); void ipath_layer_remove(struct ipath_devdata *); diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c --- a/drivers/infiniband/hw/ipath/ipath_layer.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Fri Aug 25 11:19:45 2006 -0700 @@ -42,26 +42,20 @@ #include "ipath_kernel.h" #include "ipath_layer.h" +#include "ipath_verbs.h" #include "ipath_common.h" /* Acquire before ipath_devs_lock. */ static DEFINE_MUTEX(ipath_layer_mutex); - -static int ipath_verbs_registered; u16 ipath_layer_rcv_opcode; static int (*layer_intr)(void *, u32); static int (*layer_rcv)(void *, void *, struct sk_buff *); static int (*layer_rcv_lid)(void *, void *); -static int (*verbs_piobufavail)(void *); -static void (*verbs_rcv)(void *, void *, void *, u32); static void *(*layer_add_one)(int, struct ipath_devdata *); static void (*layer_remove_one)(void *); -static void *(*verbs_add_one)(int, struct ipath_devdata *); -static void (*verbs_remove_one)(void *); -static void (*verbs_timer_cb)(void *); int __ipath_layer_intr(struct ipath_devdata *dd, u32 arg) { @@ -103,29 +97,6 @@ int __ipath_layer_rcv_lid(struct ipath_d if (dd->ipath_layer.l_arg && layer_rcv_lid) ret = layer_rcv_lid(dd->ipath_layer.l_arg, hdr); - - return ret; -} - -int __ipath_verbs_piobufavail(struct ipath_devdata *dd) -{ - int ret = -ENODEV; - - if (dd->verbs_layer.l_arg && verbs_piobufavail) - ret = verbs_piobufavail(dd->verbs_layer.l_arg); - - return ret; -} - -int __ipath_verbs_rcv(struct ipath_devdata *dd, void *rc, void *ebuf, - u32 tlen) -{ - int ret = -ENODEV; - - if (dd->verbs_layer.l_arg && verbs_rcv) { - verbs_rcv(dd->verbs_layer.l_arg, rc, ebuf, tlen); - ret = 0; - } return ret; } @@ -211,8 +182,6 @@ bail: bail: return ret; } - -EXPORT_SYMBOL_GPL(ipath_layer_set_linkstate); /** * ipath_layer_set_mtu - set the MTU @@ -298,8 +267,6 @@ bail: return ret; } -EXPORT_SYMBOL_GPL(ipath_layer_set_mtu); - int ipath_set_lid(struct ipath_devdata *dd, u32 arg, u8 lmc) { dd->ipath_lid = arg; @@ -315,8 +282,6 @@ int ipath_set_lid(struct ipath_devdata * return 0; } -EXPORT_SYMBOL_GPL(ipath_set_lid); - int ipath_layer_set_guid(struct ipath_devdata *dd, __be64 guid) { /* XXX - need to inform anyone who cares this just happened. */ @@ -324,84 +289,55 @@ int ipath_layer_set_guid(struct ipath_de return 0; } -EXPORT_SYMBOL_GPL(ipath_layer_set_guid); - __be64 ipath_layer_get_guid(struct ipath_devdata *dd) { return dd->ipath_guid; } -EXPORT_SYMBOL_GPL(ipath_layer_get_guid); - -u32 ipath_layer_get_nguid(struct ipath_devdata *dd) -{ - return dd->ipath_nguid; -} - -EXPORT_SYMBOL_GPL(ipath_layer_get_nguid); - u32 ipath_layer_get_majrev(struct ipath_devdata *dd) { return dd->ipath_majrev; } -EXPORT_SYMBOL_GPL(ipath_layer_get_majrev); - u32 ipath_layer_get_minrev(struct ipath_devdata *dd) { return dd->ipath_minrev; } -EXPORT_SYMBOL_GPL(ipath_layer_get_minrev); - u32 ipath_layer_get_pcirev(struct ipath_devdata *dd) { return dd->ipath_pcirev; } -EXPORT_SYMBOL_GPL(ipath_layer_get_pcirev); - u32 ipath_layer_get_flags(struct ipath_devdata *dd) { return dd->ipath_flags; } -EXPORT_SYMBOL_GPL(ipath_layer_get_flags); - struct device *ipath_layer_get_device(struct ipath_devdata *dd) { return &dd->pcidev->dev; } -EXPORT_SYMBOL_GPL(ipath_layer_get_device); - u16 ipath_layer_get_deviceid(struct ipath_devdata *dd) { return dd->ipath_deviceid; } -EXPORT_SYMBOL_GPL(ipath_layer_get_deviceid); - u32 ipath_layer_get_vendorid(struct ipath_devdata *dd) { return dd->ipath_vendorid; } -EXPORT_SYMBOL_GPL(ipath_layer_get_vendorid); - u64 ipath_layer_get_lastibcstat(struct ipath_devdata *dd) { return dd->ipath_lastibcstat; } -EXPORT_SYMBOL_GPL(ipath_layer_get_lastibcstat); - u32 ipath_layer_get_ibmtu(struct ipath_devdata *dd) { return dd->ipath_ibmtu; } - -EXPORT_SYMBOL_GPL(ipath_layer_get_ibmtu); void ipath_layer_add(struct ipath_devdata *dd) { @@ -411,10 +347,6 @@ void ipath_layer_add(struct ipath_devdat dd->ipath_layer.l_arg = layer_add_one(dd->ipath_unit, dd); - if (verbs_add_one) - dd->verbs_layer.l_arg = - verbs_add_one(dd->ipath_unit, dd); - mutex_unlock(&ipath_layer_mutex); } @@ -425,11 +357,6 @@ void ipath_layer_remove(struct ipath_dev if (dd->ipath_layer.l_arg && layer_remove_one) { layer_remove_one(dd->ipath_layer.l_arg); dd->ipath_layer.l_arg = NULL; - } - - if (dd->verbs_layer.l_arg && verbs_remove_one) { - verbs_remove_one(dd->verbs_layer.l_arg); - dd->verbs_layer.l_arg = NULL; } mutex_unlock(&ipath_layer_mutex); @@ -521,94 +448,9 @@ static void __ipath_verbs_timer(unsigned ipath_kreceive(dd); /* Handle verbs layer timeouts. */ - if (dd->verbs_layer.l_arg && verbs_timer_cb) - verbs_timer_cb(dd->verbs_layer.l_arg); - - mod_timer(&dd->verbs_layer.l_timer, jiffies + 1); -} - -/** - * ipath_verbs_register - verbs layer registration - * @l_piobufavail: callback for when PIO buffers become available - * @l_rcv: callback for receiving a packet - * @l_timer_cb: timer callback - * @ipath_devdata: device data structure is put here - */ -int ipath_verbs_register(void *(*l_add)(int, struct ipath_devdata *), - void (*l_remove)(void *arg), - int (*l_piobufavail) (void *arg), - void (*l_rcv) (void *arg, void *rhdr, - void *data, u32 tlen), - void (*l_timer_cb) (void *arg)) -{ - struct ipath_devdata *dd, *tmp; - unsigned long flags; - - mutex_lock(&ipath_layer_mutex); - - verbs_add_one = l_add; - verbs_remove_one = l_remove; - verbs_piobufavail = l_piobufavail; - verbs_rcv = l_rcv; - verbs_timer_cb = l_timer_cb; - - spin_lock_irqsave(&ipath_devs_lock, flags); - - list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) { - if (!(dd->ipath_flags & IPATH_INITTED)) - continue; - - if (dd->verbs_layer.l_arg) - continue; - - spin_unlock_irqrestore(&ipath_devs_lock, flags); - dd->verbs_layer.l_arg = l_add(dd->ipath_unit, dd); - spin_lock_irqsave(&ipath_devs_lock, flags); - } - - spin_unlock_irqrestore(&ipath_devs_lock, flags); - mutex_unlock(&ipath_layer_mutex); - - ipath_verbs_registered = 1; - - return 0; -} - -EXPORT_SYMBOL_GPL(ipath_verbs_register); - -void ipath_verbs_unregister(void) -{ - struct ipath_devdata *dd, *tmp; - unsigned long flags; - - mutex_lock(&ipath_layer_mutex); - spin_lock_irqsave(&ipath_devs_lock, flags); - - list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) { - *dd->ipath_statusp &= ~IPATH_STATUS_OIB_SMA; - - if (dd->verbs_layer.l_arg && verbs_remove_one) { - spin_unlock_irqrestore(&ipath_devs_lock, flags); - verbs_remove_one(dd->verbs_layer.l_arg); - spin_lock_irqsave(&ipath_devs_lock, flags); - dd->verbs_layer.l_arg = NULL; - } - } - - spin_unlock_irqrestore(&ipath_devs_lock, flags); - - verbs_add_one = NULL; - verbs_remove_one = NULL; - verbs_piobufavail = NULL; - verbs_rcv = NULL; - verbs_timer_cb = NULL; - - ipath_verbs_registered = 0; - - mutex_unlock(&ipath_layer_mutex); -} - -EXPORT_SYMBOL_GPL(ipath_verbs_unregister); + ipath_ib_timer(dd->verbs_dev); + mod_timer(&dd->verbs_timer, jiffies + 1); +} int ipath_layer_open(struct ipath_devdata *dd, u32 * pktmax) { @@ -702,8 +544,6 @@ u32 ipath_layer_get_cr_errpkey(struct ip { return ipath_read_creg32(dd, dd->ipath_cregs->cr_errpkey); } - -EXPORT_SYMBOL_GPL(ipath_layer_get_cr_errpkey); static void update_sge(struct ipath_sge_state *ss, u32 length) { @@ -981,8 +821,6 @@ bail: return ret; } -EXPORT_SYMBOL_GPL(ipath_verbs_send); - int ipath_layer_snapshot_counters(struct ipath_devdata *dd, u64 *swords, u64 *rwords, u64 *spkts, u64 *rpkts, u64 *xmit_wait) @@ -1006,8 +844,6 @@ bail: bail: return ret; } - -EXPORT_SYMBOL_GPL(ipath_layer_snapshot_counters); /** * ipath_layer_get_counters - get various chip counters @@ -1069,8 +905,6 @@ bail: return ret; } -EXPORT_SYMBOL_GPL(ipath_layer_get_counters); - int ipath_layer_want_buffer(struct ipath_devdata *dd) { set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl); @@ -1079,8 +913,6 @@ int ipath_layer_want_buffer(struct ipath return 0; } - -EXPORT_SYMBOL_GPL(ipath_layer_want_buffer); int ipath_layer_send_hdr(struct ipath_devdata *dd, struct ether_header *hdr) { @@ -1174,16 +1006,14 @@ int ipath_layer_enable_timer(struct ipat (u64) (1 << 2)); } - init_timer(&dd->verbs_layer.l_timer); - dd->verbs_layer.l_timer.function = __ipath_verbs_timer; - dd->verbs_layer.l_timer.data = (unsigned long)dd; - dd->verbs_layer.l_timer.expires = jiffies + 1; - add_timer(&dd->verbs_layer.l_timer); - - return 0; -} - -EXPORT_SYMBOL_GPL(ipath_layer_enable_timer); + init_timer(&dd->verbs_timer); + dd->verbs_timer.function = __ipath_verbs_timer; + dd->verbs_timer.data = (unsigned long)dd; + dd->verbs_timer.expires = jiffies + 1; + add_timer(&dd->verbs_timer); + + return 0; +} int ipath_layer_disable_timer(struct ipath_devdata *dd) { @@ -1191,12 +1021,10 @@ int ipath_layer_disable_timer(struct ipa if (dd->ipath_flags & IPATH_GPIO_INTR) ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, 0); - del_timer_sync(&dd->verbs_layer.l_timer); - - return 0; -} - -EXPORT_SYMBOL_GPL(ipath_layer_disable_timer); + del_timer_sync(&dd->verbs_timer); + + return 0; +} /** * ipath_layer_set_verbs_flags - set the verbs layer flags @@ -1225,8 +1053,6 @@ int ipath_layer_set_verbs_flags(struct i return 0; } -EXPORT_SYMBOL_GPL(ipath_layer_set_verbs_flags); - /** * ipath_layer_get_npkeys - return the size of the PKEY table for port 0 * @dd: the infinipath device @@ -1235,8 +1061,6 @@ unsigned ipath_layer_get_npkeys(struct i { return ARRAY_SIZE(dd->ipath_pd[0]->port_pkeys); } - -EXPORT_SYMBOL_GPL(ipath_layer_get_npkeys); /** * ipath_layer_get_pkey - return the indexed PKEY from the port 0 PKEY table @@ -1255,8 +1079,6 @@ unsigned ipath_layer_get_pkey(struct ipa return ret; } -EXPORT_SYMBOL_GPL(ipath_layer_get_pkey); - /** * ipath_layer_get_pkeys - return the PKEY table for port 0 * @dd: the infinipath device @@ -1270,8 +1092,6 @@ int ipath_layer_get_pkeys(struct ipath_d return 0; } - -EXPORT_SYMBOL_GPL(ipath_layer_get_pkeys); /** * rm_pkey - decrecment the reference count for the given PKEY @@ -1419,8 +1239,6 @@ int ipath_layer_set_pkeys(struct ipath_d return 0; } -EXPORT_SYMBOL_GPL(ipath_layer_set_pkeys); - /** * ipath_layer_get_linkdowndefaultstate - get the default linkdown state * @dd: the infinipath device @@ -1431,8 +1249,6 @@ int ipath_layer_get_linkdowndefaultstate { return !!(dd->ipath_ibcctrl & INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE); } - -EXPORT_SYMBOL_GPL(ipath_layer_get_linkdowndefaultstate); /** * ipath_layer_set_linkdowndefaultstate - set the default linkdown state @@ -1453,16 +1269,12 @@ int ipath_layer_set_linkdowndefaultstate return 0; } -EXPORT_SYMBOL_GPL(ipath_layer_set_linkdowndefaultstate); - int ipath_layer_get_phyerrthreshold(struct ipath_devdata *dd) { return (dd->ipath_ibcctrl >> INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) & INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK; } - -EXPORT_SYMBOL_GPL(ipath_layer_get_phyerrthreshold); /** * ipath_layer_set_phyerrthreshold - set the physical error threshold @@ -1489,16 +1301,12 @@ int ipath_layer_set_phyerrthreshold(stru return 0; } -EXPORT_SYMBOL_GPL(ipath_layer_set_phyerrthreshold); - int ipath_layer_get_overrunthreshold(struct ipath_devdata *dd) { return (dd->ipath_ibcctrl >> INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) & INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK; } - -EXPORT_SYMBOL_GPL(ipath_layer_get_overrunthreshold); /** * ipath_layer_set_overrunthreshold - set the overrun threshold @@ -1525,17 +1333,13 @@ int ipath_layer_set_overrunthreshold(str return 0; } -EXPORT_SYMBOL_GPL(ipath_layer_set_overrunthreshold); - int ipath_layer_get_boardname(struct ipath_devdata *dd, char *name, size_t namelen) { return dd->ipath_f_get_boardname(dd, name, namelen); } -EXPORT_SYMBOL_GPL(ipath_layer_get_boardname); u32 ipath_layer_get_rcvhdrentsize(struct ipath_devdata *dd) { return dd->ipath_rcvhdrentsize; } -EXPORT_SYMBOL_GPL(ipath_layer_get_rcvhdrentsize); diff --git a/drivers/infiniband/hw/ipath/ipath_layer.h b/drivers/infiniband/hw/ipath/ipath_layer.h --- a/drivers/infiniband/hw/ipath/ipath_layer.h Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.h Fri Aug 25 11:19:45 2006 -0700 @@ -114,14 +114,7 @@ int ipath_layer_register(void *(*l_add)( struct sk_buff *), u16 rcv_opcode, int (*l_rcv_lid)(void *, void *)); -int ipath_verbs_register(void *(*l_add)(int, struct ipath_devdata *), - void (*l_remove)(void *arg), - int (*l_piobufavail)(void *arg), - void (*l_rcv)(void *arg, void *rhdr, - void *data, u32 tlen), - void (*l_timer_cb)(void *arg)); void ipath_layer_unregister(void); -void ipath_verbs_unregister(void); int ipath_layer_open(struct ipath_devdata *, u32 * pktmax); u16 ipath_layer_get_lid(struct ipath_devdata *dd); int ipath_layer_get_mac(struct ipath_devdata *dd, u8 *); @@ -145,7 +138,6 @@ int ipath_layer_want_buffer(struct ipath int ipath_layer_want_buffer(struct ipath_devdata *dd); int ipath_layer_set_guid(struct ipath_devdata *, __be64 guid); __be64 ipath_layer_get_guid(struct ipath_devdata *); -u32 ipath_layer_get_nguid(struct ipath_devdata *); u32 ipath_layer_get_majrev(struct ipath_devdata *); u32 ipath_layer_get_minrev(struct ipath_devdata *); u32 ipath_layer_get_pcirev(struct ipath_devdata *); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 @@ -368,7 +368,7 @@ static void ipath_qp_rcv(struct ipath_ib } /** - * ipath_ib_rcv - process and incoming packet + * ipath_ib_rcv - process an incoming packet * @arg: the device pointer * @rhdr: the header of the packet * @data: the packet data @@ -377,9 +377,9 @@ static void ipath_qp_rcv(struct ipath_ib * This is called from ipath_kreceive() to process an incoming packet at * interrupt level. Tlen is the length of the header + data + CRC in bytes. */ -static void ipath_ib_rcv(void *arg, void *rhdr, void *data, u32 tlen) -{ - struct ipath_ibdev *dev = (struct ipath_ibdev *) arg; +void ipath_ib_rcv(struct ipath_ibdev *dev, void *rhdr, void *data, + u32 tlen) +{ struct ipath_ib_header *hdr = rhdr; struct ipath_other_headers *ohdr; struct ipath_qp *qp; @@ -468,9 +468,8 @@ bail:; * This is called from ipath_do_rcv_timer() at interrupt level to check for * QPs which need retransmits and to collect performance numbers. */ -static void ipath_ib_timer(void *arg) -{ - struct ipath_ibdev *dev = (struct ipath_ibdev *) arg; +void ipath_ib_timer(struct ipath_ibdev *dev) +{ struct ipath_qp *resend = NULL; struct list_head *last; struct ipath_qp *qp; @@ -564,9 +563,8 @@ static void ipath_ib_timer(void *arg) * QPs waiting for buffers (for now, just do a tasklet_hi_schedule and * return zero). */ -static int ipath_ib_piobufavail(void *arg) -{ - struct ipath_ibdev *dev = (struct ipath_ibdev *) arg; +int ipath_ib_piobufavail(struct ipath_ibdev *dev) +{ struct ipath_qp *qp; unsigned long flags; @@ -957,11 +955,10 @@ static int ipath_verbs_register_sysfs(st /** * ipath_register_ib_device - register our device with the infiniband core - * @unit: the device number to register * @dd: the device data structure * Return the allocated ipath_ibdev pointer or NULL on error. */ -static void *ipath_register_ib_device(int unit, struct ipath_devdata *dd) +int ipath_register_ib_device(struct ipath_devdata *dd) { struct ipath_layer_counters cntrs; struct ipath_ibdev *idev; @@ -969,8 +966,10 @@ static void *ipath_register_ib_device(in int ret; idev = (struct ipath_ibdev *)ib_alloc_device(sizeof *idev); - if (idev == NULL) - goto bail; + if (idev == NULL) { + ret = -ENOMEM; + goto bail; + } dev = &idev->ibdev; @@ -1047,7 +1046,7 @@ static void *ipath_register_ib_device(in if (!sys_image_guid) sys_image_guid = ipath_layer_get_guid(dd); idev->sys_image_guid = sys_image_guid; - idev->ib_unit = unit; + idev->ib_unit = dd->ipath_unit; idev->dd = dd; strlcpy(dev->name, "ipath%d", IB_DEVICE_NAME_MAX); @@ -1153,16 +1152,16 @@ err_qp: err_qp: ib_dealloc_device(dev); _VERBS_ERROR("ib_ipath%d cannot register verbs (%d)!\n", - unit, -ret); + dd->ipath_unit, -ret); idev = NULL; bail: - return idev; -} - -static void ipath_unregister_ib_device(void *arg) -{ - struct ipath_ibdev *dev = (struct ipath_ibdev *) arg; + dd->verbs_dev = idev; + return ret; +} + +void ipath_unregister_ib_device(struct ipath_ibdev *dev) +{ struct ib_device *ibdev = &dev->ibdev; ipath_layer_disable_timer(dev->dd); @@ -1193,19 +1192,6 @@ static void ipath_unregister_ib_device(v ib_dealloc_device(ibdev); } -static int __init ipath_verbs_init(void) -{ - return ipath_verbs_register(ipath_register_ib_device, - ipath_unregister_ib_device, - ipath_ib_piobufavail, ipath_ib_rcv, - ipath_ib_timer); -} - -static void __exit ipath_verbs_cleanup(void) -{ - ipath_verbs_unregister(); -} - static ssize_t show_rev(struct class_device *cdev, char *buf) { struct ipath_ibdev *dev = @@ -1297,6 +1283,3 @@ bail: bail: return ret; } - -module_init(ipath_verbs_init); -module_exit(ipath_verbs_cleanup); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:45 2006 -0700 @@ -711,6 +711,16 @@ int ipath_make_uc_req(struct ipath_qp *q int ipath_make_uc_req(struct ipath_qp *qp, struct ipath_other_headers *ohdr, u32 pmtu, u32 *bth0p, u32 *bth2p); +int ipath_register_ib_device(struct ipath_devdata *); + +void ipath_unregister_ib_device(struct ipath_ibdev *); + +void ipath_ib_rcv(struct ipath_ibdev *, void *, void *, u32); + +int ipath_ib_piobufavail(struct ipath_ibdev *); + +void ipath_ib_timer(struct ipath_ibdev *); + extern const enum ib_wc_opcode ib_ipath_wc_opcode[]; extern const u8 ipath_cvt_physportstate[]; From bos at pathscale.com Fri Aug 25 11:24:32 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 11:24:32 -0700 Subject: [openib-general] [PATCH 7 of 23] IB/ipath - simplify layering code In-Reply-To: Message-ID: <6016a3c7c50a03598523.1156530272@eng-12.pathscale.com> A lot of ipath layer code was only called in one place. Now that the ipath_core and ib_ipath drivers are merged, it's more sensible to simply inline the simple stuff that the layer code was doing. Signed-off-by: Bryan O'Sullivan diff --git a/drivers/infiniband/hw/ipath/ipath_diag.c b/drivers/infiniband/hw/ipath/ipath_diag.c --- a/drivers/infiniband/hw/ipath/ipath_diag.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_diag.c Fri Aug 25 11:19:45 2006 -0700 @@ -45,7 +45,6 @@ #include #include "ipath_kernel.h" -#include "ipath_layer.h" #include "ipath_common.h" int ipath_diag_inuse; diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Fri Aug 25 11:19:45 2006 -0700 @@ -39,7 +39,6 @@ #include #include "ipath_kernel.h" -#include "ipath_layer.h" #include "ipath_verbs.h" #include "ipath_common.h" @@ -508,7 +507,6 @@ static int __devinit ipath_init_one(stru ipathfs_add_device(dd); ipath_user_add(dd); ipath_diag_add(dd); - ipath_layer_add(dd); ipath_register_ib_device(dd); goto bail; @@ -539,7 +537,6 @@ static void __devexit ipath_remove_one(s dd = pci_get_drvdata(pdev); ipath_unregister_ib_device(dd->verbs_dev); - ipath_layer_remove(dd); ipath_diag_remove(dd); ipath_user_remove(dd); ipathfs_remove_device(dd); @@ -614,11 +611,12 @@ void ipath_disarm_piobufs(struct ipath_d * * wait up to msecs milliseconds for IB link state change to occur for * now, take the easy polling route. Currently used only by - * ipath_layer_set_linkstate. Returns 0 if state reached, otherwise + * ipath_set_linkstate. Returns 0 if state reached, otherwise * -ETIMEDOUT state can have multiple states set, for any of several * transitions. */ -int ipath_wait_linkstate(struct ipath_devdata *dd, u32 state, int msecs) +static int ipath_wait_linkstate(struct ipath_devdata *dd, u32 state, + int msecs) { dd->ipath_sma_state_wanted = state; wait_event_interruptible_timeout(ipath_sma_state_wait, @@ -814,58 +812,6 @@ bail: return skb; } -/** - * ipath_rcv_layer - receive a packet for the layered (ethernet) driver - * @dd: the infinipath device - * @etail: the sk_buff number - * @tlen: the total packet length - * @hdr: the ethernet header - * - * Separate routine for better overall optimization - */ -static void ipath_rcv_layer(struct ipath_devdata *dd, u32 etail, - u32 tlen, struct ether_header *hdr) -{ - u32 elen; - u8 pad, *bthbytes; - struct sk_buff *skb, *nskb; - - if (dd->ipath_port0_skbs && - hdr->sub_opcode == IPATH_ITH4X_OPCODE_ENCAP) { - /* - * Allocate a new sk_buff to replace the one we give - * to the network stack. - */ - nskb = ipath_alloc_skb(dd, GFP_ATOMIC); - if (!nskb) { - /* count OK packets that we drop */ - ipath_stats.sps_krdrops++; - return; - } - - bthbytes = (u8 *) hdr->bth; - pad = (bthbytes[1] >> 4) & 3; - /* +CRC32 */ - elen = tlen - (sizeof(*hdr) + pad + sizeof(u32)); - - skb = dd->ipath_port0_skbs[etail]; - dd->ipath_port0_skbs[etail] = nskb; - skb_put(skb, elen); - - dd->ipath_f_put_tid(dd, etail + (u64 __iomem *) - ((char __iomem *) dd->ipath_kregbase - + dd->ipath_rcvegrbase), 0, - virt_to_phys(nskb->data)); - - __ipath_layer_rcv(dd, hdr, skb); - - /* another ether packet received */ - ipath_stats.sps_ether_rpkts++; - } - else if (hdr->sub_opcode == IPATH_ITH4X_OPCODE_LID_ARP) - __ipath_layer_rcv_lid(dd, hdr); -} - static void ipath_rcv_hdrerr(struct ipath_devdata *dd, u32 eflags, u32 l, @@ -979,22 +925,17 @@ reloop: if (unlikely(eflags)) ipath_rcv_hdrerr(dd, eflags, l, etail, rc); else if (etype == RCVHQ_RCV_TYPE_NON_KD) { - ipath_ib_rcv(dd->verbs_dev, rc + 1, ebuf, - tlen); - if (dd->ipath_lli_counter) - dd->ipath_lli_counter--; - - } else if (etype == RCVHQ_RCV_TYPE_EAGER) { - if (qp == IPATH_KD_QP && - bthbytes[0] == ipath_layer_rcv_opcode && - ebuf) - ipath_rcv_layer(dd, etail, tlen, - (struct ether_header *)hdr); - else - ipath_cdbg(PKT, "typ %x, opcode %x (eager, " - "qp=%x), len %x; ignored\n", - etype, bthbytes[0], qp, tlen); - } + ipath_ib_rcv(dd->verbs_dev, rc + 1, ebuf, tlen); + if (dd->ipath_lli_counter) + dd->ipath_lli_counter--; + ipath_cdbg(PKT, "typ %x, opcode %x (eager, " + "qp=%x), len %x; ignored\n", + etype, bthbytes[0], qp, tlen); + } + else if (etype == RCVHQ_RCV_TYPE_EAGER) + ipath_cdbg(PKT, "typ %x, opcode %x (eager, " + "qp=%x), len %x; ignored\n", + etype, bthbytes[0], qp, tlen); else if (etype == RCVHQ_RCV_TYPE_EXPECTED) ipath_dbg("Bug: Expected TID, opcode %x; ignored\n", be32_to_cpu(hdr->bth[0]) & 0xff); @@ -1320,13 +1261,6 @@ rescan: goto bail; } - if (updated) - /* - * ran out of bufs, now some (at least this one we just - * got) are now available, so tell the layered driver. - */ - __ipath_layer_intr(dd, IPATH_LAYER_INT_SEND_CONTINUE); - /* * set next starting place. Since it's just an optimization, * it doesn't matter who wins on this, so no locking @@ -1503,7 +1437,7 @@ int ipath_waitfor_mdio_cmdready(struct i return ret; } -void ipath_set_ib_lstate(struct ipath_devdata *dd, int which) +static void ipath_set_ib_lstate(struct ipath_devdata *dd, int which) { static const char *what[4] = { [0] = "DOWN", @@ -1537,6 +1471,180 @@ void ipath_set_ib_lstate(struct ipath_de dd->ipath_ibcctrl | which); } +int ipath_set_linkstate(struct ipath_devdata *dd, u8 newstate) +{ + u32 lstate; + int ret; + + switch (newstate) { + case IPATH_IB_LINKDOWN: + ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKINITCMD_POLL << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); + /* don't wait */ + ret = 0; + goto bail; + + case IPATH_IB_LINKDOWN_SLEEP: + ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKINITCMD_SLEEP << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); + /* don't wait */ + ret = 0; + goto bail; + + case IPATH_IB_LINKDOWN_DISABLE: + ipath_set_ib_lstate(dd, + INFINIPATH_IBCC_LINKINITCMD_DISABLE << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); + /* don't wait */ + ret = 0; + goto bail; + + case IPATH_IB_LINKINIT: + if (dd->ipath_flags & IPATH_LINKINIT) { + ret = 0; + goto bail; + } + ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKCMD_INIT << + INFINIPATH_IBCC_LINKCMD_SHIFT); + lstate = IPATH_LINKINIT; + break; + + case IPATH_IB_LINKARM: + if (dd->ipath_flags & IPATH_LINKARMED) { + ret = 0; + goto bail; + } + if (!(dd->ipath_flags & + (IPATH_LINKINIT | IPATH_LINKACTIVE))) { + ret = -EINVAL; + goto bail; + } + ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKCMD_ARMED << + INFINIPATH_IBCC_LINKCMD_SHIFT); + /* + * Since the port can transition to ACTIVE by receiving + * a non VL 15 packet, wait for either state. + */ + lstate = IPATH_LINKARMED | IPATH_LINKACTIVE; + break; + + case IPATH_IB_LINKACTIVE: + if (dd->ipath_flags & IPATH_LINKACTIVE) { + ret = 0; + goto bail; + } + if (!(dd->ipath_flags & IPATH_LINKARMED)) { + ret = -EINVAL; + goto bail; + } + ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKCMD_ACTIVE << + INFINIPATH_IBCC_LINKCMD_SHIFT); + lstate = IPATH_LINKACTIVE; + break; + + default: + ipath_dbg("Invalid linkstate 0x%x requested\n", newstate); + ret = -EINVAL; + goto bail; + } + ret = ipath_wait_linkstate(dd, lstate, 2000); + +bail: + return ret; +} + +/** + * ipath_set_mtu - set the MTU + * @dd: the infinipath device + * @arg: the new MTU + * + * we can handle "any" incoming size, the issue here is whether we + * need to restrict our outgoing size. For now, we don't do any + * sanity checking on this, and we don't deal with what happens to + * programs that are already running when the size changes. + * NOTE: changing the MTU will usually cause the IBC to go back to + * link initialize (IPATH_IBSTATE_INIT) state... + */ +int ipath_set_mtu(struct ipath_devdata *dd, u16 arg) +{ + u32 piosize; + int changed = 0; + int ret; + + /* + * mtu is IB data payload max. It's the largest power of 2 less + * than piosize (or even larger, since it only really controls the + * largest we can receive; we can send the max of the mtu and + * piosize). We check that it's one of the valid IB sizes. + */ + if (arg != 256 && arg != 512 && arg != 1024 && arg != 2048 && + arg != 4096) { + ipath_dbg("Trying to set invalid mtu %u, failing\n", arg); + ret = -EINVAL; + goto bail; + } + if (dd->ipath_ibmtu == arg) { + ret = 0; /* same as current */ + goto bail; + } + + piosize = dd->ipath_ibmaxlen; + dd->ipath_ibmtu = arg; + + if (arg >= (piosize - IPATH_PIO_MAXIBHDR)) { + /* Only if it's not the initial value (or reset to it) */ + if (piosize != dd->ipath_init_ibmaxlen) { + dd->ipath_ibmaxlen = piosize; + changed = 1; + } + } else if ((arg + IPATH_PIO_MAXIBHDR) != dd->ipath_ibmaxlen) { + piosize = arg + IPATH_PIO_MAXIBHDR; + ipath_cdbg(VERBOSE, "ibmaxlen was 0x%x, setting to 0x%x " + "(mtu 0x%x)\n", dd->ipath_ibmaxlen, piosize, + arg); + dd->ipath_ibmaxlen = piosize; + changed = 1; + } + + if (changed) { + /* + * set the IBC maxpktlength to the size of our pio + * buffers in words + */ + u64 ibc = dd->ipath_ibcctrl; + ibc &= ~(INFINIPATH_IBCC_MAXPKTLEN_MASK << + INFINIPATH_IBCC_MAXPKTLEN_SHIFT); + + piosize = piosize - 2 * sizeof(u32); /* ignore pbc */ + dd->ipath_ibmaxlen = piosize; + piosize /= sizeof(u32); /* in words */ + /* + * for ICRC, which we only send in diag test pkt mode, and + * we don't need to worry about that for mtu + */ + piosize += 1; + + ibc |= piosize << INFINIPATH_IBCC_MAXPKTLEN_SHIFT; + dd->ipath_ibcctrl = ibc; + ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, + dd->ipath_ibcctrl); + dd->ipath_f_tidtemplate(dd); + } + + ret = 0; + +bail: + return ret; +} + +int ipath_set_lid(struct ipath_devdata *dd, u32 arg, u8 lmc) +{ + dd->ipath_lid = arg; + dd->ipath_lmc = lmc; + + return 0; +} + /** * ipath_read_kreg64_port - read a device's per-port 64-bit kernel register * @dd: the infinipath device @@ -1639,13 +1747,6 @@ void ipath_shutdown_device(struct ipath_ ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKINITCMD_DISABLE << INFINIPATH_IBCC_LINKINITCMD_SHIFT); - - /* - * we are shutting down, so tell the layered driver. We don't do - * this on just a link state change, much like ethernet, a cable - * unplug, etc. doesn't change driver state - */ - ipath_layer_intr(dd, IPATH_LAYER_INT_IF_DOWN); /* disable IBC */ dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Fri Aug 25 11:19:45 2006 -0700 @@ -39,7 +39,6 @@ #include #include "ipath_kernel.h" -#include "ipath_layer.h" #include "ipath_common.h" static int ipath_open(struct inode *, struct file *); diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Fri Aug 25 11:19:45 2006 -0700 @@ -34,7 +34,6 @@ #include #include "ipath_kernel.h" -#include "ipath_layer.h" #include "ipath_verbs.h" #include "ipath_common.h" @@ -290,8 +289,6 @@ static void handle_e_ibstatuschanged(str *dd->ipath_statusp |= IPATH_STATUS_IB_READY | IPATH_STATUS_IB_CONF; dd->ipath_f_setextled(dd, lstate, ltstate); - - __ipath_layer_intr(dd, IPATH_LAYER_INT_IF_UP); } else if ((val & IPATH_IBSTATE_MASK) == IPATH_IBSTATE_INIT) { /* * set INIT and DOWN. Down is checked by most of the other @@ -709,10 +706,6 @@ static void handle_layer_pioavail(struct { int ret; - ret = __ipath_layer_intr(dd, IPATH_LAYER_INT_SEND_CONTINUE); - if (ret > 0) - goto set; - ret = ipath_ib_piobufavail(dd->verbs_dev); if (ret > 0) goto set; diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Fri Aug 25 11:19:45 2006 -0700 @@ -518,16 +518,6 @@ extern spinlock_t ipath_devs_lock; extern spinlock_t ipath_devs_lock; extern struct ipath_devdata *ipath_lookup(int unit); -extern u16 ipath_layer_rcv_opcode; -extern int __ipath_layer_intr(struct ipath_devdata *, u32); -extern int ipath_layer_intr(struct ipath_devdata *, u32); -extern int __ipath_layer_rcv(struct ipath_devdata *, void *, - struct sk_buff *); -extern int __ipath_layer_rcv_lid(struct ipath_devdata *, void *); - -void ipath_layer_add(struct ipath_devdata *); -void ipath_layer_remove(struct ipath_devdata *); - int ipath_init_chip(struct ipath_devdata *, int); int ipath_enable_wc(struct ipath_devdata *dd); void ipath_disable_wc(struct ipath_devdata *dd); @@ -575,12 +565,13 @@ void ipath_free_pddata(struct ipath_devd int ipath_parse_ushort(const char *str, unsigned short *valp); -int ipath_wait_linkstate(struct ipath_devdata *, u32, int); -void ipath_set_ib_lstate(struct ipath_devdata *, int); void ipath_kreceive(struct ipath_devdata *); int ipath_setrcvhdrsize(struct ipath_devdata *, unsigned); int ipath_reset_device(int); void ipath_get_faststats(unsigned long); +int ipath_set_linkstate(struct ipath_devdata *, u8); +int ipath_set_mtu(struct ipath_devdata *, u16); +int ipath_set_lid(struct ipath_devdata *, u32, u8); /* for use in system calls, where we want to know device type, etc. */ #define port_fp(fp) ((struct ipath_portdata *) (fp)->private_data) diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c --- a/drivers/infiniband/hw/ipath/ipath_layer.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Fri Aug 25 11:19:45 2006 -0700 @@ -101,242 +101,14 @@ int __ipath_layer_rcv_lid(struct ipath_d return ret; } -int ipath_layer_set_linkstate(struct ipath_devdata *dd, u8 newstate) -{ - u32 lstate; - int ret; - - switch (newstate) { - case IPATH_IB_LINKDOWN: - ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKINITCMD_POLL << - INFINIPATH_IBCC_LINKINITCMD_SHIFT); - /* don't wait */ - ret = 0; - goto bail; - - case IPATH_IB_LINKDOWN_SLEEP: - ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKINITCMD_SLEEP << - INFINIPATH_IBCC_LINKINITCMD_SHIFT); - /* don't wait */ - ret = 0; - goto bail; - - case IPATH_IB_LINKDOWN_DISABLE: - ipath_set_ib_lstate(dd, - INFINIPATH_IBCC_LINKINITCMD_DISABLE << - INFINIPATH_IBCC_LINKINITCMD_SHIFT); - /* don't wait */ - ret = 0; - goto bail; - - case IPATH_IB_LINKINIT: - if (dd->ipath_flags & IPATH_LINKINIT) { - ret = 0; - goto bail; - } - ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKCMD_INIT << - INFINIPATH_IBCC_LINKCMD_SHIFT); - lstate = IPATH_LINKINIT; - break; - - case IPATH_IB_LINKARM: - if (dd->ipath_flags & IPATH_LINKARMED) { - ret = 0; - goto bail; - } - if (!(dd->ipath_flags & - (IPATH_LINKINIT | IPATH_LINKACTIVE))) { - ret = -EINVAL; - goto bail; - } - ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKCMD_ARMED << - INFINIPATH_IBCC_LINKCMD_SHIFT); - /* - * Since the port can transition to ACTIVE by receiving - * a non VL 15 packet, wait for either state. - */ - lstate = IPATH_LINKARMED | IPATH_LINKACTIVE; - break; - - case IPATH_IB_LINKACTIVE: - if (dd->ipath_flags & IPATH_LINKACTIVE) { - ret = 0; - goto bail; - } - if (!(dd->ipath_flags & IPATH_LINKARMED)) { - ret = -EINVAL; - goto bail; - } - ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKCMD_ACTIVE << - INFINIPATH_IBCC_LINKCMD_SHIFT); - lstate = IPATH_LINKACTIVE; - break; - - default: - ipath_dbg("Invalid linkstate 0x%x requested\n", newstate); - ret = -EINVAL; - goto bail; - } - ret = ipath_wait_linkstate(dd, lstate, 2000); - -bail: - return ret; -} - -/** - * ipath_layer_set_mtu - set the MTU - * @dd: the infinipath device - * @arg: the new MTU - * - * we can handle "any" incoming size, the issue here is whether we - * need to restrict our outgoing size. For now, we don't do any - * sanity checking on this, and we don't deal with what happens to - * programs that are already running when the size changes. - * NOTE: changing the MTU will usually cause the IBC to go back to - * link initialize (IPATH_IBSTATE_INIT) state... - */ -int ipath_layer_set_mtu(struct ipath_devdata *dd, u16 arg) -{ - u32 piosize; - int changed = 0; - int ret; - - /* - * mtu is IB data payload max. It's the largest power of 2 less - * than piosize (or even larger, since it only really controls the - * largest we can receive; we can send the max of the mtu and - * piosize). We check that it's one of the valid IB sizes. - */ - if (arg != 256 && arg != 512 && arg != 1024 && arg != 2048 && - arg != 4096) { - ipath_dbg("Trying to set invalid mtu %u, failing\n", arg); - ret = -EINVAL; - goto bail; - } - if (dd->ipath_ibmtu == arg) { - ret = 0; /* same as current */ - goto bail; - } - - piosize = dd->ipath_ibmaxlen; - dd->ipath_ibmtu = arg; - - if (arg >= (piosize - IPATH_PIO_MAXIBHDR)) { - /* Only if it's not the initial value (or reset to it) */ - if (piosize != dd->ipath_init_ibmaxlen) { - dd->ipath_ibmaxlen = piosize; - changed = 1; - } - } else if ((arg + IPATH_PIO_MAXIBHDR) != dd->ipath_ibmaxlen) { - piosize = arg + IPATH_PIO_MAXIBHDR; - ipath_cdbg(VERBOSE, "ibmaxlen was 0x%x, setting to 0x%x " - "(mtu 0x%x)\n", dd->ipath_ibmaxlen, piosize, - arg); - dd->ipath_ibmaxlen = piosize; - changed = 1; - } - - if (changed) { - /* - * set the IBC maxpktlength to the size of our pio - * buffers in words - */ - u64 ibc = dd->ipath_ibcctrl; - ibc &= ~(INFINIPATH_IBCC_MAXPKTLEN_MASK << - INFINIPATH_IBCC_MAXPKTLEN_SHIFT); - - piosize = piosize - 2 * sizeof(u32); /* ignore pbc */ - dd->ipath_ibmaxlen = piosize; - piosize /= sizeof(u32); /* in words */ - /* - * for ICRC, which we only send in diag test pkt mode, and - * we don't need to worry about that for mtu - */ - piosize += 1; - - ibc |= piosize << INFINIPATH_IBCC_MAXPKTLEN_SHIFT; - dd->ipath_ibcctrl = ibc; - ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, - dd->ipath_ibcctrl); - dd->ipath_f_tidtemplate(dd); - } - - ret = 0; - -bail: - return ret; -} - -int ipath_set_lid(struct ipath_devdata *dd, u32 arg, u8 lmc) -{ - dd->ipath_lid = arg; - dd->ipath_lmc = lmc; - +void ipath_layer_lid_changed(struct ipath_devdata *dd) +{ mutex_lock(&ipath_layer_mutex); if (dd->ipath_layer.l_arg && layer_intr) layer_intr(dd->ipath_layer.l_arg, IPATH_LAYER_INT_LID); mutex_unlock(&ipath_layer_mutex); - - return 0; -} - -int ipath_layer_set_guid(struct ipath_devdata *dd, __be64 guid) -{ - /* XXX - need to inform anyone who cares this just happened. */ - dd->ipath_guid = guid; - return 0; -} - -__be64 ipath_layer_get_guid(struct ipath_devdata *dd) -{ - return dd->ipath_guid; -} - -u32 ipath_layer_get_majrev(struct ipath_devdata *dd) -{ - return dd->ipath_majrev; -} - -u32 ipath_layer_get_minrev(struct ipath_devdata *dd) -{ - return dd->ipath_minrev; -} - -u32 ipath_layer_get_pcirev(struct ipath_devdata *dd) -{ - return dd->ipath_pcirev; -} - -u32 ipath_layer_get_flags(struct ipath_devdata *dd) -{ - return dd->ipath_flags; -} - -struct device *ipath_layer_get_device(struct ipath_devdata *dd) -{ - return &dd->pcidev->dev; -} - -u16 ipath_layer_get_deviceid(struct ipath_devdata *dd) -{ - return dd->ipath_deviceid; -} - -u32 ipath_layer_get_vendorid(struct ipath_devdata *dd) -{ - return dd->ipath_vendorid; -} - -u64 ipath_layer_get_lastibcstat(struct ipath_devdata *dd) -{ - return dd->ipath_lastibcstat; -} - -u32 ipath_layer_get_ibmtu(struct ipath_devdata *dd) -{ - return dd->ipath_ibmtu; } void ipath_layer_add(struct ipath_devdata *dd) @@ -435,22 +207,6 @@ void ipath_layer_unregister(void) } EXPORT_SYMBOL_GPL(ipath_layer_unregister); - -static void __ipath_verbs_timer(unsigned long arg) -{ - struct ipath_devdata *dd = (struct ipath_devdata *) arg; - - /* - * If port 0 receive packet interrupts are not available, or - * can be missed, poll the receive queue - */ - if (dd->ipath_flags & IPATH_POLL_RX_INTR) - ipath_kreceive(dd); - - /* Handle verbs layer timeouts. */ - ipath_ib_timer(dd->verbs_dev); - mod_timer(&dd->verbs_timer, jiffies + 1); -} int ipath_layer_open(struct ipath_devdata *dd, u32 * pktmax) { @@ -539,380 +295,6 @@ u16 ipath_layer_get_bcast(struct ipath_d } EXPORT_SYMBOL_GPL(ipath_layer_get_bcast); - -u32 ipath_layer_get_cr_errpkey(struct ipath_devdata *dd) -{ - return ipath_read_creg32(dd, dd->ipath_cregs->cr_errpkey); -} - -static void update_sge(struct ipath_sge_state *ss, u32 length) -{ - struct ipath_sge *sge = &ss->sge; - - sge->vaddr += length; - sge->length -= length; - sge->sge_length -= length; - if (sge->sge_length == 0) { - if (--ss->num_sge) - *sge = *ss->sg_list++; - } else if (sge->length == 0 && sge->mr != NULL) { - if (++sge->n >= IPATH_SEGSZ) { - if (++sge->m >= sge->mr->mapsz) - return; - sge->n = 0; - } - sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; - sge->length = sge->mr->map[sge->m]->segs[sge->n].length; - } -} - -#ifdef __LITTLE_ENDIAN -static inline u32 get_upper_bits(u32 data, u32 shift) -{ - return data >> shift; -} - -static inline u32 set_upper_bits(u32 data, u32 shift) -{ - return data << shift; -} - -static inline u32 clear_upper_bytes(u32 data, u32 n, u32 off) -{ - data <<= ((sizeof(u32) - n) * BITS_PER_BYTE); - data >>= ((sizeof(u32) - n - off) * BITS_PER_BYTE); - return data; -} -#else -static inline u32 get_upper_bits(u32 data, u32 shift) -{ - return data << shift; -} - -static inline u32 set_upper_bits(u32 data, u32 shift) -{ - return data >> shift; -} - -static inline u32 clear_upper_bytes(u32 data, u32 n, u32 off) -{ - data >>= ((sizeof(u32) - n) * BITS_PER_BYTE); - data <<= ((sizeof(u32) - n - off) * BITS_PER_BYTE); - return data; -} -#endif - -static void copy_io(u32 __iomem *piobuf, struct ipath_sge_state *ss, - u32 length) -{ - u32 extra = 0; - u32 data = 0; - u32 last; - - while (1) { - u32 len = ss->sge.length; - u32 off; - - BUG_ON(len == 0); - if (len > length) - len = length; - if (len > ss->sge.sge_length) - len = ss->sge.sge_length; - /* If the source address is not aligned, try to align it. */ - off = (unsigned long)ss->sge.vaddr & (sizeof(u32) - 1); - if (off) { - u32 *addr = (u32 *)((unsigned long)ss->sge.vaddr & - ~(sizeof(u32) - 1)); - u32 v = get_upper_bits(*addr, off * BITS_PER_BYTE); - u32 y; - - y = sizeof(u32) - off; - if (len > y) - len = y; - if (len + extra >= sizeof(u32)) { - data |= set_upper_bits(v, extra * - BITS_PER_BYTE); - len = sizeof(u32) - extra; - if (len == length) { - last = data; - break; - } - __raw_writel(data, piobuf); - piobuf++; - extra = 0; - data = 0; - } else { - /* Clear unused upper bytes */ - data |= clear_upper_bytes(v, len, extra); - if (len == length) { - last = data; - break; - } - extra += len; - } - } else if (extra) { - /* Source address is aligned. */ - u32 *addr = (u32 *) ss->sge.vaddr; - int shift = extra * BITS_PER_BYTE; - int ushift = 32 - shift; - u32 l = len; - - while (l >= sizeof(u32)) { - u32 v = *addr; - - data |= set_upper_bits(v, shift); - __raw_writel(data, piobuf); - data = get_upper_bits(v, ushift); - piobuf++; - addr++; - l -= sizeof(u32); - } - /* - * We still have 'extra' number of bytes leftover. - */ - if (l) { - u32 v = *addr; - - if (l + extra >= sizeof(u32)) { - data |= set_upper_bits(v, shift); - len -= l + extra - sizeof(u32); - if (len == length) { - last = data; - break; - } - __raw_writel(data, piobuf); - piobuf++; - extra = 0; - data = 0; - } else { - /* Clear unused upper bytes */ - data |= clear_upper_bytes(v, l, - extra); - if (len == length) { - last = data; - break; - } - extra += l; - } - } else if (len == length) { - last = data; - break; - } - } else if (len == length) { - u32 w; - - /* - * Need to round up for the last dword in the - * packet. - */ - w = (len + 3) >> 2; - __iowrite32_copy(piobuf, ss->sge.vaddr, w - 1); - piobuf += w - 1; - last = ((u32 *) ss->sge.vaddr)[w - 1]; - break; - } else { - u32 w = len >> 2; - - __iowrite32_copy(piobuf, ss->sge.vaddr, w); - piobuf += w; - - extra = len & (sizeof(u32) - 1); - if (extra) { - u32 v = ((u32 *) ss->sge.vaddr)[w]; - - /* Clear unused upper bytes */ - data = clear_upper_bytes(v, extra, 0); - } - } - update_sge(ss, len); - length -= len; - } - /* Update address before sending packet. */ - update_sge(ss, length); - /* must flush early everything before trigger word */ - ipath_flush_wc(); - __raw_writel(last, piobuf); - /* be sure trigger word is written */ - ipath_flush_wc(); -} - -/** - * ipath_verbs_send - send a packet from the verbs layer - * @dd: the infinipath device - * @hdrwords: the number of words in the header - * @hdr: the packet header - * @len: the length of the packet in bytes - * @ss: the SGE to send - * - * This is like ipath_sma_send_pkt() in that we need to be able to send - * packets after the chip is initialized (MADs) but also like - * ipath_layer_send_hdr() since its used by the verbs layer. - */ -int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, - u32 *hdr, u32 len, struct ipath_sge_state *ss) -{ - u32 __iomem *piobuf; - u32 plen; - int ret; - - /* +1 is for the qword padding of pbc */ - plen = hdrwords + ((len + 3) >> 2) + 1; - if (unlikely((plen << 2) > dd->ipath_ibmaxlen)) { - ipath_dbg("packet len 0x%x too long, failing\n", plen); - ret = -EINVAL; - goto bail; - } - - /* Get a PIO buffer to use. */ - piobuf = ipath_getpiobuf(dd, NULL); - if (unlikely(piobuf == NULL)) { - ret = -EBUSY; - goto bail; - } - - /* - * Write len to control qword, no flags. - * We have to flush after the PBC for correctness on some cpus - * or WC buffer can be written out of order. - */ - writeq(plen, piobuf); - ipath_flush_wc(); - piobuf += 2; - if (len == 0) { - /* - * If there is just the header portion, must flush before - * writing last word of header for correctness, and after - * the last header word (trigger word). - */ - __iowrite32_copy(piobuf, hdr, hdrwords - 1); - ipath_flush_wc(); - __raw_writel(hdr[hdrwords - 1], piobuf + hdrwords - 1); - ipath_flush_wc(); - ret = 0; - goto bail; - } - - __iowrite32_copy(piobuf, hdr, hdrwords); - piobuf += hdrwords; - - /* The common case is aligned and contained in one segment. */ - if (likely(ss->num_sge == 1 && len <= ss->sge.length && - !((unsigned long)ss->sge.vaddr & (sizeof(u32) - 1)))) { - u32 w; - u32 *addr = (u32 *) ss->sge.vaddr; - - /* Update address before sending packet. */ - update_sge(ss, len); - /* Need to round up for the last dword in the packet. */ - w = (len + 3) >> 2; - __iowrite32_copy(piobuf, addr, w - 1); - /* must flush early everything before trigger word */ - ipath_flush_wc(); - __raw_writel(addr[w - 1], piobuf + w - 1); - /* be sure trigger word is written */ - ipath_flush_wc(); - ret = 0; - goto bail; - } - copy_io(piobuf, ss, len); - ret = 0; - -bail: - return ret; -} - -int ipath_layer_snapshot_counters(struct ipath_devdata *dd, u64 *swords, - u64 *rwords, u64 *spkts, u64 *rpkts, - u64 *xmit_wait) -{ - int ret; - - if (!(dd->ipath_flags & IPATH_INITTED)) { - /* no hardware, freeze, etc. */ - ipath_dbg("unit %u not usable\n", dd->ipath_unit); - ret = -EINVAL; - goto bail; - } - *swords = ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordsendcnt); - *rwords = ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordrcvcnt); - *spkts = ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktsendcnt); - *rpkts = ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktrcvcnt); - *xmit_wait = ipath_snap_cntr(dd, dd->ipath_cregs->cr_sendstallcnt); - - ret = 0; - -bail: - return ret; -} - -/** - * ipath_layer_get_counters - get various chip counters - * @dd: the infinipath device - * @cntrs: counters are placed here - * - * Return the counters needed by recv_pma_get_portcounters(). - */ -int ipath_layer_get_counters(struct ipath_devdata *dd, - struct ipath_layer_counters *cntrs) -{ - int ret; - - if (!(dd->ipath_flags & IPATH_INITTED)) { - /* no hardware, freeze, etc. */ - ipath_dbg("unit %u not usable\n", dd->ipath_unit); - ret = -EINVAL; - goto bail; - } - cntrs->symbol_error_counter = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_ibsymbolerrcnt); - cntrs->link_error_recovery_counter = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_iblinkerrrecovcnt); - /* - * The link downed counter counts when the other side downs the - * connection. We add in the number of times we downed the link - * due to local link integrity errors to compensate. - */ - cntrs->link_downed_counter = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_iblinkdowncnt); - cntrs->port_rcv_errors = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_rxdroppktcnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_rcvovflcnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_portovflcnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_err_rlencnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_invalidrlencnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_erricrccnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_errvcrccnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_errlpcrccnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_badformatcnt); - cntrs->port_rcv_remphys_errors = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_rcvebpcnt); - cntrs->port_xmit_discards = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_unsupvlcnt); - cntrs->port_xmit_data = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordsendcnt); - cntrs->port_rcv_data = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordrcvcnt); - cntrs->port_xmit_packets = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktsendcnt); - cntrs->port_rcv_packets = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktrcvcnt); - cntrs->local_link_integrity_errors = dd->ipath_lli_errors; - cntrs->excessive_buffer_overrun_errors = 0; /* XXX */ - - ret = 0; - -bail: - return ret; -} - -int ipath_layer_want_buffer(struct ipath_devdata *dd) -{ - set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl); - ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, - dd->ipath_sendctrl); - - return 0; -} int ipath_layer_send_hdr(struct ipath_devdata *dd, struct ether_header *hdr) { @@ -985,361 +367,3 @@ int ipath_layer_set_piointbufavail_int(s } EXPORT_SYMBOL_GPL(ipath_layer_set_piointbufavail_int); - -int ipath_layer_enable_timer(struct ipath_devdata *dd) -{ - /* - * HT-400 has a design flaw where the chip and kernel idea - * of the tail register don't always agree, and therefore we won't - * get an interrupt on the next packet received. - * If the board supports per packet receive interrupts, use it. - * Otherwise, the timer function periodically checks for packets - * to cover this case. - * Either way, the timer is needed for verbs layer related - * processing. - */ - if (dd->ipath_flags & IPATH_GPIO_INTR) { - ipath_write_kreg(dd, dd->ipath_kregs->kr_debugportselect, - 0x2074076542310ULL); - /* Enable GPIO bit 2 interrupt */ - ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, - (u64) (1 << 2)); - } - - init_timer(&dd->verbs_timer); - dd->verbs_timer.function = __ipath_verbs_timer; - dd->verbs_timer.data = (unsigned long)dd; - dd->verbs_timer.expires = jiffies + 1; - add_timer(&dd->verbs_timer); - - return 0; -} - -int ipath_layer_disable_timer(struct ipath_devdata *dd) -{ - /* Disable GPIO bit 2 interrupt */ - if (dd->ipath_flags & IPATH_GPIO_INTR) - ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, 0); - - del_timer_sync(&dd->verbs_timer); - - return 0; -} - -/** - * ipath_layer_set_verbs_flags - set the verbs layer flags - * @dd: the infinipath device - * @flags: the flags to set - */ -int ipath_layer_set_verbs_flags(struct ipath_devdata *dd, unsigned flags) -{ - struct ipath_devdata *ss; - unsigned long lflags; - - spin_lock_irqsave(&ipath_devs_lock, lflags); - - list_for_each_entry(ss, &ipath_dev_list, ipath_list) { - if (!(ss->ipath_flags & IPATH_INITTED)) - continue; - if ((flags & IPATH_VERBS_KERNEL_SMA) && - !(*ss->ipath_statusp & IPATH_STATUS_SMA)) - *ss->ipath_statusp |= IPATH_STATUS_OIB_SMA; - else - *ss->ipath_statusp &= ~IPATH_STATUS_OIB_SMA; - } - - spin_unlock_irqrestore(&ipath_devs_lock, lflags); - - return 0; -} - -/** - * ipath_layer_get_npkeys - return the size of the PKEY table for port 0 - * @dd: the infinipath device - */ -unsigned ipath_layer_get_npkeys(struct ipath_devdata *dd) -{ - return ARRAY_SIZE(dd->ipath_pd[0]->port_pkeys); -} - -/** - * ipath_layer_get_pkey - return the indexed PKEY from the port 0 PKEY table - * @dd: the infinipath device - * @index: the PKEY index - */ -unsigned ipath_layer_get_pkey(struct ipath_devdata *dd, unsigned index) -{ - unsigned ret; - - if (index >= ARRAY_SIZE(dd->ipath_pd[0]->port_pkeys)) - ret = 0; - else - ret = dd->ipath_pd[0]->port_pkeys[index]; - - return ret; -} - -/** - * ipath_layer_get_pkeys - return the PKEY table for port 0 - * @dd: the infinipath device - * @pkeys: the pkey table is placed here - */ -int ipath_layer_get_pkeys(struct ipath_devdata *dd, u16 * pkeys) -{ - struct ipath_portdata *pd = dd->ipath_pd[0]; - - memcpy(pkeys, pd->port_pkeys, sizeof(pd->port_pkeys)); - - return 0; -} - -/** - * rm_pkey - decrecment the reference count for the given PKEY - * @dd: the infinipath device - * @key: the PKEY index - * - * Return true if this was the last reference and the hardware table entry - * needs to be changed. - */ -static int rm_pkey(struct ipath_devdata *dd, u16 key) -{ - int i; - int ret; - - for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { - if (dd->ipath_pkeys[i] != key) - continue; - if (atomic_dec_and_test(&dd->ipath_pkeyrefs[i])) { - dd->ipath_pkeys[i] = 0; - ret = 1; - goto bail; - } - break; - } - - ret = 0; - -bail: - return ret; -} - -/** - * add_pkey - add the given PKEY to the hardware table - * @dd: the infinipath device - * @key: the PKEY - * - * Return an error code if unable to add the entry, zero if no change, - * or 1 if the hardware PKEY register needs to be updated. - */ -static int add_pkey(struct ipath_devdata *dd, u16 key) -{ - int i; - u16 lkey = key & 0x7FFF; - int any = 0; - int ret; - - if (lkey == 0x7FFF) { - ret = 0; - goto bail; - } - - /* Look for an empty slot or a matching PKEY. */ - for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { - if (!dd->ipath_pkeys[i]) { - any++; - continue; - } - /* If it matches exactly, try to increment the ref count */ - if (dd->ipath_pkeys[i] == key) { - if (atomic_inc_return(&dd->ipath_pkeyrefs[i]) > 1) { - ret = 0; - goto bail; - } - /* Lost the race. Look for an empty slot below. */ - atomic_dec(&dd->ipath_pkeyrefs[i]); - any++; - } - /* - * It makes no sense to have both the limited and unlimited - * PKEY set at the same time since the unlimited one will - * disable the limited one. - */ - if ((dd->ipath_pkeys[i] & 0x7FFF) == lkey) { - ret = -EEXIST; - goto bail; - } - } - if (!any) { - ret = -EBUSY; - goto bail; - } - for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { - if (!dd->ipath_pkeys[i] && - atomic_inc_return(&dd->ipath_pkeyrefs[i]) == 1) { - /* for ipathstats, etc. */ - ipath_stats.sps_pkeys[i] = lkey; - dd->ipath_pkeys[i] = key; - ret = 1; - goto bail; - } - } - ret = -EBUSY; - -bail: - return ret; -} - -/** - * ipath_layer_set_pkeys - set the PKEY table for port 0 - * @dd: the infinipath device - * @pkeys: the PKEY table - */ -int ipath_layer_set_pkeys(struct ipath_devdata *dd, u16 * pkeys) -{ - struct ipath_portdata *pd; - int i; - int changed = 0; - - pd = dd->ipath_pd[0]; - - for (i = 0; i < ARRAY_SIZE(pd->port_pkeys); i++) { - u16 key = pkeys[i]; - u16 okey = pd->port_pkeys[i]; - - if (key == okey) - continue; - /* - * The value of this PKEY table entry is changing. - * Remove the old entry in the hardware's array of PKEYs. - */ - if (okey & 0x7FFF) - changed |= rm_pkey(dd, okey); - if (key & 0x7FFF) { - int ret = add_pkey(dd, key); - - if (ret < 0) - key = 0; - else - changed |= ret; - } - pd->port_pkeys[i] = key; - } - if (changed) { - u64 pkey; - - pkey = (u64) dd->ipath_pkeys[0] | - ((u64) dd->ipath_pkeys[1] << 16) | - ((u64) dd->ipath_pkeys[2] << 32) | - ((u64) dd->ipath_pkeys[3] << 48); - ipath_cdbg(VERBOSE, "p0 new pkey reg %llx\n", - (unsigned long long) pkey); - ipath_write_kreg(dd, dd->ipath_kregs->kr_partitionkey, - pkey); - } - return 0; -} - -/** - * ipath_layer_get_linkdowndefaultstate - get the default linkdown state - * @dd: the infinipath device - * - * Returns zero if the default is POLL, 1 if the default is SLEEP. - */ -int ipath_layer_get_linkdowndefaultstate(struct ipath_devdata *dd) -{ - return !!(dd->ipath_ibcctrl & INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE); -} - -/** - * ipath_layer_set_linkdowndefaultstate - set the default linkdown state - * @dd: the infinipath device - * @sleep: the new state - * - * Note that this will only take effect when the link state changes. - */ -int ipath_layer_set_linkdowndefaultstate(struct ipath_devdata *dd, - int sleep) -{ - if (sleep) - dd->ipath_ibcctrl |= INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE; - else - dd->ipath_ibcctrl &= ~INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE; - ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, - dd->ipath_ibcctrl); - return 0; -} - -int ipath_layer_get_phyerrthreshold(struct ipath_devdata *dd) -{ - return (dd->ipath_ibcctrl >> - INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) & - INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK; -} - -/** - * ipath_layer_set_phyerrthreshold - set the physical error threshold - * @dd: the infinipath device - * @n: the new threshold - * - * Note that this will only take effect when the link state changes. - */ -int ipath_layer_set_phyerrthreshold(struct ipath_devdata *dd, unsigned n) -{ - unsigned v; - - v = (dd->ipath_ibcctrl >> INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) & - INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK; - if (v != n) { - dd->ipath_ibcctrl &= - ~(INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK << - INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT); - dd->ipath_ibcctrl |= - (u64) n << INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT; - ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, - dd->ipath_ibcctrl); - } - return 0; -} - -int ipath_layer_get_overrunthreshold(struct ipath_devdata *dd) -{ - return (dd->ipath_ibcctrl >> - INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) & - INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK; -} - -/** - * ipath_layer_set_overrunthreshold - set the overrun threshold - * @dd: the infinipath device - * @n: the new threshold - * - * Note that this will only take effect when the link state changes. - */ -int ipath_layer_set_overrunthreshold(struct ipath_devdata *dd, unsigned n) -{ - unsigned v; - - v = (dd->ipath_ibcctrl >> INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) & - INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK; - if (v != n) { - dd->ipath_ibcctrl &= - ~(INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK << - INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT); - dd->ipath_ibcctrl |= - (u64) n << INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT; - ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, - dd->ipath_ibcctrl); - } - return 0; -} - -int ipath_layer_get_boardname(struct ipath_devdata *dd, char *name, - size_t namelen) -{ - return dd->ipath_f_get_boardname(dd, name, namelen); -} - -u32 ipath_layer_get_rcvhdrentsize(struct ipath_devdata *dd) -{ - return dd->ipath_rcvhdrentsize; -} diff --git a/drivers/infiniband/hw/ipath/ipath_layer.h b/drivers/infiniband/hw/ipath/ipath_layer.h --- a/drivers/infiniband/hw/ipath/ipath_layer.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.h Fri Aug 25 11:19:45 2006 -0700 @@ -40,72 +40,8 @@ */ struct sk_buff; -struct ipath_sge_state; struct ipath_devdata; struct ether_header; - -struct ipath_layer_counters { - u64 symbol_error_counter; - u64 link_error_recovery_counter; - u64 link_downed_counter; - u64 port_rcv_errors; - u64 port_rcv_remphys_errors; - u64 port_xmit_discards; - u64 port_xmit_data; - u64 port_rcv_data; - u64 port_xmit_packets; - u64 port_rcv_packets; - u32 local_link_integrity_errors; - u32 excessive_buffer_overrun_errors; -}; - -/* - * A segment is a linear region of low physical memory. - * XXX Maybe we should use phys addr here and kmap()/kunmap(). - * Used by the verbs layer. - */ -struct ipath_seg { - void *vaddr; - size_t length; -}; - -/* The number of ipath_segs that fit in a page. */ -#define IPATH_SEGSZ (PAGE_SIZE / sizeof (struct ipath_seg)) - -struct ipath_segarray { - struct ipath_seg segs[IPATH_SEGSZ]; -}; - -struct ipath_mregion { - u64 user_base; /* User's address for this region */ - u64 iova; /* IB start address of this region */ - size_t length; - u32 lkey; - u32 offset; /* offset (bytes) to start of region */ - int access_flags; - u32 max_segs; /* number of ipath_segs in all the arrays */ - u32 mapsz; /* size of the map array */ - struct ipath_segarray *map[0]; /* the segments */ -}; - -/* - * These keep track of the copy progress within a memory region. - * Used by the verbs layer. - */ -struct ipath_sge { - struct ipath_mregion *mr; - void *vaddr; /* current pointer into the segment */ - u32 sge_length; /* length of the SGE */ - u32 length; /* remaining length of the segment */ - u16 m; /* current index: mr->map[m] */ - u16 n; /* current index: mr->map[m]->segs[n] */ -}; - -struct ipath_sge_state { - struct ipath_sge *sg_list; /* next SGE to be used if any */ - struct ipath_sge sge; /* progress state for the current SGE */ - u8 num_sge; -}; int ipath_layer_register(void *(*l_add)(int, struct ipath_devdata *), void (*l_remove)(void *), @@ -119,49 +55,9 @@ u16 ipath_layer_get_lid(struct ipath_dev u16 ipath_layer_get_lid(struct ipath_devdata *dd); int ipath_layer_get_mac(struct ipath_devdata *dd, u8 *); u16 ipath_layer_get_bcast(struct ipath_devdata *dd); -u32 ipath_layer_get_cr_errpkey(struct ipath_devdata *dd); -int ipath_layer_set_linkstate(struct ipath_devdata *dd, u8 state); -int ipath_layer_set_mtu(struct ipath_devdata *, u16); -int ipath_set_lid(struct ipath_devdata *, u32, u8); int ipath_layer_send_hdr(struct ipath_devdata *dd, struct ether_header *hdr); -int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, - u32 * hdr, u32 len, struct ipath_sge_state *ss); int ipath_layer_set_piointbufavail_int(struct ipath_devdata *dd); -int ipath_layer_get_boardname(struct ipath_devdata *dd, char *name, - size_t namelen); -int ipath_layer_snapshot_counters(struct ipath_devdata *dd, u64 *swords, - u64 *rwords, u64 *spkts, u64 *rpkts, - u64 *xmit_wait); -int ipath_layer_get_counters(struct ipath_devdata *dd, - struct ipath_layer_counters *cntrs); -int ipath_layer_want_buffer(struct ipath_devdata *dd); -int ipath_layer_set_guid(struct ipath_devdata *, __be64 guid); -__be64 ipath_layer_get_guid(struct ipath_devdata *); -u32 ipath_layer_get_majrev(struct ipath_devdata *); -u32 ipath_layer_get_minrev(struct ipath_devdata *); -u32 ipath_layer_get_pcirev(struct ipath_devdata *); -u32 ipath_layer_get_flags(struct ipath_devdata *dd); -struct device *ipath_layer_get_device(struct ipath_devdata *dd); -u16 ipath_layer_get_deviceid(struct ipath_devdata *dd); -u32 ipath_layer_get_vendorid(struct ipath_devdata *); -u64 ipath_layer_get_lastibcstat(struct ipath_devdata *dd); -u32 ipath_layer_get_ibmtu(struct ipath_devdata *dd); -int ipath_layer_enable_timer(struct ipath_devdata *dd); -int ipath_layer_disable_timer(struct ipath_devdata *dd); -int ipath_layer_set_verbs_flags(struct ipath_devdata *dd, unsigned flags); -unsigned ipath_layer_get_npkeys(struct ipath_devdata *dd); -unsigned ipath_layer_get_pkey(struct ipath_devdata *dd, unsigned index); -int ipath_layer_get_pkeys(struct ipath_devdata *dd, u16 *pkeys); -int ipath_layer_set_pkeys(struct ipath_devdata *dd, u16 *pkeys); -int ipath_layer_get_linkdowndefaultstate(struct ipath_devdata *dd); -int ipath_layer_set_linkdowndefaultstate(struct ipath_devdata *dd, - int sleep); -int ipath_layer_get_phyerrthreshold(struct ipath_devdata *dd); -int ipath_layer_set_phyerrthreshold(struct ipath_devdata *dd, unsigned n); -int ipath_layer_get_overrunthreshold(struct ipath_devdata *dd); -int ipath_layer_set_overrunthreshold(struct ipath_devdata *dd, unsigned n); -u32 ipath_layer_get_rcvhdrentsize(struct ipath_devdata *dd); /* ipath_ether interrupt values */ #define IPATH_LAYER_INT_IF_UP 0x2 diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c --- a/drivers/infiniband/hw/ipath/ipath_mad.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_mad.c Fri Aug 25 11:19:45 2006 -0700 @@ -101,15 +101,15 @@ static int recv_subn_get_nodeinfo(struct nip->num_ports = ibdev->phys_port_cnt; /* This is already in network order */ nip->sys_guid = to_idev(ibdev)->sys_image_guid; - nip->node_guid = ipath_layer_get_guid(dd); + nip->node_guid = dd->ipath_guid; nip->port_guid = nip->sys_guid; - nip->partition_cap = cpu_to_be16(ipath_layer_get_npkeys(dd)); - nip->device_id = cpu_to_be16(ipath_layer_get_deviceid(dd)); - majrev = ipath_layer_get_majrev(dd); - minrev = ipath_layer_get_minrev(dd); + nip->partition_cap = cpu_to_be16(ipath_get_npkeys(dd)); + nip->device_id = cpu_to_be16(dd->ipath_deviceid); + majrev = dd->ipath_majrev; + minrev = dd->ipath_minrev; nip->revision = cpu_to_be32((majrev << 16) | minrev); nip->local_port_num = port; - vendor = ipath_layer_get_vendorid(dd); + vendor = dd->ipath_vendorid; nip->vendor_id[0] = 0; nip->vendor_id[1] = vendor >> 8; nip->vendor_id[2] = vendor; @@ -133,11 +133,87 @@ static int recv_subn_get_guidinfo(struct */ if (startgx == 0) /* The first is a copy of the read-only HW GUID. */ - *p = ipath_layer_get_guid(to_idev(ibdev)->dd); + *p = to_idev(ibdev)->dd->ipath_guid; else smp->status |= IB_SMP_INVALID_FIELD; return reply(smp); +} + + +static int get_overrunthreshold(struct ipath_devdata *dd) +{ + return (dd->ipath_ibcctrl >> + INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) & + INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK; +} + +/** + * set_overrunthreshold - set the overrun threshold + * @dd: the infinipath device + * @n: the new threshold + * + * Note that this will only take effect when the link state changes. + */ +static int set_overrunthreshold(struct ipath_devdata *dd, unsigned n) +{ + unsigned v; + + v = (dd->ipath_ibcctrl >> INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) & + INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK; + if (v != n) { + dd->ipath_ibcctrl &= + ~(INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK << + INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT); + dd->ipath_ibcctrl |= + (u64) n << INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT; + ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, + dd->ipath_ibcctrl); + } + return 0; +} + +static int get_phyerrthreshold(struct ipath_devdata *dd) +{ + return (dd->ipath_ibcctrl >> + INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) & + INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK; +} + +/** + * set_phyerrthreshold - set the physical error threshold + * @dd: the infinipath device + * @n: the new threshold + * + * Note that this will only take effect when the link state changes. + */ +static int set_phyerrthreshold(struct ipath_devdata *dd, unsigned n) +{ + unsigned v; + + v = (dd->ipath_ibcctrl >> INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) & + INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK; + if (v != n) { + dd->ipath_ibcctrl &= + ~(INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK << + INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT); + dd->ipath_ibcctrl |= + (u64) n << INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT; + ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, + dd->ipath_ibcctrl); + } + return 0; +} + +/** + * get_linkdowndefaultstate - get the default linkdown state + * @dd: the infinipath device + * + * Returns zero if the default is POLL, 1 if the default is SLEEP. + */ +static int get_linkdowndefaultstate(struct ipath_devdata *dd) +{ + return !!(dd->ipath_ibcctrl & INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE); } static int recv_subn_get_portinfo(struct ib_smp *smp, @@ -166,7 +242,7 @@ static int recv_subn_get_portinfo(struct (dev->mkeyprot_resv_lmc >> 6) == 0) pip->mkey = dev->mkey; pip->gid_prefix = dev->gid_prefix; - lid = ipath_layer_get_lid(dev->dd); + lid = dev->dd->ipath_lid; pip->lid = lid ? cpu_to_be16(lid) : IB_LID_PERMISSIVE; pip->sm_lid = cpu_to_be16(dev->sm_lid); pip->cap_mask = cpu_to_be32(dev->port_cap_flags); @@ -177,14 +253,14 @@ static int recv_subn_get_portinfo(struct pip->link_width_supported = 3; /* 1x or 4x */ pip->link_width_active = 2; /* 4x */ pip->linkspeed_portstate = 0x10; /* 2.5Gbps */ - ibcstat = ipath_layer_get_lastibcstat(dev->dd); + ibcstat = dev->dd->ipath_lastibcstat; pip->linkspeed_portstate |= ((ibcstat >> 4) & 0x3) + 1; pip->portphysstate_linkdown = (ipath_cvt_physportstate[ibcstat & 0xf] << 4) | - (ipath_layer_get_linkdowndefaultstate(dev->dd) ? 1 : 2); + (get_linkdowndefaultstate(dev->dd) ? 1 : 2); pip->mkeyprot_resv_lmc = dev->mkeyprot_resv_lmc; pip->linkspeedactive_enabled = 0x11; /* 2.5Gbps, 2.5Gbps */ - switch (ipath_layer_get_ibmtu(dev->dd)) { + switch (dev->dd->ipath_ibmtu) { case 4096: mtu = IB_MTU_4096; break; @@ -217,7 +293,7 @@ static int recv_subn_get_portinfo(struct pip->mkey_violations = cpu_to_be16(dev->mkey_violations); /* P_KeyViolations are counted by hardware. */ pip->pkey_violations = - cpu_to_be16((ipath_layer_get_cr_errpkey(dev->dd) - + cpu_to_be16((ipath_get_cr_errpkey(dev->dd) - dev->z_pkey_violations) & 0xFFFF); pip->qkey_violations = cpu_to_be16(dev->qkey_violations); /* Only the hardware GUID is supported for now */ @@ -226,8 +302,8 @@ static int recv_subn_get_portinfo(struct /* 32.768 usec. response time (guessing) */ pip->resv_resptimevalue = 3; pip->localphyerrors_overrunerrors = - (ipath_layer_get_phyerrthreshold(dev->dd) << 4) | - ipath_layer_get_overrunthreshold(dev->dd); + (get_phyerrthreshold(dev->dd) << 4) | + get_overrunthreshold(dev->dd); /* pip->max_credit_hint; */ /* pip->link_roundtrip_latency[3]; */ @@ -235,6 +311,20 @@ static int recv_subn_get_portinfo(struct bail: return ret; +} + +/** + * get_pkeys - return the PKEY table for port 0 + * @dd: the infinipath device + * @pkeys: the pkey table is placed here + */ +static int get_pkeys(struct ipath_devdata *dd, u16 * pkeys) +{ + struct ipath_portdata *pd = dd->ipath_pd[0]; + + memcpy(pkeys, pd->port_pkeys, sizeof(pd->port_pkeys)); + + return 0; } static int recv_subn_get_pkeytable(struct ib_smp *smp, @@ -249,9 +339,9 @@ static int recv_subn_get_pkeytable(struc memset(smp->data, 0, sizeof(smp->data)); if (startpx == 0) { struct ipath_ibdev *dev = to_idev(ibdev); - unsigned i, n = ipath_layer_get_npkeys(dev->dd); - - ipath_layer_get_pkeys(dev->dd, p); + unsigned i, n = ipath_get_npkeys(dev->dd); + + get_pkeys(dev->dd, p); for (i = 0; i < n; i++) q[i] = cpu_to_be16(p[i]); @@ -266,6 +356,24 @@ static int recv_subn_set_guidinfo(struct { /* The only GUID we support is the first read-only entry. */ return recv_subn_get_guidinfo(smp, ibdev); +} + +/** + * set_linkdowndefaultstate - set the default linkdown state + * @dd: the infinipath device + * @sleep: the new state + * + * Note that this will only take effect when the link state changes. + */ +static int set_linkdowndefaultstate(struct ipath_devdata *dd, int sleep) +{ + if (sleep) + dd->ipath_ibcctrl |= INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE; + else + dd->ipath_ibcctrl &= ~INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE; + ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, + dd->ipath_ibcctrl); + return 0; } /** @@ -290,7 +398,7 @@ static int recv_subn_set_portinfo(struct u8 state; u16 lstate; u32 mtu; - int ret; + int ret, ore; if (be32_to_cpu(smp->attr_mod) > ibdev->phys_port_cnt) goto err; @@ -304,7 +412,7 @@ static int recv_subn_set_portinfo(struct dev->mkey_lease_period = be16_to_cpu(pip->mkey_lease_period); lid = be16_to_cpu(pip->lid); - if (lid != ipath_layer_get_lid(dev->dd)) { + if (lid != dev->dd->ipath_lid) { /* Must be a valid unicast LID address. */ if (lid == 0 || lid >= IPATH_MULTICAST_LID_BASE) goto err; @@ -342,11 +450,11 @@ static int recv_subn_set_portinfo(struct case 0: /* NOP */ break; case 1: /* SLEEP */ - if (ipath_layer_set_linkdowndefaultstate(dev->dd, 1)) + if (set_linkdowndefaultstate(dev->dd, 1)) goto err; break; case 2: /* POLL */ - if (ipath_layer_set_linkdowndefaultstate(dev->dd, 0)) + if (set_linkdowndefaultstate(dev->dd, 0)) goto err; break; default: @@ -376,7 +484,7 @@ static int recv_subn_set_portinfo(struct /* XXX We have already partially updated our state! */ goto err; } - ipath_layer_set_mtu(dev->dd, mtu); + ipath_set_mtu(dev->dd, mtu); dev->sm_sl = pip->neighbormtu_mastersmsl & 0xF; @@ -392,20 +500,16 @@ static int recv_subn_set_portinfo(struct * later. */ if (pip->pkey_violations == 0) - dev->z_pkey_violations = - ipath_layer_get_cr_errpkey(dev->dd); + dev->z_pkey_violations = ipath_get_cr_errpkey(dev->dd); if (pip->qkey_violations == 0) dev->qkey_violations = 0; - if (ipath_layer_set_phyerrthreshold( - dev->dd, - (pip->localphyerrors_overrunerrors >> 4) & 0xF)) + ore = pip->localphyerrors_overrunerrors; + if (set_phyerrthreshold(dev->dd, (ore >> 4) & 0xF)) goto err; - if (ipath_layer_set_overrunthreshold( - dev->dd, - (pip->localphyerrors_overrunerrors & 0xF))) + if (set_overrunthreshold(dev->dd, (ore & 0xF))) goto err; dev->subnet_timeout = pip->clientrereg_resv_subnetto & 0x1F; @@ -423,7 +527,7 @@ static int recv_subn_set_portinfo(struct * is down or is being set to down. */ state = pip->linkspeed_portstate & 0xF; - flags = ipath_layer_get_flags(dev->dd); + flags = dev->dd->ipath_flags; lstate = (pip->portphysstate_linkdown >> 4) & 0xF; if (lstate && !(state == IB_PORT_DOWN || state == IB_PORT_NOP)) goto err; @@ -439,7 +543,7 @@ static int recv_subn_set_portinfo(struct /* FALLTHROUGH */ case IB_PORT_DOWN: if (lstate == 0) - if (ipath_layer_get_linkdowndefaultstate(dev->dd)) + if (get_linkdowndefaultstate(dev->dd)) lstate = IPATH_IB_LINKDOWN_SLEEP; else lstate = IPATH_IB_LINKDOWN; @@ -451,7 +555,7 @@ static int recv_subn_set_portinfo(struct lstate = IPATH_IB_LINKDOWN_DISABLE; else goto err; - ipath_layer_set_linkstate(dev->dd, lstate); + ipath_set_linkstate(dev->dd, lstate); if (flags & IPATH_LINKACTIVE) { event.event = IB_EVENT_PORT_ERR; ib_dispatch_event(&event); @@ -460,7 +564,7 @@ static int recv_subn_set_portinfo(struct case IB_PORT_ARMED: if (!(flags & (IPATH_LINKINIT | IPATH_LINKACTIVE))) break; - ipath_layer_set_linkstate(dev->dd, IPATH_IB_LINKARM); + ipath_set_linkstate(dev->dd, IPATH_IB_LINKARM); if (flags & IPATH_LINKACTIVE) { event.event = IB_EVENT_PORT_ERR; ib_dispatch_event(&event); @@ -469,7 +573,7 @@ static int recv_subn_set_portinfo(struct case IB_PORT_ACTIVE: if (!(flags & IPATH_LINKARMED)) break; - ipath_layer_set_linkstate(dev->dd, IPATH_IB_LINKACTIVE); + ipath_set_linkstate(dev->dd, IPATH_IB_LINKACTIVE); event.event = IB_EVENT_PORT_ACTIVE; ib_dispatch_event(&event); break; @@ -491,6 +595,152 @@ err: done: return ret; +} + +/** + * rm_pkey - decrecment the reference count for the given PKEY + * @dd: the infinipath device + * @key: the PKEY index + * + * Return true if this was the last reference and the hardware table entry + * needs to be changed. + */ +static int rm_pkey(struct ipath_devdata *dd, u16 key) +{ + int i; + int ret; + + for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (dd->ipath_pkeys[i] != key) + continue; + if (atomic_dec_and_test(&dd->ipath_pkeyrefs[i])) { + dd->ipath_pkeys[i] = 0; + ret = 1; + goto bail; + } + break; + } + + ret = 0; + +bail: + return ret; +} + +/** + * add_pkey - add the given PKEY to the hardware table + * @dd: the infinipath device + * @key: the PKEY + * + * Return an error code if unable to add the entry, zero if no change, + * or 1 if the hardware PKEY register needs to be updated. + */ +static int add_pkey(struct ipath_devdata *dd, u16 key) +{ + int i; + u16 lkey = key & 0x7FFF; + int any = 0; + int ret; + + if (lkey == 0x7FFF) { + ret = 0; + goto bail; + } + + /* Look for an empty slot or a matching PKEY. */ + for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (!dd->ipath_pkeys[i]) { + any++; + continue; + } + /* If it matches exactly, try to increment the ref count */ + if (dd->ipath_pkeys[i] == key) { + if (atomic_inc_return(&dd->ipath_pkeyrefs[i]) > 1) { + ret = 0; + goto bail; + } + /* Lost the race. Look for an empty slot below. */ + atomic_dec(&dd->ipath_pkeyrefs[i]); + any++; + } + /* + * It makes no sense to have both the limited and unlimited + * PKEY set at the same time since the unlimited one will + * disable the limited one. + */ + if ((dd->ipath_pkeys[i] & 0x7FFF) == lkey) { + ret = -EEXIST; + goto bail; + } + } + if (!any) { + ret = -EBUSY; + goto bail; + } + for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (!dd->ipath_pkeys[i] && + atomic_inc_return(&dd->ipath_pkeyrefs[i]) == 1) { + /* for ipathstats, etc. */ + ipath_stats.sps_pkeys[i] = lkey; + dd->ipath_pkeys[i] = key; + ret = 1; + goto bail; + } + } + ret = -EBUSY; + +bail: + return ret; +} + +/** + * set_pkeys - set the PKEY table for port 0 + * @dd: the infinipath device + * @pkeys: the PKEY table + */ +static int set_pkeys(struct ipath_devdata *dd, u16 *pkeys) +{ + struct ipath_portdata *pd; + int i; + int changed = 0; + + pd = dd->ipath_pd[0]; + + for (i = 0; i < ARRAY_SIZE(pd->port_pkeys); i++) { + u16 key = pkeys[i]; + u16 okey = pd->port_pkeys[i]; + + if (key == okey) + continue; + /* + * The value of this PKEY table entry is changing. + * Remove the old entry in the hardware's array of PKEYs. + */ + if (okey & 0x7FFF) + changed |= rm_pkey(dd, okey); + if (key & 0x7FFF) { + int ret = add_pkey(dd, key); + + if (ret < 0) + key = 0; + else + changed |= ret; + } + pd->port_pkeys[i] = key; + } + if (changed) { + u64 pkey; + + pkey = (u64) dd->ipath_pkeys[0] | + ((u64) dd->ipath_pkeys[1] << 16) | + ((u64) dd->ipath_pkeys[2] << 32) | + ((u64) dd->ipath_pkeys[3] << 48); + ipath_cdbg(VERBOSE, "p0 new pkey reg %llx\n", + (unsigned long long) pkey); + ipath_write_kreg(dd, dd->ipath_kregs->kr_partitionkey, + pkey); + } + return 0; } static int recv_subn_set_pkeytable(struct ib_smp *smp, @@ -500,13 +750,12 @@ static int recv_subn_set_pkeytable(struc __be16 *p = (__be16 *) smp->data; u16 *q = (u16 *) smp->data; struct ipath_ibdev *dev = to_idev(ibdev); - unsigned i, n = ipath_layer_get_npkeys(dev->dd); + unsigned i, n = ipath_get_npkeys(dev->dd); for (i = 0; i < n; i++) q[i] = be16_to_cpu(p[i]); - if (startpx != 0 || - ipath_layer_set_pkeys(dev->dd, q) != 0) + if (startpx != 0 || set_pkeys(dev->dd, q) != 0) smp->status |= IB_SMP_INVALID_FIELD; return recv_subn_get_pkeytable(smp, ibdev); @@ -844,10 +1093,10 @@ static int recv_pma_get_portcounters(str struct ib_pma_portcounters *p = (struct ib_pma_portcounters *) pmp->data; struct ipath_ibdev *dev = to_idev(ibdev); - struct ipath_layer_counters cntrs; + struct ipath_verbs_counters cntrs; u8 port_select = p->port_select; - ipath_layer_get_counters(dev->dd, &cntrs); + ipath_get_counters(dev->dd, &cntrs); /* Adjust counters for any resets done. */ cntrs.symbol_error_counter -= dev->z_symbol_error_counter; @@ -944,8 +1193,8 @@ static int recv_pma_get_portcounters_ext u64 swords, rwords, spkts, rpkts, xwait; u8 port_select = p->port_select; - ipath_layer_snapshot_counters(dev->dd, &swords, &rwords, &spkts, - &rpkts, &xwait); + ipath_snapshot_counters(dev->dd, &swords, &rwords, &spkts, + &rpkts, &xwait); /* Adjust counters for any resets done. */ swords -= dev->z_port_xmit_data; @@ -978,13 +1227,13 @@ static int recv_pma_set_portcounters(str struct ib_pma_portcounters *p = (struct ib_pma_portcounters *) pmp->data; struct ipath_ibdev *dev = to_idev(ibdev); - struct ipath_layer_counters cntrs; + struct ipath_verbs_counters cntrs; /* * Since the HW doesn't support clearing counters, we save the * current count and subtract it from future responses. */ - ipath_layer_get_counters(dev->dd, &cntrs); + ipath_get_counters(dev->dd, &cntrs); if (p->counter_select & IB_PMA_SEL_SYMBOL_ERROR) dev->z_symbol_error_counter = cntrs.symbol_error_counter; @@ -1041,8 +1290,8 @@ static int recv_pma_set_portcounters_ext struct ipath_ibdev *dev = to_idev(ibdev); u64 swords, rwords, spkts, rpkts, xwait; - ipath_layer_snapshot_counters(dev->dd, &swords, &rwords, &spkts, - &rpkts, &xwait); + ipath_snapshot_counters(dev->dd, &swords, &rwords, &spkts, + &rpkts, &xwait); if (p->counter_select & IB_PMA_SELX_PORT_XMIT_DATA) dev->z_port_xmit_data = swords; diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c b/drivers/infiniband/hw/ipath/ipath_mr.c --- a/drivers/infiniband/hw/ipath/ipath_mr.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_mr.c Fri Aug 25 11:19:45 2006 -0700 @@ -35,6 +35,18 @@ #include #include "ipath_verbs.h" + +/* Fast memory region */ +struct ipath_fmr { + struct ib_fmr ibfmr; + u8 page_shift; + struct ipath_mregion mr; /* must be last */ +}; + +static inline struct ipath_fmr *to_ifmr(struct ib_fmr *ibfmr) +{ + return container_of(ibfmr, struct ipath_fmr, ibfmr); +} /** * ipath_get_dma_mr - get a DMA memory region diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Fri Aug 25 11:19:45 2006 -0700 @@ -461,7 +461,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, goto inval; if (attr_mask & IB_QP_PKEY_INDEX) - if (attr->pkey_index >= ipath_layer_get_npkeys(dev->dd)) + if (attr->pkey_index >= ipath_get_npkeys(dev->dd)) goto inval; if (attr_mask & IB_QP_MIN_RNR_TIMER) @@ -645,6 +645,33 @@ __be32 ipath_compute_aeth(struct ipath_q } /** + * set_verbs_flags - set the verbs layer flags + * @dd: the infinipath device + * @flags: the flags to set + */ +static int set_verbs_flags(struct ipath_devdata *dd, unsigned flags) +{ + struct ipath_devdata *ss; + unsigned long lflags; + + spin_lock_irqsave(&ipath_devs_lock, lflags); + + list_for_each_entry(ss, &ipath_dev_list, ipath_list) { + if (!(ss->ipath_flags & IPATH_INITTED)) + continue; + if ((flags & IPATH_VERBS_KERNEL_SMA) && + !(*ss->ipath_statusp & IPATH_STATUS_SMA)) + *ss->ipath_statusp |= IPATH_STATUS_OIB_SMA; + else + *ss->ipath_statusp &= ~IPATH_STATUS_OIB_SMA; + } + + spin_unlock_irqrestore(&ipath_devs_lock, lflags); + + return 0; +} + +/** * ipath_create_qp - create a queue pair for a device * @ibpd: the protection domain who's device we create the queue pair for * @init_attr: the attributes of the queue pair @@ -760,8 +787,7 @@ struct ib_qp *ipath_create_qp(struct ib_ /* Tell the core driver that the kernel SMA is present. */ if (init_attr->qp_type == IB_QPT_SMI) - ipath_layer_set_verbs_flags(dev->dd, - IPATH_VERBS_KERNEL_SMA); + set_verbs_flags(dev->dd, IPATH_VERBS_KERNEL_SMA); break; default: @@ -838,7 +864,7 @@ int ipath_destroy_qp(struct ib_qp *ibqp) /* Tell the core driver that the kernel SMA is gone. */ if (qp->ibqp.qp_type == IB_QPT_SMI) - ipath_layer_set_verbs_flags(dev->dd, 0); + set_verbs_flags(dev->dd, 0); spin_lock_irqsave(&qp->s_lock, flags); qp->state = IB_QPS_ERR; diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c --- a/drivers/infiniband/hw/ipath/ipath_rc.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_rc.c Fri Aug 25 11:19:45 2006 -0700 @@ -32,7 +32,7 @@ */ #include "ipath_verbs.h" -#include "ipath_common.h" +#include "ipath_kernel.h" /* cut down ridiculously long IB macro names */ #define OP(x) IB_OPCODE_RC_##x @@ -540,7 +540,7 @@ static void send_rc_ack(struct ipath_qp lrh0 = IPATH_LRH_GRH; } /* read pkey_index w/o lock (its atomic) */ - bth0 = ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); + bth0 = ipath_get_pkey(dev->dd, qp->s_pkey_index); if (qp->r_nak_state) ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPATH_MSN_MASK) | (qp->r_nak_state << @@ -557,7 +557,7 @@ static void send_rc_ack(struct ipath_qp hdr.lrh[0] = cpu_to_be16(lrh0); hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); hdr.lrh[2] = cpu_to_be16(hwords + SIZE_OF_CRC); - hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); + hdr.lrh[3] = cpu_to_be16(dev->dd->ipath_lid); ohdr->bth[0] = cpu_to_be32(bth0); ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); ohdr->bth[2] = cpu_to_be32(qp->r_ack_psn & IPATH_PSN_MASK); @@ -1323,8 +1323,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de * the eager header buffer size to 56 bytes so the last 4 * bytes of the BTH header (PSN) is in the data buffer. */ - header_in_data = - ipath_layer_get_rcvhdrentsize(dev->dd) == 16; + header_in_data = dev->dd->ipath_rcvhdrentsize == 16; if (header_in_data) { psn = be32_to_cpu(((__be32 *) data)[0]); data += sizeof(__be32); diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c --- a/drivers/infiniband/hw/ipath/ipath_ruc.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c Fri Aug 25 11:19:45 2006 -0700 @@ -470,6 +470,15 @@ done: wake_up(&qp->wait); } +static int want_buffer(struct ipath_devdata *dd) +{ + set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl); + ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, + dd->ipath_sendctrl); + + return 0; +} + /** * ipath_no_bufs_available - tell the layer driver we need buffers * @qp: the QP that caused the problem @@ -486,7 +495,7 @@ void ipath_no_bufs_available(struct ipat list_add_tail(&qp->piowait, &dev->piowait); spin_unlock_irqrestore(&dev->pending_lock, flags); /* - * Note that as soon as ipath_layer_want_buffer() is called and + * Note that as soon as want_buffer() is called and * possibly before it returns, ipath_ib_piobufavail() * could be called. If we are still in the tasklet function, * tasklet_hi_schedule() will not call us until the next time @@ -496,7 +505,7 @@ void ipath_no_bufs_available(struct ipat */ clear_bit(IPATH_S_BUSY, &qp->s_flags); tasklet_unlock(&qp->s_task); - ipath_layer_want_buffer(dev->dd); + want_buffer(dev->dd); dev->n_piowait++; } @@ -611,7 +620,7 @@ u32 ipath_make_grh(struct ipath_ibdev *d hdr->hop_limit = grh->hop_limit; /* The SGID is 32-bit aligned. */ hdr->sgid.global.subnet_prefix = dev->gid_prefix; - hdr->sgid.global.interface_id = ipath_layer_get_guid(dev->dd); + hdr->sgid.global.interface_id = dev->dd->ipath_guid; hdr->dgid = grh->dgid; /* GRH header size in 32-bit words. */ @@ -643,8 +652,7 @@ void ipath_do_ruc_send(unsigned long dat if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags)) goto bail; - if (unlikely(qp->remote_ah_attr.dlid == - ipath_layer_get_lid(dev->dd))) { + if (unlikely(qp->remote_ah_attr.dlid == dev->dd->ipath_lid)) { ipath_ruc_loopback(qp); goto clear; } @@ -711,8 +719,8 @@ again: qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + nwords + SIZE_OF_CRC); - qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); - bth0 |= ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); + qp->s_hdr.lrh[3] = cpu_to_be16(dev->dd->ipath_lid); + bth0 |= ipath_get_pkey(dev->dd, qp->s_pkey_index); bth0 |= extra_bytes << 20; ohdr->bth[0] = cpu_to_be32(bth0); ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); diff --git a/drivers/infiniband/hw/ipath/ipath_sysfs.c b/drivers/infiniband/hw/ipath/ipath_sysfs.c --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c Fri Aug 25 11:19:45 2006 -0700 @@ -35,7 +35,6 @@ #include #include "ipath_kernel.h" -#include "ipath_layer.h" #include "ipath_common.h" /** @@ -227,7 +226,6 @@ static ssize_t store_mlid(struct device unit = dd->ipath_unit; dd->ipath_mlid = mlid; - ipath_layer_intr(dd, IPATH_LAYER_INT_BCAST); goto bail; invalid: @@ -467,7 +465,7 @@ static ssize_t store_link_state(struct d if (ret < 0) goto invalid; - r = ipath_layer_set_linkstate(dd, state); + r = ipath_set_linkstate(dd, state); if (r < 0) { ret = r; goto bail; @@ -502,7 +500,7 @@ static ssize_t store_mtu(struct device * if (ret < 0) goto invalid; - r = ipath_layer_set_mtu(dd, mtu); + r = ipath_set_mtu(dd, mtu); if (r < 0) ret = r; diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c --- a/drivers/infiniband/hw/ipath/ipath_uc.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_uc.c Fri Aug 25 11:19:45 2006 -0700 @@ -32,7 +32,7 @@ */ #include "ipath_verbs.h" -#include "ipath_common.h" +#include "ipath_kernel.h" /* cut down ridiculously long IB macro names */ #define OP(x) IB_OPCODE_UC_##x @@ -261,8 +261,7 @@ void ipath_uc_rcv(struct ipath_ibdev *de * size to 56 bytes so the last 4 bytes of * the BTH header (PSN) is in the data buffer. */ - header_in_data = - ipath_layer_get_rcvhdrentsize(dev->dd) == 16; + header_in_data = dev->dd->ipath_rcvhdrentsize == 16; if (header_in_data) { psn = be32_to_cpu(((__be32 *) data)[0]); data += sizeof(__be32); diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c --- a/drivers/infiniband/hw/ipath/ipath_ud.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ud.c Fri Aug 25 11:19:45 2006 -0700 @@ -353,7 +353,7 @@ int ipath_post_ud_send(struct ipath_qp * ss.num_sge++; } /* Check for invalid packet size. */ - if (len > ipath_layer_get_ibmtu(dev->dd)) { + if (len > dev->dd->ipath_ibmtu) { ret = -EINVAL; goto bail; } @@ -375,7 +375,7 @@ int ipath_post_ud_send(struct ipath_qp * dev->n_unicast_xmit++; lid = ah_attr->dlid & ~((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); - if (unlikely(lid == ipath_layer_get_lid(dev->dd))) { + if (unlikely(lid == dev->dd->ipath_lid)) { /* * Pass in an uninitialized ib_wc to save stack * space. @@ -404,7 +404,7 @@ int ipath_post_ud_send(struct ipath_qp * qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; qp->s_hdr.u.l.grh.sgid.global.interface_id = - ipath_layer_get_guid(dev->dd); + dev->dd->ipath_guid; qp->s_hdr.u.l.grh.dgid = ah_attr->grh.dgid; /* * Don't worry about sending to locally attached multicast @@ -434,7 +434,7 @@ int ipath_post_ud_send(struct ipath_qp * qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); qp->s_hdr.lrh[1] = cpu_to_be16(ah_attr->dlid); /* DEST LID */ qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); - lid = ipath_layer_get_lid(dev->dd); + lid = dev->dd->ipath_lid; if (lid) { lid |= ah_attr->src_path_bits & ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); @@ -445,7 +445,7 @@ int ipath_post_ud_send(struct ipath_qp * bth0 |= 1 << 23; bth0 |= extra_bytes << 20; bth0 |= qp->ibqp.qp_type == IB_QPT_SMI ? IPATH_DEFAULT_P_KEY : - ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); + ipath_get_pkey(dev->dd, qp->s_pkey_index); ohdr->bth[0] = cpu_to_be32(bth0); /* * Use the multicast QP if the destination LID is a multicast LID. @@ -531,8 +531,7 @@ void ipath_ud_rcv(struct ipath_ibdev *de * the eager header buffer size to 56 bytes so the last 12 * bytes of the IB header is in the data buffer. */ - header_in_data = - ipath_layer_get_rcvhdrentsize(dev->dd) == 16; + header_in_data = dev->dd->ipath_rcvhdrentsize == 16; if (header_in_data) { qkey = be32_to_cpu(((__be32 *) data)[1]); src_qp = be32_to_cpu(((__be32 *) data)[2]); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Aug 25 11:19:45 2006 -0700 @@ -33,14 +33,12 @@ #include #include +#include #include #include "ipath_kernel.h" #include "ipath_verbs.h" #include "ipath_common.h" - -/* Not static, because we don't want the compiler removing it */ -const char ipath_verbs_version[] = "ipath_verbs " IPATH_IDSTR; static unsigned int ib_ipath_qp_table_size = 251; module_param_named(qp_table_size, ib_ipath_qp_table_size, uint, S_IRUGO); @@ -108,10 +106,6 @@ module_param_named(max_srq_wrs, ib_ipath module_param_named(max_srq_wrs, ib_ipath_max_srq_wrs, uint, S_IWUSR | S_IRUGO); MODULE_PARM_DESC(max_srq_wrs, "Maximum number of SRQ WRs support"); - -MODULE_LICENSE("GPL"); -MODULE_AUTHOR("QLogic "); -MODULE_DESCRIPTION("QLogic InfiniPath driver"); const int ib_ipath_state_ops[IB_QPS_ERR + 1] = { [IB_QPS_RESET] = 0, @@ -124,6 +118,16 @@ const int ib_ipath_state_ops[IB_QPS_ERR [IB_QPS_SQE] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK, [IB_QPS_ERR] = 0, }; + +struct ipath_ucontext { + struct ib_ucontext ibucontext; +}; + +static inline struct ipath_ucontext *to_iucontext(struct ib_ucontext + *ibucontext) +{ + return container_of(ibucontext, struct ipath_ucontext, ibucontext); +} /* * Translate ib_wr_opcode into ib_wc_opcode. @@ -400,7 +404,7 @@ void ipath_ib_rcv(struct ipath_ibdev *de lid = be16_to_cpu(hdr->lrh[1]); if (lid < IPATH_MULTICAST_LID_BASE) { lid &= ~((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); - if (unlikely(lid != ipath_layer_get_lid(dev->dd))) { + if (unlikely(lid != dev->dd->ipath_lid)) { dev->rcv_errors++; goto bail; } @@ -511,19 +515,19 @@ void ipath_ib_timer(struct ipath_ibdev * if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_STARTED && --dev->pma_sample_start == 0) { dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_RUNNING; - ipath_layer_snapshot_counters(dev->dd, &dev->ipath_sword, - &dev->ipath_rword, - &dev->ipath_spkts, - &dev->ipath_rpkts, - &dev->ipath_xmit_wait); + ipath_snapshot_counters(dev->dd, &dev->ipath_sword, + &dev->ipath_rword, + &dev->ipath_spkts, + &dev->ipath_rpkts, + &dev->ipath_xmit_wait); } if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_RUNNING) { if (dev->pma_sample_interval == 0) { u64 ta, tb, tc, td, te; dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_DONE; - ipath_layer_snapshot_counters(dev->dd, &ta, &tb, - &tc, &td, &te); + ipath_snapshot_counters(dev->dd, &ta, &tb, + &tc, &td, &te); dev->ipath_sword = ta - dev->ipath_sword; dev->ipath_rword = tb - dev->ipath_rword; @@ -551,6 +555,362 @@ void ipath_ib_timer(struct ipath_ibdev * if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); } +} + +static void update_sge(struct ipath_sge_state *ss, u32 length) +{ + struct ipath_sge *sge = &ss->sge; + + sge->vaddr += length; + sge->length -= length; + sge->sge_length -= length; + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + return; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } +} + +#ifdef __LITTLE_ENDIAN +static inline u32 get_upper_bits(u32 data, u32 shift) +{ + return data >> shift; +} + +static inline u32 set_upper_bits(u32 data, u32 shift) +{ + return data << shift; +} + +static inline u32 clear_upper_bytes(u32 data, u32 n, u32 off) +{ + data <<= ((sizeof(u32) - n) * BITS_PER_BYTE); + data >>= ((sizeof(u32) - n - off) * BITS_PER_BYTE); + return data; +} +#else +static inline u32 get_upper_bits(u32 data, u32 shift) +{ + return data << shift; +} + +static inline u32 set_upper_bits(u32 data, u32 shift) +{ + return data >> shift; +} + +static inline u32 clear_upper_bytes(u32 data, u32 n, u32 off) +{ + data >>= ((sizeof(u32) - n) * BITS_PER_BYTE); + data <<= ((sizeof(u32) - n - off) * BITS_PER_BYTE); + return data; +} +#endif + +static void copy_io(u32 __iomem *piobuf, struct ipath_sge_state *ss, + u32 length) +{ + u32 extra = 0; + u32 data = 0; + u32 last; + + while (1) { + u32 len = ss->sge.length; + u32 off; + + BUG_ON(len == 0); + if (len > length) + len = length; + if (len > ss->sge.sge_length) + len = ss->sge.sge_length; + /* If the source address is not aligned, try to align it. */ + off = (unsigned long)ss->sge.vaddr & (sizeof(u32) - 1); + if (off) { + u32 *addr = (u32 *)((unsigned long)ss->sge.vaddr & + ~(sizeof(u32) - 1)); + u32 v = get_upper_bits(*addr, off * BITS_PER_BYTE); + u32 y; + + y = sizeof(u32) - off; + if (len > y) + len = y; + if (len + extra >= sizeof(u32)) { + data |= set_upper_bits(v, extra * + BITS_PER_BYTE); + len = sizeof(u32) - extra; + if (len == length) { + last = data; + break; + } + __raw_writel(data, piobuf); + piobuf++; + extra = 0; + data = 0; + } else { + /* Clear unused upper bytes */ + data |= clear_upper_bytes(v, len, extra); + if (len == length) { + last = data; + break; + } + extra += len; + } + } else if (extra) { + /* Source address is aligned. */ + u32 *addr = (u32 *) ss->sge.vaddr; + int shift = extra * BITS_PER_BYTE; + int ushift = 32 - shift; + u32 l = len; + + while (l >= sizeof(u32)) { + u32 v = *addr; + + data |= set_upper_bits(v, shift); + __raw_writel(data, piobuf); + data = get_upper_bits(v, ushift); + piobuf++; + addr++; + l -= sizeof(u32); + } + /* + * We still have 'extra' number of bytes leftover. + */ + if (l) { + u32 v = *addr; + + if (l + extra >= sizeof(u32)) { + data |= set_upper_bits(v, shift); + len -= l + extra - sizeof(u32); + if (len == length) { + last = data; + break; + } + __raw_writel(data, piobuf); + piobuf++; + extra = 0; + data = 0; + } else { + /* Clear unused upper bytes */ + data |= clear_upper_bytes(v, l, + extra); + if (len == length) { + last = data; + break; + } + extra += l; + } + } else if (len == length) { + last = data; + break; + } + } else if (len == length) { + u32 w; + + /* + * Need to round up for the last dword in the + * packet. + */ + w = (len + 3) >> 2; + __iowrite32_copy(piobuf, ss->sge.vaddr, w - 1); + piobuf += w - 1; + last = ((u32 *) ss->sge.vaddr)[w - 1]; + break; + } else { + u32 w = len >> 2; + + __iowrite32_copy(piobuf, ss->sge.vaddr, w); + piobuf += w; + + extra = len & (sizeof(u32) - 1); + if (extra) { + u32 v = ((u32 *) ss->sge.vaddr)[w]; + + /* Clear unused upper bytes */ + data = clear_upper_bytes(v, extra, 0); + } + } + update_sge(ss, len); + length -= len; + } + /* Update address before sending packet. */ + update_sge(ss, length); + /* must flush early everything before trigger word */ + ipath_flush_wc(); + __raw_writel(last, piobuf); + /* be sure trigger word is written */ + ipath_flush_wc(); +} + +/** + * ipath_verbs_send - send a packet + * @dd: the infinipath device + * @hdrwords: the number of words in the header + * @hdr: the packet header + * @len: the length of the packet in bytes + * @ss: the SGE to send + */ +int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, + u32 *hdr, u32 len, struct ipath_sge_state *ss) +{ + u32 __iomem *piobuf; + u32 plen; + int ret; + + /* +1 is for the qword padding of pbc */ + plen = hdrwords + ((len + 3) >> 2) + 1; + if (unlikely((plen << 2) > dd->ipath_ibmaxlen)) { + ipath_dbg("packet len 0x%x too long, failing\n", plen); + ret = -EINVAL; + goto bail; + } + + /* Get a PIO buffer to use. */ + piobuf = ipath_getpiobuf(dd, NULL); + if (unlikely(piobuf == NULL)) { + ret = -EBUSY; + goto bail; + } + + /* + * Write len to control qword, no flags. + * We have to flush after the PBC for correctness on some cpus + * or WC buffer can be written out of order. + */ + writeq(plen, piobuf); + ipath_flush_wc(); + piobuf += 2; + if (len == 0) { + /* + * If there is just the header portion, must flush before + * writing last word of header for correctness, and after + * the last header word (trigger word). + */ + __iowrite32_copy(piobuf, hdr, hdrwords - 1); + ipath_flush_wc(); + __raw_writel(hdr[hdrwords - 1], piobuf + hdrwords - 1); + ipath_flush_wc(); + ret = 0; + goto bail; + } + + __iowrite32_copy(piobuf, hdr, hdrwords); + piobuf += hdrwords; + + /* The common case is aligned and contained in one segment. */ + if (likely(ss->num_sge == 1 && len <= ss->sge.length && + !((unsigned long)ss->sge.vaddr & (sizeof(u32) - 1)))) { + u32 w; + u32 *addr = (u32 *) ss->sge.vaddr; + + /* Update address before sending packet. */ + update_sge(ss, len); + /* Need to round up for the last dword in the packet. */ + w = (len + 3) >> 2; + __iowrite32_copy(piobuf, addr, w - 1); + /* must flush early everything before trigger word */ + ipath_flush_wc(); + __raw_writel(addr[w - 1], piobuf + w - 1); + /* be sure trigger word is written */ + ipath_flush_wc(); + ret = 0; + goto bail; + } + copy_io(piobuf, ss, len); + ret = 0; + +bail: + return ret; +} + +int ipath_snapshot_counters(struct ipath_devdata *dd, u64 *swords, + u64 *rwords, u64 *spkts, u64 *rpkts, + u64 *xmit_wait) +{ + int ret; + + if (!(dd->ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + ipath_dbg("unit %u not usable\n", dd->ipath_unit); + ret = -EINVAL; + goto bail; + } + *swords = ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordsendcnt); + *rwords = ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordrcvcnt); + *spkts = ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktsendcnt); + *rpkts = ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktrcvcnt); + *xmit_wait = ipath_snap_cntr(dd, dd->ipath_cregs->cr_sendstallcnt); + + ret = 0; + +bail: + return ret; +} + +/** + * ipath_get_counters - get various chip counters + * @dd: the infinipath device + * @cntrs: counters are placed here + * + * Return the counters needed by recv_pma_get_portcounters(). + */ +int ipath_get_counters(struct ipath_devdata *dd, + struct ipath_verbs_counters *cntrs) +{ + int ret; + + if (!(dd->ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + ipath_dbg("unit %u not usable\n", dd->ipath_unit); + ret = -EINVAL; + goto bail; + } + cntrs->symbol_error_counter = + ipath_snap_cntr(dd, dd->ipath_cregs->cr_ibsymbolerrcnt); + cntrs->link_error_recovery_counter = + ipath_snap_cntr(dd, dd->ipath_cregs->cr_iblinkerrrecovcnt); + /* + * The link downed counter counts when the other side downs the + * connection. We add in the number of times we downed the link + * due to local link integrity errors to compensate. + */ + cntrs->link_downed_counter = + ipath_snap_cntr(dd, dd->ipath_cregs->cr_iblinkdowncnt); + cntrs->port_rcv_errors = + ipath_snap_cntr(dd, dd->ipath_cregs->cr_rxdroppktcnt) + + ipath_snap_cntr(dd, dd->ipath_cregs->cr_rcvovflcnt) + + ipath_snap_cntr(dd, dd->ipath_cregs->cr_portovflcnt) + + ipath_snap_cntr(dd, dd->ipath_cregs->cr_err_rlencnt) + + ipath_snap_cntr(dd, dd->ipath_cregs->cr_invalidrlencnt) + + ipath_snap_cntr(dd, dd->ipath_cregs->cr_erricrccnt) + + ipath_snap_cntr(dd, dd->ipath_cregs->cr_errvcrccnt) + + ipath_snap_cntr(dd, dd->ipath_cregs->cr_errlpcrccnt) + + ipath_snap_cntr(dd, dd->ipath_cregs->cr_badformatcnt); + cntrs->port_rcv_remphys_errors = + ipath_snap_cntr(dd, dd->ipath_cregs->cr_rcvebpcnt); + cntrs->port_xmit_discards = + ipath_snap_cntr(dd, dd->ipath_cregs->cr_unsupvlcnt); + cntrs->port_xmit_data = + ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordsendcnt); + cntrs->port_rcv_data = + ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordrcvcnt); + cntrs->port_xmit_packets = + ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktsendcnt); + cntrs->port_rcv_packets = + ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktrcvcnt); + cntrs->local_link_integrity_errors = dd->ipath_lli_errors; + cntrs->excessive_buffer_overrun_errors = 0; /* XXX */ + + ret = 0; + +bail: + return ret; } /** @@ -595,9 +955,9 @@ static int ipath_query_device(struct ib_ IB_DEVICE_BAD_QKEY_CNTR | IB_DEVICE_SHUTDOWN_PORT | IB_DEVICE_SYS_IMAGE_GUID; props->page_size_cap = PAGE_SIZE; - props->vendor_id = ipath_layer_get_vendorid(dev->dd); - props->vendor_part_id = ipath_layer_get_deviceid(dev->dd); - props->hw_ver = ipath_layer_get_pcirev(dev->dd); + props->vendor_id = dev->dd->ipath_vendorid; + props->vendor_part_id = dev->dd->ipath_deviceid; + props->hw_ver = dev->dd->ipath_pcirev; props->sys_image_guid = dev->sys_image_guid; @@ -618,7 +978,7 @@ static int ipath_query_device(struct ib_ props->max_srq_sge = ib_ipath_max_srq_sges; /* props->local_ca_ack_delay */ props->atomic_cap = IB_ATOMIC_HCA; - props->max_pkeys = ipath_layer_get_npkeys(dev->dd); + props->max_pkeys = ipath_get_npkeys(dev->dd); props->max_mcast_grp = ib_ipath_max_mcast_grps; props->max_mcast_qp_attach = ib_ipath_max_mcast_qp_attached; props->max_total_mcast_qp_attach = props->max_mcast_qp_attach * @@ -643,12 +1003,17 @@ const u8 ipath_cvt_physportstate[16] = { [INFINIPATH_IBCS_LT_STATE_RECOVERIDLE] = 6, }; +u32 ipath_get_cr_errpkey(struct ipath_devdata *dd) +{ + return ipath_read_creg32(dd, dd->ipath_cregs->cr_errpkey); +} + static int ipath_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr *props) { struct ipath_ibdev *dev = to_idev(ibdev); enum ib_mtu mtu; - u16 lid = ipath_layer_get_lid(dev->dd); + u16 lid = dev->dd->ipath_lid; u64 ibcstat; memset(props, 0, sizeof(*props)); @@ -656,16 +1021,16 @@ static int ipath_query_port(struct ib_de props->lmc = dev->mkeyprot_resv_lmc & 7; props->sm_lid = dev->sm_lid; props->sm_sl = dev->sm_sl; - ibcstat = ipath_layer_get_lastibcstat(dev->dd); + ibcstat = dev->dd->ipath_lastibcstat; props->state = ((ibcstat >> 4) & 0x3) + 1; /* See phys_state_show() */ props->phys_state = ipath_cvt_physportstate[ - ipath_layer_get_lastibcstat(dev->dd) & 0xf]; + dev->dd->ipath_lastibcstat & 0xf]; props->port_cap_flags = dev->port_cap_flags; props->gid_tbl_len = 1; props->max_msg_sz = 0x80000000; - props->pkey_tbl_len = ipath_layer_get_npkeys(dev->dd); - props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->dd) - + props->pkey_tbl_len = ipath_get_npkeys(dev->dd); + props->bad_pkey_cntr = ipath_get_cr_errpkey(dev->dd) - dev->z_pkey_violations; props->qkey_viol_cntr = dev->qkey_violations; props->active_width = IB_WIDTH_4X; @@ -675,7 +1040,7 @@ static int ipath_query_port(struct ib_de props->init_type_reply = 0; props->max_mtu = IB_MTU_4096; - switch (ipath_layer_get_ibmtu(dev->dd)) { + switch (dev->dd->ipath_ibmtu) { case 4096: mtu = IB_MTU_4096; break; @@ -734,7 +1099,7 @@ static int ipath_modify_port(struct ib_d dev->port_cap_flags |= props->set_port_cap_mask; dev->port_cap_flags &= ~props->clr_port_cap_mask; if (port_modify_mask & IB_PORT_SHUTDOWN) - ipath_layer_set_linkstate(dev->dd, IPATH_IB_LINKDOWN); + ipath_set_linkstate(dev->dd, IPATH_IB_LINKDOWN); if (port_modify_mask & IB_PORT_RESET_QKEY_CNTR) dev->qkey_violations = 0; return 0; @@ -751,7 +1116,7 @@ static int ipath_query_gid(struct ib_dev goto bail; } gid->global.subnet_prefix = dev->gid_prefix; - gid->global.interface_id = ipath_layer_get_guid(dev->dd); + gid->global.interface_id = dev->dd->ipath_guid; ret = 0; @@ -902,24 +1267,49 @@ static int ipath_query_ah(struct ib_ah * return 0; } +/** + * ipath_get_npkeys - return the size of the PKEY table for port 0 + * @dd: the infinipath device + */ +unsigned ipath_get_npkeys(struct ipath_devdata *dd) +{ + return ARRAY_SIZE(dd->ipath_pd[0]->port_pkeys); +} + +/** + * ipath_get_pkey - return the indexed PKEY from the port 0 PKEY table + * @dd: the infinipath device + * @index: the PKEY index + */ +unsigned ipath_get_pkey(struct ipath_devdata *dd, unsigned index) +{ + unsigned ret; + + if (index >= ARRAY_SIZE(dd->ipath_pd[0]->port_pkeys)) + ret = 0; + else + ret = dd->ipath_pd[0]->port_pkeys[index]; + + return ret; +} + static int ipath_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) { struct ipath_ibdev *dev = to_idev(ibdev); int ret; - if (index >= ipath_layer_get_npkeys(dev->dd)) { + if (index >= ipath_get_npkeys(dev->dd)) { ret = -EINVAL; goto bail; } - *pkey = ipath_layer_get_pkey(dev->dd, index); + *pkey = ipath_get_pkey(dev->dd, index); ret = 0; bail: return ret; } - /** * ipath_alloc_ucontext - allocate a ucontest @@ -953,6 +1343,63 @@ static int ipath_dealloc_ucontext(struct static int ipath_verbs_register_sysfs(struct ib_device *dev); +static void __verbs_timer(unsigned long arg) +{ + struct ipath_devdata *dd = (struct ipath_devdata *) arg; + + /* + * If port 0 receive packet interrupts are not available, or + * can be missed, poll the receive queue + */ + if (dd->ipath_flags & IPATH_POLL_RX_INTR) + ipath_kreceive(dd); + + /* Handle verbs layer timeouts. */ + ipath_ib_timer(dd->verbs_dev); + + mod_timer(&dd->verbs_timer, jiffies + 1); +} + +static int enable_timer(struct ipath_devdata *dd) +{ + /* + * Early chips had a design flaw where the chip and kernel idea + * of the tail register don't always agree, and therefore we won't + * get an interrupt on the next packet received. + * If the board supports per packet receive interrupts, use it. + * Otherwise, the timer function periodically checks for packets + * to cover this case. + * Either way, the timer is needed for verbs layer related + * processing. + */ + if (dd->ipath_flags & IPATH_GPIO_INTR) { + ipath_write_kreg(dd, dd->ipath_kregs->kr_debugportselect, + 0x2074076542310ULL); + /* Enable GPIO bit 2 interrupt */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, + (u64) (1 << 2)); + } + + init_timer(&dd->verbs_timer); + dd->verbs_timer.function = __verbs_timer; + dd->verbs_timer.data = (unsigned long)dd; + dd->verbs_timer.expires = jiffies + 1; + add_timer(&dd->verbs_timer); + + return 0; +} + +static int disable_timer(struct ipath_devdata *dd) +{ + /* Disable GPIO bit 2 interrupt */ + if (dd->ipath_flags & IPATH_GPIO_INTR) + ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, 0); + + del_timer_sync(&dd->verbs_timer); + + return 0; +} + /** * ipath_register_ib_device - register our device with the infiniband core * @dd: the device data structure @@ -960,7 +1407,7 @@ static int ipath_verbs_register_sysfs(st */ int ipath_register_ib_device(struct ipath_devdata *dd) { - struct ipath_layer_counters cntrs; + struct ipath_verbs_counters cntrs; struct ipath_ibdev *idev; struct ib_device *dev; int ret; @@ -1020,7 +1467,7 @@ int ipath_register_ib_device(struct ipat idev->link_width_enabled = 3; /* 1x or 4x */ /* Snapshot current HW counters to "clear" them. */ - ipath_layer_get_counters(dd, &cntrs); + ipath_get_counters(dd, &cntrs); idev->z_symbol_error_counter = cntrs.symbol_error_counter; idev->z_link_error_recovery_counter = cntrs.link_error_recovery_counter; @@ -1044,14 +1491,14 @@ int ipath_register_ib_device(struct ipat * device types in the system, we can't be sure this is unique. */ if (!sys_image_guid) - sys_image_guid = ipath_layer_get_guid(dd); + sys_image_guid = dd->ipath_guid; idev->sys_image_guid = sys_image_guid; idev->ib_unit = dd->ipath_unit; idev->dd = dd; strlcpy(dev->name, "ipath%d", IB_DEVICE_NAME_MAX); dev->owner = THIS_MODULE; - dev->node_guid = ipath_layer_get_guid(dd); + dev->node_guid = dd->ipath_guid; dev->uverbs_abi_ver = IPATH_UVERBS_ABI_VERSION; dev->uverbs_cmd_mask = (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | @@ -1085,7 +1532,7 @@ int ipath_register_ib_device(struct ipat (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); dev->node_type = IB_NODE_CA; dev->phys_port_cnt = 1; - dev->dma_device = ipath_layer_get_device(dd); + dev->dma_device = &dd->pcidev->dev; dev->class_dev.dev = dev->dma_device; dev->query_device = ipath_query_device; dev->modify_device = ipath_modify_device; @@ -1139,7 +1586,7 @@ int ipath_register_ib_device(struct ipat if (ipath_verbs_register_sysfs(dev)) goto err_class; - ipath_layer_enable_timer(dd); + enable_timer(dd); goto bail; @@ -1164,7 +1611,7 @@ void ipath_unregister_ib_device(struct i { struct ib_device *ibdev = &dev->ibdev; - ipath_layer_disable_timer(dev->dd); + disable_timer(dev->dd); ib_unregister_device(ibdev); @@ -1197,7 +1644,7 @@ static ssize_t show_rev(struct class_dev struct ipath_ibdev *dev = container_of(cdev, struct ipath_ibdev, ibdev.class_dev); - return sprintf(buf, "%x\n", ipath_layer_get_pcirev(dev->dd)); + return sprintf(buf, "%x\n", dev->dd->ipath_pcirev); } static ssize_t show_hca(struct class_device *cdev, char *buf) @@ -1206,7 +1653,7 @@ static ssize_t show_hca(struct class_dev container_of(cdev, struct ipath_ibdev, ibdev.class_dev); int ret; - ret = ipath_layer_get_boardname(dev->dd, buf, 128); + ret = dev->dd->ipath_f_get_boardname(dev->dd, buf, 128); if (ret < 0) goto bail; strcat(buf, "\n"); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:45 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Aug 25 11:19:45 2006 -0700 @@ -153,19 +153,6 @@ struct ipath_mcast { int n_attached; }; -/* Memory region */ -struct ipath_mr { - struct ib_mr ibmr; - struct ipath_mregion mr; /* must be last */ -}; - -/* Fast memory region */ -struct ipath_fmr { - struct ib_fmr ibfmr; - u8 page_shift; - struct ipath_mregion mr; /* must be last */ -}; - /* Protection domain */ struct ipath_pd { struct ib_pd ibpd; @@ -217,6 +204,54 @@ struct ipath_cq { }; /* + * A segment is a linear region of low physical memory. + * XXX Maybe we should use phys addr here and kmap()/kunmap(). + * Used by the verbs layer. + */ +struct ipath_seg { + void *vaddr; + size_t length; +}; + +/* The number of ipath_segs that fit in a page. */ +#define IPATH_SEGSZ (PAGE_SIZE / sizeof (struct ipath_seg)) + +struct ipath_segarray { + struct ipath_seg segs[IPATH_SEGSZ]; +}; + +struct ipath_mregion { + u64 user_base; /* User's address for this region */ + u64 iova; /* IB start address of this region */ + size_t length; + u32 lkey; + u32 offset; /* offset (bytes) to start of region */ + int access_flags; + u32 max_segs; /* number of ipath_segs in all the arrays */ + u32 mapsz; /* size of the map array */ + struct ipath_segarray *map[0]; /* the segments */ +}; + +/* + * These keep track of the copy progress within a memory region. + * Used by the verbs layer. + */ +struct ipath_sge { + struct ipath_mregion *mr; + void *vaddr; /* current pointer into the segment */ + u32 sge_length; /* length of the SGE */ + u32 length; /* remaining length of the segment */ + u16 m; /* current index: mr->map[m] */ + u16 n; /* current index: mr->map[m]->segs[n] */ +}; + +/* Memory region */ +struct ipath_mr { + struct ib_mr ibmr; + struct ipath_mregion mr; /* must be last */ +}; + +/* * Send work request queue entry. * The size of the sg_list is determined when the QP is created and stored * in qp->s_max_sge. @@ -268,6 +303,12 @@ struct ipath_srq { struct ipath_mmap_info *ip; /* send signal when number of RWQEs < limit */ u32 limit; +}; + +struct ipath_sge_state { + struct ipath_sge *sg_list; /* next SGE to be used if any */ + struct ipath_sge sge; /* progress state for the current SGE */ + u8 num_sge; }; /* @@ -500,18 +541,24 @@ struct ipath_ibdev { struct ipath_opcode_stats opstats[128]; }; -struct ipath_ucontext { - struct ib_ucontext ibucontext; +struct ipath_verbs_counters { + u64 symbol_error_counter; + u64 link_error_recovery_counter; + u64 link_downed_counter; + u64 port_rcv_errors; + u64 port_rcv_remphys_errors; + u64 port_xmit_discards; + u64 port_xmit_data; + u64 port_rcv_data; + u64 port_xmit_packets; + u64 port_rcv_packets; + u32 local_link_integrity_errors; + u32 excessive_buffer_overrun_errors; }; static inline struct ipath_mr *to_imr(struct ib_mr *ibmr) { return container_of(ibmr, struct ipath_mr, ibmr); -} - -static inline struct ipath_fmr *to_ifmr(struct ib_fmr *ibfmr) -{ - return container_of(ibfmr, struct ipath_fmr, ibfmr); } static inline struct ipath_pd *to_ipd(struct ib_pd *ibpd) @@ -551,12 +598,6 @@ int ipath_process_mad(struct ib_device * struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad); -static inline struct ipath_ucontext *to_iucontext(struct ib_ucontext - *ibucontext) -{ - return container_of(ibucontext, struct ipath_ucontext, ibucontext); -} - /* * Compare the lower 24 bits of the two values. * Returns an integer <, ==, or > than zero. @@ -567,6 +608,13 @@ static inline int ipath_cmp24(u32 a, u32 } struct ipath_mcast *ipath_mcast_find(union ib_gid *mgid); + +int ipath_snapshot_counters(struct ipath_devdata *dd, u64 *swords, + u64 *rwords, u64 *spkts, u64 *rpkts, + u64 *xmit_wait); + +int ipath_get_counters(struct ipath_devdata *dd, + struct ipath_verbs_counters *cntrs); int ipath_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); @@ -598,6 +646,9 @@ void ipath_sqerror_qp(struct ipath_qp *q void ipath_get_credit(struct ipath_qp *qp, u32 aeth); +int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, + u32 *hdr, u32 len, struct ipath_sge_state *ss); + void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int sig); int ipath_rkey_ok(struct ipath_ibdev *dev, struct ipath_sge_state *ss, @@ -721,6 +772,12 @@ int ipath_ib_piobufavail(struct ipath_ib void ipath_ib_timer(struct ipath_ibdev *); +unsigned ipath_get_npkeys(struct ipath_devdata *); + +u32 ipath_get_cr_errpkey(struct ipath_devdata *); + +unsigned ipath_get_pkey(struct ipath_devdata *, unsigned); + extern const enum ib_wc_opcode ib_ipath_wc_opcode[]; extern const u8 ipath_cvt_physportstate[]; From greg.lindahl at qlogic.com Fri Aug 25 11:55:55 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri, 25 Aug 2006 11:55:55 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <000001c6c84e$e7cfd700$4e019e84@bullseye> References: <20060824221833.GD3670@greglaptop.hotels-on-air.de> <000001c6c84e$e7cfd700$4e019e84@bullseye> Message-ID: <20060825185555.GE10509@greglaptop.hotels-on-air.de> On Fri, Aug 25, 2006 at 10:00:50AM -0400, Thomas Bachman wrote: > Not that I have any stance on this issue, but is this is the text in the > spec that is being debated? > > (page 269, section 9.5, Transaction Ordering): > "An application shall not depend upon the order of data writes to > memory within a message. For example, if an application sets up > data buffers that overlap, for separate data segments within a > message, it is not guaranteed that the last sent data will always > overwrite the earlier." No. The case we're talking about is different from the example. There's text elsewhere which says, basically, that you can't access the data buffer until seeing the completion. > I'm assuming that the spec authors had reason for putting this in there, so > maybe they could provide guidance here? I can't speak for the authors, but as an implementor, this has a huge impact on implementation. For example, on an architecture where you need to do work such as flushing the cache before accessing DMAed data, that's done in the completion. x86 in general is not such an architecture, but they exist. IB is intended to be portable to any CPU architecture. For iWarp, the issue is that packets are frequently reordered. -- greg From Brian.Cain at ge.com Fri Aug 25 12:17:24 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Fri, 25 Aug 2006 15:17:24 -0400 Subject: [openib-general] SRP numbers from gen1 vs gen2 Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB033BEF886@CINMLVEM11.e2k.ad.ge.com> Does anyone have any throughput benchmark data for SRP comparing gen1 and gen2? -- -Brian From mlakshmanan at silverstorm.com Fri Aug 25 12:21:20 2006 From: mlakshmanan at silverstorm.com (mlakshmanan at silverstorm.com) Date: Fri, 25 Aug 2006 15:21:20 -0400 Subject: [openib-general] basic IB doubt In-Reply-To: <20060825185555.GE10509@greglaptop.hotels-on-air.de> References: <20060824221833.GD3670@greglaptop.hotels-on-air.de>, <000001c6c84e$e7cfd700$4e019e84@bullseye>, <20060825185555.GE10509@greglaptop.hotels-on-air.de> Message-ID: <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> Date sent: Fri, 25 Aug 2006 11:55:55 -0700 From: "Greg Lindahl" > For example, on an architecture where you need to do work such as > flushing the cache before accessing DMAed data, that's done in the > completion. x86 in general is not such an architecture, but they > exist. IB is intended to be portable to any CPU architecture. > I presume you meant invalidate the cache, not flush it, before accessing DMA'ed data. -madhu -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg.lindahl at qlogic.com Fri Aug 25 12:23:51 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri, 25 Aug 2006 12:23:51 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> Message-ID: <20060825192351.GM10509@greglaptop.hotels-on-air.de> On Fri, Aug 25, 2006 at 03:21:20PM -0400, mlakshmanan at silverstorm.com wrote: > I presume you meant invalidate the cache, not flush it, before accessing DMA'ed > data. Yes, this is what I meant. Sorry! -- greg From rdreier at cisco.com Fri Aug 25 12:35:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 25 Aug 2006 12:35:06 -0700 Subject: [openib-general] [PATCH 22 of 23] IB/ipath - print warning if LID not acquired within one minute In-Reply-To: <1a41dc627c5a1bc2f7e9.1156530287@eng-12.pathscale.com> ( Bryan O'Sullivan's message of "Fri, 25 Aug 2006 11:24:47 -0700") References: <1a41dc627c5a1bc2f7e9.1156530287@eng-12.pathscale.com> Message-ID: 1) What makes ipath special so that we want this warning for ipath devices but not other IB hardware? If this warning is actually useful, then I think it would make more sense to start a timer when any IB device is added, and warn if ports with a physical link don't become active after the timeout time. But I'm having a hard time seeing why we want this message in the kernel log. 2) You do cancel_delayed_work() but not flush_scheduled_work(), so it's possible for your timeout function to be running after the module text is gone. - R. From rdreier at cisco.com Fri Aug 25 12:45:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 25 Aug 2006 12:45:12 -0700 Subject: [openib-general] [PATCH 1 of 23] IB/ipath - More changes to support InfiniPath on PowerPC 970 systems In-Reply-To: <44809b730ac95b39b672.1156530266@eng-12.pathscale.com> ( Bryan O'Sullivan's message of "Fri, 25 Aug 2006 11:24:26 -0700") References: <44809b730ac95b39b672.1156530266@eng-12.pathscale.com> Message-ID: How did you generate these patches? When I try to apply them with git, I get errors like error: drivers/infiniband/hw/ipath/Makefile Fri Aug 25 11:19:44 2006 -0700: No such file or directory because the line diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile makes git think it's a git diff, but git doesn't put dates on the filename lines. In other words, instead of --- a/drivers/infiniband/hw/ipath/Makefile Fri Aug 25 11:19:44 2006 -0700 +++ b/drivers/infiniband/hw/ipath/Makefile Fri Aug 25 11:19:44 2006 -0700 the patch should just have --- a/drivers/infiniband/hw/ipath/Makefile +++ b/drivers/infiniband/hw/ipath/Makefile before the Makefile chunks. I fixed this up by deleting the "diff --git" lines, but I'm curious how you created this in the first place. - R. From rdreier at cisco.com Fri Aug 25 12:50:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 25 Aug 2006 12:50:58 -0700 Subject: [openib-general] [PATCH 11 of 23] IB/ipath - add new minor device to allow sending of diag packets In-Reply-To: <8743e6ee09c51e799f0f.1156530276@eng-12.pathscale.com> ( Bryan O'Sullivan's message of "Fri, 25 Aug 2006 11:24:36 -0700") References: <8743e6ee09c51e799f0f.1156530276@eng-12.pathscale.com> Message-ID: > + if (ret < 0) { > + printk(KERN_ERR IPATH_DRV_NAME ": Unable to create " > + "diag data device: error %d\n", -ret); > + goto bail_ipathfs; > + } > + The last line adds trailing whitespace, which git complains about. When patchbombing, can you run your patches through "git apply --check --whitespace=error-all" or the equivalent? Thanks, Roland From Thomas.Talpey at netapp.com Fri Aug 25 12:53:12 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 25 Aug 2006 15:53:12 -0400 Subject: [openib-general] basic IB doubt In-Reply-To: <20060825192351.GM10509@greglaptop.hotels-on-air.de> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> Message-ID: <7.0.1.0.2.20060825155129.08272fc0@netapp.com> At 03:23 PM 8/25/2006, Greg Lindahl wrote: >On Fri, Aug 25, 2006 at 03:21:20PM -0400, mlakshmanan at silverstorm.com wrote: > >> I presume you meant invalidate the cache, not flush it, before >accessing DMA'ed >> data. > >Yes, this is what I meant. Sorry! Flush (sync for_device) before posting. Invalidate (sync for_cpu) before processing. On some architectures, these operations flush and/or invalidate i/o pipeline caches as well. As they should. Tom. From rdreier at cisco.com Fri Aug 25 13:01:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 25 Aug 2006 13:01:25 -0700 Subject: [openib-general] [PATCH 23 of 23] IB/ipath - control receive polarity inversion In-Reply-To: <7a03a7b18dcfe1afeeb1.1156530288@eng-12.pathscale.com> ( Bryan O'Sullivan's message of "Fri, 25 Aug 2006 11:24:48 -0700") References: <7a03a7b18dcfe1afeeb1.1156530288@eng-12.pathscale.com> Message-ID: Applied 1-21 and 23 to my for-2.6.19 branch, and skipped 22 for now. - R. From rdreier at cisco.com Fri Aug 25 13:01:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 25 Aug 2006 13:01:20 -0700 Subject: [openib-general] [PATCH 1 of 23] IB/ipath - More changes to support InfiniPath on PowerPC 970 systems In-Reply-To: <44809b730ac95b39b672.1156530266@eng-12.pathscale.com> ( Bryan O'Sullivan's message of "Fri, 25 Aug 2006 11:24:26 -0700") References: <44809b730ac95b39b672.1156530266@eng-12.pathscale.com> Message-ID: > Signed-off-by: John Gregor I assume this patch was actually written by John Gregor? If so you should include an extra "From:" line in the body of the email, so that the authorship information gets put into git correctly. - R. From bos at pathscale.com Fri Aug 25 13:19:54 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 25 Aug 2006 13:19:54 -0700 Subject: [openib-general] [PATCH 1 of 23] IB/ipath - More changes to support InfiniPath on PowerPC 970 systems In-Reply-To: References: <44809b730ac95b39b672.1156530266@eng-12.pathscale.com> Message-ID: <1156537194.31531.38.camel@sardonyx> On Fri, 2006-08-25 at 12:45 -0700, Roland Dreier wrote: > How did you generate these patches? Using Mercurial. > because the line > > diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile > > makes git think it's a git diff, but git doesn't put dates on the > filename lines. Ah, interesting. Looks like a bug in the git-compatible patch generator, then. Sorry about that. References: <8743e6ee09c51e799f0f.1156530276@eng-12.pathscale.com> Message-ID: <1156537224.31531.40.camel@sardonyx> On Fri, 2006-08-25 at 12:50 -0700, Roland Dreier wrote: > The last line adds trailing whitespace, which git complains about. > When patchbombing, can you run your patches through "git apply --check > --whitespace=error-all" or the equivalent? Sure. Thanks for spotting that. References: <1a41dc627c5a1bc2f7e9.1156530287@eng-12.pathscale.com> Message-ID: <44EF6053.4010006@pathscale.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Roland Dreier wrote: > 1) What makes ipath special so that we want this warning for ipath > devices but not other IB hardware? There's nothing special about our hardware that requires this. We just wanted that in there so we could direct customers to look at dmesg to see if the warning popped up if they call with a problem. It is useful to have for this purpose. > If this warning is actually > useful, then I think it would make more sense to start a timer when > any IB device is added, and warn if ports with a physical link don't > become active after the timeout time. I'd be OK with doing that, too. > But I'm having a hard time > seeing why we want this message in the kernel log. It's useful when you're trying to track down problems. > 2) You do cancel_delayed_work() but not flush_scheduled_work(), so > it's possible for your timeout function to be running after the module > text is gone. OK - I'll fix this up. Thanks for spotting it. Regards, Robert. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iQEVAwUBRO9gU/zvnpzTd9fxAQJBQwgAkbgrEA4/UpbcD0gsGC+39r5ZAAz+4d3I /QAIKn239juMf8TfrlekAzD9MCj5Rna1bk3yu1gu/Z0Jg5vHvQNmBxDtveQ4bDyu 1DAUbvmclNknzM00LtMHI6AZfYbRgsbCIKXJw0reXctAkbJAvMU0U6Ff1imvO0Tw 38C24ktDalaaKpz4DHO261UHlmtD4wlJojKLYI5yH39JSHK449zjJznrP9W8SPIU RbxGktSsD69gQXmpqgY5KEmbcukZ9AIF4VHTG2uEz1aO7eOQ+1BsUg140EcWXC// R1Jg56WhCYsMDVik7+u994VgQi34beos9pwbLIUkq+315VHN3QFbQg== =XhNA -----END PGP SIGNATURE----- From mrmikeylee at gmail.com Fri Aug 25 14:28:50 2006 From: mrmikeylee at gmail.com (Michael Lee) Date: Fri, 25 Aug 2006 14:28:50 -0700 Subject: [openib-general] test message Message-ID: <9abc48ff0608251428s40f06fd3g5e27b5076746d597@mail.gmail.com> test message -------------- next part -------------- An HTML attachment was scrubbed... URL: From venkatesh.babu at 3leafnetworks.com Fri Aug 25 15:02:11 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Fri, 25 Aug 2006 15:02:11 -0700 Subject: [openib-general] OpenSM partition Management Message-ID: <44EF7363.50302@3leafnetworks.com> The document OpenSM_PKey_Mgr.txt under link https://openib.org/svn/gen2/trunk/src/userspace/management/osm/doc/OpenSM_PKey_Mgr.txt describes the roadmap for OpenSM partition management. It discusses two phase implementation. 1. What functionality is available with OFED version 1.1 ? 2. When each of these two phases are going to be implemented and available ? Thanks, VBabu From sashak at voltaire.com Fri Aug 25 15:03:41 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 26 Aug 2006 01:03:41 +0300 Subject: [openib-general] OpenSM partition Management In-Reply-To: <44EF7363.50302@3leafnetworks.com> References: <44EF7363.50302@3leafnetworks.com> Message-ID: <20060825220341.GB4239@sashak.voltaire.com> Hi, On 15:02 Fri 25 Aug , Venkatesh Babu wrote: > > The document OpenSM_PKey_Mgr.txt under link > https://openib.org/svn/gen2/trunk/src/userspace/management/osm/doc/OpenSM_PKey_Mgr.txt > describes the roadmap for OpenSM partition management. It discusses two > phase implementation. > > 1. What functionality is available with OFED version 1.1 ? The implemented and available in OFED 1.1 part of partition management is more or the less phase I from the road-map you are referring. For more details about implemented features see: https://openib.org/svn/gen2/trunk/src/userspace/management/osm/doc/partition-config.txt > 2. When each of these two phases are going to be implemented and available ? Phase I is done. And phase II is TBD. Sasha > > Thanks, > VBabu > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From krause at cup.hp.com Fri Aug 25 15:55:43 2006 From: krause at cup.hp.com (Michael Krause) Date: Fri, 25 Aug 2006 15:55:43 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <20060825185555.GE10509@greglaptop.hotels-on-air.de> References: <20060824221833.GD3670@greglaptop.hotels-on-air.de> <000001c6c84e$e7cfd700$4e019e84@bullseye> <20060825185555.GE10509@greglaptop.hotels-on-air.de> Message-ID: <6.2.0.14.2.20060825154927.029c6d30@esmail.cup.hp.com> At 11:55 AM 8/25/2006, Greg Lindahl wrote: >On Fri, Aug 25, 2006 at 10:00:50AM -0400, Thomas Bachman wrote: > > > Not that I have any stance on this issue, but is this is the text in the > > spec that is being debated? > > > > (page 269, section 9.5, Transaction Ordering): > > "An application shall not depend upon the order of data writes to > > memory within a message. For example, if an application sets up > > data buffers that overlap, for separate data segments within a > > message, it is not guaranteed that the last sent data will always > > overwrite the earlier." > >No. The case we're talking about is different from the example. >There's text elsewhere which says, basically, that you can't access >the data buffer until seeing the completion. > > I'm assuming that the spec authors had reason for putting this in there, so > > maybe they could provide guidance here? We put that text there to accommodate differing memory controller architectures / coherency protocol capabilities / etc. Basically, there is no way to guarantee that the memory is in a usable and correct state until the completion is seen. This was intended to guide software to not peek at memory but to examine a completion queue entry so that if memory is updated out of order, silent data corruption would not occur. >I can't speak for the authors, but as an implementor, this has a huge >impact on implementation. > >For example, on an architecture where you need to do work such as flushing >the cache before accessing DMAed data, that's done in the completion. x86 >in general is not such an architecture, but they exist. IB is intended to >be portable to any CPU architecture. Invalidation protocol is one concern. The other is the a completion notification also often acts as a flush of the local I/O fabric as well. In the case of a RDMA Write, the only way to safely determine complete delivery was to have a RDMA Write / Send with completion combination or a RDMA Write / RDMA Read depending upon which side required such completion knowledge. >For iWarp, the issue is that packets are frequently reordered. Neither IP or Ethernet re-order packets that often in practice. Same is true for packet drop rates (the real issue for packet drop is the impact on performance and recovery times which is why IB was not designed to work over long or diverse topologies where intermediate elements may see what might be termed a high packet loss rate). Mike >-- greg > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Fri Aug 25 15:56:47 2006 From: krause at cup.hp.com (Michael Krause) Date: Fri, 25 Aug 2006 15:56:47 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <7.0.1.0.2.20060825155129.08272fc0@netapp.com> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> Message-ID: <6.2.0.14.2.20060825155618.02a3e8a0@esmail.cup.hp.com> At 12:53 PM 8/25/2006, Talpey, Thomas wrote: >At 03:23 PM 8/25/2006, Greg Lindahl wrote: > >On Fri, Aug 25, 2006 at 03:21:20PM -0400, mlakshmanan at silverstorm.com wrote: > > > >> I presume you meant invalidate the cache, not flush it, before > >accessing DMA'ed > >> data. > > > >Yes, this is what I meant. Sorry! > >Flush (sync for_device) before posting. >Invalidate (sync for_cpu) before processing. > >On some architectures, these operations flush and/or invalidate >i/o pipeline caches as well. As they should. Many platforms have coherent I/O components so the explicit requirements on software to participate are often eliminated. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Fri Aug 25 16:04:07 2006 From: krause at cup.hp.com (Michael Krause) Date: Fri, 25 Aug 2006 16:04:07 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <1156527926.25769.39.camel@trinity.ogc.int> References: <000201c6c865$3be47d80$8698070a@amr.corp.intel.com> <7.0.1.0.2.20060825124752.0464b7c8@netapp.com> <1156527926.25769.39.camel@trinity.ogc.int> Message-ID: <6.2.0.14.2.20060825155949.02a0b7d0@esmail.cup.hp.com> At 10:45 AM 8/25/2006, Tom Tucker wrote: >On Fri, 2006-08-25 at 12:51 -0400, Talpey, Thomas wrote: > > At 12:40 PM 8/25/2006, Sean Hefty wrote: > > >> Thomas> How does an adapter guarantee that no bridges or other > > >> Thomas> intervening devices reorder their writes, or for that > > >> Thomas> matter flush them to memory at all!? > > >> > > >>That's a good point. The HCA would have to do a read to flush the > > >>posted writes, and I'm sure it's not doing that (since it would add > > >>horrible latency for no good reason). > > >> > > >>I guess it's not safe to rely on ordering of RDMA writes after all. > > > > > >Couldn't the same point then be made that a CQ entry may come before > the data > > >has been posted? > > > > When the CQ entry arrives, the context that polls it off the queue > > must use the dma_sync_*() api to finalize any associated data > > transactions (known by the uper layer). > > > > This is basic, and it's the reason that a completion is so important. > > The completion, in and of itself, isn't what drives the synchronization. > > It's the transfer of control to the processor. > >This is a giant rat hole. > >On a coherent cache architecture, the CQE write posted to the bus >following the write of the last byte of data will NOT be seen by the >processor prior to the last byte of data. That is, write ordering is >preserved in bridges. > >The dma_sync_* API has to do with processor cache, not transaction >ordering. In fact, per this argument at the time you called dma_sync_*, >the processor may not have seen the reordered transaction yet, so what >would it be syncing? > >Write ordering and read ordering/fence is preserved in intervening >bridges. What you DON'T know is whether or not a write (which was posted >and may be sitting in a bridge FIFO) has been flushed and/or propagated >to memory at the time you submit the next write and/or interrupt the >host. > >If you submit a READ following the write, however, per the PCI bus >ordering rules you know that the data is in the target. > >Unless, of course, I'm wrong ... :-) A PCI read following a write to the same address will result validate that all prior write transactions are flushed to host memory. This is one way that people have used (albeit with a performance penalty) to verify that a transaction it out of the HCA / RNIC fault zone and therefore an acknowledgement to the source means the data is safe and one can survive the HCA / RNIC failing without falling into a non-deterministic state. PCI writes are strongly ordered on any PCI technology offering. Relaxed ordering needs to be taken into account w.r.t. writes vs. reads as well as read completions being weakly ordered as well. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Fri Aug 25 16:08:39 2006 From: krause at cup.hp.com (Michael Krause) Date: Fri, 25 Aug 2006 16:08:39 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F189EB44@NT-SJCA-0751.brcm.a d.broadcom.com> References: <000201c6c865$3be47d80$8698070a@amr.corp.intel.com> <54AD0F12E08D1541B826BE97C98F99F189EB44@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <6.2.0.14.2.20060825160533.02a0ba60@esmail.cup.hp.com> At 09:50 AM 8/25/2006, Caitlin Bestler wrote: >openib-general-bounces at openib.org wrote: > >> Thomas> How does an adapter guarantee that no bridges or other > >> Thomas> intervening devices reorder their writes, or for that > >> Thomas> matter flush them to memory at all!? > >> > >> That's a good point. The HCA would have to do a read to flush the > >> posted writes, and I'm sure it's not doing that (since it would add > >> horrible latency for no good reason). > >> > >> I guess it's not safe to rely on ordering of RDMA writes after all. > > > > Couldn't the same point then be made that a CQ entry may come > > before the data has been posted? > > > >That's why both specs (IBTA and RDMAC) are very explicit that all >prior messages are complete before the CQE is given to the user. > >It is up to the RDMA Device and/or its driver to guarantee this >by whatever means are appropriate. An implementation that allows >a CQE post to pass the data placement that it is reporting on the >PCI bus is in error. > >The critical concept of the Work Completion is that it consolidates >guarantees and notificatins. The implementation can do all sorts >of strange things that it thinks optimize *before* the work completion, >but at the time the work completion is delivered to the user everything >is supposed to be as expected. Caitlin's logic is correct and the basis for why these two specifications call out this issue. And yes, Roland, one cannot rely upon RDMA Write ordering whether for IB or iWARP. iWARP specifically allows out of order delivery. IB while providing in-order delivery due to its strong ordering protocol still has no guarantees when it comes to the memory controller and I/O technology being used. Given not everything was expected to operate over PCI, we made sure that the specifications pointed out these issues so that software would be designed to accommodate all interconnect attach types and usage models. We wanted to maximize the underlying implementation options while providing software with a consistent operating model to enable it to be simplified as well. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpearson at systemfabricworks.com Fri Aug 25 22:40:16 2006 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Sat, 26 Aug 2006 00:40:16 -0500 Subject: [openib-general] bug in libmthca/src/verbs.c? Message-ID: <20060826054029.QHAL1364.rrcs-fep-10.hrndva.rr.com@BOBP> struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector) { struct mthca_create_cq cmd; -------------> snip <-------------- ret = ibv_cmd_create_cq(context, cqe - 1, channel, comp_vector, &cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd, ^^^^^^^^^^ &resp.ibv_resp, sizeof resp); The command size passed to ibv_cmd_create_cq is the size of the mthca command wrapper which is larger than what is most likely expected. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gri at yourgreeting.com Sat Aug 26 00:24:40 2006 From: gri at yourgreeting.com (YourGreeting.Com) Date: Sat, 26 Aug 2006 00:24:40 -0700 (PDT) Subject: [openib-general] You have a new greeting! Message-ID: <20060826072440.0DD315DD23D@cube.biola.edu> An HTML attachment was scrubbed... URL: From glebn at voltaire.com Sat Aug 26 00:39:57 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Sat, 26 Aug 2006 10:39:57 +0300 Subject: [openib-general] basic IB doubt In-Reply-To: <7.0.1.0.2.20060825155129.08272fc0@netapp.com> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> Message-ID: <20060826073957.GA1369@minantech.com> On Fri, Aug 25, 2006 at 03:53:12PM -0400, Talpey, Thomas wrote: > At 03:23 PM 8/25/2006, Greg Lindahl wrote: > >On Fri, Aug 25, 2006 at 03:21:20PM -0400, mlakshmanan at silverstorm.com wrote: > > > >> I presume you meant invalidate the cache, not flush it, before > >accessing DMA'ed > >> data. > > > >Yes, this is what I meant. Sorry! > > Flush (sync for_device) before posting. > Invalidate (sync for_cpu) before processing. > So, before touching the data that was RDMAed into the buffer application should cache invalidate the buffer, is this even possible from user space? (Not on x86, but it isn't needed there.) > On some architectures, these operations flush and/or invalidate > i/o pipeline caches as well. As they should. > > Tom. -- Gleb. From mst at mellanox.co.il Sat Aug 26 11:41:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 26 Aug 2006 21:41:50 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <20060825141704.GA3867@sashak.voltaire.com> References: <20060825141704.GA3867@sashak.voltaire.com> Message-ID: <20060826184150.GB21168@mellanox.co.il> Quoting r. Sasha Khapyorsky : > Subject: Re: [openib-general] [PATCH] osm: handle local events > > On 16:28 Thu 24 Aug , Michael S. Tsirkin wrote: > > Quoting r. Yevgeny Kliteynik : > > > Index: libvendor/osm_vendor_ibumad.c > > > =================================================================== > > > --- libvendor/osm_vendor_ibumad.c (revision 8998) > > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > > @@ -72,6 +72,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > > > /****s* OpenSM: Vendor AL/osm_umad_bind_info_t > > > * NAME > > > > NAK. > > > > This means that the SM becomes dependent on the uverbs module. I don't think > > this is a good idea. Let's not go there - SM should depend just on the umad > > module and libc. > > Agree on this point. I dislike this new libibverbs dependency too. I > think we need to work with umad. > > So more generic question: some application performs blocked read() from > /dev/umadN, should this read() be interrupted and return error (with > appropriate errno value), then the port state becomes DOWN? > I think yes, it should. Other opinions? Sean? One thing seems obvious: if device goes away it seems obvious that we should return ENODEV from any read. Isn't this already done? > > And if yes, then in OpenSM we will need just to check errno value upon > umad_recv() failure. > > Sasha Might be a good idea. Hoever, such an approach by default is an ABI change so it could break some apps. Could this be made an option somehow? Think also about a race where I read *after* the state was changed to DOWN. Another question comes to mind: does not SM care about physical link state changes as well? Assuming I disconnect and re-connect the cable, does not SM want to know and try bringing the logical link up? How is this handled currently? Can the same mechanism be used for port state events? -- MST From mst at mellanox.co.il Sat Aug 26 12:24:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 26 Aug 2006 22:24:14 +0300 Subject: [openib-general] [PATCH 0 of 23] IB/ipath - updates for 2.6.19 In-Reply-To: References: Message-ID: <20060826192413.GE21168@mellanox.co.il> Quoting r. Bryan O'Sullivan : > Subject: [PATCH 0 of 23] IB/ipath - updates for 2.6.19 > > Hi, Roland - > > This is a series of patches to bring the ipath driver up to date for 2.6.19. > The patches apply on top of Ralph's mmap patch that you accepted yesterday. > > Please apply. > > Thanks, > > References: Message-ID: <20060826191834.GD21168@mellanox.co.il> Quoting r. Roland Dreier : > Subject: drop mthca from svn? (was: Rollup patch for ipath and OFED) > > Sean> Why not remove your code from SVN? > > Along those lines, how would people feel if I removed the mthca kernel > code from svn, and just maintained mthca in kernel.org git trees? I > am getting heartily sick of double checkins for every mthca change... Ack. -- MST From mst at mellanox.co.il Sat Aug 26 12:31:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 26 Aug 2006 22:31:27 +0300 Subject: [openib-general] [PATCH 22 of 23] IB/ipath - print warning if LID not acquired within one minute In-Reply-To: <44EF6053.4010006@pathscale.com> References: <44EF6053.4010006@pathscale.com> Message-ID: <20060826193126.GF21168@mellanox.co.il> Quoting r. Robert Walsh : > Subject: Re: [PATCH 22 of 23] IB/ipath - print warning if LID not acquired within one minute > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Roland Dreier wrote: > > 1) What makes ipath special so that we want this warning for ipath > > devices but not other IB hardware? > > There's nothing special about our hardware that requires this. We just > wanted that in there so we could direct customers to look at dmesg to > see if the warning popped up if they call with a problem. It is useful > to have for this purpose. > > > If this warning is actually > > useful, then I think it would make more sense to start a timer when > > any IB device is added, and warn if ports with a physical link don't > > become active after the timeout time. > > I'd be OK with doing that, too. Looks like your devices are all single-port. With a multi port device it is quite common to have one port down. > > But I'm having a hard time > > seeing why we want this message in the kernel log. > > It's useful when you're trying to track down problems. How about doing this in userspace by looking at port state in sysfs? You can diagnose a much wider class of problems this way. -- MST From mst at mellanox.co.il Sat Aug 26 12:42:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 26 Aug 2006 22:42:25 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready In-Reply-To: <1156526995.12257.107.camel@fc6.xsintricity.com> References: <1156526995.12257.107.camel@fc6.xsintricity.com> Message-ID: <20060826194225.GH21168@mellanox.co.il> Quoting r. Doug Ledford : > IOW, make use of the infrastructure > provided in U4 instead of working around it. Sorry, I don't really understand what you suggest here. Could you give us an example please? -- MST From kliteyn at mellanox.co.il Sun Aug 27 03:16:30 2006 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 27 Aug 2006 13:16:30 +0300 Subject: [openib-general] [PATCH] osm: OSM fails to bind - TRIVIAL message addition Message-ID: Hi Hal This patch just makes the error message more informative for user, since another instance of running SM is most probably the reason why osm_opensm_bind failed. Yevgeny Signed-off-by: Yevgeny Kliteynik Index: opensm/main.c =================================================================== --- opensm/main.c (revision 9107) +++ opensm/main.c (working copy) @@ -875,5 +914,6 @@ main( if( status != IB_SUCCESS ) { printf( "\nError from osm_opensm_bind (0x%X)\n", status ); + printf( "Perhaps another instance of SM is already running\n" ); goto Exit; } From kliteyn at mellanox.co.il Sun Aug 27 04:30:02 2006 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 27 Aug 2006 14:30:02 +0300 Subject: [openib-general] [PATCH] osm: Dynamic verbosity control per file In-Reply-To: <1155656058.29378.8180.camel@hal.voltaire.com> References: <1155656058.29378.8180.camel@hal.voltaire.com> Message-ID: <1156678202.24539.11.camel@kliteynik.yok.mtl.com> Hi Hal. > > By default, the OSM will use the following file: /etc/opensmlog.conf > > Nit: For consistency in naming, this would be better as osmlog.conf > (or > osm-log.conf) rather than opensmlog.conf Right - will use osm-log.conf > Rather than remove osm_log and osm_log_raw, these should be > deprecated. > There are other applications outside of OpenSM (like osmtest and > others) > that need this. You're right, osm_log & osm_log_raw are no longer appear in the API, but they are not removed from headers - they are now macros, so the old code will still compile. > Also, is this functionality needed for OFED 1.1 or is this trunk > only ? It doesn't have to get to 1.1. I'll send a second version of this patch that will address all your comments, including the addition in the osm man pages. Thanks, Yevgeny On Tue, 2006-08-15 at 18:34 +0300, Hal Rosenstock wrote: > Also, is this functionality needed for OFED 1.1 or is this trunk > only ? > > Thanks. > > -- Hal > > > 1. Verbosity configuration file > > ------------------------------- > > > > The user is able to set verbosity level per source code file > > by supplying verbosity configuration file using the following > > command line arguments: > > > > -b filename > > --verbosity_file filename > > > > By default, the OSM will use the following file: /etc/opensmlog.conf > > Nit: For consistency in naming, this would be better as osmlog.conf > (or > osm-log.conf) rather than opensmlog.conf > > > Verbosity configuration file should contain zero or more lines of > > the following pattern: > > > > filename verbosity_level > > > > where 'filename' is the name of the source code file that the > > 'verbosity_level' refers to, and the 'verbosity_level' itself > > should be specified as an integer number (decimal or hexadecimal). > > > > One reserved filename is 'all' - it represents general verbosity > > level, that is used for all the files that are not specified in > > the verbosity configuration file. > > If 'all' is not specified, the verbosity level set in the > > command line will be used instead. > > Note: The 'all' file verbosity level will override any other > > general level that was specified by the command line arguments. > > > > Sending a SIGHUP signal to the OSM will cause it to reload > > the verbosity configuration file. > > > > > > 2. Logging source code filename and line number > > ----------------------------------------------- > > > > If command line option -S or --log_source_info is specified, > > OSM will add source code filename and line number to every > > log message that is written to the log file. > > By default, the OSM will not log this additional info. > > > > > > Yevgeny > > > > Signed-off-by: Yevgeny Kliteynik > > > > Index: include/opensm/osm_subnet.h > > > =================================================================== > > --- include/opensm/osm_subnet.h (revision 8614) > > +++ include/opensm/osm_subnet.h (working copy) > > @@ -285,6 +285,8 @@ typedef struct _osm_subn_opt > > osm_qos_options_t qos_sw0_options; > > osm_qos_options_t qos_swe_options; > > osm_qos_options_t qos_rtr_options; > > + boolean_t src_info; > > + char * verbosity_file; > > } osm_subn_opt_t; > > /* > > * FIELDS > > @@ -463,6 +465,27 @@ typedef struct _osm_subn_opt > > * qos_rtr_options > > * QoS options for router ports > > * > > +* src_info > > +* If TRUE - the source code filename and line number will > be > > +* added to each log message. > > +* Default value - FALSE. > > +* > > +* verbosity_file > > +* OSM log configuration file - the file that describes > > +* verbosity level per source code file. > > +* The file may containg zero or more lines of the following > > +* pattern: > > +* filename verbosity_level > > +* where 'filename' is the name of the source code file that > > +* the 'verbosity_level' refers to. > > +* Filename "all" represents general verbosity level, that > is > > +* used for all the files that are not specified in the > > +* verbosity file. > > +* If "all" is not specified, the general verbosity level > will > > +* be used instead. > > +* Note: the "all" file verbosity level will override any > other > > +* general level that was specified by the command line > > arguments. > > +* > > * SEE ALSO > > * Subnet object > > *********/ > > Index: include/opensm/osm_base.h > > > =================================================================== > > --- include/opensm/osm_base.h (revision 8614) > > +++ include/opensm/osm_base.h (working copy) > > @@ -222,6 +222,22 @@ BEGIN_C_DECLS > > #endif > > /***********/ > > > > +/****d* OpenSM: Base/OSM_DEFAULT_VERBOSITY_FILE > > +* NAME > > +* OSM_DEFAULT_VERBOSITY_FILE > > +* > > +* DESCRIPTION > > +* Specifies the default verbosity config file name > > +* > > +* SYNOPSIS > > +*/ > > +#ifdef __WIN__ > > +#define OSM_DEFAULT_VERBOSITY_FILE strcat(GetOsmPath(), " > > opensmlog.conf") > > +#else > > +#define OSM_DEFAULT_VERBOSITY_FILE "/etc/opensmlog.conf" > > +#endif > > +/***********/ > > + > > /****d* OpenSM: Base/OSM_DEFAULT_PARTITION_CONFIG_FILE > > * NAME > > * OSM_DEFAULT_PARTITION_CONFIG_FILE > > Index: include/opensm/osm_log.h > > =================================================================== > > --- include/opensm/osm_log.h (revision 8652) > > +++ include/opensm/osm_log.h (working copy) > > @@ -57,6 +57,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > > > @@ -123,9 +124,45 @@ typedef struct _osm_log > > cl_spinlock_t lock; > > boolean_t flush; > > FILE* out_port; > > + boolean_t src_info; > > + st_table * table; > > } osm_log_t; > > /*********/ > > > > +/****f* OpenSM: Log/osm_log_read_verbosity_file > > +* NAME > > +* osm_log_read_verbosity_file > > +* > > +* DESCRIPTION > > +* This function reads the verbosity configuration file > > +* and constructs a verbosity data structure. > > +* > > +* SYNOPSIS > > +*/ > > +void > > +osm_log_read_verbosity_file( > > + IN osm_log_t* p_log, > > + IN const char * const verbosity_file); > > +/* > > +* PARAMETERS > > +* p_log > > +* [in] Pointer to a Log object to construct. > > +* > > +* verbosity_file > > +* [in] verbosity configuration file > > +* > > +* RETURN VALUE > > +* None > > +* > > +* NOTES > > +* If the verbosity configuration file is not found, default > > +* verbosity value is used for all files. > > +* If there is an error in some line of the verbosity > > +* configuration file, the line is ignored. > > +* > > +*********/ > > + > > + > > /****f* OpenSM: Log/osm_log_construct > > * NAME > > * osm_log_construct > > @@ -201,9 +238,13 @@ osm_log_destroy( > > * osm_log_init > > *********/ > > > > -/****f* OpenSM: Log/osm_log_init > > +#define osm_log_init(p_log, flush, log_flags, log_file, > > accum_log_file) \ > > + osm_log_init_ext(p_log, flush, (log_flags), log_file, \ > > + accum_log_file, FALSE, OSM_DEFAULT_VERBOSITY_FILE) > > + > > +/****f* OpenSM: Log/osm_log_init_ext > > * NAME > > -* osm_log_init > > +* osm_log_init_ext > > * > > * DESCRIPTION > > * The osm_log_init function initializes a > > @@ -211,50 +252,15 @@ osm_log_destroy( > > * > > * SYNOPSIS > > */ > > -static inline ib_api_status_t > > -osm_log_init( > > +ib_api_status_t > > +osm_log_init_ext( > > IN osm_log_t* const p_log, > > IN const boolean_t flush, > > IN const uint8_t log_flags, > > IN const char *log_file, > > - IN const boolean_t accum_log_file ) > > -{ > > - p_log->level = log_flags; > > - p_log->flush = flush; > > - > > - if (log_file == NULL || !strcmp(log_file, "-") || > > - !strcmp(log_file, "stdout")) > > - { > > - p_log->out_port = stdout; > > - } > > - else if (!strcmp(log_file, "stderr")) > > - { > > - p_log->out_port = stderr; > > - } > > - else > > - { > > - if (accum_log_file) > > - p_log->out_port = fopen(log_file, "a+"); > > - else > > - p_log->out_port = fopen(log_file, "w+"); > > - > > - if (!p_log->out_port) > > - { > > - if (accum_log_file) > > - printf("Cannot open %s for appending. Permission denied > \n", > > log_file); > > - else > > - printf("Cannot open %s for writing. Permission denied\n", > > log_file); > > These lines above are line wrapped so they don't apply. This is an > email > issue on your side. > > > - > > - return(IB_UNKNOWN_ERROR); > > - } > > - } > > - openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); > > - > > - if (cl_spinlock_init( &p_log->lock ) == CL_SUCCESS) > > - return IB_SUCCESS; > > - else > > - return IB_ERROR; > > -} > > + IN const boolean_t accum_log_file, > > + IN const boolean_t src_info, > > + IN const char *verbosity_file); > > /* > > * PARAMETERS > > * p_log > > @@ -271,6 +277,16 @@ osm_log_init( > > * log_file > > * [in] if not NULL defines the name of the log file. Otherwise > it > > is stdout. > > * > > +* accum_log_file > > +* [in] Whether the log file should be accumulated. > > +* > > +* src_info > > +* [in] Set to TRUE directs the log to add filename and line > > number > > +* to each log message. > > +* > > +* verbosity_file > > +* [in] Log configuration file location. > > +* > > * RETURN VALUES > > * CL_SUCCESS if the Log object was initialized > > * successfully. > > @@ -283,26 +299,32 @@ osm_log_init( > > * osm_log_destroy > > *********/ > > > > -/****f* OpenSM: Log/osm_log_get_level > > +#define osm_log_get_level(p_log) \ > > + osm_log_get_level_ext(p_log, __FILE__) > > + > > +/****f* OpenSM: Log/osm_log_get_level_ext > > * NAME > > -* osm_log_get_level > > +* osm_log_get_level_ext > > * > > * DESCRIPTION > > -* Returns the current log level. > > +* Returns the current log level for the file. > > +* If the file is not specified in the log config file, > > +* the general verbosity level will be returned. > > * > > * SYNOPSIS > > */ > > -static inline osm_log_level_t > > -osm_log_get_level( > > - IN const osm_log_t* const p_log ) > > -{ > > - return( p_log->level ); > > -} > > +osm_log_level_t > > +osm_log_get_level_ext( > > + IN const osm_log_t* const p_log, > > + IN const char* const p_filename ); > > /* > > * PARAMETERS > > * p_log > > * [in] Pointer to the log object. > > * > > +* p_filename > > +* [in] Source code file name. > > +* > > * RETURN VALUES > > * Returns the current log level. > > * > > @@ -310,7 +332,7 @@ osm_log_get_level( > > * > > * SEE ALSO > > * Log object, osm_log_construct, > > -* osm_log_destroy > > +* osm_log_destroy, osm_log_get_level > > *********/ > > > > /****f* OpenSM: Log/osm_log_set_level > > @@ -318,7 +340,7 @@ osm_log_get_level( > > * osm_log_set_level > > * > > * DESCRIPTION > > -* Sets the current log level. > > +* Sets the current general log level. > > * > > * SYNOPSIS > > */ > > @@ -338,7 +360,7 @@ osm_log_set_level( > > * [in] New level to set. > > * > > * RETURN VALUES > > -* Returns the current log level. > > +* None. > > * > > * NOTES > > * > > @@ -347,9 +369,12 @@ osm_log_set_level( > > * osm_log_destroy > > *********/ > > > > -/****f* OpenSM: Log/osm_log_is_active > > +#define osm_log_is_active(p_log, level) \ > > + osm_log_is_active_ext(p_log, __FILE__, level) > > + > > +/****f* OpenSM: Log/osm_log_is_active_ext > > * NAME > > -* osm_log_is_active > > +* osm_log_is_active_ext > > * > > * DESCRIPTION > > * Returns TRUE if the specified log level would be logged. > > @@ -357,18 +382,19 @@ osm_log_set_level( > > * > > * SYNOPSIS > > */ > > -static inline boolean_t > > -osm_log_is_active( > > +boolean_t > > +osm_log_is_active_ext( > > IN const osm_log_t* const p_log, > > - IN const osm_log_level_t level ) > > -{ > > - return( (p_log->level & level) != 0 ); > > -} > > + IN const char* const p_filename, > > + IN const osm_log_level_t level ); > > /* > > * PARAMETERS > > * p_log > > * [in] Pointer to the log object. > > * > > +* p_filename > > +* [in] Source code file name. > > +* > > * level > > * [in] Level to check. > > * > > @@ -383,17 +409,125 @@ osm_log_is_active( > > * osm_log_destroy > > *********/ > > > > + > > +#define osm_log(p_log, verbosity, p_str, args...) \ > > + osm_log_ext(p_log, verbosity, __FILE__, __LINE__, p_str , ## > > args) > > + > > +/****f* OpenSM: Log/osm_log_ext > > +* NAME > > +* osm_log_ext > > +* > > +* DESCRIPTION > > +* Logs the formatted specified message. > > +* > > +* SYNOPSIS > > +*/ > > void > > -osm_log( > > +osm_log_ext( > > IN osm_log_t* const p_log, > > IN const osm_log_level_t verbosity, > > + IN const char *p_filename, > > + IN int line, > > IN const char *p_str, ... ); > > +/* > > +* PARAMETERS > > +* p_log > > +* [in] Pointer to the log object. > > +* > > +* verbosity > > +* [in] Current message verbosity level > > + > > + p_filename > > + [in] Name of the file that is logging this message > > + > > + line > > + [in] Line number in the file that is logging this message > > + > > + p_str > > + [in] Format string of the message > > +* > > +* RETURN VALUES > > +* None. > > +* > > +* NOTES > > +* > > +* SEE ALSO > > +* Log object, osm_log_construct, > > +* osm_log_destroy > > +*********/ > > > > +#define osm_log_raw(p_log, verbosity, p_buff) \ > > + osm_log_raw_ext(p_log, verbosity, __FILE__, p_buff) > > + > > +/****f* OpenSM: Log/osm_log_raw_ext > > +* NAME > > +* osm_log_ext > > +* > > +* DESCRIPTION > > +* Logs the specified message. > > +* > > +* SYNOPSIS > > +*/ > > void > > -osm_log_raw( > > +osm_log_raw_ext( > > IN osm_log_t* const p_log, > > IN const osm_log_level_t verbosity, > > + IN const char * p_filename, > > IN const char *p_buf ); > > +/* > > +* PARAMETERS > > +* p_log > > +* [in] Pointer to the log object. > > +* > > +* verbosity > > +* [in] Current message verbosity level > > + > > + p_filename > > + [in] Name of the file that is logging this message > > + > > + p_buf > > + [in] Message string > > +* > > +* RETURN VALUES > > +* None. > > +* > > +* NOTES > > +* > > +* SEE ALSO > > +* Log object, osm_log_construct, > > +* osm_log_destroy > > +*********/ > > + > > + > > +/****f* OpenSM: Log/osm_log_flush > > +* NAME > > +* osm_log_flush > > +* > > +* DESCRIPTION > > +* Flushes the log. > > +* > > +* SYNOPSIS > > +*/ > > +static inline void > > +osm_log_flush( > > + IN osm_log_t* const p_log) > > +{ > > + fflush(p_log->out_port); > > +} > > +/* > > +* PARAMETERS > > +* p_log > > +* [in] Pointer to the log object. > > +* > > +* RETURN VALUES > > +* None. > > +* > > +* NOTES > > +* > > +* SEE ALSO > > +* > > +*********/ > > + > > > > #define DBG_CL_LOCK 0 > > > > Index: opensm/osm_subnet.c > > =================================================================== > > --- opensm/osm_subnet.c (revision 8614) > > +++ opensm/osm_subnet.c (working copy) > > @@ -493,6 +493,8 @@ osm_subn_set_default_opt( > > p_opt->ucast_dump_file = NULL; > > p_opt->updn_guid_file = NULL; > > p_opt->exit_on_fatal = TRUE; > > + p_opt->src_info = FALSE; > > + p_opt->verbosity_file = OSM_DEFAULT_VERBOSITY_FILE; > > subn_set_default_qos_options(&p_opt->qos_options); > > subn_set_default_qos_options(&p_opt->qos_hca_options); > > subn_set_default_qos_options(&p_opt->qos_sw0_options); > > @@ -959,6 +961,13 @@ osm_subn_parse_conf_file( > > "honor_guid2lid_file", > > p_key, p_val, &p_opts->honor_guid2lid_file); > > > > + __osm_subn_opts_unpack_boolean( > > + "log_source_info", > > + p_key, p_val, &p_opts->src_info); > > + > > + __osm_subn_opts_unpack_charp( > > + "verbosity_file", p_key, p_val, &p_opts->verbosity_file); > > + > > subn_parse_qos_options("qos", > > p_key, p_val, &p_opts->qos_options); > > > > @@ -1182,7 +1191,11 @@ osm_subn_write_conf_file( > > "# No multicast routing is performed if TRUE\n" > > "disable_multicast %s\n\n" > > "# If TRUE opensm will exit on fatal initialization issues\n" > > - "exit_on_fatal %s\n\n", > > + "exit_on_fatal %s\n\n" > > + "# If TRUE OpenSM will log filename and line numbers\n" > > + "log_source_info %s\n\n" > > + "# Verbosity configuration file to be used\n" > > + "verbosity_file %s\n\n", > > p_opts->log_flags, > > p_opts->force_log_flush ? "TRUE" : "FALSE", > > p_opts->log_file, > > @@ -1190,7 +1203,9 @@ osm_subn_write_conf_file( > > p_opts->dump_files_dir, > > p_opts->no_multicast_option ? "TRUE" : "FALSE", > > p_opts->disable_multicast ? "TRUE" : "FALSE", > > - p_opts->exit_on_fatal ? "TRUE" : "FALSE" > > + p_opts->exit_on_fatal ? "TRUE" : "FALSE", > > + p_opts->src_info ? "TRUE" : "FALSE", > > + p_opts->verbosity_file > > ); > > > > fprintf( > > Index: opensm/osm_opensm.c > > =================================================================== > > --- opensm/osm_opensm.c (revision 8614) > > +++ opensm/osm_opensm.c (working copy) > > @@ -180,8 +180,10 @@ osm_opensm_init( > > /* Can't use log macros here, since we're initializing the log. > */ > > osm_opensm_construct( p_osm ); > > > > - status = osm_log_init( &p_osm->log, p_opt->force_log_flush, > > - p_opt->log_flags, p_opt->log_file, > > p_opt->accum_log_file ); > > + status = osm_log_init_ext( &p_osm->log, p_opt->force_log_flush, > > + p_opt->log_flags, p_opt->log_file, > > + p_opt->accum_log_file, p_opt->src_info, > > + p_opt->verbosity_file); > > if( status != IB_SUCCESS ) > > return ( status ); > > > > Index: opensm/libopensm.map > > > =================================================================== > > --- opensm/libopensm.map (revision 8614) > > +++ opensm/libopensm.map (working copy) > > @@ -1,6 +1,11 @@ > > -OPENSM_1.0 { > > +OPENSM_2.0 { > > global: > > - osm_log; > > + osm_log_init_ext; > > + osm_log_ext; > > + osm_log_raw_ext; > > + osm_log_get_level_ext; > > + osm_log_is_active_ext; > > + osm_log_read_verbosity_file; > > osm_is_debug; > > osm_mad_pool_construct; > > osm_mad_pool_destroy; > > @@ -39,7 +44,6 @@ OPENSM_1.0 { > > osm_dump_dr_path; > > osm_dump_smp_dr_path; > > osm_dump_pkey_block; > > - osm_log_raw; > > osm_get_sm_state_str; > > osm_get_sm_signal_str; > > osm_get_disp_msg_str; > > Rather than remove osm_log and osm_log_raw, these should be > deprecated. > There are other applications outside of OpenSM (like osmtest and > others) > that need this. > > > @@ -51,5 +55,11 @@ OPENSM_1.0 { > > osm_get_lsa_str; > > osm_get_sm_mgr_signal_str; > > osm_get_sm_mgr_state_str; > > + st_init_strtable; > > + st_delete; > > + st_insert; > > + st_lookup; > > + st_foreach; > > + st_free_table; > > local: *; > > }; > > Index: opensm/osm_log.c > > =================================================================== > > --- opensm/osm_log.c (revision 8614) > > +++ opensm/osm_log.c (working copy) > > @@ -80,17 +80,365 @@ static char *month_str[] = { > > }; > > #endif /* ndef WIN32 */ > > > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > +#define OSM_VERBOSITY_ALL "all" > > + > > +static void > > +__osm_log_free_verbosity_table( > > + IN osm_log_t* p_log); > > +static void > > +__osm_log_print_verbosity_table( > > + IN osm_log_t* const p_log); > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > +osm_log_level_t > > +osm_log_get_level_ext( > > + IN const osm_log_t* const p_log, > > + IN const char* const p_filename ) > > +{ > > + osm_log_level_t * p_curr_file_level = NULL; > > + > > + if (!p_filename || !p_log->table) > > + return p_log->level; > > + > > + if ( st_lookup( p_log->table, > > + (st_data_t) p_filename, > > + (st_data_t*) &p_curr_file_level) ) > > + return *p_curr_file_level; > > + else > > + return p_log->level; > > +} > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > +ib_api_status_t > > +osm_log_init_ext( > > + IN osm_log_t* const p_log, > > + IN const boolean_t flush, > > + IN const uint8_t log_flags, > > + IN const char *log_file, > > + IN const boolean_t accum_log_file, > > + IN const boolean_t src_info, > > + IN const char *verbosity_file) > > +{ > > + p_log->level = log_flags; > > + p_log->flush = flush; > > + p_log->src_info = src_info; > > + p_log->table = NULL; > > + > > + if (log_file == NULL || !strcmp(log_file, "-") || > > + !strcmp(log_file, "stdout")) > > + { > > + p_log->out_port = stdout; > > + } > > + else if (!strcmp(log_file, "stderr")) > > + { > > + p_log->out_port = stderr; > > + } > > + else > > + { > > + if (accum_log_file) > > + p_log->out_port = fopen(log_file, "a+"); > > + else > > + p_log->out_port = fopen(log_file, "w+"); > > + > > + if (!p_log->out_port) > > + { > > + if (accum_log_file) > > + printf("Cannot open %s for appending. Permission denied > \n", > > log_file); > > + else > > + printf("Cannot open %s for writing. Permission denied\n", > > log_file); > > + > > + return(IB_UNKNOWN_ERROR); > > + } > > + } > > + openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); > > + > > + if (cl_spinlock_init( &p_log->lock ) != CL_SUCCESS) > > + return IB_ERROR; > > + > > + osm_log_read_verbosity_file(p_log,verbosity_file); > > + return IB_SUCCESS; > > +} > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > +void > > +osm_log_read_verbosity_file( > > + IN osm_log_t* p_log, > > + IN const char * const verbosity_file) > > +{ > > + FILE *infile; > > + char line[500]; > > + struct stat buf; > > + boolean_t table_empty = TRUE; > > + char * tmp_str = NULL; > > + > > + if (p_log->table) > > + { > > + /* > > + * Free the existing table. > > + * Note: if the verbosity config file will not be found, > this > > will > > + * effectivly reset the existing verbosity configuration and > > set > > + * all the files to the same verbosity level > > + */ > > + __osm_log_free_verbosity_table(p_log); > > + } > > + > > + if (!verbosity_file) > > + return; > > + > > + if ( stat(verbosity_file, &buf) != 0 ) > > + { > > + /* > > + * Verbosity configuration file doesn't exist. > > + */ > > + if (strcmp(verbosity_file,OSM_DEFAULT_VERBOSITY_FILE) == 0) > > + { > > + /* > > + * Verbosity configuration file wasn't explicitly > specified. > > + * No need to issue any error message. > > + */ > > + return; > > + } > > + else > > + { > > + /* > > + * Verbosity configuration file was explicitly specified. > > + */ > > + osm_log(p_log, OSM_LOG_SYS, > > + "ERROR: Verbosity configuration file (%s) doesn't > > exist.\n", > > + verbosity_file); > > + osm_log(p_log, OSM_LOG_SYS, > > + " Using general verbosity value.\n"); > > + return; > > + } > > + } > > + > > + infile = fopen(verbosity_file, "r"); > > + if ( infile == NULL ) > > + { > > + osm_log(p_log, OSM_LOG_SYS, > > + "ERROR: Failed opening verbosity configuration file > > (%s).\n", > > + verbosity_file); > > + osm_log(p_log, OSM_LOG_SYS, > > + " Using general verbosity value.\n"); > > + return; > > + } > > + > > + p_log->table = st_init_strtable(); > > + if (p_log->table == NULL) > > + { > > + osm_log(p_log, OSM_LOG_SYS, "ERROR: Verbosity table > > initialization failed.\n"); > > + return; > > + } > > + > > + /* > > + * Read the file line by line, parse the lines, and > > + * add each line to p_log->table. > > + */ > > + while ( fgets(line, sizeof(line), infile) != NULL ) > > + { > > + char * str = line; > > + char * name = NULL; > > + char * value = NULL; > > + osm_log_level_t * p_log_level_value = NULL; > > + int res; > > + > > + name = strtok_r(str," \t\n",&tmp_str); > > + if (name == NULL || strlen(name) == 0) { > > + /* > > + * empty line - ignore it > > + */ > > + continue; > > + } > > + value = strtok_r(NULL," \t\n",&tmp_str); > > + if (value == NULL || strlen(value) == 0) > > + { > > + /* > > + * No verbosity value - wrong syntax. > > + * This line will be ignored. > > + */ > > + continue; > > + } > > + > > + /* > > + * If the conversion will fail, the log_level_value will get > 0, > > + * so the only way to check that the syntax is correct is to > > + * scan value for any non-digit (which we're not doing > here). > > + */ > > + p_log_level_value = malloc (sizeof(osm_log_level_t)); > > + if (!p_log_level_value) > > + { > > + osm_log(p_log, OSM_LOG_SYS, "ERROR: malloc failed.\n"); > > + p_log->table = NULL; > > + fclose(infile); > > + return; > > + } > > + *p_log_level_value = strtoul(value, NULL, 0); > > + > > + if (strcasecmp(name,OSM_VERBOSITY_ALL) == 0) > > + { > > + osm_log_set_level(p_log, *p_log_level_value); > > + free(p_log_level_value); > > + } > > + else > > + { > > + res = st_insert( p_log->table, > > + (st_data_t) strdup(name), > > + (st_data_t) p_log_level_value); > > + if (res != 0) > > + { > > + /* > > + * Something is wrong with the verbosity table. > > + * We won't try to free the table, because there's > > + * clearly something corrupted there. > > + */ > > + osm_log(p_log, OSM_LOG_SYS, "ERROR: Failed adding > > verbosity table element.\n"); > > + p_log->table = NULL; > > + fclose(infile); > > + return; > > + } > > + table_empty = FALSE; > > + } > > + > > + } > > + > > + if (table_empty) > > + __osm_log_free_verbosity_table(p_log); > > + > > + fclose(infile); > > + > > + __osm_log_print_verbosity_table(p_log); > > +} > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > +static int > > +__osm_log_print_verbosity_table_element( > > + IN st_data_t key, > > + IN st_data_t val, > > + IN st_data_t arg) > > +{ > > + osm_log( (osm_log_t* const) arg, > > + OSM_LOG_INFO, > > + "[verbosity] File: %s, Level: 0x%x\n", > > + (char *) key, *((osm_log_level_t *) val)); > > + > > + return ST_CONTINUE; > > +} > > + > > +static void > > +__osm_log_print_verbosity_table( > > + IN osm_log_t* const p_log) > > +{ > > + osm_log( p_log, OSM_LOG_INFO, > > + "[verbosity] Verbosity table loaded\n" ); > > + osm_log( p_log, OSM_LOG_INFO, > > + "[verbosity] General level: > > 0x%x\n",osm_log_get_level_ext(p_log,NULL)); > > + > > + if (p_log->table) > > + { > > + st_foreach( p_log->table, > > + __osm_log_print_verbosity_table_element, > > + (st_data_t) p_log ); > > + } > > + osm_log_flush(p_log); > > +} > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > +static int > > +__osm_log_free_verbosity_table_element( > > + IN st_data_t key, > > + IN st_data_t val, > > + IN st_data_t arg) > > +{ > > + free( (char *) key ); > > + free( (osm_log_level_t *) val ); > > + return ST_DELETE; > > +} > > + > > +static void > > +__osm_log_free_verbosity_table( > > + IN osm_log_t* p_log) > > +{ > > + if (!p_log->table) > > + return; > > + > > + st_foreach( p_log->table, > > + __osm_log_free_verbosity_table_element, > > + (st_data_t) NULL); > > + > > + st_free_table(p_log->table); > > + p_log->table = NULL; > > +} > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > +static inline const char * > > +__osm_log_get_base_name( > > + IN const char * const p_filename) > > +{ > > +#ifdef WIN32 > > + char dir_separator = '\\'; > > +#else > > + char dir_separator = '/'; > > +#endif > > + char * tmp_ptr; > > + > > + if (!p_filename) > > + return NULL; > > + > > + tmp_ptr = strrchr(p_filename,dir_separator); > > + > > + if (!tmp_ptr) > > + return p_filename; > > + return tmp_ptr+1; > > +} > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > +boolean_t > > +osm_log_is_active_ext( > > + IN const osm_log_t* const p_log, > > + IN const char* const p_filename, > > + IN const osm_log_level_t level ) > > +{ > > + osm_log_level_t tmp_lvl; > > + tmp_lvl = level & > > + > > osm_log_get_level_ext(p_log,__osm_log_get_base_name(p_filename)); > > + return ( tmp_lvl != 0 ); > > +} > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > static int log_exit_count = 0; > > > > void > > -osm_log( > > +osm_log_ext( > > IN osm_log_t* const p_log, > > IN const osm_log_level_t verbosity, > > + IN const char *p_filename, > > + IN int line, > > IN const char *p_str, ... ) > > { > > char buffer[LOG_ENTRY_SIZE_MAX]; > > va_list args; > > int ret; > > + osm_log_level_t file_verbosity; > > > > #ifdef WIN32 > > SYSTEMTIME st; > > @@ -108,69 +456,89 @@ osm_log( > > localtime_r(&tim, &result); > > #endif /* WIN32 */ > > > > - /* If this is a call to syslog - always print it */ > > - if ( verbosity & OSM_LOG_SYS ) > > + /* > > + * Extract only the file name out of the full path > > + */ > > + p_filename = __osm_log_get_base_name(p_filename); > > + /* > > + * Get the verbosity level for this file. > > + * If the file is not specified in the log config file, > > + * the general verbosity level will be returned. > > + */ > > + file_verbosity = osm_log_get_level_ext(p_log, p_filename); > > + > > + if ( ! (verbosity & OSM_LOG_SYS) && > > + ! (file_verbosity & verbosity) ) > > { > > - /* this is a call to the syslog */ > > + /* > > + * This is not a syslog message (which is always printed) > > + * and doesn't have the required verbosity level. > > + */ > > + return; > > + } > > + > > va_start( args, p_str ); > > vsprintf( buffer, p_str, args ); > > va_end(args); > > - cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); > > > > + > > + if ( verbosity & OSM_LOG_SYS ) > > + { > > + /* this is a call to the syslog */ > > + cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); > > /* SYSLOG should go to stdout too */ > > if (p_log->out_port != stdout) > > { > > - printf("%s\n", buffer); > > + printf("%s", buffer); > > fflush( stdout ); > > } > > + } > > + /* SYSLOG also goes to to the log file */ > > + > > + cl_spinlock_acquire( &p_log->lock ); > > > > - /* send it also to the log file */ > > #ifdef WIN32 > > GetLocalTime(&st); > > - fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> % > s", > > + if (p_log->src_info) > > + { > > + ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] > > [%s:%d] -> %s", > > st.wHour, st.wMinute, st.wSecond, > > st.wMilliseconds, > > - pid, buffer); > > -#else > > - fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] > -> > > %s\n", > > - (result.tm_mon < 12 ? month_str[result.tm_mon] : > "???"), > > - result.tm_mday, result.tm_hour, > > - result.tm_min, result.tm_sec, > > - usecs, pid, buffer); > > - fflush( p_log->out_port ); > > -#endif > > + pid, p_filename, line, buffer); > > } > > - > > - /* SYS messages go to the log anyways */ > > - if (p_log->level & verbosity) > > + else > > { > > - > > - va_start( args, p_str ); > > - vsprintf( buffer, p_str, args ); > > - va_end(args); > > - > > - /* regular log to default out_port */ > > - cl_spinlock_acquire( &p_log->lock ); > > -#ifdef WIN32 > > - GetLocalTime(&st); > > ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] > -> > > %s", > > st.wHour, st.wMinute, st.wSecond, > > st.wMilliseconds, > > pid, buffer); > > - > > + } > > #else > > pid = pthread_self(); > > tim = time(NULL); > > + if (p_log->src_info) > > + { > > + ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d > > [%04X] [%s:%d] -> %s", > > + ((result.tm_mon < 12) && (result.tm_mon >= > 0) ? > > + month_str[ result.tm_mon] : "???"), > > + result.tm_mday, result.tm_hour, > > + result.tm_min, result.tm_sec, > > + usecs, pid, p_filename, line, buffer); > > + } > > + else > > + { > > ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d > > [%04X] -> %s", > > ((result.tm_mon < 12) && (result.tm_mon >= > 0) ? > > month_str[ result.tm_mon] : "???"), > > result.tm_mday, result.tm_hour, > > result.tm_min, result.tm_sec, > > usecs, pid, buffer); > > -#endif /* WIN32 */ > > - > > + } > > +#endif > > /* > > - Flush log on errors too. > > + * Flush log on errors and SYSLOGs too. > > */ > > - if( p_log->flush || (verbosity & OSM_LOG_ERROR) ) > > + if ( p_log->flush || > > + (verbosity & OSM_LOG_ERROR) || > > + (verbosity & OSM_LOG_SYS) ) > > fflush( p_log->out_port ); > > > > cl_spinlock_release( &p_log->lock ); > > @@ -183,15 +551,30 @@ osm_log( > > } > > } > > } > > -} > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > > > void > > -osm_log_raw( > > +osm_log_raw_ext( > > IN osm_log_t* const p_log, > > IN const osm_log_level_t verbosity, > > + IN const char * p_filename, > > IN const char *p_buf ) > > { > > - if( p_log->level & verbosity ) > > + osm_log_level_t file_verbosity; > > + /* > > + * Extract only the file name out of the full path > > + */ > > + p_filename = __osm_log_get_base_name(p_filename); > > + /* > > + * Get the verbosity level for this file. > > + * If the file is not specified in the log config file, > > + * the general verbosity level will be returned. > > + */ > > + file_verbosity = osm_log_get_level_ext(p_log, p_filename); > > + > > + if ( file_verbosity & verbosity ) > > { > > cl_spinlock_acquire( &p_log->lock ); > > printf( "%s", p_buf ); > > @@ -205,6 +588,9 @@ osm_log_raw( > > } > > } > > > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > boolean_t > > osm_is_debug(void) > > { > > @@ -214,3 +600,7 @@ osm_is_debug(void) > > return FALSE; > > #endif /* defined( _DEBUG_ ) */ > > } > > + > > > +/*************************************************************************** > > + > > > ***************************************************************************/ > > + > > Index: opensm/main.c > > > =================================================================== > > --- opensm/main.c (revision 8652) > > +++ opensm/main.c (working copy) > > @@ -296,6 +296,33 @@ show_usage(void) > > " -d3 - Disable multicast support\n" > > " -d10 - Put OpenSM in testability mode\n" > > " Without -d, no debug options are enabled\n > \n" ); > > + printf( "-S\n" > > + "--log_source_info\n" > > + " This option tells SM to add source code > > filename\n" > > + " and line number to every log message.\n" > > + " By default, the SM will not log this > additional > > info.\n\n"); > > + printf( "-b\n" > > + "--verbosity_file \n" > > + " This option specifies name of the verbosity > \n" > > + " configuration file, which describes verbosity > > level\n" > > + " per source code file. The file may contain > zero > > or\n" > > + " more lines of the following pattern:\n" > > + " filename verbosity_level\n" > > + " where 'filename' is the name of the source > code > > file\n" > > + " that the 'verbosity_level' refers to, and the > > \n" > > + " 'verbosity_level' itself should be specified > as > > a\n" > > + " number (decimal or hexadecimal).\n" > > + " Filename 'all' represents general verbosity > > level,\n" > > + " that is used for all the files that are not > > specified\n" > > + " in the verbosity file.\n" > > + " Note: The 'all' file verbosity level will > > override any\n" > > + " other general level that was specified by the > > command\n" > > + " line arguments.\n" > > + " By default, the SM will use the following > > file:\n" > > + " %s\n" > > + " Sending a SIGHUP signal to the SM will cause > it > > to\n" > > + " re-read the verbosity configuration file.\n" > > + "\n\n", OSM_DEFAULT_VERBOSITY_FILE); > > printf( "-h\n" > > "--help\n" > > " Display this usage info then exit.\n\n" ); > > @@ -527,7 +554,7 @@ main( > > boolean_t cache_options = FALSE; > > char *ignore_guids_file_name = NULL; > > uint32_t val; > > - const char * const short_option = > > "i:f:ed:g:l:s:t:a:R:U:P:NQvVhorcyx"; > > + const char * const short_option = > > "i:f:ed:g:l:s:t:a:R:U:P:b:SNQvVhorcyx"; > > > > /* > > In the array below, the 2nd parameter specified the number > > @@ -565,6 +592,8 @@ main( > > { "cache-options", 0, NULL, 'c'}, > > { "stay_on_fatal", 0, NULL, 'y'}, > > { "honor_guid2lid", 0, NULL, 'x'}, > > + { "log_source_info",0,NULL, 'S'}, > > + { "verbosity_file",1, NULL, 'b'}, > > { NULL, 0, NULL, 0 } /* Required at the end of > > the array */ > > }; > > > > @@ -808,6 +837,16 @@ main( > > printf (" Honor guid2lid file, if possible\n"); > > break; > > > > + case 'S': > > + opt.src_info = TRUE; > > + printf(" Logging source code filename and line number\n"); > > + break; > > + > > + case 'b': > > + opt.verbosity_file = optarg; > > + printf(" Verbosity Configuration File: %s\n", optarg); > > + break; > > + > > case 'h': > > case '?': > > case ':': > > @@ -920,9 +959,13 @@ main( > > > > if (osm_hup_flag) { > > osm_hup_flag = 0; > > - /* a HUP signal should only start a new heavy sweep */ > > + /* > > + * A HUP signal should cause OSM to re-read the log > > + * configuration file and start a new heavy sweep > > + */ > > osm.subn.force_immediate_heavy_sweep = TRUE; > > osm_opensm_sweep( &osm ); > > + osm_log_read_verbosity_file(&osm.log,opt.verbosity_file); > > } > > } > > } > > Index: opensm/Makefile.am > > =================================================================== > > --- opensm/Makefile.am (revision 8614) > > +++ opensm/Makefile.am (working copy) > > @@ -43,7 +43,7 @@ else > > libopensm_version_script = > > endif > > > > -libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c > > +libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c st.c > > libopensm_la_LDFLAGS = -version-info $(opensm_api_version) \ > > -export-dynamic $(libopensm_version_script) > > libopensm_la_DEPENDENCIES = $(srcdir)/libopensm.map > > @@ -90,7 +90,7 @@ opensm_SOURCES = main.c osm_console.c os > > osm_trap_rcv.c osm_trap_rcv_ctrl.c \ > > osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c \ > > osm_vl15intf.c osm_vl_arb_rcv.c \ > > - osm_vl_arb_rcv_ctrl.c st.c > > + osm_vl_arb_rcv_ctrl.c > > if OSMV_OPENIB > > opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing > > -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) > > -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > > -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > Index: doc/verbosity-config.txt > > =================================================================== > > --- doc/verbosity-config.txt (revision 0) > > +++ doc/verbosity-config.txt (revision 0) > > @@ -0,0 +1,43 @@ > > + > > +This patch adds new verbosity functionality. > > + > > +1. Verbosity configuration file > > +------------------------------- > > + > > +The user is able to set verbosity level per source code file > > +by supplying verbosity configuration file using the following > > +command line arguments: > > + > > + -b filename > > + --verbosity_file filename > > + > > +By default, the OSM will use the following > file: /etc/opensmlog.conf > > +Verbosity configuration file should contain zero or more lines of > > +the following pattern: > > + > > + filename verbosity_level > > + > > +where 'filename' is the name of the source code file that the > > +'verbosity_level' refers to, and the 'verbosity_level' itself > > +should be specified as an integer number (decimal or > hexadecimal). > > + > > +One reserved filename is 'all' - it represents general verbosity > > +level, that is used for all the files that are not specified in > > +the verbosity configuration file. > > +If 'all' is not specified, the verbosity level set in the > > +command line will be used instead. > > +Note: The 'all' file verbosity level will override any other > > +general level that was specified by the command line arguments. > > + > > +Sending a SIGHUP signal to the OSM will cause it to reload > > +the verbosity configuration file. > > + > > + > > +2. Logging source code filename and line number > > +----------------------------------------------- > > + > > +If command line option -S or --log_source_info is specified, > > +OSM will add source code filename and line number to every > > +log message that is written to the log file. > > +By default, the OSM will not log this additional info. > > + > > > > From tziporet at mellanox.co.il Sun Aug 27 07:31:19 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 27 Aug 2006 17:31:19 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready In-Reply-To: <1156526995.12257.107.camel@fc6.xsintricity.com> References: <44EA004F.2060608@mellanox.co.il> <1156526995.12257.107.camel@fc6.xsintricity.com> Message-ID: <44F1ACB7.6050702@mellanox.co.il> Doug Ledford wrote: > > Not supporting ppc is a problem to a certain extent. I can't speak for > SuSE, but at least for Red Hat, ppc is the default and over rides ppc64. > The ppc64 arch is less efficient than the ppc arch on ppc64 processors > except when large memory footprints are involved. So, for things like > opensm, ibv_*, etc. the ppc arch should actually be preferred, and the > ppc64 arch libs should be present for those end user apps that need > large memory access. That fact that dapl doesn't compile on ppc at all > is problematic as well. In addition, what are you guys doing about the > lack of asm/atomic.h (breaking udapl compiles on ppc64 and ia64) going > forward? I'd look in the packages and see for myself but the svn update > is taking forever due to those binary rpms packed into svn...ahh, it's > finally done....ok, still broken. > We don't have here any PPC machine for event for compilation checks :-( > Without getting into an argument over the usage of that include, suffice > it to say that the include file is gone and builds fails on > fc6/rhel5beta. Since the code really only uses low level intrinsics as > opposed to high level atomic ops, I made a ppc and ia64 intrinsics > header for linux and added it to the dapl package itself to work around > the issue. > Please work with James and Arlin to drive these changes to uDAPL. > > Ugh. Each library does not need it's own package. Imagine what X would > do to your RPM count otherwise. For grouped libraries like this, it is > perfectly acceptable to do opensm, opensm-libs, opensm-devel (and that's > in fact what I did for RHEL4 U4). Regardless though, make a decision > and stick to it. Changing package names with each release == not good. > I agree - but we had to change since there was requirement from Cisco to include diagnostic tools and not opensm. It might be that we did it not in the best way and Vlad will check if he can improve it according to your directives. >> 5. RHEL4 up4 is not supported due to problems in the backport patches. > > You should be able to start by pulling the patches that are already > applied out of the RHEL4 U4 kernel rpm, looking at which ones fix up the > core kernel to provide what's needed instead of doing a thousand little > backports all over the kernel tree, and axing any backport patches you > had planned that would undo that. IOW, make use of the infrastructure > provided in U4 instead of working around it. > We already fixed this and RHEL4 will be part of rc3. We will also look at the way you did in RHEL4 up4 later and learn for future releases. Tziporet From dledford at redhat.com Sun Aug 27 07:51:08 2006 From: dledford at redhat.com (Doug Ledford) Date: Sun, 27 Aug 2006 10:51:08 -0400 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready In-Reply-To: <20060826194225.GH21168@mellanox.co.il> References: <1156526995.12257.107.camel@fc6.xsintricity.com> <20060826194225.GH21168@mellanox.co.il> Message-ID: <1156690268.15782.18.camel@fc6.xsintricity.com> On Sat, 2006-08-26 at 22:42 +0300, Michael S. Tsirkin wrote: > Quoting r. Doug Ledford : > > IOW, make use of the infrastructure > > provided in U4 instead of working around it. > > Sorry, I don't really understand what you suggest here. > Could you give us an example please? Sure. The stock 2.6.9 kernel doesn't include kzalloc() or kstrdup(). However, both of those exist in the Red Hat RHEL4 U3 and later kernels (yeah, I know, there isn't an easy way to know this, but unless you just *need* to support a U2 or earlier kernel, then you can just assume Red Hat has this). Likewise, the U4 and later kernels have the proper class functions in the core kernel so you don't need small backports of stuff like class_create in the uverbs_main patch. Ditto for the get_sb_psuedo in the core patch. Ditto for a number of things in the ipath backports. Ditto for quite a few other things as well. In general, there's less need to backport under RHEL4 U4 than previously, and where possible it would be best to make use of the core kernel's enhancements relative to a stock 2.6.9. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From kliteyn at mellanox.co.il Sun Aug 27 07:49:09 2006 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 27 Aug 2006 17:49:09 +0300 Subject: [openib-general] [PATCHv2] osm: Dynamic verbosity control per file Message-ID: Hi Hal. This is a resubmission of the patch that addresses the comments that I got on the first version - using osm-log.conf file instead of opensmlog.conf and osm man page update. Yevgeny Signed-off-by: Yevgeny Kliteynik Index: include/opensm/osm_subnet.h =================================================================== --- include/opensm/osm_subnet.h (revision 9107) +++ include/opensm/osm_subnet.h (working copy) @@ -285,6 +285,8 @@ typedef struct _osm_subn_opt osm_qos_options_t qos_sw0_options; osm_qos_options_t qos_swe_options; osm_qos_options_t qos_rtr_options; + boolean_t src_info; + char * verbosity_file; } osm_subn_opt_t; /* * FIELDS @@ -463,6 +465,27 @@ typedef struct _osm_subn_opt * qos_rtr_options * QoS options for router ports * +* src_info +* If TRUE - the source code filename and line number will be +* added to each log message. +* Default value - FALSE. +* +* verbosity_file +* OSM log configuration file - the file that describes +* verbosity level per source code file. +* The file may containg zero or more lines of the following +* pattern: +* filename verbosity_level +* where 'filename' is the name of the source code file that +* the 'verbosity_level' refers to. +* Filename "all" represents general verbosity level, that is +* used for all the files that are not specified in the +* verbosity file. +* If "all" is not specified, the general verbosity level will +* be used instead. +* Note: the "all" file verbosity level will override any other +* general level that was specified by the command line arguments. +* * SEE ALSO * Subnet object *********/ Index: include/opensm/osm_base.h =================================================================== --- include/opensm/osm_base.h (revision 9107) +++ include/opensm/osm_base.h (working copy) @@ -222,6 +222,22 @@ BEGIN_C_DECLS #endif /***********/ +/****d* OpenSM: Base/OSM_DEFAULT_VERBOSITY_FILE +* NAME +* OSM_DEFAULT_VERBOSITY_FILE +* +* DESCRIPTION +* Specifies the default verbosity config file name +* +* SYNOPSIS +*/ +#ifdef __WIN__ +#define OSM_DEFAULT_VERBOSITY_FILE strcat(GetOsmPath(), "osm-log.conf") +#else +#define OSM_DEFAULT_VERBOSITY_FILE "/etc/osm-log.conf" +#endif +/***********/ + /****d* OpenSM: Base/OSM_DEFAULT_PARTITION_CONFIG_FILE * NAME * OSM_DEFAULT_PARTITION_CONFIG_FILE Index: include/opensm/osm_log.h =================================================================== --- include/opensm/osm_log.h (revision 9107) +++ include/opensm/osm_log.h (working copy) @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -123,9 +124,45 @@ typedef struct _osm_log cl_spinlock_t lock; boolean_t flush; FILE* out_port; + boolean_t src_info; + st_table * table; } osm_log_t; /*********/ +/****f* OpenSM: Log/osm_log_read_verbosity_file +* NAME +* osm_log_read_verbosity_file +* +* DESCRIPTION +* This function reads the verbosity configuration file +* and constructs a verbosity data structure. +* +* SYNOPSIS +*/ +void +osm_log_read_verbosity_file( + IN osm_log_t* p_log, + IN const char * const verbosity_file); +/* +* PARAMETERS +* p_log +* [in] Pointer to a Log object to construct. +* +* verbosity_file +* [in] verbosity configuration file +* +* RETURN VALUE +* None +* +* NOTES +* If the verbosity configuration file is not found, default +* verbosity value is used for all files. +* If there is an error in some line of the verbosity +* configuration file, the line is ignored. +* +*********/ + + /****f* OpenSM: Log/osm_log_construct * NAME * osm_log_construct @@ -201,9 +238,13 @@ osm_log_destroy( * osm_log_init *********/ -/****f* OpenSM: Log/osm_log_init +#define osm_log_init(p_log, flush, log_flags, log_file, accum_log_file) \ + osm_log_init_ext(p_log, flush, (log_flags), log_file, \ + accum_log_file, FALSE, OSM_DEFAULT_VERBOSITY_FILE) + +/****f* OpenSM: Log/osm_log_init_ext * NAME -* osm_log_init +* osm_log_init_ext * * DESCRIPTION * The osm_log_init function initializes a @@ -211,50 +252,15 @@ osm_log_destroy( * * SYNOPSIS */ -static inline ib_api_status_t -osm_log_init( +ib_api_status_t +osm_log_init_ext( IN osm_log_t* const p_log, IN const boolean_t flush, IN const uint8_t log_flags, IN const char *log_file, - IN const boolean_t accum_log_file ) -{ - p_log->level = log_flags; - p_log->flush = flush; - - if (log_file == NULL || !strcmp(log_file, "-") || - !strcmp(log_file, "stdout")) - { - p_log->out_port = stdout; - } - else if (!strcmp(log_file, "stderr")) - { - p_log->out_port = stderr; - } - else - { - if (accum_log_file) - p_log->out_port = fopen(log_file, "a+"); - else - p_log->out_port = fopen(log_file, "w+"); - - if (!p_log->out_port) - { - if (accum_log_file) - printf("Cannot open %s for appending. Permission denied\n", log_file); - else - printf("Cannot open %s for writing. Permission denied\n", log_file); - - return(IB_UNKNOWN_ERROR); - } - } - openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); - - if (cl_spinlock_init( &p_log->lock ) == CL_SUCCESS) - return IB_SUCCESS; - else - return IB_ERROR; -} + IN const boolean_t accum_log_file, + IN const boolean_t src_info, + IN const char *verbosity_file); /* * PARAMETERS * p_log @@ -271,6 +277,16 @@ osm_log_init( * log_file * [in] if not NULL defines the name of the log file. Otherwise it is stdout. * +* accum_log_file +* [in] Whether the log file should be accumulated. +* +* src_info +* [in] Set to TRUE directs the log to add filename and line number +* to each log message. +* +* verbosity_file +* [in] Log configuration file location. +* * RETURN VALUES * CL_SUCCESS if the Log object was initialized * successfully. @@ -283,25 +299,31 @@ osm_log_init( * osm_log_destroy *********/ -/****f* OpenSM: Log/osm_log_get_level +#define osm_log_get_level(p_log) \ + osm_log_get_level_ext(p_log, __FILE__) + +/****f* OpenSM: Log/osm_log_get_level_ext * NAME -* osm_log_get_level +* osm_log_get_level_ext * * DESCRIPTION -* Returns the current log level. +* Returns the current log level for the file. +* If the file is not specified in the log config file, +* the general verbosity level will be returned. * * SYNOPSIS */ -static inline osm_log_level_t -osm_log_get_level( - IN const osm_log_t* const p_log ) -{ - return( p_log->level ); -} +osm_log_level_t +osm_log_get_level_ext( + IN const osm_log_t* const p_log, + IN const char* const p_filename ); /* * PARAMETERS * p_log * [in] Pointer to the log object. +* +* p_filename +* [in] Source code file name. * * RETURN VALUES * Returns the current log level. @@ -310,7 +332,7 @@ osm_log_get_level( * * SEE ALSO * Log object, osm_log_construct, -* osm_log_destroy +* osm_log_destroy, osm_log_get_level *********/ /****f* OpenSM: Log/osm_log_set_level @@ -318,7 +340,7 @@ osm_log_get_level( * osm_log_set_level * * DESCRIPTION -* Sets the current log level. +* Sets the current general log level. * * SYNOPSIS */ @@ -338,7 +360,7 @@ osm_log_set_level( * [in] New level to set. * * RETURN VALUES -* Returns the current log level. +* None. * * NOTES * @@ -347,9 +369,12 @@ osm_log_set_level( * osm_log_destroy *********/ -/****f* OpenSM: Log/osm_log_is_active +#define osm_log_is_active(p_log, level) \ + osm_log_is_active_ext(p_log, __FILE__, level) + +/****f* OpenSM: Log/osm_log_is_active_ext * NAME -* osm_log_is_active +* osm_log_is_active_ext * * DESCRIPTION * Returns TRUE if the specified log level would be logged. @@ -357,18 +382,19 @@ osm_log_set_level( * * SYNOPSIS */ -static inline boolean_t -osm_log_is_active( +boolean_t +osm_log_is_active_ext( IN const osm_log_t* const p_log, - IN const osm_log_level_t level ) -{ - return( (p_log->level & level) != 0 ); -} + IN const char* const p_filename, + IN const osm_log_level_t level ); /* * PARAMETERS * p_log * [in] Pointer to the log object. * +* p_filename +* [in] Source code file name. +* * level * [in] Level to check. * @@ -383,17 +409,125 @@ osm_log_is_active( * osm_log_destroy *********/ + +#define osm_log(p_log, verbosity, p_str, args...) \ + osm_log_ext(p_log, verbosity, __FILE__, __LINE__, p_str , ## args) + +/****f* OpenSM: Log/osm_log_ext +* NAME +* osm_log_ext +* +* DESCRIPTION +* Logs the formatted specified message. +* +* SYNOPSIS +*/ void -osm_log( +osm_log_ext( IN osm_log_t* const p_log, IN const osm_log_level_t verbosity, + IN const char *p_filename, + IN int line, IN const char *p_str, ... ); +/* +* PARAMETERS +* p_log +* [in] Pointer to the log object. +* +* verbosity +* [in] Current message verbosity level + + p_filename + [in] Name of the file that is logging this message + + line + [in] Line number in the file that is logging this message + p_str + [in] Format string of the message +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +* Log object, osm_log_construct, +* osm_log_destroy +*********/ + +#define osm_log_raw(p_log, verbosity, p_buff) \ + osm_log_raw_ext(p_log, verbosity, __FILE__, p_buff) + +/****f* OpenSM: Log/osm_log_raw_ext +* NAME +* osm_log_ext +* +* DESCRIPTION +* Logs the specified message. +* +* SYNOPSIS +*/ void -osm_log_raw( +osm_log_raw_ext( IN osm_log_t* const p_log, IN const osm_log_level_t verbosity, + IN const char * p_filename, IN const char *p_buf ); +/* +* PARAMETERS +* p_log +* [in] Pointer to the log object. +* +* verbosity +* [in] Current message verbosity level + + p_filename + [in] Name of the file that is logging this message + + p_buf + [in] Message string +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +* Log object, osm_log_construct, +* osm_log_destroy +*********/ + + +/****f* OpenSM: Log/osm_log_flush +* NAME +* osm_log_flush +* +* DESCRIPTION +* Flushes the log. +* +* SYNOPSIS +*/ +static inline void +osm_log_flush( + IN osm_log_t* const p_log) +{ + fflush(p_log->out_port); +} +/* +* PARAMETERS +* p_log +* [in] Pointer to the log object. +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +* +*********/ + #define DBG_CL_LOCK 0 Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 9107) +++ opensm/osm_subnet.c (working copy) @@ -493,6 +493,8 @@ osm_subn_set_default_opt( p_opt->ucast_dump_file = NULL; p_opt->updn_guid_file = NULL; p_opt->exit_on_fatal = TRUE; + p_opt->src_info = FALSE; + p_opt->verbosity_file = OSM_DEFAULT_VERBOSITY_FILE; subn_set_default_qos_options(&p_opt->qos_options); subn_set_default_qos_options(&p_opt->qos_hca_options); subn_set_default_qos_options(&p_opt->qos_sw0_options); @@ -959,6 +961,13 @@ osm_subn_parse_conf_file( "honor_guid2lid_file", p_key, p_val, &p_opts->honor_guid2lid_file); + __osm_subn_opts_unpack_boolean( + "log_source_info", + p_key, p_val, &p_opts->src_info); + + __osm_subn_opts_unpack_charp( + "verbosity_file", p_key, p_val, &p_opts->verbosity_file); + subn_parse_qos_options("qos", p_key, p_val, &p_opts->qos_options); @@ -1182,7 +1191,11 @@ osm_subn_write_conf_file( "# No multicast routing is performed if TRUE\n" "disable_multicast %s\n\n" "# If TRUE opensm will exit on fatal initialization issues\n" - "exit_on_fatal %s\n\n", + "exit_on_fatal %s\n\n" + "# If TRUE OpenSM will log filename and line numbers\n" + "log_source_info %s\n\n" + "# Verbosity configuration file to be used\n" + "verbosity_file %s\n\n", p_opts->log_flags, p_opts->force_log_flush ? "TRUE" : "FALSE", p_opts->log_file, @@ -1190,7 +1203,9 @@ osm_subn_write_conf_file( p_opts->dump_files_dir, p_opts->no_multicast_option ? "TRUE" : "FALSE", p_opts->disable_multicast ? "TRUE" : "FALSE", - p_opts->exit_on_fatal ? "TRUE" : "FALSE" + p_opts->exit_on_fatal ? "TRUE" : "FALSE", + p_opts->src_info ? "TRUE" : "FALSE", + p_opts->verbosity_file ); fprintf( Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 9107) +++ opensm/osm_opensm.c (working copy) @@ -180,8 +180,10 @@ osm_opensm_init( /* Can't use log macros here, since we're initializing the log. */ osm_opensm_construct( p_osm ); - status = osm_log_init( &p_osm->log, p_opt->force_log_flush, - p_opt->log_flags, p_opt->log_file, p_opt->accum_log_file ); + status = osm_log_init_ext( &p_osm->log, p_opt->force_log_flush, + p_opt->log_flags, p_opt->log_file, + p_opt->accum_log_file, p_opt->src_info, + p_opt->verbosity_file); if( status != IB_SUCCESS ) return ( status ); Index: opensm/libopensm.map =================================================================== --- opensm/libopensm.map (revision 9107) +++ opensm/libopensm.map (working copy) @@ -1,6 +1,11 @@ -OPENSM_1.1 { +OPENSM_2.0 { global: - osm_log; + osm_log_init_ext; + osm_log_ext; + osm_log_raw_ext; + osm_log_get_level_ext; + osm_log_is_active_ext; + osm_log_read_verbosity_file; osm_is_debug; osm_mad_pool_construct; osm_mad_pool_destroy; @@ -39,7 +44,6 @@ OPENSM_1.1 { osm_dump_dr_path; osm_dump_smp_dr_path; osm_dump_pkey_block; - osm_log_raw; osm_get_sm_state_str; osm_get_sm_signal_str; osm_get_disp_msg_str; @@ -51,5 +55,11 @@ OPENSM_1.1 { osm_get_lsa_str; osm_get_sm_mgr_signal_str; osm_get_sm_mgr_state_str; + st_init_strtable; + st_delete; + st_insert; + st_lookup; + st_foreach; + st_free_table; local: *; }; Index: opensm/osm_log.c =================================================================== --- opensm/osm_log.c (revision 9107) +++ opensm/osm_log.c (working copy) @@ -81,98 +81,465 @@ static char *month_str[] = { }; #endif /* ndef WIN32 */ -static int log_exit_count = 0; -void -osm_log( - IN osm_log_t* const p_log, - IN const osm_log_level_t verbosity, - IN const char *p_str, ... ) -{ - char buffer[LOG_ENTRY_SIZE_MAX]; - va_list args; - int ret; +/*************************************************************************** + ***************************************************************************/ -#ifdef WIN32 - SYSTEMTIME st; - uint32_t pid = GetCurrentThreadId(); -#else - pid_t pid = 0; - time_t tim; - struct tm result; - uint64_t time_usecs; - uint32_t usecs; - - time_usecs = cl_get_time_stamp(); - tim = time_usecs/1000000; - usecs = time_usecs % 1000000; - localtime_r(&tim, &result); -#endif /* WIN32 */ +#define OSM_VERBOSITY_ALL "all" + +static void +__osm_log_free_verbosity_table( + IN osm_log_t* p_log); +static void +__osm_log_print_verbosity_table( + IN osm_log_t* const p_log); + +/*************************************************************************** + ***************************************************************************/ + +osm_log_level_t +osm_log_get_level_ext( + IN const osm_log_t* const p_log, + IN const char* const p_filename ) +{ + osm_log_level_t * p_curr_file_level = NULL; + + if (!p_filename || !p_log->table) + return p_log->level; + + if ( st_lookup( p_log->table, + (st_data_t) p_filename, + (st_data_t*) &p_curr_file_level) ) + return *p_curr_file_level; + else + return p_log->level; +} + +/*************************************************************************** + ***************************************************************************/ - /* If this is a call to syslog - always print it */ - if ( verbosity & OSM_LOG_SYS ) +ib_api_status_t +osm_log_init_ext( + IN osm_log_t* const p_log, + IN const boolean_t flush, + IN const uint8_t log_flags, + IN const char *log_file, + IN const boolean_t accum_log_file, + IN const boolean_t src_info, + IN const char *verbosity_file) +{ + p_log->level = log_flags; + p_log->flush = flush; + p_log->src_info = src_info; + p_log->table = NULL; + + if (log_file == NULL || !strcmp(log_file, "-") || + !strcmp(log_file, "stdout")) + { + p_log->out_port = stdout; + } + else if (!strcmp(log_file, "stderr")) + { + p_log->out_port = stderr; + } + else { - /* this is a call to the syslog */ - va_start( args, p_str ); - vsprintf( buffer, p_str, args ); - va_end(args); - cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); + if (accum_log_file) + p_log->out_port = fopen(log_file, "a+"); + else + p_log->out_port = fopen(log_file, "w+"); - /* SYSLOG should go to stdout too */ - if (p_log->out_port != stdout) + if (!p_log->out_port) { - printf("%s\n", buffer); - fflush( stdout ); + if (accum_log_file) + printf("Cannot open %s for appending. Permission denied\n", log_file); + else + printf("Cannot open %s for writing. Permission denied\n", log_file); + + return(IB_UNKNOWN_ERROR); } + } + openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); + + if (cl_spinlock_init( &p_log->lock ) != CL_SUCCESS) + return IB_ERROR; + + osm_log_read_verbosity_file(p_log,verbosity_file); + return IB_SUCCESS; +} + +/*************************************************************************** + ***************************************************************************/ + +void +osm_log_read_verbosity_file( + IN osm_log_t* p_log, + IN const char * const verbosity_file) +{ + FILE *infile; + char line[500]; + struct stat buf; + boolean_t table_empty = TRUE; + char * tmp_str = NULL; + + if (p_log->table) + { + /* + * Free the existing table. + * Note: if the verbosity config file will not be found, this will + * effectivly reset the existing verbosity configuration and set + * all the files to the same verbosity level + */ + __osm_log_free_verbosity_table(p_log); + } + + if (!verbosity_file) + return; + + if ( stat(verbosity_file, &buf) != 0 ) + { + /* + * Verbosity configuration file doesn't exist. + */ + if (strcmp(verbosity_file,OSM_DEFAULT_VERBOSITY_FILE) == 0) + { + /* + * Verbosity configuration file wasn't explicitly specified. + * No need to issue any error message. + */ + return; + } + else + { + /* + * Verbosity configuration file was explicitly specified. + */ + osm_log(p_log, OSM_LOG_SYS, + "ERROR: Verbosity configuration file (%s) doesn't exist.\n", + verbosity_file); + osm_log(p_log, OSM_LOG_SYS, + " Using general verbosity value.\n"); + return; + } + } + + infile = fopen(verbosity_file, "r"); + if ( infile == NULL ) + { + osm_log(p_log, OSM_LOG_SYS, + "ERROR: Failed opening verbosity configuration file (%s).\n", + verbosity_file); + osm_log(p_log, OSM_LOG_SYS, + " Using general verbosity value.\n"); + return; + } + + p_log->table = st_init_strtable(); + if (p_log->table == NULL) + { + osm_log(p_log, OSM_LOG_SYS, "ERROR: Verbosity table initialization failed.\n"); + return; + } + + /* + * Read the file line by line, parse the lines, and + * add each line to p_log->table. + */ + while ( fgets(line, sizeof(line), infile) != NULL ) + { + char * str = line; + char * name = NULL; + char * value = NULL; + osm_log_level_t * p_log_level_value = NULL; + int res; + + name = strtok_r(str," \t\n",&tmp_str); + if (name == NULL || strlen(name) == 0) { + /* + * empty line - ignore it + */ + continue; + } + value = strtok_r(NULL," \t\n",&tmp_str); + if (value == NULL || strlen(value) == 0) + { + /* + * No verbosity value - wrong syntax. + * This line will be ignored. + */ + continue; + } + + /* + * If the conversion will fail, the log_level_value will get 0, + * so the only way to check that the syntax is correct is to + * scan value for any non-digit (which we're not doing here). + */ + p_log_level_value = malloc (sizeof(osm_log_level_t)); + if (!p_log_level_value) + { + osm_log(p_log, OSM_LOG_SYS, "ERROR: malloc failed.\n"); + p_log->table = NULL; + fclose(infile); + return; + } + *p_log_level_value = strtoul(value, NULL, 0); + + if (strcasecmp(name,OSM_VERBOSITY_ALL) == 0) + { + osm_log_set_level(p_log, *p_log_level_value); + free(p_log_level_value); + } + else + { + res = st_insert( p_log->table, + (st_data_t) strdup(name), + (st_data_t) p_log_level_value); + if (res != 0) + { + /* + * Something is wrong with the verbosity table. + * We won't try to free the table, because there's + * clearly something corrupted there. + */ + osm_log(p_log, OSM_LOG_SYS, "ERROR: Failed adding verbosity table element.\n"); + p_log->table = NULL; + fclose(infile); + return; + } + table_empty = FALSE; + } + + } + + if (table_empty) + __osm_log_free_verbosity_table(p_log); + + fclose(infile); + + __osm_log_print_verbosity_table(p_log); +} + +/*************************************************************************** + ***************************************************************************/ + +static int +__osm_log_print_verbosity_table_element( + IN st_data_t key, + IN st_data_t val, + IN st_data_t arg) +{ + osm_log( (osm_log_t* const) arg, + OSM_LOG_INFO, + "[verbosity] File: %s, Level: 0x%x\n", + (char *) key, *((osm_log_level_t *) val)); + + return ST_CONTINUE; +} + +static void +__osm_log_print_verbosity_table( + IN osm_log_t* const p_log) +{ + osm_log( p_log, OSM_LOG_INFO, + "[verbosity] Verbosity table loaded\n" ); + osm_log( p_log, OSM_LOG_INFO, + "[verbosity] General level: 0x%x\n",osm_log_get_level_ext(p_log,NULL)); + + if (p_log->table) + { + st_foreach( p_log->table, + __osm_log_print_verbosity_table_element, + (st_data_t) p_log ); + } + osm_log_flush(p_log); +} + +/*************************************************************************** + ***************************************************************************/ + +static int +__osm_log_free_verbosity_table_element( + IN st_data_t key, + IN st_data_t val, + IN st_data_t arg) +{ + free( (char *) key ); + free( (osm_log_level_t *) val ); + return ST_DELETE; +} - /* send it also to the log file */ +static void +__osm_log_free_verbosity_table( + IN osm_log_t* p_log) +{ + if (!p_log->table) + return; + + st_foreach( p_log->table, + __osm_log_free_verbosity_table_element, + (st_data_t) NULL); + + st_free_table(p_log->table); + p_log->table = NULL; +} + +/*************************************************************************** + ***************************************************************************/ + +static inline const char * +__osm_log_get_base_name( + IN const char * const p_filename) +{ #ifdef WIN32 - GetLocalTime(&st); - fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", - st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, - pid, buffer); + char dir_separator = '\\'; #else - fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s\n", - (result.tm_mon < 12 ? month_str[result.tm_mon] : "???"), - result.tm_mday, result.tm_hour, - result.tm_min, result.tm_sec, - usecs, pid, buffer); - fflush( p_log->out_port ); -#endif - } + char dir_separator = '/'; +#endif + char * tmp_ptr; + + if (!p_filename) + return NULL; + + tmp_ptr = strrchr(p_filename,dir_separator); + + if (!tmp_ptr) + return p_filename; + return tmp_ptr+1; +} + +/*************************************************************************** + ***************************************************************************/ + +boolean_t +osm_log_is_active_ext( + IN const osm_log_t* const p_log, + IN const char* const p_filename, + IN const osm_log_level_t level ) +{ + osm_log_level_t tmp_lvl; + tmp_lvl = level & + osm_log_get_level_ext(p_log,__osm_log_get_base_name(p_filename)); + return ( tmp_lvl != 0 ); +} + +/*************************************************************************** + ***************************************************************************/ + +static int log_exit_count = 0; + +void +osm_log_ext( + IN osm_log_t* const p_log, + IN const osm_log_level_t verbosity, + IN const char *p_filename, + IN int line, + IN const char *p_str, ... ) +{ + char buffer[LOG_ENTRY_SIZE_MAX]; + va_list args; + int ret; + osm_log_level_t file_verbosity; - /* SYS messages go to the log anyways */ - if (p_log->level & verbosity) - { - - va_start( args, p_str ); - vsprintf( buffer, p_str, args ); - va_end(args); - - /* regular log to default out_port */ - cl_spinlock_acquire( &p_log->lock ); #ifdef WIN32 - GetLocalTime(&st); - _retry: - ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", - st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, - pid, buffer); - + SYSTEMTIME st; + uint32_t pid = GetCurrentThreadId(); #else - pid = pthread_self(); - tim = time(NULL); - _retry: - ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", - ((result.tm_mon < 12) && (result.tm_mon >= 0) ? - month_str[result.tm_mon] : "???"), - result.tm_mday, result.tm_hour, - result.tm_min, result.tm_sec, - usecs, pid, buffer); -#endif /* WIN32 */ + pid_t pid = 0; + time_t tim; + struct tm result; + uint64_t time_usecs; + uint32_t usecs; + + time_usecs = cl_get_time_stamp(); + tim = time_usecs/1000000; + usecs = time_usecs % 1000000; + localtime_r(&tim, &result); +#endif /* WIN32 */ + + /* + * Extract only the file name out of the full path + */ + p_filename = __osm_log_get_base_name(p_filename); + /* + * Get the verbosity level for this file. + * If the file is not specified in the log config file, + * the general verbosity level will be returned. + */ + file_verbosity = osm_log_get_level_ext(p_log, p_filename); + + if ( ! (verbosity & OSM_LOG_SYS) && + ! (file_verbosity & verbosity) ) + { + /* + * This is not a syslog message (which is always printed) + * and doesn't have the required verbosity level. + */ + return; + } + + va_start( args, p_str ); + vsprintf( buffer, p_str, args ); + va_end(args); + + + if ( verbosity & OSM_LOG_SYS ) + { + /* this is a call to the syslog */ + cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); + /* SYSLOG should go to stdout too */ + if (p_log->out_port != stdout) + { + printf("%s", buffer); + fflush( stdout ); + } + } + /* SYSLOG also goes to to the log file */ + + cl_spinlock_acquire( &p_log->lock ); - if (ret >= 0) +#ifdef WIN32 + GetLocalTime(&st); +_retry: + if (p_log->src_info) + { + ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] [%s:%d] -> %s", + st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, + pid, p_filename, line, buffer); + } + else + { + ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", + st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, + pid, buffer); + } +#else + pid = pthread_self(); + tim = time(NULL); +_retry: + if (p_log->src_info) + { + ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] [%s:%d] -> %s", + ((result.tm_mon < 12) && (result.tm_mon >= 0) ? + month_str[result.tm_mon] : "???"), + result.tm_mday, result.tm_hour, + result.tm_min, result.tm_sec, + usecs, pid, p_filename, line, buffer); + } + else + { + ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", + ((result.tm_mon < 12) && (result.tm_mon >= 0) ? + month_str[result.tm_mon] : "???"), + result.tm_mday, result.tm_hour, + result.tm_min, result.tm_sec, + usecs, pid, buffer); + } +#endif + + if (ret >= 0) log_exit_count = 0; - else if (errno == ENOSPC && log_exit_count < 3) { + else if (errno == ENOSPC && log_exit_count < 3) { int fd = fileno(p_log->out_port); fprintf(stderr, "osm_log write failed: %s. Truncating log file.\n", strerror(errno)); @@ -180,38 +547,58 @@ osm_log( lseek(fd, 0, SEEK_SET); log_exit_count++; goto _retry; - } - - /* - Flush log on errors too. + } + + /* + * Flush log on errors and SYSLOGs too. */ - if( p_log->flush || (verbosity & OSM_LOG_ERROR) ) + if ( p_log->flush || + (verbosity & OSM_LOG_ERROR) || + (verbosity & OSM_LOG_SYS) ) fflush( p_log->out_port ); - - cl_spinlock_release( &p_log->lock ); - } + + cl_spinlock_release( &p_log->lock ); } +/*************************************************************************** + ***************************************************************************/ + void -osm_log_raw( - IN osm_log_t* const p_log, - IN const osm_log_level_t verbosity, - IN const char *p_buf ) +osm_log_raw_ext( + IN osm_log_t* const p_log, + IN const osm_log_level_t verbosity, + IN const char * p_filename, + IN const char *p_buf ) { - if( p_log->level & verbosity ) - { - cl_spinlock_acquire( &p_log->lock ); - printf( "%s", p_buf ); - cl_spinlock_release( &p_log->lock ); - - /* - Flush log on errors too. + osm_log_level_t file_verbosity; + /* + * Extract only the file name out of the full path + */ + p_filename = __osm_log_get_base_name(p_filename); + /* + * Get the verbosity level for this file. + * If the file is not specified in the log config file, + * the general verbosity level will be returned. */ - if( p_log->flush || (verbosity & OSM_LOG_ERROR) ) - fflush( stdout ); - } + file_verbosity = osm_log_get_level_ext(p_log, p_filename); + + if ( file_verbosity & verbosity ) + { + cl_spinlock_acquire( &p_log->lock ); + printf( "%s", p_buf ); + cl_spinlock_release( &p_log->lock ); + + /* + Flush log on errors too. + */ + if ( p_log->flush || (verbosity & OSM_LOG_ERROR) ) + fflush( stdout ); + } } +/*************************************************************************** + ***************************************************************************/ + boolean_t osm_is_debug(void) { @@ -221,3 +608,7 @@ osm_is_debug(void) return FALSE; #endif /* defined( _DEBUG_ ) */ } + +/*************************************************************************** + ***************************************************************************/ + Index: opensm/main.c =================================================================== --- opensm/main.c (revision 9107) +++ opensm/main.c (working copy) @@ -296,6 +296,33 @@ show_usage(void) " -d3 - Disable multicast support\n" " -d10 - Put OpenSM in testability mode\n" " Without -d, no debug options are enabled\n\n" ); + printf( "-S\n" + "--log_source_info\n" + " This option tells SM to add source code filename\n" + " and line number to every log message.\n" + " By default, the SM will not log this additional info.\n\n"); + printf( "-b\n" + "--verbosity_file \n" + " This option specifies name of the verbosity\n" + " configuration file, which describes verbosity level\n" + " per source code file. The file may contain zero or\n" + " more lines of the following pattern:\n" + " filename verbosity_level\n" + " where 'filename' is the name of the source code file\n" + " that the 'verbosity_level' refers to, and the \n" + " 'verbosity_level' itself should be specified as a\n" + " number (decimal or hexadecimal).\n" + " Filename 'all' represents general verbosity level,\n" + " that is used for all the files that are not specified\n" + " in the verbosity file.\n" + " Note: The 'all' file verbosity level will override any\n" + " other general level that was specified by the command\n" + " line arguments.\n" + " By default, the SM will use the following file:\n" + " %s\n" + " Sending a SIGHUP signal to the SM will cause it to\n" + " re-read the verbosity configuration file.\n" + "\n\n", OSM_DEFAULT_VERBOSITY_FILE); printf( "-h\n" "--help\n" " Display this usage info then exit.\n\n" ); @@ -527,7 +554,7 @@ main( boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:s:t:a:R:U:P:NQvVhorcyx"; + const char * const short_option = "i:f:ed:g:l:s:t:a:R:U:P:b:SNQvVhorcyx"; /* In the array below, the 2nd parameter specified the number @@ -564,7 +591,9 @@ main( { "add_guid_file", 1, NULL, 'a'}, { "cache-options", 0, NULL, 'c'}, { "stay_on_fatal", 0, NULL, 'y'}, - { "honor_guid2lid", 0, NULL, 'x'}, + { "honor_guid2lid",0, NULL, 'x'}, + { "log_source_info",0,NULL, 'S'}, + { "verbosity_file",1, NULL, 'b'}, { NULL, 0, NULL, 0 } /* Required at the end of the array */ }; @@ -808,6 +837,16 @@ main( printf (" Honor guid2lid file, if possible\n"); break; + case 'S': + opt.src_info = TRUE; + printf(" Logging source code filename and line number\n"); + break; + + case 'b': + opt.verbosity_file = optarg; + printf(" Verbosity Configuration File: %s\n", optarg); + break; + case 'h': case '?': case ':': @@ -920,9 +959,13 @@ main( if (osm_hup_flag) { osm_hup_flag = 0; - /* a HUP signal should only start a new heavy sweep */ + /* + * A HUP signal should cause OSM to re-read the log + * configuration file and start a new heavy sweep + */ osm.subn.force_immediate_heavy_sweep = TRUE; osm_opensm_sweep( &osm ); + osm_log_read_verbosity_file(&osm.log,opt.verbosity_file); } } } Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 9107) +++ opensm/Makefile.am (working copy) @@ -43,7 +43,7 @@ else libopensm_version_script = endif -libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c +libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c st.c libopensm_la_LDFLAGS = -version-info $(opensm_api_version) \ -export-dynamic $(libopensm_version_script) libopensm_la_DEPENDENCIES = $(srcdir)/libopensm.map @@ -90,7 +90,7 @@ opensm_SOURCES = main.c osm_console.c os osm_trap_rcv.c osm_trap_rcv_ctrl.c \ osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c \ osm_vl15intf.c osm_vl_arb_rcv.c \ - osm_vl_arb_rcv_ctrl.c st.c + osm_vl_arb_rcv_ctrl.c if OSMV_OPENIB opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 Index: doc/verbosity-config.txt =================================================================== --- doc/verbosity-config.txt (revision 0) +++ doc/verbosity-config.txt (revision 0) @@ -0,0 +1,43 @@ + +This patch adds new verbosity functionality. + +1. Verbosity configuration file +------------------------------- + +The user is able to set verbosity level per source code file +by supplying verbosity configuration file using the following +command line arguments: + + -b filename + --verbosity_file filename + +By default, the OSM will use the following file: /etc/osm-log.conf +Verbosity configuration file should contain zero or more lines of +the following pattern: + + filename verbosity_level + +where 'filename' is the name of the source code file that the +'verbosity_level' refers to, and the 'verbosity_level' itself +should be specified as an integer number (decimal or hexadecimal). + +One reserved filename is 'all' - it represents general verbosity +level, that is used for all the files that are not specified in +the verbosity configuration file. +If 'all' is not specified, the verbosity level set in the +command line will be used instead. +Note: The 'all' file verbosity level will override any other +general level that was specified by the command line arguments. + +Sending a SIGHUP signal to the OSM will cause it to reload +the verbosity configuration file. + + +2. Logging source code filename and line number +----------------------------------------------- + +If command line option -S or --log_source_info is specified, +OSM will add source code filename and line number to every +log message that is written to the log file. +By default, the OSM will not log this additional info. + Index: man/opensm.8 =================================================================== --- man/opensm.8 (revision 9107) +++ man/opensm.8 (working copy) @@ -5,7 +5,7 @@ opensm \- InfiniBand subnet manager and .SH SYNOPSIS .B opensm -[\-c(ache_options)] [\-g(uid)[=]] [\-l(mc) ] [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] [\-R | \-routing_engine ] [\-U | \-ucast_file ] [\-a(dd_guid_file) ] [\-o(nce)] [\-s(weep) ] [\-t(imeout) ] [\-maxsmps ] [\-console] [\-i(gnore-guids) ] [\-f | \-\-log_file] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-no_qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-v(erbose)] [\-V] [\-D ] [\-d(ebug) ] [\-h(elp)] [\-?] +[\-c(ache_options)] [\-g(uid)[=]] [\-l(mc) ] [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] [\-R | \-routing_engine ] [\-U | \-ucast_file ] [\-a(dd_guid_file) ] [\-o(nce)] [\-s(weep) ] [\-t(imeout) ] [\-maxsmps ] [\-console] [\-i(gnore-guids) ] [\-f | \-\-log_file] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-no_qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-v(erbose)] [\-V] [\-D ] [\-d(ebug) ] [\-S | \-log_source_info] [\-b | \-verbosity_file ] [\-h(elp)] [\-?] .SH DESCRIPTION .PP @@ -208,6 +208,33 @@ option to enable as follows: -d2 - Force log flushing after each log message -d3 - Disable multicast support .TP +\fB\-S\fR, \fB\-\-log_source_info\fR +This option tells SM to add source code filename +and line number to every log message. +By default, the SM will not log this additional info. +.TP +\fB\-b\fR, \fB\-\-verbosity_file\fR +This option specifies name of the verbosity +configuration file, which describes verbosity level +per source code file. +By default, the SM will use /etc/osm-log.conf. +The file may contain zero or more lines of the +following pattern: + + filename verbosity_level + +where 'filename' is the name of the source code file +that the 'verbosity_level' refers to, and the 'verbosity_level' +itself should be specified as a number (decimal or hexadecimal). +Filename 'all' represents general verbosity level, +that is used for all the files that are not specified +in the verbosity file. +Note: The 'all' file verbosity level will override any +other general level that was specified by the command +line arguments. +Sending a SIGHUP signal to the SM will cause it to +re-read the verbosity configuration file. +.TP \fB\-h\fR, \fB\-\-help\fR Display this usage info then exit. .TP From dledford at redhat.com Sun Aug 27 08:09:37 2006 From: dledford at redhat.com (Doug Ledford) Date: Sun, 27 Aug 2006 11:09:37 -0400 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready In-Reply-To: <44F1ACB7.6050702@mellanox.co.il> References: <44EA004F.2060608@mellanox.co.il> <1156526995.12257.107.camel@fc6.xsintricity.com> <44F1ACB7.6050702@mellanox.co.il> Message-ID: <1156691377.15782.23.camel@fc6.xsintricity.com> On Sun, 2006-08-27 at 17:31 +0300, Tziporet Koren wrote: > > Doug Ledford wrote: > > > > > Not supporting ppc is a problem to a certain extent. I can't speak for > > SuSE, but at least for Red Hat, ppc is the default and over rides ppc64. > > The ppc64 arch is less efficient than the ppc arch on ppc64 processors > > except when large memory footprints are involved. So, for things like > > opensm, ibv_*, etc. the ppc arch should actually be preferred, and the > > ppc64 arch libs should be present for those end user apps that need > > large memory access. That fact that dapl doesn't compile on ppc at all > > is problematic as well. In addition, what are you guys doing about the > > lack of asm/atomic.h (breaking udapl compiles on ppc64 and ia64) going > > forward? I'd look in the packages and see for myself but the svn update > > is taking forever due to those binary rpms packed into svn...ahh, it's > > finally done....ok, still broken. > > > > We don't have here any PPC machine for event for compilation checks :-( Hmmm..probably should talk to Roland about getting sponsored as a Fedora Extras developer, then you could kick off builds through the Fedora build system, which allows you to compile on all the arches. > > Without getting into an argument over the usage of that include, suffice > > it to say that the include file is gone and builds fails on > > fc6/rhel5beta. Since the code really only uses low level intrinsics as > > opposed to high level atomic ops, I made a ppc and ia64 intrinsics > > header for linux and added it to the dapl package itself to work around > > the issue. > > > > Please work with James and Arlin to drive these changes to uDAPL. Sure. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Sun Aug 27 08:10:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 Aug 2006 18:10:05 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready In-Reply-To: <1156690268.15782.18.camel@fc6.xsintricity.com> References: <1156690268.15782.18.camel@fc6.xsintricity.com> Message-ID: <20060827151005.GG25554@mellanox.co.il> Quoting r. Doug Ledford : > Subject: Re: [openfabrics-ewg] OFED 1.1-rc2 is ready > > On Sat, 2006-08-26 at 22:42 +0300, Michael S. Tsirkin wrote: > > Quoting r. Doug Ledford : > > > IOW, make use of the infrastructure > > > provided in U4 instead of working around it. > > > > Sorry, I don't really understand what you suggest here. > > Could you give us an example please? > > Sure. The stock 2.6.9 kernel doesn't include kzalloc() or kstrdup(). > However, both of those exist in the Red Hat RHEL4 U3 and later kernels > (yeah, I know, there isn't an easy way to know this, but unless you just > *need* to support a U2 or earlier kernel, then you can just assume Red > Hat has this). Likewise, the U4 and later kernels have the proper class > functions in the core kernel so you don't need small backports of stuff > like class_create in the uverbs_main patch. Ditto for the get_sb_psuedo > in the core patch. Ditto for a number of things in the ipath backports. > Ditto for quite a few other things as well. In general, there's less > need to backport under RHEL4 U4 than previously, and where possible it > would be best to make use of the core kernel's enhancements relative to > a stock 2.6.9. Oh. Yes, generally we do try to use these and make the patches minimal. I am Cc-ing Jack Morgenstein who did the lion's share of work implementing the backport. I guess what you are saying is that we had missed quite a bit and they could still be made smaller - in this case, would you care to post improved patches? Of course, it is unfortunate that these proposals for sweeping changes come at such a late stage in the game - we are close to code freeze. But let's see the code and we'll be able to decide whether this change seems safe. -- MST From bos at pathscale.com Sun Aug 27 09:50:20 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Sun, 27 Aug 2006 09:50:20 -0700 Subject: [openib-general] [PATCH 0 of 23] IB/ipath - updates for 2.6.19 In-Reply-To: <20060826192413.GE21168@mellanox.co.il> References: <20060826192413.GE21168@mellanox.co.il> Message-ID: <1156697420.11913.22.camel@pelerin.serpentine.com> On Sat, 2006-08-26 at 22:24 +0300, Michael S. Tsirkin wrote: > This looks much better than the monolitic patch that sits in the ofed fixes > directory. Does this cover all the fixes there, and if so can > it be replaced with this? I'll check on Monday. I have a split-up series of backport patches, too. References: <1156690268.15782.18.camel@fc6.xsintricity.com> Message-ID: <20060827175943.GA27248@mellanox.co.il> Quoting r. Doug Ledford : > Subject: Re: [openfabrics-ewg] OFED 1.1-rc2 is ready > > On Sat, 2006-08-26 at 22:42 +0300, Michael S. Tsirkin wrote: > > Quoting r. Doug Ledford : > > > IOW, make use of the infrastructure > > > provided in U4 instead of working around it. > > > > Sorry, I don't really understand what you suggest here. > > Could you give us an example please? > > Sure. I get the idea now, but that's exactly what we tried to do. I went over the patches, and here's what I found: > The stock 2.6.9 kernel doesn't include kzalloc() or kstrdup(). > However, both of those exist in the Red Hat RHEL4 U3 and later kernels > (yeah, I know, there isn't an easy way to know this, but unless you just > *need* to support a U2 or earlier kernel, then you can just assume Red > Hat has this). I just checked and we don't override kzalloc except in the ipath driver - see below about that. Are there other specific examples? > Likewise, the U4 and later kernels have the proper class > functions in the core kernel so you don't need small backports of stuff > like class_create in the uverbs_main patch. Again, U4 patch for uverbs main is smaller than for U3, we removed all the class_device_create re-implementation. Can it be made even smaller? How? > Ditto for the get_sb_psuedo > in the core patch. Ditto, get_sb_pseudo is not re-implemented in our U4 patch. So what's the issue? > Ditto for a number of things in the ipath backports. Please address this to ipath developers - they seem to prefer keeping a single backport patch for multiple OS-es. > Ditto for quite a few other things as well. list? > In general, there's less > need to backport under RHEL4 U4 than previously, and where possible it > would be best to make use of the core kernel's enhancements relative to > a stock 2.6.9. In general, ipath guys insist on keeping the same patch across distributions. For other parts of the kernel that is exactly what we were trying to do. Can patches be made even smaller? Pls go ahead and suggest how. -- MST From dledford at redhat.com Sun Aug 27 11:23:37 2006 From: dledford at redhat.com (Doug Ledford) Date: Sun, 27 Aug 2006 14:23:37 -0400 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready In-Reply-To: <20060827175943.GA27248@mellanox.co.il> References: <1156690268.15782.18.camel@fc6.xsintricity.com> <20060827175943.GA27248@mellanox.co.il> Message-ID: <1156703017.15782.82.camel@fc6.xsintricity.com> On Sun, 2006-08-27 at 20:59 +0300, Michael S. Tsirkin wrote: > I get the idea now, but that's exactly what we tried to do. > I went over the patches, and here's what I found: Great. I don't have the u4 patches you guys have made, so I didn't see that you had taken those things out (I've been going on the patches in the OFED-1.1rc2 package, and it doesn't have the U4 stuff yet). So, if you took care of it in your later stuff, I'm perfectly happy with that. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Sun Aug 27 11:39:09 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 Aug 2006 21:39:09 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc2 is ready In-Reply-To: <1156703017.15782.82.camel@fc6.xsintricity.com> References: <1156703017.15782.82.camel@fc6.xsintricity.com> Message-ID: <20060827183909.GC27248@mellanox.co.il> Quoting r. Doug Ledford : > Subject: Re: [openfabrics-ewg] OFED 1.1-rc2 is ready > > On Sun, 2006-08-27 at 20:59 +0300, Michael S. Tsirkin wrote: > > > I get the idea now, but that's exactly what we tried to do. > > I went over the patches, and here's what I found: > > Great. I don't have the u4 patches you guys have made, so I didn't see > that you had taken those things out (I've been going on the patches in > the OFED-1.1rc2 package, and it doesn't have the U4 stuff yet). So, if > you took care of it in your later stuff, I'm perfectly happy with that. Oh. You can usually get updated stuff from git://www.mellanox.co.il/~git/infiniband ofed_addons -- MST From customer_service at mastercard.com Fri Aug 25 03:45:11 2006 From: customer_service at mastercard.com (MasterCard) Date: Fri, 25 Aug 2006 06:45:11 -0400 Subject: [openib-general] Introducing the MasterCard SecureCode Message-ID: <20060827190527.21D4D3B000A@sentry-two.sandia.gov> - This mail is a HTML mail. Not all elements could be shown in plain text mode. - MasterCard SecureCode - Introduction Dear valued MasterCard customer, MasterCard is proud to announce the new online safe shopping solution, the MasterCard SecureCode program. This service is totally free, and it will keep you safe and secure, and protect you from unauthorized payments made trough the internet. How does MasterCard SecureCode work? Once you've registered and created your own private SecureCode, you will be automatically prompted by your financial institution at checkout to provide your SecureCode each time you make a purchase with a participating online merchant. Your SecureCode is quickly confirmed by your financial institution and then your purchase is completed. Your SecureCode will never be shared with the merchant. It's just like entering your PIN at an ATM. How does MasterCard SecureCode protect me? When you correctly enter your SecureCode during a purchase at a participating online merchant, you confirm that you are the authorized cardholder and your purchase is then completed. If an incorrect SecureCode is entered, the purchase will not be completed. Even if someone knows your credit or debit card number, the purchase cannot be completed without your SecureCode at a participating merchant. What are the system requirements for MasterCard SecureCode? MasterCard SecureCode works with most browsers. Please be advised software which prevents pop-up windows may interfere with your use of SecureCode. How many cards can I register with MasterCard SecureCode? We encourage you to register all of your MasterCard cards. There is no limit on how many cards you can register. Sign Up for MasterCard SecureCode! Click here! -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Sun Aug 27 15:25:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 27 Aug 2006 15:25:58 -0700 Subject: [openib-general] bug in libmthca/src/verbs.c? References: <20060826054029.QHAL1364.rrcs-fep-10.hrndva.rr.com@BOBP> Message-ID: Robert> The command size passed to ibv_cmd_create_cq is the size Robert> of the mthca command wrapper which is larger than what is Robert> most likely expected. No, that is the whole point of passing the size into that function: it lets low-level drivers pass extra device-specific data in. - R. From rdreier at cisco.com Sun Aug 27 15:25:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 27 Aug 2006 15:25:56 -0700 Subject: [openib-general] [PATCH 22 of 23] IB/ipath - print warning if LID not acquired within one minute References: <44EF6053.4010006@pathscale.com> <20060826193126.GF21168@mellanox.co.il> Message-ID: Michael> Looks like your devices are all single-port. With a multi Michael> port device it is quite common to have one port down. My reading of the patch is that it warns if the link is up physically but does not come up logically. Which would still be reasonable for a multi-port device. But I am still wondering about when this is really useful. - R. From dledford at redhat.com Sun Aug 27 15:28:06 2006 From: dledford at redhat.com (Doug Ledford) Date: Sun, 27 Aug 2006 18:28:06 -0400 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <20060820171808.GZ18411@sashak.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> Message-ID: <1156717686.15782.93.camel@fc6.xsintricity.com> On Sun, 2006-08-20 at 20:18 +0300, Sasha Khapyorsky wrote: > On 13:01 Sun 20 Aug , Hal Rosenstock wrote: > > Hi Sasha, > > > > On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote: > > > In case when OpenSM log file overflows filesystem and write() fails with > > > 'No space left on device' try to truncate the log file and wrap-around > > > logging. > > > > Should it be an (admin) option as to whether to truncate the file or not > > or is there no way to continue without logging (other than this) once > > the log file fills the disk ? > > In theory OpenSM may continue, but don't think it is good idea to leave > overflowed disk on the SM machine (by default it is '/var/log'). For me > truncating there looks as reasonable default behavior, don't think we > need the option. I would definitely put the option in, and in fact would default it to *NOT* truncate. If the disk is full, you have no idea why. It *might* be your logs, or it might be a mail bomb filling /var/spool/mail. I'm sure as an admin the last thing I would want is my apps deciding, based upon incomplete information, that wiping out their log files is the right thing to do. To me that sounds more like an intruder covering his tracks than a reasonable thing to do when confronted with ENOSPC. Truncating logs is something best left up to the admin that's dealing with the disk full problem in the first place. After all, if it is something like an errant app filling the mail spool, truncating the logs just looses valuable logs while at the same time making room for the app to keep on adding more to /var/spool/mail. That's just wrong. If you run out of space, just quit logging things until the admin clears the problem up. If you put this code in, make the admin turn it on. That will keep opensm friendly to appliance like devices that are single task subnet managers. But I don't think having this patch always on makes any sense on a multi task server. > > > > See comment below as well. > > > > -- Hal > > > > > Signed-off-by: Sasha Khapyorsky > > > --- > > > > > > osm/opensm/osm_log.c | 23 +++++++++++++++-------- > > > 1 files changed, 15 insertions(+), 8 deletions(-) > > > > > > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c > > > index 668e9a6..b4700c8 100644 > > > --- a/osm/opensm/osm_log.c > > > +++ b/osm/opensm/osm_log.c > > > @@ -58,6 +58,7 @@ #include > > > #include > > > #include > > > #include > > > +#include > > > > > > #ifndef WIN32 > > > #include > > > @@ -152,6 +153,7 @@ #endif > > > cl_spinlock_acquire( &p_log->lock ); > > > #ifdef WIN32 > > > GetLocalTime(&st); > > > + _retry: > > > ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", > > > st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, > > > pid, buffer); > > > @@ -159,6 +161,7 @@ #ifdef WIN32 > > > #else > > > pid = pthread_self(); > > > tim = time(NULL); > > > + _retry: > > > ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", > > > ((result.tm_mon < 12) && (result.tm_mon >= 0) ? > > > month_str[result.tm_mon] : "???"), > > > @@ -166,6 +169,18 @@ #else > > > result.tm_min, result.tm_sec, > > > usecs, pid, buffer); > > > #endif /* WIN32 */ > > > + > > > + if (ret >= 0) > > > + log_exit_count = 0; > > > + else if (errno == ENOSPC && log_exit_count < 3) { > > > + int fd = fileno(p_log->out_port); > > > + fprintf(stderr, "log write failed: %s. Will truncate the log file.\n", > > > + strerror(errno)); > > > + ftruncate(fd, 0); > > > > Should return from ftruncate be checked here ? > > May be checked, but I don't think that potential ftruncate() failure > should change the flow - in case of failure we will try to continue > with lseek() anyway (in order to wrap around the file at least). > > Sasha > > > > > > + lseek(fd, 0, SEEK_SET); > > > + log_exit_count++; > > > + goto _retry; > > > + } > > > > > > /* > > > Flush log on errors too. > > > @@ -174,14 +189,6 @@ #endif /* WIN32 */ > > > fflush( p_log->out_port ); > > > > > > cl_spinlock_release( &p_log->lock ); > > > - > > > - if (ret < 0) > > > - { > > > - if (log_exit_count++ < 10) > > > - { > > > - fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n"); > > > - } > > > - } > > > } > > > } > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Sun Aug 27 15:30:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 27 Aug 2006 15:30:56 -0700 Subject: [openib-general] basic IB doubt References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> <20060826073957.GA1369@minantech.com> Message-ID: glebn> So, before touching the data that was RDMAed into the glebn> buffer application should cache invalidate the buffer, is glebn> this even possible from user space? (Not on x86, but it glebn> isn't needed there.) Yes, on any architecture that is not cache-coherent with PCI DMA, some cache invalidation/flushing will be necessary. And this probably won't be possible from userspace if the cache is physically tagged. (Are there any such architectures in real use, ie non-coherent with PCI and physically tagged cache?) - R. From rjwalsh at pathscale.com Sun Aug 27 18:41:08 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sun, 27 Aug 2006 18:41:08 -0700 Subject: [openib-general] [PATCH 22 of 23] IB/ipath - print warning if LID not acquired within one minute In-Reply-To: References: <44EF6053.4010006@pathscale.com> <20060826193126.GF21168@mellanox.co.il> Message-ID: <44F249B4.501@pathscale.com> Roland Dreier wrote: > Michael> Looks like your devices are all single-port. With a multi > Michael> port device it is quite common to have one port down. > > My reading of the patch is that it warns if the link is up physically > but does not come up logically. Which would still be reasonable for a > multi-port device. > > But I am still wondering about when this is really useful. Well, either you think it is or it isn't. We like it: it's easier than pointing customers at something in /sys. Regards, Robert. From jgunthorpe at obsidianresearch.com Sun Aug 27 23:18:49 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 28 Aug 2006 00:18:49 -0600 Subject: [openib-general] basic IB doubt In-Reply-To: References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> <20060826073957.GA1369@minantech.com> Message-ID: <20060828061849.GC13774@obsidianresearch.com> On Sun, Aug 27, 2006 at 03:30:56PM -0700, Roland Dreier wrote: > glebn> So, before touching the data that was RDMAed into the > glebn> buffer application should cache invalidate the buffer, is > glebn> this even possible from user space? (Not on x86, but it > glebn> isn't needed there.) > Yes, on any architecture that is not cache-coherent with PCI DMA, some > cache invalidation/flushing will be necessary. And this probably > won't be possible from userspace if the cache is physically tagged. > (Are there any such architectures in real use, ie non-coherent with > PCI and physically tagged cache?) It depends on the arch if it is a problem or not.. Ie PPC Book-E has 'dcba' which is available from user space. It operates on virtual addresses and is a flush and invalidate combined. So it is safe, but less effecient than the pure invalidate that the kernel has access to. So long as cache ops that works on virtual addresses are present it should be fine from userspace, but in some cases the necessary sequence of cache ops can be quite elaborate and hardware dependent, so a syscall, or at least a vdso function would be needed to support eveything. In my experience most real architectures that have this problem these days are embedded targetted lower performance processors. If you are in the embedded space and using IB hardware then presumably you care about performance and will avoid such things. (Although long ago, this wasn't a choice and I actually have built an embedded IB capable system with non-coherent PCI.. It is a big pain, I don't recommend it.) Jason From jgunthorpe at obsidianresearch.com Sun Aug 27 23:23:35 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 28 Aug 2006 00:23:35 -0600 Subject: [openib-general] [PATCH 22 of 23] IB/ipath - print warning if LID not acquired within one minute In-Reply-To: <44F249B4.501@pathscale.com> References: <44EF6053.4010006@pathscale.com> <20060826193126.GF21168@mellanox.co.il> <44F249B4.501@pathscale.com> Message-ID: <20060828062335.GD13774@obsidianresearch.com> On Sun, Aug 27, 2006 at 06:41:08PM -0700, Robert Walsh wrote: > Roland Dreier wrote: > > Michael> Looks like your devices are all single-port. With a multi > > Michael> port device it is quite common to have one port down. > > > > My reading of the patch is that it warns if the link is up physically > > but does not come up logically. Which would still be reasonable for a > > multi-port device. > > > > But I am still wondering about when this is really useful. > > Well, either you think it is or it isn't. We like it: it's easier than > pointing customers at something in /sys. Why not copy what most ethernet drivers do and just log a message on link negotiation state changes? ie like the: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1 Maybe something like: ib0: link up, 10Gbps, port active, lid 0x10/16 Jason From rjwalsh at pathscale.com Mon Aug 28 00:31:33 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 28 Aug 2006 00:31:33 -0700 Subject: [openib-general] [PATCH 22 of 23] IB/ipath - print warning if LID not acquired within one minute In-Reply-To: <20060828062335.GD13774@obsidianresearch.com> References: <44EF6053.4010006@pathscale.com> <20060826193126.GF21168@mellanox.co.il> <44F249B4.501@pathscale.com> <20060828062335.GD13774@obsidianresearch.com> Message-ID: <44F29BD5.30304@pathscale.com> J> Why not copy what most ethernet drivers do and just log a message on > link negotiation state changes? ie like the: > > eth0: link up, 100Mbps, full-duplex, lpa 0x45E1 > > Maybe something like: > > ib0: link up, 10Gbps, port active, lid 0x10/16 That's a neat idea. From glebn at voltaire.com Mon Aug 28 00:58:05 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Mon, 28 Aug 2006 10:58:05 +0300 Subject: [openib-general] basic IB doubt In-Reply-To: References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> <20060826073957.GA1369@minantech.com> Message-ID: <20060828075805.GO26450@minantech.com> On Sun, Aug 27, 2006 at 03:30:56PM -0700, Roland Dreier wrote: > glebn> So, before touching the data that was RDMAed into the > glebn> buffer application should cache invalidate the buffer, is > glebn> this even possible from user space? (Not on x86, but it > glebn> isn't needed there.) > > Yes, on any architecture that is not cache-coherent with PCI DMA, some > cache invalidation/flushing will be necessary. And this probably > won't be possible from userspace if the cache is physically tagged. > (Are there any such architectures in real use, ie non-coherent with > PCI and physically tagged cache?) > AFAIR at least PPC603 with bus snooping disabled is such architecture. But the point is in order to write really portable code you can't just assume cache coherency, so data receive order is not the only assumption we are making currently. -- Gleb. From glebn at voltaire.com Mon Aug 28 01:11:08 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Mon, 28 Aug 2006 11:11:08 +0300 Subject: [openib-general] basic IB doubt In-Reply-To: <20060828061849.GC13774@obsidianresearch.com> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> <20060826073957.GA1369@minantech.com> <20060828061849.GC13774@obsidianresearch.com> Message-ID: <20060828081108.GP26450@minantech.com> On Mon, Aug 28, 2006 at 12:18:49AM -0600, Jason Gunthorpe wrote: > On Sun, Aug 27, 2006 at 03:30:56PM -0700, Roland Dreier wrote: > > glebn> So, before touching the data that was RDMAed into the > > glebn> buffer application should cache invalidate the buffer, is > > glebn> this even possible from user space? (Not on x86, but it > > glebn> isn't needed there.) > > > Yes, on any architecture that is not cache-coherent with PCI DMA, some > > cache invalidation/flushing will be necessary. And this probably > > won't be possible from userspace if the cache is physically tagged. > > (Are there any such architectures in real use, ie non-coherent with > > PCI and physically tagged cache?) > > It depends on the arch if it is a problem or not.. Ie PPC Book-E > has 'dcba' which is available from user space. It operates on virtual > addresses and is a flush and invalidate combined. So it is safe, > but less effecient than the pure invalidate that the kernel has access > to. > This is from PPC instruction book: The dcba instruction executes as follows: If the cache block containing the byte addressed by EA is in the data cache, the contents of all bytes are made undefined but the cache block is still considered valid. Note that programming errors can occur if the data in this cache block is subsequently read or used inadvertently. If the cache block containing the byte addressed by EA is not in the data cache and the corresponding memory page or block is caching-allowed, the cache block is allocated (and made valid) in the data cache without fetching the block from main memory, and the value of all bytes is undefined. This doesn't look like this instruction is doing flush or invalidate. It makes cache line present without accessing underlying memory. AFAIR uboot uses this + cache locking to create C stack before SDRAM is initialised. > So long as cache ops that works on virtual addresses are present > it should be fine from userspace, but in some cases the necessary > sequence of cache ops can be quite elaborate and hardware dependent, > so a syscall, or at least a vdso function would be needed to support > eveything. And then you'll need to do syscall for every IB verb. > > In my experience most real architectures that have this problem these > days are embedded targetted lower performance processors. If you are > in the embedded space and using IB hardware then presumably you care > about performance and will avoid such things. (Although long ago, this > wasn't a choice and I actually have built an embedded IB capable > system with non-coherent PCI.. It is a big pain, I don't recommend it.) > Agreed. -- Gleb. From kliteyn at mellanox.co.il Mon Aug 28 01:11:53 2006 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 28 Aug 2006 11:11:53 +0300 Subject: [openib-general] [PATCH] osm: TRIVIAL code cleanup Message-ID: Hi Hal. I noticed that there are some unused defaults: OSM_DEFAULT_MGRP_MTU and OSM_DEFAULT_MGRP_RATE. The corresponding values in the code are hadcoded. Fixed the code to use these defaults, and updated the OSM_DEFAULT_MGRP_MTU to the value that was hardcoded. Yevgeny Signed-off-by: Yevgeny Kliteynik Index: include/opensm/osm_sa_mcmember_record.h =================================================================== --- include/opensm/osm_sa_mcmember_record.h (revision 9107) +++ include/opensm/osm_sa_mcmember_record.h (working copy) @@ -374,12 +374,12 @@ osm_mcmr_rcv_find_or_create_new_mgrp( * OSM_DEFAULT_MGRP_MTU * * DESCRIPTION -* Default MTU used for new MGRP creation (256 bytes) +* Default MTU used for new MGRP creation (2048 bytes) * Note it includes the MTUSelector which is set to "Greater Than" * * SYNOPSIS */ -#define OSM_DEFAULT_MGRP_MTU 0x01 +#define OSM_DEFAULT_MGRP_MTU 0x04 /***********/ /****d* OpenSM: MC Member Record Receiver/OSM_DEFAULT_MGRP_RATE Index: opensm/osm_prtn.c =================================================================== --- opensm/osm_prtn.c (revision 9107) +++ opensm/osm_prtn.c (working copy) @@ -216,10 +216,10 @@ ib_api_status_t osm_prtn_add_mcgroup(osm memcpy(&mc_rec.mgid.raw[4], &pkey, sizeof(pkey)); mc_rec.qkey = CL_HTON32(0x0b1b); - mc_rec.mtu = (mtu ? mtu : 4) | (2 << 6); /* 2048 Bytes */ + mc_rec.mtu = (mtu ? mtu : OSM_DEFAULT_MGRP_MTU) | (2 << 6); mc_rec.tclass = 0; mc_rec.pkey = pkey; - mc_rec.rate = (rate ? rate : 0x3) | (2 << 6); /* 10Gb/sec */ + mc_rec.rate = (rate ? rate : OSM_DEFAULT_MGRP_RATE) | (2 << 6); mc_rec.pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; mc_rec.sl_flow_hop = ib_member_set_sl_flow_hop(p->sl, 0, 0); /* Note: scope needs to be consistent with MGID */ From mst at mellanox.co.il Mon Aug 28 03:57:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Aug 2006 13:57:55 +0300 Subject: [openib-general] CMA oops Message-ID: <20060828105755.GB23639@mellanox.co.il> I've observed the following oops with CMA Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [] :rdma_cm:cma_detach_from_dev+0x1a/0x58 PGD 135abd067 PUD 133ed3067 PMD 0 Oops: 0002 [1] SMP CPU 1 Modules linked in: ib_sdp rdma_cm ib_addr i2c_dev i2c_core ib_ipoib ib_mthca ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core Pid: 6389, comm: sdp Not tainted 2.6.18-rc2-devel #7 RIP: 0010:[] [] :rdma_cm:cma_detach_from_dev+0x1a/0x58 RSP: 0018:ffff8101351cbdf8 EFLAGS: 00010246 RAX: ffff810134fd3ef0 RBX: ffff810137202200 RCX: ffff8101372022f0 RDX: 0000000000000000 RSI: ffff81013acdf510 RDI: ffff810137202200 RBP: ffff810137202200 R08: ffff8101351ca000 R09: ffff810137d2cc80 R10: 0000000000000068 R11: ffff810138604540 R12: ffff810138604e40 R13: 0000000000000293 R14: ffff810135098800 R15: ffffffff8808910e FS: 0000000000000000(0000) GS:ffff81013b876cc0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000013520f000 CR4: 00000000000006e0 Process sdp (pid: 6389, threadinfo ffff8101351ca000, task ffff81013acdf510) Stack: ffff810137202200 ffffffff88081717 ffff810138604e88 ffff810135098800 ffff810137202200 ffffffff880884bb ffff8101340fd800 ffff810135098800 ffffffff88092500 ffffffff80495579 ffff8101340fd800 ffff810135098bc8 Call Trace: [] :rdma_cm:rdma_destroy_id+0x5f/0x107 Apparently, list->prev pointer in CMA id_priv structure is NULL which causes a crash in list_del. I note that rdma_destroy_id tests outside the mutex lock. Could that be the problem? The problem is not unfortunately easily reproducible. -- MST From Thomas.Talpey at netapp.com Mon Aug 28 05:24:26 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 28 Aug 2006 08:24:26 -0400 Subject: [openib-general] basic IB doubt In-Reply-To: <20060826073957.GA1369@minantech.com> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> <20060826073957.GA1369@minantech.com> Message-ID: <7.0.1.0.2.20060828081032.04133618@netapp.com> At 03:39 AM 8/26/2006, Gleb Natapov wrote: >On Fri, Aug 25, 2006 at 03:53:12PM -0400, Talpey, Thomas wrote: >> Flush (sync for_device) before posting. >> Invalidate (sync for_cpu) before processing. >> >So, before touching the data that was RDMAed into the buffer application >should cache invalidate the buffer, is this even possible from user >space? (Not on x86, but it isn't needed there.) Interesting you should mention that. :-) There isn't a user verb for dma_sync, there's only deregister. The kernel can perform this for receive completions, and signaled RDMA Reads, but it can't do so for remote RDMA Writes. Only the upper layer knows where those went. There are two practical solutions: 1) (practical solution) user mappings must be fully consistent, within the capability of the hardware. Still, don't go depending on any specific ordering here. 2) user must deregister any mapping before inspecting the result. I doubt any of them do this, for that reason anyway. MO is that this will bite us in the a** some day. If anybody was running this code on the Sparc architecture it already would have. Tom. From glebn at voltaire.com Mon Aug 28 06:00:52 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Mon, 28 Aug 2006 16:00:52 +0300 Subject: [openib-general] basic IB doubt In-Reply-To: <7.0.1.0.2.20060828081032.04133618@netapp.com> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> <20060826073957.GA1369@minantech.com> <7.0.1.0.2.20060828081032.04133618@netapp.com> Message-ID: <20060828130052.GA4921@minantech.com> On Mon, Aug 28, 2006 at 08:24:26AM -0400, Talpey, Thomas wrote: > At 03:39 AM 8/26/2006, Gleb Natapov wrote: > >On Fri, Aug 25, 2006 at 03:53:12PM -0400, Talpey, Thomas wrote: > >> Flush (sync for_device) before posting. > >> Invalidate (sync for_cpu) before processing. > >> > >So, before touching the data that was RDMAed into the buffer application > >should cache invalidate the buffer, is this even possible from user > >space? (Not on x86, but it isn't needed there.) > > Interesting you should mention that. :-) There isn't a user verb for > dma_sync, there's only deregister. > > The kernel can perform this for receive completions, and signaled > RDMA Reads, but it can't do so for remote RDMA Writes. Only the > upper layer knows where those went. > I think that kernel gets interrupt about completion only if notification is requested otherwise cqe directly placed in CQ buffer. By the way CQ buffer also has to be cache-invalidated before each cq_poll. > There are two practical solutions: > > 1) (practical solution) user mappings must be fully consistent, > within the capability of the hardware. Still, don't go depending > on any specific ordering here. Memory registration should work on any memory allocated for user process. Can we change memory to be consistent after it was allocated? Based on linux DMA api - no, because we only have dma_alloc/free_consistent() and not something that gets existing pointer as a parameter. > > 2) user must deregister any mapping before inspecting the result. I > doubt any of them do this, for that reason anyway. > This may have big performance impact. > MO is that this will bite us in the a** some day. If anybody was > running this code on the Sparc architecture it already would have. > AFAIK SUN runs MPI over UDAPL, but they have their own IB implementation, so may be they handle all coherency issues in the UDAPL itself. -- Gleb. From mst at mellanox.co.il Mon Aug 28 06:32:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Aug 2006 16:32:50 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state Message-ID: <20060828133250.GB24261@mellanox.co.il> IB spec, section 12.4, says: CMs shall maintain enough connection state information to detect an attempt to initiate a connection on a remote QP/EEC that has not been released from a connection with a local QP/EEC, or that is in the TimeWait state. Such an event could occur if the remote CM had dropped the connection and sent DREQ, but the DREQ was not received by the local CM. If the local CM receives a REQ that includes a QPN (or EECN if REQ:RDC Exists is not set), that it believes to be connected to a local QP/EEC, the local CM shall act as defined in section 12.9.8.3. Note here, that while CM must maintain QPs in TimeWait state (to enable detection of TimeWait packets, as explained in 9.7.1 PACKET SEQUENCE NUMBERS), such QPs are not connected (they are normally in reset state). Thus even if a local QP was connected to a specific remote QPN, once the connection enters the timewait state CM must not reject the connection request even if it includes the specific remote QPN. The bahaviour decribed in 12.9.8.3 is as follows: 12.9.8.3.1 REQ RECEIVED / REP RECEIVED (RC, UC) A CM may receive a REQ/REP specifying a remote QPN in .REQ:local QPN./.REP:local QPN. that the CM already considers connected to a local QP. A local CM may receive such a REQ/REP if its local QP has a stale connection, as described in section 12.4.1. When a CM receives such a REQ/REP it shall abort the connection establishment by issuing REJ to the REQ/REP. It shall then issue DREQ, with .DREQ:remote QPN. set to the remote QPN from the REQ/REP, until DREP is received or Max Retries is exceeded, and place the local QP in the TimeWait state. .... If a CM receives a REQ/REP as described above, if the REQ/REP has the same Local Communication ID and Remote Communication ID as are present in the existing connection and if the REQ/REP arrives within the window of time during which the active/passive side could be legally retransmitting REQ/REP, the CM should treat the REQ/REP as a retry and not initiate stale connection processing as described above. Not how all this does not make any sense for connections in timewait state. Finally, let me quote the definition of the stale connection: 12.4.1 STALE CONNECTION A QP/EEC is said to have a stale connection when only one side has connection information. A stale connection may result if the remote CM had dropped the connection and sent a DREQ but the DREQ was never received by the local CM. Alternatively the remote CM may have lost all record of past connections because its node crashed and rebooted, while the local CM did not become aware of the remote node's reboot and therefore did not clean up stale connections. Note how, again, a connection in TimeWait state does not match the definition of the Stale connection since we arrive there after graceful DREQ/DREP exchange. Our CM implementation violates this requirement - even after the connection was torn down gracefully, and after QP was moved to timewait, CM still rejects connection requests that happen to share the same remote QPN, until timewait exit. I actually see a lot of such bogus rejects when QPs are open/closed at a high rate. The following patch addresses this issue for me, but also seems to trigger crashes under stress - I am still debugging these. Comments appreciated. --- IB/cm: do not track remote QPN in TimeWait, since QP is not connected Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index f85c97f..e270311 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -679,6 +679,8 @@ static void cm_enter_timewait(struct cm_ { int wait_time; + cm_cleanup_timewait(cm_id_priv->timewait_info); + /* * The cm_id could be destroyed by the user before we exit timewait. * To protect against this, we search for the cm_id after exiting -- MST From jlentini at netapp.com Mon Aug 28 07:10:10 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 28 Aug 2006 10:10:10 -0400 (EDT) Subject: [openib-general] [PATCH] remove unnecessary include Message-ID: Index: hw/mthca/mthca_mcg.c =================================================================== --- hw/mthca/mthca_mcg.c (revision 9120) +++ hw/mthca/mthca_mcg.c (working copy) @@ -32,8 +32,6 @@ * $Id$ */ -#include - #include "mthca_dev.h" #include "mthca_cmd.h" From Thomas.Talpey at netapp.com Mon Aug 28 07:38:43 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 28 Aug 2006 10:38:43 -0400 Subject: [openib-general] basic IB doubt In-Reply-To: <20060828130052.GA4921@minantech.com> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> <20060826073957.GA1369@minantech.com> <7.0.1.0.2.20060828081032.04133618@netapp.com> <20060828130052.GA4921@minantech.com> Message-ID: <7.0.1.0.2.20060828102208.07d5b648@netapp.com> At 09:00 AM 8/28/2006, Gleb Natapov wrote: >> 2) user must deregister any mapping before inspecting the result. I >> doubt any of them do this, for that reason anyway. >> >This may have big performance impact. You think? :-) >> MO is that this will bite us in the a** some day. If anybody was >> running this code on the Sparc architecture it already would have. >> >AFAIK SUN runs MPI over UDAPL, but they have their own IB >implementation, so may be they handle all coherency issues in the UDAPL >itself. The Sparc IOMMU supports consistent mappings, in which the i/o streaming caches are not used. There is a performance impact to using this mode however. The best throughput is achieved using streaming with explicit software consistency. However, even in consistent mode, the Sparc API requires that the synchronization calls be made. I have never gotten a completely satisfactory answer as to why, but on the high-end server platforms, I think it's possible that the busses can't always snoop one another and the calls provide a "push". Will turning on the Opteron's IOMMU introduce some of these issues to x86? Tom. From felix at chelsio.com Mon Aug 28 08:23:00 2006 From: felix at chelsio.com (Felix Marti) Date: Mon, 28 Aug 2006 08:23:00 -0700 Subject: [openib-general] basic IB doubt Message-ID: <8A71B368A89016469F72CD08050AD334B205FC@maui.asicdesigners.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of glebn at voltaire.com > Sent: Monday, August 28, 2006 1:11 AM > To: Jason Gunthorpe > Cc: Roland Dreier; openib-general at openib.org > Subject: Re: [openib-general] basic IB doubt > > On Mon, Aug 28, 2006 at 12:18:49AM -0600, Jason Gunthorpe wrote: > > On Sun, Aug 27, 2006 at 03:30:56PM -0700, Roland Dreier wrote: > > > glebn> So, before touching the data that was RDMAed into the > > > glebn> buffer application should cache invalidate the buffer, is > > > glebn> this even possible from user space? (Not on x86, but it > > > glebn> isn't needed there.) > > > > > Yes, on any architecture that is not cache-coherent with PCI DMA, some > > > cache invalidation/flushing will be necessary. And this probably > > > won't be possible from userspace if the cache is physically tagged. > > > (Are there any such architectures in real use, ie non-coherent with > > > PCI and physically tagged cache?) > > > > It depends on the arch if it is a problem or not.. Ie PPC Book-E > > has 'dcba' which is available from user space. It operates on virtual > > addresses and is a flush and invalidate combined. So it is safe, > > but less effecient than the pure invalidate that the kernel has access > > to. > > > This is from PPC instruction book: > > The dcba instruction executes as follows: > If the cache block containing the byte addressed by EA is in the data > cache, the contents of all bytes are made undefined but the cache block > is > still considered valid. Note that programming errors can occur if the > data > in this cache block is subsequently read or used inadvertently. > > If the cache block containing the byte addressed by EA is not in > the data cache and the corresponding memory page or block is caching- > allowed, > the cache block is allocated (and made valid) in the data cache without > fetching the block from main memory, and the value of all bytes is > undefined. > > This doesn't look like this instruction is doing flush or invalidate. It > makes cache line present without accessing underlying memory. AFAIR > uboot uses this + cache locking to create C stack before SDRAM is > initialised. Might be a typo: dcbf does flush & invalidate and is a non-privileged instruction. dcbi does invalidate only but it is privileged. > > > So long as cache ops that works on virtual addresses are present > > it should be fine from userspace, but in some cases the necessary > > sequence of cache ops can be quite elaborate and hardware dependent, > > so a syscall, or at least a vdso function would be needed to support > > eveything. > And then you'll need to do syscall for every IB verb. > > > > > In my experience most real architectures that have this problem these > > days are embedded targetted lower performance processors. If you are > > in the embedded space and using IB hardware then presumably you care > > about performance and will avoid such things. (Although long ago, this > > wasn't a choice and I actually have built an embedded IB capable > > system with non-coherent PCI.. It is a big pain, I don't recommend it.) > > > Agreed. > > -- > Gleb. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From caitlinb at broadcom.com Mon Aug 28 09:05:23 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 28 Aug 2006 09:05:23 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <7.0.1.0.2.20060828081032.04133618@netapp.com> Message-ID: <54AD0F12E08D1541B826BE97C98F99F189EC63@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > At 03:39 AM 8/26/2006, Gleb Natapov wrote: >> On Fri, Aug 25, 2006 at 03:53:12PM -0400, Talpey, Thomas wrote: >>> Flush (sync for_device) before posting. >>> Invalidate (sync for_cpu) before processing. >>> >> So, before touching the data that was RDMAed into the buffer >> application should cache invalidate the buffer, is this even possible >> from user space? (Not on x86, but it isn't needed there.) > > Interesting you should mention that. :-) There isn't a user > verb for dma_sync, there's only deregister. > > The kernel can perform this for receive completions, and > signaled RDMA Reads, but it can't do so for remote RDMA > Writes. Only the upper layer knows where those went. > > There are two practical solutions: > > 1) (practical solution) user mappings must be fully > consistent, within the capability of the hardware. Still, > don't go depending on any specific ordering here. > > 2) user must deregister any mapping before inspecting the > result. I doubt any of them do this, for that reason anyway. > > MO is that this will bite us in the a** some day. If anybody > was running this code on the Sparc architecture it already would have. > The consensus I have seen in other forums is that the RDMA device is expected to be at least as coherent as another CPU at the point when a completion is delivered. That is, it is reasonable to expect an application to treat memory updated by the RDMA device just as it would treat memory updated by another processor under platform appropriate rules. But it is unreasonable to expect that application to do *more* to access the memory, other than actually waiting for a completion. DAPL, IT-API and RNIC-PI all include support for an exception, which was totally related to a specific bus architecture that is no longer sold. From jackm at mellanox.co.il Mon Aug 28 09:08:53 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 28 Aug 2006 19:08:53 +0300 Subject: [openib-general] [PATCH] mthca: fix default static rate returned for Tavor in av Message-ID: <200608281908.53402.jackm@mellanox.co.il> When default static rate is returned for Tavor, need to translate it to an ib rate value. Signed-off-by: Jack Morgenstein Index: ofed_1_1/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- ofed_1_1.orig/drivers/infiniband/hw/mthca/mthca_av.c 2006-08-03 14:30:21.000000000 +0300 +++ ofed_1_1/drivers/infiniband/hw/mthca/mthca_av.c 2006-08-23 17:53:01.227220000 +0300 @@ -90,7 +90,7 @@ static enum ib_rate tavor_rate_to_ib(u8 case MTHCA_RATE_TAVOR_1X: return IB_RATE_2_5_GBPS; case MTHCA_RATE_TAVOR_1X_DDR: return IB_RATE_5_GBPS; case MTHCA_RATE_TAVOR_4X: return IB_RATE_10_GBPS; - default: return port_rate; + default: return mult_to_ib_rate(port_rate); } } From jackm at mellanox.co.il Mon Aug 28 09:10:34 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 28 Aug 2006 19:10:34 +0300 Subject: [openib-general] [PATCH] mthca: return port number for unconnected QPs as well in query_qp Message-ID: <200608281910.35021.jackm@mellanox.co.il> port_num was not being returned for unconnected QPs. Signed-off-by: Jack Morgenstein Index: ofed_1_1/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- ofed_1_1.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-08-23 10:33:04.000000000 +0300 +++ ofed_1_1/drivers/infiniband/hw/mthca/mthca_qp.c 2006-08-23 18:46:08.330885000 +0300 @@ -468,10 +474,14 @@ int mthca_query_qp(struct ib_qp *ibqp, s if (qp->transport == RC || qp->transport == UC) { to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + qp_attr->alt_pkey_index = + be32_to_cpu(context->alt_path.port_pkey) & 0x7f; + qp_attr->alt_port_num = qp_attr->alt_ah_attr.port_num; } - qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; - qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; + qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; + qp_attr->port_num = + (be32_to_cpu(context->pri_path.port_pkey) >> 24) & 0x3; /* qp_attr->en_sqd_async_notify is only applicable in modify qp */ qp_attr->sq_draining = mthca_state == MTHCA_QP_STATE_DRAINING; @@ -482,11 +492,9 @@ int mthca_query_qp(struct ib_qp *ibqp, s 1 << ((be32_to_cpu(context->params2) >> 21) & 0x7); qp_attr->min_rnr_timer = (be32_to_cpu(context->rnr_nextrecvpsn) >> 24) & 0x1f; - qp_attr->port_num = qp_attr->ah_attr.port_num; qp_attr->timeout = context->pri_path.ackto >> 3; qp_attr->retry_cnt = (be32_to_cpu(context->params1) >> 16) & 0x7; qp_attr->rnr_retry = context->pri_path.rnr_retry >> 5; - qp_attr->alt_port_num = qp_attr->alt_ah_attr.port_num; qp_attr->alt_timeout = context->alt_path.ackto >> 3; qp_init_attr->cap = qp_attr->cap; From jackm at mellanox.co.il Mon Aug 28 09:12:39 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 28 Aug 2006 19:12:39 +0300 Subject: [openib-general] [PATCH] mthca: return correct number of bits for static rate in query_qp Message-ID: <200608281912.39896.jackm@mellanox.co.il> Incorrect number of bits was taken for static_rate field. Signed-off-by: Jack Morgenstein Index: ofed_1_1/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- ofed_1_1.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-08-23 10:33:04.000000000 +0300 +++ ofed_1_1/drivers/infiniband/hw/mthca/mthca_qp.c 2006-08-23 18:46:08.330885000 +0300 @@ -404,7 +410,7 @@ static void to_ib_ah_attr(struct mthca_d ib_ah_attr->sl = be32_to_cpu(path->sl_tclass_flowlabel) >> 28; ib_ah_attr->src_path_bits = path->g_mylmc & 0x7f; ib_ah_attr->static_rate = mthca_rate_to_ib(dev, - path->static_rate & 0x7, + path->static_rate & 0xf, ib_ah_attr->port_num); ib_ah_attr->ah_flags = (path->g_mylmc & (1 << 7)) ? IB_AH_GRH : 0; if (ib_ah_attr->ah_flags) { From jgunthorpe at obsidianresearch.com Mon Aug 28 09:22:52 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 28 Aug 2006 10:22:52 -0600 Subject: [openib-general] basic IB doubt In-Reply-To: <7.0.1.0.2.20060828102208.07d5b648@netapp.com> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> <20060826073957.GA1369@minantech.com> <7.0.1.0.2.20060828081032.04133618@netapp.com> <20060828130052.GA4921@minantech.com> <7.0.1.0.2.20060828102208.07d5b648@netapp.com> Message-ID: <20060828162252.GF13774@obsidianresearch.com> On Mon, Aug 28, 2006 at 10:38:43AM -0400, Talpey, Thomas wrote: > Will turning on the Opteron's IOMMU introduce some of these > issues to x86? No, definately not. The Opteron IOMMU (the GART) is a pure address translation mechanism and doesn't change the operation of the caches. If Sun has a problem on larger systems I wonder if SGI Altix also has a problem? SGI Altix is definately a real system that people use IB cards in today and it would be easy to imagine such a large system could have coherence issues with memory polling.. Jason From jgunthorpe at obsidianresearch.com Mon Aug 28 09:25:52 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 28 Aug 2006 10:25:52 -0600 Subject: [openib-general] basic IB doubt In-Reply-To: <8A71B368A89016469F72CD08050AD334B205FC@maui.asicdesigners.com> References: <8A71B368A89016469F72CD08050AD334B205FC@maui.asicdesigners.com> Message-ID: <20060828162552.GG13774@obsidianresearch.com> On Mon, Aug 28, 2006 at 08:23:00AM -0700, Felix Marti wrote: > Might be a typo: dcbf does flush & invalidate and is a non-privileged > instruction. dcbi does invalidate only but it is privileged. Er yes, this is correct, sorry about the typo. Jason From mshefty at ichips.intel.com Mon Aug 28 09:45:02 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 09:45:02 -0700 Subject: [openib-general] CMA oops In-Reply-To: <20060828105755.GB23639@mellanox.co.il> References: <20060828105755.GB23639@mellanox.co.il> Message-ID: <44F31D8E.3090306@ichips.intel.com> Michael S. Tsirkin wrote: > Apparently, list->prev pointer in CMA id_priv structure is NULL > which causes a crash in list_del. > > I note that rdma_destroy_id tests outside the mutex lock. > Could that be the problem? > The problem is not unfortunately easily reproducible. I'll see if I see a problem. Can you describe what was happening when the crash occurred? Was SDP the user of the CMA? - Sean From mshefty at ichips.intel.com Mon Aug 28 09:53:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 09:53:19 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060828133250.GB24261@mellanox.co.il> References: <20060828133250.GB24261@mellanox.co.il> Message-ID: <44F31F7F.8030907@ichips.intel.com> Michael S. Tsirkin wrote: > Comments appreciated. I will look at the spec in more details, but I thought that timewait was included as part of the life of a connection. I.e. the connection wasn't released until it returned to idle. Also, isn't the purpose behind timewait to prevent re-connecting a QP while there are outstanding packets on the fabric that could be associated with the QP? - Sean From rdreier at cisco.com Mon Aug 28 10:12:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 Aug 2006 10:12:24 -0700 Subject: [openib-general] [PATCH] remove unnecessary include In-Reply-To: (James Lentini's message of "Mon, 28 Aug 2006 10:10:10 -0400 (EDT)") References: Message-ID: > -#include that file declares a function as __devinit, so is definitely needed. What is pulling it in implicitly? - R. From mst at mellanox.co.il Mon Aug 28 10:13:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Aug 2006 20:13:23 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F31F7F.8030907@ichips.intel.com> References: <44F31F7F.8030907@ichips.intel.com> Message-ID: <20060828171323.GA25491@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Michael S. Tsirkin wrote: > > Comments appreciated. > > I will look at the spec in more details, but I thought that timewait was > included as part of the life of a connection. I.e. the connection wasn't > released until it returned to idle. Here's a quote: 12.9.4 STATE DIAGRAM NOTES When the Consumer wishes to destroy a QP or EEC that is in the Established CM state, it is good practice for the Consumer to first release the connection before destroying the QP or EEC. Doing so allows any state maintained by CM related to the QP or EEC in question to be cleaned up. A connection is released by moving from the Established state to the TimeWait state using one of the state transition sequences described in the sections that follow. So yes, connection in TimeWait is released. > Also, isn't the purpose behind timewait to > prevent re-connecting a QP while there are outstanding packets on the fabric > that could be associated with the QP? Yes, and for that we need *not* track the remote QPN in timewait - instead, we must keep the local QP around in reset, error or init state: When the CM state is IDLE, LISTEN, or TimeWait, the QP or EE Context is allowed to be in any of the Error, Reset, or Initialized states. That's exactly my point: - timewait is to flush out outstanding packets (even after DREQ/DREP completed) - stale connection detection is for when one of the sides loses connection information - needed only if DREQ was lost These two are not related. -- MST From rdreier at cisco.com Mon Aug 28 10:39:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 Aug 2006 10:39:59 -0700 Subject: [openib-general] [PATCH] mthca: return correct number of bits for static rate in query_qp In-Reply-To: <200608281912.39896.jackm@mellanox.co.il> (Jack Morgenstein's message of "Mon, 28 Aug 2006 19:12:39 +0300") References: <200608281912.39896.jackm@mellanox.co.il> Message-ID: Thanks, applied all 3 to for-2.6.19, and pushed out my git tree. From mshefty at ichips.intel.com Mon Aug 28 10:43:35 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 10:43:35 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060828133250.GB24261@mellanox.co.il> References: <20060828133250.GB24261@mellanox.co.il> Message-ID: <44F32B47.10302@ichips.intel.com> Michael S. Tsirkin wrote: > IB spec, section 12.4, says: > > CMs shall maintain enough connection state information to detect an attempt > to initiate a connection on a remote QP/EEC that has not been released > from a connection with a local QP/EEC, or that is in the TimeWait > state. Such an event could occur if the remote CM had dropped the connection > and sent DREQ, but the DREQ was not received by the local CM. > If the local CM receives a REQ that includes a QPN (or EECN if > REQ:RDC Exists is not set), that it believes to be connected to a local > QP/EEC, the local CM shall act as defined in section 12.9.8.3. > > Note here, that while CM must maintain QPs in TimeWait state (to enable > detection of TimeWait packets, as explained in 9.7.1 PACKET SEQUENCE NUMBERS), > such QPs are not connected (they are normally in reset state). > Thus even if a local QP was connected to a specific remote QPN, once the > connection enters the timewait state CM must not reject the connection request > even if it includes the specific remote QPN. My interpretation of 12.4 is: The CM should track remote QPs that are either: 1. Part of an active connection, or 2. A connection that has been placed into timewait. The CM should detect attempts to connect such remote QPs, and reject them. The entire paragraph is referring to stale connection handling, and I believe the reference to timewait is included as part of that general discussion. - Sean From Thomas.Talpey at netapp.com Mon Aug 28 10:49:27 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 28 Aug 2006 13:49:27 -0400 Subject: [openib-general] basic IB doubt In-Reply-To: <20060828162252.GF13774@obsidianresearch.com> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> <20060826073957.GA1369@minantech.com> <7.0.1.0.2.20060828081032.04133618@netapp.com> <20060828130052.GA4921@minantech.com> <7.0.1.0.2.20060828102208.07d5b648@netapp.com> <20060828162252.GF13774@obsidianresearch.com> Message-ID: <7.0.1.0.2.20060828134057.07d5b648@netapp.com> At 12:22 PM 8/28/2006, Jason Gunthorpe wrote: >On Mon, Aug 28, 2006 at 10:38:43AM -0400, Talpey, Thomas wrote: > >> Will turning on the Opteron's IOMMU introduce some of these >> issues to x86? > >No, definately not. The Opteron IOMMU (the GART) is a pure address >translation mechanism and doesn't change the operation of the caches. Okay, that's good. However, doesn't it delay reads and writes until the necessary table walk / mapping is resolved? Because it passes all other cycles through, it seems to me that an interrupt may pass data, meaning that ordering (at least) may be somewhat different when it's present. And, those pending writes are not in the cache's consistency domain (i.e. they can't be snooped yet, right?). >If Sun has a problem on larger systems I wonder if SGI Altix also has a >problem? SGI Altix is definately a real system that people use IB >cards in today and it would be easy to imagine such a large system >could have coherence issues with memory polling.. I'd be interested in this too. Tom. From jgunthorpe at obsidianresearch.com Mon Aug 28 11:03:10 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 28 Aug 2006 12:03:10 -0600 Subject: [openib-general] basic IB doubt In-Reply-To: <7.0.1.0.2.20060828134057.07d5b648@netapp.com> References: <20060825185555.GE10509@greglaptop.hotels-on-air.de> <44EF1570.31467.1581582@mlakshmanan.silverstorm.com> <20060825192351.GM10509@greglaptop.hotels-on-air.de> <7.0.1.0.2.20060825155129.08272fc0@netapp.com> <20060826073957.GA1369@minantech.com> <7.0.1.0.2.20060828081032.04133618@netapp.com> <20060828130052.GA4921@minantech.com> <7.0.1.0.2.20060828102208.07d5b648@netapp.com> <20060828162252.GF13774@obsidianresearch.com> <7.0.1.0.2.20060828134057.07d5b648@netapp.com> Message-ID: <20060828180310.GI1624@obsidianresearch.com> On Mon, Aug 28, 2006 at 01:49:27PM -0400, Talpey, Thomas wrote: > Okay, that's good. However, doesn't it delay reads and writes until the > necessary table walk / mapping is resolved? Because it passes all other > cycles through, it seems to me that an interrupt may pass data, meaning > that ordering (at least) may be somewhat different when it's present. > And, those pending writes are not in the cache's consistency domain > (i.e. they can't be snooped yet, right?). I've never asked AMD this kind of question directly, but my guess would be that either the HT queues or the SRQ stalls while the table walk is performed and that maintains the in-order requirements of PCI/HT. Otherwise, like you say, the ordering guarentees of MSI could be lost.. Jason From mst at mellanox.co.il Mon Aug 28 11:08:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Aug 2006 21:08:55 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F32B47.10302@ichips.intel.com> References: <44F32B47.10302@ichips.intel.com> Message-ID: <20060828180855.GC25491@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Michael S. Tsirkin wrote: > > IB spec, section 12.4, says: > > > > CMs shall maintain enough connection state information to detect an attempt > > to initiate a connection on a remote QP/EEC that has not been released > > from a connection with a local QP/EEC, or that is in the TimeWait > > state. Such an event could occur if the remote CM had dropped the connection > > and sent DREQ, but the DREQ was not received by the local CM. > > If the local CM receives a REQ that includes a QPN (or EECN if > > REQ:RDC Exists is not set), that it believes to be connected to a local > > QP/EEC, the local CM shall act as defined in section 12.9.8.3. > > > > Note here, that while CM must maintain QPs in TimeWait state (to enable > > detection of TimeWait packets, as explained in 9.7.1 PACKET SEQUENCE NUMBERS), > > such QPs are not connected (they are normally in reset state). > > Thus even if a local QP was connected to a specific remote QPN, once the > > connection enters the timewait state CM must not reject the connection request > > even if it includes the specific remote QPN. > > My interpretation of 12.4 is: > > The CM should track remote QPs that are either: > > 1. Part of an active connection, or > 2. A connection that has been placed into timewait. > > The CM should detect attempts to connect such remote QPs, and reject them. The > entire paragraph is referring to stale connection handling, and I believe the > reference to timewait is included as part of that general discussion. So, you must somehow detect that the remote QP is in timewait state. I don't see any way to do this, and this is not what the CM currently does. Our CM tracks local QPs in timewait state, which is obviously not what the spec intends since remote QP could be reused even though local QP is in timewait. -- MST From mst at mellanox.co.il Mon Aug 28 11:11:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Aug 2006 21:11:02 +0300 Subject: [openib-general] CMA oops In-Reply-To: <44F31D8E.3090306@ichips.intel.com> References: <44F31D8E.3090306@ichips.intel.com> Message-ID: <20060828181102.GD25491@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] CMA oops > > Michael S. Tsirkin wrote: > > Apparently, list->prev pointer in CMA id_priv structure is NULL > > which causes a crash in list_del. > > > > I note that rdma_destroy_id tests outside the mutex lock. > > Could that be the problem? > > The problem is not unfortunately easily reproducible. > > I'll see if I see a problem. Can you describe what was happening when the crash Lots of connections opening/closing. > occurred? > Was SDP the user of the CMA? yes -- MST From jlentini at netapp.com Mon Aug 28 11:20:56 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 28 Aug 2006 14:20:56 -0400 (EDT) Subject: [openib-general] [PATCH] remove unnecessary include In-Reply-To: References: Message-ID: On Mon, 28 Aug 2006, Roland Dreier wrote: > > -#include > > that file declares a function as __devinit, so is > definitely needed. What is pulling it in implicitly? Here's the include sequence: mthca_mcg.c includes mthca_cmd.h includes rdma/ib_verbs.h includes linux/device.h includes linux/module.h includes linux/moduleparam.h includes linux/init.h Given how many levels down this is, I'm going to change my mind. I think the current explicit include is better than what I suggested. From rdreier at cisco.com Mon Aug 28 11:25:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 Aug 2006 11:25:00 -0700 Subject: [openib-general] [PATCH] remove unnecessary include In-Reply-To: (James Lentini's message of "Mon, 28 Aug 2006 14:20:56 -0400 (EDT)") References: Message-ID: James> Given how many levels down this is, I'm going to change my James> mind. I think the current explicit include is better than James> what I suggested. Thanks, I agree. No sense introducing such a fragile dependency on an implicit include for such a minimal gain. - R. From mst at mellanox.co.il Mon Aug 28 11:24:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Aug 2006 21:24:31 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F32B47.10302@ichips.intel.com> References: <44F32B47.10302@ichips.intel.com> Message-ID: <20060828182431.GA25979@mellanox.co.il> Quoting r. Sean Hefty : > The CM should track remote QPs that are either: > > 1. Part of an active connection, or > 2. A connection that has been placed into timewait. Agree here. > The CM should detect attempts to connect such remote QPs, and reject them. That's stretching what the spec says though. It says reject in case 1. > The > entire paragraph is referring to stale connection handling, Not really, stale connection handling is the one below. This paragraph talks about release in general. > and I believe the > reference to timewait is included as part of that general discussion. So CM currently tracks the remote QPN that QP in timewait *as* connected to. What you get then is, e.g.: - local side sends dreq - remote side sends drep, places qp in timewait - local side gets drep places qp in timewait - remote side gets out of timewait reuses qp and sends req - local side sees remote QP was used as peer for local qp that now is in timewait, and rejects the connection And here we are, connection request that was part of a perfectly valid connection setup sequence gets rejected as supposedly stale. In your interpretation, I see no way *not* to get rejects which breaks applications such as SDP which are careful to perform DREQ/DREP sequence gracefully, but report rejects directly to the user. -- MST From mshefty at ichips.intel.com Mon Aug 28 11:37:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 11:37:33 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060828180855.GC25491@mellanox.co.il> References: <44F32B47.10302@ichips.intel.com> <20060828180855.GC25491@mellanox.co.il> Message-ID: <44F337ED.904@ichips.intel.com> Michael S. Tsirkin wrote: > So, you must somehow detect that the remote QP is in timewait state. > I don't see any way to do this, and this is not what the CM > currently does. > > Our CM tracks local QPs in timewait state, which is obviously not > what the spec intends since remote QP could be reused even though > local QP is in timewait. The CM tracks the remote QP, not the local. The spec (12.4) seems to state that this is required by the CM. Do you disagree with my interpretation of 12.4, or why do you think that this is obviously not what the spec intends? - Sean From mshefty at ichips.intel.com Mon Aug 28 12:03:40 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 12:03:40 -0700 Subject: [openib-general] drop mthca from svn? In-Reply-To: References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> <1156445387.17908.6.camel@chalcedony.pathscale.com> Message-ID: <44F33E0C.6010905@ichips.intel.com> Roland Dreier wrote: > James> If the code is moved, how can the OpenFabrics community be > James> guaranteed that the entire software stack will remain under > James> a dual BSD/GPL license? > > You can't guarantee that someone won't come along and write some IB > driver and get it merged upstream without a BSD license. So there's > not much we can do anyway. Such a driver wouldn't be an OpenFabrics driver though. My only concern with this is its effect on backport patches, since most users are not running the later kernels. I think only using the kernel org tree is ideal for the more mature code though. Developing code still requires a separate repository. - Sean From bugzilla-daemon at openib.org Mon Aug 28 12:21:02 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 28 Aug 2006 12:21:02 -0700 (PDT) Subject: [openib-general] [Bug 211] New: libibcm.so is unversioned Message-ID: <20060828192102.1FBD02283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=211 Summary: libibcm.so is unversioned Product: OpenFabrics Linux Version: 1.1rc4 Platform: All OS/Version: All Status: NEW Severity: blocker Priority: P2 Component: RDMA CM AssignedTo: bugzilla at openib.org ReportedBy: dledford at redhat.com The Makefile.am setup in libibcm passes -avoid-version to the libtool script. Generally speaking lib.so files should be avoided except as -devel files, and even then the .so file should be a link to the then current versioned library name. To make matters worse, because of this additional typo/thinko in Makefile.am: if HAVE_LD_VERSION_SCRIPT ibcm_version_script = -Wl,--version-script=$(srcdir)/src/libibcm.map else ... src_libibcm_la_LDFLAGS = -avoid-version $(ucm_version_script) ^^^oops the existing setup produces 100% completely unversioned library symbols, both at the soname level and the individual symbol map level. Setting to blocker severity. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Mon Aug 28 12:23:46 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 28 Aug 2006 12:23:46 -0700 (PDT) Subject: [openib-general] [Bug 212] New: librdmacm is unversioned Message-ID: <20060828192346.E9CA92283D8@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=212 Summary: librdmacm is unversioned Product: OpenFabrics Linux Version: 1.1rc4 Platform: All OS/Version: All Status: NEW Severity: major Priority: P2 Component: RDMA CM AssignedTo: bugzilla at openib.org ReportedBy: dledford at redhat.com It is generally bad form to have a shared lib without any soname versioning. Makefile.am for librdmacm should be updated to provide a suitable so version for the libtool script. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Mon Aug 28 12:34:55 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 28 Aug 2006 12:34:55 -0700 (PDT) Subject: [openib-general] [Bug 213] New: librdmacm uses deprecated libsysfs function Message-ID: <20060828193455.CDD3C2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=213 Summary: librdmacm uses deprecated libsysfs function Product: OpenFabrics Linux Version: 1.1rc4 Platform: Other OS/Version: Other Status: NEW Severity: blocker Priority: P2 Component: RDMA CM AssignedTo: bugzilla at openib.org ReportedBy: dledford at redhat.com In libsysfs-2.0 and later the sysfs_read_attribute_value() function no longer exists, so compilation of librdmacm fails. The attached patch makes librdmacm use the open_attribute/read_attribute/close_attribute method supported in both 1.x and 2.x versions of the library. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Mon Aug 28 12:35:53 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 28 Aug 2006 12:35:53 -0700 (PDT) Subject: [openib-general] [Bug 213] librdmacm uses deprecated libsysfs function Message-ID: <20060828193553.476D62283D8@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=213 ------- Comment #1 from dledford at redhat.com 2006-08-28 12:35 ------- Created an attachment (id=39) --> (http://openib.org/bugzilla/attachment.cgi?id=39&action=view) Fix for use of dead function This would need to be slightly modified to check for multiple sysfs attribute file locations, but otherwise will work. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mst at mellanox.co.il Mon Aug 28 12:30:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Aug 2006 22:30:46 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F337ED.904@ichips.intel.com> References: <44F337ED.904@ichips.intel.com> Message-ID: <20060828193046.GB25979@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Michael S. Tsirkin wrote: > > So, you must somehow detect that the remote QP is in timewait state. > > I don't see any way to do this, and this is not what the CM > > currently does. > > > > Our CM tracks local QPs in timewait state, which is obviously not > > what the spec intends since remote QP could be reused even though > > local QP is in timewait. > > The CM tracks the remote QP, not the local. I might not have been clear. For connection in timewait state, spec explicitly says local QP must be in reset, error or init. Only after it goes out of timewait can you destroy the QP. That's the tracking I think spec means CM needs to do. > The spec (12.4) seems to state > that this is required by the CM. Tracking, yes. But the not rejecting connections. > Do you disagree with my interpretation of > 12.4, or why do you think that this is obviously not what the spec intends? It seems I disagree with your interpretation of the spec. I think what the spec intends is CM must track 2 kinds of QPs: 1. remote QPN used in Connected QPs - to detect stale connections 2. Local QPs in timewait state - QP must not be destroyed immediately, but must stay in reset, error or init so that harware discards timewait packets These 2 are mutually exclusive. In case 1 a new REQ with same remote QPN means connection got stale. In case 2 we exchanged DREQ/DREP so there's no issue. -- MST From swise at opengridcomputing.com Mon Aug 28 12:59:52 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 28 Aug 2006 14:59:52 -0500 Subject: [openib-general] dapltest compiler error on FC5/X86_64 Message-ID: <1156795192.6987.5.camel@stevo-desktop> Anybody run into this compiling dapltest? gcc -g3 -Wall -Wmissing-prototypes -Wstrict-prototypes -Wmissing-declarations -Werror -pipe -I/home/swise/openib/intel_demo1b_userspace/dapl/test/dapltest/udapl/../../../dat/include/ -I../mdep/linux -I../include -D__LINUX__ -D__PENTIUM__ -o Obj/dapl_mdep_user.o -c ../mdep/linux/dapl_mdep_user.c ../mdep/linux/dapl_mdep_user.c: In function ‘DT_Mdep_GetTime’: ../mdep/linux/dapl_mdep_user.c:182: error: ‘CLK_TCK’ undeclared (first use in this function) ../mdep/linux/dapl_mdep_user.c:182: error: (Each undeclared identifier is reported only once ../mdep/linux/dapl_mdep_user.c:182: error: for each function it appears in.) cc1: warnings being treated as errors ../mdep/linux/dapl_mdep_user.c:183: warning: control reaches end of non-void function make: *** [Obj/dapl_mdep_user.o] Error 1 [root at vic8 udapl]# From asgeir_eiriksson at hotmail.com Mon Aug 28 13:00:02 2006 From: asgeir_eiriksson at hotmail.com (Asgeir Eiriksson) Date: Mon, 28 Aug 2006 13:00:02 -0700 Subject: [openib-general] basic IB doubt Message-ID: > Date: Mon, 28 Aug 2006 10:22:52 -0600> From: jgunthorpe at obsidianresearch.com> To: Thomas.Talpey at netapp.com> CC: glebn at voltaire.com; openib-general at openib.org> Subject: Re: [openib-general] basic IB doubt> > On Mon, Aug 28, 2006 at 10:38:43AM -0400, Talpey, Thomas wrote:> > > Will turning on the Opteron's IOMMU introduce some of these> > issues to x86?> > No, definately not. The Opteron IOMMU (the GART) is a pure address> translation mechanism and doesn't change the operation of the caches.> > If Sun has a problem on larger systems I wonder if SGI Altix also has a> problem? SGI Altix is definately a real system that people use IB> cards in today and it would be easy to imagine such a large system> could have coherence issues with memory polling..> Jason Yes, there's an issue with polling on the last byte of data, but not polling on a properly implemented CQ. The SGI machines have fully cache coherent I/O, e.g. as part of a DMA write an I/O controller will invalidate any cached copies of a cache line being DMA written. There is still an ordering issue with respect to polling on the last byte of data. For example assume an incoming RDMA WRITE results in two DMA writes to two different pages, then the ordering of these DMA write vis-a-vis being visible in the coherent domain is not guaranteed. Instead these machines have an I/O sync operation to implement ordering guarantees, e.g. when such an I/O sync is used in our example to implement the CQ then it's guaranteed that both DMA writes have completed vis-a-vis being visible in the coherent domain. Hope that helps, Asgeir Eiriksson CTO Chelsio Communications, Inc. _________________________________________________________________ Express yourself with gadgets on Windows Live Spaces http://discoverspaces.live.com?source=hmtag1&loc=us -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Aug 28 13:29:32 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 13:29:32 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060828193046.GB25979@mellanox.co.il> References: <44F337ED.904@ichips.intel.com> <20060828193046.GB25979@mellanox.co.il> Message-ID: <44F3522C.2050207@ichips.intel.com> Michael S. Tsirkin wrote: >>The CM tracks the remote QP, not the local. > > > I might not have been clear. > For connection in timewait state, spec explicitly says local QP > must be in reset, error or init. > Only after it goes out of timewait can you destroy the QP. > That's the tracking I think spec means CM needs to do. I believe that this tracking is done, and is reported to the user by the timewait exit event. QP transitions are the responsibility of the user. This is related to a problem that Arlin and I have been discussing. There's nothing that the CM does to prevent the QP from being destroyed, especially for a usermode application. The CM invokes a callback once a connection exits timewait, indicating to the user that the QP may now be destroyed. But if an application crashes, uverbs automatically destroys the QP. We may need better coordination between the CM and verbs wrt timewait to handle userspace QPs, but this depends on this change. >>The spec (12.4) seems to state >>that this is required by the CM. > > > Tracking, yes. But the not rejecting connections. Section 12.4 indicates that the CM shall put both the local and remote QPNs into timewait. I was assuming that the remote QPN was tracked, in part, for rejecting a stale connection. I can see where it would only be needed to validate repeated DREQs, which carry the remote QPN. >> Do you disagree with my interpretation of >>12.4, or why do you think that this is obviously not what the spec intends? > > > It seems I disagree with your interpretation of the spec. > I think what the spec intends is CM must track 2 kinds of QPs: > 1. remote QPN used in Connected QPs - to detect stale connections > 2. Local QPs in timewait state - QP must not be destroyed immediately, but must > stay in reset, error or init so that harware discards timewait packets > > These 2 are mutually exclusive. > In case 1 a new REQ with same remote QPN means connection got stale. > In case 2 we exchanged DREQ/DREP so there's no issue. > From 12.9.7.1 and 12.9.7.2, there's no action indicated that the CM should take when receiving a REQ when in timewait. A stale connection check is explicitly listed under the established state. This may help clarify stale connections. - Sean From mst at mellanox.co.il Mon Aug 28 13:41:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Aug 2006 23:41:18 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F3522C.2050207@ichips.intel.com> References: <44F3522C.2050207@ichips.intel.com> Message-ID: <20060828204118.GA26768@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Michael S. Tsirkin wrote: > >>The CM tracks the remote QP, not the local. > > > > > > I might not have been clear. > > For connection in timewait state, spec explicitly says local QP > > must be in reset, error or init. > > Only after it goes out of timewait can you destroy the QP. > > That's the tracking I think spec means CM needs to do. > > I believe that this tracking is done, and is reported to the user by the > timewait exit event. QP transitions are the responsibility of the user. > > This is related to a problem that Arlin and I have been discussing. There's > nothing that the CM does to prevent the QP from being destroyed, especially for > a usermode application. The CM invokes a callback once a connection exits > timewait, indicating to the user that the QP may now be destroyed. But if an > application crashes, uverbs automatically destroys the QP. > > We may need better coordination between the CM and verbs wrt timewait to handle > userspace QPs, but this depends on this change. > > >>The spec (12.4) seems to state > >>that this is required by the CM. > > > > > > Tracking, yes. But the not rejecting connections. > > Section 12.4 indicates that the CM shall put both the local and remote QPNs into > timewait. I was assuming that the remote QPN was tracked, in part, for > rejecting a stale connection. I can see where it would only be needed to > validate repeated DREQs, which carry the remote QPN. I believe communication id should be checked to detect duplicates. Right? Remote QPN stale connection rule is only to avoid a case where we keep connection in established state forever if the remote side rebooted. > >> Do you disagree with my interpretation of > >>12.4, or why do you think that this is obviously not what the spec intends? > > > > > > It seems I disagree with your interpretation of the spec. > > I think what the spec intends is CM must track 2 kinds of QPs: > > 1. remote QPN used in Connected QPs - to detect stale connections > > 2. Local QPs in timewait state - QP must not be destroyed immediately, but must > > stay in reset, error or init so that harware discards timewait packets > > > > These 2 are mutually exclusive. > > In case 1 a new REQ with same remote QPN means connection got stale. > > In case 2 we exchanged DREQ/DREP so there's no issue. > > > > From 12.9.7.1 and 12.9.7.2, there's no action indicated that the CM should take > when receiving a REQ when in timewait. But if the ID that REQ uses is not in timewait the usual rules apply. > A stale connection check is explicitly > listed under the established state. This may help clarify stale connections. So we agree stale connection rule only applies if connection is in established state? -- MST From mst at mellanox.co.il Mon Aug 28 13:47:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Aug 2006 23:47:27 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F3522C.2050207@ichips.intel.com> References: <44F3522C.2050207@ichips.intel.com> Message-ID: <20060828204727.GB26768@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Michael S. Tsirkin wrote: > >>The CM tracks the remote QP, not the local. > > > > > > I might not have been clear. > > For connection in timewait state, spec explicitly says local QP > > must be in reset, error or init. > > Only after it goes out of timewait can you destroy the QP. > > That's the tracking I think spec means CM needs to do. > > I believe that this tracking is done, and is reported to the user by the > timewait exit event. QP transitions are the responsibility of the user. > > This is related to a problem that Arlin and I have been discussing. There's > nothing that the CM does to prevent the QP from being destroyed, especially for > a usermode application. The CM invokes a callback once a connection exits > timewait, indicating to the user that the QP may now be destroyed. But if an > application crashes, uverbs automatically destroys the QP. > > We may need better coordination between the CM and verbs wrt timewait to handle > userspace QPs, but this depends on this change. Another problem that I see is that CMA currently seems to completely mask timewait exit. So there's no way to properly handle timewait on top of cma that I can see. -- MST From mst at mellanox.co.il Mon Aug 28 13:54:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Aug 2006 23:54:02 +0300 Subject: [openib-general] drop mthca from svn? In-Reply-To: <44F33E0C.6010905@ichips.intel.com> References: <44F33E0C.6010905@ichips.intel.com> Message-ID: <20060828205335.GC25979@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: drop mthca from svn? > > Roland Dreier wrote: > > James> If the code is moved, how can the OpenFabrics community be > > James> guaranteed that the entire software stack will remain under > > James> a dual BSD/GPL license? > > > > You can't guarantee that someone won't come along and write some IB > > driver and get it merged upstream without a BSD license. So there's > > not much we can do anyway. > > Such a driver wouldn't be an OpenFabrics driver though. > > My only concern with this is its effect on backport patches, since most users > are not running the later kernels. I agree. However most users really want a stale kernel - so we really should take bugfixes from 2.6.18 and backport to 2.6.17.x and 2.6.16.x. I might do this eventually - but I'm too busy now. Our backport patches in OFED also solve this, in another way. > I think only using the kernel org tree is > ideal for the more mature code though. Developing code still requires a > separate repository. Right. In git each developer has his own, and they get merged as they mature. -- MST From mshefty at ichips.intel.com Mon Aug 28 14:56:50 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 14:56:50 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060828204727.GB26768@mellanox.co.il> References: <44F3522C.2050207@ichips.intel.com> <20060828204727.GB26768@mellanox.co.il> Message-ID: <44F366A2.9010207@ichips.intel.com> Michael S. Tsirkin wrote: > Another problem that I see is that CMA currently seems to completely > mask timewait exit. This is correct. > So there's no way to properly handle timewait on top of cma that I can see. I don't think so, which is what brought up the problem with Arlin. (He's using DAPL above the userspace CMA.) I'm not sure what the proper fix is for this though because of the disconnect between the CM states and the destruction of the QP. - Sean From mshefty at ichips.intel.com Mon Aug 28 15:09:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 15:09:39 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060828204118.GA26768@mellanox.co.il> References: <44F3522C.2050207@ichips.intel.com> <20060828204118.GA26768@mellanox.co.il> Message-ID: <44F369A3.1010506@ichips.intel.com> Michael S. Tsirkin wrote: > I believe communication id should be checked to detect duplicates. Right? Can you clarify this? Check the remote comm id of an incoming REQ against a value in timewait? > Remote QPN stale connection rule is only to avoid a case where we keep > connection in established state forever if the remote side rebooted. We can still end up keeping the connection state forever. If the remote node reboots and re-uses the QP to connect to some other system, we'll never see the stale connection. At some point, the user of the QP needs to notice that the connection isn't in use anymore. Will an RC QP automatically go into the error state if this happens? (I want to say no here...) >>A stale connection check is explicitly >>listed under the established state. This may help clarify stale connections. > > So we agree stale connection rule only applies if connection is in established > state? I agree. I'm trying to understand the full impact to the code (including users such as the CMA) given this interpretation, and why your patch could be failing. - Sean From rdreier at cisco.com Mon Aug 28 15:37:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 Aug 2006 15:37:55 -0700 Subject: [openib-general] drop mthca from svn? In-Reply-To: <44F33E0C.6010905@ichips.intel.com> (Sean Hefty's message of "Mon, 28 Aug 2006 12:03:40 -0700") References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> <1156445387.17908.6.camel@chalcedony.pathscale.com> <44F33E0C.6010905@ichips.intel.com> Message-ID: Roland> You can't guarantee that someone won't come along and Roland> write some IB driver and get it merged upstream without a Roland> BSD license. So there's not much we can do anyway. Sean> Such a driver wouldn't be an OpenFabrics driver though. Well, what is an "OpenFabrics driver" anyway? I'm interesting in writing Linux drivers to be honest. Sean> My only concern with this is its effect on backport patches, Sean> since most users are not running the later kernels. I think Sean> only using the kernel org tree is ideal for the more mature Sean> code though. Developing code still requires a separate Sean> repository. I think that using git makes it much easier for the developers, since merges with the trunk and handled far better than with svn. And in a way it's easier for testers also -- they can just merge the branches that they're interested in, without having to run some bleeding edge tree with everything under development thrown in (although -mm kernels exist for people that want to do that as well). - R. From tom at opengridcomputing.com Mon Aug 28 15:53:39 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 28 Aug 2006 17:53:39 -0500 Subject: [openib-general] drop mthca from svn? In-Reply-To: References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> <1156445387.17908.6.camel@chalcedony.pathscale.com> <44F33E0C.6010905@ichips.intel.com> Message-ID: <1156805619.8530.9.camel@trinity.ogc.int> [...snip...] > I think that using git makes it much easier for the developers, I'm using stg on git. It's absolutely beautiful for tracking the kernel. If I had to go back to SVN, I'd shoot myself. My 2 cents... > since > merges with the trunk and handled far better than with svn. And in a > way it's easier for testers also -- they can just merge the branches > that they're interested in, without having to run some bleeding edge > tree with everything under development thrown in (although -mm kernels > exist for people that want to do that as well). > > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Mon Aug 28 16:00:58 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 16:00:58 -0700 Subject: [openib-general] drop mthca from svn? In-Reply-To: Message-ID: <000101c6caf5$d3b99130$8698070a@amr.corp.intel.com> >Well, what is an "OpenFabrics driver" anyway? I'm interesting in >writing Linux drivers to be honest. It's often ignored, but OpenFabrics does include Windows. My understanding is that the requirement for lower level components is that they must be licensed using dual GPL / BSD. This agreement was made by all members of OpenFabrics. I'm assuming that the companies who agreed to be members of OF did so because joining provided some benefit, just as I'm assuming that including the OF code with the kernel provides some benefit. No, there's nothing that prevents someone from contributing code directly to Linux or shipping code on their own. A company just has to decide if that provides a greater benefit than integrating with OF. >I think that using git makes it much easier for the developers, since >merges with the trunk and handled far better than with svn. And in a >way it's easier for testers also -- they can just merge the branches >that they're interested in, without having to run some bleeding edge >tree with everything under development thrown in (although -mm kernels >exist for people that want to do that as well). No disagreement here. - Sean From rdreier at cisco.com Mon Aug 28 16:07:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 Aug 2006 16:07:15 -0700 Subject: [openib-general] [PATCH v4 7/7] AMSO1100 Makefiles and Kconfig changes. In-Reply-To: <20060803210738.16572.74032.stgit@dell3.ogc.int> (Steve Wise's message of "Thu, 03 Aug 2006 16:07:38 -0500") References: <20060803210723.16572.34829.stgit@dell3.ogc.int> <20060803210738.16572.74032.stgit@dell3.ogc.int> Message-ID: I'm finally getting around to merging this up, and: > --- /dev/null > +++ b/drivers/infiniband/hw/amso1100/README > @@ -0,0 +1,11 @@ > +This is the OpenFabrics provider driver for the > +AMSO1100 1Gb RNIC adapter. > + > +This adapter is available in limited quantities > +for development purposes from Open Grid Computing. > + > +This driver requires the IWCM and CMA mods necessary > +to support iWARP. > + > +Contact tom at opengridcomputing.com for more information. > + I don't think this belongs in the drivers directory. In fact, is it worth having this in the kernel at all? How about if I just add a MAINTAINERS entry for amso1100 pointing at tom at opengridcomputing.com ? - R. From ftillier at silverstorm.com Mon Aug 28 16:12:26 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Mon, 28 Aug 2006 16:12:26 -0700 Subject: [openib-general] [openfabrics-ewg] drop mthca from svn? In-Reply-To: References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> <1156445387.17908.6.camel@chalcedony.pathscale.com> <44F33E0C.6010905@ichips.intel.com> Message-ID: <79ae2f320608281612x59bcb366v4c52b53535745aab@mail.gmail.com> Hi Roland, On 8/28/06, Roland Dreier wrote: > Roland> You can't guarantee that someone won't come along and > Roland> write some IB driver and get it merged upstream without a > Roland> BSD license. So there's not much we can do anyway. > > Sean> Such a driver wouldn't be an OpenFabrics driver though. > > Well, what is an "OpenFabrics driver" anyway? I'm interesting in > writing Linux drivers to be honest. Mellanox is currently tracking the MTHCA code base for Windows, and moving it out of SVN could make that harder, even impossible if it were to lose the BSD license. - Fab From rdreier at cisco.com Mon Aug 28 16:16:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 Aug 2006 16:16:52 -0700 Subject: [openib-general] [openfabrics-ewg] drop mthca from svn? In-Reply-To: <79ae2f320608281612x59bcb366v4c52b53535745aab@mail.gmail.com> (Fabian Tillier's message of "Mon, 28 Aug 2006 16:12:26 -0700") References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> <1156445387.17908.6.camel@chalcedony.pathscale.com> <44F33E0C.6010905@ichips.intel.com> <79ae2f320608281612x59bcb366v4c52b53535745aab@mail.gmail.com> Message-ID: Fabian> Mellanox is currently tracking the MTHCA code base for Fabian> Windows, and moving it out of SVN could make that harder, Fabian> even impossible if it were to lose the BSD license. There's no thought of changing the license. I'm sure that would be a discussion at a much higher temperature. With that said, why would maintaining mthca exclusively in git make it harder to track? If anything I would think it would make it slightly easier, since "git log rev1..rev2 drivers/infiniband/hw/mthca" and "git diff rev1..rev2 drivers/infiniband/hw/mthca" are a lot faster than the svn equivalents. - R. From sean.hefty at intel.com Mon Aug 28 16:18:11 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 16:18:11 -0700 Subject: [openib-general] [PATCHES] for 2.6.19 Message-ID: <000201c6caf8$3acd21a0$8698070a@amr.corp.intel.com> Roland, Attached are 6 git patches pulled from SVN to queue for 2.6.19. They're from SVN versions: 4578 - include atomic as default QP attribute 8628 - fix reject message if GID is invalid 8434 - add dual-sided RMPP 8826 - remove unnecessary include 8827 - remove unnecessary include 9088 - randomize starting local comm id Let me know if you'd prefer these in another format (such as inline). - Sean -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0006-Randomize-the-starting-local-comm-ID-to-avoid-getting-a-rejected-connection.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0001-Enable-atomic-operations-along-with-RDMA-reads-if-a-local-RDMA-read-atomic.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0002-Set-the-reject-code-properly-when-rejecting-a-request-that-contains-an.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0003-Add-support-for-dual-sided-RMPP-transfers.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0004-The-ib_mad-module-does-not-use-a-kthread-function-but-mad_priv.h-includes.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0005-Remove-unnecessary-include.txt URL: From ftillier at silverstorm.com Mon Aug 28 16:21:50 2006 From: ftillier at silverstorm.com (Fab Tillier) Date: Mon, 28 Aug 2006 16:21:50 -0700 Subject: [openib-general] [openfabrics-ewg] drop mthca from svn? In-Reply-To: Message-ID: <000701c6caf8$be2abd50$af5aa8c0@infiniconsys.com> > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Monday, August 28, 2006 4:17 PM > > With that said, why would maintaining mthca exclusively > in git make it harder to track? If anything I would think > it would make it slightly easier, since "git log rev1..rev2 > drivers/infiniband/hw/mthca" and "git diff rev1..rev2 > drivers/infiniband/hw/mthca" are a lot faster than the > svn equivalents. Is git supported in Windows? Right now, with MTHCA in SVN, it's possible to do all development under Windows. I don't know jack about git, so if there's a Windows client that concern is moot. - Fab From mshefty at ichips.intel.com Mon Aug 28 16:25:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 16:25:58 -0700 Subject: [openib-general] [PATCH] libibcm: modify API to support multi-threaded event processing In-Reply-To: <000101c6c62d$78431cd0$c7cc180a@amr.corp.intel.com> References: <000101c6c62d$78431cd0$c7cc180a@amr.corp.intel.com> Message-ID: <44F37B86.3060203@ichips.intel.com> Sean Hefty wrote: > Modify the libibcm API to provide better support for multi-threaded > event processing. CM devices are no longer tied to verb devices > and hidden from the user. This should allow an application to direct > events to specific threads for processing. > > This patch also removes the libibcm's dependency on libsysfs. > > The changes do not break the kernel ABI, but do break the library's > API in such a way that requires (hopefully minor) changes to all > existing users. I have committed this change in revision 9128. - Sean From rdreier at cisco.com Mon Aug 28 16:30:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 Aug 2006 16:30:36 -0700 Subject: [openib-general] [openfabrics-ewg] drop mthca from svn? In-Reply-To: <000701c6caf8$be2abd50$af5aa8c0@infiniconsys.com> (Fab Tillier's message of "Mon, 28 Aug 2006 16:21:50 -0700") References: <000701c6caf8$be2abd50$af5aa8c0@infiniconsys.com> Message-ID: Fab> Is git supported in Windows? Right now, with MTHCA in SVN, Fab> it's possible to do all development under Windows. I don't Fab> know jack about git, so if there's a Windows client that Fab> concern is moot. Yes, I believe there is a cygwin package of it, although I've never tried it personally. - R. From bugzilla-daemon at openib.org Mon Aug 28 16:39:30 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 28 Aug 2006 16:39:30 -0700 (PDT) Subject: [openib-general] [Bug 214] New: IB Stack ASSERTS while handling stale connections. Message-ID: <20060828233930.41C2A2283D8@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=214 Summary: IB Stack ASSERTS while handling stale connections. Product: OpenFabrics Windows Version: unspecified Platform: X86 OS/Version: Other Status: NEW Severity: critical Priority: P1 Component: Core AssignedTo: bugzilla at openib.org ReportedBy: pgarg at xsigo.com We are encountering a serious bug in the stack which happens while there is a stale connection in the list. Here is the call stack: ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached. DEFAULT_BUCKET_ID: STATUS_BREAKPOINT BUGCHECK_STR: 0x0 CURRENT_IRQL: 2 ASSERT_DATA: p_item->p_map == p_map ASSERT_FILE_LOCATION: k:\windows-openib\src\winib-461\core\complib\cl_map.c at Line 422 LAST_CONTROL_TRANSFER: from 80873046 to 8087163c STACK_TEXT: f78a2978 80873046 ffdffa40 00000003 bac99480 nt!DbgBreakPoint f78a2c60 ba82e054 ba82ded0 ba82de98 000001a6 nt!RtlAssert+0xba f78a2c98 ba90bb85 8a1741ac 8928ad54 f78a2ce4 ibbus!cl_rbmap_remove_item+0xb4 [k:\windows-openib\src\winib-461\core\complib\cl_map.c @ 422] f78a2ca8 ba9019af 8928ace8 ffdffa40 bac99480 ibbus!__remove_cep+0xb5 [k:\windows-openib\src\winib-461\core\al\kernel\al_cm_cep.c @ 2825] f78a2ce4 ba900e0f 8928ace8 89b0dc00 20000001 ibbus!__process_rej+0x5ef [k:\windows-openib\src\winib-461\core\al\kernel\al_cm_cep.c @ 939] f78a2d08 ba903aae 8928ace8 bac8ba5b 20000001 ibbus!__process_stale+0x10f [k:\windows-openib\src\winib-461\core\al\kernel\al_cm_cep.c @ 1019] f78a2d44 ba8fd748 89b3c248 89adc5b8 f78a2d6c ibbus!__rep_handler+0x54e [k:\windows-openib\src\winib-461\core\al\kernel\al_cm_cep.c @ 1436] f78a2d70 ba8b35fe 8a116008 ffffffff 89b3c248 ibbus!__cep_mad_recv_cb+0x1e8 [k:\windows-openib\src\winib-461\core\al\kernel\al_cm_cep.c @ 1969] f78a2da4 ba8a8caf 8a116008 ffffffff 89adc5b8 ibbus!__mad_svc_recv_done+0xa8e [k:\windows-openib\src\winib-461\core\al\al_mad.c @ 2215] f78a2e04 ba85356b 89ba6228 89adc5b8 8a1597e0 ibbus!mad_disp_recv_done+0x130f [k:\windows-openib\src\winib-461\core\al\al_mad.c @ 1013] f78a2e34 ba852dc6 8a0c7720 89adc5b8 88deb8c8 ibbus!process_mad_recv+0x34b [k:\windows-openib\src\winib-461\core\al\kernel\al_smi.c @ 2309] f78a2ec4 ba8526eb 8a0c7720 8a1578c8 ffffffff ibbus!spl_qp_comp+0x2a6 [k:\windows-openib\src\winib-461\core\al\kernel\al_smi.c @ 2135] f78a2eec ba8683ab 8a1578c8 ffffffff 8a0c7720 ibbus!spl_qp_recv_comp_cb+0x11b [k:\windows-openib\src\winib-461\core\al\kernel\al_smi.c @ 2005] f78a2f08 bac723ca 8a1578c8 f78a2f18 00000000 ibbus!ci_ca_comp_cb+0x6b [k:\windows-openib\src\winib-461\core\al\kernel\al_ci_ca.c @ 329] f78a2f2c bac96e5f 8a1341a8 8a20f250 85000000 mthca!cq_comp_handler+0xca [c:\winib-461\hw\mthca\kernel\hca_data.c @ 329] f78a2f44 bac99701 8a159210 00000085 8a17a008 mthca!mthca_cq_completion+0xcf [c:\winib-461\hw\mthca\kernel\mthca_cq.c @ 239] f78a2f78 bac994b6 8a159210 8a159768 8a159210 mthca!mthca_eq_int+0x81 [c:\winib-461\hw\mthca\kernel\mthca_eq.c @ 328] f78a2f9c 80831cb2 8a1597e0 8a159768 00000000 mthca!mthca_tavor_dpc+0x36 [c:\winib-461\hw\mthca\kernel\mthca_eq.c @ 455] f78a2ff4 8088cf9f b94d7b1c 00000000 00000000 nt!KiRetireDpcList+0xca f78a2ff8 b94d7b1c 00000000 00000000 00000000 nt!KiDispatchInterrupt+0x3f WARNING: Frame IP not in any known module. Following frames may be wrong. 8088cf9f 00000000 0000000a bb837775 00000128 0xb94d7b1c STACK_COMMAND: kb FOLLOWUP_IP: ibbus!cl_rbmap_remove_item+b4 [k:\windows-openib\src\winib-461\core\complib\cl_map.c @ 422] ba82e054 c745e800000000 mov dword ptr [ebp-18h],0 FAULTING_SOURCE_CODE: 418: 419: CL_ASSERT( p_map ); 420: CL_ASSERT( p_map->state == CL_INITIALIZED ); 421: CL_ASSERT( p_item ); > 422: CL_ASSERT( p_item->p_map == p_map ); 423: 424: if( p_item == cl_rbmap_end( p_map ) ) 425: return; 426: 427: if( p_item->p_right == &p_map->nil ) The problem seems to be that when in function __rep_handler the following line of code fails the check if( __insert_cep( p_cep ) != p_cep ) This seems to mean we have something stale in the list. We call the function status = __process_stale( p_cep ); which calls the function __process_rej. __process_rej then calls __remove_cep which tries to remove the p_cep from list. We think the problem is right here. This is the pointer to the new p_cep which was never inserted in the list because the check in _insert_cep function failed. Now instead of removing the old p_cep from the list, we are removing the new one. The cl_rbmap_remove_item function doest really validate the pointer given to it and always assumes the item was in the list. This also begs the question that why was an item present in the list already. We are seeing this behavior when we try to make q-pairs to a target repeatedly i.e create a q-pair and then destroy it and then re-create it. It seems like if we recreate the q-pair within a few seconds (3) then the probelem happens and if we wait for 5-10 seconds the problem seems to go away. Is there a design limitation with the stack that a q-pair connection to the same target can not be made again with a certain time period? If yes what is the time perio. If not, what should we be doing to ensure proper cleanup? I guess even if there was a limitation there is still a bug here that the stack should be able to handle. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mshefty at ichips.intel.com Mon Aug 28 16:59:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Aug 2006 16:59:44 -0700 Subject: [openib-general] [PATCH v3] ib_sa: require SA registration In-Reply-To: <000501c6c57b$2594dd00$8698070a@amr.corp.intel.com> References: <000501c6c57b$2594dd00$8698070a@amr.corp.intel.com> Message-ID: <44F38370.7050809@ichips.intel.com> Roland, Not sure if you've had a chance to review the SA patches, but any comments on any of the SA related patches? (SA registration, generic RMPP query support, or userspace SA) - Sean From rdreier at cisco.com Mon Aug 28 17:04:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 Aug 2006 17:04:30 -0700 Subject: [openib-general] [PATCHES] for 2.6.19 In-Reply-To: <000201c6caf8$3acd21a0$8698070a@amr.corp.intel.com> (Sean Hefty's message of "Mon, 28 Aug 2006 16:18:11 -0700") References: <000201c6caf8$3acd21a0$8698070a@amr.corp.intel.com> Message-ID: Thanks, I applied all 6 to for-2.6.19 (although I folded 0004 and 0005 into a single patch) > Let me know if you'd prefer these in another format (such as inline). I handled it all myself this time, but in the future it is easier for me if each patch is inline in a separate email. A couple of other things that would also make my life easier: - Try to keep authorship information intact -- I assume 0004 and 0005 were written by James, so I added a "From:" line with his info in it to get the write author in git. - Have the Subject: of the email be a short description with "IB/: " at the beginning, and put the patch description in the body. For example, for patch 0006, I had to munge Subject: [PATCH] Randomize the starting local comm ID to avoid getting a rejected connection due to a stale connection after a system reboot or reloading of the ib_cm. into Subject: [PATCH] IB/cm: Randomize starting comm ID Randomize the starting local comm ID to avoid getting a rejected connection due to a stale connection after a system reboot or reloading of the ib_cm. - a couple of patches had trailing whitespace, which I had to fix up by hand (since I have git set up to error out in that case, as maintainers are supposed to). Adding [apply] whitespace = error-all to your git config would help avoid this. Also, if you want to set up an externally visible git tree somewhere (I think you could probably get a kernel.org account if you asked), then I would be glad to just pull from you to get the patches. - R. From rdreier at cisco.com Mon Aug 28 17:15:35 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 Aug 2006 17:15:35 -0700 Subject: [openib-general] [PATCH v3] ib_sa: require SA registration In-Reply-To: <44F38370.7050809@ichips.intel.com> (Sean Hefty's message of "Mon, 28 Aug 2006 16:59:44 -0700") References: <000501c6c57b$2594dd00$8698070a@amr.corp.intel.com> <44F38370.7050809@ichips.intel.com> Message-ID: Sean> Roland, Not sure if you've had a chance to review the SA Sean> patches, but any comments on any of the SA related patches? Sean> (SA registration, generic RMPP query support, or userspace Sean> SA) I haven't really read the later patches but I am planning on merging at least the registration stuff for 2.6.19. - R. From swise at opengridcomputing.com Mon Aug 28 17:25:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 28 Aug 2006 19:25:36 -0500 Subject: [openib-general] [PATCH v4 7/7] AMSO1100 Makefiles and Kconfig changes. References: <20060803210723.16572.34829.stgit@dell3.ogc.int><20060803210738.16572.74032.stgit@dell3.ogc.int> Message-ID: <007301c6cb01$aacc2830$020010ac@haggard> Sounds good to me. ----- Original Message ----- From: "Roland Dreier" To: "Steve Wise" Cc: ; Sent: Monday, August 28, 2006 6:07 PM Subject: Re: [PATCH v4 7/7] AMSO1100 Makefiles and Kconfig changes. > I'm finally getting around to merging this up, and: > > > --- /dev/null > > +++ b/drivers/infiniband/hw/amso1100/README > > @@ -0,0 +1,11 @@ > > +This is the OpenFabrics provider driver for the > > +AMSO1100 1Gb RNIC adapter. > > + > > +This adapter is available in limited quantities > > +for development purposes from Open Grid Computing. > > + > > +This driver requires the IWCM and CMA mods necessary > > +to support iWARP. > > + > > +Contact tom at opengridcomputing.com for more information. > > + > > I don't think this belongs in the drivers directory. In fact, is it > worth having this in the kernel at all? > > How about if I just add a MAINTAINERS entry for amso1100 pointing at > tom at opengridcomputing.com ? > > - R. > From krause at cup.hp.com Mon Aug 28 20:18:19 2006 From: krause at cup.hp.com (Michael Krause) Date: Mon, 28 Aug 2006 20:18:19 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <1156353294.25846.32.camel@brick.pathscale.com> References: <54AD0F12E08D1541B826BE97C98F99F189E8D5@NT-SJCA-0751.brcm.ad.broadcom.com> <1156353294.25846.32.camel@brick.pathscale.com> Message-ID: <6.2.0.14.2.20060828201421.0287a2b8@esmail.cup.hp.com> At 10:14 AM 8/23/2006, Ralph Campbell wrote: >On Wed, 2006-08-23 at 09:47 -0700, Caitlin Bestler wrote: > > openib-general-bounces at openib.org wrote: > > > Quoting r. john t : > > >> Subject: basic IB doubt > > >> > > >> Hi > > >> > > >> I have a very basic doubt. Suppose Host A is doing RDMA write (say 8 > > >> MB) to Host B. When data is copied into Host B's local > > > buffer, is it guaranteed that data will be copied starting > > > from the first location (first buffer address) to the last > > > location (last buffer address)? or it could be in any order? > > > > > > Once B gets a completion (e.g. of a subsequent send), data in > > > its buffer matches that of A, byte for byte. > > > > An excellent and concise answer. That is exactly what the application > > should rely upon, and nothing else. With iWARP this is very explicit, > > because portions of the message not only MAY be placed out of > > order, they SHOULD be when packets have been re-ordered by the > > network. But for *any* RDMA adapter there is no guarantee on > > what order the adapter flushes things to host memory or particularly > > when old contents that may be cached are invalidated or updated. > > The role of the completion is to limit the frequency with which > > the RDMA adapter MUST guarantee coherency with application visible > > buffers. The completion not only indicates that the entire message > > was received, but that it has been entirely delivered to host memory. > >Actually, A knows the data is in B's memory when A gets the completion >notice. This is incorrect for both iWARP and IB. A completion by A only means that the receiving HCA / RNIC has the data and has generated an acknowledgement. It does not indicate that B has flushed the data to host memory. Hence, the fault zone remains the HCA / RNIC and while A may free the associated buffer for other usage, it should not rely upon the data being delivered to host memory on B. This is one of the fault scenarios I raised during the initial RDS transparent recovery assertions. If A were to issue a RDMA Read to the B targeting the associated RDMA Write memory location, then it can know the data has been placed in B's memory. > B can't rely on anything unless A uses the RDMA write with >immediate which puts a completion event in B's CQ. >Most applications on B ignore this requirement and test for the last >memory location being modified which usually works but doesn't >guarantee that all the data is in memory. B cannot rely on anything until a completion is seen either through an immediate or a subsequent Send. It is not wise to rely upon IHV-specific behaviors when designing an application as even an IHV can change things over time or due to interoperability requirements, things may not work as desired which is definitely a customer complaint that many would like to avoid. BTW, the reason immediate data is 4 bytes in length is that was what was defined in VIA. Many within the IBTA wanted to get rid of immediate data but due to the requirement to support legacy VIA applications, the immediate value was left in place. The need to support a larger value was not apparent. One needs to keep in mind where the immediate resides within the wire protocol and its usage model. The past usage was to signal a PID or some other unique identifier that could be used to comprehend which thread of execution should be informed of a particular completion event. Four bytes is sufficient to communicate such information without significantly complicating or making the wire protocol too inefficient. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Mon Aug 28 20:35:01 2006 From: krause at cup.hp.com (Michael Krause) Date: Mon, 28 Aug 2006 20:35:01 -0700 Subject: [openib-general] basic IB doubt In-Reply-To: <000301c6c7c8$688a7990$ff0da8c0@amr.corp.intel.com> References: <20060824213225.GI2962@greglaptop.hotels-on-air.de> <000301c6c7c8$688a7990$ff0da8c0@amr.corp.intel.com> Message-ID: <6.2.0.14.2.20060828202756.028bd410@esmail.cup.hp.com> At 02:58 PM 8/24/2006, Sean Hefty wrote: > >We're trying to create *inter-operable* hardware and > >software in this community. So we follow the IB standard. > >Atomic operations and RDD are optional, yet still part of the IB >"standard". An >application that makes use of either of these isn't guaranteed to operate with >all IB hardware. I'm not even sure that CAs are required to implement RDMA >reads. A TCA is not required to support RDMA Read. A HCA is required. It is correct that atomic and reliable datagram are optional. However, that does not mean they can be used or will not work in an interoperable manner. The movement to a software multiplexing over a RC (a technique HP delivered to some ISV years ago) may make RD obsolete from an execution perspective but that does mean it is not interoperable. As for atomics, well, they are part of IB and many within MPI would like to see their support. Their usage should also be interoperable. > >> It's up to the application to verify that the hardware that they're > >> using provides the required features, or adjust accordingly, and > >> publish those requirements to the end users. > > > >If that was being done (and it isn't), it would still be bad for the > >ecosystem as a whole. > >Applications should drive the requirements. Some poll on memory today. A lot >of existing hardware provides support for this by guaranteeing that the last >byte will always be written last. This doesn't mean that data cannot be >placed >out of order, only that the last byte is deferred. Seems much of this debate is really about how software chose to implement polling of a CQ versus polling of memory. Changing IB or iWARP semantics to compensate for what some might view as a sub-optimal implementation does not seem logical as others have been able to poll CQ without such overheads in other environments. In fact, during the definition of IB and iWARP, it was with this knowledge that we felt the need to change the semantics was not required. >Again, if a vendor wants to work with applications written this way, then this >is a feature that should be provided. If a vendor doesn't care about working >with those applications, or wants to require that the apps be rewritten, then >this feature isn't important. > >But I do not see an issue with a vendor adding value beyond what's defined >in the spec. It all comes down to how much of the solution needs to be fully interoperable and how much needs to be communicated as optional semantics. You could always define API for applications to communicate their capabilities that go beyond a specification. This is in part the logic behind an iSCSI login or SDP Hello exchange where the capabilities are communicated in a standard way so software does the right thing based on the components involved. Changing fundamentals of IB and iWARP seems a bit much when it is much easier to have the ULP provide such an exchange of capabilities if people feel they are truly required. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From jackm at mellanox.co.il Tue Aug 29 01:24:45 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 29 Aug 2006 11:24:45 +0300 Subject: [openib-general] [PATCH] libibcm: Need to include stddef.h in cm.c for SLES10 compilations Message-ID: <200608291124.45816.jackm@mellanox.co.il> Fix compilation on SLES10: cm.c uses offsetof, so it must include stddef.h Signed-off-by: Jack Morgenstein Index: l/src/userspace/libibcm/src/cm.c =================================================================== --- l/src/userspace/libibcm/src/cm.c (revision 9128) +++ l/src/userspace/libibcm/src/cm.c (working copy) @@ -44,6 +44,7 @@ #include #include #include +#include #include #include From zhushisongzhu at yahoo.com Tue Aug 29 03:52:41 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Tue, 29 Aug 2006 03:52:41 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060823090653.GA5877@mellanox.co.il> Message-ID: <20060829105242.87155.qmail@web36912.mail.mud.yahoo.com> Have you resolved the problem? zhu --- "Michael S. Tsirkin" wrote: > Yes, I have reproduced the connection refusal > problem and I am looking into it. > Thanks! > > MST > > Quoting r. zhu shi song : > Subject: Re: why sdp connections cost so much memory > > I haven't met kernel crashes using rc2. But there > always occurred connection refusal when max > concurrent > connections set above 200. All is right when max > concurrent connections is set to below 200. ( If > using TCP to take the same test, there is no > problem.) > (1) > This is ApacheBench, Version 2.0.41-dev <$Revision: > 1.141 $> apache-2.0 > Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, > http://www.zeustech.net/ > Copyright (c) 1998-2002 The Apache Software > Foundation, http://www.apache.org/ > > Benchmarking www.google.com [through > 193.12.10.14:3129] (be patient) > Completed 100 requests > Completed 200 requests > apr_recv: Connection refused (111) > Total of 257 requests completed > (2) > This is ApacheBench, Version 2.0.41-dev <$Revision: > 1.141 $> apache-2.0 > Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, > http://www.zeustech.net/ > Copyright (c) 1998-2002 The Apache Software > Foundation, http://www.apache.org/ > > Benchmarking www.google.com [through > 193.12.10.14:3129] (be patient) > Completed 100 requests > Completed 200 requests > apr_recv: Connection refused (111) > Total of 256 requests completed > [root at IB-TEST squid.test]# > > zhu > > > > > --- "Michael S. Tsirkin" wrote: > > > Quoting r. zhu shi song : > > > --- "Michael S. Tsirkin" > > wrote: > > > > > > > Quoting r. zhu shi song > > : > > > > > (3) one time linux kernel on the client > > crashed. I > > > > > copy the output from the screen. > > > > > Process sdp (pid:4059, threadinfo > > 0000010036384000 > > > > > task 000001003ea10030) > > > > > Call > > > > > > > > > > > > Trace:{:ib_sdp:sdp_destroy_workto} > > > > > > {:ib_sdp:sdp_destroy_qp+77} > > > > > > > > > > > > > > > {:ib_sdp:sdp_destruct+279}{sk_free+28} > > > > > > > > > > > > > > > {worker_thread+419}{default_wake_function+0} > > > > > > > > > > > > > > > {default_wake_function+0}{keventd_create_kthread+0} > > > > > > > > > > > > > > > {worker_thread+0}{keventd_create_kthread+0} > > > > > > > > > > > > > > > {kthread+200}{child_rip+8} > > > > > > > > > > > > > > > {keventd_create_kthread+0}{kthread+0}{child_rip+0} > > > > > Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b 45 > 31 > > ff > > > > 45 > > > > > 31 ed 4c 89 > > > > > > > > > > > > > > > RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> > > > > > CR2:0000000000000004 > > > > > <0>kernel panic-not syncing:Oops > > > > > > > > > > zhu > > > > > > > > Hmm, the stack dump does not match my sources. > > Is > > > > this OFED rc1? > > > > Could you send me the sdp_main.o and > sdp_main.c > > > > files from your system please? > > > > --- > > > > > Subject: Re: why sdp connections cost so much > > memory > > > > > > please see the attachment. > > > zhu > > > > Ugh, so its crashing inside sdp_bcopy ... > > > > By the way, could you please re-test with OFED > rc2? > > We've solved a couple of bugs there ... > > > > If this still crashes, could you please post the > > whole > > sdp directory, with .o and .c files? > > > > Thanks, > > > > -- > > MST > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From mst at mellanox.co.il Tue Aug 29 04:04:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Aug 2006 14:04:27 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060829105242.87155.qmail@web36912.mail.mud.yahoo.com> References: <20060829105242.87155.qmail@web36912.mail.mud.yahoo.com> Message-ID: <20060829110427.GA23560@mellanox.co.il> I did - this is the spec bug we are discussing with Sean. Quoting r. zhu shi song : > Subject: Re: why sdp connections cost so much memory > > Have you resolved the problem? > zhu > > --- "Michael S. Tsirkin" wrote: > > > Yes, I have reproduced the connection refusal > > problem and I am looking into it. > > Thanks! > > > > MST > > > > Quoting r. zhu shi song : > > Subject: Re: why sdp connections cost so much memory > > > > I haven't met kernel crashes using rc2. But there > > always occurred connection refusal when max > > concurrent > > connections set above 200. All is right when max > > concurrent connections is set to below 200. ( If > > using TCP to take the same test, there is no > > problem.) > > (1) > > This is ApacheBench, Version 2.0.41-dev <$Revision: > > 1.141 $> apache-2.0 > > Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, > > http://www.zeustech.net/ > > Copyright (c) 1998-2002 The Apache Software > > Foundation, http://www.apache.org/ > > > > Benchmarking www.google.com [through > > 193.12.10.14:3129] (be patient) > > Completed 100 requests > > Completed 200 requests > > apr_recv: Connection refused (111) > > Total of 257 requests completed > > (2) > > This is ApacheBench, Version 2.0.41-dev <$Revision: > > 1.141 $> apache-2.0 > > Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, > > http://www.zeustech.net/ > > Copyright (c) 1998-2002 The Apache Software > > Foundation, http://www.apache.org/ > > > > Benchmarking www.google.com [through > > 193.12.10.14:3129] (be patient) > > Completed 100 requests > > Completed 200 requests > > apr_recv: Connection refused (111) > > Total of 256 requests completed > > [root at IB-TEST squid.test]# > > > > zhu > > > > > > > > > > --- "Michael S. Tsirkin" wrote: > > > > > Quoting r. zhu shi song : > > > > --- "Michael S. Tsirkin" > > > wrote: > > > > > > > > > Quoting r. zhu shi song > > > : > > > > > > (3) one time linux kernel on the client > > > crashed. I > > > > > > copy the output from the screen. > > > > > > Process sdp (pid:4059, threadinfo > > > 0000010036384000 > > > > > > task 000001003ea10030) > > > > > > Call > > > > > > > > > > > > > > > > Trace:{:ib_sdp:sdp_destroy_workto} > > > > > > > > {:ib_sdp:sdp_destroy_qp+77} > > > > > > > > > > > > > > > > > > > > > {:ib_sdp:sdp_destruct+279}{sk_free+28} > > > > > > > > > > > > > > > > > > > > > {worker_thread+419}{default_wake_function+0} > > > > > > > > > > > > > > > > > > > > > {default_wake_function+0}{keventd_create_kthread+0} > > > > > > > > > > > > > > > > > > > > > {worker_thread+0}{keventd_create_kthread+0} > > > > > > > > > > > > > > > > > > > > > {kthread+200}{child_rip+8} > > > > > > > > > > > > > > > > > > > > > {keventd_create_kthread+0}{kthread+0}{child_rip+0} > > > > > > Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b 45 > > 31 > > > ff > > > > > 45 > > > > > > 31 ed 4c 89 > > > > > > > > > > > > > > > > > > > > > RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> > > > > > > CR2:0000000000000004 > > > > > > <0>kernel panic-not syncing:Oops > > > > > > > > > > > > zhu > > > > > > > > > > Hmm, the stack dump does not match my sources. > > > Is > > > > > this OFED rc1? > > > > > Could you send me the sdp_main.o and > > sdp_main.c > > > > > files from your system please? > > > > > > --- > > > > > > > Subject: Re: why sdp connections cost so much > > > memory > > > > > > > > please see the attachment. > > > > zhu > > > > > > Ugh, so its crashing inside sdp_bcopy ... > > > > > > By the way, could you please re-test with OFED > > rc2? > > > We've solved a couple of bugs there ... > > > > > > If this still crashes, could you please post the > > > whole > > > sdp directory, with .o and .c files? > > > > > > Thanks, -- MST From thomas.bub at thomson.net Tue Aug 29 04:35:43 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Tue, 29 Aug 2006 13:35:43 +0200 Subject: [openib-general] libibcm can't open /dev/infiniband/ucm0 Message-ID: Made some progress today: The file OFED-1.0.1/SOURCES/openib-1.0.1/src/userspace/libibcm/README talks about "mknod /dev/infiniband/ucm0 c 231 255" This is not working for me. I'm getting the device /dev/infiniband/ucm0 created but the libibcm still could not open the device Looked into the openIB kernel sources and found that the minor number seems to be wrong in the README file. With a minor number "224" and the creation like: "mknod /dev/infiniband/ucm0 c 231 224" I got the device created andtthe libibcm could open the device. So I assume this to be a bug at least in the README file. I'm not the kernel expert to tell where else this has to be changed. Should I open a bug on this? I got a first SLES10 system eaelier this week, there I'm using the ofed-1.0-sles10-rpms_x86_64.tar.gz instead of the OFED-1.0.1.tgz. Interesting enough the /dev/infiniband/ucm0 got created and usable automatically. Thomas -----Original Message----- From: Sean Hefty [mailto:sean.hefty at intel.com] Sent: Tuesday, August 22, 2006 5:47 PM To: Bub Thomas; Sean Hefty Cc: openib-general at openib.org Subject: RE: [openib-general] libibcm can't open /dev/infiniband/ucm0 >https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+ T >3&highlight=udev The udev information for this link looks correct. >Or is there another way/description? You can run mknod to manually create the file. (See the README file in the libibcm directory.) >Additionally I didn't find the udev rules for the already exsisting >/dev/infinibadn devices in the /etc/udev/udev.rules file. Is there any chance these were created manually at some point? What files are in the directory? (I would expect something like issm*, umad*, ucm*, uverbs*.) - Sean From dotanb at dev.mellanox.co.il Tue Aug 29 05:20:46 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 29 Aug 2006 15:20:46 +0300 Subject: [openib-general] Ib question In-Reply-To: <44F2A341.7030407@mellnox.co.il> References: <44F2A341.7030407@mellnox.co.il> Message-ID: <44F4311E.2000400@dev.mellanox.co.il> resending the response (after a problem in my mail client): > john t wrote: >> >> What is the meaning of above fields or where can I find the >> definition of above fields ? Can I change the value of fields like >> "timeout" or should it be always set to a fixed value. > You can find the description and encoding of those values in the IB spec. >> >> Like TCP I guess there would be a state transition diagram for IB (QP >> state machine). Can someone point me to that? > IB spec sections 10.3.1 and 11.2.4.2 >> >> In my application, I get an error message "IBV_WC_WR_FLUSH_ERR" and >> sometimes "IBV_WC_RETRY_EXC_ERR" while polling a CQ after posting >> some write commands. What could be the reason for that ? > IBV_WC_WR_FLUSH_ERR means that the completion was created when the QP > was already in error > IBV_WC_RETRY_EXC_ERR means that the remote side didn't respond to the > messages that were sent (there may be several reasons for it: > the remote QP is not exist at all, or it is configured with wrong > parameters). > > i guess that you have a race in your code (and some more issues after > you'll deal with the race ..) >> >> Also, can someone point me to a document where I can find the meaning >> of different error status values (enum ibv_wc_status) returned by >> "ibv_poll_cq" > IB spec section 11.6.2 Dotan From halr at voltaire.com Tue Aug 29 05:19:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 08:19:55 -0400 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: <44ECC7F5.3000300@ichips.intel.com> References: <000901c6c582$ad09f890$8698070a@amr.corp.intel.com> <44ECC7F5.3000300@ichips.intel.com> Message-ID: <1156853995.4509.23170.camel@hal.voltaire.com> On Wed, 2006-08-23 at 17:26, Sean Hefty wrote: > Roland Dreier wrote: > > What's the plan for how this would be used? We can't let unprivileged > > userspace processes talk to the SA, because they could cause problems > > like deleting someone else's multicast group membership. And I don't > > think we want to try to do some elaborate filtering in the kernel, do we? > > The ibv_sa_send_mad() routine can only be used to issue the following methods: > > GET, SEND, GET_TABLE, GET_MULTI, and GET_TRACE_TABLE Why SEND ? In general, couldn't it be used like SET/DELETE (in addition to being used like the GET method variants) ? Also, the SA doesn't use the SEND method. -- Hal > I do check for this in the kernel, but that is the extent of any filtering > that's done. Multicast operations must go through the multicast join / free > calls, which drop into the kernel to interface with the ib_multicast module. > > I would expect that other SET / DELETE type operations would be treated similar > to how multicast is handled. > > I'm expecting that the labs will use at least the multicast interfaces based on > e-mail conversations, but without path record query support, the userspace CM > interface isn't all that useful. > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dotanb at dev.mellanox.co.il Tue Aug 29 05:21:49 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 29 Aug 2006 15:21:49 +0300 Subject: [openib-general] [PATCH] libibcm: modify API to support multi-threaded event processing In-Reply-To: <44F37B86.3060203@ichips.intel.com> References: <000101c6c62d$78431cd0$c7cc180a@amr.corp.intel.com> <44F37B86.3060203@ichips.intel.com> Message-ID: <44F4315D.4080508@dev.mellanox.co.il> Sean Hefty wrote: >> Modify the libibcm API to provide better support for multi-threaded >> event processing. CM devices are no longer tied to verb devices >> and hidden from the user. This should allow an application to direct >> events to specific threads for processing. >> >> This patch also removes the libibcm's dependency on libsysfs. >> >> The changes do not break the kernel ABI, but do break the library's >> API in such a way that requires (hopefully minor) changes to all >> existing users. > > I have committed this change in revision 9128. > There are compilation errors with this patch when using gcc 4.1.0: make all-am make[1]: Entering directory `/tmp/openib_gen2/last_stable/src/userspace/libibcm' if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I. -I./include -I../libibverbs/include -g -Wall -D_GNU_SOURCE -g -O2 -MT cm.lo -MD -MP -MF ".deps/cm.Tpo" -c -o cm.lo `test -f 'src/cm.c' || echo './'`src/cm.c; \ then mv -f ".deps/cm.Tpo" ".deps/cm.Plo"; else rm -f ".deps/cm.Tpo"; exit 1; fi gcc -DHAVE_CONFIG_H -I. -I. -I. -I./include -I../libibverbs/include -g -Wall -D_GNU_SOURCE -g -O2 -MT cm.lo -MD -MP -MF .deps/cm.Tpo -c src/cm.c -fPIC -DPIC -o .libs/cm.o src/cm.c: In function ‘ib_cm_destroy_id’: src/cm.c:250: warning: implicit declaration of function ‘offsetof’ src/cm.c:250: error: expected expression before ‘struct’ src/cm.c: In function ‘ib_cm_send_rep’: src/cm.c:404: error: expected expression before ‘struct’ src/cm.c: In function ‘ib_cm_ack_event’: src/cm.c:917: error: expected expression before ‘struct’ src/cm.c:921: error: expected expression before ‘struct’ src/cm.c:939: error: expected expression before ‘struct’ make[1]: *** [cm.lo] Error 1 make[1]: Leaving directory `/tmp/openib_gen2/last_stable/src/userspace/libibcm' make: *** [all] Error 2 thanks Dotan From mst at mellanox.co.il Tue Aug 29 06:09:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Aug 2006 16:09:08 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F3522C.2050207@ichips.intel.com> References: <44F3522C.2050207@ichips.intel.com> Message-ID: <20060829130908.GA24322@mellanox.co.il> Quoting r. Sean Hefty : > I believe that this tracking is done, and is reported to the user by the > timewait exit event. QP transitions are the responsibility of the user. > > This is related to a problem that Arlin and I have been discussing. There's > nothing that the CM does to prevent the QP from being destroyed, especially for > a usermode application. The CM invokes a callback once a connection exits > timewait, indicating to the user that the QP may now be destroyed. But if an > application crashes, uverbs automatically destroys the QP. > > We may need better coordination between the CM and verbs wrt timewait to handle > userspace QPs, but this depends on this change. And userspace is not the only one affected - CMA also is missing timewait handling, and it is quite hard to fit one there. Here's an idea: how about we move the whole timewait thing to low level driver, starting timer automatically upon QP destroy? At least in mthca, it makes sense: actual QP in reset state takes up resources, while all we need for QP in timewait is a slot in QP table, to prevent the QP number from being reused. So we'll be saving a lot of memory and all ULPs will be much simpler since they'll be able to just destroy the QP and forget about it, without headache of TIMEWAIT_EXIT events. This is actually very easy to implement: all we need is a per-device list of QPNs to free, and a work structure to schedule delayed work on QP destroy. Roland, what do you say? -- MST From jlentini at netapp.com Tue Aug 29 06:25:35 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 29 Aug 2006 09:25:35 -0400 (EDT) Subject: [openib-general] [openfabrics-ewg] drop mthca from svn? In-Reply-To: References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> <1156445387.17908.6.camel@chalcedony.pathscale.com> <44F33E0C.6010905@ichips.intel.com> <79ae2f320608281612x59bcb366v4c52b53535745aab@mail.gmail.com> Message-ID: On Mon, 28 Aug 2006, Roland Dreier wrote: > Fabian> Mellanox is currently tracking the MTHCA code base for > Fabian> Windows, and moving it out of SVN could make that harder, > Fabian> even impossible if it were to lose the BSD license. > > There's no thought of changing the license. I'm sure that would be a > discussion at a much higher temperature. > > With that said, why would maintaining mthca exclusively in git make it > harder to track? If anything I would think it would make it slightly > easier, since "git log rev1..rev2 drivers/infiniband/hw/mthca" and > "git diff rev1..rev2 drivers/infiniband/hw/mthca" are a lot faster > than the svn equivalents. OpenFabrics can provide git repositories. If OpenFabrics.org had git repositories, why would we remove the code from the OpenFabrics.org? From swise at opengridcomputing.com Tue Aug 29 06:47:44 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 29 Aug 2006 08:47:44 -0500 Subject: [openib-general] [PATCH DAPLTEST] - compile failure on FC5/X86_64 Message-ID: <1156859264.31129.12.camel@stevo-desktop> Dunno if this is the correct fix for all platforms/distros, but it worked for me on FC5/X86_64... CLK_TCK wasn't getting defined for this distro... Signed-off-by: Steve Wise --- Index: test/dapltest/mdep/linux/dapl_mdep_user.c =================================================================== --- test/dapltest/mdep/linux/dapl_mdep_user.c (revision 9096) +++ test/dapltest/mdep/linux/dapl_mdep_user.c (working copy) @@ -178,7 +178,7 @@ { struct tms ts; clock_t t = times (&ts); - return (unsigned long) ((DAT_UINT64) t * 1000 / CLK_TCK); + return (unsigned long) ((DAT_UINT64) t * 1000 / CLOCKS_PER_SEC); } double From tziporet at dev.mellanox.co.il Tue Aug 29 07:15:37 2006 From: tziporet at dev.mellanox.co.il (Tziporet Cohen) Date: Tue, 29 Aug 2006 17:15:37 +0300 Subject: [openib-general] [openfabrics-ewg] drop mthca from svn? In-Reply-To: References: <1156356250.10010.22.camel@sardonyx> <44EC99F1.9060102@ichips.intel.com> <1156445387.17908.6.camel@chalcedony.pathscale.com> <44F33E0C.6010905@ichips.intel.com> <79ae2f320608281612x59bcb366v4c52b53535745aab@mail.gmail.com> Message-ID: <44F44C09.2050606@dev.mellanox.co.il> James Lentini wrote: > If OpenFabrics.org had git repositories, why would we remove the code > from the OpenFabrics.org? > > > Matt from Sandia is checking this since I also asked him to provide git repository for OFED release. Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at dev.mellanox.co.il Tue Aug 29 07:21:35 2006 From: tziporet at dev.mellanox.co.il (Tziporet Cohen) Date: Tue, 29 Aug 2006 17:21:35 +0300 Subject: [openib-general] OFED 1.1-rc3 is delayd in 1-2 days Message-ID: <44F44D6F.9010505@dev.mellanox.co.il> Hi All, RC3 will not be available today. There are 2 items that are gating this: 1. SDP last issue with CM resolution 2. ipath patches update So if everything goes well it will be available tomorrow or on Thursday. Tziporet From tziporet at dev.mellanox.co.il Tue Aug 29 07:49:32 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 29 Aug 2006 17:49:32 +0300 Subject: [openib-general] problems to regiser memory as a reglar user on SLES9 SP3 Message-ID: <44F453FC.4070300@dev.mellanox.co.il> Hi All, In testing today we found that on SLES9 SP3 memory locking as a regular user fails. Although I changed /etc/security/limits.conf and added the following two lines: * soft memlock * hard memlock Note that same change does work in SLES10. Another change I tried (that worked in gen1) was to add the following line to the file/etc/sysctl.conf: vm.disable_cap_mlock=1. However nothing helped in SLES9 Does anyone have any idea how to solve this? Thanks, Tziporet From kliteyn at mellanox.co.il Tue Aug 29 07:50:55 2006 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 29 Aug 2006 17:50:55 +0300 Subject: [openib-general] [PATCH] osm: TRIVIAL fix in usage message Message-ID: Hi Hal. This patch is just fixing some error in the OSM usage message. Instead of old '-vf' option, there should be '-D'. Thanks. Yevgeny Signed-off-by: Yevgeny Kliteynik Index: osm/opensm/main.c =================================================================== --- osm/opensm/main.c (revision 9140) +++ osm/opensm/main.c (working copy) @@ -254,13 +254,13 @@ show_usage(void) " This option increases the log verbosity level.\n" " The -v option may be specified multiple times\n" " to further increase the verbosity level.\n" - " See the -vf option for more information about.\n" + " See the -D option for more information about.\n" " log verbosity.\n\n" ); printf( "-V\n" " This option sets the maximum verbosity level and\n" " forces log flushing.\n" - " The -V is equivalent to '-vf 0xFF -d 2'.\n" - " See the -vf option for more information about.\n" + " The -V is equivalent to '-D 0xFF -d 2'.\n" + " See the -D option for more information about.\n" " log verbosity.\n\n" ); printf( "-D \n" " This option sets the log verbosity level.\n" From jackm at dev.mellanox.co.il Tue Aug 29 08:07:05 2006 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 29 Aug 2006 18:07:05 +0300 Subject: [openib-general] [PATCH] libmthca: include stddef.h in mthca.h for SLES10 compilations Message-ID: <200608291807.06146.jackm@dev.mellanox.co.il> Fix compilation on SLES10 RC2: mthca.h uses offsetof so it must include stddef.h Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: l/src/userspace/libmthca/src/mthca.h =================================================================== --- l/src/userspace/libmthca/src/mthca.h (revision 7569) +++ l/src/userspace/libmthca/src/mthca.h (working copy) @@ -36,6 +36,7 @@ #ifndef MTHCA_H #define MTHCA_H +#include #include #include From krause at cup.hp.com Tue Aug 29 07:53:55 2006 From: krause at cup.hp.com (Michael Krause) Date: Tue, 29 Aug 2006 07:53:55 -0700 Subject: [openib-general] A critique of RDMA PUT/GET in HPC In-Reply-To: <20060825155652.GG8380@greglaptop.hotels-on-air.de> References: <20060824225343.GD3927@greglaptop.hotels-on-air.de> <1156518781.25769.22.camel@trinity.ogc.int> <20060825155652.GG8380@greglaptop.hotels-on-air.de> Message-ID: <6.2.0.14.2.20060829072937.02b2fa90@esmail.cup.hp.com> At 08:56 AM 8/25/2006, Greg Lindahl wrote: >On Fri, Aug 25, 2006 at 10:13:01AM -0500, Tom Tucker wrote: > > > He does say this, but his analysis does not support this conclusion. His > > analysis revolves around MPI send/recv, not the MPI 2.0 get/put > > services. > >Nobody uses MPI put/get anyway, so leaving out analyzing that doesn't >change reality much. Is this due to legacy or other reasons? One reason cited from Winsocks Direct for using the bcopy vs. the RDMA zcopy operations was the cost to register memory if done on a per operation basis, i.e. single use. The bcopy threshold was ~9KB. With the new verbs developed for iWARP and then added to IB v1.2, the bcopy threshold was reduced to ~1KB. Now, if I recall correctly, many MPI implementations split their buffer usage between what are often 1KB envelopes and what are large regions. One can persistently register the envelopes so their size does not really matter and thus could use send / receive or RDMA semantics for their update depending upon how the completions are managed. The larger data movements can be RDMA semantics if desired as these are typically large in size. > > A valid conclusion IMO is that "MPI send/recv can > > be most efficiently implemented over an unconnected reliable datagram > > protocol that supports 64bit tag matching at the data sink." And not > > coincidentally, Myricom has this ;-) > >As do all of the non-VIA-family interconnects he mentions. Since "we" >all landed on the same conclusion, you might think we're on to >something. Or not. We've had this argument multiple times and examined all of the known and relatively volume usage models which includes the suite of MPI benchmarks used to evaluate and drive implementations. Any interconnect architecture is one of compromise if it is to be used in a volume environment - the goal for the architects is to insure the compromises do not result in a brain-dead or too diminished technology that will not meet customer requirements. With respect to reliable datagram, unless one does software multiplexing over what amounts to a reliable connection which comes with a performance penalty as well as complexity in terms of error recover, etc. logic it really does not buy one anything better than a RC model used today. Given the application mix and the customer usage model, IB provided four transport types to meet different application needs and allow people to make choices. iWARP reduced this to one since the target applications really were met with RC and reliable datagram as defined in IB simply was not being picked up or demanded by the targeted ISV. While some of us had argued for the software multiplex model, others wanted everything to be implemented in hardware so IB is what it is today. In any case, it is one of a set of reasonable compromises and for the most part, I contend it is difficult to argue that these interconnect technologies are so compromised that they are brain dead or broken. >However, that's only part of the argument. Another part is that the >buffer space needed to use RDMA put/get for all data links is huge. >And there are some other interesting points. The buffer and context differences to track RDMA vs. Send are not significant in terms of hardware. In terms of software, memory needs to be registered in some capacity to perform DMA to it and hence, there is a cost from the OS / application perspective. Our goals were to be able to use application buffers to provide zero copy data movements as well as OS bypass. RDMA vs. Send does not incrementally differ in terms of resource costs in the end. > > I DO agree that it is interesting reading. :-), it's definitely got > > people fired up. > >Heh. Glad you found it interesting. The article is somewhat interesting but does not really present anything novel in this on-going debate on how interconnects should be designed. There will always be someone pointing out a particular issue here and there and in the end, many of these amount to mouse nuts when placed into the larger context. When they don't, a new interconnect is defined or extensions are made to compensate as nothing is ever permanent or perfect. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Aug 29 08:35:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 11:35:36 -0400 Subject: [openib-general] [PATCH] osm: fix memory leak in vendor ibumad Message-ID: <1156865725.4509.28892.camel@hal.voltaire.com> Hi Eitan, > Hi Hal > These are two trivial fixes for memory leaks in the > ibumad vendor. > Thanks > Eitan > Signed-off-by: Eitan Zahavi Thanks. Applied (to both trunk and 1.1). -- Hal From halr at voltaire.com Tue Aug 29 08:46:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 11:46:05 -0400 Subject: [openib-general] [PATCH] osm: OSM fails to bind - TRIVIAL message addition In-Reply-To: References: Message-ID: <1156866364.4509.29209.camel@hal.voltaire.com> On Sun, 2006-08-27 at 06:16, Yevgeny Kliteynik wrote: > Hi Hal > > This patch just makes the error message more informative for user, > since another instance of running SM is most probably the reason > why osm_opensm_bind failed. > > Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to both trunk and 1.1). -- Hal From sean.hefty at intel.com Tue Aug 29 08:49:22 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 08:49:22 -0700 Subject: [openib-general] [PATCH] libibcm: modify API to support multi-threaded event processing In-Reply-To: <44F4315D.4080508@dev.mellanox.co.il> Message-ID: <000901c6cb82$b2caf3f0$dcc8180a@amr.corp.intel.com> >There are compilation errors with this patch when using gcc 4.1.0: Hmmm... I will look into this. - Sean From sean.hefty at intel.com Tue Aug 29 08:58:01 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 08:58:01 -0700 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: <1156853995.4509.23170.camel@hal.voltaire.com> Message-ID: <000a01c6cb83$e846dbb0$dcc8180a@amr.corp.intel.com> >Why SEND ? In general, couldn't it be used like SET/DELETE (in addition >to being used like the GET method variants) ? Also, the SA doesn't use >the SEND method. The latest version of the patch only allows GET or GET_TABLE for PathRecords ServiceRecords, and MCMemberRecords, and GET_MULTI for MultiPath queries if using the default access mode. Raw access mode doesn't filter the request, but is intended for privileged applications. - Sean From sean.hefty at intel.com Tue Aug 29 09:12:24 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 09:12:24 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060829130908.GA24322@mellanox.co.il> Message-ID: <000b01c6cb85$ea7828b0$dcc8180a@amr.corp.intel.com> >Here's an idea: >how about we move the whole timewait thing to low level driver, >starting timer automatically upon QP destroy? I've thought about this too, and I think this may end up making the most sense. How would the driver determine how long the QP should remain in timewait, and how would you sync that with the CM's timewait? (Userspace QPs are often connected using TCP, rather than through the IB CM.) - Sean From mst at mellanox.co.il Tue Aug 29 09:14:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Aug 2006 19:14:50 +0300 Subject: [openib-general] [PATCH] libibcm: modify API to support multi-threaded event processing In-Reply-To: <000901c6cb82$b2caf3f0$dcc8180a@amr.corp.intel.com> References: <000901c6cb82$b2caf3f0$dcc8180a@amr.corp.intel.com> Message-ID: <20060829161450.GA26963@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] libibcm: modify API to support multi-threaded event processing > > >There are compilation errors with this patch when using gcc 4.1.0: > > > Hmmm... I will look into this. I think offsetof is defined in stddef.h, so you must include that. -- MST From sean.hefty at intel.com Tue Aug 29 09:16:45 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 09:16:45 -0700 Subject: [openib-general] [PATCHES] for 2.6.19 In-Reply-To: Message-ID: <000c01c6cb86$86253780$dcc8180a@amr.corp.intel.com> >I handled it all myself this time, but in the future it is easier for >me if each patch is inline in a separate email. A couple of other >things that would also make my life easier: That's not a problem. I think in the past I've just referred you to the svn revision numbers. I was just trying to pull out the svn patches, apply them to git, and send you the git-format-patch output instead. I didn't clean-up that output. >Also, if you want to set up an externally visible git tree somewhere >(I think you could probably get a kernel.org account if you asked), >then I would be glad to just pull from you to get the patches. I will look into getting an externally visible git tree. Thanks, - Sean From sashak at voltaire.com Tue Aug 29 09:32:43 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 29 Aug 2006 19:32:43 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <20060826184150.GB21168@mellanox.co.il> References: <20060825141704.GA3867@sashak.voltaire.com> <20060826184150.GB21168@mellanox.co.il> Message-ID: <20060829163243.GA12948@sashak.voltaire.com> On 21:41 Sat 26 Aug , Michael S. Tsirkin wrote: > Quoting r. Sasha Khapyorsky : > > Subject: Re: [openib-general] [PATCH] osm: handle local events > > > > On 16:28 Thu 24 Aug , Michael S. Tsirkin wrote: > > > Quoting r. Yevgeny Kliteynik : > > > > Index: libvendor/osm_vendor_ibumad.c > > > > =================================================================== > > > > --- libvendor/osm_vendor_ibumad.c (revision 8998) > > > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > > > @@ -72,6 +72,7 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > > > > > /****s* OpenSM: Vendor AL/osm_umad_bind_info_t > > > > * NAME > > > > > > NAK. > > > > > > This means that the SM becomes dependent on the uverbs module. I don't think > > > this is a good idea. Let's not go there - SM should depend just on the umad > > > module and libc. > > > > Agree on this point. I dislike this new libibverbs dependency too. I > > think we need to work with umad. > > > > So more generic question: some application performs blocked read() from > > /dev/umadN, should this read() be interrupted and return error (with > > appropriate errno value), then the port state becomes DOWN? > > I think yes, it should. Other opinions? Sean? > > One thing seems obvious: if device goes away it seems obvious that we should > return ENODEV from any read. Isn't this already done? I think it does nothing when the port becomes down (or the driver sends PORT_ERR event). > > > > > And if yes, then in OpenSM we will need just to check errno value upon > > umad_recv() failure. > > > > Sasha > > Might be a good idea. Hoever, such an approach by default is an ABI change so it > could break some apps. It should not change ABI, only slight change for read()/write() behaviors. I think more important question is "is proposed umad behavior right?" > Could this be made an option somehow? With fcntl() (or similar)? > Think also about > a race where I read *after* the state was changed to DOWN. Which race? If blocked read() starts on DOWN port we may decide how to handle this, I may think about two options: or return error, or work as usual.. Both may be acceptable (but later does not required any changes). > > Another question comes to mind: > does not SM care about physical link state changes as well? Not explicitly AFAIK. > Assuming I disconnect and re-connect the cable, does not > SM want to know and try bringing the logical link up? AFAIK OpenSM does not detect the local port disconnection explicitly (however will detect this at next periodic sweep if such configured), but it will get report from the switch when the local port will be re-connected back. Sasha > How is this handled currently? Can the same mechanism be used > for port state events? > > -- > MST From mst at mellanox.co.il Tue Aug 29 09:33:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Aug 2006 19:33:44 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <000b01c6cb85$ea7828b0$dcc8180a@amr.corp.intel.com> References: <20060829130908.GA24322@mellanox.co.il> <000b01c6cb85$ea7828b0$dcc8180a@amr.corp.intel.com> Message-ID: <20060829163344.GA27121@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > >Here's an idea: > >how about we move the whole timewait thing to low level driver, > >starting timer automatically upon QP destroy? > > I've thought about this too, and I think this may end up making the most sense. > How would the driver determine how long the QP should remain in timewait, Need to look into this - likely we can just add a call for that. Roland? > and > how would you sync that with the CM's timewait? There's no need for CM or any other entity to track timewait QP in this setup - we can remove TIMEWAIT_EXIT event or pass it up immediately. The role of timewait as I see it is to prevent reuse of the QP number too soon. If verbs do that, CM does not need to. > (Userspace QPs are often > connected using TCP, rather than through the IB CM.) These actually want timewait too, they just don't know it :) -- MST From mshefty at ichips.intel.com Tue Aug 29 09:35:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 09:35:34 -0700 Subject: [openib-general] [PATCH] libibcm: modify API to support multi-threaded event processing In-Reply-To: <20060829161450.GA26963@mellanox.co.il> References: <000901c6cb82$b2caf3f0$dcc8180a@amr.corp.intel.com> <20060829161450.GA26963@mellanox.co.il> Message-ID: <44F46CD6.3060508@ichips.intel.com> Michael S. Tsirkin wrote: > I think offsetof is defined in stddef.h, so you must include that. Dotan, Can you see if adding this include works for you? I just re-tested the build on my system, and it worked fine without it (gcc 3.3.3). Jack posted a patch for this earlier if you need one. - Sean From mshefty at ichips.intel.com Tue Aug 29 09:39:31 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 09:39:31 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060829163344.GA27121@mellanox.co.il> References: <20060829130908.GA24322@mellanox.co.il> <000b01c6cb85$ea7828b0$dcc8180a@amr.corp.intel.com> <20060829163344.GA27121@mellanox.co.il> Message-ID: <44F46DC3.8050406@ichips.intel.com> Michael S. Tsirkin wrote: >>I've thought about this too, and I think this may end up making the most sense. >>How would the driver determine how long the QP should remain in timewait, > > > Need to look into this - likely we can just add a call for that. > Roland? The Intel gen1 code passed this into the modify QP call for either reset or error. My question was more, does the driver have enough information to calculate the timewait duration based on other modify QP parameters? I don't know that you want the timewait duration determined by a userspace app. > There's no need for CM or any other entity to track timewait QP in this setup - > we can remove TIMEWAIT_EXIT event or pass it up immediately. The CM still needs to be able to respond to duplicate DREQs, so it will be tracking timewait internally anyway. - Sean From mst at mellanox.co.il Tue Aug 29 09:39:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Aug 2006 19:39:13 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <20060829163243.GA12948@sashak.voltaire.com> References: <20060825141704.GA3867@sashak.voltaire.com> <20060826184150.GB21168@mellanox.co.il> <20060829163243.GA12948@sashak.voltaire.com> Message-ID: <20060829163913.GB27121@mellanox.co.il> Quoting r. Sasha Khapyorsky : > > Assuming I disconnect and re-connect the cable, does not > > SM want to know and try bringing the logical link up? > > AFAIK OpenSM does not detect the local port disconnection explicitly > (however will detect this at next periodic sweep if such configured), Wouldn't it be a good idea though? > but > it will get report from the switch when the local port will be > re-connected back. But only AFTER it brings the logical link up, by which time it's already not relevant. -- MST From halr at voltaire.com Tue Aug 29 09:42:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 12:42:45 -0400 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: <000a01c6cb83$e846dbb0$dcc8180a@amr.corp.intel.com> References: <000a01c6cb83$e846dbb0$dcc8180a@amr.corp.intel.com> Message-ID: <1156869763.4509.30929.camel@hal.voltaire.com> On Tue, 2006-08-29 at 11:58, Sean Hefty wrote: > >Why SEND ? In general, couldn't it be used like SET/DELETE (in addition > >to being used like the GET method variants) ? Also, the SA doesn't use > >the SEND method. > > The latest version of the patch only allows GET or GET_TABLE for PathRecords > ServiceRecords, and MCMemberRecords, and GET_MULTI for MultiPath queries if > using the default access mode. OK. So shouldn't IBV_SA_METHOD_SEND be removed from sa_net.h ? Index: libibsa/include/infiniband/sa_net.h =================================================================== --- libibsa/include/infiniband/sa_net.h (revision 0) +++ libibsa/include/infiniband/sa_net.h (revision 0) +enum { + IBV_SA_METHOD_GET = 0x01, + IBV_SA_METHOD_SET = 0x02, + IBV_SA_METHOD_GET_RESP = 0x81, + IBV_SA_METHOD_SEND = 0x03, > Raw access mode doesn't filter the request, but > is intended for privileged applications. I'm unclear about what you mean in terms of the relation of this to my original comment. By raw access, do you mean SEND_MAD operation ? How do those applications gain this privilege ? -- Hal > - Sean From mst at mellanox.co.il Tue Aug 29 09:49:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Aug 2006 19:49:56 +0300 Subject: [openib-general] [PATCH] libibcm: modify API to support multi-threaded event processing In-Reply-To: <44F46CD6.3060508@ichips.intel.com> References: <44F46CD6.3060508@ichips.intel.com> Message-ID: <20060829164956.GA27478@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] libibcm: modify API to support multi-threaded event processing > > Michael S. Tsirkin wrote: > > I think offsetof is defined in stddef.h, so you must include that. > > Dotan, > > Can you see if adding this include works for you? I just re-tested the build on > my system, and it worked fine without it (gcc 3.3.3). Jack posted a patch for > this earlier if you need one. Just tested on a system similiar to Dotan's and this seems to help. -- MST From mshefty at ichips.intel.com Tue Aug 29 10:01:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 10:01:44 -0700 Subject: [openib-general] [PATCH] libibsa: userspace SA query and multicast support In-Reply-To: <1156869763.4509.30929.camel@hal.voltaire.com> References: <000a01c6cb83$e846dbb0$dcc8180a@amr.corp.intel.com> <1156869763.4509.30929.camel@hal.voltaire.com> Message-ID: <44F472F8.3020606@ichips.intel.com> Hal Rosenstock wrote: > OK. So shouldn't IBV_SA_METHOD_SEND be removed from sa_net.h ? I was just defining the well known methods. I can remove this. > By raw access, do you mean SEND_MAD operation ? > > How do those applications gain this privilege ? The kernel module exports two files to perform operations on: ib_usa_raw and ib_usa_default. An administrator can control access to these files however they like. Operations performed through ib_usa_default are limited to specific SA queries. Sends performed through ib_usa_raw are not filtered in the kernel, so behave more like sending requests through the userspace MAD interfaces. - Sean From mshefty at ichips.intel.com Tue Aug 29 10:05:10 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 10:05:10 -0700 Subject: [openib-general] [PATCH] libibcm: Need to include stddef.h in cm.c for SLES10 compilations In-Reply-To: <200608291124.45816.jackm@mellanox.co.il> References: <200608291124.45816.jackm@mellanox.co.il> Message-ID: <44F473C6.9040204@ichips.intel.com> Jack Morgenstein wrote: > Fix compilation on SLES10: > cm.c uses offsetof, so it must include stddef.h Thanks - committed in 9150. - Sean From sean.hefty at intel.com Tue Aug 29 10:16:27 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 10:16:27 -0700 Subject: [openib-general] libibcm can't open /dev/infiniband/ucm0 In-Reply-To: Message-ID: <000d01c6cb8e$dd406fa0$dcc8180a@amr.corp.intel.com> >Looked into the openIB kernel sources and found that the minor number >seems to be wrong in the README file. With a minor number "224" and the >creation like: > > "mknod /dev/infiniband/ucm0 c 231 224" The README file was never updated when the userspace CM added per device handling. I've updated this in SVN on the trunk. Thanks. >So I assume this to be a bug at least in the README file. I'm not the >kernel expert to tell where else this has to be changed. Only the README file should need updating. >Should I open a bug on this? For the OFED release, it's probably best to open a bug on this. - Sean From mst at mellanox.co.il Tue Aug 29 10:50:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Aug 2006 20:50:14 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F46DC3.8050406@ichips.intel.com> References: <20060829130908.GA24322@mellanox.co.il> <000b01c6cb85$ea7828b0$dcc8180a@amr.corp.intel.com> <20060829163344.GA27121@mellanox.co.il> <44F46DC3.8050406@ichips.intel.com> Message-ID: <20060829175014.GA28200@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Michael S. Tsirkin wrote: > >>I've thought about this too, and I think this may end up making the most sense. > >>How would the driver determine how long the QP should remain in timewait, > > > > > > Need to look into this - likely we can just add a call for that. > > Roland? > > The Intel gen1 code passed this into the modify QP call for either reset or > error. My question was more, does the driver have enough information to > calculate the timewait duration based on other modify QP parameters? I don't > know that you want the timewait duration determined by a userspace app. > > > There's no need for CM or any other entity to track timewait QP in this setup - > > we can remove TIMEWAIT_EXIT event or pass it up immediately. > > The CM still needs to be able to respond to duplicate DREQs, so it will be Correct of course - I should have said that CM will only need to track he remote QP/ID and not the local QP, therefore there won't be need for synchronisation beetween timewait in verbs layer and CM. -- MST From sashak at voltaire.com Tue Aug 29 11:15:35 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 29 Aug 2006 21:15:35 +0300 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <1156717686.15782.93.camel@fc6.xsintricity.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> Message-ID: <20060829181535.GE12948@sashak.voltaire.com> On 18:28 Sun 27 Aug , Doug Ledford wrote: > On Sun, 2006-08-20 at 20:18 +0300, Sasha Khapyorsky wrote: > > On 13:01 Sun 20 Aug , Hal Rosenstock wrote: > > > Hi Sasha, > > > > > > On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote: > > > > In case when OpenSM log file overflows filesystem and write() fails with > > > > 'No space left on device' try to truncate the log file and wrap-around > > > > logging. > > > > > > Should it be an (admin) option as to whether to truncate the file or not > > > or is there no way to continue without logging (other than this) once > > > the log file fills the disk ? > > > > In theory OpenSM may continue, but don't think it is good idea to leave > > overflowed disk on the SM machine (by default it is '/var/log'). For me > > truncating there looks as reasonable default behavior, don't think we > > need the option. > > I would definitely put the option in, and in fact would default it to > *NOT* truncate. If the disk is full, you have no idea why. It *might* > be your logs, or it might be a mail bomb filling /var/spool/mail. I'm > sure as an admin the last thing I would want is my apps deciding, based > upon incomplete information, that wiping out their log files is the > right thing to do. To me that sounds more like an intruder covering his > tracks than a reasonable thing to do when confronted with ENOSPC. > > Truncating logs is something best left up to the admin that's dealing > with the disk full problem in the first place. After all, if it is > something like an errant app filling the mail spool, truncating the logs > just looses valuable logs while at the same time making room for the app > to keep on adding more to /var/spool/mail. That's just wrong. If you > run out of space, just quit logging things until the admin clears the > problem up. If you put this code in, make the admin turn it on. That > will keep opensm friendly to appliance like devices that are single task > subnet managers. But I don't think having this patch always on makes > any sense on a multi task server. My expectation is that when OpenSM is running it will generate ENOSPC more frequently than mail bombs, or other activities. But I see your point - don't take this control from an admin... I will do this ENOSPC handling optional - actually there is another patch was submitted, there is the option which limits OpenSM log file size. Will add ENOSPC processing under same option. Hal, I will resend the patch soon. Sasha > > > > > > > See comment below as well. > > > > > > -- Hal > > > > > > > Signed-off-by: Sasha Khapyorsky > > > > --- > > > > > > > > osm/opensm/osm_log.c | 23 +++++++++++++++-------- > > > > 1 files changed, 15 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c > > > > index 668e9a6..b4700c8 100644 > > > > --- a/osm/opensm/osm_log.c > > > > +++ b/osm/opensm/osm_log.c > > > > @@ -58,6 +58,7 @@ #include > > > > #include > > > > #include > > > > #include > > > > +#include > > > > > > > > #ifndef WIN32 > > > > #include > > > > @@ -152,6 +153,7 @@ #endif > > > > cl_spinlock_acquire( &p_log->lock ); > > > > #ifdef WIN32 > > > > GetLocalTime(&st); > > > > + _retry: > > > > ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", > > > > st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, > > > > pid, buffer); > > > > @@ -159,6 +161,7 @@ #ifdef WIN32 > > > > #else > > > > pid = pthread_self(); > > > > tim = time(NULL); > > > > + _retry: > > > > ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", > > > > ((result.tm_mon < 12) && (result.tm_mon >= 0) ? > > > > month_str[result.tm_mon] : "???"), > > > > @@ -166,6 +169,18 @@ #else > > > > result.tm_min, result.tm_sec, > > > > usecs, pid, buffer); > > > > #endif /* WIN32 */ > > > > + > > > > + if (ret >= 0) > > > > + log_exit_count = 0; > > > > + else if (errno == ENOSPC && log_exit_count < 3) { > > > > + int fd = fileno(p_log->out_port); > > > > + fprintf(stderr, "log write failed: %s. Will truncate the log file.\n", > > > > + strerror(errno)); > > > > + ftruncate(fd, 0); > > > > > > Should return from ftruncate be checked here ? > > > > May be checked, but I don't think that potential ftruncate() failure > > should change the flow - in case of failure we will try to continue > > with lseek() anyway (in order to wrap around the file at least). > > > > Sasha > > > > > > > > > + lseek(fd, 0, SEEK_SET); > > > > + log_exit_count++; > > > > + goto _retry; > > > > + } > > > > > > > > /* > > > > Flush log on errors too. > > > > @@ -174,14 +189,6 @@ #endif /* WIN32 */ > > > > fflush( p_log->out_port ); > > > > > > > > cl_spinlock_release( &p_log->lock ); > > > > - > > > > - if (ret < 0) > > > > - { > > > > - if (log_exit_count++ < 10) > > > > - { > > > > - fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n"); > > > > - } > > > > - } > > > > } > > > > } > > > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- > Doug Ledford > GPG KeyID: CFBFF194 > http://people.redhat.com/dledford > > Infiniband specific RPMs available at > http://people.redhat.com/dledford/Infiniband From greg.lindahl at qlogic.com Tue Aug 29 08:21:27 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Tue, 29 Aug 2006 08:21:27 -0700 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <1156717686.15782.93.camel@fc6.xsintricity.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> Message-ID: <20060829152126.GA1749@greglaptop> On Sun, Aug 27, 2006 at 06:28:06PM -0400, Doug Ledford wrote: > I would definitely put the option in, and in fact would default it to > *NOT* truncate. I agree. I have never seen any other daemon with a logfile do this, why are we out to surprise the admin? The admin might want the start of the long instead of the end. And so on. -- g From mshefty at ichips.intel.com Tue Aug 29 11:53:11 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 11:53:11 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <000b01c6cb85$ea7828b0$dcc8180a@amr.corp.intel.com> References: <000b01c6cb85$ea7828b0$dcc8180a@amr.corp.intel.com> Message-ID: <44F48D17.6040406@ichips.intel.com> Sean Hefty wrote: > How would the driver determine how long the QP should remain in timewait The spec isn't totally clear to me on this, but here's what I can gather: timewait = packet lifetime x 2 + remote ack delay local_ack_timeout (in CM REQ) = packet lifetime x 2 + local ack delay Verbs gets local_ack_timeout through qp_attr.timeout when modifying the QP to RTS. But from 12.7.34, I believe that the local_ack_timeout used by verbs is for the remote QP, meaning that: local_ack_timeout (verbs) = packet lifetime x 2 + remote ack delay which is also the timewait value. Do you concur? If so, then verbs already has the timewait value. - Sean From sashak at voltaire.com Tue Aug 29 12:01:40 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 29 Aug 2006 22:01:40 +0300 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <1156875510.4509.33584.camel@hal.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> <20060829181535.GE12948@sashak.voltaire.com> <1156875510.4509.33584.camel@hal.voltaire.com> Message-ID: <20060829190140.GI12948@sashak.voltaire.com> On 14:18 Tue 29 Aug , Hal Rosenstock wrote: > Hi Sasha, > > On Tue, 2006-08-29 at 14:15, Sasha Khapyorsky wrote: > > On 18:28 Sun 27 Aug , Doug Ledford wrote: > > > On Sun, 2006-08-20 at 20:18 +0300, Sasha Khapyorsky wrote: > > > > On 13:01 Sun 20 Aug , Hal Rosenstock wrote: > > > > > Hi Sasha, > > > > > > > > > > On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote: > > > > > > In case when OpenSM log file overflows filesystem and write() fails with > > > > > > 'No space left on device' try to truncate the log file and wrap-around > > > > > > logging. > > > > > > > > > > Should it be an (admin) option as to whether to truncate the file or not > > > > > or is there no way to continue without logging (other than this) once > > > > > the log file fills the disk ? > > > > > > > > In theory OpenSM may continue, but don't think it is good idea to leave > > > > overflowed disk on the SM machine (by default it is '/var/log'). For me > > > > truncating there looks as reasonable default behavior, don't think we > > > > need the option. > > > > > > I would definitely put the option in, and in fact would default it to > > > *NOT* truncate. If the disk is full, you have no idea why. It *might* > > > be your logs, or it might be a mail bomb filling /var/spool/mail. I'm > > > sure as an admin the last thing I would want is my apps deciding, based > > > upon incomplete information, that wiping out their log files is the > > > right thing to do. To me that sounds more like an intruder covering his > > > tracks than a reasonable thing to do when confronted with ENOSPC. > > > > > > Truncating logs is something best left up to the admin that's dealing > > > with the disk full problem in the first place. After all, if it is > > > something like an errant app filling the mail spool, truncating the logs > > > just looses valuable logs while at the same time making room for the app > > > to keep on adding more to /var/spool/mail. That's just wrong. If you > > > run out of space, just quit logging things until the admin clears the > > > problem up. If you put this code in, make the admin turn it on. That > > > will keep opensm friendly to appliance like devices that are single task > > > subnet managers. But I don't think having this patch always on makes > > > any sense on a multi task server. > > > > My expectation is that when OpenSM is running it will generate ENOSPC > > more frequently than mail bombs, or other activities. > > > > But I see your point - don't take this control from an admin... I will > > do this ENOSPC handling optional - actually there is another patch was > > submitted, there is the option which limits OpenSM log file size. Will > > add ENOSPC processing under same option. > > > > Hal, I will resend the patch soon. > > I'd prefer an incremental one off the last patch related to this if that > isn't too much work as I'm close to committing the previous one now (and > it'd be more work to start over on this). Ok. There is: Optional log file truncating upon ENOSPC errors. Signed-off-by: Sasha Khapyorsky diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c index bc5f25c..e1c43d1 100644 --- a/osm/opensm/osm_log.c +++ b/osm/opensm/osm_log.c @@ -174,9 +174,11 @@ #endif if (ret < 0 && errno == ENOSPC && log_exit_count < 3) { fprintf(stderr, "osm_log write failed: %s. Truncating log file.\n", strerror(errno)); - truncate_log_file(p_log); log_exit_count++; - goto _retry; + if (p_log->max_size) { + truncate_log_file(p_log); + goto _retry; + } } else { log_exit_count = 0; Sasha > > -- Hal > > > Sasha > > > > > > > > > > > > > > > See comment below as well. > > > > > > > > > > -- Hal > > > > > > > > > > > Signed-off-by: Sasha Khapyorsky > > > > > > --- > > > > > > > > > > > > osm/opensm/osm_log.c | 23 +++++++++++++++-------- > > > > > > 1 files changed, 15 insertions(+), 8 deletions(-) > > > > > > > > > > > > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c > > > > > > index 668e9a6..b4700c8 100644 > > > > > > --- a/osm/opensm/osm_log.c > > > > > > +++ b/osm/opensm/osm_log.c > > > > > > @@ -58,6 +58,7 @@ #include > > > > > > #include > > > > > > #include > > > > > > #include > > > > > > +#include > > > > > > > > > > > > #ifndef WIN32 > > > > > > #include > > > > > > @@ -152,6 +153,7 @@ #endif > > > > > > cl_spinlock_acquire( &p_log->lock ); > > > > > > #ifdef WIN32 > > > > > > GetLocalTime(&st); > > > > > > + _retry: > > > > > > ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", > > > > > > st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, > > > > > > pid, buffer); > > > > > > @@ -159,6 +161,7 @@ #ifdef WIN32 > > > > > > #else > > > > > > pid = pthread_self(); > > > > > > tim = time(NULL); > > > > > > + _retry: > > > > > > ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", > > > > > > ((result.tm_mon < 12) && (result.tm_mon >= 0) ? > > > > > > month_str[result.tm_mon] : "???"), > > > > > > @@ -166,6 +169,18 @@ #else > > > > > > result.tm_min, result.tm_sec, > > > > > > usecs, pid, buffer); > > > > > > #endif /* WIN32 */ > > > > > > + > > > > > > + if (ret >= 0) > > > > > > + log_exit_count = 0; > > > > > > + else if (errno == ENOSPC && log_exit_count < 3) { > > > > > > + int fd = fileno(p_log->out_port); > > > > > > + fprintf(stderr, "log write failed: %s. Will truncate the log file.\n", > > > > > > + strerror(errno)); > > > > > > + ftruncate(fd, 0); > > > > > > > > > > Should return from ftruncate be checked here ? > > > > > > > > May be checked, but I don't think that potential ftruncate() failure > > > > should change the flow - in case of failure we will try to continue > > > > with lseek() anyway (in order to wrap around the file at least). > > > > > > > > Sasha > > > > > > > > > > > > > > > + lseek(fd, 0, SEEK_SET); > > > > > > + log_exit_count++; > > > > > > + goto _retry; > > > > > > + } > > > > > > > > > > > > /* > > > > > > Flush log on errors too. > > > > > > @@ -174,14 +189,6 @@ #endif /* WIN32 */ > > > > > > fflush( p_log->out_port ); > > > > > > > > > > > > cl_spinlock_release( &p_log->lock ); > > > > > > - > > > > > > - if (ret < 0) > > > > > > - { > > > > > > - if (log_exit_count++ < 10) > > > > > > - { > > > > > > - fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n"); > > > > > > - } > > > > > > - } > > > > > > } > > > > > > } > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > -- > > > Doug Ledford > > > GPG KeyID: CFBFF194 > > > http://people.redhat.com/dledford > > > > > > Infiniband specific RPMs available at > > > http://people.redhat.com/dledford/Infiniband > > > > > From sashak at voltaire.com Tue Aug 29 12:07:39 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 29 Aug 2006 22:07:39 +0300 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <20060829152126.GA1749@greglaptop> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> <20060829152126.GA1749@greglaptop> Message-ID: <20060829190739.GJ12948@sashak.voltaire.com> On 08:21 Tue 29 Aug , Greg Lindahl wrote: > On Sun, Aug 27, 2006 at 06:28:06PM -0400, Doug Ledford wrote: > > > I would definitely put the option in, and in fact would default it to > > *NOT* truncate. > > I agree. I have never seen any other daemon with a logfile do this, OpenSM is not a real daemon. > why are we out to surprise the admin? The admin might want the start > of the long instead of the end. And so on. Ok. Agree. Sasha > > -- g From mst at mellanox.co.il Tue Aug 29 12:08:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Aug 2006 22:08:41 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F48D17.6040406@ichips.intel.com> References: <44F48D17.6040406@ichips.intel.com> Message-ID: <20060829190840.GA28559@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Sean Hefty wrote: > > How would the driver determine how long the QP should remain in timewait > > The spec isn't totally clear to me on this, but here's what I can gather: > > timewait = packet lifetime x 2 + remote ack delay > local_ack_timeout (in CM REQ) = packet lifetime x 2 + local ack delay > > Verbs gets local_ack_timeout through qp_attr.timeout when modifying the QP to > RTS. Isn't that RTR? > But from 12.7.34, I believe that the local_ack_timeout used by verbs is > for the remote QP, meaning that: > > local_ack_timeout (verbs) = packet lifetime x 2 + remote ack delay > > which is also the timewait value. Do you concur? If so, then verbs already has > the timewait value. I agree. 11.2.4.3 which defines Local Ack Timeout in QP even has a link to this REQ field. So it seems we won't need any API changes. This begins to look good. I waner what Roland and other low level driver maintainers think. -- MST From halr at voltaire.com Tue Aug 29 12:13:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 15:13:31 -0400 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <20060829152126.GA1749@greglaptop> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> <20060829152126.GA1749@greglaptop> Message-ID: <1156878810.4509.35185.camel@hal.voltaire.com> On Tue, 2006-08-29 at 11:21, Greg Lindahl wrote: > On Sun, Aug 27, 2006 at 06:28:06PM -0400, Doug Ledford wrote: > > > I would definitely put the option in, and in fact would default it to > > *NOT* truncate. > > I agree. I have never seen any other daemon with a logfile do this, > why are we out to surprise the admin? The admin might want the start > of the long instead of the end. And so on. I agree too but there is perhaps a difference here per degree: OpenSM can spew copious logging and fill /var/log readily. -- Hal > -- g > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Tue Aug 29 12:23:47 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 12:23:47 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060829190840.GA28559@mellanox.co.il> References: <44F48D17.6040406@ichips.intel.com> <20060829190840.GA28559@mellanox.co.il> Message-ID: <44F49443.5000908@ichips.intel.com> Michael S. Tsirkin wrote: >>Verbs gets local_ack_timeout through qp_attr.timeout when modifying the QP to >>RTS. > > > Isn't that RTR? It's the transition from RTR to RTS. > So it seems we won't need any API changes. This begins to look good. > I waner what Roland and other low level driver maintainers think. The requirement then is that a lower level driver will not allocate a QP until it exits timewait. Note that this doesn't prevent a user from re-using a QP with the CM too soon, but that should only affect that user. - Sean From mst at mellanox.co.il Tue Aug 29 12:37:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Aug 2006 22:37:33 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F49443.5000908@ichips.intel.com> References: <44F49443.5000908@ichips.intel.com> Message-ID: <20060829193733.GB28559@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Michael S. Tsirkin wrote: > >>Verbs gets local_ack_timeout through qp_attr.timeout when modifying the QP to > >>RTS. > > > > > > Isn't that RTR? > > It's the transition from RTR to RTS. Hmm. But you need timewait already after you get to RTR, right? -- MST From halr at voltaire.com Tue Aug 29 12:36:11 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 15:36:11 -0400 Subject: [openib-general] [PATCH] osm: TRIVIAL fix in usage message In-Reply-To: <1156873147.4509.32653.camel@hal.voltaire.com> References: <1156873147.4509.32653.camel@hal.voltaire.com> Message-ID: <1156880171.4509.35741.camel@hal.voltaire.com> On Tue, 2006-08-29 at 10:50, Yevgeny Kliteynik wrote: > Hi Hal. > > This patch is just fixing some error in the OSM usage message. > Instead of old '-vf' option, there should be '-D'. > > Thanks. > > Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to both trunk and 1.1). -- Hal From halr at voltaire.com Tue Aug 29 12:38:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 15:38:49 -0400 Subject: [openib-general] [PATCH] osm: TRIVIAL fix in usage message In-Reply-To: <1156875797.4509.33741.camel@hal.voltaire.com> References: <1156873147.4509.32653.camel@hal.voltaire.com> <1156875797.4509.33741.camel@hal.voltaire.com> Message-ID: <1156880328.4509.35858.camel@hal.voltaire.com> On Tue, 2006-08-29 at 13:39, Hal Rosenstock wrote: > On Tue, 2006-08-29 at 10:50, Yevgeny Kliteynik wrote: > > Hi Hal. > > > > This patch is just fixing some error in the OSM usage message. > > Instead of old '-vf' option, there should be '-D'. > > > > Thanks. > > > > Yevgeny > > > > Signed-off-by: Yevgeny Kliteynik > > Thanks. Applied (to both trunk and 1.1). Also, I fixed the opensm man page to go along with this change. -- Hal > -- Hal From mshefty at ichips.intel.com Tue Aug 29 12:49:49 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 12:49:49 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060829193733.GB28559@mellanox.co.il> References: <44F49443.5000908@ichips.intel.com> <20060829193733.GB28559@mellanox.co.il> Message-ID: <44F49A5D.2060707@ichips.intel.com> Michael S. Tsirkin wrote: > Hmm. But you need timewait already after you get to RTR, right? The active side looks fine. The passive side can enter timewait without moving through RTS if it gets an RTU timeout. I'm not sure how much going into timewait really helps in this case though. If we completely ignore timewait, what conditions are required to have a problem occur? And, can we meet those conditions if we connect over the IB CM, given the CMs three-way handshake? - Sean From halr at voltaire.com Tue Aug 29 12:54:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 15:54:10 -0400 Subject: [openib-general] [PATCH] osm: TRIVIAL code cleanup In-Reply-To: <1156880123.4509.35739.camel@hal.voltaire.com> References: <1156871042.4509.31597.camel@hal.voltaire.com> <1156880123.4509.35739.camel@hal.voltaire.com> Message-ID: <1156881245.4509.36186.camel@hal.voltaire.com> On Mon, 2006-08-28 at 04:11, Yevgeny Kliteynik wrote: > Hi Hal. > > I noticed that there are some unused defaults: > OSM_DEFAULT_MGRP_MTU and OSM_DEFAULT_MGRP_RATE. > The corresponding values in the code are hadcoded. > > Fixed the code to use these defaults, and updated the > OSM_DEFAULT_MGRP_MTU to the value that was hardcoded. > > Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to both trunk and 1.1). -- Hal From halr at voltaire.com Tue Aug 29 12:55:01 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 15:55:01 -0400 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <1156880236.4509.35780.camel@hal.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> <20060829181535.GE12948@sashak.voltaire.com> <1156875510.4509.33584.camel@hal.voltaire.com> <1156880236.4509.35780.camel@hal.voltaire.com> Message-ID: <1156881299.4509.36227.camel@hal.voltaire.com> Hi Sasha, On Tue, 2006-08-29 at 14:15, Sasha Khapyorsky wrote: > On 18:28 Sun 27 Aug , Doug Ledford wrote: > > On Sun, 2006-08-20 at 20:18 +0300, Sasha Khapyorsky wrote: > > > On 13:01 Sun 20 Aug , Hal Rosenstock wrote: > > > > Hi Sasha, > > > > > > > > On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote: > > > > > In case when OpenSM log file overflows filesystem and write() fails with > > > > > 'No space left on device' try to truncate the log file and wrap-around > > > > > logging. > > > > > > > > Should it be an (admin) option as to whether to truncate the file or not > > > > or is there no way to continue without logging (other than this) once > > > > the log file fills the disk ? > > > > > > In theory OpenSM may continue, but don't think it is good idea to leave > > > overflowed disk on the SM machine (by default it is '/var/log'). For me > > > truncating there looks as reasonable default behavior, don't think we > > > need the option. > > > > I would definitely put the option in, and in fact would default it to > > *NOT* truncate. If the disk is full, you have no idea why. It *might* > > be your logs, or it might be a mail bomb filling /var/spool/mail. I'm > > sure as an admin the last thing I would want is my apps deciding, based > > upon incomplete information, that wiping out their log files is the > > right thing to do. To me that sounds more like an intruder covering his > > tracks than a reasonable thing to do when confronted with ENOSPC. > > > > Truncating logs is something best left up to the admin that's dealing > > with the disk full problem in the first place. After all, if it is > > something like an errant app filling the mail spool, truncating the logs > > just looses valuable logs while at the same time making room for the app > > to keep on adding more to /var/spool/mail. That's just wrong. If you > > run out of space, just quit logging things until the admin clears the > > problem up. If you put this code in, make the admin turn it on. That > > will keep opensm friendly to appliance like devices that are single task > > subnet managers. But I don't think having this patch always on makes > > any sense on a multi task server. > > My expectation is that when OpenSM is running it will generate ENOSPC > more frequently than mail bombs, or other activities. > > But I see your point - don't take this control from an admin... I will > do this ENOSPC handling optional - actually there is another patch was > submitted, there is the option which limits OpenSM log file size. Will > add ENOSPC processing under same option. > > Hal, I will resend the patch soon. I'd prefer an incremental one off the last patch related to this if that isn't too much work as I'm close to committing the previous one now (and it'd be more work to start over on this). -- Hal > Sasha > > > > > > > > > > > See comment below as well. > > > > > > > > -- Hal > > > > > > > > > Signed-off-by: Sasha Khapyorsky > > > > > --- > > > > > > > > > > osm/opensm/osm_log.c | 23 +++++++++++++++-------- > > > > > 1 files changed, 15 insertions(+), 8 deletions(-) > > > > > > > > > > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c > > > > > index 668e9a6..b4700c8 100644 > > > > > --- a/osm/opensm/osm_log.c > > > > > +++ b/osm/opensm/osm_log.c > > > > > @@ -58,6 +58,7 @@ #include > > > > > #include > > > > > #include > > > > > #include > > > > > +#include > > > > > > > > > > #ifndef WIN32 > > > > > #include > > > > > @@ -152,6 +153,7 @@ #endif > > > > > cl_spinlock_acquire( &p_log->lock ); > > > > > #ifdef WIN32 > > > > > GetLocalTime(&st); > > > > > + _retry: > > > > > ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", > > > > > st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, > > > > > pid, buffer); > > > > > @@ -159,6 +161,7 @@ #ifdef WIN32 > > > > > #else > > > > > pid = pthread_self(); > > > > > tim = time(NULL); > > > > > + _retry: > > > > > ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s", > > > > > ((result.tm_mon < 12) && (result.tm_mon >= 0) ? > > > > > month_str[result.tm_mon] : "???"), > > > > > @@ -166,6 +169,18 @@ #else > > > > > result.tm_min, result.tm_sec, > > > > > usecs, pid, buffer); > > > > > #endif /* WIN32 */ > > > > > + > > > > > + if (ret >= 0) > > > > > + log_exit_count = 0; > > > > > + else if (errno == ENOSPC && log_exit_count < 3) { > > > > > + int fd = fileno(p_log->out_port); > > > > > + fprintf(stderr, "log write failed: %s. Will truncate the log file.\n", > > > > > + strerror(errno)); > > > > > + ftruncate(fd, 0); > > > > > > > > Should return from ftruncate be checked here ? > > > > > > May be checked, but I don't think that potential ftruncate() failure > > > should change the flow - in case of failure we will try to continue > > > with lseek() anyway (in order to wrap around the file at least). > > > > > > Sasha > > > > > > > > > > > > + lseek(fd, 0, SEEK_SET); > > > > > + log_exit_count++; > > > > > + goto _retry; > > > > > + } > > > > > > > > > > /* > > > > > Flush log on errors too. > > > > > @@ -174,14 +189,6 @@ #endif /* WIN32 */ > > > > > fflush( p_log->out_port ); > > > > > > > > > > cl_spinlock_release( &p_log->lock ); > > > > > - > > > > > - if (ret < 0) > > > > > - { > > > > > - if (log_exit_count++ < 10) > > > > > - { > > > > > - fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n"); > > > > > - } > > > > > - } > > > > > } > > > > > } > > > > > > > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -- > > Doug Ledford > > GPG KeyID: CFBFF194 > > http://people.redhat.com/dledford > > > > Infiniband specific RPMs available at > > http://people.redhat.com/dledford/Infiniband > > From mst at mellanox.co.il Tue Aug 29 13:09:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Aug 2006 23:09:20 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F49A5D.2060707@ichips.intel.com> References: <44F49A5D.2060707@ichips.intel.com> Message-ID: <20060829200920.GA29183@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Michael S. Tsirkin wrote: > > Hmm. But you need timewait already after you get to RTR, right? > > The active side looks fine. The passive side can enter timewait without moving > through RTS if it gets an RTU timeout. I'm not sure how much going into > timewait really helps in this case though. > > If we completely ignore timewait, what conditions are required to have a problem > occur? Outstanding packets with PSNs and QP numbers coinside between the 2 connections. Look for "Stale packet" in IB spec. > And, can we meet those conditions if we connect over the IB CM, given > the CMs three-way handshake? Hmm. We can ask user not to post sends if he rejects the REP. Then there won't be stale packets. But is there anything in spec that forbids this? Maybe an extra call is better than assuming things beyond spec requirements? -- MST From sashak at voltaire.com Tue Aug 29 13:34:06 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 29 Aug 2006 23:34:06 +0300 Subject: [openib-general] [PATCH] osm: handle local events In-Reply-To: <20060829163913.GB27121@mellanox.co.il> References: <20060825141704.GA3867@sashak.voltaire.com> <20060826184150.GB21168@mellanox.co.il> <20060829163243.GA12948@sashak.voltaire.com> <20060829163913.GB27121@mellanox.co.il> Message-ID: <20060829203406.GM12948@sashak.voltaire.com> On 19:39 Tue 29 Aug , Michael S. Tsirkin wrote: > Quoting r. Sasha Khapyorsky : > > > Assuming I disconnect and re-connect the cable, does not > > > SM want to know and try bringing the logical link up? > > > > AFAIK OpenSM does not detect the local port disconnection explicitly > > (however will detect this at next periodic sweep if such configured), > > Wouldn't it be a good idea though? Yes, but if disconnection is reported by umad. Today it does not. Sasha > > > but > > it will get report from the switch when the local port will be > > re-connected back. > > But only AFTER it brings the logical link up, by which time it's > already not relevant. > > -- > MST From bugzilla-daemon at openib.org Tue Aug 29 13:40:25 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 29 Aug 2006 13:40:25 -0700 (PDT) Subject: [openib-general] [Bug 214] IB Stack ASSERTS while handling stale connections. Message-ID: <20060829204025.1C7E22283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=214 ftillier at silverstorm.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from ftillier at silverstorm.com 2006-08-29 13:40 ------- Fixed in revision 466. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From rdreier at cisco.com Tue Aug 29 13:39:13 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 Aug 2006 13:39:13 -0700 Subject: [openib-general] [PATCH] libmthca: include stddef.h in mthca.h for SLES10 compilations In-Reply-To: <200608291807.06146.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 29 Aug 2006 18:07:05 +0300") References: <200608291807.06146.jackm@dev.mellanox.co.il> Message-ID: > --- l/src/userspace/libmthca/src/mthca.h (revision 7569) > +++ l/src/userspace/libmthca/src/mthca.h (working copy) > @@ -36,6 +36,7 @@ > #ifndef MTHCA_H > #define MTHCA_H > > +#include > #include > #include svn blame shows me that mthca.h has included stddef.h since r8819. - R. From halr at voltaire.com Tue Aug 29 13:51:50 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 16:51:50 -0400 Subject: [openib-general] [PATCH] opensm: option to limit size of OpenSM log file In-Reply-To: <20060822211855.GD10446@sashak.voltaire.com> References: <20060822211855.GD10446@sashak.voltaire.com> Message-ID: <1156884708.4509.38019.camel@hal.voltaire.com> On Tue, 2006-08-22 at 17:18, Sasha Khapyorsky wrote: > Hi Hal, > > There is new option which specified max size of OpenSM log file. The > default is '0' (not-limited). Please note osm_log_init() has new > parameter now. > > We already saw the problems with FS overflowing in real life - we may > want those related fixes in OFED too. > > Sasha > > > opensm: option to limit size of OpenSM log file > > New option '-L' will limit size of OpenSM log file. If specified the log > file will be truncated upon reaching this limit. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to both trunk and 1.1) -- Hal From mshefty at ichips.intel.com Tue Aug 29 14:02:43 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Aug 2006 14:02:43 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060829200920.GA29183@mellanox.co.il> References: <44F49A5D.2060707@ichips.intel.com> <20060829200920.GA29183@mellanox.co.il> Message-ID: <44F4AB73.2070208@ichips.intel.com> Michael S. Tsirkin wrote: >>If we completely ignore timewait, what conditions are required to have a problem >>occur? > > Outstanding packets with PSNs and QP numbers coinside between the 2 connections. > Look for "Stale packet" in IB spec. From what I can tell, a QP will receive an incoming packet incorrectly if the SLID and PSN match that of its current connection, which matches with your statement. Stale packets could cause this, but so can misconfigured QPs. (I'm just trying to understand how large the problem is, and how much of it does timewait solve.) > Hmm. We can ask user not to post sends if he rejects the REP. > Then there won't be stale packets. But is there anything in spec that > forbids this? See page 690 of the spec. It implies that the QP should go to RTS only if an RTU is sent. Note that if the DREQ is lost, it's possible for the remote side to initiate a send after the local QP has exited timewait, which seems to defeat its purpose in this case. > Maybe an extra call is better than assuming things beyond spec > requirements? I'm still trying to determine who provides the timewait duration. Verbs allows users to connect QPs without going through the CM, and several apps do this. Timewait provides only partial protection against this problem, so maybe we only restrict it to handling the most common case, which is after the QP has transitioned to RTS. Another alternative to solving this problem is to select a PSN value that is likely to discard stale packets. Can the lower level driver be of any assistance here? I.e. would it know what the last PSN value was received on a QP? - Sean From halr at voltaire.com Tue Aug 29 14:03:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 17:03:47 -0400 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <20060829190140.GI12948@sashak.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> <20060829181535.GE12948@sashak.voltaire.com> <1156875510.4509.33584.camel@hal.voltaire.com> <20060829190140.GI12948@sashak.voltaire.com> Message-ID: <1156885419.4509.38402.camel@hal.voltaire.com> On Tue, 2006-08-29 at 15:01, Sasha Khapyorsky wrote: > > I'd prefer an incremental one off the last patch related to this if that > > isn't too much work as I'm close to committing the previous one now (and > > it'd be more work to start over on this). > > Ok. There is: > > > Optional log file truncating upon ENOSPC errors. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to trunk and 1.1) -- Hal From mvharish at gmail.com Tue Aug 29 14:46:23 2006 From: mvharish at gmail.com (harish) Date: Tue, 29 Aug 2006 14:46:23 -0700 Subject: [openib-general] Interrupt Threshold Rate equivalent in Infiniband NIC Message-ID: Hi, The interruptThresholdRate module parameter allows you to control the maximum number of interrupts/sec for an e1000 Intel NIC for example. Is there an equivalent parameter for Infiniband NICs. I am using a Mellanox Infiniband NIC. Please let me know if you need any more information. Noticed from a simple netperf TCP_STREAM test that I get around 50-80K interrupts/sec and my machine gets very unresponsive. Want to control the maximum number of interrupts that are generated by the NIC. Quick look at the code showed that there is a way you can use the notify* functions in the driver to let ipoib module know when a certain number of packets are received instead of notifying for every packet received. Is that something in the lines of what will achieve my goal? Any help/pointers are welcome. Thanks harish -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Tue Aug 29 14:59:35 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 30 Aug 2006 00:59:35 +0300 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <1156885419.4509.38402.camel@hal.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> <20060829181535.GE12948@sashak.voltaire.com> <1156875510.4509.33584.camel@hal.voltaire.com> <20060829190140.GI12948@sashak.voltaire.com> <1156885419.4509.38402.camel@hal.voltaire.com> Message-ID: <20060829215935.GQ12948@sashak.voltaire.com> On 17:03 Tue 29 Aug , Hal Rosenstock wrote: > On Tue, 2006-08-29 at 15:01, Sasha Khapyorsky wrote: > > > > I'd prefer an incremental one off the last patch related to this if that > > > isn't too much work as I'm close to committing the previous one now (and > > > it'd be more work to start over on this). > > > > Ok. There is: > > > > > > Optional log file truncating upon ENOSPC errors. > > > > Signed-off-by: Sasha Khapyorsky > > Thanks. Applied (to trunk and 1.1) Thanks. And there is more - this will not reset error counter, and the error messages will be cleaner. Signed-off-by: Sasha Khapyorsky diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c index 05b1185..a5dac10 100644 --- a/osm/opensm/osm_log.c +++ b/osm/opensm/osm_log.c @@ -173,20 +173,21 @@ #endif fflush( p_log->out_port ) < 0) ret = -1; - if (ret < 0 && errno == ENOSPC && log_exit_count < 3) + if (ret >= 0) + { + log_exit_count = 0; + p_log->count += ret; + } + else if (log_exit_count < 3) { - fprintf(stderr, "osm_log: write failed: %s. Truncating log file.\n", - strerror(errno)); log_exit_count++; - if (p_log->max_size) { + if (errno == ENOSPC && p_log->max_size) { + fprintf(stderr, "osm_log: write failed: %s. Truncating log file.\n", + strerror(errno)); truncate_log_file(p_log); goto _retry; } - } - else - { - log_exit_count = 0; - p_log->count += ret; + fprintf(stderr, "osm_log: write failed: %s\n", strerror(errno)); } cl_spinlock_release( &p_log->lock ); From rsalmon at tulane.edu Tue Aug 29 15:03:45 2006 From: rsalmon at tulane.edu (Rene Salmon) Date: Tue, 29 Aug 2006 17:03:45 -0500 Subject: [openib-general] compile problems suse 10.1 on x86_64 Message-ID: <44F4B9C1.9090000@tulane.edu> Hello, Trying to build OFED-1.0.tgz on a suse 10.1 x86_64 box with kernel 2.6.16.21-0.13. I am using the build.sh script but it stops while building the kernel modules. build.sh fails with this message: error: too few arguments to function 'sk_eat_skb' From what I can tell there is patch for sles 10 to fix this: http://openib.org/pipermail/openib-commits/2006-May/007244.html It is also located in my tar ball here: /var/tmp/OFED/tmp/openib/openib/patches/2.6.16_sles10 Can anyone point me in the right direction as to maybe applying some of the 2.6.16_sles10 patches to suse 10.1? Thanks in advance for any help on this. Following is the error message I get from in the make_kernel.log file. Rene make -f scripts/Makefile.build obj=/var/tmp/OFED/tmp/openib/openib/src/linux-kerne l/infiniband/ulp/sdp gcc -Wp,-MD,/var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/sdp/ .sdp_main.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include - D__KERNEL__ -I/var/tmp/OFED/tmp/openib/openib/include -I/var/tmp/OFED/tmp/openib/ openib/src/linux-kernel/infiniband/include -Iinclude -Wall -Wundef -Wstrict-pr ototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -march=k8 -m64 -mno-red-z one -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynch ronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdec laration-after-statement -Wno-pointer-sign -I/var/tmp/OFED/tmp/openib/openib/src/ linux-kernel/infiniband/include -I/var/tmp/OFED/tmp/openib/openib/src/linux-kerne l/infiniband/ulp/ipoib -I/var/tmp/OFED/tmp/openib/openib/drivers/infiniband/debug -Idrivers/infiniband/include -ggdb -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASEN AME=KBUILD_STR(sdp_main)" -D"KBUILD_MODNAME=KBUILD_STR(ib_sdp)" -c -o /var/tmp/OF ED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/sdp/.tmp_sdp_main.o /var/tmp/ OFED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/sdp/sdp_main.c /var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/sdp/sdp_main.c: In function ?sdp_recvmsg?: /var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/sdp/sdp_main.c:117 1: error: too few arguments to function ?sk_eat_skb? /var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/sdp/sdp_main.c:117 9: error: too few arguments to function ?sk_eat_skb? make[3]: *** [/var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/sdp/ sdp_main.o] Error 1 -- - -- Rene Salmon Tulane University Center for Computational Science http://www.ccs.tulane.edu rsalmon at tulane.edu Tel 504-862-8393 Tel 504-988-8552 Fax 504-862-8392 From dledford at redhat.com Tue Aug 29 15:25:15 2006 From: dledford at redhat.com (Doug Ledford) Date: Tue, 29 Aug 2006 22:25:15 +0000 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <1156878810.4509.35185.camel@hal.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> <20060829152126.GA1749@greglaptop> <1156878810.4509.35185.camel@hal.voltaire.com> Message-ID: <1156890315.28832.75.camel@fc6.xsintricity.com> On Tue, 2006-08-29 at 15:13 -0400, Hal Rosenstock wrote: > On Tue, 2006-08-29 at 11:21, Greg Lindahl wrote: > > On Sun, Aug 27, 2006 at 06:28:06PM -0400, Doug Ledford wrote: > > > > > I would definitely put the option in, and in fact would default it to > > > *NOT* truncate. > > > > I agree. I have never seen any other daemon with a logfile do this, > > why are we out to surprise the admin? The admin might want the start > > of the long instead of the end. And so on. > > I agree too but there is perhaps a difference here per degree: OpenSM > can spew copious logging and fill /var/log readily. In which case the -L option to limit the maximum log file size makes sense (as does making sure that the default logging level is warnings and above, not all sorts of informational stuff unless requested in order to keep logs more manageable under normal circumstances). In response to another statement made in another email, for better or worse, OpenSM *is* a system daemon at this point. If you don't have (or elect not to use) a switch embedded SM, then OpenSM is necessary to keep your fabric operational. Or to put it another way: if you need to start it during init for your system to operate properly, and if it needs to keep running all the time, and if you need elevated permissions to run it, and if it has an init script...you get the point...the rest of the world is going to tell you this is a system daemon with console capabilities, not a console program with daemon capabilities. As such, always thinking of it from the perspective of a daemon would be wise in terms of avoiding surprising users down the road IMO. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Tue Aug 29 15:29:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 Aug 2006 15:29:58 -0700 Subject: [openib-general] Interrupt Threshold Rate equivalent in Infiniband NIC In-Reply-To: (harish's message of "Tue, 29 Aug 2006 14:46:23 -0700") References: Message-ID: harish> Hi, The interruptThresholdRate module parameter allows you harish> to control the maximum number of interrupts/sec for an harish> e1000 Intel NIC for example. Is there an equivalent harish> parameter for Infiniband NICs. I am using a Mellanox harish> Infiniband NIC. Please let me know if you need any more harish> information. There is no such equivalent for IB. However I am thinking about implementing NAPI for IPoIB, which should help in your situation. - R. From halr at voltaire.com Tue Aug 29 15:29:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 18:29:43 -0400 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <20060829215935.GQ12948@sashak.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> <20060829181535.GE12948@sashak.voltaire.com> <1156875510.4509.33584.camel@hal.voltaire.com> <20060829190140.GI12948@sashak.voltaire.com> <1156885419.4509.38402.camel@hal.voltaire.com> <20060829215935.GQ12948@sashak.voltaire.com> Message-ID: <1156890578.4509.40744.camel@hal.voltaire.com> On Tue, 2006-08-29 at 17:59, Sasha Khapyorsky wrote: > On 17:03 Tue 29 Aug , Hal Rosenstock wrote: > > On Tue, 2006-08-29 at 15:01, Sasha Khapyorsky wrote: > > > > > > I'd prefer an incremental one off the last patch related to this if that > > > > isn't too much work as I'm close to committing the previous one now (and > > > > it'd be more work to start over on this). > > > > > > Ok. There is: > > > > > > > > > Optional log file truncating upon ENOSPC errors. > > > > > > Signed-off-by: Sasha Khapyorsky > > > > Thanks. Applied (to trunk and 1.1) > > Thanks. And there is more - this will not reset error counter, and the > error messages will be cleaner. > > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to both trunk and 1.1). -- Hal From mvharish at gmail.com Tue Aug 29 15:40:43 2006 From: mvharish at gmail.com (harish) Date: Tue, 29 Aug 2006 15:40:43 -0700 Subject: [openib-general] Interrupt Threshold Rate equivalent in Infiniband NIC In-Reply-To: References: Message-ID: Hi Roland, Thanks a lot for the prompt response. Could you please let me know as to what is the expected time frame for having NAPI for IPoIB implemented. Also as regards to the first question, would it make sense to play around with the cq notification frequency. Will it help to reduce the CPU utilization/ interrupt overhead? Sincerely thanks again, Harish On 8/29/06, Roland Dreier wrote: > > harish> Hi, The interruptThresholdRate module parameter allows you > harish> to control the maximum number of interrupts/sec for an > harish> e1000 Intel NIC for example. Is there an equivalent > harish> parameter for Infiniband NICs. I am using a Mellanox > harish> Infiniband NIC. Please let me know if you need any more > harish> information. > > There is no such equivalent for IB. > > However I am thinking about implementing NAPI for IPoIB, which should > help in your situation. > > - R. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Aug 29 15:55:05 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 Aug 2006 15:55:05 -0700 Subject: [openib-general] Interrupt Threshold Rate equivalent in Infiniband NIC In-Reply-To: (harish's message of "Tue, 29 Aug 2006 15:40:43 -0700") References: Message-ID: harish> Hi Roland, Thanks a lot for the prompt response. Could you harish> please let me know as to what is the expected time frame harish> for having NAPI for IPoIB implemented. Also as regards to harish> the first question, would it make sense to play around harish> with the cq notification frequency. Will it help to reduce harish> the CPU utilization/ interrupt overhead? I can't tell you when NAPI will be implemented, since I don't know if anyone is working on it. You could try playing around with the CQ notification frequency but if you're going to hack the driver then I think it would be just as easy to implement NAPI. Anyway, give it a try and let us know your results. Thanks, Roland From halr at voltaire.com Tue Aug 29 15:54:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 18:54:42 -0400 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <1156890315.28832.75.camel@fc6.xsintricity.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> <20060829152126.GA1749@greglaptop> <1156878810.4509.35185.camel@hal.voltaire.com> <1156890315.28832.75.camel@fc6.xsintricity.com> Message-ID: <1156892081.4509.41264.camel@hal.voltaire.com> On Tue, 2006-08-29 at 18:25, Doug Ledford wrote: > On Tue, 2006-08-29 at 15:13 -0400, Hal Rosenstock wrote: > > On Tue, 2006-08-29 at 11:21, Greg Lindahl wrote: > > > On Sun, Aug 27, 2006 at 06:28:06PM -0400, Doug Ledford wrote: > > > > > > > I would definitely put the option in, and in fact would default it to > > > > *NOT* truncate. > > > > > > I agree. I have never seen any other daemon with a logfile do this, > > > why are we out to surprise the admin? The admin might want the start > > > of the long instead of the end. And so on. > > > > I agree too but there is perhaps a difference here per degree: OpenSM > > can spew copious logging and fill /var/log readily. > > In which case the -L option to limit the maximum log file size makes > sense (as does making sure that the default logging level is warnings > and above, not all sorts of informational stuff unless requested in > order to keep logs more manageable under normal circumstances). > > In response to another statement made in another email, for better or > worse, OpenSM *is* a system daemon at this point. If you don't have (or > elect not to use) a switch embedded SM, then OpenSM is necessary to keep > your fabric operational. > > Or to put it another way: if you need to start it during init for your > system to operate properly, and if it needs to keep running all the > time, and if you need elevated permissions to run it, and if it has an > init script...you get the point...the rest of the world is going to tell > you this is a system daemon with console capabilities, not a console > program with daemon capabilities. As such, always thinking of it from > the perspective of a daemon would be wise in terms of avoiding > surprising users down the road IMO. Well stated and I concur with that viewpoint. I may be mistaken but I think Sasha may have been referring to some things that could/might be done to make it behave better as a daemon. -- Hal From brendan at kublai.com Tue Aug 29 16:04:06 2006 From: brendan at kublai.com (Brendan Cully) Date: Tue, 29 Aug 2006 16:04:06 -0700 Subject: [openib-general] [PATCH 1 of 23] IB/ipath - More changes to support InfiniPath on PowerPC 970 systems In-Reply-To: <1156537194.31531.38.camel@sardonyx> References: <44809b730ac95b39b672.1156530266@eng-12.pathscale.com> <1156537194.31531.38.camel@sardonyx> Message-ID: <20060829230406.GA3223@xanadu.kublai.com> On Friday, 25 August 2006 at 13:19, Bryan O'Sullivan wrote: > On Fri, 2006-08-25 at 12:45 -0700, Roland Dreier wrote: > > How did you generate these patches? > > Using Mercurial. > > > because the line > > > > diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile > > > > makes git think it's a git diff, but git doesn't put dates on the > > filename lines. > > Ah, interesting. Looks like a bug in the git-compatible patch > generator, then. Sorry about that. I've just posted a fix to the mercurial list. From mvharish at gmail.com Tue Aug 29 16:21:33 2006 From: mvharish at gmail.com (harish) Date: Tue, 29 Aug 2006 16:21:33 -0700 Subject: [openib-general] Interrupt Threshold Rate equivalent in Infiniband NIC In-Reply-To: References: Message-ID: Hi Roland, As regards the CQ notification frequency, I noticed that the function req_ncomp_notif used by ib_req_ncom_notif is not implemented yet. I was hoping that if this was implemented, I would just use ib_req_ncom_notif with a count of 10 in place of ib_req_com_notif. Please share your comments on this approach and it will be great if you have a patch with the ib_req_ncom_notif implementation. Thanks in advance, harish On 8/29/06, Roland Dreier wrote: > > harish> Hi Roland, Thanks a lot for the prompt response. Could you > harish> please let me know as to what is the expected time frame > harish> for having NAPI for IPoIB implemented. Also as regards to > harish> the first question, would it make sense to play around > harish> with the cq notification frequency. Will it help to reduce > harish> the CPU utilization/ interrupt overhead? > > I can't tell you when NAPI will be implemented, since I don't know if > anyone is working on it. > > You could try playing around with the CQ notification frequency but if > you're going to hack the driver then I think it would be just as easy > to implement NAPI. Anyway, give it a try and let us know your results. > > Thanks, > Roland > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Aug 29 17:09:38 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2006 20:09:38 -0400 Subject: [openib-general] [PATCH] opensm: libibmad: rpc API which supports more than one ports. In-Reply-To: <20060825131734.3786.74359.stgit@sashak.voltaire.com> References: <20060825131734.3786.74359.stgit@sashak.voltaire.com> Message-ID: <1156896546.4509.43618.camel@hal.voltaire.com> Hi Sasha, On Fri, 2006-08-25 at 09:17, Sasha Khapyorsky wrote: > This provides RPC like API which may work with several ports. I think you mean "can work" rather "may work" :-) > Signed-off-by: Sasha Khapyorsky > --- > > libibmad/include/infiniband/mad.h | 9 +++ > libibmad/src/libibmad.map | 4 + > libibmad/src/register.c | 20 +++++-- > libibmad/src/rpc.c | 106 +++++++++++++++++++++++++++++++++++-- > libibumad/src/umad.c | 4 + ../doc/libibmad.txt should also be updated appropriately for the new routines. > 5 files changed, 130 insertions(+), 13 deletions(-) > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index 45ff572..bd8a80b 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -660,6 +660,7 @@ uint64_t mad_trid(void); > int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); > > /* register.c */ > +int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); > int mad_register_client(int mgmt, uint8_t rmpp_version); > int mad_register_server(int mgmt, uint8_t rmpp_version, > uint32_t method_mask[4], uint32_t class_oui); > @@ -704,6 +705,14 @@ void madrpc_lock(void); > void madrpc_unlock(void); > void madrpc_show_errors(int set); > > +void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > + int num_classes); > +void mad_rpc_close_port(void *ibmad_port); > +void * mad_rpc(void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > + void *payload, void *rcvdata); > +void * mad_rpc_rmpp(void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > + ib_rmpp_hdr_t *rmpp, void *data); > + > /* smp.c */ > uint8_t * smp_query(void *buf, ib_portid_t *id, uint attrid, uint mod, > uint timeout); > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > index bf81bd1..78b7ff0 100644 > --- a/libibmad/src/libibmad.map > +++ b/libibmad/src/libibmad.map > @@ -62,6 +62,10 @@ IBMAD_1.0 { This should be 1.1 > ib_resolve_self; > ib_resolve_smlid; > ibdebug; > + mad_rpc_open_port; > + mad_rpc_close_port; > + mad_rpc; > + mad_rpc_rmpp; > madrpc; > madrpc_def_timeout; > madrpc_init; What about mad_register_port_client ? Should that be included here ? > diff --git a/libibmad/src/register.c b/libibmad/src/register.c > index 4f44625..52d6989 100644 > --- a/libibmad/src/register.c > +++ b/libibmad/src/register.c > @@ -43,6 +43,7 @@ #include > #include > #include > #include > +#include > > #include > #include "mad.h" > @@ -118,7 +119,7 @@ mad_agent_class(int agent) > } > > int > -mad_register_client(int mgmt, uint8_t rmpp_version) > +mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version) > { > int vers, agent; > > @@ -126,7 +127,7 @@ mad_register_client(int mgmt, uint8_t rm > DEBUG("Unknown class %d mgmt_class", mgmt); > return -1; > } > - if ((agent = umad_register(madrpc_portid(), mgmt, > + if ((agent = umad_register(port_id, mgmt, > vers, rmpp_version, 0)) < 0) { > DEBUG("Can't register agent for class %d", mgmt); > return -1; > @@ -137,13 +138,22 @@ mad_register_client(int mgmt, uint8_t rm > return -1; > } > > - if (register_agent(agent, mgmt) < 0) > - return -1; > - > return agent; > } > > int > +mad_register_client(int mgmt, uint8_t rmpp_version) > +{ > + int agent; > + > + agent = mad_register_port_client(madrpc_portid(), mgmt, rmpp_version); > + if (agent < 0) > + return agent; > + > + return register_agent(agent, mgmt); > +} > + > +int > mad_register_server(int mgmt, uint8_t rmpp_version, > uint32_t method_mask[4], uint32_t class_oui) > { > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > index b2d3e77..ac4f361 100644 > --- a/libibmad/src/rpc.c > +++ b/libibmad/src/rpc.c > @@ -48,6 +48,13 @@ #include > #include > #include "mad.h" > > +#define MAX_CLASS 256 > + > +struct ibmad_port { > + int port_id; /* file descriptor returned by umad_open() */ > + int class_agents[MAX_CLASS]; /* class2agent mapper */ > +}; > + > int ibdebug; > > static int mad_portid = -1; > @@ -105,7 +112,8 @@ madrpc_portid(void) > } > > static int > -_do_madrpc(void *sndbuf, void *rcvbuf, int agentid, int len, int timeout) > +_do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > + int timeout) > { > uint32_t trid; /* only low 32 bits */ > int retries; > @@ -133,7 +141,7 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > } > > length = len; > - if (umad_send(mad_portid, agentid, sndbuf, length, timeout, 0) < 0) { > + if (umad_send(port_id, agentid, sndbuf, length, timeout, 0) < 0) { > IBWARN("send failed; %m"); > return -1; > } > @@ -141,7 +149,7 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > /* Use same timeout on receive side just in case */ > /* send packet is lost somewhere. */ > do { > - if (umad_recv(mad_portid, rcvbuf, &length, timeout) < 0) { > + if (umad_recv(port_id, rcvbuf, &length, timeout) < 0) { > IBWARN("recv failed: %m"); > return -1; > } > @@ -164,8 +172,10 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > } > > void * > -madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) > +mad_rpc(void *port_id, ib_rpc_t *rpc, ib_portid_t *dport, void *payload, > + void *rcvdata) > { > + struct ibmad_port *p = port_id; > int status, len; > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > @@ -175,7 +185,8 @@ madrpc(ib_rpc_t *rpc, ib_portid_t *dport > if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) > return 0; > > - if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), > + if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > + p->class_agents[rpc->mgtclass], > len, rpc->timeout)) < 0) > return 0; > > @@ -198,8 +209,10 @@ madrpc(ib_rpc_t *rpc, ib_portid_t *dport > } > > void * > -madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) > +mad_rpc_rmpp(void *port_id, ib_rpc_t *rpc, ib_portid_t *dport, > + ib_rmpp_hdr_t *rmpp, void *data) > { > + struct ibmad_port *p = port_id; > int status, len; > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > @@ -210,7 +223,8 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t * > if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) > return 0; > > - if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), > + if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > + p->class_agents[rpc->mgtclass], > len, rpc->timeout)) < 0) > return 0; > > @@ -249,6 +263,24 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t * > return data; > } > > +void * > +madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) > +{ > + struct ibmad_port port; > + port.port_id = mad_portid; > + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); > + return mad_rpc(&port, rpc, dport, payload, rcvdata); > +} > + > +void * > +madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) > +{ > + struct ibmad_port port; > + port.port_id = mad_portid; > + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); > + return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); > +} > + > static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; > > void > @@ -282,3 +314,63 @@ madrpc_init(char *dev_name, int dev_port > IBPANIC("client_register for mgmt %d failed", mgmt); > } > } > + > +void * > +mad_rpc_open_port(char *dev_name, int dev_port, > + int *mgmt_classes, int num_classes) > +{ > + struct ibmad_port *p; > + int port_id; Should there be some validation on num_classes < MAX_CLASS ? > + if (umad_init() < 0) { > + IBWARN("can't init UMAD library"); > + errno = ENODEV; > + return NULL; > + } > + > + p = malloc(sizeof(*p)); > + if (!p) { > + errno = ENOMEM; > + return NULL; > + } > + memset(p, 0, sizeof(*p)); > + > + if ((port_id = umad_open_port(dev_name, dev_port)) < 0) { > + IBWARN("can't open UMAD port (%s:%d)", dev_name, dev_port); > + if (!errno) > + errno = EIO; > + free(p); > + return NULL; > + } > + > + while (num_classes--) { > + int rmpp_version = 0; > + int mgmt = *mgmt_classes++; > + int agent; > + > + if (mgmt == IB_SA_CLASS) > + rmpp_version = 1; There are other classes which can use RMPP. How are they handled ? > + if (mgmt < 0 || mgmt >= MAX_CLASS || > + (agent = mad_register_port_client(port_id, mgmt, > + rmpp_version)) < 0) { > + IBWARN("client_register for mgmt %d failed", mgmt); > + if(!errno) > + errno = EINVAL; > + umad_close_port(port_id); > + free(p); > + return NULL; > + } > + p->class_agents[mgmt] = agent; > + } > + > + p->port_id = port_id; > + return p; > +} > + > +void > +mad_rpc_close_port(void *port_id) > +{ > + struct ibmad_port *p = port_id; > + umad_close_port(p->port_id); > + free(p); > +} > diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c > index a99fb5a..cb9eef6 100644 > --- a/libibumad/src/umad.c > +++ b/libibumad/src/umad.c > @@ -93,12 +93,14 @@ port_alloc(int portid, char *dev, int po > > if (portid < 0 || portid >= UMAD_MAX_PORTS) { > IBWARN("bad umad portid %d", portid); > + errno = EINVAL; > return 0; > } > > if (port->dev_name[0]) { > IBWARN("umad port id %d is already allocated for %s %d", > portid, port->dev_name, port->dev_port); > + errno = EBUSY; > return 0; > } > > @@ -567,7 +569,7 @@ umad_open_port(char *ca_name, int portnu > return -EINVAL; > > if (!(port = port_alloc(umad_id, ca_name, portnum))) > - return -EINVAL; > + return -errno; > > snprintf(port->dev_file, sizeof port->dev_file - 1, "%s/umad%d", > UMAD_DEV_DIR , umad_id); Is the umad.c change really a separate change from the rest ? If so, this patch should be broken into two parts and that is the first part. No need to resubmit for this. -- Hal From rdreier at cisco.com Tue Aug 29 17:18:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 Aug 2006 17:18:36 -0700 Subject: [openib-general] Interrupt Threshold Rate equivalent in Infiniband NIC In-Reply-To: (harish's message of "Tue, 29 Aug 2006 16:21:33 -0700") References: Message-ID: harish> Hi Roland, As regards the CQ notification frequency, I harish> noticed that the function req_ncomp_notif used by harish> ib_req_ncom_notif is not implemented yet. I was hoping harish> that if this was implemented, I would just use harish> ib_req_ncom_notif with a count of 10 in place of harish> ib_req_com_notif. Please share your comments on this harish> approach and it will be great if you have a patch with the harish> ib_req_ncom_notif implementation. No, I have not implemented that function. But if you request an event after 10 completions then you run into a problem of getting stuck if only 9 completions are generated. - R. From sashak at voltaire.com Tue Aug 29 18:29:56 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 30 Aug 2006 04:29:56 +0300 Subject: [openib-general] [PATCH] opensm: libibmad: rpc API which supports more than one ports. In-Reply-To: <1156896546.4509.43618.camel@hal.voltaire.com> References: <20060825131734.3786.74359.stgit@sashak.voltaire.com> <1156896546.4509.43618.camel@hal.voltaire.com> Message-ID: <20060830012956.GA12356@sashak.voltaire.com> Hi Hal, On 20:09 Tue 29 Aug , Hal Rosenstock wrote: > Hi Sasha, > > On Fri, 2006-08-25 at 09:17, Sasha Khapyorsky wrote: > > This provides RPC like API which may work with several ports. > > I think you mean "can work" rather "may work" :-) Yes. Some limitation we will have from libumad - this tracks already open ports. I'm not sure why (the same port can be opened from another process or by forking current). I think this may be the next improvement there. > > > Signed-off-by: Sasha Khapyorsky > > --- > > > > libibmad/include/infiniband/mad.h | 9 +++ > > libibmad/src/libibmad.map | 4 + > > libibmad/src/register.c | 20 +++++-- > > libibmad/src/rpc.c | 106 +++++++++++++++++++++++++++++++++++-- > > libibumad/src/umad.c | 4 + > > ../doc/libibmad.txt should also be updated appropriately for the new > routines. Sure, I thought to stabilize this API first. > > > 5 files changed, 130 insertions(+), 13 deletions(-) > > > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > > index 45ff572..bd8a80b 100644 > > --- a/libibmad/include/infiniband/mad.h > > +++ b/libibmad/include/infiniband/mad.h > > @@ -660,6 +660,7 @@ uint64_t mad_trid(void); > > int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); > > > > /* register.c */ > > +int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); > > int mad_register_client(int mgmt, uint8_t rmpp_version); > > int mad_register_server(int mgmt, uint8_t rmpp_version, > > uint32_t method_mask[4], uint32_t class_oui); > > @@ -704,6 +705,14 @@ void madrpc_lock(void); > > void madrpc_unlock(void); > > void madrpc_show_errors(int set); > > > > +void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > > + int num_classes); > > +void mad_rpc_close_port(void *ibmad_port); > > +void * mad_rpc(void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > > + void *payload, void *rcvdata); > > +void * mad_rpc_rmpp(void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > > + ib_rmpp_hdr_t *rmpp, void *data); > > + > > /* smp.c */ > > uint8_t * smp_query(void *buf, ib_portid_t *id, uint attrid, uint mod, > > uint timeout); > > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > > index bf81bd1..78b7ff0 100644 > > --- a/libibmad/src/libibmad.map > > +++ b/libibmad/src/libibmad.map > > @@ -62,6 +62,10 @@ IBMAD_1.0 { > > This should be 1.1 Ok. > > > ib_resolve_self; > > ib_resolve_smlid; > > ibdebug; > > + mad_rpc_open_port; > > + mad_rpc_close_port; > > + mad_rpc; > > + mad_rpc_rmpp; > > madrpc; > > madrpc_def_timeout; > > madrpc_init; > > What about mad_register_port_client ? Should that be included here ? It is not used externally - all registrations are done in _open(). So I don't see this as part of the new "API". Maybe if we will decide to extend it later we will need to "export" this symbol. > > > diff --git a/libibmad/src/register.c b/libibmad/src/register.c > > index 4f44625..52d6989 100644 > > --- a/libibmad/src/register.c > > +++ b/libibmad/src/register.c > > @@ -43,6 +43,7 @@ #include > > #include > > #include > > #include > > +#include > > > > #include > > #include "mad.h" > > @@ -118,7 +119,7 @@ mad_agent_class(int agent) > > } > > > > int > > -mad_register_client(int mgmt, uint8_t rmpp_version) > > +mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version) > > { > > int vers, agent; > > > > @@ -126,7 +127,7 @@ mad_register_client(int mgmt, uint8_t rm > > DEBUG("Unknown class %d mgmt_class", mgmt); > > return -1; > > } > > - if ((agent = umad_register(madrpc_portid(), mgmt, > > + if ((agent = umad_register(port_id, mgmt, > > vers, rmpp_version, 0)) < 0) { > > DEBUG("Can't register agent for class %d", mgmt); > > return -1; > > @@ -137,13 +138,22 @@ mad_register_client(int mgmt, uint8_t rm > > return -1; > > } > > > > - if (register_agent(agent, mgmt) < 0) > > - return -1; > > - > > return agent; > > } > > > > int > > +mad_register_client(int mgmt, uint8_t rmpp_version) > > +{ > > + int agent; > > + > > + agent = mad_register_port_client(madrpc_portid(), mgmt, rmpp_version); > > + if (agent < 0) > > + return agent; > > + > > + return register_agent(agent, mgmt); > > +} > > + > > +int > > mad_register_server(int mgmt, uint8_t rmpp_version, > > uint32_t method_mask[4], uint32_t class_oui) > > { > > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > > index b2d3e77..ac4f361 100644 > > --- a/libibmad/src/rpc.c > > +++ b/libibmad/src/rpc.c > > @@ -48,6 +48,13 @@ #include > > #include > > #include "mad.h" > > > > +#define MAX_CLASS 256 > > + > > +struct ibmad_port { > > + int port_id; /* file descriptor returned by umad_open() */ > > + int class_agents[MAX_CLASS]; /* class2agent mapper */ > > +}; > > + > > int ibdebug; > > > > static int mad_portid = -1; > > @@ -105,7 +112,8 @@ madrpc_portid(void) > > } > > > > static int > > -_do_madrpc(void *sndbuf, void *rcvbuf, int agentid, int len, int timeout) > > +_do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > > + int timeout) > > { > > uint32_t trid; /* only low 32 bits */ > > int retries; > > @@ -133,7 +141,7 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > > } > > > > length = len; > > - if (umad_send(mad_portid, agentid, sndbuf, length, timeout, 0) < 0) { > > + if (umad_send(port_id, agentid, sndbuf, length, timeout, 0) < 0) { > > IBWARN("send failed; %m"); > > return -1; > > } > > @@ -141,7 +149,7 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > > /* Use same timeout on receive side just in case */ > > /* send packet is lost somewhere. */ > > do { > > - if (umad_recv(mad_portid, rcvbuf, &length, timeout) < 0) { > > + if (umad_recv(port_id, rcvbuf, &length, timeout) < 0) { > > IBWARN("recv failed: %m"); > > return -1; > > } > > @@ -164,8 +172,10 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > > } > > > > void * > > -madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) > > +mad_rpc(void *port_id, ib_rpc_t *rpc, ib_portid_t *dport, void *payload, > > + void *rcvdata) > > { > > + struct ibmad_port *p = port_id; > > int status, len; > > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > > > @@ -175,7 +185,8 @@ madrpc(ib_rpc_t *rpc, ib_portid_t *dport > > if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) > > return 0; > > > > - if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), > > + if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > > + p->class_agents[rpc->mgtclass], > > len, rpc->timeout)) < 0) > > return 0; > > > > @@ -198,8 +209,10 @@ madrpc(ib_rpc_t *rpc, ib_portid_t *dport > > } > > > > void * > > -madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) > > +mad_rpc_rmpp(void *port_id, ib_rpc_t *rpc, ib_portid_t *dport, > > + ib_rmpp_hdr_t *rmpp, void *data) > > { > > + struct ibmad_port *p = port_id; > > int status, len; > > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > > > @@ -210,7 +223,8 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t * > > if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) > > return 0; > > > > - if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), > > + if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > > + p->class_agents[rpc->mgtclass], > > len, rpc->timeout)) < 0) > > return 0; > > > > @@ -249,6 +263,24 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t * > > return data; > > } > > > > +void * > > +madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) > > +{ > > + struct ibmad_port port; > > + port.port_id = mad_portid; > > + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); > > + return mad_rpc(&port, rpc, dport, payload, rcvdata); > > +} > > + > > +void * > > +madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) > > +{ > > + struct ibmad_port port; > > + port.port_id = mad_portid; > > + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); > > + return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); > > +} > > + > > static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; > > > > void > > @@ -282,3 +314,63 @@ madrpc_init(char *dev_name, int dev_port > > IBPANIC("client_register for mgmt %d failed", mgmt); > > } > > } > > + > > +void * > > +mad_rpc_open_port(char *dev_name, int dev_port, > > + int *mgmt_classes, int num_classes) > > +{ > > + struct ibmad_port *p; > > + int port_id; > > Should there be some validation on num_classes < MAX_CLASS ? Such check is cheap and may be performed (it was not done in madrpc_init()). Without this the function will "work" (will fail), but in longer way (this will fail to register an agent when MAX_CLASS will be overflowed). > > > + if (umad_init() < 0) { > > + IBWARN("can't init UMAD library"); > > + errno = ENODEV; > > + return NULL; > > + } > > + > > + p = malloc(sizeof(*p)); > > + if (!p) { > > + errno = ENOMEM; > > + return NULL; > > + } > > + memset(p, 0, sizeof(*p)); > > + > > + if ((port_id = umad_open_port(dev_name, dev_port)) < 0) { > > + IBWARN("can't open UMAD port (%s:%d)", dev_name, dev_port); > > + if (!errno) > > + errno = EIO; > > + free(p); > > + return NULL; > > + } > > + > > + while (num_classes--) { > > + int rmpp_version = 0; > > + int mgmt = *mgmt_classes++; > > + int agent; > > + > > + if (mgmt == IB_SA_CLASS) > > + rmpp_version = 1; > > There are other classes which can use RMPP. How are they handled ? This is copy & paste from madrpc_init(). This problem is generic for libibmad and I think should be fixed separately (maybe in mad_register_port_client()). > > > + if (mgmt < 0 || mgmt >= MAX_CLASS || > > + (agent = mad_register_port_client(port_id, mgmt, > > + rmpp_version)) < 0) { > > + IBWARN("client_register for mgmt %d failed", mgmt); > > + if(!errno) > > + errno = EINVAL; > > + umad_close_port(port_id); > > + free(p); > > + return NULL; > > + } > > + p->class_agents[mgmt] = agent; > > + } > > + > > + p->port_id = port_id; > > + return p; > > +} > > + > > +void > > +mad_rpc_close_port(void *port_id) > > +{ > > + struct ibmad_port *p = port_id; > > + umad_close_port(p->port_id); > > + free(p); > > +} > > diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c > > index a99fb5a..cb9eef6 100644 > > --- a/libibumad/src/umad.c > > +++ b/libibumad/src/umad.c > > @@ -93,12 +93,14 @@ port_alloc(int portid, char *dev, int po > > > > if (portid < 0 || portid >= UMAD_MAX_PORTS) { > > IBWARN("bad umad portid %d", portid); > > + errno = EINVAL; > > return 0; > > } > > > > if (port->dev_name[0]) { > > IBWARN("umad port id %d is already allocated for %s %d", > > portid, port->dev_name, port->dev_port); > > + errno = EBUSY; > > return 0; > > } > > > > @@ -567,7 +569,7 @@ umad_open_port(char *ca_name, int portnu > > return -EINVAL; > > > > if (!(port = port_alloc(umad_id, ca_name, portnum))) > > - return -EINVAL; > > + return -errno; > > > > snprintf(port->dev_file, sizeof port->dev_file - 1, "%s/umad%d", > > UMAD_DEV_DIR , umad_id); > > Is the umad.c change really a separate change from the rest ? It was done in order to provide the meanfull errno value in case of mad_rpc_open() failure (not needed with madrpc_init() because it does exit() if something is wrong) and this can be separated. > If so, > this patch should be broken into two parts and that is the first part. Agree. > No need to resubmit for this. Ok. And for the rest of changes? Sasha > > -- Hal > From mvharish at gmail.com Tue Aug 29 18:28:43 2006 From: mvharish at gmail.com (harish) Date: Tue, 29 Aug 2006 18:28:43 -0700 Subject: [openib-general] Interrupt Threshold Rate equivalent in Infiniband NIC In-Reply-To: References: Message-ID: Hi Roland, I was hoping that the function will take care of that by introducing some kind of timer since last notification. The logic I had in mind was that the notification is triggered in either of the two events is true: ->We have 10+ completions or ->Time since last notification > (some specified value) && (there is atleast one item in the cq queue). Is this feasible? Thanks for your time, harish On 8/29/06, Roland Dreier wrote: > > harish> Hi Roland, As regards the CQ notification frequency, I > harish> noticed that the function req_ncomp_notif used by > harish> ib_req_ncom_notif is not implemented yet. I was hoping > harish> that if this was implemented, I would just use > harish> ib_req_ncom_notif with a count of 10 in place of > harish> ib_req_com_notif. Please share your comments on this > harish> approach and it will be great if you have a patch with the > harish> ib_req_ncom_notif implementation. > > No, I have not implemented that function. > > But if you request an event after 10 completions then you run into a > problem of getting stuck if only 9 completions are generated. > > - R. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Aug 29 18:36:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 Aug 2006 18:36:23 -0700 Subject: [openib-general] Interrupt Threshold Rate equivalent in Infiniband NIC In-Reply-To: (harish's message of "Tue, 29 Aug 2006 18:28:43 -0700") References: Message-ID: harish> Hi Roland, I was hoping that the function will take care harish> of that by introducing some kind of timer since last harish> notification. The logic I had in mind was that the harish> notification is triggered in either of the two events is harish> true: -> We have 10+ completions or Time since last notification > (some -> specified value) && (there is atleast harish> one item in the cq queue). Is this feasible? Not with any existing hardware that I know of. - R. From sashak at voltaire.com Tue Aug 29 18:47:24 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 30 Aug 2006 04:47:24 +0300 Subject: [openib-general] [PATCH] opensm: truncate log file when fs is overflowed In-Reply-To: <1156892081.4509.41264.camel@hal.voltaire.com> References: <20060820160538.12435.23041.stgit@sashak.voltaire.com> <1156093312.9855.162745.camel@hal.voltaire.com> <20060820171808.GZ18411@sashak.voltaire.com> <1156717686.15782.93.camel@fc6.xsintricity.com> <20060829152126.GA1749@greglaptop> <1156878810.4509.35185.camel@hal.voltaire.com> <1156890315.28832.75.camel@fc6.xsintricity.com> <1156892081.4509.41264.camel@hal.voltaire.com> Message-ID: <20060830014724.GB12356@sashak.voltaire.com> On 18:54 Tue 29 Aug , Hal Rosenstock wrote: > On Tue, 2006-08-29 at 18:25, Doug Ledford wrote: > > On Tue, 2006-08-29 at 15:13 -0400, Hal Rosenstock wrote: > > > On Tue, 2006-08-29 at 11:21, Greg Lindahl wrote: > > > > On Sun, Aug 27, 2006 at 06:28:06PM -0400, Doug Ledford wrote: > > > > > > > > > I would definitely put the option in, and in fact would default it to > > > > > *NOT* truncate. > > > > > > > > I agree. I have never seen any other daemon with a logfile do this, > > > > why are we out to surprise the admin? The admin might want the start > > > > of the long instead of the end. And so on. > > > > > > I agree too but there is perhaps a difference here per degree: OpenSM > > > can spew copious logging and fill /var/log readily. > > > > In which case the -L option to limit the maximum log file size makes > > sense (as does making sure that the default logging level is warnings > > and above, not all sorts of informational stuff unless requested in > > order to keep logs more manageable under normal circumstances). > > > > In response to another statement made in another email, for better or > > worse, OpenSM *is* a system daemon at this point. If you don't have (or > > elect not to use) a switch embedded SM, then OpenSM is necessary to keep > > your fabric operational. > > > > Or to put it another way: if you need to start it during init for your > > system to operate properly, and if it needs to keep running all the > > time, and if you need elevated permissions to run it, and if it has an > > init script...you get the point...the rest of the world is going to tell > > you this is a system daemon with console capabilities, not a console > > program with daemon capabilities. As such, always thinking of it from > > the perspective of a daemon would be wise in terms of avoiding > > surprising users down the road IMO. > > Well stated and I concur with that viewpoint. > > I may be mistaken but I think Sasha may have been referring to some > things that could/might be done to make it behave better as a daemon. Yes, exactly. My point was to state that OpenSM as it is today cannot be called a daemon. And I agree that there is functional requirement for this. Sasha > > -- Hal > From dfrench at mtknox.com Tue Aug 29 20:27:17 2006 From: dfrench at mtknox.com (dfrench at mtknox.com) Date: Tue, 29 Aug 2006 22:27:17 -0500 (CDT) Subject: [openib-general] Mt Knox Datacenter in Franklin TN Message-ID: <20060830032717.4788910955@mtknox05> Can you please help me to contact your Information Technology Director? My company has just opened a new computer datacenter in Franklin, TN for the purpose of providing disaster recovery, business continuity, and outsourcing services to small and medium size businesses in the Nashville area. This facility can also aid companies such as yours in compliance with HIPAA and Sarbanes/Oxley. If you can provide me with contact information for your IT director, it would be greatly appreciated. Regards, Dana French 615.556.0456 President, Mt Knox Availability Services dfrench at mtknox.com http://www.mtknox.com From zhushisongzhu at yahoo.com Tue Aug 29 20:55:31 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Tue, 29 Aug 2006 20:55:31 -0700 (PDT) Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060829110427.GA23560@mellanox.co.il> Message-ID: <20060830035531.76121.qmail@web36907.mail.mud.yahoo.com> If it's spec bug is it difficult to solve? And how long will it take you to complete the bugfix? I really hope SDP can work as stably as TCP. zhu --- "Michael S. Tsirkin" wrote: > I did - this is the spec bug we are discussing with > Sean. > > Quoting r. zhu shi song : > > Subject: Re: why sdp connections cost so much > memory > > > > Have you resolved the problem? > > zhu > > > > --- "Michael S. Tsirkin" > wrote: > > > > > Yes, I have reproduced the connection refusal > > > problem and I am looking into it. > > > Thanks! > > > > > > MST > > > > > > Quoting r. zhu shi song > : > > > Subject: Re: why sdp connections cost so much > memory > > > > > > I haven't met kernel crashes using rc2. But > there > > > always occurred connection refusal when max > > > concurrent > > > connections set above 200. All is right when max > > > concurrent connections is set to below 200. ( > If > > > using TCP to take the same test, there is no > > > problem.) > > > (1) > > > This is ApacheBench, Version 2.0.41-dev > <$Revision: > > > 1.141 $> apache-2.0 > > > Copyright (c) 1996 Adam Twiss, Zeus Technology > Ltd, > > > http://www.zeustech.net/ > > > Copyright (c) 1998-2002 The Apache Software > > > Foundation, http://www.apache.org/ > > > > > > Benchmarking www.google.com [through > > > 193.12.10.14:3129] (be patient) > > > Completed 100 requests > > > Completed 200 requests > > > apr_recv: Connection refused (111) > > > Total of 257 requests completed > > > (2) > > > This is ApacheBench, Version 2.0.41-dev > <$Revision: > > > 1.141 $> apache-2.0 > > > Copyright (c) 1996 Adam Twiss, Zeus Technology > Ltd, > > > http://www.zeustech.net/ > > > Copyright (c) 1998-2002 The Apache Software > > > Foundation, http://www.apache.org/ > > > > > > Benchmarking www.google.com [through > > > 193.12.10.14:3129] (be patient) > > > Completed 100 requests > > > Completed 200 requests > > > apr_recv: Connection refused (111) > > > Total of 256 requests completed > > > [root at IB-TEST squid.test]# > > > > > > zhu > > > > > > > > > > > > > > > --- "Michael S. Tsirkin" > wrote: > > > > > > > Quoting r. zhu shi song > : > > > > > --- "Michael S. Tsirkin" > > > > > wrote: > > > > > > > > > > > Quoting r. zhu shi song > > > > : > > > > > > > (3) one time linux kernel on the client > > > > crashed. I > > > > > > > copy the output from the screen. > > > > > > > Process sdp (pid:4059, threadinfo > > > > 0000010036384000 > > > > > > > task 000001003ea10030) > > > > > > > Call > > > > > > > > > > > > > > > > > > > > > Trace:{:ib_sdp:sdp_destroy_workto} > > > > > > > > > > {:ib_sdp:sdp_destroy_qp+77} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {:ib_sdp:sdp_destruct+279}{sk_free+28} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {worker_thread+419}{default_wake_function+0} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {default_wake_function+0}{keventd_create_kthread+0} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {worker_thread+0}{keventd_create_kthread+0} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {kthread+200}{child_rip+8} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {keventd_create_kthread+0}{kthread+0}{child_rip+0} > > > > > > > Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b > 45 > > > 31 > > > > ff > > > > > > 45 > > > > > > > 31 ed 4c 89 > > > > > > > > > > > > > > > > > > > > > > > > > > > > RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> > > > > > > > CR2:0000000000000004 > > > > > > > <0>kernel panic-not syncing:Oops > > > > > > > > > > > > > > zhu > > > > > > > > > > > > Hmm, the stack dump does not match my > sources. > > > > Is > > > > > > this OFED rc1? > > > > > > Could you send me the sdp_main.o and > > > sdp_main.c > > > > > > files from your system please? > > > > > > > > --- > > > > > > > > > Subject: Re: why sdp connections cost so > much > > > > memory > > > > > > > > > > please see the attachment. > > > > > zhu > > > > > > > > Ugh, so its crashing inside sdp_bcopy ... > > > > > > > > By the way, could you please re-test with OFED > > > rc2? > > > > We've solved a couple of bugs there ... > > > > > > > > If this still crashes, could you please post > the > > > > whole > > > > sdp directory, with .o and .c files? > > > > > > > > Thanks, > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From mst at mellanox.co.il Tue Aug 29 21:57:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Aug 2006 07:57:26 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F4AB73.2070208@ichips.intel.com> References: <44F4AB73.2070208@ichips.intel.com> Message-ID: <20060830045726.GA25478@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Michael S. Tsirkin wrote: > >>If we completely ignore timewait, what conditions are required to have a problem > >>occur? > > > > Outstanding packets with PSNs and QP numbers coinside between the 2 connections. > > Look for "Stale packet" in IB spec. > > From what I can tell, a QP will receive an incoming packet incorrectly if the > SLID and PSN match that of its current connection, which matches with your > statement. Stale packets could cause this, but so can misconfigured QPs. (I'm > just trying to understand how large the problem is, and how much of it does > timewait solve.) > > > Hmm. We can ask user not to post sends if he rejects the REP. > > Then there won't be stale packets. But is there anything in spec that > > forbids this? > > See page 690 of the spec. It implies that the QP should go to RTS only if an > RTU is sent. > > Note that if the DREQ is lost, it's possible for the remote side to initiate a > send after the local QP has exited timewait, which seems to defeat its purpose > in this case. And so can RTU, in which case again QP will be in RTR. So it seems lost CM packets aren't protected by timewait. > > Maybe an extra call is better than assuming things beyond spec > > requirements? > > I'm still trying to determine who provides the timewait duration. Verbs allows > users to connect QPs without going through the CM, and several apps do this. > Timewait provides only partial protection against this problem, so maybe we only > restrict it to handling the most common case, which is after the QP has > transitioned to RTS. > > Another alternative to solving this problem is to select a PSN value that is > likely to discard stale packets. Can the lower level driver be of any > assistance here? I.e. would it know what the last PSN value was received on a QP? At least in case of mthca, I think we do have the last PSN. So I guess we could have a special wildcard PSN value that let's low level driver select it. What would a good value for the PSN be? -- MST From mst at mellanox.co.il Tue Aug 29 21:59:28 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Aug 2006 07:59:28 +0300 Subject: [openib-general] why sdp connections cost so much memory In-Reply-To: <20060830035531.76121.qmail@web36907.mail.mud.yahoo.com> References: <20060830035531.76121.qmail@web36907.mail.mud.yahoo.com> Message-ID: <20060830045927.GB25478@mellanox.co.il> We will have a work-around in OFED 1.1. Quoting r. zhu shi song : Subject: Re: why sdp connections cost so much memory If it's spec bug is it difficult to solve? And how long will it take you to complete the bugfix? I really hope SDP can work as stably as TCP. zhu --- "Michael S. Tsirkin" wrote: > I did - this is the spec bug we are discussing with > Sean. > > Quoting r. zhu shi song : > > Subject: Re: why sdp connections cost so much > memory > > > > Have you resolved the problem? > > zhu > > > > --- "Michael S. Tsirkin" > wrote: > > > > > Yes, I have reproduced the connection refusal > > > problem and I am looking into it. > > > Thanks! > > > > > > MST > > > > > > Quoting r. zhu shi song > : > > > Subject: Re: why sdp connections cost so much > memory > > > > > > I haven't met kernel crashes using rc2. But > there > > > always occurred connection refusal when max > > > concurrent > > > connections set above 200. All is right when max > > > concurrent connections is set to below 200. ( > If > > > using TCP to take the same test, there is no > > > problem.) > > > (1) > > > This is ApacheBench, Version 2.0.41-dev > <$Revision: > > > 1.141 $> apache-2.0 > > > Copyright (c) 1996 Adam Twiss, Zeus Technology > Ltd, > > > http://www.zeustech.net/ > > > Copyright (c) 1998-2002 The Apache Software > > > Foundation, http://www.apache.org/ > > > > > > Benchmarking www.google.com [through > > > 193.12.10.14:3129] (be patient) > > > Completed 100 requests > > > Completed 200 requests > > > apr_recv: Connection refused (111) > > > Total of 257 requests completed > > > (2) > > > This is ApacheBench, Version 2.0.41-dev > <$Revision: > > > 1.141 $> apache-2.0 > > > Copyright (c) 1996 Adam Twiss, Zeus Technology > Ltd, > > > http://www.zeustech.net/ > > > Copyright (c) 1998-2002 The Apache Software > > > Foundation, http://www.apache.org/ > > > > > > Benchmarking www.google.com [through > > > 193.12.10.14:3129] (be patient) > > > Completed 100 requests > > > Completed 200 requests > > > apr_recv: Connection refused (111) > > > Total of 256 requests completed > > > [root at IB-TEST squid.test]# > > > > > > zhu > > > > > > > > > > > > > > > --- "Michael S. Tsirkin" > wrote: > > > > > > > Quoting r. zhu shi song > : > > > > > --- "Michael S. Tsirkin" > > > > > wrote: > > > > > > > > > > > Quoting r. zhu shi song > > > > : > > > > > > > (3) one time linux kernel on the client > > > > crashed. I > > > > > > > copy the output from the screen. > > > > > > > Process sdp (pid:4059, threadinfo > > > > 0000010036384000 > > > > > > > task 000001003ea10030) > > > > > > > Call > > > > > > > > > > > > > > > > > > > > > Trace:{:ib_sdp:sdp_destroy_workto} > > > > > > > > > > {:ib_sdp:sdp_destroy_qp+77} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {:ib_sdp:sdp_destruct+279}{sk_free+28} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {worker_thread+419}{default_wake_function+0} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {default_wake_function+0}{keventd_create_kthread+0} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {worker_thread+0}{keventd_create_kthread+0} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {kthread+200}{child_rip+8} > > > > > > > > > > > > > > > > > > > > > > > > > > > > {keventd_create_kthread+0}{kthread+0}{child_rip+0} > > > > > > > Code:8b 40 04 41 39 c6 89 44 24 0c 7d 3b > 45 > > > 31 > > > > ff > > > > > > 45 > > > > > > > 31 ed 4c 89 > > > > > > > > > > > > > > > > > > > > > > > > > > > > RIP:{:ib_sdp:sdp_recv_completion+127}RSP<0000010036385dc8> > > > > > > > CR2:0000000000000004 > > > > > > > <0>kernel panic-not syncing:Oops > > > > > > > > > > > > > > zhu > > > > > > > > > > > > Hmm, the stack dump does not match my > sources. > > > > Is > > > > > > this OFED rc1? > > > > > > Could you send me the sdp_main.o and > > > sdp_main.c > > > > > > files from your system please? > > > > > > > > --- > > > > > > > > > Subject: Re: why sdp connections cost so > much > > > > memory > > > > > > > > > > please see the attachment. > > > > > zhu > > > > > > > > Ugh, so its crashing inside sdp_bcopy ... > > > > > > > > By the way, could you please re-test with OFED > > > rc2? > > > > We've solved a couple of bugs there ... > > > > > > > > If this still crashes, could you please post > the > > > > whole > > > > sdp directory, with .o and .c files? > > > > > > > > Thanks, > > -- > MST > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- MST From mvharish at gmail.com Tue Aug 29 23:49:36 2006 From: mvharish at gmail.com (harish) Date: Tue, 29 Aug 2006 23:49:36 -0700 Subject: [openib-general] Interrupt Threshold Rate equivalent in Infiniband NIC In-Reply-To: References: Message-ID: Hi Roland, Thanks. I guess then the only neat solution is NAPI. I shall keep you posted if we can get NAPI implemented in the meanwhile. thanks, harish On 8/29/06, Roland Dreier wrote: > > harish> Hi Roland, I was hoping that the function will take care > harish> of that by introducing some kind of timer since last > harish> notification. The logic I had in mind was that the > harish> notification is triggered in either of the two events is > harish> true: > -> We have 10+ completions or Time since last notification > (some > -> specified value) && (there is atleast > harish> one item in the cq queue). Is this feasible? > > Not with any existing hardware that I know of. > > - R. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From HNGUYEN at de.ibm.com Wed Aug 30 02:13:26 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Wed, 30 Aug 2006 11:13:26 +0200 Subject: [openib-general] [PATCH 02/13] IB/ehca: includes In-Reply-To: Message-ID: > Christoph Raisch wrote on 18.08.2006 17:35:54: > we'll change these EDEBs to a wrapper around dev_err, dev_dbg and > dev_warn as it's done in the mthca driver. > All EDEB_EN and EDEB_EX will be removed, that type of tracing can be > done if needed by kprobes. > There are a few cases where we won't get to a dev, for these few > places we'll use a simple wrapper around printk, as done in ipoib. We incorporated those changes throughout ehca code, which is accessible from Roland's git tree: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-2.6.19 Further comments/suggestions are appreciated! Regards Hoang-Nam Nguyen From arnd.bergmann at de.ibm.com Wed Aug 30 02:43:34 2006 From: arnd.bergmann at de.ibm.com (Arnd Bergmann) Date: Wed, 30 Aug 2006 11:43:34 +0200 Subject: [openib-general] [PATCH 02/13] IB/ehca: includes In-Reply-To: References: Message-ID: <200608301143.35320.arnd.bergmann@de.ibm.com> On Wednesday 30 August 2006 11:13, Hoang-Nam Nguyen wrote: > Further comments/suggestions are appreciated! There are a few places in the driver where you declare external variables (mostly ehca_module and ehca_debug_level) from C files instead of a header. This sometimes leads to bugs when a type changes and is therefore considered bad style. ehca_debug_level is already declared in a header so you should not need any other declaration. For ehca_module, the usage pattern is very uncommon. Declaring the structure in a header helps a bit, but I don't really see the need for this structure at all. Each member of the struct seems to be used mostly in a single file, so I would declare it statically in there. E.g. in drivers/infiniband/hw/ehca/ehca_pd.c, you can do static struct kmem_cache *ehca_pd_cache; int ehca_init_pd_cache(void) { ehca_pd_cache = kmem_cache_init("ehca_cache_pd", sizeof(struct ehca_pd), 0, SLAB_HWCACHE_ALIGN, NULL, NULL); if (!ehca_pd_cache) return -ENOMEM; return 0; } void ehca_cleanup_pd_cache(void) { if (ehca_pd_cache) kmem_cache_destroy(ehca_pd_cache); } Moreover, for some of your more heavily used caches, you may want to look into using constructor/destructor calls to speed up allocation. Arnd <>< From halr at voltaire.com Wed Aug 30 06:16:26 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2006 09:16:26 -0400 Subject: [openib-general] Utilities for sending traffic with different SL In-Reply-To: <200608242042.k7OKgBWG010194@mail.baymicrosystems.com> References: <200608242042.k7OKgBWG010194@mail.baymicrosystems.com> Message-ID: <1156943786.4504.843.camel@hal.voltaire.com> Hi Suri, On Thu, 2006-08-24 at 16:42, Suresh Shelvapille wrote: > Folks: > > Is there a utility within the OFED1.0 package which can be used for > generating traffic on different SL (akin to the Voltaire perf_main utility)? With OpenSM, you can configure an SL per IPoIB based partition and use any IP based program for SL based testing. It likely isn't hard to add SL to any of the test examples in OpenIB as a parameter. -- Hal > Many thanks, > Suri > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Wed Aug 30 06:21:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2006 09:21:21 -0400 Subject: [openib-general] IPoIB In-Reply-To: References: Message-ID: <1156944081.4504.993.camel@hal.voltaire.com> Hi John, On Thu, 2006-08-24 at 06:29, john t wrote: > Hi, > > Does IPoIB work across IB subnets. IPoIB architecture is capable of having an IPoIB subnet span multiple IB subnets. (You can obviously route between 2 IPoIB subnets or other IP subnets as long as the IP routing is setup properly). However, IB architecture for IB routers is currently incomplete. > For example if there are 4 IB subnets and all the IB subnets are > reachable (meaning there is one host that is in all the IB subnets), > then is it possible to ping/ssh to hosts in other IB subnets using > IPoIB (assuming there is no ethernet) A complete solution would support this. Can you say what specific implementation components are being used here ? -- Hal > Regards, > John T. > > ______________________________________________________________________ > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Aug 30 06:30:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2006 09:30:19 -0400 Subject: [openib-general] [PATCH] osm: Dynamic verbosity control per file In-Reply-To: <1156678202.24539.11.camel@kliteynik.yok.mtl.com> References: <1155656058.29378.8180.camel@hal.voltaire.com> <1156678202.24539.11.camel@kliteynik.yok.mtl.com> Message-ID: <1156944616.4504.1284.camel@hal.voltaire.com> Hi again Yevgeny, On Sun, 2006-08-27 at 07:30, Yevgeny Kliteynik wrote: > Hi Hal. > > > > By default, the OSM will use the following file: /etc/opensmlog.conf > > > > Nit: For consistency in naming, this would be better as osmlog.conf > > (or > > osm-log.conf) rather than opensmlog.conf > > Right - will use osm-log.conf > > > Rather than remove osm_log and osm_log_raw, these should be > > deprecated. > > There are other applications outside of OpenSM (like osmtest and > > others) > > that need this. > > You're right, osm_log & osm_log_raw are no longer appear in the > API, but they are not removed from headers - they are now macros, > so the old code will still compile. It's still an API change and old executables won't work with this new opensm library. > > Also, is this functionality needed for OFED 1.1 or is this trunk > > only ? > > It doesn't have to get to 1.1. > I'll send a second version of this patch that will address all your > comments, including the addition in the osm man pages. I see it. I'm still catching back up on my emails after vacation. -- Hal > Thanks, > > Yevgeny > > On Tue, 2006-08-15 at 18:34 +0300, Hal Rosenstock wrote: > > > Also, is this functionality needed for OFED 1.1 or is this trunk > > only ? > > > > Thanks. > > > > -- Hal > > > > > 1. Verbosity configuration file > > > -- > > > > > > The user is able to set verbosity level per source code file > > > by supplying verbosity configuration file using the following > > > command line arguments: > > > > > > -b filename > > > --verbosity_file filename > > > > > > By default, the OSM will use the following file: /etc/opensmlog.conf > > > > Nit: For consistency in naming, this would be better as osmlog.conf > > (or > > osm-log.conf) rather than opensmlog.conf > > > > > Verbosity configuration file should contain zero or more lines of > > > the following pattern: > > > > > > filename verbosity_level > > > > > > where 'filename' is the name of the source code file that the > > > 'verbosity_level' refers to, and the 'verbosity_level' itself > > > should be specified as an integer number (decimal or hexadecimal). > > > > > > One reserved filename is 'all' - it represents general verbosity > > > level, that is used for all the files that are not specified in > > > the verbosity configuration file. > > > If 'all' is not specified, the verbosity level set in the > > > command line will be used instead. > > > Note: The 'all' file verbosity level will override any other > > > general level that was specified by the command line arguments. > > > > > > Sending a SIGHUP signal to the OSM will cause it to reload > > > the verbosity configuration file. > > > > > > > > > 2. Logging source code filename and line number > > > -- > > > > > > If command line option -S or --log_source_info is specified, > > > OSM will add source code filename and line number to every > > > log message that is written to the log file. > > > By default, the OSM will not log this additional info. > > > > > > > > > Yevgeny > > > > > > Signed-off-by: Yevgeny Kliteynik > > > > > > Index: include/opensm/osm_subnet.h > > > > > =================================================================== > > > -- include/opensm/osm_subnet.h (revision 8614) > > > +++ include/opensm/osm_subnet.h (working copy) > > > @@ -285,6 +285,8 @@ typedef struct _osm_subn_opt > > > osm_qos_options_t qos_sw0_options; > > > osm_qos_options_t qos_swe_options; > > > osm_qos_options_t qos_rtr_options; > > > + boolean_t src_info; > > > + char * verbosity_file; > > > } osm_subn_opt_t; > > > /* > > > * FIELDS > > > @@ -463,6 +465,27 @@ typedef struct _osm_subn_opt > > > * qos_rtr_options > > > * QoS options for router ports > > > * > > > +* src_info > > > +* If TRUE - the source code filename and line number will > > be > > > +* added to each log message. > > > +* Default value - FALSE. > > > +* > > > +* verbosity_file > > > +* OSM log configuration file - the file that describes > > > +* verbosity level per source code file. > > > +* The file may containg zero or more lines of the following > > > +* pattern: > > > +* filename verbosity_level > > > +* where 'filename' is the name of the source code file that > > > +* the 'verbosity_level' refers to. > > > +* Filename "all" represents general verbosity level, that > > is > > > +* used for all the files that are not specified in the > > > +* verbosity file. > > > +* If "all" is not specified, the general verbosity level > > will > > > +* be used instead. > > > +* Note: the "all" file verbosity level will override any > > other > > > +* general level that was specified by the command line > > > arguments. > > > +* > > > * SEE ALSO > > > * Subnet object > > > *********/ > > > Index: include/opensm/osm_base.h > > > > > =================================================================== > > > -- include/opensm/osm_base.h (revision 8614) > > > +++ include/opensm/osm_base.h (working copy) > > > @@ -222,6 +222,22 @@ BEGIN_C_DECLS > > > #endif > > > /***********/ > > > > > > +/****d* OpenSM: Base/OSM_DEFAULT_VERBOSITY_FILE > > > +* NAME > > > +* OSM_DEFAULT_VERBOSITY_FILE > > > +* > > > +* DESCRIPTION > > > +* Specifies the default verbosity config file name > > > +* > > > +* SYNOPSIS > > > +*/ > > > +#ifdef __WIN__ > > > +#define OSM_DEFAULT_VERBOSITY_FILE strcat(GetOsmPath(), " > > > opensmlog.conf") > > > +#else > > > +#define OSM_DEFAULT_VERBOSITY_FILE "/etc/opensmlog.conf" > > > +#endif > > > +/***********/ > > > + > > > /****d* OpenSM: Base/OSM_DEFAULT_PARTITION_CONFIG_FILE > > > * NAME > > > * OSM_DEFAULT_PARTITION_CONFIG_FILE > > > Index: include/opensm/osm_log.h > > > =================================================================== > > > -- include/opensm/osm_log.h (revision 8652) > > > +++ include/opensm/osm_log.h (working copy) > > > @@ -57,6 +57,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > > > > @@ -123,9 +124,45 @@ typedef struct _osm_log > > > cl_spinlock_t lock; > > > boolean_t flush; > > > FILE* out_port; > > > + boolean_t src_info; > > > + st_table * table; > > > } osm_log_t; > > > /*********/ > > > > > > +/****f* OpenSM: Log/osm_log_read_verbosity_file > > > +* NAME > > > +* osm_log_read_verbosity_file > > > +* > > > +* DESCRIPTION > > > +* This function reads the verbosity configuration file > > > +* and constructs a verbosity data structure. > > > +* > > > +* SYNOPSIS > > > +*/ > > > +void > > > +osm_log_read_verbosity_file( > > > + IN osm_log_t* p_log, > > > + IN const char * const verbosity_file); > > > +/* > > > +* PARAMETERS > > > +* p_log > > > +* [in] Pointer to a Log object to construct. > > > +* > > > +* verbosity_file > > > +* [in] verbosity configuration file > > > +* > > > +* RETURN VALUE > > > +* None > > > +* > > > +* NOTES > > > +* If the verbosity configuration file is not found, default > > > +* verbosity value is used for all files. > > > +* If there is an error in some line of the verbosity > > > +* configuration file, the line is ignored. > > > +* > > > +*********/ > > > + > > > + > > > /****f* OpenSM: Log/osm_log_construct > > > * NAME > > > * osm_log_construct > > > @@ -201,9 +238,13 @@ osm_log_destroy( > > > * osm_log_init > > > *********/ > > > > > > -/****f* OpenSM: Log/osm_log_init > > > +#define osm_log_init(p_log, flush, log_flags, log_file, > > > accum_log_file) \ > > > + osm_log_init_ext(p_log, flush, (log_flags), log_file, \ > > > + accum_log_file, FALSE, OSM_DEFAULT_VERBOSITY_FILE) > > > + > > > +/****f* OpenSM: Log/osm_log_init_ext > > > * NAME > > > -* osm_log_init > > > +* osm_log_init_ext > > > * > > > * DESCRIPTION > > > * The osm_log_init function initializes a > > > @@ -211,50 +252,15 @@ osm_log_destroy( > > > * > > > * SYNOPSIS > > > */ > > > -static inline ib_api_status_t > > > -osm_log_init( > > > +ib_api_status_t > > > +osm_log_init_ext( > > > IN osm_log_t* const p_log, > > > IN const boolean_t flush, > > > IN const uint8_t log_flags, > > > IN const char *log_file, > > > - IN const boolean_t accum_log_file ) > > > -{ > > > - p_log->level = log_flags; > > > - p_log->flush = flush; > > > - > > > - if (log_file == NULL || !strcmp(log_file, "-") || > > > - !strcmp(log_file, "stdout")) > > > - { > > > - p_log->out_port = stdout; > > > - } > > > - else if (!strcmp(log_file, "stderr")) > > > - { > > > - p_log->out_port = stderr; > > > - } > > > - else > > > - { > > > - if (accum_log_file) > > > - p_log->out_port = fopen(log_file, "a+"); > > > - else > > > - p_log->out_port = fopen(log_file, "w+"); > > > - > > > - if (!p_log->out_port) > > > - { > > > - if (accum_log_file) > > > - printf("Cannot open %s for appending. Permission denied > > \n", > > > log_file); > > > - else > > > - printf("Cannot open %s for writing. Permission denied\n", > > > log_file); > > > > These lines above are line wrapped so they don't apply. This is an > > email > > issue on your side. > > > > > - > > > - return(IB_UNKNOWN_ERROR); > > > - } > > > - } > > > - openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); > > > - > > > - if (cl_spinlock_init( &p_log->lock ) == CL_SUCCESS) > > > - return IB_SUCCESS; > > > - else > > > - return IB_ERROR; > > > -} > > > + IN const boolean_t accum_log_file, > > > + IN const boolean_t src_info, > > > + IN const char *verbosity_file); > > > /* > > > * PARAMETERS > > > * p_log > > > @@ -271,6 +277,16 @@ osm_log_init( > > > * log_file > > > * [in] if not NULL defines the name of the log file. Otherwise > > it > > > is stdout. > > > * > > > +* accum_log_file > > > +* [in] Whether the log file should be accumulated. > > > +* > > > +* src_info > > > +* [in] Set to TRUE directs the log to add filename and line > > > number > > > +* to each log message. > > > +* > > > +* verbosity_file > > > +* [in] Log configuration file location. > > > +* > > > * RETURN VALUES > > > * CL_SUCCESS if the Log object was initialized > > > * successfully. > > > @@ -283,26 +299,32 @@ osm_log_init( > > > * osm_log_destroy > > > *********/ > > > > > > -/****f* OpenSM: Log/osm_log_get_level > > > +#define osm_log_get_level(p_log) \ > > > + osm_log_get_level_ext(p_log, __FILE__) > > > + > > > +/****f* OpenSM: Log/osm_log_get_level_ext > > > * NAME > > > -* osm_log_get_level > > > +* osm_log_get_level_ext > > > * > > > * DESCRIPTION > > > -* Returns the current log level. > > > +* Returns the current log level for the file. > > > +* If the file is not specified in the log config file, > > > +* the general verbosity level will be returned. > > > * > > > * SYNOPSIS > > > */ > > > -static inline osm_log_level_t > > > -osm_log_get_level( > > > - IN const osm_log_t* const p_log ) > > > -{ > > > - return( p_log->level ); > > > -} > > > +osm_log_level_t > > > +osm_log_get_level_ext( > > > + IN const osm_log_t* const p_log, > > > + IN const char* const p_filename ); > > > /* > > > * PARAMETERS > > > * p_log > > > * [in] Pointer to the log object. > > > * > > > +* p_filename > > > +* [in] Source code file name. > > > +* > > > * RETURN VALUES > > > * Returns the current log level. > > > * > > > @@ -310,7 +332,7 @@ osm_log_get_level( > > > * > > > * SEE ALSO > > > * Log object, osm_log_construct, > > > -* osm_log_destroy > > > +* osm_log_destroy, osm_log_get_level > > > *********/ > > > > > > /****f* OpenSM: Log/osm_log_set_level > > > @@ -318,7 +340,7 @@ osm_log_get_level( > > > * osm_log_set_level > > > * > > > * DESCRIPTION > > > -* Sets the current log level. > > > +* Sets the current general log level. > > > * > > > * SYNOPSIS > > > */ > > > @@ -338,7 +360,7 @@ osm_log_set_level( > > > * [in] New level to set. > > > * > > > * RETURN VALUES > > > -* Returns the current log level. > > > +* None. > > > * > > > * NOTES > > > * > > > @@ -347,9 +369,12 @@ osm_log_set_level( > > > * osm_log_destroy > > > *********/ > > > > > > -/****f* OpenSM: Log/osm_log_is_active > > > +#define osm_log_is_active(p_log, level) \ > > > + osm_log_is_active_ext(p_log, __FILE__, level) > > > + > > > +/****f* OpenSM: Log/osm_log_is_active_ext > > > * NAME > > > -* osm_log_is_active > > > +* osm_log_is_active_ext > > > * > > > * DESCRIPTION > > > * Returns TRUE if the specified log level would be logged. > > > @@ -357,18 +382,19 @@ osm_log_set_level( > > > * > > > * SYNOPSIS > > > */ > > > -static inline boolean_t > > > -osm_log_is_active( > > > +boolean_t > > > +osm_log_is_active_ext( > > > IN const osm_log_t* const p_log, > > > - IN const osm_log_level_t level ) > > > -{ > > > - return( (p_log->level & level) != 0 ); > > > -} > > > + IN const char* const p_filename, > > > + IN const osm_log_level_t level ); > > > /* > > > * PARAMETERS > > > * p_log > > > * [in] Pointer to the log object. > > > * > > > +* p_filename > > > +* [in] Source code file name. > > > +* > > > * level > > > * [in] Level to check. > > > * > > > @@ -383,17 +409,125 @@ osm_log_is_active( > > > * osm_log_destroy > > > *********/ > > > > > > + > > > +#define osm_log(p_log, verbosity, p_str, args...) \ > > > + osm_log_ext(p_log, verbosity, __FILE__, __LINE__, p_str , ## > > > args) > > > + > > > +/****f* OpenSM: Log/osm_log_ext > > > +* NAME > > > +* osm_log_ext > > > +* > > > +* DESCRIPTION > > > +* Logs the formatted specified message. > > > +* > > > +* SYNOPSIS > > > +*/ > > > void > > > -osm_log( > > > +osm_log_ext( > > > IN osm_log_t* const p_log, > > > IN const osm_log_level_t verbosity, > > > + IN const char *p_filename, > > > + IN int line, > > > IN const char *p_str, ... ); > > > +/* > > > +* PARAMETERS > > > +* p_log > > > +* [in] Pointer to the log object. > > > +* > > > +* verbosity > > > +* [in] Current message verbosity level > > > + > > > + p_filename > > > + [in] Name of the file that is logging this message > > > + > > > + line > > > + [in] Line number in the file that is logging this message > > > + > > > + p_str > > > + [in] Format string of the message > > > +* > > > +* RETURN VALUES > > > +* None. > > > +* > > > +* NOTES > > > +* > > > +* SEE ALSO > > > +* Log object, osm_log_construct, > > > +* osm_log_destroy > > > +*********/ > > > > > > +#define osm_log_raw(p_log, verbosity, p_buff) \ > > > + osm_log_raw_ext(p_log, verbosity, __FILE__, p_buff) > > > + > > > +/****f* OpenSM: Log/osm_log_raw_ext > > > +* NAME > > > +* osm_log_ext > > > +* > > > +* DESCRIPTION > > > +* Logs the specified message. > > > +* > > > +* SYNOPSIS > > > +*/ > > > void > > > -osm_log_raw( > > > +osm_log_raw_ext( > > > IN osm_log_t* const p_log, > > > IN const osm_log_level_t verbosity, > > > + IN const char * p_filename, > > > IN const char *p_buf ); > > > +/* > > > +* PARAMETERS > > > +* p_log > > > +* [in] Pointer to the log object. > > > +* > > > +* verbosity > > > +* [in] Current message verbosity level > > > + > > > + p_filename > > > + [in] Name of the file that is logging this message > > > + > > > + p_buf > > > + [in] Message string > > > +* > > > +* RETURN VALUES > > > +* None. > > > +* > > > +* NOTES > > > +* > > > +* SEE ALSO > > > +* Log object, osm_log_construct, > > > +* osm_log_destroy > > > +*********/ > > > + > > > + > > > +/****f* OpenSM: Log/osm_log_flush > > > +* NAME > > > +* osm_log_flush > > > +* > > > +* DESCRIPTION > > > +* Flushes the log. > > > +* > > > +* SYNOPSIS > > > +*/ > > > +static inline void > > > +osm_log_flush( > > > + IN osm_log_t* const p_log) > > > +{ > > > + fflush(p_log->out_port); > > > +} > > > +/* > > > +* PARAMETERS > > > +* p_log > > > +* [in] Pointer to the log object. > > > +* > > > +* RETURN VALUES > > > +* None. > > > +* > > > +* NOTES > > > +* > > > +* SEE ALSO > > > +* > > > +*********/ > > > + > > > > > > #define DBG_CL_LOCK 0 > > > > > > Index: opensm/osm_subnet.c > > > =================================================================== > > > -- opensm/osm_subnet.c (revision 8614) > > > +++ opensm/osm_subnet.c (working copy) > > > @@ -493,6 +493,8 @@ osm_subn_set_default_opt( > > > p_opt->ucast_dump_file = NULL; > > > p_opt->updn_guid_file = NULL; > > > p_opt->exit_on_fatal = TRUE; > > > + p_opt->src_info = FALSE; > > > + p_opt->verbosity_file = OSM_DEFAULT_VERBOSITY_FILE; > > > subn_set_default_qos_options(&p_opt->qos_options); > > > subn_set_default_qos_options(&p_opt->qos_hca_options); > > > subn_set_default_qos_options(&p_opt->qos_sw0_options); > > > @@ -959,6 +961,13 @@ osm_subn_parse_conf_file( > > > "honor_guid2lid_file", > > > p_key, p_val, &p_opts->honor_guid2lid_file); > > > > > > + __osm_subn_opts_unpack_boolean( > > > + "log_source_info", > > > + p_key, p_val, &p_opts->src_info); > > > + > > > + __osm_subn_opts_unpack_charp( > > > + "verbosity_file", p_key, p_val, &p_opts->verbosity_file); > > > + > > > subn_parse_qos_options("qos", > > > p_key, p_val, &p_opts->qos_options); > > > > > > @@ -1182,7 +1191,11 @@ osm_subn_write_conf_file( > > > "# No multicast routing is performed if TRUE\n" > > > "disable_multicast %s\n\n" > > > "# If TRUE opensm will exit on fatal initialization issues\n" > > > - "exit_on_fatal %s\n\n", > > > + "exit_on_fatal %s\n\n" > > > + "# If TRUE OpenSM will log filename and line numbers\n" > > > + "log_source_info %s\n\n" > > > + "# Verbosity configuration file to be used\n" > > > + "verbosity_file %s\n\n", > > > p_opts->log_flags, > > > p_opts->force_log_flush ? "TRUE" : "FALSE", > > > p_opts->log_file, > > > @@ -1190,7 +1203,9 @@ osm_subn_write_conf_file( > > > p_opts->dump_files_dir, > > > p_opts->no_multicast_option ? "TRUE" : "FALSE", > > > p_opts->disable_multicast ? "TRUE" : "FALSE", > > > - p_opts->exit_on_fatal ? "TRUE" : "FALSE" > > > + p_opts->exit_on_fatal ? "TRUE" : "FALSE", > > > + p_opts->src_info ? "TRUE" : "FALSE", > > > + p_opts->verbosity_file > > > ); > > > > > > fprintf( > > > Index: opensm/osm_opensm.c > > > =================================================================== > > > -- opensm/osm_opensm.c (revision 8614) > > > +++ opensm/osm_opensm.c (working copy) > > > @@ -180,8 +180,10 @@ osm_opensm_init( > > > /* Can't use log macros here, since we're initializing the log. > > */ > > > osm_opensm_construct( p_osm ); > > > > > > - status = osm_log_init( &p_osm->log, p_opt->force_log_flush, > > > - p_opt->log_flags, p_opt->log_file, > > > p_opt->accum_log_file ); > > > + status = osm_log_init_ext( &p_osm->log, p_opt->force_log_flush, > > > + p_opt->log_flags, p_opt->log_file, > > > + p_opt->accum_log_file, p_opt->src_info, > > > + p_opt->verbosity_file); > > > if( status != IB_SUCCESS ) > > > return ( status ); > > > > > > Index: opensm/libopensm.map > > > > > =================================================================== > > > -- opensm/libopensm.map (revision 8614) > > > +++ opensm/libopensm.map (working copy) > > > @@ -1,6 +1,11 @@ > > > -OPENSM_1.0 { > > > +OPENSM_2.0 { > > > global: > > > - osm_log; > > > + osm_log_init_ext; > > > + osm_log_ext; > > > + osm_log_raw_ext; > > > + osm_log_get_level_ext; > > > + osm_log_is_active_ext; > > > + osm_log_read_verbosity_file; > > > osm_is_debug; > > > osm_mad_pool_construct; > > > osm_mad_pool_destroy; > > > @@ -39,7 +44,6 @@ OPENSM_1.0 { > > > osm_dump_dr_path; > > > osm_dump_smp_dr_path; > > > osm_dump_pkey_block; > > > - osm_log_raw; > > > osm_get_sm_state_str; > > > osm_get_sm_signal_str; > > > osm_get_disp_msg_str; > > > > Rather than remove osm_log and osm_log_raw, these should be > > deprecated. > > There are other applications outside of OpenSM (like osmtest and > > others) > > that need this. > > > > > @@ -51,5 +55,11 @@ OPENSM_1.0 { > > > osm_get_lsa_str; > > > osm_get_sm_mgr_signal_str; > > > osm_get_sm_mgr_state_str; > > > + st_init_strtable; > > > + st_delete; > > > + st_insert; > > > + st_lookup; > > > + st_foreach; > > > + st_free_table; > > > local: *; > > > }; > > > Index: opensm/osm_log.c > > > =================================================================== > > > -- opensm/osm_log.c (revision 8614) > > > +++ opensm/osm_log.c (working copy) > > > @@ -80,17 +80,365 @@ static char *month_str[] = { > > > }; > > > #endif /* ndef WIN32 */ > > > > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > +#define OSM_VERBOSITY_ALL "all" > > > + > > > +static void > > > +__osm_log_free_verbosity_table( > > > + IN osm_log_t* p_log); > > > +static void > > > +__osm_log_print_verbosity_table( > > > + IN osm_log_t* const p_log); > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > +osm_log_level_t > > > +osm_log_get_level_ext( > > > + IN const osm_log_t* const p_log, > > > + IN const char* const p_filename ) > > > +{ > > > + osm_log_level_t * p_curr_file_level = NULL; > > > + > > > + if (!p_filename || !p_log->table) > > > + return p_log->level; > > > + > > > + if ( st_lookup( p_log->table, > > > + (st_data_t) p_filename, > > > + (st_data_t*) &p_curr_file_level) ) > > > + return *p_curr_file_level; > > > + else > > > + return p_log->level; > > > +} > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > +ib_api_status_t > > > +osm_log_init_ext( > > > + IN osm_log_t* const p_log, > > > + IN const boolean_t flush, > > > + IN const uint8_t log_flags, > > > + IN const char *log_file, > > > + IN const boolean_t accum_log_file, > > > + IN const boolean_t src_info, > > > + IN const char *verbosity_file) > > > +{ > > > + p_log->level = log_flags; > > > + p_log->flush = flush; > > > + p_log->src_info = src_info; > > > + p_log->table = NULL; > > > + > > > + if (log_file == NULL || !strcmp(log_file, "-") || > > > + !strcmp(log_file, "stdout")) > > > + { > > > + p_log->out_port = stdout; > > > + } > > > + else if (!strcmp(log_file, "stderr")) > > > + { > > > + p_log->out_port = stderr; > > > + } > > > + else > > > + { > > > + if (accum_log_file) > > > + p_log->out_port = fopen(log_file, "a+"); > > > + else > > > + p_log->out_port = fopen(log_file, "w+"); > > > + > > > + if (!p_log->out_port) > > > + { > > > + if (accum_log_file) > > > + printf("Cannot open %s for appending. Permission denied > > \n", > > > log_file); > > > + else > > > + printf("Cannot open %s for writing. Permission denied\n", > > > log_file); > > > + > > > + return(IB_UNKNOWN_ERROR); > > > + } > > > + } > > > + openlog("OpenSM", LOG_CONS | LOG_PID, LOG_USER); > > > + > > > + if (cl_spinlock_init( &p_log->lock ) != CL_SUCCESS) > > > + return IB_ERROR; > > > + > > > + osm_log_read_verbosity_file(p_log,verbosity_file); > > > + return IB_SUCCESS; > > > +} > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > +void > > > +osm_log_read_verbosity_file( > > > + IN osm_log_t* p_log, > > > + IN const char * const verbosity_file) > > > +{ > > > + FILE *infile; > > > + char line[500]; > > > + struct stat buf; > > > + boolean_t table_empty = TRUE; > > > + char * tmp_str = NULL; > > > + > > > + if (p_log->table) > > > + { > > > + /* > > > + * Free the existing table. > > > + * Note: if the verbosity config file will not be found, > > this > > > will > > > + * effectivly reset the existing verbosity configuration and > > > set > > > + * all the files to the same verbosity level > > > + */ > > > + __osm_log_free_verbosity_table(p_log); > > > + } > > > + > > > + if (!verbosity_file) > > > + return; > > > + > > > + if ( stat(verbosity_file, &buf) != 0 ) > > > + { > > > + /* > > > + * Verbosity configuration file doesn't exist. > > > + */ > > > + if (strcmp(verbosity_file,OSM_DEFAULT_VERBOSITY_FILE) == 0) > > > + { > > > + /* > > > + * Verbosity configuration file wasn't explicitly > > specified. > > > + * No need to issue any error message. > > > + */ > > > + return; > > > + } > > > + else > > > + { > > > + /* > > > + * Verbosity configuration file was explicitly specified. > > > + */ > > > + osm_log(p_log, OSM_LOG_SYS, > > > + "ERROR: Verbosity configuration file (%s) doesn't > > > exist.\n", > > > + verbosity_file); > > > + osm_log(p_log, OSM_LOG_SYS, > > > + " Using general verbosity value.\n"); > > > + return; > > > + } > > > + } > > > + > > > + infile = fopen(verbosity_file, "r"); > > > + if ( infile == NULL ) > > > + { > > > + osm_log(p_log, OSM_LOG_SYS, > > > + "ERROR: Failed opening verbosity configuration file > > > (%s).\n", > > > + verbosity_file); > > > + osm_log(p_log, OSM_LOG_SYS, > > > + " Using general verbosity value.\n"); > > > + return; > > > + } > > > + > > > + p_log->table = st_init_strtable(); > > > + if (p_log->table == NULL) > > > + { > > > + osm_log(p_log, OSM_LOG_SYS, "ERROR: Verbosity table > > > initialization failed.\n"); > > > + return; > > > + } > > > + > > > + /* > > > + * Read the file line by line, parse the lines, and > > > + * add each line to p_log->table. > > > + */ > > > + while ( fgets(line, sizeof(line), infile) != NULL ) > > > + { > > > + char * str = line; > > > + char * name = NULL; > > > + char * value = NULL; > > > + osm_log_level_t * p_log_level_value = NULL; > > > + int res; > > > + > > > + name = strtok_r(str," \t\n",&tmp_str); > > > + if (name == NULL || strlen(name) == 0) { > > > + /* > > > + * empty line - ignore it > > > + */ > > > + continue; > > > + } > > > + value = strtok_r(NULL," \t\n",&tmp_str); > > > + if (value == NULL || strlen(value) == 0) > > > + { > > > + /* > > > + * No verbosity value - wrong syntax. > > > + * This line will be ignored. > > > + */ > > > + continue; > > > + } > > > + > > > + /* > > > + * If the conversion will fail, the log_level_value will get > > 0, > > > + * so the only way to check that the syntax is correct is to > > > + * scan value for any non-digit (which we're not doing > > here). > > > + */ > > > + p_log_level_value = malloc (sizeof(osm_log_level_t)); > > > + if (!p_log_level_value) > > > + { > > > + osm_log(p_log, OSM_LOG_SYS, "ERROR: malloc failed.\n"); > > > + p_log->table = NULL; > > > + fclose(infile); > > > + return; > > > + } > > > + *p_log_level_value = strtoul(value, NULL, 0); > > > + > > > + if (strcasecmp(name,OSM_VERBOSITY_ALL) == 0) > > > + { > > > + osm_log_set_level(p_log, *p_log_level_value); > > > + free(p_log_level_value); > > > + } > > > + else > > > + { > > > + res = st_insert( p_log->table, > > > + (st_data_t) strdup(name), > > > + (st_data_t) p_log_level_value); > > > + if (res != 0) > > > + { > > > + /* > > > + * Something is wrong with the verbosity table. > > > + * We won't try to free the table, because there's > > > + * clearly something corrupted there. > > > + */ > > > + osm_log(p_log, OSM_LOG_SYS, "ERROR: Failed adding > > > verbosity table element.\n"); > > > + p_log->table = NULL; > > > + fclose(infile); > > > + return; > > > + } > > > + table_empty = FALSE; > > > + } > > > + > > > + } > > > + > > > + if (table_empty) > > > + __osm_log_free_verbosity_table(p_log); > > > + > > > + fclose(infile); > > > + > > > + __osm_log_print_verbosity_table(p_log); > > > +} > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > +static int > > > +__osm_log_print_verbosity_table_element( > > > + IN st_data_t key, > > > + IN st_data_t val, > > > + IN st_data_t arg) > > > +{ > > > + osm_log( (osm_log_t* const) arg, > > > + OSM_LOG_INFO, > > > + "[verbosity] File: %s, Level: 0x%x\n", > > > + (char *) key, *((osm_log_level_t *) val)); > > > + > > > + return ST_CONTINUE; > > > +} > > > + > > > +static void > > > +__osm_log_print_verbosity_table( > > > + IN osm_log_t* const p_log) > > > +{ > > > + osm_log( p_log, OSM_LOG_INFO, > > > + "[verbosity] Verbosity table loaded\n" ); > > > + osm_log( p_log, OSM_LOG_INFO, > > > + "[verbosity] General level: > > > 0x%x\n",osm_log_get_level_ext(p_log,NULL)); > > > + > > > + if (p_log->table) > > > + { > > > + st_foreach( p_log->table, > > > + __osm_log_print_verbosity_table_element, > > > + (st_data_t) p_log ); > > > + } > > > + osm_log_flush(p_log); > > > +} > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > +static int > > > +__osm_log_free_verbosity_table_element( > > > + IN st_data_t key, > > > + IN st_data_t val, > > > + IN st_data_t arg) > > > +{ > > > + free( (char *) key ); > > > + free( (osm_log_level_t *) val ); > > > + return ST_DELETE; > > > +} > > > + > > > +static void > > > +__osm_log_free_verbosity_table( > > > + IN osm_log_t* p_log) > > > +{ > > > + if (!p_log->table) > > > + return; > > > + > > > + st_foreach( p_log->table, > > > + __osm_log_free_verbosity_table_element, > > > + (st_data_t) NULL); > > > + > > > + st_free_table(p_log->table); > > > + p_log->table = NULL; > > > +} > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > +static inline const char * > > > +__osm_log_get_base_name( > > > + IN const char * const p_filename) > > > +{ > > > +#ifdef WIN32 > > > + char dir_separator = '\\'; > > > +#else > > > + char dir_separator = '/'; > > > +#endif > > > + char * tmp_ptr; > > > + > > > + if (!p_filename) > > > + return NULL; > > > + > > > + tmp_ptr = strrchr(p_filename,dir_separator); > > > + > > > + if (!tmp_ptr) > > > + return p_filename; > > > + return tmp_ptr+1; > > > +} > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > +boolean_t > > > +osm_log_is_active_ext( > > > + IN const osm_log_t* const p_log, > > > + IN const char* const p_filename, > > > + IN const osm_log_level_t level ) > > > +{ > > > + osm_log_level_t tmp_lvl; > > > + tmp_lvl = level & > > > + > > > osm_log_get_level_ext(p_log,__osm_log_get_base_name(p_filename)); > > > + return ( tmp_lvl != 0 ); > > > +} > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > static int log_exit_count = 0; > > > > > > void > > > -osm_log( > > > +osm_log_ext( > > > IN osm_log_t* const p_log, > > > IN const osm_log_level_t verbosity, > > > + IN const char *p_filename, > > > + IN int line, > > > IN const char *p_str, ... ) > > > { > > > char buffer[LOG_ENTRY_SIZE_MAX]; > > > va_list args; > > > int ret; > > > + osm_log_level_t file_verbosity; > > > > > > #ifdef WIN32 > > > SYSTEMTIME st; > > > @@ -108,69 +456,89 @@ osm_log( > > > localtime_r(&tim, &result); > > > #endif /* WIN32 */ > > > > > > - /* If this is a call to syslog - always print it */ > > > - if ( verbosity & OSM_LOG_SYS ) > > > + /* > > > + * Extract only the file name out of the full path > > > + */ > > > + p_filename = __osm_log_get_base_name(p_filename); > > > + /* > > > + * Get the verbosity level for this file. > > > + * If the file is not specified in the log config file, > > > + * the general verbosity level will be returned. > > > + */ > > > + file_verbosity = osm_log_get_level_ext(p_log, p_filename); > > > + > > > + if ( ! (verbosity & OSM_LOG_SYS) && > > > + ! (file_verbosity & verbosity) ) > > > { > > > - /* this is a call to the syslog */ > > > + /* > > > + * This is not a syslog message (which is always printed) > > > + * and doesn't have the required verbosity level. > > > + */ > > > + return; > > > + } > > > + > > > va_start( args, p_str ); > > > vsprintf( buffer, p_str, args ); > > > va_end(args); > > > - cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); > > > > > > + > > > + if ( verbosity & OSM_LOG_SYS ) > > > + { > > > + /* this is a call to the syslog */ > > > + cl_log_event("OpenSM", LOG_INFO, buffer , NULL, 0); > > > /* SYSLOG should go to stdout too */ > > > if (p_log->out_port != stdout) > > > { > > > - printf("%s\n", buffer); > > > + printf("%s", buffer); > > > fflush( stdout ); > > > } > > > + } > > > + /* SYSLOG also goes to to the log file */ > > > + > > > + cl_spinlock_acquire( &p_log->lock ); > > > > > > - /* send it also to the log file */ > > > #ifdef WIN32 > > > GetLocalTime(&st); > > > - fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> % > > s", > > > + if (p_log->src_info) > > > + { > > > + ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] > > > [%s:%d] -> %s", > > > st.wHour, st.wMinute, st.wSecond, > > > st.wMilliseconds, > > > - pid, buffer); > > > -#else > > > - fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] > > -> > > > %s\n", > > > - (result.tm_mon < 12 ? month_str[result.tm_mon] : > > "???"), > > > - result.tm_mday, result.tm_hour, > > > - result.tm_min, result.tm_sec, > > > - usecs, pid, buffer); > > > - fflush( p_log->out_port ); > > > -#endif > > > + pid, p_filename, line, buffer); > > > } > > > - > > > - /* SYS messages go to the log anyways */ > > > - if (p_log->level & verbosity) > > > + else > > > { > > > - > > > - va_start( args, p_str ); > > > - vsprintf( buffer, p_str, args ); > > > - va_end(args); > > > - > > > - /* regular log to default out_port */ > > > - cl_spinlock_acquire( &p_log->lock ); > > > -#ifdef WIN32 > > > - GetLocalTime(&st); > > > ret = fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] > > -> > > > %s", > > > st.wHour, st.wMinute, st.wSecond, > > > st.wMilliseconds, > > > pid, buffer); > > > - > > > + } > > > #else > > > pid = pthread_self(); > > > tim = time(NULL); > > > + if (p_log->src_info) > > > + { > > > + ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d > > > [%04X] [%s:%d] -> %s", > > > + ((result.tm_mon < 12) && (result.tm_mon >= > > 0) ? > > > + month_str[ result.tm_mon] : "???"), > > > + result.tm_mday, result.tm_hour, > > > + result.tm_min, result.tm_sec, > > > + usecs, pid, p_filename, line, buffer); > > > + } > > > + else > > > + { > > > ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d > > > [%04X] -> %s", > > > ((result.tm_mon < 12) && (result.tm_mon >= > > 0) ? > > > month_str[ result.tm_mon] : "???"), > > > result.tm_mday, result.tm_hour, > > > result.tm_min, result.tm_sec, > > > usecs, pid, buffer); > > > -#endif /* WIN32 */ > > > - > > > + } > > > +#endif > > > /* > > > - Flush log on errors too. > > > + * Flush log on errors and SYSLOGs too. > > > */ > > > - if( p_log->flush || (verbosity & OSM_LOG_ERROR) ) > > > + if ( p_log->flush || > > > + (verbosity & OSM_LOG_ERROR) || > > > + (verbosity & OSM_LOG_SYS) ) > > > fflush( p_log->out_port ); > > > > > > cl_spinlock_release( &p_log->lock ); > > > @@ -183,15 +551,30 @@ osm_log( > > > } > > > } > > > } > > > -} > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > > > > void > > > -osm_log_raw( > > > +osm_log_raw_ext( > > > IN osm_log_t* const p_log, > > > IN const osm_log_level_t verbosity, > > > + IN const char * p_filename, > > > IN const char *p_buf ) > > > { > > > - if( p_log->level & verbosity ) > > > + osm_log_level_t file_verbosity; > > > + /* > > > + * Extract only the file name out of the full path > > > + */ > > > + p_filename = __osm_log_get_base_name(p_filename); > > > + /* > > > + * Get the verbosity level for this file. > > > + * If the file is not specified in the log config file, > > > + * the general verbosity level will be returned. > > > + */ > > > + file_verbosity = osm_log_get_level_ext(p_log, p_filename); > > > + > > > + if ( file_verbosity & verbosity ) > > > { > > > cl_spinlock_acquire( &p_log->lock ); > > > printf( "%s", p_buf ); > > > @@ -205,6 +588,9 @@ osm_log_raw( > > > } > > > } > > > > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > boolean_t > > > osm_is_debug(void) > > > { > > > @@ -214,3 +600,7 @@ osm_is_debug(void) > > > return FALSE; > > > #endif /* defined( _DEBUG_ ) */ > > > } > > > + > > > > > +/*************************************************************************** > > > + > > > > > ***************************************************************************/ > > > + > > > Index: opensm/main.c > > > > > =================================================================== > > > -- opensm/main.c (revision 8652) > > > +++ opensm/main.c (working copy) > > > @@ -296,6 +296,33 @@ show_usage(void) > > > " -d3 - Disable multicast support\n" > > > " -d10 - Put OpenSM in testability mode\n" > > > " Without -d, no debug options are enabled\n > > \n" ); > > > + printf( "-S\n" > > > + "--log_source_info\n" > > > + " This option tells SM to add source code > > > filename\n" > > > + " and line number to every log message.\n" > > > + " By default, the SM will not log this > > additional > > > info.\n\n"); > > > + printf( "-b\n" > > > + "--verbosity_file \n" > > > + " This option specifies name of the verbosity > > \n" > > > + " configuration file, which describes verbosity > > > level\n" > > > + " per source code file. The file may contain > > zero > > > or\n" > > > + " more lines of the following pattern:\n" > > > + " filename verbosity_level\n" > > > + " where 'filename' is the name of the source > > code > > > file\n" > > > + " that the 'verbosity_level' refers to, and the > > > \n" > > > + " 'verbosity_level' itself should be specified > > as > > > a\n" > > > + " number (decimal or hexadecimal).\n" > > > + " Filename 'all' represents general verbosity > > > level,\n" > > > + " that is used for all the files that are not > > > specified\n" > > > + " in the verbosity file.\n" > > > + " Note: The 'all' file verbosity level will > > > override any\n" > > > + " other general level that was specified by the > > > command\n" > > > + " line arguments.\n" > > > + " By default, the SM will use the following > > > file:\n" > > > + " %s\n" > > > + " Sending a SIGHUP signal to the SM will cause > > it > > > to\n" > > > + " re-read the verbosity configuration file.\n" > > > + "\n\n", OSM_DEFAULT_VERBOSITY_FILE); > > > printf( "-h\n" > > > "--help\n" > > > " Display this usage info then exit.\n\n" ); > > > @@ -527,7 +554,7 @@ main( > > > boolean_t cache_options = FALSE; > > > char *ignore_guids_file_name = NULL; > > > uint32_t val; > > > - const char * const short_option = > > > "i:f:ed:g:l:s:t:a:R:U:P:NQvVhorcyx"; > > > + const char * const short_option = > > > "i:f:ed:g:l:s:t:a:R:U:P:b:SNQvVhorcyx"; > > > > > > /* > > > In the array below, the 2nd parameter specified the number > > > @@ -565,6 +592,8 @@ main( > > > { "cache-options", 0, NULL, 'c'}, > > > { "stay_on_fatal", 0, NULL, 'y'}, > > > { "honor_guid2lid", 0, NULL, 'x'}, > > > + { "log_source_info",0,NULL, 'S'}, > > > + { "verbosity_file",1, NULL, 'b'}, > > > { NULL, 0, NULL, 0 } /* Required at the end of > > > the array */ > > > }; > > > > > > @@ -808,6 +837,16 @@ main( > > > printf (" Honor guid2lid file, if possible\n"); > > > break; > > > > > > + case 'S': > > > + opt.src_info = TRUE; > > > + printf(" Logging source code filename and line number\n"); > > > + break; > > > + > > > + case 'b': > > > + opt.verbosity_file = optarg; > > > + printf(" Verbosity Configuration File: %s\n", optarg); > > > + break; > > > + > > > case 'h': > > > case '?': > > > case ':': > > > @@ -920,9 +959,13 @@ main( > > > > > > if (osm_hup_flag) { > > > osm_hup_flag = 0; > > > - /* a HUP signal should only start a new heavy sweep */ > > > + /* > > > + * A HUP signal should cause OSM to re-read the log > > > + * configuration file and start a new heavy sweep > > > + */ > > > osm.subn.force_immediate_heavy_sweep = TRUE; > > > osm_opensm_sweep( &osm ); > > > + osm_log_read_verbosity_file(&osm.log,opt.verbosity_file); > > > } > > > } > > > } > > > Index: opensm/Makefile.am > > > =================================================================== > > > -- opensm/Makefile.am (revision 8614) > > > +++ opensm/Makefile.am (working copy) > > > @@ -43,7 +43,7 @@ else > > > libopensm_version_script = > > > endif > > > > > > -libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c > > > +libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c st.c > > > libopensm_la_LDFLAGS = -version-info $(opensm_api_version) \ > > > -export-dynamic $(libopensm_version_script) > > > libopensm_la_DEPENDENCIES = $(srcdir)/libopensm.map > > > @@ -90,7 +90,7 @@ opensm_SOURCES = main.c osm_console.c os > > > osm_trap_rcv.c osm_trap_rcv_ctrl.c \ > > > osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c \ > > > osm_vl15intf.c osm_vl_arb_rcv.c \ > > > - osm_vl_arb_rcv_ctrl.c st.c > > > + osm_vl_arb_rcv_ctrl.c > > > if OSMV_OPENIB > > > opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing > > > -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) > > > -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > > opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > > > -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > > Index: doc/verbosity-config.txt > > > =================================================================== > > > -- doc/verbosity-config.txt (revision 0) > > > +++ doc/verbosity-config.txt (revision 0) > > > @@ -0,0 +1,43 @@ > > > + > > > +This patch adds new verbosity functionality. > > > + > > > +1. Verbosity configuration file > > > +-- > > > + > > > +The user is able to set verbosity level per source code file > > > +by supplying verbosity configuration file using the following > > > +command line arguments: > > > + > > > + -b filename > > > + --verbosity_file filename > > > + > > > +By default, the OSM will use the following > > file: /etc/opensmlog.conf > > > +Verbosity configuration file should contain zero or more lines of > > > +the following pattern: > > > + > > > + filename verbosity_level > > > + > > > +where 'filename' is the name of the source code file that the > > > +'verbosity_level' refers to, and the 'verbosity_level' itself > > > +should be specified as an integer number (decimal or > > hexadecimal). > > > + > > > +One reserved filename is 'all' - it represents general verbosity > > > +level, that is used for all the files that are not specified in > > > +the verbosity configuration file. > > > +If 'all' is not specified, the verbosity level set in the > > > +command line will be used instead. > > > +Note: The 'all' file verbosity level will override any other > > > +general level that was specified by the command line arguments. > > > + > > > +Sending a SIGHUP signal to the OSM will cause it to reload > > > +the verbosity configuration file. > > > + > > > + > > > +2. Logging source code filename and line number > > > +-- > > > + > > > +If command line option -S or --log_source_info is specified, > > > +OSM will add source code filename and line number to every > > > +log message that is written to the log file. > > > +By default, the OSM will not log this additional info. > > > + > > > > > > From johnt1johnt2 at gmail.com Wed Aug 30 07:03:06 2006 From: johnt1johnt2 at gmail.com (john t) Date: Wed, 30 Aug 2006 19:33:06 +0530 Subject: [openib-general] ibv_poll_cq Message-ID: Hi, In one of my multi-threaded application (simple send/recv application written using uverbs), I am repeatedly getting an error code 12 (IB_WC_RETRY_EXC_ERR) from "ibv_poll_cq". Not able to figure out what is going wrong. Cam some one please give a suggestion so that I can investigate on those lines. Also, is there an error handling mechanism in IB, for ex: in the above case what should I do in order to correct the problem. Regards, John T -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at dev.mellanox.co.il Wed Aug 30 07:27:47 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 30 Aug 2006 17:27:47 +0300 Subject: [openib-general] ibv_poll_cq In-Reply-To: References: Message-ID: <44F5A063.4020502@dev.mellanox.co.il> Hi. john t wrote: > Hi, > > In one of my multi-threaded application (simple send/recv application > written using uverbs), I am repeatedly getting an error code 12 > (IB_WC_RETRY_EXC_ERR) from "ibv_poll_cq". Not able to figure out > what is going wrong. Cam some one please give a suggestion so that I > can investigate on those lines. > > Also, is there an error handling mechanism in IB, for ex: in the above > case what should I do in order to correct the problem. This completion status means that the remote side of the QP is not sending any response (ack/nack/ anything ...) You can have this completion if one of the following scenarios occurs: * a QP tries to send a message to a remote QP which is not ready (not in at least RTR state) * a QP tries to send a message to a remote QP which is being closed (or in error state) * the QP parameters are not the same as the remote QP parameters (for example: if the PSNs are not configured with good values, the messages may be silently dropped) I suggest to: sync between the 2 sides before starting to work with the QPs sync between the 2 sides before stop to work with the QPs You can increase the number of retry_cnt / timeout attributes in the QP context you should make sure that the timeout value is not 0 (Zero). Dotan From tziporet at dev.mellanox.co.il Wed Aug 30 08:37:53 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 30 Aug 2006 18:37:53 +0300 Subject: [openib-general] libibcm can't open /dev/infiniband/ucm0 In-Reply-To: <000d01c6cb8e$dd406fa0$dcc8180a@amr.corp.intel.com> References: <000d01c6cb8e$dd406fa0$dcc8180a@amr.corp.intel.com> Message-ID: <44F5B0D1.5040802@dev.mellanox.co.il> Sean Hefty wrote: >> So I assume this to be a bug at least in the README file. I'm not the >> kernel expert to tell where else this has to be changed. >> > > Only the README file should need updating. > > We also updated the README of OFED Tziporet From tziporet at mellanox.co.il Wed Aug 30 08:43:36 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 30 Aug 2006 18:43:36 +0300 Subject: [openib-general] [PATCH 02/13] IB/ehca: includes In-Reply-To: References: Message-ID: <44F5B228.4010107@mellanox.co.il> Hoang-Nam Nguyen wrote: > We incorporated those changes throughout ehca code, which is accessible > from > Roland's git tree: > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git > for-2.6.19 > > Hi Marcus & Hoang-Nam, RC3 is almost closed (going to be out tomorrow). Following this mail I wish to know if we need to update ehca for RC4. Thanks, Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From HNGUYEN at de.ibm.com Wed Aug 30 09:06:43 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Wed, 30 Aug 2006 18:06:43 +0200 Subject: [openib-general] [PATCH 02/13] IB/ehca: includes In-Reply-To: <44F5B228.4010107@mellanox.co.il> Message-ID: Hi Tziporet! > RC3 is almost closed (going to be out tomorrow). > Following this mail I wish to know if we need to update ehca for RC4. I'm generating a patch against OFED git tree ehca_branch and could provide you with a patch today if that makes sense to you. In the meanwhile we've made some local changes so that we've to update ehca for RC4 anyway. What do you think? Thanks Nam From halr at voltaire.com Wed Aug 30 09:13:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2006 12:13:41 -0400 Subject: [openib-general] [PATCH] opensm: libibmad: rpc API which supports more than one ports. In-Reply-To: <20060830012956.GA12356@sashak.voltaire.com> References: <20060825131734.3786.74359.stgit@sashak.voltaire.com> <1156896546.4509.43618.camel@hal.voltaire.com> <20060830012956.GA12356@sashak.voltaire.com> Message-ID: <1156954416.4504.5998.camel@hal.voltaire.com> Hi Sasha, On Tue, 2006-08-29 at 21:29, Sasha Khapyorsky wrote: > Hi Hal, > > On 20:09 Tue 29 Aug , Hal Rosenstock wrote: > > Hi Sasha, > > > > On Fri, 2006-08-25 at 09:17, Sasha Khapyorsky wrote: > > > This provides RPC like API which may work with several ports. > > > > I think you mean "can work" rather "may work" :-) > > Yes. > > Some limitation we will have from libumad - this tracks already open > ports. I'm not sure why (the same port can be opened from another > process or by forking current). I think this may be the next > improvement there. OK. > > > Signed-off-by: Sasha Khapyorsky > > > --- > > > > > > libibmad/include/infiniband/mad.h | 9 +++ > > > libibmad/src/libibmad.map | 4 + > > > libibmad/src/register.c | 20 +++++-- > > > libibmad/src/rpc.c | 106 +++++++++++++++++++++++++++++++++++-- > > > libibumad/src/umad.c | 4 + > > > > ../doc/libibmad.txt should also be updated appropriately for the new > > routines. > > Sure, I thought to stabilize this API first. OK. > > > 5 files changed, 130 insertions(+), 13 deletions(-) > > > > > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > > > index 45ff572..bd8a80b 100644 > > > --- a/libibmad/include/infiniband/mad.h > > > +++ b/libibmad/include/infiniband/mad.h > > > @@ -660,6 +660,7 @@ uint64_t mad_trid(void); > > > int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); > > > > > > /* register.c */ > > > +int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); > > > int mad_register_client(int mgmt, uint8_t rmpp_version); > > > int mad_register_server(int mgmt, uint8_t rmpp_version, > > > uint32_t method_mask[4], uint32_t class_oui); > > > @@ -704,6 +705,14 @@ void madrpc_lock(void); > > > void madrpc_unlock(void); > > > void madrpc_show_errors(int set); > > > > > > +void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > > > + int num_classes); > > > +void mad_rpc_close_port(void *ibmad_port); > > > +void * mad_rpc(void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > > > + void *payload, void *rcvdata); > > > +void * mad_rpc_rmpp(void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > > > + ib_rmpp_hdr_t *rmpp, void *data); > > > + > > > /* smp.c */ > > > uint8_t * smp_query(void *buf, ib_portid_t *id, uint attrid, uint mod, > > > uint timeout); > > > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > > > index bf81bd1..78b7ff0 100644 > > > --- a/libibmad/src/libibmad.map > > > +++ b/libibmad/src/libibmad.map > > > @@ -62,6 +62,10 @@ IBMAD_1.0 { > > > > This should be 1.1 > > Ok. > > > > > > ib_resolve_self; > > > ib_resolve_smlid; > > > ibdebug; > > > + mad_rpc_open_port; > > > + mad_rpc_close_port; > > > + mad_rpc; > > > + mad_rpc_rmpp; > > > madrpc; > > > madrpc_def_timeout; > > > madrpc_init; > > > > What about mad_register_port_client ? Should that be included here ? > > It is not used externally - all registrations are done in _open(). So I > don't see this as part of the new "API". Maybe if we will decide to > extend it later we will need to "export" this symbol. OK. > > > diff --git a/libibmad/src/register.c b/libibmad/src/register.c > > > index 4f44625..52d6989 100644 > > > --- a/libibmad/src/register.c > > > +++ b/libibmad/src/register.c > > > @@ -43,6 +43,7 @@ #include > > > #include > > > #include > > > #include > > > +#include > > > > > > #include > > > #include "mad.h" > > > @@ -118,7 +119,7 @@ mad_agent_class(int agent) > > > } > > > > > > int > > > -mad_register_client(int mgmt, uint8_t rmpp_version) > > > +mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version) > > > { > > > int vers, agent; > > > > > > @@ -126,7 +127,7 @@ mad_register_client(int mgmt, uint8_t rm > > > DEBUG("Unknown class %d mgmt_class", mgmt); > > > return -1; > > > } > > > - if ((agent = umad_register(madrpc_portid(), mgmt, > > > + if ((agent = umad_register(port_id, mgmt, > > > vers, rmpp_version, 0)) < 0) { > > > DEBUG("Can't register agent for class %d", mgmt); > > > return -1; > > > @@ -137,13 +138,22 @@ mad_register_client(int mgmt, uint8_t rm > > > return -1; > > > } > > > > > > - if (register_agent(agent, mgmt) < 0) > > > - return -1; > > > - > > > return agent; > > > } > > > > > > int > > > +mad_register_client(int mgmt, uint8_t rmpp_version) > > > +{ > > > + int agent; > > > + > > > + agent = mad_register_port_client(madrpc_portid(), mgmt, rmpp_version); > > > + if (agent < 0) > > > + return agent; > > > + > > > + return register_agent(agent, mgmt); > > > +} > > > + > > > +int > > > mad_register_server(int mgmt, uint8_t rmpp_version, > > > uint32_t method_mask[4], uint32_t class_oui) > > > { > > > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > > > index b2d3e77..ac4f361 100644 > > > --- a/libibmad/src/rpc.c > > > +++ b/libibmad/src/rpc.c > > > @@ -48,6 +48,13 @@ #include > > > #include > > > #include "mad.h" > > > > > > +#define MAX_CLASS 256 > > > + > > > +struct ibmad_port { > > > + int port_id; /* file descriptor returned by umad_open() */ > > > + int class_agents[MAX_CLASS]; /* class2agent mapper */ > > > +}; > > > + > > > int ibdebug; > > > > > > static int mad_portid = -1; > > > @@ -105,7 +112,8 @@ madrpc_portid(void) > > > } > > > > > > static int > > > -_do_madrpc(void *sndbuf, void *rcvbuf, int agentid, int len, int timeout) > > > +_do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > > > + int timeout) > > > { > > > uint32_t trid; /* only low 32 bits */ > > > int retries; > > > @@ -133,7 +141,7 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > > > } > > > > > > length = len; > > > - if (umad_send(mad_portid, agentid, sndbuf, length, timeout, 0) < 0) { > > > + if (umad_send(port_id, agentid, sndbuf, length, timeout, 0) < 0) { > > > IBWARN("send failed; %m"); > > > return -1; > > > } > > > @@ -141,7 +149,7 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > > > /* Use same timeout on receive side just in case */ > > > /* send packet is lost somewhere. */ > > > do { > > > - if (umad_recv(mad_portid, rcvbuf, &length, timeout) < 0) { > > > + if (umad_recv(port_id, rcvbuf, &length, timeout) < 0) { > > > IBWARN("recv failed: %m"); > > > return -1; > > > } > > > @@ -164,8 +172,10 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > > > } > > > > > > void * > > > -madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) > > > +mad_rpc(void *port_id, ib_rpc_t *rpc, ib_portid_t *dport, void *payload, > > > + void *rcvdata) > > > { > > > + struct ibmad_port *p = port_id; > > > int status, len; > > > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > > > > > @@ -175,7 +185,8 @@ madrpc(ib_rpc_t *rpc, ib_portid_t *dport > > > if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) > > > return 0; > > > > > > - if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), > > > + if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > > > + p->class_agents[rpc->mgtclass], > > > len, rpc->timeout)) < 0) > > > return 0; > > > > > > @@ -198,8 +209,10 @@ madrpc(ib_rpc_t *rpc, ib_portid_t *dport > > > } > > > > > > void * > > > -madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) > > > +mad_rpc_rmpp(void *port_id, ib_rpc_t *rpc, ib_portid_t *dport, > > > + ib_rmpp_hdr_t *rmpp, void *data) > > > { > > > + struct ibmad_port *p = port_id; > > > int status, len; > > > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > > > > > @@ -210,7 +223,8 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t * > > > if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) > > > return 0; > > > > > > - if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), > > > + if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > > > + p->class_agents[rpc->mgtclass], > > > len, rpc->timeout)) < 0) > > > return 0; > > > > > > @@ -249,6 +263,24 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t * > > > return data; > > > } > > > > > > +void * > > > +madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) > > > +{ > > > + struct ibmad_port port; > > > + port.port_id = mad_portid; > > > + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); > > > + return mad_rpc(&port, rpc, dport, payload, rcvdata); > > > +} > > > + > > > +void * > > > +madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) > > > +{ > > > + struct ibmad_port port; > > > + port.port_id = mad_portid; > > > + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); > > > + return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); > > > +} > > > + > > > static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; > > > > > > void > > > @@ -282,3 +314,63 @@ madrpc_init(char *dev_name, int dev_port > > > IBPANIC("client_register for mgmt %d failed", mgmt); > > > } > > > } > > > + > > > +void * > > > +mad_rpc_open_port(char *dev_name, int dev_port, > > > + int *mgmt_classes, int num_classes) > > > +{ > > > + struct ibmad_port *p; > > > + int port_id; > > > > Should there be some validation on num_classes < MAX_CLASS ? > > Such check is cheap and may be performed (it was not done in > madrpc_init()). Guess that validation is needed in both places. I'll add it subsequent to this. > Without this the function will "work" (will fail), but in longer way > (this will fail to register an agent when MAX_CLASS will be overflowed). Won't it overwrite some structure (scribble on memory) ? > > > + if (umad_init() < 0) { > > > + IBWARN("can't init UMAD library"); > > > + errno = ENODEV; > > > + return NULL; > > > + } > > > + > > > + p = malloc(sizeof(*p)); > > > + if (!p) { > > > + errno = ENOMEM; > > > + return NULL; > > > + } > > > + memset(p, 0, sizeof(*p)); > > > + > > > + if ((port_id = umad_open_port(dev_name, dev_port)) < 0) { > > > + IBWARN("can't open UMAD port (%s:%d)", dev_name, dev_port); > > > + if (!errno) > > > + errno = EIO; > > > + free(p); > > > + return NULL; > > > + } > > > + > > > + while (num_classes--) { > > > + int rmpp_version = 0; > > > + int mgmt = *mgmt_classes++; > > > + int agent; > > > + > > > + if (mgmt == IB_SA_CLASS) > > > + rmpp_version = 1; > > > > There are other classes which can use RMPP. How are they handled ? > > This is copy & paste from madrpc_init(). > This problem is generic for libibmad and I think should be fixed > separately You are right :-( > (maybe in mad_register_port_client()). Perhaps. We'll see. > > > + if (mgmt < 0 || mgmt >= MAX_CLASS || > > > + (agent = mad_register_port_client(port_id, mgmt, > > > + rmpp_version)) < 0) { > > > + IBWARN("client_register for mgmt %d failed", mgmt); > > > + if(!errno) > > > + errno = EINVAL; > > > + umad_close_port(port_id); > > > + free(p); > > > + return NULL; > > > + } > > > + p->class_agents[mgmt] = agent; > > > + } > > > + > > > + p->port_id = port_id; > > > + return p; > > > +} > > > + > > > +void > > > +mad_rpc_close_port(void *port_id) > > > +{ > > > + struct ibmad_port *p = port_id; > > > + umad_close_port(p->port_id); > > > + free(p); > > > +} > > > diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c > > > index a99fb5a..cb9eef6 100644 > > > --- a/libibumad/src/umad.c > > > +++ b/libibumad/src/umad.c > > > @@ -93,12 +93,14 @@ port_alloc(int portid, char *dev, int po > > > > > > if (portid < 0 || portid >= UMAD_MAX_PORTS) { > > > IBWARN("bad umad portid %d", portid); > > > + errno = EINVAL; > > > return 0; > > > } > > > > > > if (port->dev_name[0]) { > > > IBWARN("umad port id %d is already allocated for %s %d", > > > portid, port->dev_name, port->dev_port); > > > + errno = EBUSY; > > > return 0; > > > } > > > > > > @@ -567,7 +569,7 @@ umad_open_port(char *ca_name, int portnu > > > return -EINVAL; > > > > > > if (!(port = port_alloc(umad_id, ca_name, portnum))) > > > - return -EINVAL; > > > + return -errno; > > > > > > snprintf(port->dev_file, sizeof port->dev_file - 1, "%s/umad%d", > > > UMAD_DEV_DIR , umad_id); > > > > Is the umad.c change really a separate change from the rest ? > > It was done in order to provide the meanfull errno value in case of > mad_rpc_open() failure (not needed with madrpc_init() because it does > exit() if something is wrong) and this can be separated. > > > If so, > > this patch should be broken into two parts and that is the first part. > > Agree. > > > No need to resubmit for this. > > Ok. And for the rest of changes? Yes. -- Hal > Sasha > > > > > -- Hal > > From halr at voltaire.com Wed Aug 30 09:14:09 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2006 12:14:09 -0400 Subject: [openib-general] [PATCH] opensm: libibmad: rpc API which supports more than one ports. In-Reply-To: <20060825131734.3786.74359.stgit@sashak.voltaire.com> References: <20060825131734.3786.74359.stgit@sashak.voltaire.com> Message-ID: <1156954449.4504.6000.camel@hal.voltaire.com> On Fri, 2006-08-25 at 09:17, Sasha Khapyorsky wrote: > This provides RPC like API which may work with several ports. > > Signed-off-by: Sasha Khapyorsky > --- > > libibmad/include/infiniband/mad.h | 9 +++ > libibmad/src/libibmad.map | 4 + > libibmad/src/register.c | 20 +++++-- > libibmad/src/rpc.c | 106 +++++++++++++++++++++++++++++++++++-- > libibumad/src/umad.c | 4 + > 5 files changed, 130 insertions(+), 13 deletions(-) Thanks. Applied (to trunk only) as two changes. -- Hal From mst at mellanox.co.il Wed Aug 30 09:26:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Aug 2006 19:26:13 +0300 Subject: [openib-general] [PATCH 02/13] IB/ehca: includes In-Reply-To: References: <44F5B228.4010107@mellanox.co.il> Message-ID: <20060830162612.GA30879@mellanox.co.il> Quoting r. Hoang-Nam Nguyen : > Subject: Re: [PATCH 02/13] IB/ehca: includes > > Hi Tziporet! > > RC3 is almost closed (going to be out tomorrow). > > Following this mail I wish to know if we need to update ehca for RC4. > I'm generating a patch against OFED git tree ehca_branch and could provide > you with a patch today if that makes sense to you. I don't think we want to touch RC3 at this point. > In the meanwhile we've > made some local changes so that we've to update ehca for RC4 anyway. What > do you think? > Thanks > Nam Ideally I would like a git tree with Linus 2.6.18-rc5 + ehca added. We plan to update RC4 to 2.6.18-rc5 so if you do that we can merge that. we can put ehca updates in if other stuff needn't be changed 'cause of it. I.e. if - module name and CONFIG names are still the same as in ehca_branch - there's no need to touch anything in core/ulps except maybe Makefile/Kconfig -- MST From sashak at voltaire.com Wed Aug 30 09:34:38 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 30 Aug 2006 19:34:38 +0300 Subject: [openib-general] [PATCH] opensm: libibmad: rpc API which supports more than one ports. In-Reply-To: <1156954416.4504.5998.camel@hal.voltaire.com> References: <20060825131734.3786.74359.stgit@sashak.voltaire.com> <1156896546.4509.43618.camel@hal.voltaire.com> <20060830012956.GA12356@sashak.voltaire.com> <1156954416.4504.5998.camel@hal.voltaire.com> Message-ID: <20060830163438.GX12948@sashak.voltaire.com> On 12:13 Wed 30 Aug , Hal Rosenstock wrote: > Hi Sasha, > > On Tue, 2006-08-29 at 21:29, Sasha Khapyorsky wrote: > > Hi Hal, > > > > On 20:09 Tue 29 Aug , Hal Rosenstock wrote: > > > Hi Sasha, > > > > > > On Fri, 2006-08-25 at 09:17, Sasha Khapyorsky wrote: > > > > This provides RPC like API which may work with several ports. > > > > > > I think you mean "can work" rather "may work" :-) > > > > Yes. > > > > Some limitation we will have from libumad - this tracks already open > > ports. I'm not sure why (the same port can be opened from another > > process or by forking current). I think this may be the next > > improvement there. > > OK. > > > > > Signed-off-by: Sasha Khapyorsky > > > > --- > > > > > > > > libibmad/include/infiniband/mad.h | 9 +++ > > > > libibmad/src/libibmad.map | 4 + > > > > libibmad/src/register.c | 20 +++++-- > > > > libibmad/src/rpc.c | 106 +++++++++++++++++++++++++++++++++++-- > > > > libibumad/src/umad.c | 4 + > > > > > > ../doc/libibmad.txt should also be updated appropriately for the new > > > routines. > > > > Sure, I thought to stabilize this API first. > > OK. > > > > > 5 files changed, 130 insertions(+), 13 deletions(-) > > > > > > > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > > > > index 45ff572..bd8a80b 100644 > > > > --- a/libibmad/include/infiniband/mad.h > > > > +++ b/libibmad/include/infiniband/mad.h > > > > @@ -660,6 +660,7 @@ uint64_t mad_trid(void); > > > > int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); > > > > > > > > /* register.c */ > > > > +int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); > > > > int mad_register_client(int mgmt, uint8_t rmpp_version); > > > > int mad_register_server(int mgmt, uint8_t rmpp_version, > > > > uint32_t method_mask[4], uint32_t class_oui); > > > > @@ -704,6 +705,14 @@ void madrpc_lock(void); > > > > void madrpc_unlock(void); > > > > void madrpc_show_errors(int set); > > > > > > > > +void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > > > > + int num_classes); > > > > +void mad_rpc_close_port(void *ibmad_port); > > > > +void * mad_rpc(void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > > > > + void *payload, void *rcvdata); > > > > +void * mad_rpc_rmpp(void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > > > > + ib_rmpp_hdr_t *rmpp, void *data); > > > > + > > > > /* smp.c */ > > > > uint8_t * smp_query(void *buf, ib_portid_t *id, uint attrid, uint mod, > > > > uint timeout); > > > > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > > > > index bf81bd1..78b7ff0 100644 > > > > --- a/libibmad/src/libibmad.map > > > > +++ b/libibmad/src/libibmad.map > > > > @@ -62,6 +62,10 @@ IBMAD_1.0 { > > > > > > This should be 1.1 > > > > Ok. > > > > > > > > > ib_resolve_self; > > > > ib_resolve_smlid; > > > > ibdebug; > > > > + mad_rpc_open_port; > > > > + mad_rpc_close_port; > > > > + mad_rpc; > > > > + mad_rpc_rmpp; > > > > madrpc; > > > > madrpc_def_timeout; > > > > madrpc_init; > > > > > > What about mad_register_port_client ? Should that be included here ? > > > > It is not used externally - all registrations are done in _open(). So I > > don't see this as part of the new "API". Maybe if we will decide to > > extend it later we will need to "export" this symbol. > > OK. > > > > > diff --git a/libibmad/src/register.c b/libibmad/src/register.c > > > > index 4f44625..52d6989 100644 > > > > --- a/libibmad/src/register.c > > > > +++ b/libibmad/src/register.c > > > > @@ -43,6 +43,7 @@ #include > > > > #include > > > > #include > > > > #include > > > > +#include > > > > > > > > #include > > > > #include "mad.h" > > > > @@ -118,7 +119,7 @@ mad_agent_class(int agent) > > > > } > > > > > > > > int > > > > -mad_register_client(int mgmt, uint8_t rmpp_version) > > > > +mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version) > > > > { > > > > int vers, agent; > > > > > > > > @@ -126,7 +127,7 @@ mad_register_client(int mgmt, uint8_t rm > > > > DEBUG("Unknown class %d mgmt_class", mgmt); > > > > return -1; > > > > } > > > > - if ((agent = umad_register(madrpc_portid(), mgmt, > > > > + if ((agent = umad_register(port_id, mgmt, > > > > vers, rmpp_version, 0)) < 0) { > > > > DEBUG("Can't register agent for class %d", mgmt); > > > > return -1; > > > > @@ -137,13 +138,22 @@ mad_register_client(int mgmt, uint8_t rm > > > > return -1; > > > > } > > > > > > > > - if (register_agent(agent, mgmt) < 0) > > > > - return -1; > > > > - > > > > return agent; > > > > } > > > > > > > > int > > > > +mad_register_client(int mgmt, uint8_t rmpp_version) > > > > +{ > > > > + int agent; > > > > + > > > > + agent = mad_register_port_client(madrpc_portid(), mgmt, rmpp_version); > > > > + if (agent < 0) > > > > + return agent; > > > > + > > > > + return register_agent(agent, mgmt); > > > > +} > > > > + > > > > +int > > > > mad_register_server(int mgmt, uint8_t rmpp_version, > > > > uint32_t method_mask[4], uint32_t class_oui) > > > > { > > > > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > > > > index b2d3e77..ac4f361 100644 > > > > --- a/libibmad/src/rpc.c > > > > +++ b/libibmad/src/rpc.c > > > > @@ -48,6 +48,13 @@ #include > > > > #include > > > > #include "mad.h" > > > > > > > > +#define MAX_CLASS 256 > > > > + > > > > +struct ibmad_port { > > > > + int port_id; /* file descriptor returned by umad_open() */ > > > > + int class_agents[MAX_CLASS]; /* class2agent mapper */ > > > > +}; > > > > + > > > > int ibdebug; > > > > > > > > static int mad_portid = -1; > > > > @@ -105,7 +112,8 @@ madrpc_portid(void) > > > > } > > > > > > > > static int > > > > -_do_madrpc(void *sndbuf, void *rcvbuf, int agentid, int len, int timeout) > > > > +_do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > > > > + int timeout) > > > > { > > > > uint32_t trid; /* only low 32 bits */ > > > > int retries; > > > > @@ -133,7 +141,7 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > > > > } > > > > > > > > length = len; > > > > - if (umad_send(mad_portid, agentid, sndbuf, length, timeout, 0) < 0) { > > > > + if (umad_send(port_id, agentid, sndbuf, length, timeout, 0) < 0) { > > > > IBWARN("send failed; %m"); > > > > return -1; > > > > } > > > > @@ -141,7 +149,7 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > > > > /* Use same timeout on receive side just in case */ > > > > /* send packet is lost somewhere. */ > > > > do { > > > > - if (umad_recv(mad_portid, rcvbuf, &length, timeout) < 0) { > > > > + if (umad_recv(port_id, rcvbuf, &length, timeout) < 0) { > > > > IBWARN("recv failed: %m"); > > > > return -1; > > > > } > > > > @@ -164,8 +172,10 @@ _do_madrpc(void *sndbuf, void *rcvbuf, i > > > > } > > > > > > > > void * > > > > -madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) > > > > +mad_rpc(void *port_id, ib_rpc_t *rpc, ib_portid_t *dport, void *payload, > > > > + void *rcvdata) > > > > { > > > > + struct ibmad_port *p = port_id; > > > > int status, len; > > > > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > > > > > > > @@ -175,7 +185,8 @@ madrpc(ib_rpc_t *rpc, ib_portid_t *dport > > > > if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) > > > > return 0; > > > > > > > > - if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), > > > > + if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > > > > + p->class_agents[rpc->mgtclass], > > > > len, rpc->timeout)) < 0) > > > > return 0; > > > > > > > > @@ -198,8 +209,10 @@ madrpc(ib_rpc_t *rpc, ib_portid_t *dport > > > > } > > > > > > > > void * > > > > -madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) > > > > +mad_rpc_rmpp(void *port_id, ib_rpc_t *rpc, ib_portid_t *dport, > > > > + ib_rmpp_hdr_t *rmpp, void *data) > > > > { > > > > + struct ibmad_port *p = port_id; > > > > int status, len; > > > > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > > > > > > > @@ -210,7 +223,8 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t * > > > > if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) > > > > return 0; > > > > > > > > - if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), > > > > + if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > > > > + p->class_agents[rpc->mgtclass], > > > > len, rpc->timeout)) < 0) > > > > return 0; > > > > > > > > @@ -249,6 +263,24 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t * > > > > return data; > > > > } > > > > > > > > +void * > > > > +madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) > > > > +{ > > > > + struct ibmad_port port; > > > > + port.port_id = mad_portid; > > > > + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); > > > > + return mad_rpc(&port, rpc, dport, payload, rcvdata); > > > > +} > > > > + > > > > +void * > > > > +madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) > > > > +{ > > > > + struct ibmad_port port; > > > > + port.port_id = mad_portid; > > > > + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); > > > > + return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); > > > > +} > > > > + > > > > static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; > > > > > > > > void > > > > @@ -282,3 +314,63 @@ madrpc_init(char *dev_name, int dev_port > > > > IBPANIC("client_register for mgmt %d failed", mgmt); > > > > } > > > > } > > > > + > > > > +void * > > > > +mad_rpc_open_port(char *dev_name, int dev_port, > > > > + int *mgmt_classes, int num_classes) > > > > +{ > > > > + struct ibmad_port *p; > > > > + int port_id; > > > > > > Should there be some validation on num_classes < MAX_CLASS ? > > > > Such check is cheap and may be performed (it was not done in > > madrpc_init()). > > Guess that validation is needed in both places. I'll add it subsequent > to this. Thanks. > > > Without this the function will "work" (will fail), but in longer way > > (this will fail to register an agent when MAX_CLASS will be overflowed). > > Won't it overwrite some structure (scribble on memory) ? Not num_classes itself, it is used just as counter. Bad *mgmt_classes value could, but this one is checked. Sasha > > > > > + if (umad_init() < 0) { > > > > + IBWARN("can't init UMAD library"); > > > > + errno = ENODEV; > > > > + return NULL; > > > > + } > > > > + > > > > + p = malloc(sizeof(*p)); > > > > + if (!p) { > > > > + errno = ENOMEM; > > > > + return NULL; > > > > + } > > > > + memset(p, 0, sizeof(*p)); > > > > + > > > > + if ((port_id = umad_open_port(dev_name, dev_port)) < 0) { > > > > + IBWARN("can't open UMAD port (%s:%d)", dev_name, dev_port); > > > > + if (!errno) > > > > + errno = EIO; > > > > + free(p); > > > > + return NULL; > > > > + } > > > > + > > > > + while (num_classes--) { > > > > + int rmpp_version = 0; > > > > + int mgmt = *mgmt_classes++; > > > > + int agent; > > > > + > > > > + if (mgmt == IB_SA_CLASS) > > > > + rmpp_version = 1; > > > > > > There are other classes which can use RMPP. How are they handled ? > > > > This is copy & paste from madrpc_init(). > > This problem is generic for libibmad and I think should be fixed > > separately > > You are right :-( > > > (maybe in mad_register_port_client()). > > Perhaps. We'll see. > > > > > + if (mgmt < 0 || mgmt >= MAX_CLASS || > > > > + (agent = mad_register_port_client(port_id, mgmt, > > > > + rmpp_version)) < 0) { > > > > + IBWARN("client_register for mgmt %d failed", mgmt); > > > > + if(!errno) > > > > + errno = EINVAL; > > > > + umad_close_port(port_id); > > > > + free(p); > > > > + return NULL; > > > > + } > > > > + p->class_agents[mgmt] = agent; > > > > + } > > > > + > > > > + p->port_id = port_id; > > > > + return p; > > > > +} > > > > + > > > > +void > > > > +mad_rpc_close_port(void *port_id) > > > > +{ > > > > + struct ibmad_port *p = port_id; > > > > + umad_close_port(p->port_id); > > > > + free(p); > > > > +} > > > > diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c > > > > index a99fb5a..cb9eef6 100644 > > > > --- a/libibumad/src/umad.c > > > > +++ b/libibumad/src/umad.c > > > > @@ -93,12 +93,14 @@ port_alloc(int portid, char *dev, int po > > > > > > > > if (portid < 0 || portid >= UMAD_MAX_PORTS) { > > > > IBWARN("bad umad portid %d", portid); > > > > + errno = EINVAL; > > > > return 0; > > > > } > > > > > > > > if (port->dev_name[0]) { > > > > IBWARN("umad port id %d is already allocated for %s %d", > > > > portid, port->dev_name, port->dev_port); > > > > + errno = EBUSY; > > > > return 0; > > > > } > > > > > > > > @@ -567,7 +569,7 @@ umad_open_port(char *ca_name, int portnu > > > > return -EINVAL; > > > > > > > > if (!(port = port_alloc(umad_id, ca_name, portnum))) > > > > - return -EINVAL; > > > > + return -errno; > > > > > > > > snprintf(port->dev_file, sizeof port->dev_file - 1, "%s/umad%d", > > > > UMAD_DEV_DIR , umad_id); > > > > > > Is the umad.c change really a separate change from the rest ? > > > > It was done in order to provide the meanfull errno value in case of > > mad_rpc_open() failure (not needed with madrpc_init() because it does > > exit() if something is wrong) and this can be separated. > > > > > If so, > > > this patch should be broken into two parts and that is the first part. > > > > Agree. > > > > > No need to resubmit for this. > > > > Ok. And for the rest of changes? > > Yes. > > -- Hal > > > Sasha > > > > > > > > -- Hal > > > > From halr at voltaire.com Wed Aug 30 09:33:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2006 12:33:13 -0400 Subject: [openib-general] [PATCH][TRIVIAL] libibmad/src/rpc.c: Validate num_classes Message-ID: <1156955592.4504.6553.camel@hal.voltaire.com> libibmad/src/rpc.c: Validate num_classes Signed-off-by: Hal Rosenstock Index: libibmad/src/rpc.c =================================================================== --- libibmad/src/rpc.c (revision 9192) +++ libibmad/src/rpc.c (working copy) @@ -306,6 +306,9 @@ madrpc_init(char *dev_name, int dev_port if ((mad_portid = umad_open_port(dev_name, dev_port)) < 0) IBPANIC("can't open UMAD port (%s:%d)", dev_name, dev_port); + if (num_classes >= MAX_CLASS) + IBPANIC("too many classes %d requested", num_classes); + while (num_classes--) { int rmpp_version = 0; int mgmt = *mgmt_classes++; @@ -324,6 +327,12 @@ mad_rpc_open_port(char *dev_name, int de struct ibmad_port *p; int port_id; + if (num_classes >= MAX_CLASS) { + IBWARN("too many classes %d requested", num_classes); + errno = EINVAL; + return NULL; + } + if (umad_init() < 0) { IBWARN("can't init UMAD library"); errno = ENODEV; From halr at voltaire.com Wed Aug 30 10:18:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2006 13:18:31 -0400 Subject: [openib-general] [PATCHv2] osm: Dynamic verbosity control per file In-Reply-To: References: Message-ID: <1156958310.4504.7881.camel@hal.voltaire.com> Hi Yevgeny, On Sun, 2006-08-27 at 10:49, Yevgeny Kliteynik wrote: > Hi Hal. > > This is a resubmission of the patch that addresses > the comments that I got on the first version - using > osm-log.conf file instead of opensmlog.conf and osm > man page update. Thanks. Would you rebase this off the latest trunk (and revalidate) ? A number of the patches were rejected and it is too much to do by hand. There were some recent changes in the osm_log area that affect this patch. Also, see a couple of comments embedded below. -- Hal > Yevgeny > > Signed-off-by: Yevgeny Kliteynik [snip...] > Index: opensm/libopensm.map > =================================================================== > --- opensm/libopensm.map (revision 9107) > +++ opensm/libopensm.map (working copy) > @@ -1,6 +1,11 @@ > -OPENSM_1.1 { > +OPENSM_2.0 { Does libopensm.ver need changing too ? > global: > - osm_log; > + osm_log_init_ext; > + osm_log_ext; > + osm_log_raw_ext; > + osm_log_get_level_ext; > + osm_log_is_active_ext; > + osm_log_read_verbosity_file; > osm_is_debug; > osm_mad_pool_construct; > osm_mad_pool_destroy; > @@ -39,7 +44,6 @@ OPENSM_1.1 { > osm_dump_dr_path; > osm_dump_smp_dr_path; > osm_dump_pkey_block; > - osm_log_raw; > osm_get_sm_state_str; > osm_get_sm_signal_str; > osm_get_disp_msg_str; > @@ -51,5 +55,11 @@ OPENSM_1.1 { > osm_get_lsa_str; > osm_get_sm_mgr_signal_str; > osm_get_sm_mgr_state_str; > + st_init_strtable; > + st_delete; > + st_insert; > + st_lookup; > + st_foreach; > + st_free_table; Should the other st_ routines also be added ? st_init_table_with_size st_init_numtable st_init_numtable_with_size st_init_strtable_with_size st_add_direct st_copy st_delete_safe st_cleanup_safe > local: *; > }; [snip...] From mshefty at ichips.intel.com Wed Aug 30 10:24:07 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 30 Aug 2006 10:24:07 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060830045726.GA25478@mellanox.co.il> References: <44F4AB73.2070208@ichips.intel.com> <20060830045726.GA25478@mellanox.co.il> Message-ID: <44F5C9B7.8070408@ichips.intel.com> Michael S. Tsirkin wrote: > And so can RTU, in which case again QP will be in RTR. So it seems > lost CM packets aren't protected by timewait. Maybe we just try to deal with this the best that we can and make the HCA driver responsible for not re-allocating QPs for a duration of local_ack_timeout once they've entered RTS. If connections are made through the IB CM, it seems unlikely that stale packets will float around the subnet longer than it takes to establish a new connection. We just need to be able to detect stale connections, which requires that users use the CM when connecting. > At least in case of mthca, I think we do have the last PSN. > So I guess we could have a special wildcard PSN value that let's low > level driver select it. What would a good value for the PSN be? I'm not sure about this idea now. A random starting PSN is easy. How much better off are we trying to be clever? Btw, have you been able to determine why your patch leads to a crash under stress? It looks to me like it would work. - Sean From rdreier at cisco.com Wed Aug 30 10:35:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 30 Aug 2006 10:35:30 -0700 Subject: [openib-general] [PATCH v5 1/2] iWARP Connection Manager. In-Reply-To: <20060803210240.16228.18429.stgit@dell3.ogc.int> (Steve Wise's message of "Thu, 03 Aug 2006 16:02:40 -0500") References: <20060803210238.16228.47335.stgit@dell3.ogc.int> <20060803210240.16228.18429.stgit@dell3.ogc.int> Message-ID: OK, getting closer to finishing the merge... anyway, why is iw_cm_private.h in include/rdma where it is visible everywhere? As far as I can tell drivers/infiniband/core/iwcm.c is the only place it's included. So why not just put this stuff in drivers/infiniband/core/iwcm.h and do #include "iwcm.h" Or the file is small enough that maybe it's simpler just to stuff this at the top of iwcm.c and kill the include entirely? - R. > --- /dev/null > +++ b/include/rdma/iw_cm_private.h > @@ -0,0 +1,63 @@ > +#ifndef IW_CM_PRIVATE_H > +#define IW_CM_PRIVATE_H > + > +#include > + > +enum iw_cm_state { > + IW_CM_STATE_IDLE, /* unbound, inactive */ > + IW_CM_STATE_LISTEN, /* listen waiting for connect */ > + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ > + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ > + IW_CM_STATE_ESTABLISHED, /* established */ > + IW_CM_STATE_CLOSING, /* disconnect */ > + IW_CM_STATE_DESTROYING /* object being deleted */ > +}; > + > +struct iwcm_id_private { > + struct iw_cm_id id; > + enum iw_cm_state state; > + unsigned long flags; > + struct ib_qp *qp; > + struct completion destroy_comp; > + wait_queue_head_t connect_wait; > + struct list_head work_list; > + spinlock_t lock; > + atomic_t refcount; > + struct list_head work_free_list; > +}; > +#define IWCM_F_CALLBACK_DESTROY 1 > +#define IWCM_F_CONNECT_WAIT 2 > + > +#endif /* IW_CM_PRIVATE_H */ From tom at opengridcomputing.com Wed Aug 30 10:52:27 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 30 Aug 2006 12:52:27 -0500 Subject: [openib-general] [PATCH v5 1/2] iWARP Connection Manager. In-Reply-To: References: <20060803210238.16228.47335.stgit@dell3.ogc.int> <20060803210240.16228.18429.stgit@dell3.ogc.int> Message-ID: <1156960347.8973.16.camel@trinity.ogc.int> On Wed, 2006-08-30 at 10:35 -0700, Roland Dreier wrote: > OK, getting closer to finishing the merge... > > anyway, why is iw_cm_private.h in include/rdma where it is visible > everywhere? As far as I can tell drivers/infiniband/core/iwcm.c is > the only place it's included. So why not just put this stuff in > drivers/infiniband/core/iwcm.h and do The data structures really belong in iwcm.c...but I have a KDB module that dumps IB data structures. So when I was writing the IWCM, I pulled them out where I could see them without include gymnastics. It seems pretty dumb though to have header files called *private.h in a public directory. Putting them in iwcm.h is fine with me... Here's a patch for the KDB code if anyone is interested... KDB module for dumping OpenFabrics stack data types From: Tom Tucker --- kdb/kdbmain.c | 2 kdb/modules/Makefile | 3 kdb/modules/kdbm_openfabrics.c | 372 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 375 insertions(+), 2 deletions(-) diff --git a/kdb/kdbmain.c b/kdb/kdbmain.c index 931b643..b35139a 100644 --- a/kdb/kdbmain.c +++ b/kdb/kdbmain.c @@ -1154,8 +1154,8 @@ kdb_quiet(int reason) * none */ +void kdba_cpu_up(void) {}; extern char kdb_prompt_str[]; - static int kdb_local(kdb_reason_t reason, int error, struct pt_regs *regs, kdb_dbtrap_t db_result) { diff --git a/kdb/modules/Makefile b/kdb/modules/Makefile index ae2ac53..fbf05e1 100644 --- a/kdb/modules/Makefile +++ b/kdb/modules/Makefile @@ -6,7 +6,8 @@ # # Copyright (c) 1999-2006 Silicon Graphics, Inc. All Rights Reserved. # -obj-$(CONFIG_KDB_MODULES) += kdbm_pg.o kdbm_task.o kdbm_vm.o kdbm_sched.o +obj-$(CONFIG_KDB_MODULES) += kdbm_pg.o kdbm_task.o kdbm_vm.o kdbm_sched.o \ + kdbm_openfabrics.o ifdef CONFIG_X86 ifndef CONFIG_X86_64 obj-$(CONFIG_KDB_MODULES) += kdbm_x86.o diff --git a/kdb/modules/kdbm_openfabrics.c b/kdb/modules/kdbm_openfabrics.c new file mode 100644 index 0000000..fdf204b --- /dev/null +++ b/kdb/modules/kdbm_openfabrics.c @@ -0,0 +1,372 @@ +/* + * Copyright (c) 2006 Tom Tucker, Open Grid Computing, Inc. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "../drivers/infiniband/hw/amso1100/c2_provider.h" + +MODULE_AUTHOR("Tom Tucker"); +MODULE_DESCRIPTION("Debug RDMA"); +MODULE_LICENSE("Dual BSD/GPL"); + +static const char *wc_status_str[] = { + "SUCCESS", + "LOC_LEN_ERR", + "LOC_QP_OP_ERR", + "LOC_EEC_OP_ERR", + "LOC_PROT_ERR", + "WR_FLUSH_ERR", + "MW_BIND_ERR", + "BAD_RESP_ERR", + "LOC_ACCESS_ERR", + "REM_INV_REQ_ERR", + "REM_ACCESS_ERR", + "REM_OP_ERR", + "RETRY_EXC_ERR", + "RNR_RETRY_EXC_ERR", + "LOC_RDD_VIOL_ERR", + "REM_INV_RD_REQ_ERR", + "REM_ABORT_ERR", + "INV_EECN_ERR", + "INV_EEC_STATE_ERR", + "FATAL_ERR", + "RESP_TIMEOUT_ERR", + "GENERAL_ERR" +}; + +static inline const char *wc_status_to_str(int status) +{ + if (status > (sizeof(wc_status_str) / sizeof(wc_status_str[0]))) + return ""; + + return wc_status_str[status]; +} + +static const char *wc_opcode_str[] = { + "SEND", + "RDMA_WRITE", + "RDMA_READ", + "COMP_SWAP", + "FETCH_ADD", + "BIND_MW", + "RECV", + "RECV_RDMA_WITH_IMM", +}; + +static inline const char* wc_opcode_to_str(int op) +{ + + if (op > 129) + return ""; + else if (op >= 128) + op -= 122; + return wc_opcode_str[op]; +} + +static int +print_ib_wc(int argc, const char **argv, const char **envp, + struct pt_regs *regs) +{ + int ret = 0; + + if (argc == 1) { + kdb_machreg_t addr; + int nextarg = 1; + long offset = 0; + struct ib_wc wc; + + ret = kdbgetaddrarg(argc, argv, &nextarg, &addr, + &offset, NULL, regs); + if (ret) + return ret; + + kdb_printf("struct ib_wc [%p]\n", (void*)addr); + ret = kdb_getarea_size((void*)&wc, (unsigned long)addr, + sizeof wc); + if (ret) + return ret; + + kdb_printf(" wr_id : %llx\n", wc.wr_id); + kdb_printf(" status : \"%s\"\n", wc_status_to_str(wc.status)); + kdb_printf(" opcode : %s\n", wc_opcode_to_str(wc.opcode)); + kdb_printf(" vendor_err : %d\n", wc.vendor_err); + kdb_printf(" byte_len : %d\n", wc.byte_len); + kdb_printf(" imm_data : %d\n", wc.imm_data); + kdb_printf(" qp_num : %d\n", wc.qp_num); + kdb_printf(" src_qp : %d\n", wc.src_qp); + kdb_printf(" wc_flags : %x\n", wc.wc_flags); + kdb_printf(" pkey_index : %d\n", wc.pkey_index); + kdb_printf(" slid : %d\n", wc.slid); + kdb_printf(" sl : %d\n", wc.sl); + kdb_printf(" dlid_path_bits : %d\n", wc.dlid_path_bits); + kdb_printf(" port_num : %d\n", wc.port_num); + } else { + /* More than one arg */ + kdb_printf("Specify address of ib_wc to dump\n"); + return KDB_ARGCOUNT; + } + return ret; +} + +static kdb_machreg_t sge_addr; +static kdb_machreg_t addr; +static int sge_no; + +static int +print_ib_sge(int argc, const char **argv, const char **envp, + struct pt_regs *regs) +{ + int ret = 0; + struct ib_sge sge; + + if (argc == 1) { + int nextarg = 1; + long offset = 0; + + sge_no = 0; + ret = kdbgetaddrarg(argc, argv, &nextarg, &sge_addr, + &offset, NULL, regs); + if (ret) + return ret; + + ret = kdb_getarea_size((void*)&sge, (unsigned long)sge_addr, + sizeof(struct ib_sge)); + if (ret) + return ret; + + } + kdb_printf("sge[%d].addr : %llx\n", sge_no, sge.addr); + kdb_printf("sge[%d].length : %d\n", sge_no, sge.length); + kdb_printf("sge[%d].lkey : %08x\n", sge_no, sge.lkey); + sge_no += 1; + sge_addr = (kdb_machreg_t) + ((unsigned long)sge_addr + sizeof(struct ib_sge)); + return ret; +} + +static const char *iwcm_state_str[] = { + "IW_CM_STATE_IDLE", + "IW_CM_STATE_LISTEN", + "IW_CM_STATE_CONN_RECV", + "IW_CM_STATE_CONN_SENT", + "IW_CM_STATE_ESTABLISHED", + "IW_CM_STATE_CLOSING", + "IW_CM_STATE_DESTROYING" +}; + +static inline const char *to_iwcm_state_str(int state) +{ + if (state < 0 || + state > sizeof(iwcm_state_str)/sizeof(iwcm_state_str[0])) + return ""; + + + return iwcm_state_str[state]; +} + +static int +print_iw_cm_id(int argc, const char **argv, const char **envp, + struct pt_regs *regs) +{ + int ret = 0; + struct iwcm_id_private id; + struct sockaddr_in *sin; + + if (argc == 1) { + int nextarg = 1; + long offset = 0; + + ret = kdbgetaddrarg(argc, argv, &nextarg, &addr, + &offset, NULL, regs); + if (ret) + return ret; + + ret = kdb_getarea_size((void*)&id, (unsigned long)addr, + sizeof(struct iwcm_id_private)); + if (ret) + return ret; + } + kdb_printf("iw_cm_handler : %p\n", id.id.cm_handler); + kdb_printf("context : %p\n", id.id.context); + sin = (struct sockaddr_in*)&id.id.local_addr; + kdb_printf("local_addr : %d.%d.%d.%d\n", + NIPQUAD(sin->sin_addr.s_addr)); + sin = (struct sockaddr_in*)&id.id.remote_addr; + kdb_printf("remote_addr : %d.%d.%d.%d\n", + NIPQUAD(sin->sin_addr.s_addr)); + kdb_printf("provider_data : %p\n", id.id.provider_data); + kdb_printf("event_handler : %p\n", id.id.event_handler); + kdb_printf("state : %s\n", to_iwcm_state_str(id.state)); + kdb_printf("flags : %lx\n", id.flags); + kdb_printf("qp : %p\n", id.qp); + kdb_printf("refcount : %d\n", atomic_read(&id.refcount)); + + return ret; +} + + +static int +print_ib_cq(int argc, const char **argv, const char **envp, + struct pt_regs *regs) +{ + int ret = 0; + struct ib_cq cq; + + if (argc == 1) { + int nextarg = 1; + long offset = 0; + + ret = kdbgetaddrarg(argc, argv, &nextarg, &addr, + &offset, NULL, regs); + if (ret) + return ret; + + ret = kdb_getarea_size((void*)&cq, (unsigned long)addr, + sizeof(struct ib_cq)); + if (ret) + return ret; + } + + kdb_printf("Completion Queue\n" + "----------------------------------------\n"); + kdb_printf("device : %p\n",cq.device); + kdb_printf("uobject : %p\n",cq.uobject); + kdb_printf("comp_handler : %p\n",cq.comp_handler); + kdb_printf("event_handler : %p\n",cq.event_handler); + kdb_printf("cq_context : %p\n",cq.cq_context); + kdb_printf("cqe : %d\n",cq.cqe); + kdb_printf("usecnt : %d\n",atomic_read(&cq.usecnt)); + + return ret; +} + +static const char *qp_type_str[] = { + "IB_QPT_SMI", + "IB_QPT_GSI", + "IB_QPT_RC", + "IB_QPT_UC", + "IB_QPT_UD", + "IB_QPT_RAW_IPV6", + "IB_QPT_RAW_ETY" +}; + +static inline const char *qp_type_to_str(int state) +{ + if (state < 0 || + state >= sizeof(qp_type_str)/sizeof(qp_type_str[0])) + return ""; + + return qp_type_str[state]; +} + +static int +print_ib_qp(int argc, const char **argv, const char **envp, + struct pt_regs *regs) +{ + int ret = 0; + struct ib_qp qp; + + if (argc == 1) { + int nextarg = 1; + long offset = 0; + + ret = kdbgetaddrarg(argc, argv, &nextarg, &addr, + &offset, NULL, regs); + if (ret) + return ret; + + ret = kdb_getarea_size((void*)&qp, (unsigned long)addr, + sizeof(struct ib_qp)); + if (ret) + return ret; + } + + kdb_printf("Queueu Pair\n" + "----------------------------------------\n"); + kdb_printf("device : %p\n",qp.device); + kdb_printf("pd : %p\n",qp.pd); + kdb_printf("send_cq : %p\n",qp.send_cq); + kdb_printf("recv_cq : %p\n",qp.recv_cq); + kdb_printf("srq : %p\n",qp.srq); + kdb_printf("uobject : %p\n",qp.uobject); + kdb_printf("event_handler : %p\n",qp.event_handler); + kdb_printf("qp_context : %p\n",qp.qp_context); + kdb_printf("qp_num : %d\n",qp.qp_num); + kdb_printf("qp_type : %s\n",qp_type_to_str(qp.qp_type)); + + return ret; +} + +static int +print_ib_mr(int argc, const char **argv, const char **envp, + struct pt_regs *regs) +{ + int ret = 0; + struct ib_mr mr; + + if (argc == 1) { + int nextarg = 1; + long offset = 0; + + ret = kdbgetaddrarg(argc, argv, &nextarg, &addr, + &offset, NULL, regs); + if (ret) + return ret; + + ret = kdb_getarea_size((void*)&mr, (unsigned long)addr, + sizeof(struct ib_mr)); + if (ret) + return ret; + } + + kdb_printf("Memory Region\n" + "----------------------------------------\n"); + kdb_printf("device : %p\n", mr.device); + kdb_printf("pd : %p\n", mr.pd); + kdb_printf("uobject : %p\n", mr.uobject); + kdb_printf("lkey : %08x\n", mr.lkey); + kdb_printf("rkey : %08x\n", mr.rkey); + kdb_printf("usecnt : %d\n", atomic_read(&mr.usecnt)); + + return ret; +} + +static int __init rdma_init(void) +{ + kdb_register("ib_wc", + print_ib_wc, "<*ib_wc>", + "Display the specified IB Work Completion", 0); + kdb_register("ib_sge", + print_ib_sge, "<*ib_sge>", + "Display the specified IB Scatter Gather Entry", 0); + kdb_register("ib_qp", + print_ib_qp, "<*ib_qp>", + "Display the specified IB Queue Pair", 0); + kdb_register("ib_cq", + print_ib_cq, "<*ib_cq>", + "Display the specified IB Completion Queue", 0); + kdb_register("ib_mr", + print_ib_mr, "<*ib_mr>", + "Display the specified IB Memory Region", 0); + kdb_register("iw_cm_id", + print_iw_cm_id, "<*iw_cm_id>", + "Display the specified IW CM ID", 0); + return 0; +} + +static void __exit rdma_exit(void) +{ + kdb_unregister("ib_wc"); + kdb_unregister("ib_sge"); +} + +module_init(rdma_init) +module_exit(rdma_exit) > > #include "iwcm.h" > > Or the file is small enough that maybe it's simpler just to stuff this > at the top of iwcm.c and kill the include entirely? > > - R. > > > --- /dev/null > > +++ b/include/rdma/iw_cm_private.h > > @@ -0,0 +1,63 @@ > > +#ifndef IW_CM_PRIVATE_H > > +#define IW_CM_PRIVATE_H > > + > > +#include > > + > > +enum iw_cm_state { > > + IW_CM_STATE_IDLE, /* unbound, inactive */ > > + IW_CM_STATE_LISTEN, /* listen waiting for connect */ > > + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ > > + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ > > + IW_CM_STATE_ESTABLISHED, /* established */ > > + IW_CM_STATE_CLOSING, /* disconnect */ > > + IW_CM_STATE_DESTROYING /* object being deleted */ > > +}; > > + > > +struct iwcm_id_private { > > + struct iw_cm_id id; > > + enum iw_cm_state state; > > + unsigned long flags; > > + struct ib_qp *qp; > > + struct completion destroy_comp; > > + wait_queue_head_t connect_wait; > > + struct list_head work_list; > > + spinlock_t lock; > > + atomic_t refcount; > > + struct list_head work_free_list; > > +}; > > +#define IWCM_F_CALLBACK_DESTROY 1 > > +#define IWCM_F_CONNECT_WAIT 2 > > + > > +#endif /* IW_CM_PRIVATE_H */ > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From mst at mellanox.co.il Wed Aug 30 10:52:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Aug 2006 20:52:16 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <44F5C9B7.8070408@ichips.intel.com> References: <44F4AB73.2070208@ichips.intel.com> <20060830045726.GA25478@mellanox.co.il> <44F5C9B7.8070408@ichips.intel.com> Message-ID: <20060830175216.GB30879@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Michael S. Tsirkin wrote: > > And so can RTU, in which case again QP will be in RTR. So it seems > > lost CM packets aren't protected by timewait. > > Maybe we just try to deal with this the best that we can and make the HCA > driver responsible for not re-allocating QPs for a duration of > local_ack_timeout once they've entered RTS. > > If connections are made through the IB CM, it seems unlikely that stale > packets will float around the subnet longer than it takes to establish a new > connection. We just need to be able to detect stale connections, which > requires that users use the CM when connecting. Fair enough. > > At least in case of mthca, I think we do have the last PSN. > > So I guess we could have a special wildcard PSN value that let's low > > level driver select it. What would a good value for the PSN be? > > I'm not sure about this idea now. A random starting PSN is easy. How much > better off are we trying to be clever? Donnu. In setups such as boot over IB, random values might be hard to get. > Btw, have you been able to determine why your patch leads to a crash under > stress? It looks to me like it would work. It exposed a race in SDP. The patch itself does not lead to crashes - I re-attach it here for reference. As we discussed, this needs to be extended to handle DREQ retries properly. --- IB/cm: do not track remote QPN in TimeWait, since QP is not connected. This avoids spurious "stale connection" rejects. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index f85c97f..e270311 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -679,6 +679,8 @@ static void cm_enter_timewait(struct cm_ { int wait_time; + cm_cleanup_timewait(cm_id_priv->timewait_info); + /* * The cm_id could be destroyed by the user before we exit timewait. * To protect against this, we search for the cm_id after exiting -- MST From HNGUYEN at de.ibm.com Wed Aug 30 11:27:51 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Wed, 30 Aug 2006 20:27:51 +0200 Subject: [openib-general] [PATCH 02/13] IB/ehca: includes In-Reply-To: <200608301143.35320.arnd.bergmann@de.ibm.com> Message-ID: Hi, > There are a few places in the driver where you declare > external variables (mostly ehca_module and ehca_debug_level) > from C files instead of a header. This sometimes leads > to bugs when a type changes and is therefore considered > bad style. Good point. See patch attached below. > Moreover, for some of your more heavily used caches, you may > want to look into using constructor/destructor calls to > speed up allocation. That makes sense. Will look into this for a later patch. Thanks! Nam Makefile | 1 ehca_av.c | 29 +++++++--- ehca_classes.h | 27 +++++---- ehca_cq.c | 27 +++++++-- ehca_eq.c | 14 ---- ehca_irq.c | 1 ehca_main.c | 164 ++++++++++++++++++++------------------------------------- ehca_mrmw.c | 45 +++++++++++---- ehca_pd.c | 25 +++++++- ehca_qp.c | 32 +++++++---- ehca_reqs.c | 2 ehca_sqp.c | 2 hcp_if.c | 1 hcp_phyp.h | 4 - ipz_pt_fn.c | 2 15 files changed, 198 insertions(+), 178 deletions(-) diff -Nurp infiniband/drivers/infiniband/hw/ehca/Makefile infiniband_work/drivers/infiniband/hw/ehca/Makefile --- infiniband/drivers/infiniband/hw/ehca/Makefile 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/Makefile 2006-08-30 20:00:17.000000000 +0200 @@ -10,6 +10,7 @@ obj-$(CONFIG_INFINIBAND_EHCA) += ib_ehca.o + ib_ehca-objs = ehca_main.o ehca_hca.o ehca_mcast.o ehca_pd.o ehca_av.o ehca_eq.o \ ehca_cq.o ehca_qp.o ehca_sqp.o ehca_mrmw.o ehca_reqs.o ehca_irq.o \ ehca_uverbs.o ipz_pt_fn.o hcp_if.o hcp_phyp.o diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_av.c infiniband_work/drivers/infiniband/hw/ehca/ehca_av.c --- infiniband/drivers/infiniband/hw/ehca/ehca_av.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_av.c 2006-08-30 20:00:16.000000000 +0200 @@ -48,16 +48,16 @@ #include "ehca_iverbs.h" #include "hcp_if.h" +static struct kmem_cache *av_cache; + struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) { - extern struct ehca_module ehca_module; - extern int ehca_static_rate; int ret; struct ehca_av *av; struct ehca_shca *shca = container_of(pd->device, struct ehca_shca, ib_device); - av = kmem_cache_alloc(ehca_module.cache_av, SLAB_KERNEL); + av = kmem_cache_alloc(av_cache, SLAB_KERNEL); if (!av) { ehca_err(pd->device, "Out of memory pd=%p ah_attr=%p", pd, ah_attr); @@ -128,7 +128,7 @@ struct ib_ah *ehca_create_ah(struct ib_p return &av->ib_ah; create_ah_exit1: - kmem_cache_free(ehca_module.cache_av, av); + kmem_cache_free(av_cache, av); return ERR_PTR(ret); } @@ -238,7 +238,6 @@ int ehca_query_ah(struct ib_ah *ah, stru int ehca_destroy_ah(struct ib_ah *ah) { - extern struct ehca_module ehca_module; struct ehca_pd *my_pd = container_of(ah->pd, struct ehca_pd, ib_pd); u32 cur_pid = current->tgid; @@ -249,8 +248,24 @@ int ehca_destroy_ah(struct ib_ah *ah) return -EINVAL; } - kmem_cache_free(ehca_module.cache_av, - container_of(ah, struct ehca_av, ib_ah)); + kmem_cache_free(av_cache, container_of(ah, struct ehca_av, ib_ah)); + + return 0; +} +int ehca_init_av_cache(void) +{ + av_cache = kmem_cache_create("ehca_cache_av", + sizeof(struct ehca_av), 0, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!av_cache) + return -ENOMEM; return 0; } + +void ehca_cleanup_av_cache(void) +{ + if (av_cache) + kmem_cache_destroy(av_cache); +} diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_classes.h infiniband_work/drivers/infiniband/hw/ehca/ehca_classes.h --- infiniband/drivers/infiniband/hw/ehca/ehca_classes.h 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_classes.h 2006-08-30 20:00:16.000000000 +0200 @@ -63,18 +63,6 @@ struct ehca_av; #include "ehca_irq.h" -struct ehca_module { - struct list_head shca_list; - spinlock_t shca_lock; - struct timer_list timer; - kmem_cache_t *cache_pd; - kmem_cache_t *cache_cq; - kmem_cache_t *cache_qp; - kmem_cache_t *cache_av; - kmem_cache_t *cache_mr; - kmem_cache_t *cache_mw; -}; - struct ehca_eq { u32 length; struct ipz_queue ipz_queue; @@ -274,11 +262,26 @@ int ehca_shca_delete(struct ehca_shca *m struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor); +int ehca_init_pd_cache(void); +void ehca_cleanup_pd_cache(void); +int ehca_init_cq_cache(void); +void ehca_cleanup_cq_cache(void); +int ehca_init_qp_cache(void); +void ehca_cleanup_qp_cache(void); +int ehca_init_av_cache(void); +void ehca_cleanup_av_cache(void); +int ehca_init_mrmw_cache(void); +void ehca_cleanup_mrmw_cache(void); + extern spinlock_t ehca_qp_idr_lock; extern spinlock_t ehca_cq_idr_lock; extern struct idr ehca_qp_idr; extern struct idr ehca_cq_idr; +extern int ehca_static_rate; +extern int ehca_port_act_time; +extern int ehca_use_hp_mr; + struct ipzu_queue_resp { u64 queue; /* points to first queue entry */ u32 qe_size; /* queue entry size */ diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_cq.c infiniband_work/drivers/infiniband/hw/ehca/ehca_cq.c --- infiniband/drivers/infiniband/hw/ehca/ehca_cq.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_cq.c 2006-08-30 20:00:17.000000000 +0200 @@ -50,6 +50,8 @@ #include "ehca_irq.h" #include "hcp_if.h" +static struct kmem_cache *cq_cache; + int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp) { unsigned int qp_num = qp->real_qp_num; @@ -115,7 +117,6 @@ struct ib_cq *ehca_create_cq(struct ib_d struct ib_ucontext *context, struct ib_udata *udata) { - extern struct ehca_module ehca_module; static const u32 additional_cqe = 20; struct ib_cq *cq; struct ehca_cq *my_cq; @@ -133,7 +134,7 @@ struct ib_cq *ehca_create_cq(struct ib_d if (cqe >= 0xFFFFFFFF - 64 - additional_cqe) return ERR_PTR(-EINVAL); - my_cq = kmem_cache_alloc(ehca_module.cache_cq, SLAB_KERNEL); + my_cq = kmem_cache_alloc(cq_cache, SLAB_KERNEL); if (!my_cq) { ehca_err(device, "Out of memory for ehca_cq struct device=%p", device); @@ -324,14 +325,13 @@ create_cq_exit2: spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); create_cq_exit1: - kmem_cache_free(ehca_module.cache_cq, my_cq); + kmem_cache_free(cq_cache, my_cq); return cq; } int ehca_destroy_cq(struct ib_cq *cq) { - extern struct ehca_module ehca_module; u64 h_ret; int ret; struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); @@ -387,7 +387,7 @@ int ehca_destroy_cq(struct ib_cq *cq) return ehca2ib_return_code(h_ret); } ipz_queue_dtor(&my_cq->ipz_queue); - kmem_cache_free(ehca_module.cache_cq, my_cq); + kmem_cache_free(cq_cache, my_cq); return 0; } @@ -408,3 +408,20 @@ int ehca_resize_cq(struct ib_cq *cq, int return -EFAULT; } + +int ehca_init_cq_cache(void) +{ + cq_cache = kmem_cache_create("ehca_cache_cq", + sizeof(struct ehca_cq), 0, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!cq_cache) + return -ENOMEM; + return 0; +} + +void ehca_cleanup_cq_cache(void) +{ + if (cq_cache) + kmem_cache_destroy(cq_cache); +} diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_eq.c infiniband_work/drivers/infiniband/hw/ehca/ehca_eq.c --- infiniband/drivers/infiniband/hw/ehca/ehca_eq.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_eq.c 2006-08-30 20:00:16.000000000 +0200 @@ -163,20 +163,6 @@ void *ehca_poll_eq(struct ehca_shca *shc return eqe; } -void ehca_poll_eqs(unsigned long data) -{ - struct ehca_shca *shca; - struct ehca_module *module = (struct ehca_module*)data; - - spin_lock(&module->shca_lock); - list_for_each_entry(shca, &module->shca_list, shca_list) { - if (shca->eq.is_initialized) - ehca_tasklet_eq((unsigned long)(void*)shca); - } - mod_timer(&module->timer, jiffies + HZ); - spin_unlock(&module->shca_lock); -} - int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq) { unsigned long flags; diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_irq.c infiniband_work/drivers/infiniband/hw/ehca/ehca_irq.c --- infiniband/drivers/infiniband/hw/ehca/ehca_irq.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_irq.c 2006-08-30 20:00:16.000000000 +0200 @@ -427,7 +427,6 @@ void ehca_tasklet_eq(unsigned long data) /* TODO: better structure */ if (EHCA_BMASK_GET(EQE_COMPLETION_EVENT, eqe_value)) { - extern struct idr ehca_cq_idr; unsigned long flags; u32 token; struct ehca_cq *cq; diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_main.c infiniband_work/drivers/infiniband/hw/ehca/ehca_main.c --- infiniband/drivers/infiniband/hw/ehca/ehca_main.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_main.c 2006-08-30 20:01:34.000000000 +0200 @@ -4,6 +4,7 @@ * module start stop, hca detection * * Authors: Heiko J Schick + * Hoang-Nam Nguyen * * Copyright (c) 2005 IBM Corporation * @@ -47,7 +48,7 @@ MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Christoph Raisch "); MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); -MODULE_VERSION("SVNEHCA_0014"); +MODULE_VERSION("SVNEHCA_0015"); int ehca_open_aqp1 = 0; int ehca_debug_level = 0; @@ -92,129 +93,69 @@ spinlock_t ehca_cq_idr_lock; DEFINE_IDR(ehca_qp_idr); DEFINE_IDR(ehca_cq_idr); -struct ehca_module ehca_module; +static struct list_head shca_list; /* list of all registered ehcas */ +static spinlock_t shca_list_lock; -int ehca_create_slab_caches(struct ehca_module *ehca_module) +static struct timer_list poll_eqs_timer; + +static int ehca_create_slab_caches(void) { int ret; - ehca_module->cache_pd = - kmem_cache_create("ehca_cache_pd", - sizeof(struct ehca_pd), - 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); - if (!ehca_module->cache_pd) { + ret = ehca_init_pd_cache(); + if (ret) { ehca_gen_err("Cannot create PD SLAB cache."); - ret = -ENOMEM; - goto create_slab_caches1; + return ret; } - ehca_module->cache_cq = - kmem_cache_create("ehca_cache_cq", - sizeof(struct ehca_cq), - 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); - if (!ehca_module->cache_cq) { + ret = ehca_init_cq_cache(); + if (ret) { ehca_gen_err("Cannot create CQ SLAB cache."); - ret = -ENOMEM; goto create_slab_caches2; } - ehca_module->cache_qp = - kmem_cache_create("ehca_cache_qp", - sizeof(struct ehca_qp), - 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); - if (!ehca_module->cache_qp) { + ret = ehca_init_qp_cache(); + if (ret) { ehca_gen_err("Cannot create QP SLAB cache."); - ret = -ENOMEM; goto create_slab_caches3; } - ehca_module->cache_av = - kmem_cache_create("ehca_cache_av", - sizeof(struct ehca_av), - 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); - if (!ehca_module->cache_av) { + ret = ehca_init_av_cache(); + if (ret) { ehca_gen_err("Cannot create AV SLAB cache."); - ret = -ENOMEM; goto create_slab_caches4; } - ehca_module->cache_mw = - kmem_cache_create("ehca_cache_mw", - sizeof(struct ehca_mw), - 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); - if (!ehca_module->cache_mw) { - ehca_gen_err("Cannot create MW SLAB cache."); - ret = -ENOMEM; + ret = ehca_init_mrmw_cache(); + if (ret) { + ehca_gen_err("Cannot create MR&MW SLAB cache."); goto create_slab_caches5; } - ehca_module->cache_mr = - kmem_cache_create("ehca_cache_mr", - sizeof(struct ehca_mr), - 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); - if (!ehca_module->cache_mr) { - ehca_gen_err("Cannot create MR SLAB cache."); - ret = -ENOMEM; - goto create_slab_caches6; - } - return 0; -create_slab_caches6: - kmem_cache_destroy(ehca_module->cache_mw); - create_slab_caches5: - kmem_cache_destroy(ehca_module->cache_av); + ehca_cleanup_av_cache(); create_slab_caches4: - kmem_cache_destroy(ehca_module->cache_qp); + ehca_cleanup_qp_cache(); create_slab_caches3: - kmem_cache_destroy(ehca_module->cache_cq); + ehca_cleanup_cq_cache(); create_slab_caches2: - kmem_cache_destroy(ehca_module->cache_pd); - -create_slab_caches1: + ehca_cleanup_pd_cache(); return ret; } -int ehca_destroy_slab_caches(struct ehca_module *ehca_module) +static void ehca_destroy_slab_caches(void) { - int ret; - - ret = kmem_cache_destroy(ehca_module->cache_pd); - if (ret) - ehca_gen_err("Cannot destroy PD SLAB cache. ret=%x", ret); - - ret = kmem_cache_destroy(ehca_module->cache_cq); - if (ret) - ehca_gen_err("Cannot destroy CQ SLAB cache. ret=%x", ret); - - ret = kmem_cache_destroy(ehca_module->cache_qp); - if (ret) - ehca_gen_err("Cannot destroy QP SLAB cache. ret=%x", ret); - - ret = kmem_cache_destroy(ehca_module->cache_av); - if (ret) - ehca_gen_err("Cannot destroy AV SLAB cache. ret=%x", ret); - - ret = kmem_cache_destroy(ehca_module->cache_mw); - if (ret) - ehca_gen_err("Cannot destroy MW SLAB cache. ret=%x", ret); - - ret = kmem_cache_destroy(ehca_module->cache_mr); - if (ret) - ehca_gen_err("Cannot destroy MR SLAB cache. ret=%x", ret); - - return 0; + ehca_cleanup_mrmw_cache(); + ehca_cleanup_av_cache(); + ehca_cleanup_qp_cache(); + ehca_cleanup_cq_cache(); + ehca_cleanup_pd_cache(); } #define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) @@ -682,9 +623,9 @@ static int __devinit ehca_probe(struct i ehca_create_device_sysfs(dev); - spin_lock(&ehca_module.shca_lock); - list_add(&shca->shca_list, &ehca_module.shca_list); - spin_unlock(&ehca_module.shca_lock); + spin_lock(&shca_list_lock); + list_add(&shca->shca_list, &shca_list); + spin_unlock(&shca_list_lock); return 0; @@ -767,9 +708,9 @@ static int __devexit ehca_remove(struct ib_dealloc_device(&shca->ib_device); - spin_lock(&ehca_module.shca_lock); + spin_lock(&shca_list_lock); list_del(&shca->shca_list); - spin_unlock(&ehca_module.shca_lock); + spin_unlock(&shca_list_lock); return ret; } @@ -790,26 +731,39 @@ static struct ibmebus_driver ehca_driver .remove = ehca_remove, }; +void ehca_poll_eqs(unsigned long data) +{ + struct ehca_shca *shca; + + spin_lock(&shca_list_lock); + list_for_each_entry(shca, &shca_list, shca_list) { + if (shca->eq.is_initialized) + ehca_tasklet_eq((unsigned long)(void*)shca); + } + mod_timer(&poll_eqs_timer, jiffies + HZ); + spin_unlock(&shca_list_lock); +} + int __init ehca_module_init(void) { int ret; printk(KERN_INFO "eHCA Infiniband Device Driver " - "(Rel.: SVNEHCA_0014)\n"); + "(Rel.: SVNEHCA_0015)\n"); idr_init(&ehca_qp_idr); idr_init(&ehca_cq_idr); spin_lock_init(&ehca_qp_idr_lock); spin_lock_init(&ehca_cq_idr_lock); - INIT_LIST_HEAD(&ehca_module.shca_list); - spin_lock_init(&ehca_module.shca_lock); + INIT_LIST_HEAD(&shca_list); + spin_lock_init(&shca_list_lock); if ((ret = ehca_create_comp_pool())) { ehca_gen_err("Cannot create comp pool."); return ret; } - if ((ret = ehca_create_slab_caches(&ehca_module))) { + if ((ret = ehca_create_slab_caches())) { ehca_gen_err("Cannot create SLAB caches"); ret = -ENOMEM; goto module_init1; @@ -827,17 +781,16 @@ int __init ehca_module_init(void) ehca_gen_err("WARNING!!!"); ehca_gen_err("It is possible to lose interrupts."); } else { - init_timer(&ehca_module.timer); - ehca_module.timer.function = ehca_poll_eqs; - ehca_module.timer.data = (unsigned long)&ehca_module; - ehca_module.timer.expires = jiffies + HZ; - add_timer(&ehca_module.timer); + init_timer(&poll_eqs_timer); + poll_eqs_timer.function = ehca_poll_eqs; + poll_eqs_timer.expires = jiffies + HZ; + add_timer(&poll_eqs_timer); } return 0; module_init2: - ehca_destroy_slab_caches(&ehca_module); + ehca_destroy_slab_caches(); module_init1: ehca_destroy_comp_pool(); @@ -847,13 +800,12 @@ module_init1: void __exit ehca_module_exit(void) { if (ehca_poll_all_eqs == 1) - del_timer_sync(&ehca_module.timer); + del_timer_sync(&poll_eqs_timer); ehca_remove_driver_sysfs(&ehca_driver); ibmebus_unregister_driver(&ehca_driver); - if (ehca_destroy_slab_caches(&ehca_module) != 0) - ehca_gen_err("Cannot destroy SLAB caches"); + ehca_destroy_slab_caches(); ehca_destroy_comp_pool(); diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_mrmw.c infiniband_work/drivers/infiniband/hw/ehca/ehca_mrmw.c --- infiniband/drivers/infiniband/hw/ehca/ehca_mrmw.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_mrmw.c 2006-08-30 20:00:16.000000000 +0200 @@ -46,14 +46,14 @@ #include "hcp_if.h" #include "hipz_hw.h" -extern int ehca_use_hp_mr; +static struct kmem_cache *mr_cache; +static struct kmem_cache *mw_cache; static struct ehca_mr *ehca_mr_new(void) { - extern struct ehca_module ehca_module; struct ehca_mr *me; - me = kmem_cache_alloc(ehca_module.cache_mr, SLAB_KERNEL); + me = kmem_cache_alloc(mr_cache, SLAB_KERNEL); if (me) { memset(me, 0, sizeof(struct ehca_mr)); spin_lock_init(&me->mrlock); @@ -65,17 +65,14 @@ static struct ehca_mr *ehca_mr_new(void) static void ehca_mr_delete(struct ehca_mr *me) { - extern struct ehca_module ehca_module; - - kmem_cache_free(ehca_module.cache_mr, me); + kmem_cache_free(mr_cache, me); } static struct ehca_mw *ehca_mw_new(void) { - extern struct ehca_module ehca_module; struct ehca_mw *me; - me = kmem_cache_alloc(ehca_module.cache_mw, SLAB_KERNEL); + me = kmem_cache_alloc(mw_cache, SLAB_KERNEL); if (me) { memset(me, 0, sizeof(struct ehca_mw)); spin_lock_init(&me->mwlock); @@ -87,9 +84,7 @@ static struct ehca_mw *ehca_mw_new(void) static void ehca_mw_delete(struct ehca_mw *me) { - extern struct ehca_module ehca_module; - - kmem_cache_free(ehca_module.cache_mw, me); + kmem_cache_free(mw_cache, me); } /*----------------------------------------------------------------------*/ @@ -2236,3 +2231,31 @@ void ehca_mr_deletenew(struct ehca_mr *m mr->nr_of_pages = 0; mr->pagearray = NULL; } /* end ehca_mr_deletenew() */ + +int ehca_init_mrmw_cache(void) +{ + mr_cache = kmem_cache_create("ehca_cache_mr", + sizeof(struct ehca_mr), 0, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!mr_cache) + return -ENOMEM; + mw_cache = kmem_cache_create("ehca_cache_mw", + sizeof(struct ehca_mw), 0, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!mw_cache) { + kmem_cache_destroy(mr_cache); + mr_cache = NULL; + return -ENOMEM; + } + return 0; +} + +void ehca_cleanup_mrmw_cache(void) +{ + if (mr_cache) + kmem_cache_destroy(mr_cache); + if (mw_cache) + kmem_cache_destroy(mw_cache); +} diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_pd.c infiniband_work/drivers/infiniband/hw/ehca/ehca_pd.c --- infiniband/drivers/infiniband/hw/ehca/ehca_pd.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_pd.c 2006-08-30 20:00:16.000000000 +0200 @@ -43,13 +43,14 @@ #include "ehca_tools.h" #include "ehca_iverbs.h" +static struct kmem_cache *pd_cache; + struct ib_pd *ehca_alloc_pd(struct ib_device *device, struct ib_ucontext *context, struct ib_udata *udata) { - extern struct ehca_module ehca_module; struct ehca_pd *pd; - pd = kmem_cache_alloc(ehca_module.cache_pd, SLAB_KERNEL); + pd = kmem_cache_alloc(pd_cache, SLAB_KERNEL); if (!pd) { ehca_err(device, "device=%p context=%p out of memory", device, context); @@ -79,7 +80,6 @@ struct ib_pd *ehca_alloc_pd(struct ib_de int ehca_dealloc_pd(struct ib_pd *pd) { - extern struct ehca_module ehca_module; u32 cur_pid = current->tgid; struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd); @@ -90,8 +90,25 @@ int ehca_dealloc_pd(struct ib_pd *pd) return -EINVAL; } - kmem_cache_free(ehca_module.cache_pd, + kmem_cache_free(pd_cache, container_of(pd, struct ehca_pd, ib_pd)); return 0; } + +int ehca_init_pd_cache(void) +{ + pd_cache = kmem_cache_create("ehca_cache_pd", + sizeof(struct ehca_pd), 0, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!pd_cache) + return -ENOMEM; + return 0; +} + +void ehca_cleanup_pd_cache(void) +{ + if (pd_cache) + kmem_cache_destroy(pd_cache); +} diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_qp.c infiniband_work/drivers/infiniband/hw/ehca/ehca_qp.c --- infiniband/drivers/infiniband/hw/ehca/ehca_qp.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_qp.c 2006-08-30 20:00:16.000000000 +0200 @@ -51,6 +51,8 @@ #include "hcp_if.h" #include "hipz_fns.h" +static struct kmem_cache *qp_cache; + /* * attributes not supported by query qp */ @@ -387,7 +389,6 @@ struct ib_qp *ehca_create_qp(struct ib_p struct ib_qp_init_attr *init_attr, struct ib_udata *udata) { - extern struct ehca_module ehca_module; static int da_rc_msg_size[]={ 128, 256, 512, 1024, 2048, 4096 }; static int da_ud_sq_msg_size[]={ 128, 384, 896, 1920, 3968 }; struct ehca_qp *my_qp; @@ -449,7 +450,7 @@ struct ib_qp *ehca_create_qp(struct ib_p if (pd->uobject && udata) context = pd->uobject->context; - my_qp = kmem_cache_alloc(ehca_module.cache_qp, SLAB_KERNEL); + my_qp = kmem_cache_alloc(qp_cache, SLAB_KERNEL); if (!my_qp) { ehca_err(pd->device, "pd=%p not enough memory to alloc qp", pd); return ERR_PTR(-ENOMEM); @@ -716,7 +717,7 @@ create_qp_exit1: spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); create_qp_exit0: - kmem_cache_free(ehca_module.cache_qp, my_qp); + kmem_cache_free(qp_cache, my_qp); return ERR_PTR(ret); } @@ -728,7 +729,6 @@ create_qp_exit0: static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca, int *bad_wqe_cnt) { - extern int ehca_debug_level; u64 h_ret; struct ipz_queue *squeue; void *bad_send_wqe_p, *bad_send_wqe_v; @@ -797,7 +797,6 @@ static int internal_modify_qp(struct ib_ struct ib_qp_attr *attr, int attr_mask, int smi_reset2init) { - extern int ehca_debug_level; enum ib_qp_state qp_cur_state, qp_new_state; int cnt, qp_attr_idx, ret = 0; enum ib_qp_statetrans statetrans; @@ -807,7 +806,7 @@ static int internal_modify_qp(struct ib_ container_of(ibqp->pd->device, struct ehca_shca, ib_device); u64 update_mask; u64 h_ret; - int bad_wqe_cnt; + int bad_wqe_cnt = 0; int squeue_locked = 0; unsigned long spl_flags = 0; @@ -1253,7 +1251,6 @@ int ehca_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr) { - extern int ehca_debug_level; struct ehca_qp *my_qp = container_of(qp, struct ehca_qp, ib_qp); struct ehca_pd *my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd, ib_pd); @@ -1410,7 +1407,6 @@ query_qp_exit1: int ehca_destroy_qp(struct ib_qp *ibqp) { - extern struct ehca_module ehca_module; struct ehca_qp *my_qp = container_of(ibqp, struct ehca_qp, ib_qp); struct ehca_shca *shca = container_of(ibqp->device, struct ehca_shca, ib_device); @@ -1488,6 +1484,23 @@ int ehca_destroy_qp(struct ib_qp *ibqp) ipz_queue_dtor(&my_qp->ipz_rqueue); ipz_queue_dtor(&my_qp->ipz_squeue); - kmem_cache_free(ehca_module.cache_qp, my_qp); + kmem_cache_free(qp_cache, my_qp); return 0; } + +int ehca_init_qp_cache(void) +{ + qp_cache = kmem_cache_create("ehca_cache_qp", + sizeof(struct ehca_qp), 0, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!qp_cache) + return -ENOMEM; + return 0; +} + +void ehca_cleanup_qp_cache(void) +{ + if (qp_cache) + kmem_cache_destroy(qp_cache); +} diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_reqs.c infiniband_work/drivers/infiniband/hw/ehca/ehca_reqs.c --- infiniband/drivers/infiniband/hw/ehca/ehca_reqs.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_reqs.c 2006-08-30 20:00:16.000000000 +0200 @@ -49,8 +49,6 @@ #include "hcp_if.h" #include "hipz_fns.h" -extern int ehca_debug_level; - static inline int ehca_write_rwqe(struct ipz_queue *ipz_rqueue, struct ehca_wqe *wqe_p, struct ib_recv_wr *recv_wr) diff -Nurp infiniband/drivers/infiniband/hw/ehca/ehca_sqp.c infiniband_work/drivers/infiniband/hw/ehca/ehca_sqp.c --- infiniband/drivers/infiniband/hw/ehca/ehca_sqp.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ehca_sqp.c 2006-08-30 20:00:16.000000000 +0200 @@ -49,8 +49,6 @@ #include "hcp_if.h" -extern int ehca_port_act_time; - /** * ehca_define_sqp - Defines special queue pair 1 (GSI QP). When special queue * pair is created successfully, the corresponding port gets active. diff -Nurp infiniband/drivers/infiniband/hw/ehca/hcp_if.c infiniband_work/drivers/infiniband/hw/ehca/hcp_if.c --- infiniband/drivers/infiniband/hw/ehca/hcp_if.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/hcp_if.c 2006-08-30 20:00:17.000000000 +0200 @@ -410,7 +410,6 @@ u64 hipz_h_query_port(const struct ipz_a const u8 port_id, struct hipz_query_port *query_port_response_block) { - extern int ehca_debug_level; u64 ret; u64 dummy; u64 r_cb = virt_to_abs(query_port_response_block); diff -Nurp infiniband/drivers/infiniband/hw/ehca/hcp_phyp.h infiniband_work/drivers/infiniband/hw/ehca/hcp_phyp.h --- infiniband/drivers/infiniband/hw/ehca/hcp_phyp.h 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/hcp_phyp.h 2006-08-30 20:00:16.000000000 +0200 @@ -69,13 +69,13 @@ struct h_galpas { static inline u64 hipz_galpa_load(struct h_galpa galpa, u32 offset) { u64 addr = galpa.fw_handle + offset; - return *(u64 *)addr; + return *(volatile u64 __force *)addr; } static inline void hipz_galpa_store(struct h_galpa galpa, u32 offset, u64 value) { u64 addr = galpa.fw_handle + offset; - *(u64 *)addr = value; + *(volatile u64 __force *)addr = value; } int hcp_galpas_ctor(struct h_galpas *galpas, diff -Nurp infiniband/drivers/infiniband/hw/ehca/ipz_pt_fn.c infiniband_work/drivers/infiniband/hw/ehca/ipz_pt_fn.c --- infiniband/drivers/infiniband/hw/ehca/ipz_pt_fn.c 2006-08-30 18:02:01.000000000 +0200 +++ infiniband_work/drivers/infiniband/hw/ehca/ipz_pt_fn.c 2006-08-30 20:00:16.000000000 +0200 @@ -41,8 +41,6 @@ #include "ehca_tools.h" #include "ipz_pt_fn.h" -extern int ehca_hwlevel; - void *ipz_qpageit_get_inc(struct ipz_queue *queue) { void *ret = ipz_qeit_get(queue); From mst at mellanox.co.il Wed Aug 30 11:59:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Aug 2006 21:59:53 +0300 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060830175216.GB30879@mellanox.co.il> References: <44F4AB73.2070208@ichips.intel.com> <20060830045726.GA25478@mellanox.co.il> <44F5C9B7.8070408@ichips.intel.com> <20060830175216.GB30879@mellanox.co.il> Message-ID: <20060830185953.GA31724@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > Quoting r. Sean Hefty : > > Subject: Re: [PATCH ] RFC IB/cm do not track remote QPN in timewait state > > > > Michael S. Tsirkin wrote: > > > And so can RTU, in which case again QP will be in RTR. So it seems > > > lost CM packets aren't protected by timewait. > > > > Maybe we just try to deal with this the best that we can and make the HCA > > driver responsible for not re-allocating QPs for a duration of > > local_ack_timeout once they've entered RTS. > > > > If connections are made through the IB CM, it seems unlikely that stale > > packets will float around the subnet longer than it takes to establish a new > > connection. We just need to be able to detect stale connections, which > > requires that users use the CM when connecting. > > Fair enough. To clarify - I don't necessarily agree that stale packets are unlikely, but I do agree that asking low level driver to prevent QPN from being re-allocated for a duration of local_ack_timeout once QP has entered RTS will be sufficient. -- MST From paul.baxter at dsl.pipex.com Wed Aug 30 12:13:15 2006 From: paul.baxter at dsl.pipex.com (Paul Baxter) Date: Wed, 30 Aug 2006 20:13:15 +0100 Subject: [openib-general] File transfer performance options Message-ID: <007201c6cc68$5a735210$8000a8c0@blorp> We've been testing an application that archives large quantities of data from a Linux system onto a Windows-based server (64bit server 2003 R2). As part of the investigation into relatively modest transfer speeds in the win-linux configuration, we configured a Linux-Linux transfer via IpoIB with NFS layered on top (with ram disks to avoid physical disk issues) [Whilst for a real Linux-Linux configuration I would look for the RDMA over NFS solution, this wouldn't translate to our eventual win-linux inter-operable system.] I was surprised that even on linux-linux I hit a wall of 100MB/s (test notes below). Are others doing better? I was hoping for 150MB/s - 200MB/s Does anyone have any hints on tweaking of an IPoIB/NFS solution to get better throughput for large files (not so concerned about latency). Are there any other inter-operable windows-linux solutions now? (cross-platform NFS over RDMA or SRP initiator/target?) Paul Baxter ------------------- Some testing notes: The windows server remotely inspects the Linux filesystem and does a 'remote read' of large files (typical testing 1-4GB file) Using IPoIB/mthca and Win IB 1.2 - no particular tweaks i.e. 32 kB NFS block size Win-Linux a) Using untweaked Linux NFS and built-in Windows NFS Transfer rate 65MB/s b) Similar but using Samba on Linux and windows file sharing Transfer rate 90 MB/s c) Repeat a) and b) using ram disks rather than physical disks (1GB transfer) Confirmed similar transfer rates ie physical disks not limiting this Presentations on winIB noted that IPoIB has to snoop each packet , so I repeated test c) in a Linux-Linux configuration expecting much better results... NFS performance over Ext2 formatted filesystem ~ 100MB/s (~ 73MB/s on Ext3 with default (journalling on?) options) Samba performance ~ 64MB/s Next tried having recipient of large file, copy it to /dev/null rather than to a local file system. Reported transfer at 145MB/s (We've also noted along the way that remote read and remote write From mshefty at ichips.intel.com Wed Aug 30 12:17:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 30 Aug 2006 12:17:58 -0700 Subject: [openib-general] CMA oops In-Reply-To: <20060828105755.GB23639@mellanox.co.il> References: <20060828105755.GB23639@mellanox.co.il> Message-ID: <44F5E466.70206@ichips.intel.com> Michael S. Tsirkin wrote: > Apparently, list->prev pointer in CMA id_priv structure is NULL > which causes a crash in list_del. > > I note that rdma_destroy_id tests outside the mutex lock. > Could that be the problem? > The problem is not unfortunately easily reproducible. I think I see one bug, but it doesn't seem like its causing the crash that you saw. It's possible that address resolution can complete at the same time that rdma_destroy_id() is called. The addr_handler() will cause the rdma_cm_id to attach to a device while destroy is running, which can come after the check for id_priv->cma_dev is made. The result is that destroy will not detach from the device, leaving the rdma_cm_id in the device list after its destruction. I'm trying to come up with a fix for this, but I'm not convinced it's the problem that you're seeing. - Sean From ftillier at silverstorm.com Wed Aug 30 12:24:24 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 30 Aug 2006 12:24:24 -0700 Subject: [openib-general] [Openib-windows] File transfer performance options In-Reply-To: <007201c6cc68$5a735210$8000a8c0@blorp> References: <007201c6cc68$5a735210$8000a8c0@blorp> Message-ID: <79ae2f320608301224g26554803pce385a6a703b9425@mail.gmail.com> Hi Paul, On 8/30/06, Paul Baxter wrote: > > Are there any other inter-operable windows-linux solutions now? > (cross-platform NFS over RDMA or SRP initiator/target?) There is an SRP initiator for Windows, but not a target. There is a Linux SRP target reference implementation, but not coded for the OpenFabrics Linux stack. I don't know if anyone is working on porting that. > ------------------- > Some testing notes: > > Using IPoIB/mthca and Win IB 1.2 - no particular tweaks i.e. 32 kB NFS block > size What version of OpenIB does Win IB 1.2 correspond to? If you right-click the property of any of the binaries (say c:\Windows\System32\Drivers\ibbus.sys), what is the file version reported on the "Versions" property page? Thanks, - Fab From Brian.Cain at ge.com Wed Aug 30 12:36:51 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Wed, 30 Aug 2006 15:36:51 -0400 Subject: [openib-general] File transfer performance options In-Reply-To: <007201c6cc68$5a735210$8000a8c0@blorp> Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB033CA974F@CINMLVEM11.e2k.ad.ge.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Paul Baxter > Sent: Wednesday, August 30, 2006 2:13 PM > To: openib-general at openib.org > Cc: openib-windows at openib.org > Subject: [openib-general] File transfer performance options > > We've been testing an application that archives large > quantities of data > from a Linux system onto a Windows-based server (64bit server > 2003 R2). > > As part of the investigation into relatively modest transfer > speeds in the > win-linux configuration, we configured a Linux-Linux transfer > via IpoIB with > NFS layered on top (with ram disks to avoid physical disk issues) > > [Whilst for a real Linux-Linux configuration I would look for > the RDMA over > NFS solution, this wouldn't translate to our eventual win-linux > inter-operable system.] > > I was surprised that even on linux-linux I hit a wall of > 100MB/s (test notes > below). Are others doing better? I was hoping for 150MB/s - 200MB/s I can report streaming write results (using SRP, not NFS/IPoIB) of around 380MiB/s. Right now we think that there's a disk or controller bottleneck on the SRP target that's keeping us from getting up near 450-500 MiB/s or so. Both the initiator and target are linux-based. I think I heard of someone here using a Windows initiator and getting streaming write results similar to the 380MiB/s we're getting now. I guess there's quite a few differences in the scenarios we're describing, so it's pretty far from apples to apples. OBTW, in my experience, ext[23] seriously hamper performance. Try XFS or ReiserFS. The numbers above are all for XFS-formatted partitions. Maybe you should make the test notes a little more detailed. Doesn't NFS have a bunch of performance knobs (TCP vs UDP, block sizes, etc)? -Brian From tom at opengridcomputing.com Wed Aug 30 13:06:42 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 30 Aug 2006 15:06:42 -0500 Subject: [openib-general] File transfer performance options In-Reply-To: <007201c6cc68$5a735210$8000a8c0@blorp> References: <007201c6cc68$5a735210$8000a8c0@blorp> Message-ID: <1156968402.8973.35.camel@trinity.ogc.int> Are you familiar with NFSoRDMA? We get 600+MB/s on read and about 150-200MB/s on write with an XFS filesystem on a stripped software raid with four spindles. This is Linux <--> Linux. On Wed, 2006-08-30 at 20:13 +0100, Paul Baxter wrote: > We've been testing an application that archives large quantities of data > from a Linux system onto a Windows-based server (64bit server 2003 R2). > > As part of the investigation into relatively modest transfer speeds in the > win-linux configuration, we configured a Linux-Linux transfer via IpoIB with > NFS layered on top (with ram disks to avoid physical disk issues) > > [Whilst for a real Linux-Linux configuration I would look for the RDMA over > NFS solution, this wouldn't translate to our eventual win-linux > inter-operable system.] > > I was surprised that even on linux-linux I hit a wall of 100MB/s (test notes > below). Are others doing better? I was hoping for 150MB/s - 200MB/s > > Does anyone have any hints on tweaking of an IPoIB/NFS solution to get > better throughput for large files (not so concerned about latency). > > Are there any other inter-operable windows-linux solutions now? > (cross-platform NFS over RDMA or SRP initiator/target?) > > Paul Baxter > > ------------------- > Some testing notes: > > The windows server remotely inspects the Linux filesystem and does a 'remote > read' of large files (typical testing 1-4GB file) > > Using IPoIB/mthca and Win IB 1.2 - no particular tweaks i.e. 32 kB NFS block > size > > Win-Linux > a) Using untweaked Linux NFS and built-in Windows NFS > Transfer rate 65MB/s > > b) Similar but using Samba on Linux and windows file sharing > Transfer rate 90 MB/s > > c) Repeat a) and b) using ram disks rather than physical disks (1GB > transfer) > Confirmed similar transfer rates ie physical disks not limiting this > > Presentations on winIB noted that IPoIB has to snoop each packet , so I > repeated test c) in a Linux-Linux configuration expecting much better > results... > > NFS performance over Ext2 formatted filesystem ~ 100MB/s (~ 73MB/s on Ext3 > with default (journalling on?) options) > Samba performance ~ 64MB/s > > Next tried having recipient of large file, copy it to /dev/null rather than > to a local file system. Reported transfer at 145MB/s > (We've also noted along the way that remote read and remote write > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Wed Aug 30 13:06:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Aug 2006 23:06:04 +0300 Subject: [openib-general] CMA oops In-Reply-To: <44F5E466.70206@ichips.intel.com> References: <44F5E466.70206@ichips.intel.com> Message-ID: <20060830200604.GA32183@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] CMA oops > > Michael S. Tsirkin wrote: > > Apparently, list->prev pointer in CMA id_priv structure is NULL > > which causes a crash in list_del. > > > > I note that rdma_destroy_id tests outside the mutex lock. > > Could that be the problem? > > The problem is not unfortunately easily reproducible. > > I think I see one bug, but it doesn't seem like its causing the crash that you saw. > > It's possible that address resolution can complete at the same time that > rdma_destroy_id() is called. The addr_handler() will cause the rdma_cm_id to > attach to a device while destroy is running, which can come after the check for > id_priv->cma_dev is made. The result is that destroy will not detach from the > device, leaving the rdma_cm_id in the device list after its destruction. > > I'm trying to come up with a fix for this, but I'm not convinced it's the > problem that you're seeing. Could be what you describe leads to a memory corruption. -- MST From rdreier at cisco.com Wed Aug 30 13:11:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 30 Aug 2006 13:11:19 -0700 Subject: [openib-general] [PATCH v5 2/2] iWARP Core Changes. In-Reply-To: <20060803210242.16228.39306.stgit@dell3.ogc.int> (Steve Wise's message of "Thu, 03 Aug 2006 16:02:42 -0500") References: <20060803210238.16228.47335.stgit@dell3.ogc.int> <20060803210242.16228.39306.stgit@dell3.ogc.int> Message-ID: While merging this, I uninlined rdma_node_get_transport, since I don't think there's any reason to make it inline: add/remove: 1/0 grow/shrink: 7/16 up/down: 65/-146 (-81) function old new delta rdma_node_get_transport - 33 +33 rdma_init_qp_attr 96 109 +13 rdma_resolve_route 612 620 +8 cma_add_one 241 245 +4 rdma_create_qp 302 305 +3 cma_rep_recv 94 96 +2 show_pkey 110 111 +1 rdma_reject 122 123 +1 rdma_accept 410 408 -2 cma_cleanup 52 50 -2 cma_acquire_dev 181 179 -2 rdma_connect 767 764 -3 rdma_listen 831 825 -6 rdma_disconnect 151 145 -6 ib_sa_add_one 455 446 -9 ib_mad_init_device 1393 1384 -9 ipoib_remove_one 160 150 -10 ib_ucm_add_one 377 365 -12 cma_destroy_listen 196 184 -12 cma_cancel_operation 258 246 -12 cm_add_one 446 434 -12 ipoib_add_one 766 753 -13 ib_umad_add_one 985 968 -17 rdma_destroy_id 322 303 -19 diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 06f98e9..8b5dd36 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -79,6 +79,23 @@ enum ib_rate mult_to_ib_rate(int mult) } EXPORT_SYMBOL(mult_to_ib_rate); +enum rdma_transport_type +rdma_node_get_transport(enum rdma_node_type node_type) +{ + switch (node_type) { + case RDMA_NODE_IB_CA: + case RDMA_NODE_IB_SWITCH: + case RDMA_NODE_IB_ROUTER: + return RDMA_TRANSPORT_IB; + case RDMA_NODE_RNIC: + return RDMA_TRANSPORT_IWARP; + default: + BUG(); + return 0; + } +} +EXPORT_SYMBOL(rdma_node_get_transport); + /* Protection domains */ struct ib_pd *ib_alloc_pd(struct ib_device *device) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 91b7338..905b44e 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -69,21 +69,8 @@ enum rdma_transport_type { RDMA_TRANSPORT_IWARP }; -static inline enum rdma_transport_type -rdma_node_get_transport(enum rdma_node_type node_type) -{ - switch (node_type) { - case RDMA_NODE_IB_CA: - case RDMA_NODE_IB_SWITCH: - case RDMA_NODE_IB_ROUTER: - return RDMA_TRANSPORT_IB; - case RDMA_NODE_RNIC: - return RDMA_TRANSPORT_IWARP; - default: - BUG(); - return 0; - } -} +enum rdma_transport_type +rdma_node_get_transport(enum rdma_node_type node_type) __attribute_const__; enum ib_device_cap_flags { IB_DEVICE_RESIZE_MAX_WR = 1, From mshefty at ichips.intel.com Wed Aug 30 13:17:01 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 30 Aug 2006 13:17:01 -0700 Subject: [openib-general] CMA oops In-Reply-To: <20060830200604.GA32183@mellanox.co.il> References: <44F5E466.70206@ichips.intel.com> <20060830200604.GA32183@mellanox.co.il> Message-ID: <44F5F23D.9000901@ichips.intel.com> Michael S. Tsirkin wrote: >>I'm trying to come up with a fix for this, but I'm not convinced it's the >>problem that you're seeing. > > > Could be what you describe leads to a memory corruption. I believe so. If this were the cause of the crash, I would expect to see an issue with list->prev->prev or list->prev->next etc, not list->prev. I haven't been able to determine how list->prev could be NULL, but id_priv->cma_dev be set when cma_attach_to_dev() is called. It's true that the test for id_priv->cma_dev in rdma_destroy_id() isn't protected by a lock, but the lock around the call to cma_detach_from_dev() should ensure that cma_attach_to_dev() -- which sets id_priv->cma_dev -- completes before we detach. - Sean From rdreier at cisco.com Wed Aug 30 13:26:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 30 Aug 2006 13:26:04 -0700 Subject: [openib-general] File transfer performance options In-Reply-To: <1156968402.8973.35.camel@trinity.ogc.int> (Tom Tucker's message of "Wed, 30 Aug 2006 15:06:42 -0500") References: <007201c6cc68$5a735210$8000a8c0@blorp> <1156968402.8973.35.camel@trinity.ogc.int> Message-ID: > Are you familiar with NFSoRDMA? We get 600+MB/s on read and about > 150-200MB/s on write with an XFS filesystem on a stripped software raid > with four spindles. umm: > > [Whilst for a real Linux-Linux configuration I would look for the RDMA over > > NFS solution, this wouldn't translate to our eventual win-linux > > inter-operable system.] ;) From mshefty at ichips.intel.com Wed Aug 30 15:40:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 30 Aug 2006 15:40:41 -0700 Subject: [openib-general] [PATCH ] RFC IB/cm do not track remote QPN in timewait state In-Reply-To: <20060830175216.GB30879@mellanox.co.il> References: <44F4AB73.2070208@ichips.intel.com> <20060830045726.GA25478@mellanox.co.il> <44F5C9B7.8070408@ichips.intel.com> <20060830175216.GB30879@mellanox.co.il> Message-ID: <44F613E9.505@ichips.intel.com> Michael S. Tsirkin wrote: > It exposed a race in SDP. The patch itself does not lead to crashes - > I re-attach it here for reference. > As we discussed, this needs to be extended to handle DREQ retries > properly. I've committed this patch to SVN 9193. The CM should already handle DREQ retries properly; however... The CM timeout for a response can end up being close, or the same as the timewait time. (All of my test apps will result in these values being the same.) If a DREP is lost, the side that received the DREQ may enter and exit timewait before the DREQ is retried. In this situation, the DREQ gets dropped repeatedly. We will want to queue this patch for 2.6.19, if you can point Roland to your git tree. Acked-by: Sean Hefty From mshefty at ichips.intel.com Wed Aug 30 15:53:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 30 Aug 2006 15:53:27 -0700 Subject: [openib-general] [PATCH v5 2/2] iWARP Core Changes. In-Reply-To: References: <20060803210238.16228.47335.stgit@dell3.ogc.int> <20060803210242.16228.39306.stgit@dell3.ogc.int> Message-ID: <44F616E7.9020402@ichips.intel.com> Roland Dreier wrote: > While merging this, I uninlined rdma_node_get_transport, since I don't > think there's any reason to make it inline: I've committed the patch to svn to sync as well. - Sean From aafabbri at cisco.com Wed Aug 30 16:29:45 2006 From: aafabbri at cisco.com (Aaron Fabbri) Date: Wed, 30 Aug 2006 23:29:45 +0000 (UTC) Subject: [openib-general] Interrupt Threshold Rate equivalent in Infiniband NIC References: Message-ID: Roland Dreier cisco.com> writes: > > harish> Hi, The interruptThresholdRate module parameter allows you > harish> to control the maximum number of interrupts/sec for an > harish> e1000 Intel NIC for example. Is there an equivalent > harish> parameter for Infiniband NICs. I am using a Mellanox > harish> Infiniband NIC. Please let me know if you need any more > harish> information. > > There is no such equivalent for IB. > I agree there is no equivalent to a rate limiter. I do recall there is (or was) an interrupt timer that you can set when you burn the firmware on the Mellanox HCAs. IIRC, it could be used to limit the interrupt rate, but the way it is implemented it can add latency. If you don't care about latency you could try it out. Ask Mellanox for specifics. Aaron From panda at cse.ohio-state.edu Wed Aug 30 20:58:38 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed, 30 Aug 2006 23:58:38 -0400 (EDT) Subject: [openib-general] Announcing the release of MVAPICH2 0.9.5 with SRQ, integrated multi-rail and TotalView support Message-ID: <200608310358.k7V3wcud009682@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the availability of MVAPICH2 0.9.5 with the following NEW features: - Shared Receive Queue (SRQ) and Adaptive RDMA support: These features reduce memory usage of the MPI library significantly to provide scalability without any degradation in performance. Performance of applications and memory scalability using SRQ and Adaptive RDMA support can be seen by visiting the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-apps.html - Integrated multi-rail communication support for both two-sided and one-sided operations - Multiple queue pairs per port - Multiple ports per adapter - Multiple adapters - Support for TotalView debugger - Auto-detection of Architecture and InfiniBand adapters More details on all features and supported platforms can be obtained by visiting the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html MVAPICH2 0.9.5 continues to deliver excellent performance. Sample performance numbers include: - OpenIB/Gen2 on EM64T with PCI-Ex and IBA-DDR: Two-sided operations: - 2.97 microsec one-way latency (4 bytes) - 1478 MB/sec unidirectional bandwidth - 2658 MB/sec bidirectional bandwidth One-sided operations: - 5.08 microsec Put latency - 1484 MB/sec unidirectional Put bandwidth - 2658 MB/sec bidirectional Put bandwidth - OpenIB/Gen2 on EM64T with PCI-Ex and IBA-DDR (Dual-rail): Two-sided operations: - 3.01 microsec one-way latency (4 bytes) - 2346 MB/sec unidirectional bandwidth - 2779 MB/sec bidirectional bandwidth One-sided operations: - 4.70 microsec Put latency - 2389 MB/sec unidirectional Put bandwidth - 2779 MB/sec bidirectional Put bandwidth - OpenIB/Gen2 on Opteron with PCI-Ex and IBA-DDR: Two-sided operations: - 2.71 microsec one-way latency (4 bytes) - 1411 MB/sec unidirectional bandwidth - 2238 MB/sec bidirectional bandwidth One-sided operations: - 4.28 microsec Put latency - 1411 MB/sec unidirectional Put bandwidth - 2238 MB/sec bidirectional Put bandwidth - Solaris uDAPL/IBTL on Opteron with PCI-Ex and IBA-SDR: Two-sided operations: - 4.81 microsec one-way latency (4 bytes) - 981 MB/sec unidirectional bandwidth - 1903 MB/sec bidirectional bandwidth One-sided operations: - 7.49 microsec Put latency - 981 MB/sec unidirectional Put bandwidth - 1903 MB/sec bidirectional Put bandwidth - OpenIB/Gen2 uDAPL on EM64T with PCI-Ex and IBA-SDR: Two-sided operations: - 3.56 microsec one-way latency (4 bytes) - 964 MB/sec unidirectional bandwidth - 1846 MB/sec bidirectional bandwidth One-sided operations: - 6.85 microsec Put latency - 964 MB/sec unidirectional Put bandwidth - 1846 MB/sec bidirectional Put bandwidth - OpenIB/Gen2 uDAPL on EM64T with PCI-Ex and IBA-DDR: Two-sided operations: - 3.18 microsec one-way latency (4 bytes) - 1484 MB/sec unidirectional bandwidth - 2635 MB/sec bidirectional bandwidth One-sided operations: - 5.41 microsec Put latency - 1485 MB/sec unidirectional Put bandwidth - 2635 MB/sec bidirectional Put bandwidth Performance numbers for all other platforms, system configurations and operations can be viewed by visiting `Performance' section of the project's web page. With the ADI-3-level design, MVAPICH2 0.9.5 delivers similar performance for two-sided operations compared to MVAPICH 0.9.8. Performance comparison between MVAPICH2 0.9.5 and MVAPICH 0.9.8 for sample applications can be seen by visiting the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-apps.html Organizations and users interested in getting the best performance for both two-sided and one-sided operations and also want to exploit `multi-threading' and `integrated multi-rail' capabilities may migrate from MVAPICH code base to MVAPICH2 code base. For downloading MVAPICH2 0.9.5 package and accessing the anonymous SVN, please visit the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ A stripped down version of this release is also available at the OpenIB SVN. All feedbacks, including bug reports and hints for performance tuning, are welcome. Please post it to the mvapich-discuss mailing list. Thanks, MVAPICH Team at OSU/NBCL ====================================================================== MVAPICH/MVAPICH2 project is currently supported with funding from U.S. National Science Foundation, U.S. DOE Office of Science, Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; and with equipment support from Advanced Clustering, AMD, Apple, Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm and Sun Microsystems. Other technology partner includes Etnus. ====================================================================== From johnt1johnt2 at gmail.com Wed Aug 30 23:26:01 2006 From: johnt1johnt2 at gmail.com (john t) Date: Thu, 31 Aug 2006 11:56:01 +0530 Subject: [openib-general] ibv_poll_cq In-Reply-To: <44F5A063.4020502@dev.mellanox.co.il> References: <44F5A063.4020502@dev.mellanox.co.il> Message-ID: Hi Dotan Is there a way to know if the two QPs (local and remote) are in sync or to wait for them to get in sync and then do the data transfer. I think in my case it is more like one QP is sending the message but the other end (receiver) is not in RTR state at that time (since sender and receiver are implemented as threads, may be receiver thread on the other machine is getting scheduled very late). Is there a way where I can specifiy infinite retry_count/timeout or find out if remote QP is in RTR state (or error state) and only then do the actual data tranfer. Regards, John On 8/30/06, Dotan Barak wrote: > > Hi. > > john t wrote: > > Hi, > > > > In one of my multi-threaded application (simple send/recv application > > written using uverbs), I am repeatedly getting an error code 12 > > (IB_WC_RETRY_EXC_ERR) from "ibv_poll_cq". Not able to figure out > > what is going wrong. Cam some one please give a suggestion so that I > > can investigate on those lines. > > > > Also, is there an error handling mechanism in IB, for ex: in the above > > case what should I do in order to correct the problem. > This completion status means that the remote side of the QP is not > sending any response (ack/nack/ anything ...) > You can have this completion if one of the following scenarios occurs: > * a QP tries to send a message to a remote QP which is not ready (not in > at least RTR state) > * a QP tries to send a message to a remote QP which is being closed (or > in error state) > * the QP parameters are not the same as the remote QP parameters (for > example: if the PSNs are not configured with good values, > the messages may be silently dropped) > > > I suggest to: > sync between the 2 sides before starting to work with the QPs > sync between the 2 sides before stop to work with the QPs > You can increase the number of retry_cnt / timeout attributes in the > QP context > > you should make sure that the timeout value is not 0 (Zero). > > > Dotan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at dev.mellanox.co.il Thu Aug 31 00:12:52 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 31 Aug 2006 10:12:52 +0300 Subject: [openib-general] ibv_poll_cq In-Reply-To: References: <44F5A063.4020502@dev.mellanox.co.il> Message-ID: <44F68BF4.3050600@dev.mellanox.co.il> Hi. john t wrote: > Hi Dotan > > Is there a way to know if the two QPs (local and remote) are in sync > or to wait for them to get in sync and then do the data transfer. > > I think in my case it is more like one QP is sending the message but > the other end (receiver) is not in RTR state at that time (since > sender and receiver are implemented as threads, may be receiver thread > on the other machine is getting scheduled very late). > > Is there a way where I can specifiy infinite retry_count/timeout or > find out if remote QP is in RTR state (or error state) and only then > do the actual data tranfer. > Sorry, but the answer is no: there isn't any way for a local QP to know the state of the remote QP . This is exactly the role of the CM: to sync between the two QPs and to move the various attributes between the two sides. how do you connect the two QPs? (are you using the CM or a socket based communication?) Dotan From sunillp at gmail.com Thu Aug 31 00:33:15 2006 From: sunillp at gmail.com (Sunil Patil) Date: Thu, 31 Aug 2006 13:03:15 +0530 Subject: [openib-general] ibv_poll_cq In-Reply-To: <44F68BF4.3050600@dev.mellanox.co.il> References: <44F5A063.4020502@dev.mellanox.co.il> <44F68BF4.3050600@dev.mellanox.co.il> Message-ID: <4fb5e0640608310033t2e20d773h35f183b0ad891b52@mail.gmail.com> I am using socket based communication for exchanging intial information such as lid,qpn,psn, in fact, more or less the same code that is there in the examples. Is there any CM based example that I can look at? Regards, John On 8/31/06, Dotan Barak wrote: > > Hi. > > john t wrote: > > Hi Dotan > > > > Is there a way to know if the two QPs (local and remote) are in sync > > or to wait for them to get in sync and then do the data transfer. > > > > I think in my case it is more like one QP is sending the message but > > the other end (receiver) is not in RTR state at that time (since > > sender and receiver are implemented as threads, may be receiver thread > > on the other machine is getting scheduled very late). > > > > Is there a way where I can specifiy infinite retry_count/timeout or > > find out if remote QP is in RTR state (or error state) and only then > > do the actual data tranfer. > > > Sorry, but the answer is no: there isn't any way for a local QP to know > the state of the remote QP . > This is exactly the role of the CM: to sync between the two QPs and to > move the various attributes between the two sides. > > how do you connect the two QPs? > (are you using the CM or a socket based communication?) > > Dotan > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at dev.mellanox.co.il Thu Aug 31 01:20:22 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 31 Aug 2006 11:20:22 +0300 Subject: [openib-general] ibv_poll_cq In-Reply-To: <4fb5e0640608310033t2e20d773h35f183b0ad891b52@mail.gmail.com> References: <44F5A063.4020502@dev.mellanox.co.il> <44F68BF4.3050600@dev.mellanox.co.il> <4fb5e0640608310033t2e20d773h35f183b0ad891b52@mail.gmail.com> Message-ID: <44F69BC6.1060307@dev.mellanox.co.il> Sunil Patil wrote: > I am using socket based communication for exchanging intial > information such as lid,qpn,psn, in fact, more or less the same code > that is there in the examples. Is there any CM based example that I > can look at? > > Regards, > John in: https://openib.org/svn/gen2/trunk/src/userspace/libibcm/examples you can find some libibcm examples. Anyway, if you are using sockets you should sync between the two sides before you use the QPs (sync between them after they both in at least the RTR state). Dotan From dotanb at dev.mellanox.co.il Thu Aug 31 01:42:58 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 31 Aug 2006 11:42:58 +0300 Subject: [openib-general] [PATCH] libibcm: Need to include stddef.h in cm.c for SLES10 compilations In-Reply-To: <44F473C6.9040204@ichips.intel.com> References: <200608291124.45816.jackm@mellanox.co.il> <44F473C6.9040204@ichips.intel.com> Message-ID: <44F6A112.2050702@dev.mellanox.co.il> Sean Hefty wrote: > Jack Morgenstein wrote: > >> Fix compilation on SLES10: >> cm.c uses offsetof, so it must include stddef.h >> > > Thanks - committed in 9150. > > I checked this libibcm with multithreaded test (qp_test) and it is working with no problems. thanks Dotan From cganapathi at novell.com Wed Aug 30 23:09:14 2006 From: cganapathi at novell.com (CH Ganapathi) Date: Thu, 31 Aug 2006 00:09:14 -0600 Subject: [openib-general] [PATCH] IB/perftest: Fix get_median, size of delta, usage(), worst latency Message-ID: <44F6CA63.6C2D.007B.0@novell.com> Hi, 1) When iters (exchanges) is even, delta has odd no.of elements and when iters is odd, delta has even no.of elements. Hence when (iters - 1) is passed get_median() uses incorrect indexes to find the median. For example: When iters = 2 , get_median returns median = (delta[0] + delta[-1])/2 when it should have been median = delta[0]. When iters = 3 get_median returns median = delta[1] when actually it should have been median = (delta[0] + delta[1])/2. 2) The array delta requires only (iters - 1) size to be allocated. 3) Worst latency is delta[iters - 2] in read_lat.c, not delta[iters - 3]. 4) usage() in write_bw.c incorrectly states default exchanges as 1000. Thanks, Ganapathi Novell Inc. The following patch includes: o Fix get_median. o Change usage() in write_bw.c to match the actual default of exchanges. o Fix worst latency in read_lat.c. o Allocate only the necessary (iters - 1) elements for delta. Signed-off-by: Ganapathi CH Index: userspace/perftest/read_lat.c =================================================================== --- userspace/perftest/read_lat.c (revision 9196) +++ userspace/perftest/read_lat.c (working copy) @@ -568,7 +568,7 @@ */ static inline cycles_t get_median(int n, cycles_t delta[]) { - if (n % 2) + if ((n - 1) % 2) return(delta[n / 2] + delta[n / 2 - 1]) / 2; else return delta[n / 2]; @@ -591,7 +591,7 @@ cycles_t median; unsigned int i; const char* units; - cycles_t *delta = malloc(iters * sizeof *delta); + cycles_t *delta = malloc((iters - 1) * sizeof *delta); if (!delta) { perror("malloc"); @@ -627,7 +627,7 @@ median = get_median(iters - 1, delta); printf("%7d %d %7.2f %7.2f %7.2f\n", size,iters,delta[0] / cycles_to_units , - delta[iters - 3] / cycles_to_units ,median / cycles_to_units ); + delta[iters - 2] / cycles_to_units ,median / cycles_to_units ); free(delta); } Index: userspace/perftest/write_bw.c =================================================================== --- userspace/perftest/write_bw.c (revision 9196) +++ userspace/perftest/write_bw.c (working copy) @@ -509,7 +509,7 @@ printf(" -s, --size= size of message to exchange (default 65536)\n"); printf(" -a, --all Run sizes from 2 till 2^23\n"); printf(" -t, --tx-depth= size of tx queue (default 100)\n"); - printf(" -n, --iters= number of exchanges (at least 2, default 1000)\n"); + printf(" -n, --iters= number of exchanges (at least 2, default 5000)\n"); printf(" -b, --bidirectional measure bidirectional bandwidth (default unidirectional)\n"); printf(" -V, --version display version number\n"); } Index: userspace/perftest/rdma_lat.c =================================================================== --- userspace/perftest/rdma_lat.c (revision 9196) +++ userspace/perftest/rdma_lat.c (working copy) @@ -516,7 +516,7 @@ */ static inline cycles_t get_median(int n, cycles_t delta[]) { - if (n % 2) + if ((n - 1) % 2) return (delta[n / 2] + delta[n / 2 - 1]) / 2; else return delta[n / 2]; @@ -538,7 +538,7 @@ cycles_t median; unsigned int i; const char* units; - cycles_t *delta = malloc(iters * sizeof *delta); + cycles_t *delta = malloc((iters - 1) * sizeof *delta); if (!delta) { perror("malloc"); Index: userspace/perftest/send_lat.c =================================================================== --- userspace/perftest/send_lat.c (revision 9196) +++ userspace/perftest/send_lat.c (working copy) @@ -678,7 +678,7 @@ */ static inline cycles_t get_median(int n, cycles_t delta[]) { - if (n % 2) + if ((n - 1) % 2) return(delta[n / 2] + delta[n / 2 - 1]) / 2; else return delta[n / 2]; @@ -701,7 +701,7 @@ cycles_t median; unsigned int i; const char* units; - cycles_t *delta = malloc(iters * sizeof *delta); + cycles_t *delta = malloc((iters - 1) * sizeof *delta); if (!delta) { perror("malloc"); Index: userspace/perftest/write_lat.c =================================================================== --- userspace/perftest/write_lat.c (revision 9196) +++ userspace/perftest/write_lat.c (working copy) @@ -579,7 +579,7 @@ */ static inline cycles_t get_median(int n, cycles_t delta[]) { - if (n % 2) + if ((n - 1) % 2) return(delta[n / 2] + delta[n / 2 - 1]) / 2; else return delta[n / 2]; @@ -602,7 +602,7 @@ cycles_t median; unsigned int i; const char* units; - cycles_t *delta = malloc(iters * sizeof *delta); + cycles_t *delta = malloc((iters - 1) * sizeof *delta); if (!delta) { perror("malloc"); From mst at mellanox.co.il Thu Aug 31 02:29:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 Aug 2006 12:29:57 +0300 Subject: [openib-general] [PATCH] IB/perftest: Fix get_median, size of delta, usage(), worst latency In-Reply-To: <44F6CA63.6C2D.007B.0@novell.com> References: <44F6CA63.6C2D.007B.0@novell.com> Message-ID: <20060831092957.GB32087@mellanox.co.il> 3) Worst latency is delta[iters - 2] in read_lat.c, not delta[iters - 3]. >> could you explain this last bit please? -- MST From mst at mellanox.co.il Thu Aug 31 04:13:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 Aug 2006 14:13:54 +0300 Subject: [openib-general] [PATCH] perftest: enhancement to rdma_lat to allow use of RDMA CM In-Reply-To: <20060817053013.GA16205@harry-potter.in.ibm.com> References: <20060817053013.GA16205@harry-potter.in.ibm.com> Message-ID: <20060831111354.GC32087@mellanox.co.il> Quoting r. Pradipta Kumar Banerjee : > Subject: [PATCH] perftest: enhancement to rdma_lat to allow use of RDMA CM > > Hi Michael, > This patch contains changes to the rdma_lat.c to allow use of RDMA CM. > This has been successfully tested with Ammasso iWARP cards, IBM eHCA and mthca IB > cards. > > Summary of changes > > # Added an option (-c|--cma) to enable use of RDMA CM > # Added a new structure (struct pp_data) containing the user parameters as well > as other data required by most of the routines. This makes it convenient to > pass the parameters between various routines. > # Outputs to stdout/stderr are prefixed with the process-id. This helps to > sort the output when multiple servers/clients are run from the same machine > > Signed-off-by: Pradipta Kumar Banerjee Thanks, applied. -- MST From mst at mellanox.co.il Thu Aug 31 04:16:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 Aug 2006 14:16:43 +0300 Subject: [openib-general] [PATCH] IB/perftest: Fix get_median, size of delta, usage(), worst latency In-Reply-To: <44F6CA63.6C2D.007B.0@novell.com> References: <44F6CA63.6C2D.007B.0@novell.com> Message-ID: <20060831111643.GD32087@mellanox.co.il> Quoting r. CH Ganapathi : > o Fix get_median. > o Change usage() in write_bw.c to match the actual default of exchanges. > o Fix worst latency in read_lat.c. > o Allocate only the necessary (iters - 1) elements for delta. > > Signed-off-by: Ganapathi CH Thanks, applied. -- MST From mst at mellanox.co.il Thu Aug 31 06:03:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 Aug 2006 16:03:36 +0300 Subject: [openib-general] [PATCH] IB/cm: do not track remote QPN in timewait state Message-ID: <20060831130335.GA1006@mellanox.co.il> Roland, please queue for 2.6.19. --- IB/cm: fix spurious rejects with bogus stale connection syndrome. CM should not track remote QPN in TimeWait, since QP is not connected. Signed-off-by: Michael S. Tsirkin Acked-by: Sean Hefty diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index f85c97f..e270311 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -679,6 +679,8 @@ static void cm_enter_timewait(struct cm_ { int wait_time; + cm_cleanup_timewait(cm_id_priv->timewait_info); + /* * The cm_id could be destroyed by the user before we exit timewait. * To protect against this, we search for the cm_id after exiting -- MST From mst at mellanox.co.il Thu Aug 31 06:10:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 Aug 2006 16:10:01 +0300 Subject: [openib-general] lockdep warnings Message-ID: <20060831131001.GB1006@mellanox.co.il> Hi, Roland! I got a load of lockdep warnings after loading all modules and configuring ipoib. This doesn't usually happen, not sure what I changed this time. I'm a bit too busy this week - could you take a look at the log, please? Attached. -- MST -------------- next part -------------- Bootdata ok (command line is auto BOOT_IMAGE=2.6.18-rc5-gdc7 ro root=806 console=ttyS0,115200n8 console=tty0) Linux version 2.6.18-rc5-gdc709bd1 (root at sw129.yok.mtl.com) (gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) #1 SMP Thu Aug 31 15:32:17 IDT 2006 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 0000000000097800 (usable) BIOS-e820: 0000000000097800 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000bff70000 (usable) BIOS-e820: 00000000bff70000 - 00000000bff78000 (ACPI data) BIOS-e820: 00000000bff78000 - 00000000bff80000 (ACPI NVS) BIOS-e820: 00000000bff80000 - 00000000c0000000 (reserved) BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved) BIOS-e820: 00000000fffffc00 - 0000000100000000 (reserved) DMI present. ACPI: RSDP (v000 PTLTD ) @ 0x00000000000f5da0 ACPI: RSDT (v001 PTLTD RSDT 0x06040000 LTP 0x00000000) @ 0x00000000bff73ce0 ACPI: FADT (v001 INTEL LINDHRST 0x06040000 PTL 0x00000003) @ 0x00000000bff77e2c ACPI: MADT (v001 PTLTD APIC 0x06040000 LTP 0x00000000) @ 0x00000000bff77ea0 ACPI: BOOT (v001 PTLTD $SBFTBL$ 0x06040000 LTP 0x00000001) @ 0x00000000bff77f48 ACPI: SPCR (v001 PTLTD $UCRTBL$ 0x06040000 PTL 0x00000001) @ 0x00000000bff77f70 ACPI: MCFG (v001 PTLTD MCFG 0x06040000 LTP 0x00000000) @ 0x00000000bff77fc0 ACPI: SSDT (v001 PmRef CpuPm 0x00003000 INTL 0x20030224) @ 0x00000000bff73d1c ACPI: DSDT (v001 Intel LINDHRST 0x06040000 MSFT 0x0100000e) @ 0x0000000000000000 No NUMA configuration found Faking a node at 0000000000000000-00000000bff70000 Bootmem setup node 0 0000000000000000-00000000bff70000 On node 0 totalpages: 766620 DMA zone: 1232 pages, LIFO batch:0 DMA32 zone: 765388 pages, LIFO batch:31 ACPI: PM-Timer IO Port: 0x1008 ACPI: Local APIC address 0xfee00000 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 15:4 APIC version 20 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x06] enabled) Processor #6 15:4 APIC version 20 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Processor #1 15:4 APIC version 20 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x07] enabled) Processor #7 15:4 APIC version 20 ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23 ACPI: IOAPIC (id[0x03] address[0xfec10000] gsi_base[24]) IOAPIC[1]: apic_id 3, version 32, address 0xfec10000, GSI 24-47 ACPI: IOAPIC (id[0x04] address[0xfec80000] gsi_base[48]) IOAPIC[2]: apic_id 4, version 32, address 0xfec80000, GSI 48-71 ACPI: IOAPIC (id[0x05] address[0xfec80400] gsi_base[72]) IOAPIC[3]: apic_id 5, version 32, address 0xfec80400, GSI 72-95 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Setting APIC routing to physical flat Using ACPI (MADT) for SMP configuration information Allocating PCI resources starting at c2000000 (gap: c0000000:20000000) SMP: Allowing 4 CPUs, 0 hotplug CPUs Built 1 zonelists. Total pages: 766620 Kernel command line: auto BOOT_IMAGE=2.6.18-rc5-gdc7 ro root=806 console=ttyS0,115200n8 console=tty0 Initializing CPU#0 PID hash table entries: 4096 (order: 12, 32768 bytes) time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer. time.c: Detected 3400.204 MHz processor. Console: colour VGA+ 80x25 Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar ... MAX_LOCKDEP_SUBCLASSES: 8 ... MAX_LOCK_DEPTH: 30 ... MAX_LOCKDEP_KEYS: 2048 ... CLASSHASH_SIZE: 1024 ... MAX_LOCKDEP_ENTRIES: 8192 ... MAX_LOCKDEP_CHAINS: 8192 ... CHAINHASH_SIZE: 4096 memory used by lock dependency info: 1120 kB per task-struct memory footprint: 1680 bytes Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) Checking aperture... Memory: 3059624k/3145152k available (3378k kernel code, 85108k reserved, 2328k data, 216k init) Calibrating delay using timer specific routine.. 6808.44 BogoMIPS (lpj=13616891) Mount-cache hash table entries: 256 CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 1024K using mwait in idle threads. CPU: Physical Processor ID: 0 CPU: Processor Core ID: 0 CPU0: Thermal monitoring enabled (TM1) lockdep: not fixing up alternatives. ACPI: Core revision 20060707 Using local APIC timer interrupts. result 12500709 Detected 12.500 MHz APIC timer. lockdep: not fixing up alternatives. Booting processor 1/4 APIC 0x6 Initializing CPU#1 Calibrating delay using timer specific routine.. 6800.83 BogoMIPS (lpj=13601676) CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 1024K CPU: Physical Processor ID: 3 CPU: Processor Core ID: 0 CPU1: Thermal monitoring enabled (TM1) Intel(R) Xeon(TM) CPU 3.40GHz stepping 01 lockdep: not fixing up alternatives. Booting processor 2/4 APIC 0x1 Initializing CPU#2 Calibrating delay using timer specific routine.. 6801.05 BogoMIPS (lpj=13602108) CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 1024K CPU: Physical Processor ID: 0 CPU: Processor Core ID: 0 CPU2: Thermal monitoring enabled (TM1) Intel(R) Xeon(TM) CPU 3.40GHz stepping 01 lockdep: not fixing up alternatives. Booting processor 3/4 APIC 0x7 Initializing CPU#3 Calibrating delay using timer specific routine.. 6800.84 BogoMIPS (lpj=13601696) CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 1024K CPU: Physical Processor ID: 3 CPU: Processor Core ID: 0 CPU3: Thermal monitoring enabled (TM1) Intel(R) Xeon(TM) CPU 3.40GHz stepping 01 Brought up 4 CPUs testing NMI watchdog ... OK. migration_cost=2,734 checking if image is initramfs... it is Freeing initrd memory: 744k freed NET: Registered protocol family 16 ACPI: bus type pci registered PCI: Using MMCONFIG at e0000000 PCI: No mmconfig possible on device 7:1 ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: PCI Root Bridge [PCI0] (0000:00) PCI: Probing PCI hardware (bus 00) PCI quirk: region 1000-107f claimed by ICH4 ACPI/GPIO/TCO PCI quirk: region 1180-11bf claimed by ICH4 GPIO PCI: Ignoring BAR0-3 of IDE controller 0000:00:1f.2 PCI: PXH quirk detected, disabling MSI for SHPC device PCI: PXH quirk detected, disabling MSI for SHPC device Boot video device is 0000:07:01.0 PCI: Transparent bridge - 0000:00:1e.0 ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PEX0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PEX0.PXH0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PEX0.PXH1._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PEY0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PEZ0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCIX._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCIB._PRT] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 *5 6 7 10 11 14 15) ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *10 11 14 15) ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 *10 11 14 15) ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 10 *11 14 15) ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 10 11 14 15) *0, disabled. ACPI: PCI Interrupt Link [LNKF] (IRQs 4 5 6 7 10 11 14 15) *0, disabled. ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 *10 11 14 15) ACPI: PCI Interrupt Link [LNKH] (IRQs 4 5 6 7 10 *11 14 15) Intel 82802 RNG detected SCSI subsystem initialized usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Using ACPI for IRQ routing PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report PCI-GART: No AMD northbridge found. PCI: Bridge: 0000:01:00.0 IO window: 2000-2fff MEM window: d0200000-d02fffff PREFETCH window: disabled. PCI: Bridge: 0000:01:00.2 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:02.0 IO window: 2000-2fff MEM window: d0100000-d02fffff PREFETCH window: disabled. PCI: Bridge: 0000:00:04.0 IO window: disabled. MEM window: d0300000-d03fffff PREFETCH window: d2800000-dfffffff PCI: Bridge: 0000:00:06.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:1c.0 IO window: 3000-3fff MEM window: d0400000-d04fffff PREFETCH window: disabled. PCI: Bridge: 0000:00:1e.0 IO window: 4000-4fff MEM window: d0500000-d1ffffff PREFETCH window: c2000000-c20fffff GSI 16 sharing vector 0xA9 and IRQ 16 ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 169 PCI: Setting latency timer of device 0000:00:02.0 to 64 PCI: Setting latency timer of device 0000:01:00.0 to 64 PCI: Setting latency timer of device 0000:01:00.2 to 64 ACPI: PCI Interrupt 0000:00:04.0[A] -> GSI 16 (level, low) -> IRQ 169 PCI: Setting latency timer of device 0000:00:04.0 to 64 ACPI: PCI Interrupt 0000:00:06.0[A] -> GSI 16 (level, low) -> IRQ 169 PCI: Setting latency timer of device 0000:00:06.0 to 64 PCI: Setting latency timer of device 0000:00:1e.0 to 64 NET: Registered protocol family 2 IP route cache hash table entries: 131072 (order: 8, 1048576 bytes) TCP established hash table entries: 65536 (order: 9, 3670016 bytes) TCP bind hash table entries: 32768 (order: 8, 1835008 bytes) TCP: Hash tables configured (established 65536 bind 32768) TCP reno registered Simple Boot Flag at 0x39 set to 0x80 Total HugeTLB memory allocated, 0 Installing knfsd (copyright (C) 1996 okir at monad.swb.de). io scheduler noop registered io scheduler deadline registered io scheduler cfq registered (default) PCI: Setting latency timer of device 0000:00:02.0 to 64 Allocate Port Service[0000:00:02.0:pcie00] Allocate Port Service[0000:00:02.0:pcie01] PCI: Setting latency timer of device 0000:00:04.0 to 64 Allocate Port Service[0000:00:04.0:pcie00] Allocate Port Service[0000:00:04.0:pcie01] PCI: Setting latency timer of device 0000:00:06.0 to 64 Allocate Port Service[0000:00:06.0:pcie00] Allocate Port Service[0000:00:06.0:pcie01] ACPI: Power Button (FF) [PWRF] ACPI: Power Button (CM) [PWRB] Real Time Clock Driver v1.12ac Software Watchdog Timer: 0.07 initialized. soft_noboot=0 soft_margin=60 sec (nowayout= 0) Linux agpgart interface v0.101 (c) Dave Jones Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A Floppy drive(s): fd0 is 1.44M FDC 0 is a post-1991 82077 RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize loop: loaded (max 8 devices) Intel(R) PRO/1000 Network Driver - version 7.1.9-k4 Copyright (c) 1999-2006 Intel Corporation. GSI 17 sharing vector 0xB1 and IRQ 17 ACPI: PCI Interrupt 0000:06:01.0[A] -> GSI 24 (level, low) -> IRQ 177 e1000: 0000:06:01.0: e1000_probe: (PCI:66MHz:32-bit) 00:30:48:74:65:02 e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection GSI 18 sharing vector 0xB9 and IRQ 18 ACPI: PCI Interrupt 0000:06:02.0[A] -> GSI 25 (level, low) -> IRQ 185 e1000: 0000:06:02.0: e1000_probe: (PCI:66MHz:32-bit) 00:30:48:74:65:03 e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection e100: Intel(R) PRO/100 Network Driver, 3.5.10-k2-NAPI e100: Copyright(c) 1999-2005 Intel Corporation forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.56. tun: Universal TUN/TAP device driver, 1.6 tun: (C) 1999-2004 Max Krasnyansky netconsole: not configured, aborting Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx ide0: I/O resource 0x1F0-0x1F7 not free. ide0: ports already in use, skipping probe Probing IDE interface ide1... hdc: CD-224E, ATAPI CD/DVD-ROM drive ide1 at 0x170-0x177,0x376 on irq 15 hdc: ATAPI 24X CD-ROM drive, 128kB Cache Uniform CD-ROM driver Revision: 3.20 megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006) megaraid: 2.20.4.9 (Release Date: Sun Jul 16 12:27:22 EST 2006) megasas: 00.00.03.01 Sun May 14 22:49:52 PDT 2006 libata version 2.00 loaded. ata_piix 0000:00:1f.2: version 2.00 ata_piix 0000:00:1f.2: MAP [ P0 P1 IDE IDE ] GSI 19 sharing vector 0xC1 and IRQ 19 ACPI: PCI Interrupt 0000:00:1f.2[A] -> GSI 18 (level, low) -> IRQ 193 ata: 0x170 IDE port busy PCI: Setting latency timer of device 0000:00:1f.2 to 64 ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x1470 irq 14 scsi0 : ata_piix ata1.00: ATA-6, max UDMA/133, 234441648 sectors: LBA48 ata1.00: ata1: dev 0 multi count 16 ata1.00: configured for UDMA/133 Vendor: ATA Model: WDC WD1200SD-01K Rev: 08.0 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back sda: sda1 sda2 sda3 sda4 < sda5 sda6 > sd 0:0:0:0: Attached scsi disk sda Fusion MPT base driver 3.04.01 Copyright (c) 1999-2005 LSI Logic Corporation Fusion MPT SPI Host driver 3.04.01 ieee1394: raw1394: /dev/raw1394 device initialized GSI 20 sharing vector 0xC9 and IRQ 20 ACPI: PCI Interrupt 0000:00:1d.7[D] -> GSI 23 (level, low) -> IRQ 201 PCI: Setting latency timer of device 0000:00:1d.7 to 64 ehci_hcd 0000:00:1d.7: EHCI Host Controller ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 1 ehci_hcd 0000:00:1d.7: debug port 1 PCI: cache line size of 128 is not supported by device 0000:00:1d.7 ehci_hcd 0000:00:1d.7: irq 201, io mem 0xd0000400 ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004 usb usb1: configuration #1 chosen from 1 choice hub 1-0:1.0: USB hub found hub 1-0:1.0: 4 ports detected ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI) USB Universal Host Controller Interface driver v3.0 ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 169 PCI: Setting latency timer of device 0000:00:1d.0 to 64 uhci_hcd 0000:00:1d.0: UHCI Host Controller uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 2 uhci_hcd 0000:00:1d.0: irq 169, io base 0x00001400 usb usb2: configuration #1 chosen from 1 choice hub 2-0:1.0: USB hub found hub 2-0:1.0: 2 ports detected GSI 21 sharing vector 0xD1 and IRQ 21 ACPI: PCI Interrupt 0000:00:1d.1[B] -> GSI 19 (level, low) -> IRQ 209 PCI: Setting latency timer of device 0000:00:1d.1 to 64 uhci_hcd 0000:00:1d.1: UHCI Host Controller uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 3 uhci_hcd 0000:00:1d.1: irq 209, io base 0x00001420 usb usb3: configuration #1 chosen from 1 choice hub 3-0:1.0: USB hub found hub 3-0:1.0: 2 ports detected usbcore: registered new driver usblp drivers/usb/class/usblp.c: v0.13: USB Printer Device Class driver Initializing USB Mass Storage driver... usbcore: registered new driver usb-storage USB Mass Storage support registered. usbcore: registered new driver usbhid drivers/usb/input/hid-core.c: v2.6:USB HID core driver serio: i8042 AUX port at 0x60,0x64 irq 12 serio: i8042 KBD port at 0x60,0x64 irq 1 mice: PS/2 mouse device common for all mice device-mapper: ioctl: 4.7.0-ioctl (2006-06-24) initialised: dm-devel at redhat.com Intel 810 + AC97 Audio, version 1.01, 15:31:00 Aug 31 2006 oprofile: using NMI interrupt. TCP bic registered NET: Registered protocol family 1 NET: Registered protocol family 10 IPv6 over IPv4 tunneling driver NET: Registered protocol family 17 ACPI: (supports S0 S1 S4 S5) Freeing unused kernel memory: 216k freed Write protecting the kernel read-only data: 612k EXT3-fs: INFO: recovery required on readonly filesystem. EXT3-fs: write access will be enabled during recovery. kjournald starting. Commit interval 5 seconds EXT3-fs: recovery complete. EXT3-fs: mounted filesystem with ordered data mode. ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca: Initializing 0000:04:00.0 ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 16 (level, low) -> IRQ 169 PCI: Setting latency timer of device 0000:04:00.0 to 64 ADDRCONF(NETDEV_UP): ib1: link is not ready ADDRCONF(NETDEV_UP): ib0: link is not ready EXT3 FS on sda6, internal journal hdc: packet command error: status=0x51 { DriveReady SeekComplete Error } hdc: packet command error: error=0x54 { AbortedCommand LastFailedSense=0x05 } ide: failed opcode was: unknown kjournald starting. Commit interval 5 seconds EXT3 FS on sda1, internal journal EXT3-fs: recovery complete. EXT3-fs: mounted filesystem with ordered data mode. Adding 2048248k swap on /dev/sda5. Priority:-1 extents:1 across:2048248k ADDRCONF(NETDEV_UP): eth0: link is not ready e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready eth0: no IPv6 routers present i2c /dev entries driver hald[6264]: segfault at 00007fff4f041000 rip 0000003711c702da rsp 00007fff4f03f238 error 6 ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready ib0: no IPv6 routers present ====================================================== [ INFO: hard-safe -> hard-unsafe lock order detected ] ------------------------------------------------------ ipoib/3102 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: (&alloc->lock){--..}, at: [] mthca_free+0x24/0x54 [ib_mthca] and this task is already holding: (&priv->lock){.+..}, at: [] __ipoib_reap_ah+0x37/0xd1 [ib_ipoib] which would create a new lock dependency: (&priv->lock){.+..} -> (&alloc->lock){--..} but this new dependency connects a hard-irq-safe lock: (&priv->tx_lock){+...} ... which became hard-irq-safe at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ipoib_ib_completion+0x357/0x422 [ib_ipoib] [] mthca_cq_completion+0x68/0x6e [ib_mthca] [] mthca_eq_int+0x84/0x3d7 [ib_mthca] [] mthca_tavor_interrupt+0x5e/0xdc [ib_mthca] [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 to a hard-irq-unsafe lock: (&alloc->lock){--..} ... which became hard-irq-unsafe at: ... [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] mthca_alloc+0x12/0x7b [ib_mthca] [] mthca_uar_alloc+0x18/0x4d [ib_mthca] [] mthca_init_one+0x7f9/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 other info that might help us debug this: 2 locks held by ipoib/3102: #0: (&priv->tx_lock){+...}, at: [] __ipoib_reap_ah+0x2f/0xd1 [ib_ipoib] #1: (&priv->lock){.+..}, at: [] __ipoib_reap_ah+0x37/0xd1 [ib_ipoib] the hard-irq-safe lock's dependencies: -> (&priv->tx_lock){+...} ops: 453 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] __ipoib_reap_ah+0x2e/0xd1 [ib_ipoib] [] ipoib_reap_ah+0xe/0x35 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ipoib_ib_completion+0x357/0x422 [ib_ipoib] [] mthca_cq_completion+0x68/0x6e [ib_mthca] [] mthca_eq_int+0x84/0x3d7 [ib_mthca] [] mthca_tavor_interrupt+0x5e/0xdc [ib_mthca] [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 } ... key at: [] __key.1+0x0/0xffffffffffff8bfb [ib_ipoib] -> (&priv->lock){.+..} ops: 593 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] ipoib_mcast_start_thread+0x6e/0x87 [ib_ipoib] [] ipoib_ib_dev_up+0x50/0x56 [ib_ipoib] [] ipoib_open+0x66/0x114 [ib_ipoib] [] dev_open+0x37/0x7c [] dev_change_flags+0x5c/0x124 [] devinet_ioctl+0x2a7/0x664 [] inet_ioctl+0x70/0x8f [] sock_ioctl+0x17c/0x19f [] do_ioctl+0x2d/0x78 [] vfs_ioctl+0x26e/0x281 [] sys_ioctl+0x3f/0x63 [] system_call+0x7d/0x83 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] ipoib_mcast_send+0x28/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.0+0x0/0xffffffffffff8c03 [ib_ipoib] -> (&idr_lock){.+..} ops: 125 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] send_mad+0x40/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_join+0x166/0x1f8 [ib_ipoib] [] ipoib_mcast_join_task+0x225/0x2c1 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] send_mad+0x40/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.1+0x0/0xffffffffffffc712 [ib_sa] -> (query_idr.lock){.+..} ops: 142 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] free_layer+0x1c/0x3f [] idr_pre_get+0x35/0x42 [] send_mad+0x26/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_join+0x166/0x1f8 [ib_ipoib] [] ipoib_mcast_join_task+0x225/0x2c1 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] free_layer+0x1c/0x3f [] idr_pre_get+0x35/0x42 [] send_mad+0x26/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_path_lookup+0x213/0x223 [ib_ipoib] [] ipoib_start_xmit+0x129/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] neigh_connected_output+0xb2/0xd2 [] ip6_output2+0x24f/0x284 [] ip6_output+0x88a/0x89a [] ndisc_send_rs+0x2eb/0x41c [] addrconf_dad_completed+0x8e/0xdb [] addrconf_dad_timer+0x70/0x119 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] query_idr+0x30/0xffffffffffffda5a [ib_sa] ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] alloc_layer+0x18/0x4d [] idr_get_new_above_int+0x33/0x271 [] idr_get_new+0x10/0x31 [] send_mad+0x56/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_join+0x166/0x1f8 [ib_ipoib] [] ipoib_mcast_join_task+0x225/0x2c1 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ib_sa_cancel_query+0x1b/0x73 [ib_sa] [] wait_for_mcast_join+0x35/0xd2 [ib_ipoib] [] ipoib_mcast_stop_thread+0xa4/0xf5 [ib_ipoib] [] ipoib_mcast_restart_task+0x4b/0x3e7 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&mad_agent_priv->lock){.+..} ops: 752 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ib_post_send_mad+0x477/0x567 [ib_mad] [] send_mad+0xe3/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_join+0x166/0x1f8 [ib_ipoib] [] ipoib_mcast_join_task+0x225/0x2c1 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ib_post_send_mad+0x477/0x567 [ib_mad] [] send_mad+0xe3/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.4+0x0/0xffffffffffffb0e8 [ib_mad] -> (base_lock_keys + cpu){++..} ops: 97119 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] run_timer_softirq+0x43/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] do_IRQ+0x10b/0x118 [] common_interrupt+0x65/0x66 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] __mod_timer+0x93/0xc2 [] __ide_set_handler+0x70/0x7a [] ide_set_handler+0x3a/0x56 [] cdrom_pc_intr+0x200/0x217 [] ide_intr+0x169/0x1de [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] run_timer_softirq+0x43/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] do_IRQ+0x10b/0x118 [] common_interrupt+0x65/0x66 } ... key at: [] base_lock_keys+0x0/0x100 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] lock_timer_base+0x21/0x47 [] try_to_del_timer_sync+0x18/0x5d [] del_timer_sync+0x11/0x1e [] ib_mad_complete_send_wr+0x109/0x1f6 [ib_mad] [] ib_mad_send_done_handler+0x121/0x176 [ib_mad] [] ib_mad_completion_handler+0x560/0x595 [ib_mad] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (base_lock_keys + cpu#2){++..} ops: 64294 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] run_timer_softirq+0x43/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] lock_timer_base+0x21/0x47 [] del_timer+0x1f/0x5c [] scsi_delete_timer+0x12/0x2e [] scsi_done+0xd/0x1e [] ata_scsi_qc_complete+0xc7/0xd9 [] __ata_qc_complete+0x22a/0x237 [] ata_qc_complete+0xcf/0xd5 [] ata_hsm_qc_complete+0x20b/0x21d [] ata_hsm_move+0x632/0x652 [] ata_interrupt+0x174/0x1b9 [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] run_timer_softirq+0x43/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] base_lock_keys+0x8/0x100 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] __mod_timer+0x93/0xc2 [] queue_delayed_work+0x77/0x81 [] wait_for_response+0xeb/0xf4 [ib_mad] [] ib_mad_complete_send_wr+0xbf/0x1f6 [ib_mad] [] ib_mad_send_done_handler+0x121/0x176 [ib_mad] [] ib_mad_completion_handler+0x560/0x595 [ib_mad] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (base_lock_keys + cpu#3){++..} ops: 63714 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] run_timer_softirq+0x43/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] lock_timer_base+0x21/0x47 [] del_timer+0x1f/0x5c [] ide_intr+0x158/0x1de [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] run_timer_softirq+0x43/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] base_lock_keys+0x10/0x100 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] __mod_timer+0x93/0xc2 [] queue_delayed_work+0x77/0x81 [] wait_for_response+0xeb/0xf4 [ib_mad] [] ib_reset_mad_timeout+0x31/0x34 [ib_mad] [] ib_modify_mad+0x147/0x163 [ib_mad] [] ib_cancel_mad+0xa/0xd [ib_mad] [] ib_sa_cancel_query+0x6a/0x73 [ib_sa] [] wait_for_mcast_join+0x35/0xd2 [ib_ipoib] [] ipoib_mcast_stop_thread+0xa4/0xf5 [ib_ipoib] [] ipoib_mcast_restart_task+0x4b/0x3e7 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (base_lock_keys + cpu#4){++..} ops: 62147 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] run_timer_softirq+0x43/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] lock_timer_base+0x21/0x47 [] del_timer+0x1f/0x5c [] scsi_delete_timer+0x12/0x2e [] scsi_done+0xd/0x1e [] ata_scsi_qc_complete+0xc7/0xd9 [] __ata_qc_complete+0x22a/0x237 [] ata_qc_complete+0xcf/0xd5 [] ata_hsm_qc_complete+0x20b/0x21d [] ata_hsm_move+0x632/0x652 [] ata_interrupt+0x174/0x1b9 [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] run_timer_softirq+0x43/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] base_lock_keys+0x18/0x100 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] __mod_timer+0x93/0xc2 [] queue_delayed_work+0x77/0x81 [] wait_for_response+0xeb/0xf4 [ib_mad] [] ib_mad_complete_send_wr+0xbf/0x1f6 [ib_mad] [] ib_mad_send_done_handler+0x121/0x176 [ib_mad] [] ib_mad_completion_handler+0x560/0x595 [ib_mad] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ib_modify_mad+0x24/0x163 [ib_mad] [] ib_cancel_mad+0xa/0xd [ib_mad] [] ib_sa_cancel_query+0x6a/0x73 [ib_sa] [] wait_for_mcast_join+0x35/0xd2 [ib_ipoib] [] ipoib_mcast_stop_thread+0xa4/0xf5 [ib_ipoib] [] ipoib_mcast_restart_task+0x4b/0x3e7 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (modlist_lock){.+..} ops: 116 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] module_text_address+0x15/0x3b [] __register_kprobe+0x4b/0x316 [] register_kprobe+0xc/0xf [] arch_init_kprobes+0xf/0x12 [] init_kprobes+0x38/0x4b [] init+0x140/0x30f [] child_rip+0x7/0x12 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] is_module_address+0x14/0x80 [] static_obj+0x89/0x92 [] lockdep_init_map+0x60/0xbf [] __spin_lock_init+0x2b/0x4c [] ipoib_mcast_alloc+0x96/0xb2 [ib_ipoib] [] ipoib_mcast_send+0x105/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] modlist_lock+0x18/0x40 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] is_module_address+0x14/0x80 [] static_obj+0x89/0x92 [] lockdep_init_map+0x60/0xbf [] __spin_lock_init+0x2b/0x4c [] ipoib_mcast_alloc+0x96/0xb2 [ib_ipoib] [] ipoib_mcast_restart_task+0x183/0x3e7 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&cwq->lock){++..} ops: 8625 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] __queue_work+0x14/0x66 [] queue_work+0x52/0x5c [] call_usermodehelper_keys+0xe8/0x109 [] kobject_uevent+0x3f8/0x422 [] class_device_add+0x331/0x464 [] class_device_register+0x15/0x1b [] class_device_create+0x13a/0x163 [] vtconsole_class_init+0x87/0xdb [] init+0x140/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] __queue_work+0x14/0x66 [] queue_work+0x52/0x5c [] schedule_work+0x15/0x18 [] schedule_bh+0x21/0x24 [] floppy_interrupt+0x1a7/0x1c2 [] floppy_hardint+0x14/0xc5 [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] __queue_work+0x14/0x66 [] delayed_work_timer_fn+0x38/0x3b [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.1+0x0/0x8 -> (&q->lock){++..} ops: 108166 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] wait_for_completion+0x2d/0xf3 [] keventd_create_kthread+0x35/0x6a [] kthread_create+0x109/0x18c [] migration_call+0x60/0x453 [] migration_init+0x24/0x4b [] init+0x44/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] __wake_up+0x1f/0x4e [] __queue_work+0x52/0x66 [] queue_work+0x52/0x5c [] schedule_work+0x15/0x18 [] schedule_bh+0x21/0x24 [] floppy_interrupt+0x1a7/0x1c2 [] floppy_hardint+0x14/0xc5 [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] complete+0x18/0x49 [] wakeme_after_rcu+0xc/0xf [] __rcu_process_callbacks+0x154/0x1de [] rcu_process_callbacks+0x22/0x44 [] tasklet_action+0x78/0xc0 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.0+0x0/0x8 -> (&rq->rq_lock_key){++..} ops: 318932 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] init_idle+0x75/0xa1 [] sched_init+0x1b5/0x1bb [] start_kernel+0x7a/0x22e [] _sinittext+0x2ab/0x2b3 [] 0xffffffffffffffff in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] main_timer_handler+0x20b/0x3c0 [] timer_interrupt+0x14/0x2a [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] wake_up_process+0xf/0x12 [] __do_softirq+0xcb/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff810001046a90 -> (&rq->rq_lock_key#2){++..} ops: 134541 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] complete+0x34/0x49 [] wakeme_after_rcu+0xc/0xf [] __rcu_process_callbacks+0x154/0x1de [] rcu_process_callbacks+0x22/0x44 [] tasklet_action+0x78/0xc0 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000104ea90 -> (&rq->rq_lock_key#4){++..} ops: 122452 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] __wake_up+0x35/0x4e [] __queue_work+0x52/0x66 [] delayed_work_timer_fn+0x38/0x3b [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000105ea90 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&rq->rq_lock_key#3){++..} ops: 299350 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] wake_up_process+0xf/0x12 [] __do_softirq+0xcb/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff810001056a90 -> (&rq->rq_lock_key#4){++..} ops: 122452 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] __wake_up+0x35/0x4e [] __queue_work+0x52/0x66 [] delayed_work_timer_fn+0x38/0x3b [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000105ea90 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&rq->rq_lock_key#3){++..} ops: 299350 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] wake_up_process+0xf/0x12 [] __do_softirq+0xcb/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff810001056a90 -> (&rq->rq_lock_key#4){++..} ops: 122452 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] __wake_up+0x35/0x4e [] __queue_work+0x52/0x66 [] delayed_work_timer_fn+0x38/0x3b [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000105ea90 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&rq->rq_lock_key#4){++..} ops: 122452 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] __wake_up+0x35/0x4e [] __queue_work+0x52/0x66 [] delayed_work_timer_fn+0x38/0x3b [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000105ea90 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] complete+0x34/0x49 [] kthread+0xbb/0xfc [] child_rip+0x7/0x12 -> (&rq->rq_lock_key#2){++..} ops: 134541 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] complete+0x34/0x49 [] wakeme_after_rcu+0xc/0xf [] __rcu_process_callbacks+0x154/0x1de [] rcu_process_callbacks+0x22/0x44 [] tasklet_action+0x78/0xc0 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000104ea90 -> (&rq->rq_lock_key#4){++..} ops: 122452 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] __wake_up+0x35/0x4e [] __queue_work+0x52/0x66 [] delayed_work_timer_fn+0x38/0x3b [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000105ea90 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&rq->rq_lock_key#3){++..} ops: 299350 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] wake_up_process+0xf/0x12 [] __do_softirq+0xcb/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff810001056a90 -> (&rq->rq_lock_key#4){++..} ops: 122452 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] __wake_up+0x35/0x4e [] __queue_work+0x52/0x66 [] delayed_work_timer_fn+0x38/0x3b [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000105ea90 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] complete+0x34/0x49 [] migration_thread+0x1c9/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&rq->rq_lock_key#3){++..} ops: 299350 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] wake_up_process+0xf/0x12 [] __do_softirq+0xcb/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff810001056a90 -> (&rq->rq_lock_key#4){++..} ops: 122452 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] __wake_up+0x35/0x4e [] __queue_work+0x52/0x66 [] delayed_work_timer_fn+0x38/0x3b [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000105ea90 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] double_rq_lock+0x2d/0x33 [] __migrate_task+0x65/0xe2 [] migration_thread+0x1ba/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] complete+0x34/0x49 [] migration_thread+0x1c9/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&rq->rq_lock_key#4){++..} ops: 122452 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] __wake_up+0x35/0x4e [] __queue_work+0x52/0x66 [] delayed_work_timer_fn+0x38/0x3b [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000105ea90 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] complete+0x34/0x49 [] migration_thread+0x1c9/0x223 [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] __wake_up+0x1f/0x4e [] __queue_work+0x52/0x66 [] queue_work+0x52/0x5c [] call_usermodehelper_keys+0xe8/0x109 [] kobject_uevent+0x3f8/0x422 [] class_device_add+0x331/0x464 [] class_device_register+0x15/0x1b [] class_device_create+0x13a/0x163 [] vtconsole_class_init+0x87/0xdb [] init+0x140/0x30f [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] __queue_work+0x14/0x66 [] queue_work+0x52/0x5c [] ipoib_mcast_join_complete+0x267/0x2a8 [ib_ipoib] [] ib_sa_mcmember_rec_callback+0x4b/0x57 [ib_sa] [] send_handler+0x50/0xaf [ib_sa] [] timeout_sends+0x199/0x1c0 [ib_mad] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&on_slab_key){++..} ops: 4526 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] cache_alloc_refill+0x84/0x245 [] kmem_cache_alloc+0xb4/0xf4 [] kmem_cache_create+0x52d/0x710 [] kmem_cache_init+0x250/0x49d [] start_kernel+0x1a9/0x22e [] _sinittext+0x2ab/0x2b3 [] 0xffffffffffffffff in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] cache_alloc_refill+0x84/0x245 [] __kmalloc_track_caller+0xf9/0x138 [] __alloc_skb+0x5c/0x125 [] __netdev_alloc_skb+0x16/0x34 [] e1000_clean_rx_irq+0x206/0x4a9 [] e1000_intr+0xac/0xf7 [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] cache_flusharray+0x4b/0x103 [] kfree+0x1ee/0x224 [] free_fdtable_rcu+0x95/0xfa [] __rcu_process_callbacks+0x154/0x1de [] rcu_process_callbacks+0x22/0x44 [] tasklet_action+0x78/0xc0 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] on_slab_key+0x0/0x10 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] cache_alloc_refill+0x84/0x245 [] kmem_cache_zalloc+0xb9/0x117 [] ipoib_mcast_alloc+0x24/0xb2 [ib_ipoib] [] ipoib_mcast_restart_task+0x183/0x3e7 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&list->lock#4){.+..} ops: 10 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] skb_queue_tail+0x1c/0x47 [] ipoib_mcast_send+0x193/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] skb_queue_tail+0x1c/0x47 [] ipoib_mcast_send+0x193/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.0+0x0/0xffffffffffff8be3 [ib_ipoib] ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] skb_queue_tail+0x1c/0x47 [] ipoib_mcast_send+0x193/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 -> (&device->client_data_lock){.+..} ops: 70 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] add_client_context+0x5f/0x97 [ib_core] [] ib_register_device+0x2a7/0x2dd [ib_core] [] mthca_register_device+0x3db/0x436 [ib_mthca] [] mthca_init_one+0xc85/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ib_get_client_data+0x1f/0x65 [ib_core] [] ib_sa_mcmember_rec_query+0x31/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.1+0x0/0xffffffffffff665a [ib_core] ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ib_get_client_data+0x1f/0x65 [ib_core] [] ib_sa_mcmember_rec_query+0x31/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 -> (&tid_lock){.+..} ops: 54 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] init_mad+0x2f/0x67 [ib_sa] [] ib_sa_mcmember_rec_query+0xe0/0x17f [ib_sa] [] ipoib_mcast_join+0x166/0x1f8 [ib_ipoib] [] ipoib_mcast_join_task+0x225/0x2c1 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] init_mad+0x2f/0x67 [ib_sa] [] ib_sa_mcmember_rec_query+0xe0/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.2+0x0/0xffffffffffffc70a [ib_sa] ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] init_mad+0x2f/0x67 [ib_sa] [] ib_sa_mcmember_rec_query+0xe0/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 -> (&sa_dev->port[i].ah_lock){.+..} ops: 58 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] update_sm_ah+0xd6/0x10b [ib_sa] [] ib_sa_add_one+0x19f/0x1ec [ib_sa] [] ib_register_device+0x2b1/0x2dd [ib_core] [] mthca_register_device+0x3db/0x436 [ib_mthca] [] mthca_init_one+0xc85/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] send_mad+0x9d/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.0+0x0/0xffffffffffffc71a [ib_sa] ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] send_mad+0x9d/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 -> (&mad_queue->lock){.+..} ops: 2586 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ib_mad_post_receive_mads+0xc0/0x1ac [ib_mad] [] ib_mad_init_device+0x3b9/0x559 [ib_mad] [] ib_register_device+0x2b1/0x2dd [ib_core] [] mthca_register_device+0x3db/0x436 [ib_mthca] [] mthca_init_one+0xc85/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ib_send_mad+0xb3/0x170 [ib_mad] [] ib_post_send_mad+0x4d4/0x567 [ib_mad] [] send_mad+0xe3/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.2+0x0/0xffffffffffffb0e0 [ib_mad] -> (&qp->sq.lock){.+..} ops: 156 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] mthca_modify_qp+0x68/0xcd8 [ib_mthca] [] ib_modify_qp+0xc/0xf [ib_core] [] ib_mad_init_device+0x2d7/0x559 [ib_mad] [] ib_register_device+0x2b1/0x2dd [ib_core] [] mthca_register_device+0x3db/0x436 [ib_mthca] [] mthca_init_one+0xc85/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] mthca_tavor_post_send+0x3c/0x512 [ib_mthca] [] ib_send_mad+0xe8/0x170 [ib_mad] [] ib_post_send_mad+0x4d4/0x567 [ib_mad] [] send_mad+0xe3/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.2+0x0/0xffffffffffff309e [ib_mthca] -> (&qp->rq.lock){+...} ops: 2463 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] mthca_modify_qp+0x70/0xcd8 [ib_mthca] [] ib_modify_qp+0xc/0xf [ib_core] [] ib_mad_init_device+0x2d7/0x559 [ib_mad] [] ib_register_device+0x2b1/0x2dd [ib_core] [] mthca_register_device+0x3db/0x436 [ib_mthca] [] mthca_init_one+0xc85/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] mthca_tavor_post_receive+0x38/0x2ef [ib_mthca] [] ipoib_ib_post_receive+0x81/0xfb [ib_ipoib] [] ipoib_ib_completion+0x29d/0x422 [ib_ipoib] [] mthca_cq_completion+0x68/0x6e [ib_mthca] [] mthca_eq_int+0x84/0x3d7 [ib_mthca] [] mthca_tavor_interrupt+0x5e/0xdc [ib_mthca] [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 } ... key at: [] __key.3+0x0/0xffffffffffff3096 [ib_mthca] ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] mthca_modify_qp+0x70/0xcd8 [ib_mthca] [] ib_modify_qp+0xc/0xf [ib_core] [] ib_mad_init_device+0x2d7/0x559 [ib_mad] [] ib_register_device+0x2b1/0x2dd [ib_core] [] mthca_register_device+0x3db/0x436 [ib_mthca] [] mthca_init_one+0xc85/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] mthca_tavor_post_send+0x3c/0x512 [ib_mthca] [] ib_send_mad+0xe8/0x170 [ib_mad] [] ib_post_send_mad+0x4d4/0x567 [ib_mad] [] send_mad+0xe3/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_join+0x166/0x1f8 [ib_ipoib] [] ipoib_mcast_join_task+0x225/0x2c1 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] ib_send_mad+0xb3/0x170 [ib_mad] [] ib_post_send_mad+0x4d4/0x567 [ib_mad] [] send_mad+0xe3/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 -> (&qp->sq.lock){.+..} ops: 156 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irq+0x27/0x34 [] mthca_modify_qp+0x68/0xcd8 [ib_mthca] [] ib_modify_qp+0xc/0xf [ib_core] [] ib_mad_init_device+0x2d7/0x559 [ib_mad] [] ib_register_device+0x2b1/0x2dd [ib_core] [] mthca_register_device+0x3db/0x436 [ib_mthca] [] mthca_init_one+0xc85/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] mthca_tavor_post_send+0x3c/0x512 [ib_mthca] [] ib_send_mad+0xe8/0x170 [ib_mad] [] ib_post_send_mad+0x4d4/0x567 [ib_mad] [] send_mad+0xe3/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.2+0x0/0xffffffffffff309e [ib_mthca] -> (&qp->rq.lock){+...} ops: 2463 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] mthca_modify_qp+0x70/0xcd8 [ib_mthca] [] ib_modify_qp+0xc/0xf [ib_core] [] ib_mad_init_device+0x2d7/0x559 [ib_mad] [] ib_register_device+0x2b1/0x2dd [ib_core] [] mthca_register_device+0x3db/0x436 [ib_mthca] [] mthca_init_one+0xc85/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] mthca_tavor_post_receive+0x38/0x2ef [ib_mthca] [] ipoib_ib_post_receive+0x81/0xfb [ib_ipoib] [] ipoib_ib_completion+0x29d/0x422 [ib_ipoib] [] mthca_cq_completion+0x68/0x6e [ib_mthca] [] mthca_eq_int+0x84/0x3d7 [ib_mthca] [] mthca_tavor_interrupt+0x5e/0xdc [ib_mthca] [] handle_IRQ_event+0x29/0x62 [] __do_IRQ+0xad/0x11d [] do_IRQ+0x106/0x118 [] common_interrupt+0x65/0x66 } ... key at: [] __key.3+0x0/0xffffffffffff3096 [ib_mthca] ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] mthca_modify_qp+0x70/0xcd8 [ib_mthca] [] ib_modify_qp+0xc/0xf [ib_core] [] ib_mad_init_device+0x2d7/0x559 [ib_mad] [] ib_register_device+0x2b1/0x2dd [ib_core] [] mthca_register_device+0x3db/0x436 [ib_mthca] [] mthca_init_one+0xc85/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] mthca_tavor_post_send+0x3c/0x512 [ib_mthca] [] ipoib_send+0x117/0x1d6 [ib_ipoib] [] ipoib_mcast_send+0x3f2/0x40a [ib_ipoib] [] ipoib_path_lookup+0x213/0x223 [ib_ipoib] [] ipoib_start_xmit+0x129/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] neigh_connected_output+0xb2/0xd2 [] ip6_output2+0x24f/0x284 [] ip6_output+0x88a/0x89a [] ndisc_send_ns+0x36c/0x49d [] addrconf_dad_timer+0xf8/0x119 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 -> (query_idr.lock){.+..} ops: 142 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] free_layer+0x1c/0x3f [] idr_pre_get+0x35/0x42 [] send_mad+0x26/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_join+0x166/0x1f8 [ib_ipoib] [] ipoib_mcast_join_task+0x225/0x2c1 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] free_layer+0x1c/0x3f [] idr_pre_get+0x35/0x42 [] send_mad+0x26/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_path_lookup+0x213/0x223 [ib_ipoib] [] ipoib_start_xmit+0x129/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] neigh_connected_output+0xb2/0xd2 [] ip6_output2+0x24f/0x284 [] ip6_output+0x88a/0x89a [] ndisc_send_rs+0x2eb/0x41c [] addrconf_dad_completed+0x8e/0xdb [] addrconf_dad_timer+0x70/0x119 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] query_idr+0x30/0xffffffffffffda5a [ib_sa] ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] free_layer+0x1c/0x3f [] idr_pre_get+0x35/0x42 [] send_mad+0x26/0x142 [ib_sa] [] ib_sa_mcmember_rec_query+0x14a/0x17f [ib_sa] [] ipoib_mcast_send+0x2dd/0x40a [ib_ipoib] [] ipoib_path_lookup+0x213/0x223 [ib_ipoib] [] ipoib_start_xmit+0x129/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] neigh_connected_output+0xb2/0xd2 [] ip6_output2+0x24f/0x284 [] ip6_output+0x88a/0x89a [] ndisc_send_rs+0x2eb/0x41c [] addrconf_dad_completed+0x8e/0xdb [] addrconf_dad_timer+0x70/0x119 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] __ipoib_reap_ah+0x36/0xd1 [ib_ipoib] [] ipoib_reap_ah+0xe/0x35 [ib_ipoib] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&list->lock#4){.+..} ops: 10 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] skb_queue_tail+0x1c/0x47 [] ipoib_mcast_send+0x193/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] skb_queue_tail+0x1c/0x47 [] ipoib_mcast_send+0x193/0x40a [ib_ipoib] [] ipoib_start_xmit+0x205/0x4b2 [ib_ipoib] [] dev_hard_start_xmit+0x1af/0x223 [] __qdisc_run+0xf7/0x1d1 [] dev_queue_xmit+0x12c/0x245 [] mld_sendpack+0x13e/0x263 [] mld_ifc_timer_expire+0x1df/0x218 [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] __key.0+0x0/0xffffffffffff8be3 [ib_ipoib] ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock_irqsave+0x2b/0x3c [] skb_dequeue+0x1b/0x5c [] ipoib_mcast_sendonly_join_complete+0xdb/0x10f [ib_ipoib] [] ib_sa_mcmember_rec_callback+0x4b/0x57 [ib_sa] [] recv_handler+0x3e/0x4b [ib_sa] [] ib_mad_completion_handler+0x3b9/0x595 [ib_mad] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 -> (&rq->rq_lock_key#4){++..} ops: 122452 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] migration_call+0x94/0x453 [] notifier_call_chain+0x28/0x3b [] blocking_notifier_call_chain+0x26/0x3d [] cpu_up+0x54/0xf1 [] init+0x88/0x30f [] child_rip+0x7/0x12 in-hardirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] scheduler_tick+0x6b/0x336 [] update_process_times+0x64/0x76 [] smp_local_timer_interrupt+0x27/0x4d [] smp_apic_timer_interrupt+0x54/0x5e [] apic_timer_interrupt+0x6a/0x70 in-softirq-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] default_wake_function+0xc/0xf [] __wake_up_common+0x42/0x62 [] __wake_up+0x35/0x4e [] __queue_work+0x52/0x66 [] delayed_work_timer_fn+0x38/0x3b [] run_timer_softirq+0x168/0x1d5 [] __do_softirq+0x67/0xf2 [] call_softirq+0x1b/0x28 [] do_softirq+0x35/0x9f [] irq_exit+0x56/0x59 [] smp_apic_timer_interrupt+0x59/0x5e [] apic_timer_interrupt+0x6a/0x70 } ... key at: [] 0xffff81000105ea90 ... acquired at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] task_rq_lock+0x41/0x74 [] try_to_wake_up+0x32/0x3df [] wake_up_process+0xf/0x12 [] raise_softirq_irqoff+0x52/0x55 [] dev_kfree_skb_any+0x66/0x8e [] ipoib_mcast_sendonly_join_complete+0xe3/0x10f [ib_ipoib] [] ib_sa_mcmember_rec_callback+0x4b/0x57 [ib_sa] [] recv_handler+0x3e/0x4b [ib_sa] [] ib_mad_completion_handler+0x3b9/0x595 [ib_mad] [] run_workqueue+0xa7/0xf2 [] worker_thread+0xfb/0x12f [] kthread+0xcf/0xfc [] child_rip+0x7/0x12 the hard-irq-unsafe lock's dependencies: -> (&alloc->lock){--..} ops: 57 { initial-use at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] mthca_alloc+0x12/0x7b [ib_mthca] [] mthca_uar_alloc+0x18/0x4d [ib_mthca] [] mthca_init_one+0x7f9/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 softirq-on-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] mthca_alloc+0x12/0x7b [ib_mthca] [] mthca_uar_alloc+0x18/0x4d [ib_mthca] [] mthca_init_one+0x7f9/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 hardirq-on-W at: [] lock_acquire+0x7a/0xa1 [] _spin_lock+0x21/0x2e [] mthca_alloc+0x12/0x7b [ib_mthca] [] mthca_uar_alloc+0x18/0x4d [ib_mthca] [] mthca_init_one+0x7f9/0xd89 [ib_mthca] [] pci_device_probe+0xee/0x155 [] driver_probe_device+0x5d/0xbe [] __driver_attach+0x90/0xcf [] bus_for_each_dev+0x49/0x7e [] driver_attach+0x1b/0x1e [] bus_add_driver+0x7b/0x13a [] driver_register+0x8c/0x91 [] __pci_register_driver+0x60/0x84 [] copy_addr+0x16/0x57 [ib_addr] [] sys_init_module+0xb6/0x1ce [] system_call+0x7d/0x83 } ... key at: [] __key.0+0x0/0xffffffffffff30ce [ib_mthca] stack backtrace: Call Trace: [] show_trace+0xb8/0x335 [] dump_stack+0x13/0x15 [] check_usage+0x279/0x28a [] __lock_acquire+0x956/0xb8e [] lock_acquire+0x7b/0xa1 [] _spin_lock+0x22/0x2e [] :ib_mthca:mthca_free+0x24/0x54 [] :ib_mthca:mthca_destroy_ah+0x30/0x55 [] :ib_mthca:mthca_ah_destroy+0x14/0x22 [] :ib_core:ib_destroy_ah+0x13/0x1f [] :ib_ipoib:__ipoib_reap_ah+0x8f/0xd1 [] :ib_ipoib:ipoib_reap_ah+0xf/0x35 [] run_workqueue+0xa8/0xf2 [] worker_thread+0xfc/0x12f [] kthread+0xd0/0xfc [] child_rip+0x8/0x12 DWARF2 unwinder stuck at child_rip+0x8/0x12 Leftover inexact backtrace: [] _spin_unlock_irq+0x29/0x2f [] restore_args+0x0/0x30 [] kthread+0x0/0xfc [] child_rip+0x0/0x12 BUG: soft lockup detected on CPU#0! ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 Call Trace: [] show_trace+0xb8/0x335 [] dump_stack+0x13/0x15 [] softlockup_tick+0xd8/0xef [] run_local_timers+0x13/0x15 [] update_process_times+0x49/0x76 [] smp_local_timer_interrupt+0x28/0x4d [] smp_apic_timer_interrupt+0x55/0x5e [] apic_timer_interrupt+0x6b/0x70 [] mwait_idle+0x3f/0x54 [] cpu_idle+0x9f/0xc2 [] rest_init+0x2b/0x2d [] start_kernel+0x22c/0x22e [] _sinittext+0x2ac/0x2b3 [] softlockup_tick+0xd8/0xef [] run_local_timers+0x13/0x15 [] update_process_times+0x49/0x76 [] smp_local_timer_interrupt+0x28/0x4d [] smp_apic_timer_interrupt+0x55/0x5e [] mwait_idle+0x0/0x54 [] apic_timer_interrupt+0x6b/0x70 [] mwait_idle+0x3f/0x54 [] cpu_idle+0x9f/0xc2 [] rest_init+0x2b/0x2d [] start_kernel+0x22c/0x22e [] _sinittext+0x2ac/0x2b3 BUG: soft lockup detected on CPU#3! Call Trace: [] show_trace+0xb8/0x335 [] dump_stack+0x13/0x15 [] softlockup_tick+0xd8/0xef [] run_local_timers+0x13/0x15 [] update_process_times+0x49/0x76 [] smp_local_timer_interrupt+0x28/0x4d [] smp_apic_timer_interrupt+0x55/0x5e [] apic_timer_interrupt+0x6b/0x70 [] mwait_idle+0x3f/0x54 [] cpu_idle+0x9f/0xc2 [] start_secondary+0x44b/0x45a [] softlockup_tick+0xd8/0xef [] run_local_timers+0x13/0x15 [] update_process_times+0x49/0x76 [] smp_local_timer_interrupt+0x28/0x4d [] smp_apic_timer_interrupt+0x55/0x5e [] mwait_idle+0x0/0x54 [] apic_timer_interrupt+0x6b/0x70 [] mwait_idle+0x3f/0x54 [] cpu_idle+0x9f/0xc2 [] start_secondary+0x44b/0x45a BUG: soft lockup detected on CPU#2! Call Trace: [] show_trace+0xb8/0x335 [] dump_stack+0x13/0x15 [] softlockup_tick+0xd8/0xef [] run_local_timers+0x13/0x15 [] update_process_times+0x49/0x76 [] smp_local_timer_interrupt+0x28/0x4d [] smp_apic_timer_interrupt+0x55/0x5e [] apic_timer_interrupt+0x6b/0x70 [] mwait_idle+0x3f/0x54 [] cpu_idle+0x9f/0xc2 [] start_secondary+0x44b/0x45a [] softlockup_tick+0xd8/0xef [] run_local_timers+0x13/0x15 [] update_process_times+0x49/0x76 [] smp_local_timer_interrupt+0x28/0x4d [] smp_apic_timer_interrupt+0x55/0x5e [] mwait_idle+0x0/0x54 [] apic_timer_interrupt+0x6b/0x70 [] mwait_idle+0x3f/0x54 [] cpu_idle+0x9f/0xc2 [] start_secondary+0x44b/0x45a ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 From yipeeyipeeyipeeyipee at yahoo.com Thu Aug 31 07:40:02 2006 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Thu, 31 Aug 2006 14:40:02 +0000 (UTC) Subject: [openib-general] single rkey Message-ID: Hi, Is it possible for several memory registrations (using ibv_reg_mr) to have a single rkey? Can I add memory registrations to a previous rkey? thanks, y From dotanb at dev.mellanox.co.il Thu Aug 31 07:54:37 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 31 Aug 2006 17:54:37 +0300 Subject: [openib-general] single rkey In-Reply-To: References: Message-ID: <44F6F82D.80309@dev.mellanox.co.il> yipee wrote: > Hi, > > Is it possible for several memory registrations (using ibv_reg_mr) to have a > single rkey? > Can I add memory registrations to a previous rkey? > > > thanks, > y > I believe that the answer is no. Dotan From mst at mellanox.co.il Thu Aug 31 08:32:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 Aug 2006 18:32:49 +0300 Subject: [openib-general] [PATCH] IB/srp: destroy and re-create QP and CQ on reconnect Message-ID: <20060831153248.GC1006@mellanox.co.il> Hello, Roland! Please consider the following for 2.6.19. --- >From: Ishai Rabinovitz For some reason (could be a firmware problem) I got a CQ overrun in SRP. Because of that there was a QP FATAL. Since in srp_reconnect_target we are not destroying the QP, the QP FATAL persists after the reconnect. In order to be able to recover from such situation I suggest we destroy the CQ and the QP in every reconnect. This also corrects a minor spec in-compliance - when srp_reconnect_target is called, srp destroys the CM ID and resets the QP, the new connection will be retried with the same QPN which could theoretically lead to stale packets (for strict spec compliance I think QPN should not be reused till all stale packets are flushed out of the network). --- IB/srp: destroy/re-create QP and CQ on each reconnect. This makes SRP more robust in presence of hardware errors and is closer to behaviour suggested by IB spec, reducing chance of stale packets. Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-08-31 12:23:52.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-08-31 12:30:48.000000000 +0300 @@ -495,10 +495,10 @@ static int srp_reconnect_target(struct srp_target_port *target) { struct ib_cm_id *new_cm_id; - struct ib_qp_attr qp_attr; struct srp_request *req, *tmp; - struct ib_wc wc; int ret; + struct ib_cq *old_cq; + struct ib_qp *old_qp; spin_lock_irq(target->scsi_host->host_lock); if (target->state != SRP_TARGET_LIVE) { @@ -522,17 +522,17 @@ ib_destroy_cm_id(target->cm_id); target->cm_id = new_cm_id; - qp_attr.qp_state = IB_QPS_RESET; - ret = ib_modify_qp(target->qp, &qp_attr, IB_QP_STATE); - if (ret) - goto err; - - ret = srp_init_qp(target, target->qp); - if (ret) + old_qp = target->qp; + old_cq = target->cq; + ret = srp_create_target_ib(target); + if (ret) { + target->qp = old_qp; + target->cq = old_cq; goto err; + } - while (ib_poll_cq(target->cq, 1, &wc) > 0) - ; /* nothing */ + ib_destroy_qp(old_qp); + ib_destroy_cq(old_cq); spin_lock_irq(target->scsi_host->host_lock); list_for_each_entry_safe(req, tmp, &target->req_queue, list) -- MST From tziporet at dev.mellanox.co.il Thu Aug 31 09:24:29 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 31 Aug 2006 19:24:29 +0300 Subject: [openib-general] OFED 1.1-rc3 is ready Message-ID: <44F70D3D.80101@dev.mellanox.co.il> Hi, OFED 1.1-RC3 is available on https://openib.org/svn/gen2/branches/1.1/ofed/releases/ File: OFED-1.1-rc3.tgz Please report any issues in bugzilla http://openib.org/bugzilla/ Schedule reminder: ================== Next milestones: RC4 is planned for 7-Sep. It should include critical bug fixes only. Final release will be on 11 or 12 Sep. Owners - please update release notes for RC4. Tziporet & Vlad ------------------------------------------------------------------------------------- Release details: ================ Build_id: OFED-1.1-rc3 openib-1.1 (REV=9203) # User space https://openib.org/svn/gen2/branches/1.1/src/userspace Git: ref: refs/heads/ofed_1_1 commit 338e942a4ae10d62f2632e6292f85bb1b15d154c # MPI mpi_osu-0.9.7-mlx2.2.0.tgz openmpi-1.1.1-1.src.rpm mpitests-2.0-0.src.rpm OS support: =========== Novell: - SLES 9.0 SP3 - SLES10 Redhat: - Redhat EL4 up3 - Redhat EL4 up4 kernel.org: - Kernel 2.6.17 Note: Redhat EL4 up2, Fedora C4 and SuSE Pro 10 were dropped from the list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. Systems: ======== * x86_64 * x86 * ia64 * ppc64 Main changes from OFED-1.1-rc2: =============================== 1. Added ehca (IBM) driver. This driver can be compiled on kernel 2.6.18 only 3. Open MPI version update to openmpi-1.1.1-1 4. Core: Huge pages registration is supported 5. IPoIB high availability script supports multicast groups 6. RHEL4 up4 is now supported 7. SDP: fixed connection refused problem; get peer name working 8. libsdp: several bug fixes Limitations and known issues: ============================= 1. SDP: For Mellanox Sinai HCAs one must use latest FW version (1.1.000). 2. SDP: Scalability issue when many connections are opened 3. SDP: If RTU packet is lost Accept call blocks even if client connected. 4. ipath driver is not supported on SLES9 SP3 5. Compilation on kernel 2.6.18-rc5 is failing - to be fixed in RC4 Missing features that should be completed for RC4: ================================================== None From sean.hefty at intel.com Thu Aug 31 09:49:55 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 31 Aug 2006 09:49:55 -0700 Subject: [openib-general] [PATCH] 2.6.19 cma: fix typo Message-ID: <000001c6cd1d$7c7dc240$e598070a@amr.corp.intel.com> Comma should be semi-colon Signed-off-by: Sean Hefty --- Please queue for 2.6.19 diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d6f99d5..bf20410 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -265,7 +265,7 @@ static int cma_acquire_ib_dev(struct rdm union ib_gid gid; int ret = -ENODEV; - ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid), + ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid); mutex_lock(&lock); list_for_each_entry(cma_dev, &dev_list, list) { From sean.hefty at intel.com Thu Aug 31 10:03:33 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 31 Aug 2006 10:03:33 -0700 Subject: [openib-general] [PATCH] cma: protect against adding device during destruction In-Reply-To: <20060830200604.GA32183@mellanox.co.il> Message-ID: <000101c6cd1f$641c14c0$e598070a@amr.corp.intel.com> Can you see if this patch helps any? This closes a window where address resolution can attach an rdma_cm_id to a device during destruction of the rdma_cm_id. This can result in the rdma_cm_id remaining in the device list after its memory has been freed. Signed-off-by: Sean Hefty --- Index: cma.c =================================================================== --- cma.c (revision 9192) +++ cma.c (working copy) @@ -283,7 +284,6 @@ static int cma_acquire_ib_dev(struct rdm ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid); - mutex_lock(&lock); list_for_each_entry(cma_dev, &dev_list, list) { ret = ib_find_cached_gid(cma_dev->device, &gid, &id_priv->id.port_num, NULL); @@ -292,7 +292,6 @@ static int cma_acquire_ib_dev(struct rdm break; } } - mutex_unlock(&lock); return ret; } @@ -781,7 +780,9 @@ void rdma_destroy_id(struct rdma_cm_id * state = cma_exch(id_priv, CMA_DESTROYING); cma_cancel_operation(id_priv, state); + mutex_lock(&lock); if (id_priv->cma_dev) { + mutex_unlock(&lock); switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) @@ -793,8 +794,8 @@ void rdma_destroy_id(struct rdma_cm_id * cma_leave_mc_groups(id_priv); mutex_lock(&lock); cma_detach_from_dev(id_priv); - mutex_unlock(&lock); } + mutex_unlock(&lock); cma_release_port(id_priv); cma_deref_id(id_priv); @@ -1511,16 +1512,26 @@ static void addr_handler(int status, str enum rdma_cm_event_type event; atomic_inc(&id_priv->dev_remove); - if (!id_priv->cma_dev && !status) + + /* + * Grab mutex to block rdma_destroy_id() from removing the device while + * we're trying to acquire it. + */ + mutex_lock(&lock); + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) { + mutex_unlock(&lock); + goto out; + } + + if (!status && !id_priv->cma_dev) status = cma_acquire_dev(id_priv); + mutex_unlock(&lock); if (status) { - if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_BOUND)) + if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ADDR_BOUND)) goto out; event = RDMA_CM_EVENT_ADDR_ERROR; } else { - if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) - goto out; memcpy(&id_priv->id.route.addr.src_addr, src_addr, ip_addr_size(src_addr)); event = RDMA_CM_EVENT_ADDR_RESOLVED; @@ -1747,8 +1758,11 @@ int rdma_bind_addr(struct rdma_cm_id *id if (!cma_any_addr(addr)) { ret = rdma_translate_ip(addr, &id->route.addr.dev_addr); - if (!ret) + if (!ret) { + mutex_lock(&lock); ret = cma_acquire_dev(id_priv); + mutex_unlock(&lock); + } if (ret) goto err; } From minich at ornl.gov Thu Aug 31 12:16:42 2006 From: minich at ornl.gov (Makia Minich) Date: Thu, 31 Aug 2006 15:16:42 -0400 Subject: [openib-general] Srp question Message-ID: We are attempting to do some performance testing of the SRP driver (with a DDN target) and are seeing some poor results: ~120MB/s per lun with 1 sgp_dd ~80MB/s per lun with 4 sgp_dd Previously we had attempted the same tests with IBGold and got the following: ~150MB/s per lun with 1 sgp_dd ~600MB/s per lun with 4 sgp_dd To achieve the results in IBGold, we were able to set the srp module option "max_xfer_sectors_per_io=4096", but can't seem to find an equivalent option in the OFED SRP drivers. By default, we found (via stats from the DDN) that we were only seeing reads and writes in the 0-32Kbyte range. Comparing IBGold and OFED, we found that the srp_sg_tablesize defaulted to 256, but in OFED it defaulted to 12. So, changing this (via modprobe.conf) to 256 in OFED, we were able to see reads and writes in the 128Kbyte range (which is what ultimately got us to the performance above). I also noticed that there is a max_sects option you can pass to add_target (in the SRP /sys entries) which seemed to be the same idea as srp_sg_tablesize, but this didn't seem to affect anything. So, my question is, what is the right magic to get SRP up to speed? Thanks... -- Makia Minich National Center for Computation Science Oak Ridge National Laboratory From rdreier at cisco.com Thu Aug 31 12:34:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 31 Aug 2006 12:34:08 -0700 Subject: [openib-general] Srp question In-Reply-To: (Makia Minich's message of "Thu, 31 Aug 2006 15:16:42 -0400") References: Message-ID: Makia> We are attempting to do some performance testing of the SRP driver (with a Makia> DDN target) and are seeing some poor results: Makia> ~120MB/s per lun with 1 sgp_dd Makia> ~80MB/s per lun with 4 sgp_dd Makia> Previously we had attempted the same tests with IBGold and got the Makia> following: Makia> ~150MB/s per lun with 1 sgp_dd Makia> ~600MB/s per lun with 4 sgp_dd Were these tests with the same kernels otherwise? If not, there may be unrelated changes to the SCSI stack that affect synthetic benchmarks like this. (I seem to remember a change in the not too distant patch that affected the largest IO it is possible to submit through the SG interface). Makia> To achieve the results in IBGold, we were able to set the Makia> srp module option "max_xfer_sectors_per_io=4096", but can't Makia> seem to find an equivalent option in the OFED SRP drivers. When connecting to the target (the echo to the add_target file), you can add ",max_sect=4096" to the string you pass in. Makia> By default, we found (via stats from the DDN) that we were Makia> only seeing reads and writes in the 0-32Kbyte range. Makia> Comparing IBGold and OFED, we found that the Makia> srp_sg_tablesize defaulted to 256, but in OFED it defaulted Makia> to 12. So, changing this (via modprobe.conf) to 256 in Makia> OFED, we were able to see reads and writes in the 128Kbyte Makia> range (which is what ultimately got us to the performance Makia> above). I also noticed that there is a max_sects option Makia> you can pass to add_target (in the SRP /sys entries) which Makia> seemed to be the same idea as srp_sg_tablesize, but this Makia> didn't seem to affect anything. It is "max_sect" not "max_sects" (no final 's'). Anyway, what do you mean that it didn't affect anything? max_sect=4096 should theoretically get you up to 512 KB IOs. - R. From rdreier at cisco.com Thu Aug 31 12:36:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 31 Aug 2006 12:36:45 -0700 Subject: [openib-general] [PATCH] 2.6.19 cma: fix typo In-Reply-To: <000001c6cd1d$7c7dc240$e598070a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 31 Aug 2006 09:49:55 -0700") References: <000001c6cd1d$7c7dc240$e598070a@amr.corp.intel.com> Message-ID: This was already fixed by the iWARP merge patches (which I'll push out shortly). So I'll drop this patch... From mst at mellanox.co.il Thu Aug 31 12:37:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 Aug 2006 22:37:07 +0300 Subject: [openib-general] [PATCH] cma: protect against adding device during destruction In-Reply-To: <000101c6cd1f$641c14c0$e598070a@amr.corp.intel.com> References: <000101c6cd1f$641c14c0$e598070a@amr.corp.intel.com> Message-ID: <20060831193707.GA3859@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH] cma: protect against adding device during destruction > > Can you see if this patch helps any? > > This closes a window where address resolution can attach an rdma_cm_id > to a device during destruction of the rdma_cm_id. This can result in > the rdma_cm_id remaining in the device list after its memory has been > freed. > > Signed-off-by: Sean Hefty I'll test some, but the problem hasn't reappeared since. The patch looks right, I'd say push it for 2.6.18. -- MST From tzachid at mellanox.co.il Thu Aug 31 12:45:00 2006 From: tzachid at mellanox.co.il (Tzachi Dar) Date: Thu, 31 Aug 2006 22:45:00 +0300 Subject: [openib-general] [Openib-windows] File transfer performance options Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302CDC979@mtlexch01.mtl.com> There is one thing that is missing from your mail, and that is if you want to see the windows machine as some file server (for example SAMBA, NFS, SRP), or are you ready to accept it as a normal server. The big difference is that on the second option the server can be running at user mode (for example FTP server). When (the server application is) running at user mode, SDP can be used as a socket provider. This means that theoretically every socket application should run and enjoy the speed of Infiniband. Currently there are two projects of SDP under development: one is for Linux and the other for Windows, so SDP can be used to allow machines from both types to connect. Performance that we have measured on the windows platform, using DDR cards was bigger than 1200 MB/Sec. (of course, this data was from host memory, and not from disks). So, if all you need to do is to pass files from one side to the other, I would recommend that you will check this option. One note about your experiments: when using ram disks, this probably means that there is one more copy from the ram disk to the application buffer. A real disk, has it's DMA engine, while a ram disk doesn't. Another copy is probably not a problem when you are talking about 100MB/sec, but it would become a problem once you will use SDP (I hope). Thanks Tzachi We've been testing an application that archives large quantities of data from a Linux system onto a Windows-based server (64bit server 2003 R2). As part of the investigation into relatively modest transfer speeds in the win-linux configuration, we configured a Linux-Linux transfer via IpoIB with NFS layered on top (with ram disks to avoid physical disk issues) [Whilst for a real Linux-Linux configuration I would look for the RDMA over NFS solution, this wouldn't translate to our eventual win-linux inter-operable system.] I was surprised that even on linux-linux I hit a wall of 100MB/s (test notes below). Are others doing better? I was hoping for 150MB/s - 200MB/s Does anyone have any hints on tweaking of an IPoIB/NFS solution to get better throughput for large files (not so concerned about latency). Are there any other inter-operable windows-linux solutions now? (cross-platform NFS over RDMA or SRP initiator/target?) Paul Baxter From minich at ornl.gov Thu Aug 31 12:48:08 2006 From: minich at ornl.gov (Makia Minich) Date: Thu, 31 Aug 2006 15:48:08 -0400 Subject: [openib-general] Srp question In-Reply-To: Message-ID: On 8/31/06 3:34 PM, "Roland Dreier" wrote: > Makia> We are attempting to do some performance testing of the SRP driver > (with a > Makia> DDN target) and are seeing some poor results: > > Makia> ~120MB/s per lun with 1 sgp_dd > Makia> ~80MB/s per lun with 4 sgp_dd > > Makia> Previously we had attempted the same tests with IBGold and got the > Makia> following: > > Makia> ~150MB/s per lun with 1 sgp_dd > Makia> ~600MB/s per lun with 4 sgp_dd > > Were these tests with the same kernels otherwise? If not, there may > be unrelated changes to the SCSI stack that affect synthetic > benchmarks like this. (I seem to remember a change in the not too > distant patch that affected the largest IO it is possible to submit > through the SG interface). The kernels in question were 2.6.9.22.0.2 and 2.6.9-34.EL. I'll have to find some changelogs to see if there were changes to the SCSI stack. > Makia> To achieve the results in IBGold, we were able to set the > Makia> srp module option "max_xfer_sectors_per_io=4096", but can't > Makia> seem to find an equivalent option in the OFED SRP drivers. > > When connecting to the target (the echo to the add_target file), you > can add ",max_sect=4096" to the string you pass in. We did add this (pardon my typo of max_sects below). I did find that by appending the max_sect= at the end of the line we were seeing strange behaviour (it seemed that the parser added a newline for no reason) and the only fix was to put it at the beginning of the line. > Makia> By default, we found (via stats from the DDN) that we were > Makia> only seeing reads and writes in the 0-32Kbyte range. > Makia> Comparing IBGold and OFED, we found that the > Makia> srp_sg_tablesize defaulted to 256, but in OFED it defaulted > Makia> to 12. So, changing this (via modprobe.conf) to 256 in > Makia> OFED, we were able to see reads and writes in the 128Kbyte > Makia> range (which is what ultimately got us to the performance > Makia> above). I also noticed that there is a max_sects option > Makia> you can pass to add_target (in the SRP /sys entries) which > Makia> seemed to be the same idea as srp_sg_tablesize, but this > Makia> didn't seem to affect anything. > > It is "max_sect" not "max_sects" (no final 's'). Anyway, what do you > mean that it didn't affect anything? max_sect=4096 should > theoretically get you up to 512 KB IOs. Sorry again about the type (I should never attempt to work off of memory). With the max_sect=4096, and the srp_sg_tablesize to 256, we are now seeing 512KB IOs. The new question is is there a way to get this to 1M IOs? > - R. -- Makia Minich National Center for Computation Science Oak Ridge National Laboratory From rdreier at cisco.com Thu Aug 31 17:29:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 31 Aug 2006 17:29:09 -0700 Subject: [openib-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus to get a fix for a locking bug found by lockdep: Roland Dreier: IB/mthca: Use IRQ safe locks to protect allocation bitmaps drivers/infiniband/hw/mthca/mthca_allocator.c | 15 +++++++++++---- 1 files changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_allocator.c b/drivers/infiniband/hw/mthca/mthca_allocator.c index 25157f5..f930e55 100644 --- a/drivers/infiniband/hw/mthca/mthca_allocator.c +++ b/drivers/infiniband/hw/mthca/mthca_allocator.c @@ -41,9 +41,11 @@ #include "mthca_dev.h" /* Trivial bitmap-based allocator */ u32 mthca_alloc(struct mthca_alloc *alloc) { + unsigned long flags; u32 obj; - spin_lock(&alloc->lock); + spin_lock_irqsave(&alloc->lock, flags); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); if (obj >= alloc->max) { alloc->top = (alloc->top + alloc->max) & alloc->mask; @@ -56,19 +58,24 @@ u32 mthca_alloc(struct mthca_alloc *allo } else obj = -1; - spin_unlock(&alloc->lock); + spin_unlock_irqrestore(&alloc->lock, flags); return obj; } void mthca_free(struct mthca_alloc *alloc, u32 obj) { + unsigned long flags; + obj &= alloc->max - 1; - spin_lock(&alloc->lock); + + spin_lock_irqsave(&alloc->lock, flags); + clear_bit(obj, alloc->table); alloc->last = min(alloc->last, obj); alloc->top = (alloc->top + alloc->max) & alloc->mask; - spin_unlock(&alloc->lock); + + spin_unlock_irqrestore(&alloc->lock, flags); } int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, From rdreier at cisco.com Thu Aug 31 17:29:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 31 Aug 2006 17:29:56 -0700 Subject: [openib-general] lockdep warnings In-Reply-To: <20060831131001.GB1006@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 31 Aug 2006 16:10:01 +0300") References: <20060831131001.GB1006@mellanox.co.il> Message-ID: Michael> Hi, Roland! I got a load of lockdep warnings after Michael> loading all modules and configuring ipoib. This doesn't Michael> usually happen, not sure what I changed this time. I'm a Michael> bit too busy this week - could you take a look at the Michael> log, please? This would only happen on a mem-full HCA. I just asked Linus to pull the following fix. commit 02113bd77e86386d02a9a606cdad53803a6e2794 Author: Roland Dreier Date: Thu Aug 31 16:43:06 2006 -0700 IB/mthca: Use IRQ safe locks to protect allocation bitmaps It is supposed to be OK to call mthca_create_ah() and mthca_destroy_ah() from any context. However, for mem-full HCAs, these functions use the mthca_alloc() and mthca_free() bitmap helpers, and those helpers use non-IRQ-safe spin_lock() internally. Lockdep correctly warns that this could lead to a deadlock. Fix this by changing mthca_alloc() and mthca_free() to use spin_lock_irqsave(). Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mthca/mthca_allocator.c b/drivers/infiniband/hw/mthca/mthca_allocator.c index 25157f5..f930e55 100644 --- a/drivers/infiniband/hw/mthca/mthca_allocator.c +++ b/drivers/infiniband/hw/mthca/mthca_allocator.c @@ -41,9 +41,11 @@ #include "mthca_dev.h" /* Trivial bitmap-based allocator */ u32 mthca_alloc(struct mthca_alloc *alloc) { + unsigned long flags; u32 obj; - spin_lock(&alloc->lock); + spin_lock_irqsave(&alloc->lock, flags); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); if (obj >= alloc->max) { alloc->top = (alloc->top + alloc->max) & alloc->mask; @@ -56,19 +58,24 @@ u32 mthca_alloc(struct mthca_alloc *allo } else obj = -1; - spin_unlock(&alloc->lock); + spin_unlock_irqrestore(&alloc->lock, flags); return obj; } void mthca_free(struct mthca_alloc *alloc, u32 obj) { + unsigned long flags; + obj &= alloc->max - 1; - spin_lock(&alloc->lock); + + spin_lock_irqsave(&alloc->lock, flags); + clear_bit(obj, alloc->table); alloc->last = min(alloc->last, obj); alloc->top = (alloc->top + alloc->max) & alloc->mask; - spin_unlock(&alloc->lock); + + spin_unlock_irqrestore(&alloc->lock, flags); } int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, From rdreier at cisco.com Thu Aug 31 17:32:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 31 Aug 2006 17:32:07 -0700 Subject: [openib-general] Srp question In-Reply-To: (Makia Minich's message of "Thu, 31 Aug 2006 15:48:08 -0400") References: Message-ID: Makia> We did add this (pardon my typo of max_sects below). I did Makia> find that by appending the max_sect= at the end of the line Makia> we were seeing strange behaviour (it seemed that the parser Makia> added a newline for no reason) and the only fix was to put Makia> it at the beginning of the line. Actually it's probably echo adding the newline. You can use "echo -n" to work around this, or just put the max_sect at the beginning of the line. Makia> Sorry again about the type (I should never attempt to work Makia> off of memory). With the max_sect=4096, and the Makia> srp_sg_tablesize to 256, we are now seeing 512KB IOs. The Makia> new question is is there a way to get this to 1M IOs? Don't know... do you get that with the old IBgold SRP initiator? - R. From sweitzen at cisco.com Thu Aug 31 18:30:35 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 31 Aug 2006 18:30:35 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc3 is ready Message-ID: RC3 includes a bunch of binary RPMS, please remove for RC4. Look at the size of the RC3 tarball vs previous ones: $ ls -s | more total 290848 46512 OFED-1.1-rc1.tgz 0 OFED-1.1-rc1.tgz.md5sum 47048 OFED-1.1-rc2.tgz 0 OFED-1.1-rc2.tgz.md5sum 197288 OFED-1.1-rc3.tgz 0 OFED-1.1-rc3.tgz.md5sum Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openfabrics-ewg-bounces at openib.org > [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of > Tziporet Koren > Sent: Thursday, August 31, 2006 9:24 AM > To: EWG > Cc: OPENIB > Subject: [openfabrics-ewg] OFED 1.1-rc3 is ready > > Hi, > > OFED 1.1-RC3 is available on > https://openib.org/svn/gen2/branches/1.1/ofed/releases/ > File: OFED-1.1-rc3.tgz > Please report any issues in bugzilla http://openib.org/bugzilla/ > > Schedule reminder: > ================== > Next milestones: > RC4 is planned for 7-Sep. It should include critical bug fixes only. > Final release will be on 11 or 12 Sep. > > Owners - please update release notes for RC4. > > Tziporet & Vlad > -------------------------------------------------------------- > ----------------------- > > Release details: > ================ > Build_id: > OFED-1.1-rc3 > > openib-1.1 (REV=9203) > # User space > https://openib.org/svn/gen2/branches/1.1/src/userspace > Git: > ref: refs/heads/ofed_1_1 > commit 338e942a4ae10d62f2632e6292f85bb1b15d154c > > # MPI > mpi_osu-0.9.7-mlx2.2.0.tgz > openmpi-1.1.1-1.src.rpm > mpitests-2.0-0.src.rpm > > > OS support: > =========== > Novell: > - SLES 9.0 SP3 > - SLES10 > Redhat: > - Redhat EL4 up3 > - Redhat EL4 up4 > kernel.org: > - Kernel 2.6.17 > > Note: Redhat EL4 up2, Fedora C4 and SuSE Pro 10 were dropped > from the list. > We keep the backport patches for these OSes and make sure > OFED compile and > loaded properly but will not do full QA cycle. > > Systems: > ======== > * x86_64 > * x86 > * ia64 > * ppc64 > > Main changes from OFED-1.1-rc2: > =============================== > 1. Added ehca (IBM) driver. This driver can be compiled on > kernel 2.6.18 > only > 3. Open MPI version update to openmpi-1.1.1-1 > 4. Core: Huge pages registration is supported > 5. IPoIB high availability script supports multicast groups > 6. RHEL4 up4 is now supported > 7. SDP: fixed connection refused problem; get peer name working > 8. libsdp: several bug fixes > > Limitations and known issues: > ============================= > 1. SDP: For Mellanox Sinai HCAs one must use latest FW > version (1.1.000). > 2. SDP: Scalability issue when many connections are opened > 3. SDP: If RTU packet is lost Accept call blocks even if > client connected. > 4. ipath driver is not supported on SLES9 SP3 > 5. Compilation on kernel 2.6.18-rc5 is failing - to be fixed in RC4 > > > Missing features that should be completed for RC4: > ================================================== > None > > > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > From cganapathi at novell.com Thu Aug 31 21:35:52 2006 From: cganapathi at novell.com (CH Ganapathi) Date: Thu, 31 Aug 2006 22:35:52 -0600 Subject: [openib-general] [PATCH] IB/perftest: Fix get_median, size of delta, usage(), worst latency Message-ID: <44F80602.6C2D.007B.0@novell.com> >> "Michael S. Tsirkin" 8/31/06 2:59 PM >>> > 3) Worst latency is delta[iters - 2] in read_lat.c, not delta[iters - 3]. >> could you explain this last bit please? delta having (iters - 1) elements has index range: 0 to (iters - 2). After sorting delta[iters - 2] is the maximum. Regards, Ganapathi. From devesh28 at gmail.com Thu Aug 31 22:02:10 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Fri, 1 Sep 2006 10:32:10 +0530 Subject: [openib-general] single rkey In-Reply-To: References: Message-ID: <309a667c0608312202w5e0d6d8ek2668c1c182363f5c@mail.gmail.com> On 8/31/06, yipee wrote: > Hi, > > Is it possible for several memory registrations (using ibv_reg_mr) to have a > single rkey? > Can I add memory registrations to a previous rkey? No this is not possible, In a single memory registration call you can have large buffer but once it is registered with NIC you can not any modifications to it and hence multiple registrations cannot share a same R_Key. > > > thanks, > y > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From tziporet at mellanox.co.il Thu Aug 31 22:17:15 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Fri, 1 Sep 2006 08:17:15 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.1-rc3 is ready Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA780C@mtlexch01.mtl.com> Hi Scott, This was my mistake (I tgz both binary RPMs and not just the source RMPs). I fixed this (removed the binary RPMs). All the rest was not touched. Tziporet -----Original Message----- From: openfabrics-ewg-bounces at openib.org [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Friday, September 01, 2006 4:31 AM To: Tziporet Koren; EWG Cc: OPENIB Subject: Re: [openfabrics-ewg] OFED 1.1-rc3 is ready RC3 includes a bunch of binary RPMS, please remove for RC4. Look at the size of the RC3 tarball vs previous ones: $ ls -s | more total 290848 46512 OFED-1.1-rc1.tgz 0 OFED-1.1-rc1.tgz.md5sum 47048 OFED-1.1-rc2.tgz 0 OFED-1.1-rc2.tgz.md5sum 197288 OFED-1.1-rc3.tgz 0 OFED-1.1-rc3.tgz.md5sum Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openfabrics-ewg-bounces at openib.org > [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of > Tziporet Koren > Sent: Thursday, August 31, 2006 9:24 AM > To: EWG > Cc: OPENIB > Subject: [openfabrics-ewg] OFED 1.1-rc3 is ready > > Hi, > > OFED 1.1-RC3 is available on > https://openib.org/svn/gen2/branches/1.1/ofed/releases/ > File: OFED-1.1-rc3.tgz > Please report any issues in bugzilla http://openib.org/bugzilla/ > > Schedule reminder: > ================== > Next milestones: > RC4 is planned for 7-Sep. It should include critical bug fixes only. > Final release will be on 11 or 12 Sep. > > Owners - please update release notes for RC4. > > Tziporet & Vlad > -------------------------------------------------------------- > ----------------------- > > Release details: > ================ > Build_id: > OFED-1.1-rc3 > > openib-1.1 (REV=9203) > # User space > https://openib.org/svn/gen2/branches/1.1/src/userspace > Git: > ref: refs/heads/ofed_1_1 > commit 338e942a4ae10d62f2632e6292f85bb1b15d154c > > # MPI > mpi_osu-0.9.7-mlx2.2.0.tgz > openmpi-1.1.1-1.src.rpm > mpitests-2.0-0.src.rpm > > > OS support: > =========== > Novell: > - SLES 9.0 SP3 > - SLES10 > Redhat: > - Redhat EL4 up3 > - Redhat EL4 up4 > kernel.org: > - Kernel 2.6.17 > > Note: Redhat EL4 up2, Fedora C4 and SuSE Pro 10 were dropped > from the list. > We keep the backport patches for these OSes and make sure > OFED compile and > loaded properly but will not do full QA cycle. > > Systems: > ======== > * x86_64 > * x86 > * ia64 > * ppc64 > > Main changes from OFED-1.1-rc2: > =============================== > 1. Added ehca (IBM) driver. This driver can be compiled on > kernel 2.6.18 > only > 3. Open MPI version update to openmpi-1.1.1-1 > 4. Core: Huge pages registration is supported > 5. IPoIB high availability script supports multicast groups > 6. RHEL4 up4 is now supported > 7. SDP: fixed connection refused problem; get peer name working > 8. libsdp: several bug fixes > > Limitations and known issues: > ============================= > 1. SDP: For Mellanox Sinai HCAs one must use latest FW > version (1.1.000). > 2. SDP: Scalability issue when many connections are opened > 3. SDP: If RTU packet is lost Accept call blocks even if > client connected. > 4. ipath driver is not supported on SLES9 SP3 > 5. Compilation on kernel 2.6.18-rc5 is failing - to be fixed in RC4 > > > Missing features that should be completed for RC4: > ================================================== > None > > > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > _______________________________________________ openfabrics-ewg mailing list openfabrics-ewg at openib.org http://openib.org/mailman/listinfo/openfabrics-ewg