From ido at mellanox.co.il Wed Dec 1 07:29:54 2004 From: ido at mellanox.co.il (Ido Bukspan) Date: Wed, 1 Dec 2004 17:29:54 +0200 Subject: [openib-general] Fix for possible oops when calling ipoib_neigh_destructor Message-ID: <91DB792C7985D411BEC300B40080D29C711BC2@mtvex01.mtv.mtl.com> I have implemented code in order to fix the oops problem, and it seems like it works fine. I still don't have a patch but here is the pseudo code I used. Please notice that I used a dynamic list (can be organized in an rb tree as well, as Roland suggested) of the current PATHs which exist in the driver (I use it also for the unicast ARP). I'm on vacation starting tomorrow but if you want I can work on a patch when I will come back (two weeks from now). Thanks -Ido The next function will be called from : ipoib_cleanup_module(void) { spin_lock_irq(&priv->lock); if (list_empty(&priv->path_list)) { // Meaning that the destructor already destroyed all the entries // Nothing to do spin_unlock_irq(&priv->lock); return; } spin_unlock_irq(&priv->lock); while (TRUE) { // We will keep checking until the list is empty. spin_lock_irq(&priv->lock); if (list_empty(&priv->path_list)) { //list empty spin_unlock_irq(&priv->lock); break; } else { fpath = path_list->next // We keep a pointer to the first neighbour for future use list_for_each_entry_safe(path, tpath, &priv->path_list, list) { // We are cloning all the neighbours so no destructor could be called // While we are in the process // I know the following is not well protected but I didn't find something better. path->neighbour = neigh_clone(path->neighbour); if (&path->neighbour->refcnt > 1){ // we check if we are the only one that hold if. // If not we will delete this AH. list_del(&path->list); list_add_tail(&path->list, &remove_list); } else { // If so the kernel already start the destructor process and we are backing off. neigh_release(path->neighbour); } } spin_unlock_irq(&priv->lock); } } // Set null pointer to the destructor function so Future called will be called to NULL. fpath->neighbour->ops->destructor = NULL; // Destroy the AHs list_for_each_entry_safe(path, tpath, &remove_list, list) { ib_address_destroy(path->ah); //In case that the module is reloaded, we erase the path pointer memset(path->neighbour->ha + 24, 0, 8); neigh_release(path->neighbour); // For the cloning kfree(path); } } Ido Bukspan Mellanox Technologies Ltd. Phone : (972)-3-6259500 ,Ext 518. Fax : (972)-3-5614943 mailto:ido at mellanox.co.il http://www.mellanox.com No play No game From halr at voltaire.com Wed Dec 1 07:27:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Dec 2004 10:27:24 -0500 Subject: [Fwd: [openib-general] ib_mthca can't be removed in some error case] Message-ID: <1101914844.6411.606.camel@localhost.localdomain> A few more messages came out and then the module removal ultimately timed out and completed. Not sure I trust the state of things after this though... Dec 1 09:35:17 localhost kernel: ib_mthca 0000:02:00.0: HW2SW_CQ failed (-16) Dec 1 09:35:17 localhost kernel: ib_mthca 0000:02:00.0: HW2SW_CQ failed (-16) -- Hal --Forwarded Message-- From: Hal Rosenstock To: Roland Dreier Cc: openib-general at openib.org Subject: [openib-general] ib_mthca can't be removed in some error case Date: 01 Dec 2004 09:34:55 -0500 Hi, When I /sbin/modprobe -r ib_mthca (after having umad and IPoIB up and then removed and a switch and SM come and go), the removal hangs up as follows: 29836 pts/5 D 0:00 /sbin/modprobe -r ib_mthca uninterruptible sleep I get the following error in the system logs: Dec 1 09:31:17 localhost kernel: ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16) Dec 1 09:32:17 localhost kernel: ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16) -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Wed Dec 1 07:53:46 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Dec 2004 07:53:46 -0800 Subject: [openib-general] testing OpenIB on TopSpin hardware ? In-Reply-To: <200411301718.SAA08841@styx.bruyeres.cea.fr> (Philippe Gregoire's message of "Tue, 30 Nov 2004 18:18:19 +0100") References: <200411301718.SAA08841@styx.bruyeres.cea.fr> Message-ID: <52hdn6ovzp.fsf@topspin.com> Philippe> Hello, I would like to test the latest OpenIB software Philippe> on our test platform, specially the SDP part. We have a Philippe> HP-DL380/DL360 (IA32) cluster with 12 nodes connected Philippe> through TopSpin HCA and TopSpin 90 IB switch. I got the Philippe> OpenIB source with svn. What is the latest version , Philippe> gen1 or 1.0 ? 1.0 looks like the version available in Philippe> march, correct ? What is the firmware requirement for Philippe> the HCA and the switch ? The latest OpenIB software is the subversion tree at https://openib.org/svn/gen2/trunk/ However this tree has support for IPoIB only (no SDP or any other ULPs). The gen1 tree does have SDP but has not been touched in some time. Any firmware should work on HCA and switch, but the newer firmware will probably be more stable. - Roland From roland at topspin.com Wed Dec 1 07:54:37 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Dec 2004 07:54:37 -0800 Subject: [Fwd: [openib-general] ib_mthca can't be removed in some error case] In-Reply-To: <1101914844.6411.606.camel@localhost.localdomain> (Hal Rosenstock's message of "Wed, 01 Dec 2004 10:27:24 -0500") References: <1101914844.6411.606.camel@localhost.localdomain> Message-ID: <52d5xuovya.fsf@topspin.com> Hal> A few more messages came out and then the module removal Hal> ultimately timed out and completed. Not sure I trust the Hal> state of things after this though... Looks like you managed to crash the HCA (firmware commands timed out). - Roland From halr at voltaire.com Wed Dec 1 07:59:56 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Dec 2004 10:59:56 -0500 Subject: [Fwd: [openib-general] ib_mthca can't be removed in some error case] In-Reply-To: <52d5xuovya.fsf@topspin.com> References: <1101914844.6411.606.camel@localhost.localdomain> <52d5xuovya.fsf@topspin.com> Message-ID: <1101916796.4124.1.camel@localhost.localdomain> On Wed, 2004-12-01 at 10:54, Roland Dreier wrote: > Hal> A few more messages came out and then the module removal > Hal> ultimately timed out and completed. Not sure I trust the > Hal> state of things after this though... > > Looks like you managed to crash the HCA (firmware commands timed out). There was a power glitch. Perhaps it reset the HCA (and it is more sensitive to these things) but the system rode this out (and did not reset). That's another possible explanation. -- Hal From roland at topspin.com Wed Dec 1 09:14:50 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Dec 2004 09:14:50 -0800 Subject: [openib-general] Fix for possible oops when calling ipoib_neigh_destructor In-Reply-To: <91DB792C7985D411BEC300B40080D29C711BC2@mtvex01.mtv.mtl.com> (Ido Bukspan's message of "Wed, 1 Dec 2004 17:29:54 +0200") References: <91DB792C7985D411BEC300B40080D29C711BC2@mtvex01.mtv.mtl.com> Message-ID: <521xeaos8l.fsf@topspin.com> Thanks for investigating this. I have a slightly different scheme in mind (which resolves both the unicast ARP and neighbour destructor problems). I'll implement it in the next couple of days (before you get back from vacation, I guess). Thanks, Roland From roland at topspin.com Wed Dec 1 09:32:50 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Dec 2004 09:32:50 -0800 Subject: [openib-general] [PATCH] Move IPoIB to use LockLess TX Message-ID: <52wtw1orel.fsf@topspin.com> This changes IPoIB's locking scheme to use the new NETIF_F_LLTX scheme. It adds about 2-3 % to throughput in my netpipe tests. - R. Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 1304) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -204,7 +204,7 @@ kfree(path); } -static int path_rec_start(struct sk_buff *skb, struct net_device *dev) +static void path_rec_start(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC); @@ -244,23 +244,23 @@ path->neighbour = skb->dst->neighbour; *to_ipoib_path(skb->dst->neighbour) = path; - return 0; + return; err: kfree(path); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); - - return 0; } -static int path_lookup(struct sk_buff *skb, struct net_device *dev) +static void path_lookup(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(skb->dev); /* Look up path record for unicasts */ - if (skb->dst->neighbour->ha[4] != 0xff) - return path_rec_start(skb, dev); + if (skb->dst->neighbour->ha[4] != 0xff) { + path_rec_start(skb, dev); + return; + } /* Add in the P_Key */ skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; @@ -268,7 +268,6 @@ ipoib_mcast_send(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4), skb); - return 0; } static void unicast_arp_completion(int status, @@ -336,8 +335,8 @@ * still go through (since we'll get the new path from the SM for * these queries) so we'll never update the neighbour. */ -static int unicast_arp_start(struct sk_buff *skb, struct net_device *dev, - struct ipoib_pseudoheader *phdr) +static void unicast_arp_start(struct sk_buff *skb, struct net_device *dev, + struct ipoib_pseudoheader *phdr) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *tmp_skb; @@ -352,7 +351,7 @@ dev_kfree_skb_any(tmp_skb); if (!skb) { ++priv->stats.tx_dropped; - return 0; + return; } } @@ -381,25 +380,32 @@ ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); } - - return 0; } static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path; + unsigned long flags; + local_irq_save(flags); + if (!spin_trylock(&priv->tx_lock)) { + local_irq_restore(flags); + return NETDEV_TX_LOCKED; + } + if (skb->dst && skb->dst->neighbour) { - if (unlikely(!*to_ipoib_path(skb->dst->neighbour))) - return path_lookup(skb, dev); + if (unlikely(!*to_ipoib_path(skb->dst->neighbour))) { + path_lookup(skb, dev); + goto out; + } path = *to_ipoib_path(skb->dst->neighbour); if (likely(path->ah)) { ipoib_send(dev, skb, path->ah, be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); - return 0; + goto out; } if (skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) @@ -417,8 +423,7 @@ phdr->hwaddr[9] = priv->pkey & 0xff; ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb); - } - else { + } else { /* unicast GID -- ARP reply?? */ /* @@ -429,7 +434,7 @@ if (skb->destructor == unicast_arp_finish) { ipoib_send(dev, skb, *(struct ipoib_ah **) skb->cb, be32_to_cpup((u32 *) phdr->hwaddr)); - return 0; + goto out; } if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { @@ -441,22 +446,25 @@ IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); dev_kfree_skb_any(skb); ++priv->stats.tx_dropped; - return 0; + goto out; } /* put the pseudoheader back on */ skb_push(skb, sizeof *phdr); - return unicast_arp_start(skb, dev, phdr); + unicast_arp_start(skb, dev, phdr); } } - return 0; + goto out; err: ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); - return 0; +out: + spin_unlock_irqrestore(&priv->tx_lock, flags); + + return NETDEV_TX_OK; } struct net_device_stats *ipoib_get_stats(struct net_device *dev) @@ -641,7 +649,7 @@ dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; - dev->features = NETIF_F_VLAN_CHALLENGED; + dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; /* MTU will be reset when mcast join happens */ dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; @@ -656,6 +664,7 @@ priv->dev = dev; spin_lock_init(&priv->lock); + spin_lock_init(&priv->tx_lock); init_MUTEX(&priv->mcast_mutex); init_MUTEX(&priv->vlan_mutex); Index: infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband/ulp/ipoib/ipoib.h (revision 1304) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -130,6 +130,7 @@ struct ipoib_buf *rx_ring; + spinlock_t tx_lock; struct ipoib_buf *tx_ring; unsigned tx_head; unsigned tx_tail; Index: infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband/ulp/ipoib/ipoib_ib.c (revision 1304) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -70,14 +70,15 @@ unsigned long flags; - spin_lock_irqsave(&priv->lock, flags); if (ah->last_send <= priv->tx_tail) { ipoib_dbg(priv, "Freeing ah %p\n", ah->ah); ib_destroy_ah(ah->ah); kfree(ah); - } else + } else { + spin_lock_irqsave(&priv->lock, flags); list_add_tail(&ah->list, &priv->dead_ahs); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irqrestore(&priv->lock, flags); + } } static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv, @@ -235,11 +236,11 @@ dev_kfree_skb_any(tx_req->skb); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) netif_wake_queue(dev); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (wc->status != IB_WC_SUCCESS && wc->status != IB_WC_WR_FLUSH_ERR) @@ -338,19 +339,15 @@ ++priv->stats.tx_errors; dev_kfree_skb_any(skb); } else { - unsigned long flags; - dev->trans_start = jiffies; address->last_send = priv->tx_head; ++priv->tx_head; - spin_lock_irqsave(&priv->lock, flags); if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); } - spin_unlock_irqrestore(&priv->lock, flags); } } From mshefty at ichips.intel.com Wed Dec 1 10:45:15 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 01 Dec 2004 10:45:15 -0800 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <1101854788.6411.373.camel@localhost.localdomain> References: <1101838548.6411.276.camel@localhost.localdomain> <41ACC233.2000201@ichips.intel.com> <1101843199.6411.288.camel@localhost.localdomain> <41ACCE63.5050800@ichips.intel.com> <1101854788.6411.373.camel@localhost.localdomain> Message-ID: <41AE113B.6070902@ichips.intel.com> Hal Rosenstock wrote: > On Tue, 2004-11-30 at 14:47, Sean Hefty wrote: > >>I guess filtering can be done above the MAD layer, so just letting the >>user specify the qp_type may be all that's needed, beyond indicating >>that snooping is desired. If we go this route, we can probably support >>any number of snoopers. > > > Does that mean the snoopers would just be a list based on qp_type (and > we have a list per QP type (SMI, GSI)) ? This was my first thought... Also, looking ahead at future changes that will be going into the code (RMPP, CM), plus a debugging issues that I'm hitting when loading the drivers, I think it makes sense to add in snooping sooner rather than later. This is something that I can start, since it would be useful to me as a debugging tool today. - Sean To: Hal Rosenstock Subject: Re: [openib-general] smpdump and current MAD layer References: <1101838548.6411.276.camel at localhost.localdomain> In-Reply-To: <1101838548.6411.276.camel at localhost.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.31 (www . roaringpenguin . com / mimedefang) X-Spam-Checker-Version: SpamAssassin 2.64 (2004-01-11) on openib.ca.sandia.gov X-Spam-Level: X-Spam-Status: No, hits=0.0 required=5.0 tests=none autolearn=no version=2.64 Cc: openib-general at openib.org X-BeenThere: openib-general at openib.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: OpenIB General Mailing List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 02 Dec 2004 18:04:13 -0000 Hal Rosenstock wrote: > So solicited MAD responses cannot currently be snooped nor can > unsolicited ones for which an agent is registered (Since SMA and PMA are > currently firmware based, the latter is not an issue for the current > implementation). I've gotten a start on adding in the snooping support. It was a little more difficult than I first thought to support multiple snoop clients, because of a race condition deregistering clients while snooping. Question on where to perform the snoop callback on the send side: should it be done in the completion handling code, immediately before the MAD is posted to the QP, or somewhere else? I'd like to place the snooping code in as few places as possible, but still be able to snoop locally processed MADs. Ideally a MAD should be snooped exactly once, which requires some extra care when handling QP errors. Snooping in the completion handling allows the MAD layer to own the thread that performs the callback. Calling clients back on the outbound path puts the callback in an arbitrary thread context. Thoughts? - Sean From halr at voltaire.com Thu Dec 2 12:05:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 02 Dec 2004 15:05:39 -0500 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <41AF559C.7090700@ichips.intel.com> References: <1101838548.6411.276.camel@localhost.localdomain> <41AF559C.7090700@ichips.intel.com> Message-ID: <1102017939.4179.10.camel@localhost.localdomain> On Thu, 2004-12-02 at 12:49, Sean Hefty wrote: > Question on where to perform the snoop callback on the send side: > should it be done in the completion handling code, immediately before > the MAD is posted to the QP, or somewhere else? > > I'd like to place the snooping code in as few places as possible, but > still be able to snoop locally processed MADs. Ideally a MAD should be > snooped exactly once, which requires some extra care when handling QP > errors. Snooping in the completion handling allows the MAD layer to > own the thread that performs the callback. Calling clients back on the > outbound path puts the callback in an arbitrary thread context. > > Thoughts? While there are arguments for snooping when the MAD is posted to the QP, I think it makes more sense to snoop it when it completes. In addition to the thread reason you cite, one could also pass the completion status which would give more information on what really happened (which might be useful). I don't see a need to instrument this more than once on the send side either. -- Hal From halr at voltaire.com Thu Dec 2 12:09:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 02 Dec 2004 15:09:36 -0500 Subject: [openib-general] mthca Page Allocation Failures Message-ID: <1102018176.4179.15.camel@localhost.localdomain> Hi, I'm seeing this a lot more lately. Not sure when this started. It happens on first starting the driver after boot. It seems to occur when there is more going on at system startup (like fixing the disk after a previous freeze). -- Hal Dec 2 14:36:17 localhost kernel: ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) Dec 2 14:36:17 localhost kernel: ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:02:00.0) Dec 2 14:36:17 localhost kernel: modprobe: page allocation failure. order:6, mode:0xd0 Dec 2 14:36:17 localhost kernel: [] __alloc_pages+0x1c2/0x370 Dec 2 14:36:17 localhost kernel: [] __get_free_pages+0x1f/0x40 Dec 2 14:36:17 localhost kernel: [] dma_alloc_coherent+0xce/0x100 Dec 2 14:36:17 localhost kernel: [] mthca_alloc_sqp+0x65/0x420 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] mthca_create_qp+0x16e/0x180 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] ib_create_qp+0x22/0x80 [ib_core] Dec 2 14:36:17 localhost kernel: [] create_mad_qp+0x86/0xd0 [ib_mad] Dec 2 14:36:17 localhost kernel: [] qp_event_handler+0x0/0x30 [ib_mad] Dec 2 14:36:17 localhost kernel: [] ib_get_dma_mr+0x1e/0x50 [ib_core] Dec 2 14:36:17 localhost kernel: [] ib_mad_port_open+0x24d/0x5d0 [ib_mad] Dec 2 14:36:17 localhost kernel: [] ib_mad_init_device+0x3e/0x100 [ib_mad] Dec 2 14:36:17 localhost kernel: [] ib_cache_setup_one+0x12d/0x1d0 [ib_core] Dec 2 14:36:17 localhost kernel: [] ib_mad_init_device+0x0/0x100 [ib_mad] Dec 2 14:36:17 localhost kernel: [] ib_register_device+0x17d/0x1a0 [ib_core] Dec 2 14:36:17 localhost kernel: [] mthca_req_notify_cq+0x0/0x30 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] mthca_poll_cq+0x0/0xbb0 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] mthca_destroy_cq+0x0/0x30 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] mthca_register_device+0x15b/0x1a0 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] mthca_init_one+0x523/0x6e0 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] pci_device_probe_static+0x52/0x70 Dec 2 14:36:17 localhost kernel: [] __pci_device_probe+0x3c/0x50 Dec 2 14:36:17 localhost kernel: [] pci_device_probe+0x2c/0x50 Dec 2 14:36:17 localhost kernel: [] bus_match+0x3f/0x70 Dec 2 14:36:17 localhost kernel: [] driver_attach+0x5c/0x90 Dec 2 14:36:17 localhost kernel: [] bus_add_driver+0x91/0xb0 Dec 2 14:36:17 localhost kernel: [] driver_register+0x8c/0x90 Dec 2 14:36:17 localhost kernel: [] pci_register_driver+0x90/0xb0 Dec 2 14:36:17 localhost kernel: [] mthca_init+0xf/0x1a [ib_mthca] Dec 2 14:36:17 localhost kernel: [] sys_init_module+0x289/0x340 Dec 2 14:36:17 localhost kernel: [] sysenter_past_esp+0x52/0x71 Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't create ib_mad QP1 Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't open mthca0 port 2 Dec 2 14:36:17 localhost kernel: ib_agent: Port 1 not found Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't close mthca0 port 1 for agents Dec 2 14:36:17 localhost kernel: ib_mad: Port 1 not found Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't close mthca0 port 1 Dec 2 14:36:17 localhost kernel: ib_agent: Port 2 not found Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't close mthca0 port 2 for agents Dec 2 14:36:17 localhost kernel: ib_mad: Port 2 not found Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't close mthca0 port 2 From roland at topspin.com Thu Dec 2 13:32:34 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 02 Dec 2004 13:32:34 -0800 Subject: [openib-general] Re: mthca Page Allocation Failures In-Reply-To: <1102018176.4179.15.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 02 Dec 2004 15:09:36 -0500") References: <1102018176.4179.15.camel@localhost.localdomain> Message-ID: <52fz2ol72l.fsf@topspin.com> This patch should help. - R. Index: infiniband/core/mad_priv.h =================================================================== --- infiniband/core/mad_priv.h (revision 1304) +++ infiniband/core/mad_priv.h (working copy) @@ -68,7 +68,7 @@ #define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 as a minimum */ /* QP and CQ parameters */ -#define IB_MAD_QP_SEND_SIZE 2048 +#define IB_MAD_QP_SEND_SIZE 128 #define IB_MAD_QP_RECV_SIZE 512 #define IB_MAD_SEND_REQ_MAX_SG 2 #define IB_MAD_RECV_REQ_MAX_SG 1 From mshefty at ichips.intel.com Thu Dec 2 14:48:25 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Dec 2004 14:48:25 -0800 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <1102017939.4179.10.camel@localhost.localdomain> References: <1101838548.6411.276.camel@localhost.localdomain> <41AF559C.7090700@ichips.intel.com> <1102017939.4179.10.camel@localhost.localdomain> Message-ID: <41AF9BB9.5040402@ichips.intel.com> Hal Rosenstock wrote: >>I'd like to place the snooping code in as few places as possible, but >>still be able to snoop locally processed MADs. Ideally a MAD should be >>snooped exactly once, which requires some extra care when handling QP >>errors. Snooping in the completion handling allows the MAD layer to >>own the thread that performs the callback. Calling clients back on the >>outbound path puts the callback in an arbitrary thread context. > > While there are arguments for snooping when the MAD is posted to the QP, > I think it makes more sense to snoop it when it completes. In addition > to the thread reason you cite, one could also pass the completion status > which would give more information on what really happened (which might > be useful). I don't see a need to instrument this more than once on the > send side either. It would be difficult to snoop all MADs without snooping in multiple locations. For example, local MADs do not generate CQ entries, and are completed from the sender's thread. (I think we may want to change the threading at some point, so it may not matter long term.) It's also not clear if a snooper would want to know if a request received a reply, snoop separate send completions in the case of RMPP, or snoop RMPP ACKs. (For RMPP debugging, I want to snoop at the lowest level.) Right now, I've tried to design the API/code to be fairly flexible about where snooping can occur (since I'm not sure where the best place to do this is), but will likely only snoop in one location (after completions). - Sean From halr at voltaire.com Thu Dec 2 20:08:58 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 02 Dec 2004 23:08:58 -0500 Subject: [openib-general] [PATCH] Add vendor OUI support to MAD layer Message-ID: <1102046938.14406.45.camel@localhost.localdomain> mad: Add vendor OUI support to MAD layer Also, in ib_register_mad_agent, if registration request supplied, make sure that class supplied in request is consistent with the QP type. On sending, it is the responsibility of the client to put the OUI in the MAD when sending a vendor MAD with OUI (despite this being available in the mad_agent). This seems consistent with the way we have been doing things up to now. (This could be changed if someone strongly objects). Also, I didn't compress ib_mad_mgmt_class_table (to eliminate the vendor with OUI classes. That's 32 pointers per version (I think there are 2 versions in play right now). That's an optimization that could be made but I'm not sure it is worth it. Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1308) +++ include/ib_mad.h (working copy) @@ -42,6 +42,8 @@ #define IB_MGMT_CLASS_DEVICE_MGMT 0x06 #define IB_MGMT_CLASS_CM 0x07 #define IB_MGMT_CLASS_SNMP 0x08 +#define IB_MGMT_CLASS_VENDOR_RANGE2_START 0x30 +#define IB_MGMT_CLASS_VENDOR_RANGE2_END 0x4F /* Management methods */ #define IB_MGMT_METHOD_GET 0x01 @@ -55,7 +57,6 @@ #define IB_MGMT_METHOD_RESP 0x80 - #define IB_MGMT_MAX_METHODS 128 #define IB_QP0 0 @@ -104,6 +105,14 @@ u8 data[220]; } __attribute__ ((packed)); +struct ib_vendor_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 reserved; + u8 oui[3]; + u8 data[216]; +} __attribute__ ((packed)); + struct ib_mad_agent; struct ib_mad_send_wc; struct ib_mad_recv_wc; @@ -199,12 +208,15 @@ * receive unsolicited MADs, otherwise it should be 0. * @mgmt_class_version: Indicates which version of MADs for the given * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. * @method_mask: The caller will receive unsolicited MADs for any method * where @method_mask = 1. */ struct ib_mad_reg_req { u8 mgmt_class; u8 mgmt_class_version; + u8 oui[3]; DECLARE_BITMAP(method_mask, IB_MGMT_MAX_METHODS); }; Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 1308) +++ core/mad_priv.h (working copy) @@ -78,6 +78,9 @@ /* Registration table sizes */ #define MAX_MGMT_CLASS 80 #define MAX_MGMT_VERSION 8 +#define MAX_MGMT_OUI 8 +#define MAX_MGMT_VENDOR_RANGE2 IB_MGMT_CLASS_VENDOR_RANGE2_END - \ + IB_MGMT_CLASS_VENDOR_RANGE2_START + 1 struct ib_mad_list_head { struct list_head list; @@ -140,6 +143,20 @@ struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; }; +struct ib_mad_mgmt_vendor_class { + u8 oui[MAX_MGMT_OUI][3]; + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_OUI]; +}; + +struct ib_mad_mgmt_vendor_class_table { + struct ib_mad_mgmt_vendor_class *vendor_class[MAX_MGMT_VENDOR_RANGE2]; +}; + +struct ib_mad_mgmt_version_table { + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; +}; + struct ib_mad_queue { spinlock_t lock; struct list_head list; @@ -165,7 +182,7 @@ struct ib_mr *mr; spinlock_t reg_lock; - struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION]; + struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION]; struct list_head agent_list; struct workqueue_struct *wq; struct work_struct work; Index: core/mad.c =================================================================== --- core/mad.c (revision 1308) +++ core/mad.c (working copy) @@ -144,6 +144,47 @@ } } +static int vendor_class_index(u8 mgmt_class) +{ + return mgmt_class - IB_MGMT_CLASS_VENDOR_RANGE2_START; +} + +static int is_vendor_class(u8 mgmt_class) +{ + if ((mgmt_class < IB_MGMT_CLASS_VENDOR_RANGE2_START) || + (mgmt_class > IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return 0; + return 1; +} + +static int is_vendor_oui(char *oui) +{ + if (oui[0] || oui[1] || oui[2]) + return 1; + return 0; +} + +static int is_vendor_method_in_use( + struct ib_mad_mgmt_vendor_class *vendor_class, + struct ib_mad_reg_req *mad_reg_req) +{ + struct ib_mad_mgmt_method_table *method; + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) { + if (!memcmp(vendor_class->oui[i], mad_reg_req->oui, 3)) { + method = vendor_class->method_table[i]; + if (method) { + if (method_in_use(&method, mad_reg_req)) + return 1; + else + break; + } + } + } + return 0; +} + /* * ib_register_mad_agent - Register to send/receive MADs */ @@ -157,61 +198,71 @@ void *context) { struct ib_mad_port_private *port_priv; - struct ib_mad_agent *ret; + struct ib_mad_agent *ret = ERR_PTR(-EINVAL); struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_reg_req *reg_req = NULL; struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; struct ib_mad_mgmt_method_table *method; int ret2, qpn; unsigned long flags; - u8 mgmt_class; + u8 mgmt_class, vclass; /* Validate parameters */ qpn = get_spl_qp_index(qp_type); - if (qpn == -1) { - ret = ERR_PTR(-EINVAL); + if (qpn == -1) goto error1; - } - if (rmpp_version) { - ret = ERR_PTR(-EINVAL); /* XXX: until RMPP implemented */ - goto error1; - } + if (rmpp_version) + goto error1; /* XXX: until RMPP implemented */ /* Validate MAD registration request if supplied */ if (mad_reg_req) { - if (mad_reg_req->mgmt_class_version >= MAX_MGMT_VERSION) { - ret = ERR_PTR(-EINVAL); + if (mad_reg_req->mgmt_class_version >= MAX_MGMT_VERSION) goto error1; - } - if (!recv_handler) { - ret = ERR_PTR(-EINVAL); + if (!recv_handler) goto error1; - } if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) { /* * IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is the only * one in this range currently allowed */ if (mad_reg_req->mgmt_class != - IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - ret = ERR_PTR(-EINVAL); + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) goto error1; - } } else if (mad_reg_req->mgmt_class == 0) { /* * Class 0 is reserved in IBA and is used for * aliasing of IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE */ - ret = ERR_PTR(-EINVAL); goto error1; + } else if (is_vendor_class(mad_reg_req->mgmt_class)) { + /* + * If class is in "new" vendor range, + * ensure supplied OUI is not zero + */ + if (!is_vendor_oui(mad_reg_req->oui)) + goto error1; } + /* Make sure class supplied is consistent with QP type */ + if (qp_type == IB_QPT_SMI) { + if ((mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_LID_ROUTED) && + (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } else { + if ((mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } } else { /* No registration request supplied */ - if (!send_handler) { - ret = ERR_PTR(-EINVAL); + if (!send_handler) goto error1; - } } /* Validate device and port */ @@ -258,17 +309,32 @@ * is non overlapping with any existing ones */ if (mad_reg_req) { - class = port_priv->version[mad_reg_req->mgmt_class_version]; - if (class) { - mgmt_class = convert_mgmt_class( - mad_reg_req->mgmt_class); - method = class->method_table[mgmt_class]; - if (method) { - if (method_in_use(&method, mad_reg_req)) { - ret = ERR_PTR(-EINVAL); - goto error3; + mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); + if (!is_vendor_class(mgmt_class)) { + class = port_priv->version[mad_reg_req-> + mgmt_class_version].class; + if (class) { + method = class->method_table[mgmt_class]; + if (method) { + if (method_in_use(&method, + mad_reg_req)) + goto error3; } } + } else { + /* "New" vendor class range */ + vendor = port_priv->version[mad_reg_req-> + mgmt_class_version].vendor; + if (vendor) { + vclass = vendor_class_index(mgmt_class); + vendor_class = vendor->vendor_class[vclass]; + if (vendor_class) { + if (is_vendor_method_in_use( + vendor_class, + mad_reg_req)) + goto error3; + } + } } } @@ -721,6 +787,40 @@ return 0; } +static int check_vendor_class(struct ib_mad_mgmt_vendor_class *vendor_class) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + if (vendor_class->method_table[i]) + return 1; + return 0; +} + +static int find_vendor_oui(struct ib_mad_mgmt_vendor_class *vendor_class, + char *oui) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + /* Is there matching OUI for this vendor class ? */ + if (!memcmp(vendor_class->oui[i], oui, 3)) + return i; + + return -1; +} + +static int check_vendor_table(struct ib_mad_mgmt_vendor_class_table *vendor) +{ + int i; + + for (i = 0; i < MAX_MGMT_VENDOR_RANGE2; i++) + if (vendor->vendor_class[i]) + return 1; + + return 0; +} + static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method, struct ib_mad_agent_private *agent) { @@ -734,23 +834,17 @@ } } -static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, - struct ib_mad_agent_private *priv) +static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv, + u8 mgmt_class) { - struct ib_mad_port_private *private; + struct ib_mad_port_private *port_priv; struct ib_mad_mgmt_class_table **class; struct ib_mad_mgmt_method_table **method; - int i, ret; - u8 mgmt_class; - /* Make sure MAD registration request supplied */ - if (!mad_reg_req) - return 0; - - private = priv->qp_info->port_priv; - mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); - class = &private->version[mad_reg_req->mgmt_class_version]; + port_priv = agent_priv->qp_info->port_priv; + class = &port_priv->version[mad_reg_req->mgmt_class_version].class; if (!*class) { /* Allocate management class table for "new" class version */ *class = kmalloc(sizeof **class, GFP_ATOMIC); @@ -760,9 +854,8 @@ ret = -ENOMEM; goto error1; } - /* Clear management class table for this class version */ - memset((*class)->method_table, 0, - sizeof((*class)->method_table)); + /* Clear management class table */ + memset(*class, 0, sizeof(**class)); /* Allocate method table for this management class */ method = &(*class)->method_table[mgmt_class]; if ((ret = allocate_method_table(method))) @@ -782,17 +875,17 @@ /* Finally, add in methods being registered */ for (i = find_first_bit(mad_reg_req->method_mask, - IB_MGMT_MAX_METHODS); + IB_MGMT_MAX_METHODS); i < IB_MGMT_MAX_METHODS; i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, 1+i)) { - (*method)->agent[i] = priv; + (*method)->agent[i] = agent_priv; } return 0; error3: /* Remove any methods for this mad agent */ - remove_methods_mad_agent(*method, priv); + remove_methods_mad_agent(*method, agent_priv); /* Now, check to see if there are any methods in use */ if (!check_method_table(*method)) { /* If not, release management method table */ @@ -808,11 +901,138 @@ return ret; } +static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_vendor_class_table **vendor_table; + struct ib_mad_mgmt_vendor_class_table *vendor = NULL; + struct ib_mad_mgmt_vendor_class *vendor_class = NULL; + struct ib_mad_mgmt_method_table **method; + int i, ret = -ENOMEM; + u8 vclass; + + /* "New" vendor (with OUI) class */ + vclass = vendor_class_index(mad_reg_req->mgmt_class); + port_priv = agent_priv->qp_info->port_priv; + vendor_table = &port_priv->version[ + mad_reg_req->mgmt_class_version].vendor; + if (!*vendor_table) { + /* Allocate mgmt vendor class table for "new" class version */ + vendor = kmalloc(sizeof *vendor, GFP_ATOMIC); + if (!vendor) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class_table\n"); + goto error1; + } + /* Clear management vendor class table */ + memset(vendor, 0, sizeof(*vendor)); + *vendor_table = vendor; + } + if (!(*vendor_table)->vendor_class[vclass]) { + /* Allocate table for this management vendor class */ + vendor_class = kmalloc(sizeof *vendor_class, GFP_ATOMIC); + if (!vendor_class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class\n"); + goto error2; + } + memset(vendor_class, 0, sizeof(*vendor_class)); + (*vendor_table)->vendor_class[vclass] = vendor_class; + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* Is there matching OUI for this vendor class ? */ + if (!memcmp((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3)) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(!*method); + goto check_in_use; + } + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* OUI slot available ? */ + if (!is_vendor_oui((*vendor_table)->vendor_class[ + vclass]->oui[i])) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(*method); + /* Allocate method table for this OUI */ + if ((ret = allocate_method_table(method))) + goto error3; + memcpy((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3); + goto check_in_use; + } + } + printk(KERN_ERR PFX "All OUI slots in use\n"); + goto error3; + +check_in_use: + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error4; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = agent_priv; + } + return 0; + +error4: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, agent_priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; +error3: + if (vendor_class) { + (*vendor_table)->vendor_class[vclass] = NULL; + kfree(vendor_class); + } +error2: + if (vendor) { + *vendor_table = NULL; + kfree(vendor); + } +error1: + return ret; +} + +static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *priv) +{ + int ret; + u8 mgmt_class; + + /* Make sure MAD registration request supplied */ + if (!mad_reg_req) + return 0; + + mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); + if (!is_vendor_class(mgmt_class)) + ret = add_nonoui_reg_req(mad_reg_req, priv, mgmt_class); + else + ret = add_oui_reg_req(mad_reg_req, priv); + return ret; +} + static void remove_mad_reg_req(struct ib_mad_agent_private *agent_priv) { struct ib_mad_port_private *port_priv; struct ib_mad_mgmt_class_table *class; struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + int index; u8 mgmt_class; /* @@ -824,12 +1044,10 @@ } port_priv = agent_priv->qp_info->port_priv; - class = port_priv->version[agent_priv->reg_req->mgmt_class_version]; - if (!class) { - printk(KERN_ERR PFX "No class table yet MAD registration " - "request supplied\n"); - goto out; - } + class = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].class; + if (!class) + goto vendor_check; mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); method = class->method_table[mgmt_class]; @@ -845,12 +1063,56 @@ if (!check_class_table(class)) { /* If not, release management class table */ kfree(class); - port_priv->version[agent_priv->reg_req-> - mgmt_class_version]= NULL; + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version].class = NULL; } } } +vendor_check: + vendor = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].vendor; + if (!vendor) + goto out; + + mgmt_class = vendor_class_index(agent_priv->reg_req->mgmt_class); + vendor_class = vendor->vendor_class[mgmt_class]; + if (vendor_class) { + index = find_vendor_oui(vendor_class, agent_priv->reg_req->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* + * Now, check to see if there are + * any methods still in use + */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + vendor_class->method_table[index] = NULL; + memset(vendor_class->oui[index], 0, 3); + /* Any OUIs left ? */ + if (!check_vendor_class(vendor_class)) { + /* If not, release vendor class table */ + kfree(vendor_class); + vendor->vendor_class[mgmt_class] = NULL; + /* Any other vendor classes left ? */ + if (!check_vendor_table(vendor)) { + kfree(vendor); + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version]. + vendor = NULL; + } + } + } + } + } + out: return; } @@ -907,20 +1169,49 @@ } } } else { - struct ib_mad_mgmt_class_table *version; - struct ib_mad_mgmt_method_table *class; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + struct ib_vendor_mad *vendor_mad; + int index; - /* Routing is based on version, class, and method */ + /* + * Routing is based on version, class, and method + * For "newer" vendor MADs, also based on OUI + */ if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) goto out; - version = port_priv->version[mad->mad_hdr.class_version]; - if (!version) - goto out; - class = version->method_table[convert_mgmt_class( + if (!is_vendor_class(mad->mad_hdr.mgmt_class)) { + class = port_priv->version[ + mad->mad_hdr.class_version].class; + if (!class) + goto out; + method = class->method_table[convert_mgmt_class( + mad->mad_hdr.mgmt_class)]; + if (method) + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } else { + vendor = port_priv->version[ + mad->mad_hdr.class_version].vendor; + if (!vendor) + goto out; + vendor_class = vendor->vendor_class[vendor_class_index( mad->mad_hdr.mgmt_class)]; - if (class) - mad_agent = class->agent[mad->mad_hdr.method & - ~IB_MGMT_METHOD_RESP]; + if (!vendor_class) + goto out; + /* Find matching OUI */ + vendor_mad = (struct ib_vendor_mad *)mad; + index = find_vendor_oui(vendor_class, vendor_mad->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } + } } if (mad_agent) { From halr at voltaire.com Fri Dec 3 06:07:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 03 Dec 2004 09:07:32 -0500 Subject: [openib-general] Re: mthca Page Allocation Failures In-Reply-To: <52fz2ol72l.fsf@topspin.com> References: <1102018176.4179.15.camel@localhost.localdomain> <52fz2ol72l.fsf@topspin.com> Message-ID: <1102082852.4197.0.camel@localhost.localdomain> On Thu, 2004-12-02 at 16:32, Roland Dreier wrote: > This patch should help. It did indeed. Thanks. Applied. -- Hal From halr at voltaire.com Fri Dec 3 06:09:19 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 03 Dec 2004 09:09:19 -0500 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <41AF9BB9.5040402@ichips.intel.com> References: <1101838548.6411.276.camel@localhost.localdomain> <41AF559C.7090700@ichips.intel.com> <1102017939.4179.10.camel@localhost.localdomain> <41AF9BB9.5040402@ichips.intel.com> Message-ID: <1102082958.4197.4.camel@localhost.localdomain> On Thu, 2004-12-02 at 17:48, Sean Hefty wrote: > For example, local MADs do not generate CQ entries, and are > completed from the sender's thread. (I think we may want to change > the threading at some point, so it may not matter long term.) That's been on the TODO list: Roland pointed this out when it was added. -- Hal From roland at topspin.com Fri Dec 3 07:00:45 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 07:00:45 -0800 Subject: [openib-general] [PATCH] Add vendor OUI support to MAD layer In-Reply-To: <1102046938.14406.45.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 02 Dec 2004 23:08:58 -0500") References: <1102046938.14406.45.camel@localhost.localdomain> Message-ID: <52k6rzifz6.fsf@topspin.com> This exposes the OUI support to userspace... I bumped the ABI version because the size of the reg request changed. Index: infiniband/include/ib_user_mad.h =================================================================== --- infiniband/include/ib_user_mad.h (revision 1310) +++ infiniband/include/ib_user_mad.h (working copy) @@ -31,7 +31,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define IB_USER_MAD_ABI_VERSION 1 +#define IB_USER_MAD_ABI_VERSION 2 /* * Make sure that all structs defined in this file remain laid out so @@ -90,6 +90,8 @@ * receive unsolicited MADs, otherwise it should be 0. * @mgmt_class_version - Indicates which version of MADs for the given * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. */ struct ib_user_mad_reg_req { __u32 id; @@ -97,6 +99,7 @@ __u8 qpn; __u8 mgmt_class; __u8 mgmt_class_version; + __u8 oui[3]; }; #define IB_IOCTL_MAGIC 0x1b Index: infiniband/core/user_mad.c =================================================================== --- infiniband/core/user_mad.c (revision 1310) +++ infiniband/core/user_mad.c (working copy) @@ -352,6 +352,7 @@ req.mgmt_class = ureq.mgmt_class; req.mgmt_class_version = ureq.mgmt_class_version; memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); + memcpy(req.oui, ureq.oui, sizeof req.oui); agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num, ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI, From iod00d at hp.com Fri Dec 3 10:27:11 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 3 Dec 2004 10:27:11 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041123225624.GO10431@esmail.cup.hp.com> References: <20041123225624.GO10431@esmail.cup.hp.com> Message-ID: <20041203182711.GH15286@esmail.cup.hp.com> On Tue, Nov 23, 2004 at 02:56:24PM -0800, Grant Grundler wrote: > So the adventure continues on a different box (rx4640). > (I'll go back to the rx2600 and reflash/reboot the box). > > With tvflash, I was able to upload the hca-cougar image I mentioned > before successfully...at least that's what tvflash asserted. So it turns out I had flash the "high profile" firmware to a "low profile" card...it didn't like that. That's why the 3rd BAR was visible but not responding to programming. I've update the source tree and tried again with a high profile card (also reflashed with topspin firmware) and still getting the same error: Linux iowa 2.6.10-rc2 #10 SMP Fri Dec 3 08:06:24 PST 2004 ia64 GNU/Linux iowa:~# modprobe ib_mthca ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:41:00.0) GSI 38 (level, low) -> CPU 1 (0x0100) vector 66 ACPI: PCI interrupt 0000:41:00.0[A] -> GSI 38 (level, low) -> IRQ 66 ib_mthca 0000:41:00.0: Unhandled event 0f(00) on eqn 3 ib_query_gid failed (-16) for mthca0 (index 12) ib_query_port failed (-16) for mthca0 ib_mthca 0000:41:00.0: WRITE_MTT failed (-16) ib_mad: Couldn't create ib_mad CQ ib_mad: Couldn't open mthca0 port 1 ib_agent: Port 1 not found ib_mad: Couldn't close mthca0 port 1 for agents ib_mad: Port 1 not found ib_mad: Couldn't close mthca0 port 1 ib_agent: Port 2 not found ib_mad: Couldn't close mthca0 port 2 for agents ib_mad: Port 2 not found ib_mad: Couldn't close mthca0 port 2 iowa:~# lspci -vs 0000:41:00.0 lspci: -f: Invalid slot number iowa:~# lspci -vs 41:00.0 0000:41:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technology MT23108 InfiniHost Flags: 66MHz, medium devsel, IRQ 66 Memory at 00000000b0800000 (64-bit, non-prefetchable) [size=1M] Memory at 00000000b0000000 (64-bit, prefetchable) [size=8M] Memory at 00000000a0000000 (64-bit, prefetchable) [size=256M] Capabilities: [40] #11 [001f] Capabilities: [50] Vital Product Data Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [70] PCI-X non-bridge device. I'm fighting other issues right now and haven't been able to work on this (^#%$ tulip driver). If anyone has advice on how to proceed debugging this or needs more info, I can use it. I'm still leary that tvflash didn't work right despite the assertion the flash operation completed: iowa:~# tvflash -i open_hca(0) flash_chip_reset() flash_check_failsafe() Error. String Tag not present (found tag 50 instead) HCA #0: Found MT23108, Cougar, revision A1 Primary image is valid, unknown source (sig 0x0/0x0) Secondary image is valid, unknown source (sig 0x0/0x0) Error. String Tag not present (found tag 50 instead) close_hca() Note that "tvflash -i" worked fine when the original firmware was loaded: HCA #0: Found MT23108, Cougar, revision A1 Primary image is valid, unknown source (sig 0x0/0x0) Secondary image is valid, unknown source (sig 0x0/0x0) Vital Product Data Product Name: PCI-X Dual Port InfiniBand HCA P/N: AB286-60001 E/C: A-4412 S/N: US4417F00350 Freq/Power: PW=15W;PCI 66MHZ;PCI-X 133MHZ Date Code: N/A Checksum: N/A Maybe the firmware image I have is corrupt? grundler at iowa:~$ cksum hca-cougar-a1-250-157.bin 2761115387 932768 hca-cougar-a1-250-157.bin Does tvflash to have some sanity checking (embedded checksums or something) so it wouldn't use corrupted images? I also tried a different image from HP (that also exposes the 3rd BAR). Got the same result as above. Since the 3rd BAR is visible and programmable, I'll assume the firmware downloaded was good in both cases. thanks, grant From roland at topspin.com Fri Dec 3 10:35:17 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 10:35:17 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041203182711.GH15286@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 3 Dec 2004 10:27:11 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> Message-ID: <52fz2ni61m.fsf@topspin.com> Can you try with the very latest svn bits? It turns out there the readb() I was using in the FW command code path is not allowed by Mellanox HW -- all accesses need to be 32 bits wide. I have no idea if it could cause this problem but in the latest tree I converted it to a readl() just to be safe. By the way, what's the difference between this box and the one that was working before? Just OpenIB SW version? - R. From philippe.gregoire at cea.fr Fri Dec 3 10:44:03 2004 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Fri, 03 Dec 2004 19:44:03 +0100 Subject: [openib-general] IP over IB configuration procedure Message-ID: <200412031844.TAA14129@styx.bruyeres.cea.fr> What are the requires steps to configure IP over IB ? I was able to load the kernel modules but unable to configure the interface ib0 with /sbin/ifup Is /sbin/ifup the correct command to setup the infinband interface or is there any other procedure to follow ? I tried also to configure "manually" the interface but pinging another node does not work. I execute the following commands : # modprobe ib_mthca # modprobe ib_ipoib # lsmod Module Size Used by ib_ipoib 51608 1 ib_sa 12812 1 ib_ipoib ib_mthca 76944 0 ib_mad 28180 2 ib_sa,ib_mthca ib_core 41472 4 ib_ipoib,ib_sa,ib_mthca,ib_mad iptable_filter 4224 0 ip_tables 19600 1 iptable_filter ep 418080 0 rms 24868 0 elan4 256520 1 ep elan3 304400 1 ep elan 41760 3 ep,elan4,elan3 qsnet 91812 5 ep,rms,elan4,elan3,elan # dmesg ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:04:00.0) divert: not allocating divert_blk for non-ethernet device ib0 divert: not allocating divert_blk for non-ethernet device ib1 ]# ifup ib0 Error, some other host already uses address xxx.yyy.zzz.119. Thanks for your help Philippe From roland at topspin.com Fri Dec 3 10:55:34 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 10:55:34 -0800 Subject: [openib-general] IP over IB configuration procedure In-Reply-To: <200412031844.TAA14129@styx.bruyeres.cea.fr> (Philippe Gregoire's message of "Fri, 03 Dec 2004 19:44:03 +0100") References: <200412031844.TAA14129@styx.bruyeres.cea.fr> Message-ID: <52brdbi53t.fsf@topspin.com> Philippe> Is /sbin/ifup the correct command to setup the infinband Philippe> interface or is there any other procedure to follow ? I Philippe> tried also to configure "manually" the interface but Philippe> pinging another node does not work. Not sure what ifup does on your system (it depends on the distribution). However you probably need to configure something like /etc/network/interfaces (the Debian config file for ifup). Something like "ifconfig ib0
" should work, though. Hal's IPoIB FAQ may be useful for debugging: http://article.gmane.org/gmane.linux.drivers.openib/6175 - Roland From iod00d at hp.com Fri Dec 3 10:56:38 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 3 Dec 2004 10:56:38 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52fz2ni61m.fsf@topspin.com> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> Message-ID: <20041203185638.GA16001@esmail.cup.hp.com> On Fri, Dec 03, 2004 at 10:35:17AM -0800, Roland Dreier wrote: > Can you try with the very latest svn bits? How can I tell which bits I'm currently testing? I did the "svn up" about 2h ago. svn is now telling me: grundler at iowa:/usr/src/linux-ia64-release-2.6.10/drivers/infiniband$ svn up At revision 1311. grundler at iowa:/usr/src/linux-ia64-release-2.6.10/drivers/infiniband$ > It turns out there the > readb() I was using in the FW command code path is not allowed by > Mellanox HW -- all accesses need to be 32 bits wide. I have no idea > if it could cause this problem but in the latest tree I converted it > to a readl() just to be safe. ok. > By the way, what's the difference between this box and the one that > was working before? Just OpenIB SW version? Before we were using the original HP firmware that exposes the virtual bridge but hides the "3rd BAR". I will eventually go back and try with such a card again. But I think being able to use a reflashed card would be just as important for openib.org than using the original HP card. thanks, grant From roland at topspin.com Fri Dec 3 10:59:12 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 10:59:12 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041203185638.GA16001@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 3 Dec 2004 10:56:38 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> Message-ID: <527jnzi4xr.fsf@topspin.com> Grant> How can I tell which bits I'm currently testing? I did the Grant> "svn up" about 2h ago. That's new enough. Here's a debugging patch that might help me figure out what's going on; can you reproduce the problem with this applied? Thanks, Roland Index: infiniband/hw/mthca/mthca_eq.c =================================================================== --- infiniband/hw/mthca/mthca_eq.c (revision 1310) +++ infiniband/hw/mthca/mthca_eq.c (working copy) @@ -151,6 +151,8 @@ { u32 doorbell[2]; + printk(KERN_ERR "Set EQ CI %d/%d\n", eqn, ci); + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); doorbell[1] = cpu_to_be32(ci); @@ -270,6 +272,11 @@ break; case MTHCA_EVENT_TYPE_CMD: + { + static int c; + ++c; + printk(KERN_ERR "cmd completion %d\n", c); + } mthca_cmd_event(dev, be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, From halr at voltaire.com Fri Dec 3 11:23:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 03 Dec 2004 14:23:32 -0500 Subject: [openib-general] IP over IB configuration procedure In-Reply-To: <52brdbi53t.fsf@topspin.com> References: <200412031844.TAA14129@styx.bruyeres.cea.fr> <52brdbi53t.fsf@topspin.com> Message-ID: <1102101811.4197.31.camel@localhost.localdomain> On Fri, 2004-12-03 at 13:55, Roland Dreier wrote: > Something like "ifconfig ib0
" should work, though. > Hal's IPoIB FAQ may be useful for debugging: > http://article.gmane.org/gmane.linux.drivers.openib/6175 Should we add how to bring up IPoIB to this writeup ? It currently assumes IPoIB has been brought up and is not working. -- Hal From halr at voltaire.com Fri Dec 3 11:43:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 03 Dec 2004 14:43:20 -0500 Subject: [openib-general] IPoIB FAQ Update Message-ID: <1102103000.4197.45.camel@localhost.localdomain> Here's an update to my initial attempt at an IPoIB FAQ: ping doesn't work between IPoIB nodes. What should I do ? First, verify that the ports are active. This can be done via: cat /sys/class/infiniband/mthca0/ports/1/state This should indicate 4: ACTIVE assuming the HCA is mthca0 and port 1 is the one plugged into the subnet (switch, etc.). If the port is not active, there could be several reasons: 1. You need an SM in your subnet to bring the ports to active. Do you have an SM ? This can be embedded in a switch or some other IB hardware or run on an end node (HCA) although OpenIB (gen2) does not currently support this. 2. If you have an SM in your subnet, there might be a cabling problem where the SM cannot "reach" your end node. If the port is active, indicate the subnet configuration and which SM is being utilized. Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0" show anything on the other nodes when you try to ping or something? There are 2 levels of IPoIB debug which can be enabled when building: IP-over-InfiniBand debugging and IP-over-InfiniBand data path debugging. The latter has performance implications and should only be enabled when all else fails. Enable the first level of IPoIB debug and then: mount -t ipoib_debugfs none /ipoib_debufs/ cat /ipoib_debugfs/ib0_mcg Other things to verify and supply to help isolate the problem: 1. Verify the firmware version via cat /sys/class/infiniband/mthca0/fw_ver For PCI-X HCAs, version 3.2.0 is recommended. For PCIe HCAs, version 4.5.3 is recommended. 2. Make sure the IB modules are loaded: /sbin/lsmod | grep ib_ should show ib_mthca (HCA driver) as well as ib_ipoib. There are others but those are the two which need to be loaded and any others will follow. 3. Make sure there are no errors in /var/log/messages pertaining to ib_. 4. Indicate the IP configuration via /sbin/ifconfig -a From iod00d at hp.com Fri Dec 3 14:14:13 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 3 Dec 2004 14:14:13 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <527jnzi4xr.fsf@topspin.com> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> Message-ID: <20041203221413.GC16522@esmail.cup.hp.com> On Fri, Dec 03, 2004 at 10:59:12AM -0800, Roland Dreier wrote: > Grant> How can I tell which bits I'm currently testing? I did the > Grant> "svn up" about 2h ago. > > That's new enough. Here's a debugging patch that might help me figure > out what's going on; can you reproduce the problem with this applied? I tried but it worked with the patch. :^( Last output was: ... Set EQ CI 3/55 cmd completion 1208 Set EQ CI 3/56 cmd completion 1209 Set EQ CI 3/57 cmd completion 1210 Set EQ CI 3/58 cmd completion 1211 Set EQ CI 3/59 --- end of output --- iowa:/usr/src/linux-ia64-release-2.6.10# cat /sys/class/infiniband/mthca0/ports/1/state cmd completion 1212 Set EQ CI 3/60 4: ACTIVE iowa:/usr/src/linux-ia64-release-2.6.10# cat /sys/class/infiniband/mthca0/ports/2/state cmd completion 1213 Set EQ CI 3/61 1: DOWN which is correct. Port "A" is connected to a switch and "B" is not. > + printk(KERN_ERR "Set EQ CI %d/%d\n", eqn, ci); > + > doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); > doorbell[1] = cpu_to_be32(ci); This doesn't feel very safe to me. If write ordering is required here, writel() or wmb() is necessary. Let me look over this code and see if the ordering is enforced elsewhere. I'll also play around with removing one printk at a time. thanks, grant From roland at topspin.com Fri Dec 3 14:26:12 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 14:26:12 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041203221413.GC16522@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 3 Dec 2004 14:14:13 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> Message-ID: <52hdn3ggsb.fsf@topspin.com> Grant> I tried but it worked with the patch. :^( Of course it would ... :) >> doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); >> doorbell[1] = cpu_to_be32(ci); Grant> This doesn't feel very safe to me. If write ordering is Grant> required here, writel() or wmb() is necessary. Let me look Grant> over this code and see if the ordering is enforced Grant> elsewhere. I'm not positive but I think it should be OK. doorbell is just a temporary variable that gets passed to mthca_write64(), which essentially does a __raw_writeq(*(u64 *) doorbell). On ia64, __raw_writeq is #defined to be writeq, so ordering should be OK there. And surely ia64 ordering is strong enough that the CPU won't try to do the writeq before the writes to doorbell complete, right? On the other hand, the fact that changing the timing with printks makes things work does make it look like there may be some sort of ordering problem... - R. From iod00d at hp.com Fri Dec 3 14:40:39 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 3 Dec 2004 14:40:39 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52hdn3ggsb.fsf@topspin.com> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> Message-ID: <20041203224039.GE16522@esmail.cup.hp.com> On Fri, Dec 03, 2004 at 02:26:12PM -0800, Roland Dreier wrote: > Grant> I tried but it worked with the patch. :^( > > Of course it would ... :) :^) > Grant> This doesn't feel very safe to me. If write ordering is > Grant> required here, writel() or wmb() is necessary. Let me look > Grant> over this code and see if the ordering is enforced > Grant> elsewhere. > > I'm not positive but I think it should be OK. doorbell is just a > temporary variable that gets passed to mthca_write64(), which > essentially does a __raw_writeq(*(u64 *) doorbell). On ia64, > __raw_writeq is #defined to be writeq, so ordering should be OK > there. Yes - I just went through the same code and came to the same conclusion. The key bit here is in __writeq() where it add "volatile". This makes sure all previous writes have completed...but... > And surely ia64 ordering is strong enough that the CPU won't > try to do the writeq before the writes to doorbell complete, right? Yes - I believe it is. > On the other hand, the fact that changing the timing with printks > makes things work does make it look like there may be some sort of > ordering problem... Well, or other race. Ordering was just my first guess. It's likely the race is not even here - but on the completion side of things. I'll play with this for a bit and see were it leads me. thanks, grant From roland at topspin.com Fri Dec 3 14:51:01 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 14:51:01 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041203224039.GE16522@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 3 Dec 2004 14:40:39 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> Message-ID: <52d5xrgfmy.fsf@topspin.com> Grant> Well, or other race. Ordering was just my first guess. Grant> It's likely the race is not even here - but on the Grant> completion side of things. Hmm, I'll think about it too. The problem appears to be an overflow of the FW command completion event queue (that's what the message ib_mthca 0000:41:00.0: Unhandled event 0f(00) on eqn 3 is saying ... event 0f is EQ overflow) It is true that we call mthca_cmd_event() on FW command completion (which frees up a command slot and could let another command start) before we update the consumer index and let the HCA know that there is a free slot in the EQ. Still, it's a little hard to see how we could overflow the EQ (which is created with 128 slots). - R. From roland at topspin.com Fri Dec 3 15:22:25 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 15:22:25 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52d5xrgfmy.fsf@topspin.com> (Roland Dreier's message of "Fri, 03 Dec 2004 14:51:01 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> Message-ID: <528y8fge6m.fsf@topspin.com> OK, I may have figured out the problem. How does this patch work? Index: infiniband/hw/mthca/mthca_eq.c =================================================================== --- infiniband/hw/mthca/mthca_eq.c (revision 1310) +++ infiniband/hw/mthca/mthca_eq.c (working copy) @@ -219,11 +219,14 @@ struct mthca_eqe *eqe; int disarm_cqn; int work = 0; + int set_ci = 0; while (1) { if (!next_eqe_sw(eq)) break; + set_ci = 0; + eqe = get_eqe(eq, eq->cons_index); work = 1; @@ -274,6 +277,13 @@ be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); + /* + * Need to set the CI inside the loop for + * command completion events, because this + * event allows another command to be posted + * and we may overflow the EQ. + */ + set_ci = 1; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -296,9 +306,14 @@ set_eqe_hw(eq, eq->cons_index); eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + + if (work && !set_ci) { + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + } } - if (work) { + if (work && !set_ci) { wmb(); set_eq_ci(dev, eq->eqn, eq->cons_index); } From iod00d at hp.com Fri Dec 3 18:18:19 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 3 Dec 2004 18:18:19 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <528y8fge6m.fsf@topspin.com> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> Message-ID: <20041204021819.GG16522@esmail.cup.hp.com> On Fri, Dec 03, 2004 at 03:22:25PM -0800, Roland Dreier wrote: > OK, I may have figured out the problem. How does this patch work? I've saved it off as diff-openib-set_ci Applied fine but still getting the "Unhandled event 0f(00) on eqn 3" message. :^( I'm not done staring at the code but will have to give up for today. The little bit that is left of my brain is cooked. BTW, a better name for "set_ci" might be "ci_full" or "stop_ci". (YKWIM) thanks, grant From roland at topspin.com Fri Dec 3 18:23:48 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 18:23:48 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041204021819.GG16522@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 3 Dec 2004 18:18:19 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> Message-ID: <524qj2hkcr.fsf@topspin.com> Grant> I've saved it off as diff-openib-set_ci Applied fine but Grant> still getting the "Unhandled event 0f(00) on eqn 3" Grant> message. :^( Hmm, I guess that wasn't the problem (or I didn't fix it properly). Grant> BTW, a better name for "set_ci" might be "ci_full" or Grant> "stop_ci". (YKWIM) No, set_ci stands for "set consumer index," which is what it does -- each event queue has a producer index (where the HW will write the next event) and a consumer index (the last event the driver has consumed). The HW compares the index to know when the EQ overflowed, so we need to set the CI to tell the HW when we've consumed events. - R. From roland at topspin.com Fri Dec 3 18:25:27 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 18:25:27 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <524qj2hkcr.fsf@topspin.com> (Roland Dreier's message of "Fri, 03 Dec 2004 18:23:48 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> Message-ID: <52zn0ug5pk.fsf@topspin.com> Roland> Hmm, I guess that wasn't the problem (or I didn't fix it Roland> properly). The latter (fix was bogus, I made a cut-and-paste error). Here's a better version: Index: infiniband/hw/mthca/mthca_eq.c =================================================================== --- infiniband/hw/mthca/mthca_eq.c (revision 1310) +++ infiniband/hw/mthca/mthca_eq.c (working copy) @@ -219,11 +219,14 @@ struct mthca_eqe *eqe; int disarm_cqn; int work = 0; + int set_ci = 0; while (1) { if (!next_eqe_sw(eq)) break; + set_ci = 0; + eqe = get_eqe(eq, eq->cons_index); work = 1; @@ -274,6 +277,13 @@ be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); + /* + * Need to set the CI inside the loop for + * command completion events, because this + * event allows another command to be posted + * and we may overflow the EQ. + */ + set_ci = 1; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -296,9 +306,14 @@ set_eqe_hw(eq, eq->cons_index); eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + + if (set_ci) { + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + } } - if (work) { + if (work && !set_ci) { wmb(); set_eq_ci(dev, eq->eqn, eq->cons_index); } From jjengla at sandia.gov Sun Dec 5 20:02:56 2004 From: jjengla at sandia.gov (Josh England) Date: Sun, 05 Dec 2004 20:02:56 -0800 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1102103000.4197.45.camel@localhost.localdomain> References: <1102103000.4197.45.camel@localhost.localdomain> Message-ID: <1102305776.5890.28.camel@localhost> I'm following the FAQ pretty closely and IPoIB is still not working for me. I'm using the latest from SVN (as of tonight). These are PCIe HCAs with 4.5.3 firmware (x86_64 MST is horribly broken BTW --can't wait for tvflash). Hardware/cabling is fine since things work under VAPI 3.2. All modules are loaded, and syslog doesn't show anything. It seems now (new with 4.5.3 FW) that packets are flowing one direction but not the other. Here's as much debug info as I could dig up. ping from n0 to n1: /sys/class/net/ib0/statistics/rx_packets on n1 increases ping from n1 to n0: /sys/class/net/ib0/statistics/rx_packets remains at 0 tcpdump shows nothing on either node n0:/ipoib_debugfs/ib0_mcg shows: GID: ff12:401b:ffff:0:0:0:0:1 created: 4295359818 queuelen: 0 complete: 1 send_only: 0 GID: ff12:401b:ffff:0:0:0:ffff:ffff created: 4295359818 queuelen: 0 complete: 1 send_only: 0 GID: ff12:601b:ffff:0:0:0:0:1 created: 4295359819 queuelen: 0 complete: 1 send_only: 0 GID: ff12:601b:ffff:0:0:0:0:2 created: 4295361295 queuelen: 2 complete: 0 send_only: 1 GID: ff12:601b:ffff:0:0:0:0:16 created: 4295359821 queuelen: 0 complete: 0 send_only: 1 GID: ff12:601b:ffff:0:0:1:ff00:6693 created: 4295359819 queuelen: 0 complete: 1 send_only: 0 n1:/ipoib_debugfs/ib0_mcg shows: GID: ff12:401b:ffff:0:0:0:0:1 created: 4294846326 queuelen: 0 complete: 1 send_only: 0 GID: ff12:401b:ffff:0:0:0:ffff:ffff created: 4294846326 queuelen: 0 complete: 1 send_only: 0 GID: ff12:601b:ffff:0:0:0:0:1 created: 4294846327 queuelen: 0 complete: 1 send_only: 0 GID: ff12:601b:ffff:0:0:0:0:2 created: 4294847749 queuelen: 2 complete: 0 send_only: 1 GID: ff12:601b:ffff:0:0:0:0:16 created: 4294846329 queuelen: 0 complete: 0 send_only: 1 GID: ff12:601b:ffff:0:0:1:ff00:66bf created: 4294846327 queuelen: 0 complete: 1 send_only: 0 ifconfig -a on n0: eth0 Link encap:Ethernet HWaddr 00:30:48:53:69:D8 inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:41000 errors:0 dropped:0 overruns:0 frame:0 TX packets:32882 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:21739542 (20.7 Mb) TX bytes:4794259 (4.5 Mb) Base address:0xbc00 Memory:feae0000-feb00000 ib0 Link encap:UNSPEC HWaddr 00-00-00-84-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.0.0.1 Bcast:10.255.255.255 Mask:255.0.0.0 inet6 addr: fe80::206:6a00:a000:6693/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:46 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:2768 (2.7 Kb) ib1 Link encap:UNSPEC HWaddr 00-00-00-85-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:70 errors:0 dropped:0 overruns:0 frame:0 TX packets:70 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:6748 (6.5 Kb) TX bytes:6748 (6.5 Kb) sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) ifconfig -a on n1: eth0 Link encap:Ethernet HWaddr 00:30:48:53:6A:77 inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:41747 errors:0 dropped:0 overruns:0 frame:0 TX packets:33587 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:21841556 (20.8 Mb) TX bytes:4897945 (4.6 Mb) Base address:0xbc00 Memory:feaa0000-feac0000 ib0 Link encap:UNSPEC HWaddr 00-00-00-84-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.0.0.2 Bcast:10.255.255.255 Mask:255.0.0.0 inet6 addr: fe80::206:6a00:a000:66bf/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:45 errors:0 dropped:0 overruns:0 frame:0 TX packets:69 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:2520 (2.4 Kb) TX bytes:4372 (4.2 Kb) ib1 Link encap:UNSPEC HWaddr 00-00-00-85-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:36 errors:0 dropped:0 overruns:0 frame:0 TX packets:36 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2940 (2.8 Kb) TX bytes:2940 (2.8 Kb) sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Is there anything else I can try? -JE On Fri, 2004-12-03 at 14:43 -0500, Hal Rosenstock wrote: > Here's an update to my initial attempt at an IPoIB FAQ: > > ping doesn't work between IPoIB nodes. What should I do ? > > First, verify that the ports are active. > > This can be done via: > > cat /sys/class/infiniband/mthca0/ports/1/state > > This should indicate 4: ACTIVE > > assuming the HCA is mthca0 and port 1 is the one plugged into the subnet > (switch, etc.). > > If the port is not active, there could be several reasons: > > 1. You need an SM in your subnet to bring the ports to active. Do you > have an SM ? This can be embedded in a switch or some other IB hardware > or run on an end node (HCA) although OpenIB (gen2) does not currently > support this. > > 2. If you have an SM in your subnet, there might be a cabling problem > where the SM cannot "reach" your end node. > > If the port is active, indicate the subnet configuration and which SM is > being utilized. > > Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0" > show anything on the other nodes when you try to ping or something? > > There are 2 levels of IPoIB debug which can be enabled when building: > IP-over-InfiniBand debugging and IP-over-InfiniBand data path debugging. > The latter has performance implications and should only be enabled when > all else fails. Enable the first level of IPoIB debug and then: > > mount -t ipoib_debugfs none /ipoib_debufs/ > cat /ipoib_debugfs/ib0_mcg > > Other things to verify and supply to help isolate the problem: > > 1. Verify the firmware version via > > cat /sys/class/infiniband/mthca0/fw_ver > > For PCI-X HCAs, version 3.2.0 is recommended. For PCIe HCAs, version > 4.5.3 is recommended. > > 2. Make sure the IB modules are loaded: > /sbin/lsmod | grep ib_ > should show ib_mthca (HCA driver) as well as ib_ipoib. There are others > but those are the two which need to be loaded and any others will > follow. > > 3. Make sure there are no errors in /var/log/messages pertaining to ib_. > > 4. Indicate the IP configuration via > /sbin/ifconfig -a > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Mon Dec 6 04:09:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Dec 2004 07:09:11 -0500 Subject: [openib-general] Re: IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1102305776.5890.28.camel@localhost> References: <1102103000.4197.45.camel@localhost.localdomain> <1102305776.5890.28.camel@localhost> Message-ID: <1102334951.4197.940.camel@localhost.localdomain> Hi Josh, On Sun, 2004-12-05 at 23:02, Josh England wrote: Your IPoIB configuration looks fine. > It seems now (new with 4.5.3 FW) that packets are flowing one direction > but not the other. It is possible there are still multicast ARP issues with PCIe with even the 4.5.3 firmware. There is apparently a pre 4.6 version which has a fix in it for this which some others have said they need in other (non OpenIB)environments. You will likely need to contact Mellanox to get this prerelease. -- Hal From bogus@does.not.exist.com Mon Dec 6 07:29:11 2004 From: bogus@does.not.exist.com () Date: Mon, 06 Dec 2004 15:29:11 -0000 Subject: No subject Message-ID: From dledford at redhat.com Mon Dec 6 07:43:12 2004 From: dledford at redhat.com (Doug Ledford) Date: Mon, 06 Dec 2004 10:43:12 -0500 Subject: [openib-general] Re: IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1102334951.4197.940.camel@localhost.localdomain> References: <1102103000.4197.45.camel@localhost.localdomain> <1102305776.5890.28.camel@localhost> <1102334951.4197.940.camel@localhost.localdomain> Message-ID: <1102347792.6490.156.camel@compaq-rhel4.xsintricity.com> On Mon, 2004-12-06 at 07:09 -0500, Hal Rosenstock wrote: > Hi Josh, > > On Sun, 2004-12-05 at 23:02, Josh England wrote: > > Your IPoIB configuration looks fine. Well, he had identical hardware addresses on n0 and n1 for his ib0 and ib1 interfaces respectively. This would likely keep the linux network stack from actually sending the packet back to the originating host and instead convince it that it's intended for internal consumption I would think. Try putting a different HW MAC address on your different ib? ports and see if that gets the packets flowing. > > It seems now (new with 4.5.3 FW) that packets are flowing one direction > > but not the other. > > It is possible there are still multicast ARP issues with PCIe with even > the 4.5.3 firmware. There is apparently a pre 4.6 version which has a > fix in it for this which some others have said they need in other (non > OpenIB)environments. You will likely need to contact Mellanox to get > this prerelease. > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Doug Ledford Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From halr at voltaire.com Mon Dec 6 07:40:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Dec 2004 10:40:08 -0500 Subject: [openib-general] Re: IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1102347792.6490.156.camel@compaq-rhel4.xsintricity.com> References: <1102103000.4197.45.camel@localhost.localdomain> <1102305776.5890.28.camel@localhost> <1102334951.4197.940.camel@localhost.localdomain> <1102347792.6490.156.camel@compaq-rhel4.xsintricity.com> Message-ID: <1102347607.4197.1183.camel@localhost.localdomain> On Mon, 2004-12-06 at 10:43, Doug Ledford wrote: > On Mon, 2004-12-06 at 07:09 -0500, Hal Rosenstock wrote: > > Hi Josh, > > > > On Sun, 2004-12-05 at 23:02, Josh England wrote: > > > > Your IPoIB configuration looks fine. > > Well, he had identical hardware addresses on n0 and n1 for his ib0 and > ib1 interfaces respectively. This would likely keep the linux network > stack from actually sending the packet back to the originating host and > instead convince it that it's intended for internal consumption I would > think. Try putting a different HW MAC address on your different ib? > ports and see if that gets the packets flowing. The HW address is truncated in ifconfig as it is 20 bytes which is longer than the 16 I think ifconfig (and arp) support. You can tell the full IB HW addresses from ip neigh. If these were the same (due to duplicate GUIDs being programmed), this would cause a problem as you indicate. (I will add this tidbit into the next IPoIB FAQ). -- Hal From roland at topspin.com Mon Dec 6 07:51:03 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Dec 2004 07:51:03 -0800 Subject: [openib-general] Re: IPoIB still not working In-Reply-To: <1102347792.6490.156.camel@compaq-rhel4.xsintricity.com> (Doug Ledford's message of "Mon, 06 Dec 2004 10:43:12 -0500") References: <1102103000.4197.45.camel@localhost.localdomain> <1102305776.5890.28.camel@localhost> <1102334951.4197.940.camel@localhost.localdomain> <1102347792.6490.156.camel@compaq-rhel4.xsintricity.com> Message-ID: <527jnvfms8.fsf@topspin.com> Doug> Well, he had identical hardware addresses on n0 and n1 for Doug> his ib0 and ib1 interfaces respectively. This would likely Doug> keep the linux network stack from actually sending the Doug> packet back to the originating host and instead convince it Doug> that it's intended for internal consumption I would think. Doug> Try putting a different HW MAC address on your different ib? Doug> ports and see if that gets the packets flowing. Actually ifconfig can only show the first 16 octets of the HW address (and I think the last two bytes are actually wrong, because the SIOGIFHWADDR ioctl that it uses can only return 14 bytes). IPoIB has a 20 byte HW address; the four (or six?) bytes that get cut off are the low-order bytes of the port GID, which is probably where the difference between port GIDs is. To see the real address, you need to do something like "ip addr show dev ib0". For example, on my system: # ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) # ip addr show dev ib0 5: ib0: mtu 2044 qdisc noop qlen 128 link/[32] 00:00:04:04:fe:80:00:00:00:00:00:00:00:02:c9:01:07:8c:e4:61 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff Perhaps we need something about this in the growing IPoIB FAQ? - Roland From jjengla at sandia.gov Mon Dec 6 08:07:19 2004 From: jjengla at sandia.gov (Josh England) Date: Mon, 06 Dec 2004 08:07:19 -0800 Subject: [openib-general] Re: IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1102334951.4197.940.camel@localhost.localdomain> References: <1102103000.4197.45.camel@localhost.localdomain> <1102305776.5890.28.camel@localhost> <1102334951.4197.940.camel@localhost.localdomain> Message-ID: <1102349239.21826.4.camel@localhost> On Mon, 2004-12-06 at 07:09 -0500, Hal Rosenstock wrote: > > It seems now (new with 4.5.3 FW) that packets are flowing one direction > > but not the other. > > It is possible there are still multicast ARP issues with PCIe with even > the 4.5.3 firmware. There is apparently a pre 4.6 version which has a > fix in it... I'll look into the pre-4.6 FW. Do you know if anyone has gotten openIB (IPoIB) to work over PCIe HCAs? -JE From robert.j.woodruff at intel.com Mon Dec 6 08:18:12 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 6 Dec 2004 08:18:12 -0800 Subject: [openib-general] Re: IPoIB still not working [was IPoIB FAQ Update] Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002F78E42@orsmsx408> >I'll look into the pre-4.6 FW. Do you know if anyone has gotten openIB >(IPoIB) to work over PCIe HCAs? >-JE I am working on setting up a couple of PCI-E systems today. Will let you know what I find. woody From halr at voltaire.com Mon Dec 6 08:41:16 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Dec 2004 11:41:16 -0500 Subject: [openib-general] Latest IPoIB FAQ Message-ID: <1102351276.4197.1196.camel@localhost.localdomain> Here's an update to my initial attempt at an IPoIB FAQ: ping doesn't work between IPoIB nodes. What should I do ? First, verify that the ports are active. This can be done via: cat /sys/class/infiniband/mthca0/ports/1/state This should indicate 4: ACTIVE assuming the HCA is mthca0 and port 1 is the one plugged into the subnet (switch, etc.). If the port is not active, there could be several reasons: 1. You need an SM in your subnet to bring the ports to active. Do you have an SM ? This can be embedded in a switch or some other IB hardware or run on an end node (HCA) although OpenIB (gen2) does not currently support this. 2. If you have an SM in your subnet, there might be a cabling problem where the SM cannot "reach" your end node. If the port is active, indicate the subnet configuration and which SM is being utilized. Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0" show anything on the other nodes when you try to ping or something? There are 2 levels of IPoIB debug which can be enabled when building: IP-over-InfiniBand debugging and IP-over-InfiniBand data path debugging. The latter has performance implications and should only be enabled when all else fails. Enable the first level of IPoIB debug and then: mount -t ipoib_debugfs none /ipoib_debufs/ cat /ipoib_debugfs/ib0_mcg Other things to verify and supply to help isolate the problem: 1. Verify the firmware version via cat /sys/class/infiniband/mthca0/fw_ver For PCI-X HCAs, version 3.2.0 is recommended. For PCIe HCAs, version 4.5.3 is recommended. 2. Make sure the IB modules are loaded: /sbin/lsmod | grep ib_ should show ib_mthca (HCA driver) as well as ib_ipoib. There are others but those are the two which need to be loaded and any others will follow. 3. Make sure there are no errors in /var/log/messages pertaining to ib_. 4. Indicate the IP configuration via /sbin/ifconfig -a and ip addr show dev ib0 (assuming ib0 is the network interface being configured) This is because ifconfig can only show the first 16 octets of the HW address (and the last two bytes are actually wrong, because the SIOGIFHWADDR ioctl that it uses can only return 14 bytes). IPoIB has a 20 byte HW address; the four (or six?) bytes that get cut off are the low-order bytes of the port GID, which is probably where the difference between port GIDs is. To see the real IB hardware address, you need to do something like "ip addr show dev ib0". For example, # ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) # ip addr show dev ib0 5: ib0: mtu 2044 qdisc noop qlen 128 link/[32] 00:00:04:04:fe:80:00:00:00:00:00:00:00:02:c9:01:07:8c:e4:61 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 5. Use ip neigh show dev ib0 to display ARP table for IB interface ib0 From roland at topspin.com Mon Dec 6 09:13:05 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Dec 2004 09:13:05 -0800 Subject: [openib-general] Latest IPoIB FAQ In-Reply-To: <1102351276.4197.1196.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 06 Dec 2004 11:41:16 -0500") References: <1102351276.4197.1196.camel@localhost.localdomain> Message-ID: <52u0qze4f2.fsf@topspin.com> This looks good. Matt, it might be worth putting this on the web site, and longer term I think this is yet another reason to set up some sort of Wiki. - R. From iod00d at hp.com Mon Dec 6 09:39:54 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 6 Dec 2004 09:39:54 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52zn0ug5pk.fsf@topspin.com> References: <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> Message-ID: <20041206173954.GF26198@esmail.cup.hp.com> On Fri, Dec 03, 2004 at 06:25:27PM -0800, Roland Dreier wrote: > Roland> Hmm, I guess that wasn't the problem (or I didn't fix it > Roland> properly). > > The latter (fix was bogus, I made a cut-and-paste error). Here's a > better version: Yes - that works. Please commit. I was still trying to sort out how set_ci when I gave up on friday. You explanation helped though I'm familiar with consumer/producer (tg3 has a similar construct) indexes. > Index: infiniband/hw/mthca/mthca_eq.c > =================================================================== > --- infiniband/hw/mthca/mthca_eq.c (revision 1310) > +++ infiniband/hw/mthca/mthca_eq.c (working copy) > @@ -219,11 +219,14 @@ > struct mthca_eqe *eqe; > int disarm_cqn; > int work = 0; > + int set_ci = 0; > > while (1) { > if (!next_eqe_sw(eq)) > break; > > + set_ci = 0; > + > eqe = get_eqe(eq, eq->cons_index); > work = 1; > > @@ -274,6 +277,13 @@ > be16_to_cpu(eqe->event.cmd.token), > eqe->event.cmd.status, > be64_to_cpu(eqe->event.cmd.out_param)); > + /* > + * Need to set the CI inside the loop for > + * command completion events, because this > + * event allows another command to be posted > + * and we may overflow the EQ. > + */ The comment inside "case MTHCA_EVENT_TYPE_CMD" now makes sense and it didn't make sense to me on friday...perhaps showing it was indeed time for a break. > + set_ci = 1; > break; > > case MTHCA_EVENT_TYPE_PORT_CHANGE: > @@ -296,9 +306,14 @@ > > set_eqe_hw(eq, eq->cons_index); > eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); This line of code assumes eq->nent is a power of 2. I didn't see anything in mthca_create_eq() that enforces the assumption that nent is a power of 2. It mthca_create_eq() isn't in the perf path, I think it would be good to enforcement the requirement or at least check for it (e.g. WARN_ON). I just don't want to make the performance patch longer. > + > + if (set_ci) { > + wmb(); this wmb() isn't necessary since set_eq_ci() calls mthca_write64() and the latter *must* enforce wmb() for the MMIO write. I'd also like to see doorbell[2] go away and just pass the two u32 values to mthca_write64(). Perhaps combine them explicitly instead depending on load/store model of the stack. While it seems to work, I'm very nervous about this dependency and wonder if it would even work on parisc-linux (ok, I'm the only who cares :^) which has an upward growing stack. > + set_eq_ci(dev, eq->eqn, eq->cons_index); > + } > } > > - if (work) { > + if (work && !set_ci) { > wmb(); ditto. > set_eq_ci(dev, eq->eqn, eq->cons_index); > } thanks, grant From roland at topspin.com Mon Dec 6 10:59:51 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Dec 2004 10:59:51 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041206173954.GF26198@esmail.cup.hp.com> (Grant Grundler's message of "Mon, 6 Dec 2004 09:39:54 -0800") References: <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> Message-ID: <52pt1ndzh4.fsf@topspin.com> Grant> Yes - that works. Please commit. I rewrote things in a way that seems cleaner to me -- what I actually committed is below. Please try one more time and make sure this still fixes the problem. Grant> It mthca_create_eq() isn't in the perf path, I think it Grant> would be good to enforcement the requirement or at least Grant> check for it (e.g. WARN_ON). I just don't want to make the Grant> performance patch longer. Good idea, I'll do this. Grant> this wmb() isn't necessary since set_eq_ci() calls Grant> mthca_write64() and the latter *must* enforce wmb() for the Grant> MMIO write. I'm not sure about that... mthca_write64() turns into __raw_writeq, which may not be ordered. Even if it were writeq(), the barrier is after the write (eg on ppc64, the "sync" if after the store operation). We want to make sure that the updates to the ownership bits in the EQ in host memory are complete before doing the MMIO write to update the HCA's consumer pointer; this avoids the (pretty much impossible) race where the HCA writes to the EQ entry and then our ownership update happens later and overwrites the HCA's value. Grant> I'd also like to see doorbell[2] go away and just pass the Grant> two u32 values to mthca_write64(). Perhaps combine them Grant> explicitly instead depending on load/store model of the Grant> stack. That makes sense. I'll try to come up with an API that avoids shifts on a 64-bit arch... Thanks, Roland Index: infiniband/hw/mthca/mthca_cmd.c =================================================================== --- infiniband/hw/mthca/mthca_cmd.c (revision 1310) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -293,6 +293,12 @@ complete(&context->done); } +void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) +{ + while (ncomp--) + up(&dev->cmd.event_sem); +} + static void event_timeout(unsigned long context_ptr) { struct mthca_cmd_context *context = @@ -357,7 +363,6 @@ dev->cmd.free_head = context - dev->cmd.context; spin_unlock(&dev->cmd.context_lock); - up(&dev->cmd.event_sem); return err; } Index: infiniband/hw/mthca/mthca_eq.c =================================================================== --- infiniband/hw/mthca/mthca_eq.c (revision 1310) +++ infiniband/hw/mthca/mthca_eq.c (working copy) @@ -219,6 +219,7 @@ struct mthca_eqe *eqe; int disarm_cqn; int work = 0; + int ncmd = 0; while (1) { if (!next_eqe_sw(eq)) @@ -274,6 +275,7 @@ be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); + ++ncmd; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -303,6 +305,9 @@ set_eq_ci(dev, eq->eqn, eq->cons_index); } + if (ncmd) + mthca_cmd_complete(dev, ncmd); + eq_req_not(dev, eq->eqn); } Index: infiniband/hw/mthca/mthca_cmd.h =================================================================== --- infiniband/hw/mthca/mthca_cmd.h (revision 1310) +++ infiniband/hw/mthca/mthca_cmd.h (working copy) @@ -203,10 +203,9 @@ int mthca_cmd_use_events(struct mthca_dev *dev); void mthca_cmd_use_polling(struct mthca_dev *dev); -void mthca_cmd_event(struct mthca_dev *dev, - u16 token, - u8 status, - u64 out_param); +void mthca_cmd_event(struct mthca_dev *dev, u16 token, + u8 status, u64 out_param); +void mthca_cmd_complete(struct mthca_dev *dev, int ncomp); int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); From iod00d at hp.com Mon Dec 6 12:16:12 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 6 Dec 2004 12:16:12 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52pt1ndzh4.fsf@topspin.com> References: <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> Message-ID: <20041206201612.GH26198@esmail.cup.hp.com> On Mon, Dec 06, 2004 at 10:59:51AM -0800, Roland Dreier wrote: > Grant> Yes - that works. Please commit. > > I rewrote things in a way that seems cleaner to me -- what I actually > committed is below. Please try one more time and make sure this still > fixes the problem. will do in a bit. ... > Grant> this wmb() isn't necessary since set_eq_ci() calls > Grant> mthca_write64() and the latter *must* enforce wmb() for the > Grant> MMIO write. > > I'm not sure about that... mthca_write64() turns into __raw_writeq, > which may not be ordered. Even if it were writeq(), the barrier is > after the write (eg on ppc64, the "sync" if after the store > operation). We want to make sure that the updates to the ownership > bits in the EQ in host memory are complete before doing the MMIO write > to update the HCA's consumer pointer; this avoids the (pretty much > impossible) race where the HCA writes to the EQ entry and then our > ownership update happens later and overwrites the HCA's value. Yeah - ia64 has slight differences in that the "release" semantics used for writeX() (and wmb()) force prior write mem ops to complete before this one does. I had forgotten exactly what the linux API semantics where and re-read Documentation/DocBook/deviceiobook.tmpl. It doesn't seem to address this issue unfortunately. The best description I have of memory ordering is still "IA64 Linux Kernel" by David Mosberger and Stephane Eranian. And that supports exactly what you say above. > Grant> I'd also like to see doorbell[2] go away and just pass the > Grant> two u32 values to mthca_write64(). Perhaps combine them > Grant> explicitly instead depending on load/store model of the > Grant> stack. > > That makes sense. I'll try to come up with an API that avoids shifts > on a 64-bit arch... I'm not sure there is one. And it still might be better to use the shift op. This feels like one of those perf vs portability issue where someone has to decide if it's worth trading off some perf for clearer code. Thinking about it more, I just realized the two 32-bit stores are to a cacheline already resident in L1. Ditto for the successive load to recover the 64-bit value. The mem ops are nearly "free" since it's to a resident, private cacheline. But completely avoiding those stores/loads would be good too and I expect the resulting code will be better to read/understand. And given the use, I don't think the shift unit will limit parallelism on ia64 or matter on other arches. But I might be wrong on that since it's just a guess. thanks, grant From halr at voltaire.com Mon Dec 6 12:25:06 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Dec 2004 15:25:06 -0500 Subject: [openib-general] Latest IPoIB FAQ In-Reply-To: <52u0qze4f2.fsf@topspin.com> References: <1102351276.4197.1196.camel@localhost.localdomain> <52u0qze4f2.fsf@topspin.com> Message-ID: <1102364706.4197.1220.camel@localhost.localdomain> On Mon, 2004-12-06 at 12:13, Roland Dreier wrote: > This looks good. > > Matt, it might be worth putting this on the web site, and longer term > I think this is yet another reason to set up some sort of Wiki. I would like to confirm the PCIe firmware rev "requirement" in this FAQ. I believe an official release is shortly coming. -- Hal From mlleinin at hpcn.ca.sandia.gov Mon Dec 6 21:31:51 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 06 Dec 2004 21:31:51 -0800 Subject: [openib-general] Latest IPoIB FAQ In-Reply-To: <52u0qze4f2.fsf@topspin.com> References: <1102351276.4197.1196.camel@localhost.localdomain> <52u0qze4f2.fsf@topspin.com> Message-ID: <1102397511.4239.323.camel@trinity> I'll post the updated IPoIB FAQ on our webpages. We started looking at Wiki's and then got sidetracked. Is there a preferred wiki? Thanks, - Matt On Mon, 2004-12-06 at 09:13 -0800, Roland Dreier wrote: > This looks good. > > Matt, it might be worth putting this on the web site, and longer term > I think this is yet another reason to set up some sort of Wiki. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From roland at topspin.com Tue Dec 7 05:25:29 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Dec 2004 05:25:29 -0800 Subject: [openib-general] Latest IPoIB FAQ In-Reply-To: <1102397511.4239.323.camel@trinity> (Matt Leininger's message of "Mon, 06 Dec 2004 21:31:51 -0800") References: <1102351276.4197.1196.camel@localhost.localdomain> <52u0qze4f2.fsf@topspin.com> <1102397511.4239.323.camel@trinity> Message-ID: <52r7m2ckae.fsf@topspin.com> Matt> We started looking at Wiki's and then got sidetracked. Is Matt> there a preferred wiki? I don't have any strong opinions -- probably whatever is easiest to set up will work fine for us. I seem to recall that Zwiki is a Zope-based Wiki that might work well with Plone. (ubuntulinux.org seems to be running Plone and Zwiki for their site) - R. From robert.j.woodruff at intel.com Tue Dec 7 11:33:34 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Dec 2004 11:33:34 -0800 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB6A82@orsmsx408> >I'm following the FAQ pretty closely and IPoIB is still not working for >me. I'm using the latest from SVN (as of tonight). These are PCIe HCAs >with 4.5.3 firmware (x86_64 MST is horribly broken BTW --can't wait for >tvflash). Hardware/cabling is fine since things work under VAPI 3.2. >All modules are loaded, and syslog doesn't show anything. It seems now >(new with 4.5.3 FW) that packets are flowing one direction but not the >other. Here's as much debug info as I could dig up. Ok here is what I found. I installed the openib.org code on my EM64T (x86_64) systems that have PCI-E HCAs. We have a 8 node Mellanox switch. On one node (a 32-bit Xeon node), we are running opensm from the SF project. We have one 32-bit system (Sean's) and 2 EM64T systems connected to the switch that are running the openib.org code. When Sean's 32-bit system loads ipoib and configures the interface, everything seems to work fine. However, when I load ipoib on the 64-bit x86_64 node, the opensm starts complaining that osm_mcr_rcv_join_mgrp ERR IB10 Provided Join State != FullMember so it looks like the x86_64 system is not able to join the multicast group. I suspect some issue with structure definitions, but have not debugged the issue any further. Any ideas ? From halr at voltaire.com Tue Dec 7 11:59:17 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 14:59:17 -0500 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB6A82@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0002FB6A82@orsmsx408> Message-ID: <1102449557.4141.59.camel@localhost.localdomain> On Tue, 2004-12-07 at 14:33, Woodruff, Robert J wrote: > > >I'm following the FAQ pretty closely and IPoIB is still not working for > >me. I'm using the latest from SVN (as of tonight). These are PCIe > HCAs > >with 4.5.3 firmware (x86_64 MST is horribly broken BTW --can't wait for > >tvflash). Hardware/cabling is fine since things work under VAPI 3.2. > >All modules are loaded, and syslog doesn't show anything. It seems now > >(new with 4.5.3 FW) that packets are flowing one direction but not the > >other. Here's as much debug info as I could dig up. > > Ok here is what I found. I installed the openib.org code on my > EM64T (x86_64) systems that have PCI-E HCAs. We have a 8 node > Mellanox switch. On one node (a 32-bit Xeon node), > we are running opensm from the SF project. > We have one 32-bit system (Sean's) and 2 EM64T systems connected to > the switch that are running the openib.org code. > When Sean's 32-bit system loads ipoib and configures the > interface, everything seems to work fine. However, when I load ipoib > on the 64-bit x86_64 node, the opensm starts complaining that > > osm_mcr_rcv_join_mgrp ERR IB10 Provided Join State != FullMember > > so it looks like the x86_64 system is not able to join the multicast > group. > I suspect some issue with structure definitions, but have not debugged > the > issue any further. I run on x86_64 (Opteron) too with a different SM and I do not see that issue. While there is code in OpenIB IPoIB to perform send only joins, it currently joins as full member (the send only join state is commented out to be full member right now due to some SMs lacking this support (which BTW is required if it does support multicast) so I can't explain what OpenSM indicates. Maybe OpenSM is putting out the wrong message. There are two types of "joins" done by OpenIB: 1. with components sufficient to create the multicast group, and 2. with components sufficient to join an already created group. The second style has been done for a long time and is similar to those being used by other stacks. The first is relatively new and OpenSM may not like it or already have this group preconfigured and return an error due to different characteristics or something like that. Is it correct to presume you do not have an IB analyzer to capture the SA packets on the link to/from the x86_64 system ? -- Hal From halr at voltaire.com Tue Dec 7 12:11:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 15:11:11 -0500 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB6A82@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0002FB6A82@orsmsx408> Message-ID: <1102450271.4141.77.camel@localhost.localdomain> On Tue, 2004-12-07 at 14:33, Woodruff, Robert J wrote: > Ok here is what I found. I installed the openib.org code on my > EM64T (x86_64) systems that have PCI-E HCAs. We have a 8 node > Mellanox switch. On one node (a 32-bit Xeon node), > we are running opensm from the SF project. > We have one 32-bit system (Sean's) and 2 EM64T systems connected to > the switch that are running the openib.org code. > When Sean's 32-bit system loads ipoib and configures the > interface, everything seems to work fine. However, when I load ipoib > on the 64-bit x86_64 node, the opensm starts complaining that > > osm_mcr_rcv_join_mgrp ERR IB10 Provided Join State != FullMember > > so it looks like the x86_64 system is not able to join the multicast > group. > I suspect some issue with structure definitions, but have not debugged > the > issue any further. Are the x86_64 machines PCIe or PCIX ? (Mine is PCIX). -- Hal From halr at voltaire.com Tue Dec 7 12:26:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 15:26:24 -0500 Subject: [openib-general] IPoIB FAQ Message-ID: <1102451183.4141.94.camel@localhost.localdomain> is now checked into the tree as https://openib.org/svn/gen2/trunk/src/linux-kernel/docs/ipoib_faq.txt From robert.j.woodruff at intel.com Tue Dec 7 12:26:38 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Dec 2004 12:26:38 -0800 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB6B57@orsmsx408> >Is it correct to presume you do not have an IB analyzer to capture the >SA packets on the link to/from the x86_64 system ? >-- Hal No, unfortunately we do not have a 4x IB analyzer. Note that the same openib.org code running on a 32-bit system does not seem to invoke the complaints from opensm. I looked at the opensm code and it is looking for a value of 1 in the join_state. I looked at the ipoib_multicast.c code and the join routine appears to set the value to 1. Somehow the value is lost in transit. Perhaps there is some 32/64 bit issue with the data structures, etc. that allows it to work on a 32 bit platform and fail on x86_64. As I said, I did not debug it any further. It does however not look like the issue is with dropping multicast packets, which was the firmware issue we saw with the PCI-E cards. Rather, it looks like x86_64 systems are having issues communicating with a 32-bit opensm. From robert.j.woodruff at intel.com Tue Dec 7 12:28:39 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Dec 2004 12:28:39 -0800 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB6B5F@orsmsx408> >Are the x86_64 machines PCIe or PCIX ? (Mine is PCIX). >-- Hal PCIe From halr at voltaire.com Tue Dec 7 12:33:48 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 15:33:48 -0500 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB6B57@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0002FB6B57@orsmsx408> Message-ID: <1102451628.4141.102.camel@localhost.localdomain> On Tue, 2004-12-07 at 15:26, Woodruff, Robert J wrote: > No, unfortunately we do not have a 4x IB analyzer. Do you have 4x to 1x cables ? Then you could use a 1x analyzer for this. > Note that the same openib.org code running > on a 32-bit system does not seem to invoke the complaints from opensm. Yes, but is that SA client local to the node with the OpenSM ? I'm not sure whether this makes a difference or not as I am unaware of how this local communication (SA client -> OpenSM) is accomplished. -- Hal From halr at voltaire.com Tue Dec 7 12:35:00 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 15:35:00 -0500 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB6B5F@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0002FB6B5F@orsmsx408> Message-ID: <1102451700.4141.105.camel@localhost.localdomain> On Tue, 2004-12-07 at 15:28, Woodruff, Robert J wrote: > > >Are the x86_64 machines PCIe or PCIX ? (Mine is PCIX). > > >-- Hal > > PCIe That's the difference and that's what makes me think this too is a firmware problem. -- Hal From robert.j.woodruff at intel.com Tue Dec 7 12:51:35 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Dec 2004 12:51:35 -0800 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB6BC0@orsmsx408> > Do you have 4x to 1x cables ? Then you could use a 1x analyzer for this. Use to have one, but I think it was lost. The 1x analyzer we have is very old and out of date, not sure that it is worth trying to get running. >Yes, but is that SA client local to the node with the OpenSM ? I'm not >sure whether this makes a difference or not as I am unaware of how this >local communication (SA client -> OpenSM) is accomplished. We have one dedicated 32-bit node running the opensm from sourceforge. We have a separate 32-bit node that Sean uses for development, so no the sa_client is not on the same node as the SM. His node does not seem to invoke the errors from opensm, but the x86_86 node does. I'll poke around in the code this afternoon to see if I can get any additional information. woody From roland at topspin.com Tue Dec 7 13:08:21 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Dec 2004 13:08:21 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB6BC0@orsmsx408> (Robert J. Woodruff's message of "Tue, 7 Dec 2004 12:51:35 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0002FB6BC0@orsmsx408> Message-ID: <52653dddfe.fsf@topspin.com> Please try applying this patch (which will dump the SA data portion of MCMember MADs). If you could send the output from the 64-bit and 32-bit systems that may help us figure out where the bug is. Thanks, Roland Index: infiniband/core/sa_query.c =================================================================== --- infiniband/core/sa_query.c (revision 1314) +++ infiniband/core/sa_query.c (working copy) @@ -662,6 +662,18 @@ ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), rec, query->sa_query.mad->data); + { + int i; + + for (i = 0; i < 104; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i); + printk(" %02x", query->sa_query.mad->data[i]); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); if (ret) { From robert.j.woodruff at intel.com Tue Dec 7 17:11:37 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Dec 2004 17:11:37 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> Here are some log files. First file, mcast-64.log is the /var/log/messages output from the patch you sent on the 64-bit system. Next log files is the opensm log file osm-64bit.log Next log file is the opensm log file when running the 32-node. osm-32-bit.log In the passing case, ipoib sends 2 MCM messages and opensm has no complaints. Search for MCMember Record in osm-32-bit.log In the failing case, ipoib sends 2 MCM messages that look similar with no errors reported. However, in the failing case ipoib continues to send MCM messages that opensm rejects. In the failing case there are a couple of differences, first the MGID lower 32-bits appear to be 0xffffffff in the passing case and something else when it fails. Second, it appears that perhaps the opensm is rejecting the messages because of a bug where the scope and join fields are reversed when extracted from the mad. In the passing case, since the lower 32 bits of the mgid are 0xfffffffff, you never get to the code that checks the join member. Someone that understands opensm should look at this, but Sean I think it may be wrong. This however does not explain why in the failing case, ipoib continues to try to join the mcast group unless it is having difficulties after trying yo join he group and decides to re-try, with the subsequent re-tries to join being failed by opensm. -------------- next part -------------- A non-text attachment was scrubbed... Name: osm-32bit.log Type: application/octet-stream Size: 2781897 bytes Desc: osm-32bit.log URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: osm-64bit.log Type: application/octet-stream Size: 387066 bytes Desc: osm-64bit.log URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mcast-64.log Type: application/octet-stream Size: 50359 bytes Desc: mcast-64.log URL: From roland at topspin.com Tue Dec 7 18:21:42 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Dec 2004 18:21:42 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> (Robert J. Woodruff's message of "Tue, 7 Dec 2004 17:11:37 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> Message-ID: <52mzwpbkcp.fsf@topspin.com> Robert> In the failing case, ipoib sends 2 MCM messages that look Robert> similar with no errors reported. However, in the failing Robert> case ipoib continues to send MCM messages that opensm Robert> rejects. In the failing case there are a couple of Robert> differences, first the MGID lower 32-bits appear to be Robert> 0xffffffff in the passing case and something else when it Robert> fails. Second, it appears that perhaps the opensm is Robert> rejecting the messages because of a bug where the scope Robert> and join fields are reversed when extracted from the Robert> mad. In the passing case, since the lower 32 bits of the Robert> mgid are 0xfffffffff, you never get to the code that Robert> checks the join member. Someone that understands opensm Robert> should look at this, but Sean I think it may be wrong. I think the difference is not 32 bit vs. 64 bit but no IPv6 vs IPv6. It looks like your 32 bit hosts do not have IPv6 support turned on, so IPoIB only joins groups with MGIDs starting ff12:401b. The 64 bit host does have IPv6 and tries to join its solicited-node group (messages about ff12:601b:ffff:0:0:1:ffd2:58f1 in mcast-64.log) and the IPv6 all nodes group (messages about ff12:601b:ffff:0:0:0:0:1 in osm-64bit.log). Since no one has created this group yet, OpenSM looks at the join state field. As you say, there seems to be a bug in OpenSM in how it interprets "ScopeState" (JoinState is the low nibble, and OpenSM dumps the byte as 0x01, so it seems OpenSM is receiving a correct FullMember request). The joins of the IPv4 broadcast group (ff12:401b:ffff:0:0:0:ffff:ffff) and IPv4 all nodes group (ff12:401b:ffff:0:0:0:0:1) succeed because presumably OpenSM has already created these groups. Robert> This however does not explain why in the failing case, Robert> ipoib continues to try to join the mcast group unless it Robert> is having difficulties after trying yo join he group and Robert> decides to re-try, with the subsequent re-tries to join Robert> being failed by opensm. IPoIB is dumb -- when it fails to join a multicast group, it just keeps trying. - R. From halr at voltaire.com Tue Dec 7 18:43:51 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 21:43:51 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <52mzwpbkcp.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> <52mzwpbkcp.fsf@topspin.com> Message-ID: <1102473830.4141.111.camel@localhost.localdomain> On Tue, 2004-12-07 at 21:21, Roland Dreier wrote: > The 64 bit host does have IPv6 and tries to join its solicited-node group > (messages about ff12:601b:ffff:0:0:1:ffd2:58f1 in mcast-64.log) and > the IPv6 all nodes group (messages about ff12:601b:ffff:0:0:0:0:1 in > osm-64bit.log). Since no one has created this group yet, OpenSM looks > at the join state field. As you say, there seems to be a bug in > OpenSM in how it interprets "ScopeState" (JoinState is the low nibble, > and OpenSM dumps the byte as 0x01, so it seems OpenSM is receiving a > correct FullMember request). Isn't the join sufficient to create these IPv6 groups ? If so, that would mean a problem with OpenSM in not doing so. -- Hal From halr at voltaire.com Tue Dec 7 18:47:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 21:47:21 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> Message-ID: <1102474041.4141.117.camel@localhost.localdomain> On Tue, 2004-12-07 at 20:11, Woodruff, Robert J wrote: So other than this, does IPoIB seem to be working ? I guess that's at least IPv4. It doesn't seem like IPv6 could be working properly. -- Hal From eitan at mellanox.co.il Tue Dec 7 22:39:06 2004 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 8 Dec 2004 08:39:06 +0200 Subject: [openib-general] IPoIB still not working Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> Forgive me for not following the entire thread. But I did take a look at the log files: The 64bit version have the following multicast activities: 1. Port 0x0002c9010ad258f1 joining MLID 0xC000 -> success. Note that MLID 0xC000 is predefined (IPoIB). MGID....................0xff12401bffff0000 : 0x00000000ffffffff PortGid.................0xfe80000000000000 : 0x0002c9010ad258f1 qkey....................0x0 Mlid....................0x0 ScopeState..............0x1 Rate....................0x0 Mtu.....................0x0 2. Port 0x0002c9010ad258f1 joining MLID 0xC000. (Again). MGID....................0xff12401bffff0000 : 0x00000000ffffffff PortGid.................0xfe80000000000000 : 0x0002c9010ad258f1 qkey....................0x1B0B0000 Mlid....................0xC000 ScopeState..............0x11 Rate....................0x3 Mtu.....................0x4 -> considered as an update to the scope state. 3. Request to join : MGID....................0xff12601bffff0000 : 0x0000000000000016 PortGid.................0xfe80000000000000 : 0x0002c9010ad258f1 qkey....................0x0 Mlid....................0x0 ScopeState..............0x1 Rate....................0x0 Mtu.....................0x0 Results with - ERR 1B10: Provided Join State != FullMember - required for create. You can not create a group if you are not a full member. 4. A sequence of requests arrive to create MGRPs with several MGIDs: MGID 0xff12601bffff0000:0x0000000000000002 MGID 0xff12601bffff0000:0x0000000000000016 MGID 0xff12601bffff0000:0x00000001ffd258f1 All fail due to the same join state issue. Inspecting the 32bit version: I see only one request to join Port 0x0002c90107fc5be1 joining MLID 0xC000 And it succeeds Hope this helps. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Wednesday, December 08, 2004 3:12 AM To: Roland Dreier Cc: openib-general at openib.org Subject: RE: [openib-general] IPoIB still not working Here are some log files. First file, mcast-64.log is the /var/log/messages output from the patch you sent on the 64-bit system. Next log files is the opensm log file osm-64bit.log Next log file is the opensm log file when running the 32-node. osm-32-bit.log In the passing case, ipoib sends 2 MCM messages and opensm has no complaints. Search for MCMember Record in osm-32-bit.log In the failing case, ipoib sends 2 MCM messages that look similar with no errors reported. However, in the failing case ipoib continues to send MCM messages that opensm rejects. In the failing case there are a couple of differences, first the MGID lower 32-bits appear to be 0xffffffff in the passing case and something else when it fails. Second, it appears that perhaps the opensm is rejecting the messages because of a bug where the scope and join fields are reversed when extracted from the mad. In the passing case, since the lower 32 bits of the mgid are 0xfffffffff, you never get to the code that checks the join member. Someone that understands opensm should look at this, but Sean I think it may be wrong. This however does not explain why in the failing case, ipoib continues to try to join the mcast group unless it is having difficulties after trying yo join he group and decides to re-try, with the subsequent re-tries to join being failed by opensm. -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Wed Dec 8 04:21:34 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 08 Dec 2004 04:21:34 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> (Eitan Zahavi's message of "Wed, 8 Dec 2004 08:39:06 +0200") References: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> Message-ID: <52is7daskx.fsf@topspin.com> Eitan> Results with - ERR 1B10: Provided Join State != FullMember Eitan> - required for create. You can not create a group if you Eitan> are not a full member. Right. However, ScopeState is dumped as 0x1, which means bit 0 of JoinState (the FullMember bit) is in fact set, so OpenSM should create the group. Thanks, Roland From halr at voltaire.com Wed Dec 8 05:17:18 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 08 Dec 2004 08:17:18 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> Message-ID: <1102511838.4141.762.camel@localhost.localdomain> Hi Eitan, The component masks are also significant relative to the below. See embedded comments. On Wed, 2004-12-08 at 01:39, Eitan Zahavi wrote: > Forgive me for not following the entire thread. > But I did take a look at the log files: > > The 64bit version have the following multicast activities: > 1. Port 0x0002c9010ad258f1 joining MLID 0xC000 -> success. > Note that MLID 0xC000 is predefined (IPoIB). > MGID....................0xff12401bffff0000 : > 0x00000000ffffffff > PortGid.................0xfe80000000000000 : > 0x0002c9010ad258f1 > qkey....................0x0 > Mlid....................0x0 > ScopeState..............0x1 > Rate....................0x0 > Mtu.....................0x0 This is a join rather than a create. It relies on the group preexisting or is rejected by the SA. > 2. Port 0x0002c9010ad258f1 joining MLID 0xC000. (Again). > MGID....................0xff12401bffff0000 : > 0x00000000ffffffff > PortGid.................0xfe80000000000000 : > 0x0002c9010ad258f1 > qkey....................0x1B0B0000 > Mlid....................0xC000 > ScopeState..............0x11 > Rate....................0x3 > Mtu.....................0x4 This is a request for group creation with sufficient components for this. > -> considered as an update to the scope state. This interpretation is non conformant with o15-0.2.1 (IBA 1.2) which obsoleted 015-0.1.2 (IBA 1.1). It is supposed to be rejected with ERR_REQ_INVALID. > 3. Request to join : > MGID....................0xff12601bffff0000 : > 0x0000000000000016 > PortGid.................0xfe80000000000000 : > 0x0002c9010ad258f1 > qkey....................0x0 > Mlid....................0x0 > ScopeState..............0x1 > Rate....................0x0 > Mtu.....................0x0 > Results with - ERR 1B10: Provided Join State != FullMember - required > for create. > You can not create a group if you are not a full member. JoinState bit 0 is Full Member. Wouldn't 0x1 have bit 0 on making this a full member join ? Anyhow, looking at the components, this appears to be a join rather than create requests (although the component mask is not shown). -- Hal From halr at voltaire.com Wed Dec 8 05:58:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 08 Dec 2004 08:58:36 -0500 Subject: [openib-general] GUID/EUI-64 Issue Message-ID: <1102514316.4141.843.camel@localhost.localdomain> Hi, Did we come to closure on how to handle the GUID/EUI-64 issue ? -- Hal On Thu, 2004-11-11 at 13:11, Roland Dreier wrote: > My only questions are: > > + eui[0] ^= 2; > > I remember some discussion about whether IBTA GUIDs are already > modified EUI-64 or not. Is this the correct transformation or should > we be doing something like "eui[0] |= 2;" (ie assume the universal bit > should always be set in our IPv6 address)? IBTA GUIDs are EUI-64. The only issue I recall was whether the polarity of the U/G bit was consistent with IEEE. This was updated at IBA 1.2. It now says "manufacturer assigns EUI-64 with global scope set. May also assign additional EUI-64 with local scope." From halr at voltaire.com Wed Dec 8 06:35:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 08 Dec 2004 09:35:28 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <52mzwpbkcp.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> <52mzwpbkcp.fsf@topspin.com> Message-ID: <1102516528.4129.17.camel@localhost.localdomain> On Tue, 2004-12-07 at 21:21, Roland Dreier wrote: > Robert> This however does not explain why in the failing case, > Robert> ipoib continues to try to join the mcast group unless it > Robert> is having difficulties after trying yo join he group and > Robert> decides to re-try, with the subsequent re-tries to join > Robert> being failed by opensm. > > IPoIB is dumb -- when it fails to join a multicast group, it just > keeps trying. This issue has come up before. The choices seem to be retry forever or retry some number of times and then give up unless someone can see a better "policy". The rety strategy might also depend on the status code returned from the SA although the errors may not be sufficiently rich for different client behavior but it could be based on the join type and the status code. Some errors to certain joins might indicate that retries might never succeed no matter how many retries are attempted. -- Hal From halr at voltaire.com Wed Dec 8 06:52:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 08 Dec 2004 09:52:08 -0500 Subject: [openib-general] IPv6 All Router Multicast Group Message-ID: <1102517528.4129.34.camel@localhost.localdomain> Hi Roland, With IPv4 and IPv6, I see all the relevant groups attempted to be joined and then if that fails created (first component mask 0x10083 and then 0x130c7). The former gets status 0x0600 in the GetResp when the group does not already exist which causes the retry with the group creation. I see one IPv6 group (all routers) (0xff12:601b:ffff:0:0:0:0:2) that is only attempted with component mask 0x10083 (join) and not 0x130c7 (create). Is this because an IP router would create this group and this node is only trying to join as it is not a router ? (If so, has the creation been tested) ? Thanks. -- Hal From halr at voltaire.com Wed Dec 8 07:59:54 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 08 Dec 2004 10:59:54 -0500 Subject: [openib-general] User MAD support for cancel MAD Message-ID: <1102521594.4129.42.camel@localhost.localdomain> Hi Roland, It doesn't look to me like there is a way to cancel a MAD from user space. Would this be an additional ioctl to support ? This is needed from an OpenSM perspective. Thanks. -- Hal From robert.j.woodruff at intel.com Wed Dec 8 08:04:46 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 8 Dec 2004 08:04:46 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB746B@orsmsx408> Rolland > I think the difference is not 32 bit vs. 64 bit but no IPv6 vs IPv6. Ok, I'll take a look at the opensm code and see if we can make a fix that will allow it to work in the IPv6 case. From mst at mellanox.co.il Wed Dec 8 08:22:22 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Dec 2004 18:22:22 +0200 Subject: [openib-general] User MAD support for cancel MAD In-Reply-To: <1102521594.4129.42.camel@localhost.localdomain> References: <1102521594.4129.42.camel@localhost.localdomain> Message-ID: <20041208162222.GA31925@mellanox.co.il> Hello! Quoting r. Hal Rosenstock (halr at voltaire.com) "[openib-general] User MAD support for cancel MAD": > Hi Roland, > > It doesn't look to me like there is a way to cancel a MAD from user > space. Would this be an additional ioctl to support ? This is needed > from an OpenSM perspective. Thanks. > > -- Hal Are you talking about MADs that have been linked but not yet posted to the QP? Since it seems opensm has no way to tell this is the state of the MAD (as opposed to being already posted to the QP), why does is need a way to change it? mst From mshefty at ichips.intel.com Wed Dec 8 09:44:46 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 08 Dec 2004 09:44:46 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <52is7daskx.fsf@topspin.com> References: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> <52is7daskx.fsf@topspin.com> Message-ID: <41B73D8E.7050602@ichips.intel.com> Roland Dreier wrote: > Eitan> Results with - ERR 1B10: Provided Join State != FullMember > Eitan> - required for create. You can not create a group if you > Eitan> are not a full member. > > Right. However, ScopeState is dumped as 0x1, which means bit 0 of > JoinState (the FullMember bit) is in fact set, so OpenSM should create > the group. Here's the relevant code from opensm for extracting the join state and scope information: *p_scope = (uint8_t)(scope_state & 0x0f); tmp_scope_state = scope_state >> 4; *p_state = (uint8_t)(tmp_scope_state &0x0f); So, opensm has the join state and scope fields reversed within the byte. I guess at some point we'll need to go through opensm and verify that it's setting/extracting bit subfields properly. - Sean From mshefty at ichips.intel.com Wed Dec 8 14:57:20 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 08 Dec 2004 14:57:20 -0800 Subject: [openib-general] crash in mthca soon after loading drivers Message-ID: <41B786D0.50701@ichips.intel.com> I'm getting the following bug in mthca when loading the drivers (core, mad, and mthca). The system is attached to a fabric with opensm running on top of the Mellanox gold software stack. I hit this when running with the tip of openib. Any help would be, well, helpful. - Sean Dec 8 14:53:47 mshefty-linux2 kernel: kernel BUG at drivers/infiniband/hw/mthca/mthca_cmd.c:328! Dec 8 14:53:47 mshefty-linux2 kernel: invalid operand: 0000 [#1] Dec 8 14:53:47 mshefty-linux2 kernel: SMP Dec 8 14:53:47 mshefty-linux2 kernel: Modules linked in: ib_mthca ib_mad ib_core edd st sr_mod ide_cd cdrom thermal processor fan button battery ac e100 mii e1000 hw_random uhci_hcd usbcore evdev reiserfs aic7xxx sd_mod scsi_mod Dec 8 14:53:47 mshefty-linux2 kernel: CPU: 0 Dec 8 14:53:47 mshefty-linux2 kernel: EIP: 0060:[pg0+948359147/1069220864] Not tainted VLI Dec 8 14:53:47 mshefty-linux2 kernel: EIP: 0060:[] Not tainted VLI Dec 8 14:53:47 mshefty-linux2 kernel: EFLAGS: 00010286 (2.6.9) Dec 8 14:53:47 mshefty-linux2 kernel: EIP is at mthca_cmd_wait+0x19b/0x1b0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: eax: f6b245a0 ebx: f6b24584 ecx: f6b24584 edx: ffffffff Dec 8 14:53:47 mshefty-linux2 kernel: esi: 32b50000 edi: f6b24324 ebp: f6b245a0 esp: f29d7e34 Dec 8 14:53:47 mshefty-linux2 kernel: ds: 007b es: 007b ss: 0068 Dec 8 14:53:47 mshefty-linux2 kernel: Process ib_mad1 (pid: 9900, threadinfo=f29d6000 task=f744a710) Dec 8 14:53:47 mshefty-linux2 kernel: Stack: 00000024 32b50000 00000000 f6b24324 32b50000 00000000 0000ea60 f8cbb058 Dec 8 14:53:47 mshefty-linux2 kernel: f29d7e70 00000000 00000001 00000000 00000024 0000ea60 f29d7edb 32b50100 Dec 8 14:53:47 mshefty-linux2 kernel: 00000000 00000001 00000000 f2b50100 f2b50000 f8cbd265 32b50100 00000000 Dec 8 14:53:47 mshefty-linux2 kernel: Call Trace: Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+948359256/1069220864] mthca_cmd_box+0x58/0x90 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [] mthca_cmd_box+0x58/0x90 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+948367973/1069220864] mthca_MAD_IFC+0x85/0xf0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [] mthca_MAD_IFC+0x85/0xf0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [check_poison_obj+45/432] check_poison_obj+0x2d/0x1b0 Dec 8 14:53:47 mshefty-linux2 kernel: [] check_poison_obj+0x2d/0x1b0 Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+948399983/1069220864] mthca_process_mad+0xcf/0x1c0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [] mthca_process_mad+0xcf/0x1c0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+948399776/1069220864] mthca_process_mad+0x0/0x1c0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [] mthca_process_mad+0x0/0x1c0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+946047120/1069220864] ib_mad_recv_done_handler+0xd0/0x230 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [] ib_mad_recv_done_handler+0xd0/0x230 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+946048724/1069220864] ib_mad_completion_handler+0x94/0xa0 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [] ib_mad_completion_handler+0x94/0xa0 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [remove_wait_queue+12/64] remove_wait_queue+0xc/0x40 Dec 8 14:53:47 mshefty-linux2 kernel: [] remove_wait_queue+0xc/0x40 Dec 8 14:53:47 mshefty-linux2 kernel: [worker_thread+424/560] worker_thread+0x1a8/0x230 Dec 8 14:53:47 mshefty-linux2 kernel: [] worker_thread+0x1a8/0x230 Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+946048576/1069220864] ib_mad_completion_handler+0x0/0xa0 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [] ib_mad_completion_handler+0x0/0xa0 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [default_wake_function+0/16] default_wake_function+0x0/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: [] default_wake_function+0x0/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: [default_wake_function+0/16] default_wake_function+0x0/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: [] default_wake_function+0x0/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: [worker_thread+0/560] worker_thread+0x0/0x230 Dec 8 14:53:47 mshefty-linux2 kernel: [] worker_thread+0x0/0x230 Dec 8 14:53:47 mshefty-linux2 kernel: [kthread+136/176] kthread+0x88/0xb0 Dec 8 14:53:47 mshefty-linux2 kernel: [] kthread+0x88/0xb0 Dec 8 14:53:47 mshefty-linux2 kernel: [kthread+0/176] kthread+0x0/0xb0 Dec 8 14:53:47 mshefty-linux2 kernel: [] kthread+0x0/0xb0 Dec 8 14:53:47 mshefty-linux2 kernel: [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: [] kernel_thread_helper+0x5/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: Code: 14 d2 89 d0 c1 e0 09 29 d0 89 c2 c1 e2 12 01 d0 f7 d8 89 87 84 02 00 00 89 e8 e8 51 92 64 c7 89 da 83 c4 0c 89 d0 5b 5e 5f 5d c3 <0f> 0b 48 01 40 7b cc f8 e9 c7 fe ff ff 90 8d b4 26 00 00 00 00 From robert.j.woodruff at intel.com Wed Dec 8 15:35:47 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 8 Dec 2004 15:35:47 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FF6BB5@orsmsx408> I found the problem with parsing of ScopeState. It appears that there was a bug in ib_types.h in the sourceforge code base. This was already fixed in the latest Mellanox code base. However, with that fixed, I still get the following error from opensm when ipoib tries to join the multicast group. This is the dump from the latest mellanox opensm. [1102546997:000053256][18007] -> osm_mcmr_rcv_join_mgrp: [ [1102546997:000053278][18007] -> osm_mcmr_rcv_join_mgrp: Dump of incomming record. [1102546997:000053307][18007] -> MCMember Record dump: MGID....................0xff12401bffff0000 : 0x0000000000000016 PortGid.................0xfe80000000000000 : 0x0002c9010ad25b91 qkey....................0x0 Mlid....................0x0 ScopeState..............0x1 Rate....................0x0 Mtu.....................0x0 [1102546997:000053334][18007] -> osm_physp_share_pkey: [ [1102546997:000053358][18007] -> osm_physp_share_pkey: ] [1102546997:000053389][18007] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7. [1102546997:000053413][18007] -> osm_sa_send_error: [ [1102546997:000053434][18007] -> osm_mad_pool_get: [ [1102546997:000053457][18007] -> osm_vendor_get: [ [1102546997:000053481][18007] -> osm_vendor_get: Allocated MAD 0x818a860, size = 256. _ From mshefty at ichips.intel.com Wed Dec 8 17:12:16 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 08 Dec 2004 17:12:16 -0800 Subject: [openib-general] crash in mthca soon after loading drivers In-Reply-To: <41B786D0.50701@ichips.intel.com> References: <41B786D0.50701@ichips.intel.com> Message-ID: <41B7A670.5070102@ichips.intel.com> Sean Hefty wrote: > I'm getting the following bug in mthca when loading the drivers (core, > mad, and mthca). The system is attached to a fabric with opensm running > on top of the Mellanox gold software stack. I hit this when running > with the tip of openib. Any help would be, well, helpful. > > - Sean > > > Dec 8 14:53:47 mshefty-linux2 kernel: kernel BUG at > drivers/infiniband/hw/mthca/mthca_cmd.c:328! I still need to spend more time investigating this, but looking at mthca_cmd_wait(): if (down_interruptible(&dev->cmd.event_sem)) return -EINTR; spin_lock(&dev->cmd.context_lock); BUG_ON(dev->cmd.free_head < 0); context = &dev->cmd.context[dev->cmd.free_head]; dev->cmd.free_head = context->next; spin_unlock(&dev->cmd.context_lock); ...snip... wait_for_completion(&context->done); ***** possible race here ***** ...snip... out: spin_lock(&dev->cmd.context_lock); context->next = dev->cmd.free_head; dev->cmd.free_head = context - dev->cmd.context; spin_unlock(&dev->cmd.context_lock); There appears to be a race here where event_sem can be incremented (in mthca_cmd_complete()), but free_head has not yet been updated. A second call to mthca_cmd_wait could then get the semaphore, but find the list empty, leading to the bug. In my case, max_cmd is set to 1. I need to verify if this is indeed what is happening, and if so what to do to fix it. - Sean From eitan at mellanox.co.il Wed Dec 8 22:33:04 2004 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 9 Dec 2004 08:33:04 +0200 Subject: [openib-general] IPoIB still not working Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEA87@mtlex01.yok.mtl.com> Hi Woody, ERR 1B11 means as it says: The expected component mask for the join is not sufficient. Quoting from your mail: >> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, >> component mask =0x0000000000010083, expected comp mask = 0x00000000000130c7. Seems the component mask being sent is missing bits: 12,13 (starting from 0). These are the 2^12 = SL and 2^13 = FlowLabel. Please see the following spec quote: o15-0.2.2: If SA supports UD multicast, then SA shall create a multicast group if it receives a SubnAdmSet() method for a MCMemberRecord, with the MGID set to 0 and the MCMemberRecord:JoinState.FullMember bit set to 1. The required components in the MCMemberRecord for the group to be created are P_Key, Q_Key, SL, FlowLabel, TClass, JoinState and PortGID (see o15-0.1.3:) with the corresponding bits in the Component- Mask set. All other components may be wildcarded. This results in an implicit join for the port specified by PortGID. In osm_sa-mcmember_record.h you can find: #define REQUIRED_MC_CREATE_COMP_MASK |\ (IB_MCR_COMPMASK_MGID | \ IB_MCR_COMPMASK_PORT_GID | \ IB_MCR_COMPMASK_JOIN_STATE | \ IB_MCR_COMPMASK_QKEY | \ IB_MCR_COMPMASK_TCLASS | \ IB_MCR_COMPMASK_PKEY | \ IB_MCR_COMPMASK_FLOW | \ IB_MCR_COMPMASK_SL) Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Thursday, December 09, 2004 1:36 AM To: Eitan Zahavi; Roland Dreier Cc: openib-general at openib.org Subject: RE: [openib-general] IPoIB still not working I found the problem with parsing of ScopeState. It appears that there was a bug in ib_types.h in the sourceforge code base. This was already fixed in the latest Mellanox code base. However, with that fixed, I still get the following error from opensm when ipoib tries to join the multicast group. This is the dump from the latest mellanox opensm. [1102546997:000053256][18007] -> osm_mcmr_rcv_join_mgrp: [ [1102546997:000053278][18007] -> osm_mcmr_rcv_join_mgrp: Dump of incomming record. [1102546997:000053307][18007] -> MCMember Record dump: MGID....................0xff12401bffff0000 : 0x0000000000000016 PortGid.................0xfe80000000000000 : 0x0002c9010ad25b91 qkey....................0x0 Mlid....................0x0 ScopeState..............0x1 Rate....................0x0 Mtu.....................0x0 [1102546997:000053334][18007] -> osm_physp_share_pkey: [ [1102546997:000053358][18007] -> osm_physp_share_pkey: ] [1102546997:000053389][18007] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7. [1102546997:000053413][18007] -> osm_sa_send_error: [ [1102546997:000053434][18007] -> osm_mad_pool_get: [ [1102546997:000053457][18007] -> osm_vendor_get: [ [1102546997:000053481][18007] -> osm_vendor_get: Allocated MAD 0x818a860, size = 256. _ -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaharf at voltaire.com Thu Dec 9 03:23:20 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 13:23:20 +0200 Subject: [openib-general] Missining Infiniband class fields Message-ID: Hi guys, I am implementing a user mode access library for the umad. I have encountered few issues: 1. I am using the sys Infiniband class to get the HCA/port attributes, and several fields are missing: For the ports: Physical port state (as in Portinfo) Port GUID (as in Nodeinfo to the port) For the CA: Node type (as in NodeInfo) If no one objects, I would like to add them to the class. If someone else prefers to do it, please tell me. 2. I need an interface for setting/clearing IS_SM bit. If there is such please let me know. If not, I would like to add an additional ioctl to set/clear the IS_SM bit. A ctl file in /dev/.../ports/.../ is anther option. Please tell me what do you think. 3. It would help me very much if I could get an async event on some changes, especially ports state changes. Lid/SM lid changes events will be nice too. I am not familiar with the HCA fw, so I don't know if it does trigger such events. The AnafaII fw does. Anyhow there are several ways to implement the event mechanism, and I would to hear what you think about it. I am still new to 2.6 mechanisms, and I don't know if there is a recommended method to do this. The methods I know are Signals, poll events, pseudo device blocking read, etc. If we are going to use device reads, I want to suggest that we implement circular events queue that each should contain at least the following: . If we use on /dev and /sys I would also use the name of the relevant sys|dev file as the event name, and have its new data as the event data. Such an event queue is managed for each fd that request that, and an event will be cleared once read, or upon queue overflow. I have used such mechanisms before and they were very useful. Again, if there is a standard way to do it, it is OK with me. The IS_SM issue is critical for OpenSM. The missing class fields are important because without them I will have to do local mads and it will considerably complicates the library. The events issue is less critical, but still such a mechanism may help to solve many problems, and to simplify many other mechanisms. Shahar -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaharf at voltaire.com Thu Dec 9 03:44:08 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 13:44:08 +0200 Subject: [openib-general] Another user mode issue Message-ID: Hi all, Another issue I want to raise concerning the umad access library is some change indicators for the infiniband class objects - ca & ports. It would be very nice if we could have a modification version file for each just object. This will help the safe atomic reading and caching of ca/port attributes. The idea that when you want to read an object attributes you will do the following: 1. read the object version. 2. read all object fields you want 3. read the object version again 4. if the new one is the same as the old one, we are done 5. else the object was changed while reading, so go to step 2. If you have cached the object you can compare the version each time the object is accessed to check if re-read is required. The version should be incremented upon any modification. Of course this is not my idea - this is common method used in lock free programming, and is similar to LL/SC (load linked, store conditional) or C&S (compare and set) schemes. Just as a long shot, it will be very nice to replace the switch state change bit with this version based mechanism. Not only that it can be used to implement atomic read, it can be used to let several distributed SM's to safely scan the network. As a matter of fact it would also solve the dual master synchronization problem we currently have (happens on network merge - two masters sweeps the network, but one clears the change bit before the other see it and it may end such that the SM with the lesser priority waits for the other to take over, while the other don't see it al all...). This issue may be relevant for the distributed SM mentioned in the SOW. Shahar -------------- next part -------------- An HTML attachment was scrubbed... URL: From mucci at cs.utk.edu Thu Dec 9 05:24:43 2004 From: mucci at cs.utk.edu (Philip Mucci) Date: Thu, 09 Dec 2004 14:24:43 +0100 Subject: [openib-general] Should I use umad -or- osm Message-ID: <1102598683.3731.66.camel@muccislaptop.pdc.kth.se> Hi folks, I've been tasked with developing a rough performance tool for IB networks. I've scanned the documentation and looks like the kind of data we're interested in can be obtained from the Mellanox ASICs. My question is a simple one: I've got to send/recv mads to enable and obtain the performance counters from a user space tool...ideally non-root, but we'll work with what we have. My current inclination has been to use the osm_vendor_api.h functions to do this work. However, the late work here done by Hal on the UMAD access layer seems to be appropriate as well. Could someone elaborate on what you think the best (and most maintainable) approach to accomplishing this task might be? Regards, Philip From shaharf at voltaire.com Thu Dec 9 07:52:43 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 17:52:43 +0200 Subject: [openib-general] Should I use umad -or- osm Message-ID: Hi Philip, I am currently implementing umad access library. It much simpler then the osm_vendor api. I would recommend using umad library and not osm. As a matter of fact the current osm vendor layer does not support openib gen2. I am working on that either. Both the umad access library and the new osm vendor layer that uses it are not finished yet. I guess that I will need at least another week to reach a point where I can release it. Even then it will change a lot until I will be finished with it. The question is can you wait a little? I can give you a preliminary version - but if you will use it you will have to modify your code several each time the library interface is changed. On the other hand, I would like to understand exactly what you need, because you are the first "client" of the user mode stuff beside me. Shahar > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Philip Mucci > Sent: Thursday, December 09, 2004 3:25 PM > To: openib-general at openib.org > Subject: [openib-general] Should I use umad -or- osm > > Hi folks, > > I've been tasked with developing a rough performance tool for IB > networks. I've scanned the documentation and looks like the kind of data > we're interested in can be obtained from the Mellanox ASICs. > > My question is a simple one: > > I've got to send/recv mads to enable and obtain the performance counters > from a user space tool...ideally non-root, but we'll work with what we > have. > > My current inclination has been to use the osm_vendor_api.h functions to > do this work. However, the late work here done by Hal on the UMAD access > layer seems to be appropriate as well. > > Could someone elaborate on what you think the best (and most > maintainable) approach to accomplishing this task might be? > > Regards, > > Philip > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From mucci at cs.utk.edu Thu Dec 9 08:43:40 2004 From: mucci at cs.utk.edu (Philip Mucci) Date: Thu, 09 Dec 2004 17:43:40 +0100 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: References: Message-ID: <1102610620.3731.101.camel@muccislaptop.pdc.kth.se> Hi Shahar, Thanks for the info. My needs are very 'simple'. Just to send and receive MADS to the PM agent on each adapter and switch in the network. Ideally, I would like this to work for existing installations based on either OpenIB gen1 or Mellanox Gold. But I think for that, I need to use the current osm_vendor_api.h interface. How much will this interface change with gen2? will it export the same functions? Or will everything change Lastly, will I be able to send/recv these MADS as a non-root user? Thanks again, and the answer to your question is, yes, I can wait. Regards, Philip On Thu, 2004-12-09 at 17:52 +0200, shaharf wrote: > Hi Philip, > > I am currently implementing umad access library. It much simpler > then the osm_vendor api. I would recommend using umad library and not > osm. As a matter of fact the current osm vendor layer does not support > openib gen2. I am working on that either. Both the umad access library > and the new osm vendor layer that uses it are not finished yet. I guess > that I will need at least another week to reach a point where I can > release it. Even then it will change a lot until I will be finished with > it. > The question is can you wait a little? I can give you a preliminary > version - but if you will use it you will have to modify your code > several each time the library interface is changed. > On the other hand, I would like to understand exactly what you need, > because you are the first "client" of the user mode stuff beside me. > > Shahar > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Philip Mucci > > Sent: Thursday, December 09, 2004 3:25 PM > > To: openib-general at openib.org > > Subject: [openib-general] Should I use umad -or- osm > > > > Hi folks, > > > > I've been tasked with developing a rough performance tool for IB > > networks. I've scanned the documentation and looks like the kind of > data > > we're interested in can be obtained from the Mellanox ASICs. > > > > My question is a simple one: > > > > I've got to send/recv mads to enable and obtain the performance > counters > > from a user space tool...ideally non-root, but we'll work with what we > > have. > > > > My current inclination has been to use the osm_vendor_api.h functions > to > > do this work. However, the late work here done by Hal on the UMAD > access > > layer seems to be appropriate as well. > > > > Could someone elaborate on what you think the best (and most > > maintainable) approach to accomplishing this task might be? > > > > Regards, > > > > Philip > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib- > > general From mst at mellanox.co.il Thu Dec 9 08:52:39 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 18:52:39 +0200 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: <1102610620.3731.101.camel@muccislaptop.pdc.kth.se> References: <1102610620.3731.101.camel@muccislaptop.pdc.kth.se> Message-ID: <20041209165239.GA8493@mellanox.co.il> Quoting r. Philip Mucci (mucci at cs.utk.edu) "RE: [openib-general] Should I use umad -or- osm": > Hi Shahar, > > Thanks for the info. > > My needs are very 'simple'. Just to send and receive MADS to the PM > agent on each adapter and switch in the network. > > Ideally, I would like this to work for existing installations based on > either OpenIB gen1 or Mellanox Gold. But I think for that, I need to use > the current osm_vendor_api.h interface. > > How much will this interface change with gen2? will it export the same > functions? Or will everything change > > Lastly, will I be able to send/recv these MADS as a non-root user? > > Thanks again, and the answer to your question is, yes, I can wait. > > Regards, > > Philip > > > On Thu, 2004-12-09 at 17:52 +0200, shaharf wrote: > > Hi Philip, > > > > I am currently implementing umad access library. It much simpler > > then the osm_vendor api. I would recommend using umad library and not > > osm. As a matter of fact the current osm vendor layer does not support > > openib gen2. I am working on that either. Both the umad access library > > and the new osm vendor layer that uses it are not finished yet. I guess > > that I will need at least another week to reach a point where I can > > release it. Even then it will change a lot until I will be finished with > > it. > > The question is can you wait a little? I can give you a preliminary > > version - but if you will use it you will have to modify your code > > several each time the library interface is changed. > > On the other hand, I would like to understand exactly what you need, > > because you are the first "client" of the user mode stuff beside me. > > > > Shahar > > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Philip Mucci > > > Sent: Thursday, December 09, 2004 3:25 PM > > > To: openib-general at openib.org > > > Subject: [openib-general] Should I use umad -or- osm > > > > > > Hi folks, > > > > > > I've been tasked with developing a rough performance tool for IB > > > networks. I've scanned the documentation and looks like the kind of > > > data > > > we're interested in can be obtained from the Mellanox ASICs. > > > > > > My question is a simple one: > > > > > > I've got to send/recv mads to enable and obtain the performance > > > counters > > > from a user space tool...ideally non-root, but we'll work with what we > > > have. > > > > > > My current inclination has been to use the osm_vendor_api.h functions > > > to > > > do this work. However, the late work here done by Hal on the UMAD > > > access > > > layer seems to be appropriate as well. > > > > > > Could someone elaborate on what you think the best (and most > > > maintainable) approach to accomplishing this task might be? > > > > > > Regards, > > > > > > Philip Its clearly not a great idea to let non-root inject arbitrary MADs into the system ,is it? MST From shaharf at voltaire.com Thu Dec 9 08:55:20 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 18:55:20 +0200 Subject: [openib-general] Should I use umad -or- osm Message-ID: Philip, I would say that there is not much in common between gen1 and gen2 user mode mad interface. If you want your tools to works above both, then osm vendor layer is your only choice. If you are planning to use gen2 stacks only at some point you can use libumad. If you need just a simple PM portcounter get query, then maybe osm vendor api is a bit too "fat" for you. I would consider using gen1 interface directly. But still, this is your choice. I guess that on the long run, openib gen2 will be the only maintained openib version. Anyone thinks differently? Shahar > From: Philip Mucci [mailto:mucci at cs.utk.edu] > Sent: Thursday, December 09, 2004 6:44 PM > To: shaharf > Cc: openib-general at openib.org > Subject: RE: [openib-general] Should I use umad -or- osm > > Hi Shahar, > > Thanks for the info. > > My needs are very 'simple'. Just to send and receive MADS to the PM > agent on each adapter and switch in the network. > > Ideally, I would like this to work for existing installations based on > either OpenIB gen1 or Mellanox Gold. But I think for that, I need to use > the current osm_vendor_api.h interface. > > How much will this interface change with gen2? will it export the same > functions? Or will everything change > > Lastly, will I be able to send/recv these MADS as a non-root user? > > Thanks again, and the answer to your question is, yes, I can wait. > > Regards, > > Philip > > > On Thu, 2004-12-09 at 17:52 +0200, shaharf wrote: > > Hi Philip, > > > > I am currently implementing umad access library. It much simpler > > then the osm_vendor api. I would recommend using umad library and not > > osm. As a matter of fact the current osm vendor layer does not support > > openib gen2. I am working on that either. Both the umad access library > > and the new osm vendor layer that uses it are not finished yet. I guess > > that I will need at least another week to reach a point where I can > > release it. Even then it will change a lot until I will be finished with > > it. > > The question is can you wait a little? I can give you a preliminary > > version - but if you will use it you will have to modify your code > > several each time the library interface is changed. > > On the other hand, I would like to understand exactly what you need, > > because you are the first "client" of the user mode stuff beside me. > > > > Shahar > > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Philip Mucci > > > Sent: Thursday, December 09, 2004 3:25 PM > > > To: openib-general at openib.org > > > Subject: [openib-general] Should I use umad -or- osm > > > > > > Hi folks, > > > > > > I've been tasked with developing a rough performance tool for IB > > > networks. I've scanned the documentation and looks like the kind of > > data > > > we're interested in can be obtained from the Mellanox ASICs. > > > > > > My question is a simple one: > > > > > > I've got to send/recv mads to enable and obtain the performance > > counters > > > from a user space tool...ideally non-root, but we'll work with what we > > > have. > > > > > > My current inclination has been to use the osm_vendor_api.h functions > > to > > > do this work. However, the late work here done by Hal on the UMAD > > access > > > layer seems to be appropriate as well. > > > > > > Could someone elaborate on what you think the best (and most > > > maintainable) approach to accomplishing this task might be? > > > > > > Regards, > > > > > > Philip > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib- > > > general From shaharf at voltaire.com Thu Dec 9 09:00:54 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 19:00:54 +0200 Subject: [openib-general] Should I use umad -or- osm Message-ID: > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Thursday, December 09, 2004 6:53 PM > To: Philip Mucci > Cc: shaharf; openib-general at openib.org > Subject: Re: [openib-general] Should I use umad -or- osm > > Its clearly not a great idea to let non-root inject arbitrary MADs > into the system ,is it? > > MST I guess that in this stage only root will able to use user mode mads. Later I would consider letting non-root applications use some mads - meaning most of the get/query mads, and some of the set mads. I won't rely on root access for security. There are mkey, qkey and pkey to handle that. Shahar From mst at mellanox.co.il Thu Dec 9 09:03:42 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 19:03:42 +0200 Subject: [openib-general] Missining Infiniband class fields In-Reply-To: References: Message-ID: <20041209170342.GB8493@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "[openib-general] Missining Infiniband class fields": > 2. I need an interface for setting/clearing IS_SM bit. If there is such please > let me know. If not, I would like to add an additional ioctl to set/clear the > IS_SM bit. A ctl file in /dev/?/ports/?/ is anther option. Please tell me what > do you think. Please note that you want to clear this bit automatically if the application set it and then dies. One way to do this is to require that the user keeps some file open while he wants to be the SM, and clear it on file close. mst From shaharf at voltaire.com Thu Dec 9 09:04:20 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 19:04:20 +0200 Subject: [openib-general] Missining Infiniband class fields Message-ID: > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Thursday, December 09, 2004 7:04 PM > To: shaharf > Cc: openib-general at openib.org > Subject: Re: [openib-general] Missining Infiniband class fields > > Hello! > Quoting r. shaharf (shaharf at voltaire.com) "[openib-general] Missining > Infiniband class fields": > > 2. I need an interface for setting/clearing IS_SM bit. If there is such > please > > let me know. If not, I would like to add an additional ioctl to > set/clear the > > IS_SM bit. A ctl file in /dev/?/ports/?/ is anther option. Please tell > me what > > do you think. > > Please note that you want to clear this bit automatically if the > application set it and then dies. > One way to do this is to require that the user keeps some file > open while he wants to be the SM, and clear it on file close. > > mst Indeed this plays in the favor of ioctl. Shahar From mst at mellanox.co.il Thu Dec 9 09:07:14 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 19:07:14 +0200 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: References: Message-ID: <20041209170714.GC8493@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Should I use umad -or- osm": > > > > -----Original Message----- > > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > Sent: Thursday, December 09, 2004 6:53 PM > > To: Philip Mucci > > Cc: shaharf; openib-general at openib.org > > Subject: Re: [openib-general] Should I use umad -or- osm > > > > Its clearly not a great idea to let non-root inject arbitrary MADs > > into the system ,is it? > > > > MST > > I guess that in this stage only root will able to use user mode mads. > Later I would consider letting non-root applications use some mads - > meaning most of the get/query mads, and some of the set mads. I won't > rely on root access for security. There are mkey, qkey and pkey to > handle that. > > Shahar They are trivial to guess, so kernel would have to touch the MAD data somehow? Further, it seems local MADs have the check disabled now? MST From mst at mellanox.co.il Thu Dec 9 09:09:33 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 19:09:33 +0200 Subject: [openib-general] Missining Infiniband class fields In-Reply-To: References: Message-ID: <20041209170933.GD8493@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Missining Infiniband class fields": > > > -----Original Message----- > > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > Sent: Thursday, December 09, 2004 7:04 PM > > To: shaharf > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] Missining Infiniband class fields > > > > Hello! > > Quoting r. shaharf (shaharf at voltaire.com) "[openib-general] Missining > > Infiniband class fields": > > > 2. I need an interface for setting/clearing IS_SM bit. If there is > such > > please > > > let me know. If not, I would like to add an additional ioctl to > > set/clear the > > > IS_SM bit. A ctl file in /dev/?/ports/?/ is anther option. Please > tell > > me what > > > do you think. > > > > Please note that you want to clear this bit automatically if the > > application set it and then dies. > > One way to do this is to require that the user keeps some file > > open while he wants to be the SM, and clear it on file close. > > > > mst > > Indeed this plays in the favor of ioctl. > > Shahar ioctls have a disadvantage of being half-broken when used by a 32 bti app on the 64 bit OS. Maybe write 1 to some offset to set the SM bit? This is what /proc/bus/pci has for pci configuration access. MST From mucci at cs.utk.edu Thu Dec 9 09:15:57 2004 From: mucci at cs.utk.edu (Philip Mucci) Date: Thu, 09 Dec 2004 18:15:57 +0100 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: <20041209165239.GA8493@mellanox.co.il> References: <1102610620.3731.101.camel@muccislaptop.pdc.kth.se> <20041209165239.GA8493@mellanox.co.il> Message-ID: <1102612558.3731.105.camel@muccislaptop.pdc.kth.se> Arbitrary, no. Performance monitoring, yes. These performance counters are not necessarily different than the resource counters available on other interconnects, memory controllers or CPU's. But as I said, we can live with either if we have to. Phil > Its clearly not a great idea to let non-root inject arbitrary MADs > into the system ,is it? > > MST From shaharf at voltaire.com Thu Dec 9 09:22:59 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 19:22:59 +0200 Subject: [openib-general] Missining Infiniband class fields Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Michael S. Tsirkin > Sent: Thursday, December 09, 2004 7:10 PM > Cc: openib-general at openib.org > Subject: Re: [openib-general] Missining Infiniband class fields > > Please note that you want to clear this bit automatically if the > > > application set it and then dies. > > > One way to do this is to require that the user keeps some file > > > open while he wants to be the SM, and clear it on file close. > > > > > > mst > > > > Indeed this plays in the favor of ioctl. > > > > Shahar > > ioctls have a disadvantage of being half-broken when used > by a 32 bti app on the 64 bit OS. > Maybe write 1 to some offset to set the SM bit? > This is what /proc/bus/pci has for pci configuration access. > MST As it will not be the first IOCTL, another one would not do any harm not already done. Shahar From shaharf at voltaire.com Thu Dec 9 09:27:47 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 19:27:47 +0200 Subject: [openib-general] Should I use umad -or- osm Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Michael S. Tsirkin > Sent: Thursday, December 09, 2004 7:07 PM > To: openib-general at openib.org > Subject: Re: [openib-general] Should I use umad -or- osm > > > > I guess that in this stage only root will able to use user mode mads. > > Later I would consider letting non-root applications use some mads - > > meaning most of the get/query mads, and some of the set mads. I won't > > rely on root access for security. There are mkey, qkey and pkey to > > handle that. > > > > Shahar > > They are trivial to guess, so kernel would have to touch the MAD > data somehow? > Further, it seems local MADs have the check disabled now? > > MST > _______________________________________________ The Mkey should set according to the system policy. They can be non trivial. 64 bits (changing) keys may be relatively strong. Currently only trivial keys are used so we won't let non root users use mads. But this is very weak (NFS style) security. Anyone can have root access on his machine. Shahar From mst at mellanox.co.il Thu Dec 9 10:01:37 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 20:01:37 +0200 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: References: Message-ID: <20041209180137.GF8493@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Should I use umad -or- osm": > > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Michael S. Tsirkin > > Sent: Thursday, December 09, 2004 7:07 PM > > To: openib-general at openib.org > > Subject: Re: [openib-general] Should I use umad -or- osm > > > > > > I guess that in this stage only root will able to use user mode > > > mads. > > > Later I would consider letting non-root applications use some mads - > > > meaning most of the get/query mads, and some of the set mads. I > > > won't > > > rely on root access for security. There are mkey, qkey and pkey to > > > handle that. > > > > > > Shahar > > > > They are trivial to guess, so kernel would have to touch the MAD > > data somehow? > > Further, it seems local MADs have the check disabled now? > > > > MST > > _______________________________________________ > > The Mkey should set according to the system policy. They can be non > trivial. > 64 bits (changing) keys may be relatively strong. Depends on your definition of the "relatively" I guess. > Currently only trivial keys are used so we won't let non root users use > mads. Fine, we are in agreement then. > But this is very weak (NFS style) security. I'm afraid it wont be easy to get beyond that level of security. > Anyone can have root > access on his machine. 1. Why not on the switch then? 2. With "anyone can be root" assumption in mind, anyone can for example, do RDMA to a memory region that is enabled for remote write, since that is protected only by a 32 bit r_key? 3. etc. mst From shaharf at voltaire.com Thu Dec 9 10:16:28 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 20:16:28 +0200 Subject: [openib-general] Should I use umad -or- osm Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Michael S. Tsirkin > Sent: Thursday, December 09, 2004 8:02 PM > Cc: openib-general at openib.org > Subject: Re: [openib-general] Should I use umad -or- osm > > > > > I guess that in this stage only root will able to use user mode > > > > mads. > > > > Later I would consider letting non-root applications use some mads - > > > > meaning most of the get/query mads, and some of the set mads. I > > > > won't > > > > rely on root access for security. There are mkey, qkey and pkey to > > > > handle that. > > > > > > > > Shahar > > > > > > They are trivial to guess, so kernel would have to touch the MAD > > > data somehow? > > > Further, it seems local MADs have the check disabled now? > > > > > > MST > > > _______________________________________________ > > > > The Mkey should set according to the system policy. They can be non > > trivial. > > 64 bits (changing) keys may be relatively strong. > > Depends on your definition of the "relatively" I guess. > > > Currently only trivial keys are used so we won't let non root users use > > mads. > > Fine, we are in agreement then. > > > But this is very weak (NFS style) security. > > I'm afraid it wont be easy to get beyond that level of security. > > > Anyone can have root > > access on his machine. > > 1. Why not on the switch then? > What do you mean? To be able to send/recv mads it is enough to have one host with HCA. No switch can block root user sending mads unless pkey or mkey mechanism is used. > 2. With "anyone can be root" assumption in mind, anyone can for example, > do RDMA to a memory region that is enabled for remote write, > since that is protected only by a 32 bit r_key? > > 3. etc. > This is a real problem. It is true that brute force attacks can break 32 bit keys quite easily, but in practice even breaking 32 bits keys takes some time. To handle these brute force attacks, I would expect the attacked target to bombard the SM with key violations traps. This should trigger SM action to block and neutralize the offending host, hopefully before the RDMA write succeeds. Anyhow, this is not worse then a regular Ethernet HCA that you attack with valid requests to valid ports. > mst Shahar From mst at mellanox.co.il Thu Dec 9 10:29:57 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 20:29:57 +0200 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: References: Message-ID: <20041209182957.GA8778@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Should I use umad -or- osm": > > > Anyone can have root > > > access on his machine. > > > > 1. Why not on the switch then? > > > > What do you mean? To be able to send/recv mads it is enough to have one > host with HCA. No switch can block root user sending mads unless pkey or > mkey mechanism is used. I mean that if a malicious user has control of a switch he can cause even more problems. > > 2. With "anyone can be root" assumption in mind, anyone can for > > example, > > do RDMA to a memory region that is enabled for remote write, > > since that is protected only by a 32 bit r_key? > > > > 3. etc. > > > This is a real problem. It is true that brute force attacks can break 32 > bit keys quite easily, but in practice even breaking 32 bits keys takes > some time. To handle these brute force attacks, I would expect the > attacked target to bombard the SM with key violations traps. This should > trigger SM action to block and neutralize the offending host, hopefully > before the RDMA write succeeds. > Anyhow, this is not worse then a regular Ethernet HCA that you attack > with valid requests to valid ports. Anyway, I'm just trying to say its not easy. mst From mshefty at ichips.intel.com Thu Dec 9 10:33:02 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Dec 2004 10:33:02 -0800 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: <20041209182957.GA8778@mellanox.co.il> References: <20041209182957.GA8778@mellanox.co.il> Message-ID: <41B89A5E.3090104@ichips.intel.com> Michael S. Tsirkin wrote: > Hello! > Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Should I use umad -or- osm": > >>>>Anyone can have root >>>>access on his machine. >>> >>>1. Why not on the switch then? >>> >> >>What do you mean? To be able to send/recv mads it is enough to have one >>host with HCA. No switch can block root user sending mads unless pkey or >>mkey mechanism is used. > > > I mean that if a malicious user has control of a switch he can > cause even more problems. IMO, if you have a malicious user on the fabric, then you've already been compromised, and attacks on your IB network are probably not your greatest worries. - Sean From mshefty at ichips.intel.com Thu Dec 9 10:43:17 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 9 Dec 2004 10:43:17 -0800 Subject: [openib-general] [PATCH] fix race condition in mthca event code Message-ID: <20041209104317.66264ea2.mshefty@ichips.intel.com> This patch fixed my problem hitting the BUG_ON code in mthca_cmd, line 328. It moves releasing the semaphore to after freeing the event entry. - Sean Index: hw/mthca/mthca_cmd.c =================================================================== --- hw/mthca/mthca_cmd.c (revision 1316) +++ hw/mthca/mthca_cmd.c (working copy) @@ -293,12 +293,6 @@ complete(&context->done); } -void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) -{ - while (ncomp--) - up(&dev->cmd.event_sem); -} - static void event_timeout(unsigned long context_ptr) { struct mthca_cmd_context *context = @@ -362,7 +356,7 @@ context->next = dev->cmd.free_head; dev->cmd.free_head = context - dev->cmd.context; spin_unlock(&dev->cmd.context_lock); - + up(&dev->cmd.event_sem); return err; } Index: hw/mthca/mthca_eq.c =================================================================== --- hw/mthca/mthca_eq.c (revision 1316) +++ hw/mthca/mthca_eq.c (working copy) @@ -219,7 +219,6 @@ struct mthca_eqe *eqe; int disarm_cqn; int work = 0; - int ncmd = 0; while (1) { if (!next_eqe_sw(eq)) @@ -275,7 +274,6 @@ be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); - ++ncmd; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -314,9 +312,6 @@ set_eq_ci(dev, eq->eqn, eq->cons_index); } - if (ncmd) - mthca_cmd_complete(dev, ncmd); - eq_req_not(dev, eq->eqn); } Index: hw/mthca/mthca_cmd.h =================================================================== --- hw/mthca/mthca_cmd.h (revision 1316) +++ hw/mthca/mthca_cmd.h (working copy) @@ -205,7 +205,6 @@ void mthca_cmd_use_polling(struct mthca_dev *dev); void mthca_cmd_event(struct mthca_dev *dev, u16 token, u8 status, u64 out_param); -void mthca_cmd_complete(struct mthca_dev *dev, int ncomp); int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); From mst at mellanox.co.il Thu Dec 9 10:49:54 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 20:49:54 +0200 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: <41B89A5E.3090104@ichips.intel.com> References: <20041209182957.GA8778@mellanox.co.il> <41B89A5E.3090104@ichips.intel.com> Message-ID: <20041209184954.GA8856@mellanox.co.il> Hello! Quoting r. Sean Hefty (mshefty at ichips.intel.com) "Re: [openib-general] Should I use umad -or- osm": > Michael S. Tsirkin wrote: > >Hello! > >Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Should I > >use umad -or- osm": > > > >>>>Anyone can have root > >>>>access on his machine. > >>> > >>>1. Why not on the switch then? > >>> > >> > >>What do you mean? To be able to send/recv mads it is enough to have one > >>host with HCA. No switch can block root user sending mads unless pkey or > >>mkey mechanism is used. > > > > > >I mean that if a malicious user has control of a switch he can > >cause even more problems. > > IMO, if you have a malicious user on the fabric, then you've already > been compromised, and attacks on your IB network are probably not your > greatest worries. > Thats what I was saying. MST From iod00d at hp.com Thu Dec 9 11:00:43 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 9 Dec 2004 11:00:43 -0800 Subject: [openib-general] Missining Infiniband class fields In-Reply-To: References: Message-ID: <20041209190043.GB7178@esmail.cup.hp.com> On Thu, Dec 09, 2004 at 07:22:59PM +0200, shaharf wrote: > As it will not be the first IOCTL, another one would not do any harm not > already done. This is not an reason for adding another one. ioctl's are a real PITA since they tend to port badly and provide opportunity for all sorts of mischief (e.g. bad arguements can crash the box and security holes). grant From hnrose at earthlink.net Thu Dec 9 11:21:49 2004 From: hnrose at earthlink.net (hnrose at earthlink.net) Date: Thu, 9 Dec 2004 14:21:49 -0500 Subject: [openib-general] UD with GRH Message-ID: <305790-220041249192149681@M2W049.mail2web.com> Hi Roland, Just wondering whether UD with GRH has been tested in terms of mthca. I am having difficulties sending UD with GRH (receiving seems OK although I have not verified all the fields as yet) and just want to know whether this should work or not. Thanks. BTW, I'm sending this from my personal email as my Voltaire email has been down for 1.5 days now :-( -- Hal -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ . From hnrose at earthlink.net Thu Dec 9 12:29:28 2004 From: hnrose at earthlink.net (hnrose at earthlink.net) Date: Thu, 9 Dec 2004 15:29:28 -0500 Subject: [openib-general] Re: UD with GRH Message-ID: <185290-220041249202928358@M2W103.mail2web.com> Never mind. This was my problem. It's fixed now. -- Hal -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ . From mshefty at ichips.intel.com Thu Dec 9 13:50:49 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 9 Dec 2004 13:50:49 -0800 Subject: [openib-general] [PATCH] MAD snooping API/implementation Message-ID: <20041209135049.5ee48ef4.mshefty@ichips.intel.com> Here's a patch that adds in the ability to snoop MADs. Currently only send and receive completions are snooped, but the implementation should be general enough to add in snooping to other areas fairly easily. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1316) +++ core/mad.c (working copy) @@ -367,17 +367,129 @@ } EXPORT_SYMBOL(ib_register_mad_agent); -/* - * ib_unregister_mad_agent - Unregisters a client from using MAD services - */ -int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) +static inline int is_snooping_sends(int mad_snoop_flags) { - struct ib_mad_agent_private *mad_agent_priv; - struct ib_mad_port_private *port_priv; + return (mad_snoop_flags & + (/*IB_MAD_SNOOP_POSTED_SENDS | + IB_MAD_SNOOP_RMPP_SENDS |*/ + IB_MAD_SNOOP_SEND_COMPLETIONS /*| + IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS*/)); +} + +static inline int is_snooping_recvs(int mad_snoop_flags) +{ + return (mad_snoop_flags & + (IB_MAD_SNOOP_RECVS /*| + IB_MAD_SNOOP_RMPP_RECVS*/)); +} + +static int register_snoop_agent(struct ib_mad_qp_info *qp_info, + struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_snoop_private **new_snoop_table; unsigned long flags; + int i; - mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, - agent); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + /* Check for empty slot in array. */ + for (i = 0; i < qp_info->snoop_table_size; i++) + if (!qp_info->snoop_table[i]) + break; + + if (i == qp_info->snoop_table_size) { + /* Grow table. */ + new_snoop_table = kmalloc(sizeof mad_snoop_priv * + qp_info->snoop_table_size + 1, + GFP_ATOMIC); + if (!new_snoop_table) { + i = -ENOMEM; + goto out; + } + if (qp_info->snoop_table) { + memcpy(new_snoop_table, qp_info->snoop_table, + sizeof mad_snoop_priv * + qp_info->snoop_table_size); + kfree(qp_info->snoop_table); + } + qp_info->snoop_table = new_snoop_table; + qp_info->snoop_table_size++; + } + qp_info->snoop_table[i] = mad_snoop_priv; + atomic_inc(&qp_info->snoop_count); +out: + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + return i; +} + +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret; + struct ib_mad_snoop_private *mad_snoop_priv; + int qpn; + + /* Validate parameters */ + if ((is_snooping_sends(mad_snoop_flags) && !snoop_handler) || + (is_snooping_recvs(mad_snoop_flags) && !recv_handler)) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + /* Allocate structures */ + mad_snoop_priv = kmalloc(sizeof *mad_snoop_priv, GFP_KERNEL); + if (!mad_snoop_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + /* Now, fill in the various structures */ + memset(mad_snoop_priv, 0, sizeof *mad_snoop_priv); + mad_snoop_priv->qp_info = &port_priv->qp_info[qpn]; + mad_snoop_priv->agent.device = device; + mad_snoop_priv->agent.recv_handler = recv_handler; + mad_snoop_priv->agent.snoop_handler = snoop_handler; + mad_snoop_priv->agent.context = context; + mad_snoop_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_snoop_priv->agent.port_num = port_num; + mad_snoop_priv->mad_snoop_flags = mad_snoop_flags; + init_waitqueue_head(&mad_snoop_priv->wait); + mad_snoop_priv->snoop_index = register_snoop_agent( + &port_priv->qp_info[qpn], + mad_snoop_priv); + if (mad_snoop_priv->snoop_index < 0) { + ret = ERR_PTR(mad_snoop_priv->snoop_index); + goto error2; + } + + atomic_set(&mad_snoop_priv->refcount, 1); + return &mad_snoop_priv->agent; + +error2: + kfree(mad_snoop_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_snoop); + +static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; /* Note that we could still be handling received MADs */ @@ -405,6 +517,46 @@ if (mad_agent_priv->reg_req) kfree(mad_agent_priv->reg_req); kfree(mad_agent_priv); +} + +static void unregister_mad_snoop(struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_qp_info *qp_info; + unsigned long flags; + + qp_info = mad_snoop_priv->qp_info; + spin_lock_irqsave(&qp_info->snoop_lock, flags); + qp_info->snoop_table[mad_snoop_priv->snoop_index] = NULL; + atomic_dec(&qp_info->snoop_count); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + + atomic_dec(&mad_snoop_priv->refcount); + wait_event(mad_snoop_priv->wait, + !atomic_read(&mad_snoop_priv->refcount)); + + kfree(mad_snoop_priv); +} + +/* + * ib_unregister_mad_agent - Unregisters a client from using MAD services + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_snoop_private *mad_snoop_priv; + + /* If the TID is zero, the agent can only snoop. */ + if (mad_agent->hi_tid) { + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + unregister_mad_agent(mad_agent_priv); + } else { + mad_snoop_priv = container_of(mad_agent, + struct ib_mad_snoop_private, + agent); + unregister_mad_snoop(mad_snoop_priv); + } return 0; } EXPORT_SYMBOL(ib_unregister_mad_agent); @@ -422,30 +574,82 @@ spin_unlock_irqrestore(&mad_queue->lock, flags); } +static void snoop_send(struct ib_mad_qp_info *qp_info, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.snoop_handler(&mad_snoop_priv->agent, + send_wr, mad_send_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + +static void snoop_recv(struct ib_mad_qp_info *qp_info, + struct ib_mad_recv_wc *mad_recv_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.recv_handler(&mad_snoop_priv->agent, + mad_recv_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + /* * Return 0 if SMP is to be sent * Return 1 if SMP was consumed locally (whether or not solicited) * Return < 0 if error */ -static int handle_outgoing_smp(struct ib_mad_agent *mad_agent, +static int handle_outgoing_smp(struct ib_mad_agent_private *mad_agent_priv, struct ib_smp *smp, struct ib_send_wr *send_wr) { int ret; struct ib_mad_private *mad_priv; struct ib_mad_send_wc mad_send_wc; + struct ib_device *device = mad_agent_priv->agent.device; + u8 port_num = mad_agent_priv->agent.port_num; - if (!smi_handle_dr_smp_send(smp, - mad_agent->device->node_type, - mad_agent->port_num)) { + if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); goto out; } /* Check to post send on QP or process locally */ - ret = smi_check_local_dr_smp(smp, mad_agent->device, - mad_agent->port_num); - if (!ret || !mad_agent->device->process_mad) + ret = smi_check_local_dr_smp(smp, device, port_num); + if (!ret || !device->process_mad) goto out; mad_priv = kmem_cache_alloc(ib_mad_cache, @@ -456,10 +660,9 @@ printk(KERN_ERR PFX "No memory for local response MAD\n"); goto out; } - ret = mad_agent->device->process_mad(mad_agent->device, 0, - mad_agent->port_num, smp->dr_slid, - (struct ib_mad *)smp, - (struct ib_mad *)&mad_priv->mad); + ret = device->process_mad(device, 0, port_num, smp->dr_slid, + (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); switch (ret) { case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: @@ -468,7 +671,7 @@ * there is a recv handler */ if (solicited_mad(&mad_priv->mad.mad) && - mad_agent->recv_handler) { + mad_agent_priv->agent.recv_handler) { struct ib_wc wc; /* @@ -494,7 +697,12 @@ mad_priv->header.recv_buf.mad = &mad_priv->mad.mad; mad_priv->header.recv_wc.recv_buf = &mad_priv->header.recv_buf; - mad_agent->recv_handler(mad_agent, + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_recv(mad_agent_priv->qp_info, + &mad_priv->header.recv_wc, + IB_MAD_SNOOP_RECVS); + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, &mad_priv->header.recv_wc); } else kmem_cache_free(ib_mad_cache, mad_priv); @@ -516,7 +724,11 @@ mad_send_wc.status = IB_WC_SUCCESS; mad_send_wc.vendor_err = 0; mad_send_wc.wr_id = send_wr->wr_id; - mad_agent->send_handler(mad_agent, &mad_send_wc); + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_send(mad_agent_priv->qp_info, send_wr, &mad_send_wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); ret = 1; out: return ret; @@ -610,7 +822,7 @@ smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - ret = handle_outgoing_smp(mad_agent, smp, send_wr); + ret = handle_outgoing_smp(mad_agent_priv, smp, send_wr); if (ret < 0) /* error */ goto error2; else if (ret == 1) /* locally consumed */ @@ -1383,6 +1595,9 @@ recv->header.recv_buf.mad = (struct ib_mad *)&recv->mad; recv->header.recv_buf.grh = &recv->grh; + if (atomic_read(&qp_info->snoop_count)) + snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); + /* Validate MAD */ if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num)) goto out; @@ -1600,7 +1815,11 @@ /* Restore client wr_id in WC and complete send */ wc->wr_id = mad_send_wr->wr_id; - ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc*)wc); + if (atomic_read(&qp_info->snoop_count)) + snoop_send(qp_info, &mad_send_wr->send_wr, + (struct ib_mad_send_wc *)wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc *)wc); if (queued_send_wr) { ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr, @@ -2068,6 +2287,10 @@ init_mad_queue(qp_info, &qp_info->send_queue); init_mad_queue(qp_info, &qp_info->recv_queue); INIT_LIST_HEAD(&qp_info->overflow_list); + spin_lock_init(&qp_info->snoop_lock); + qp_info->snoop_table = NULL; + qp_info->snoop_table_size = 0; + atomic_set(&qp_info->snoop_count, 0); } static int create_mad_qp(struct ib_mad_qp_info *qp_info, @@ -2108,6 +2331,8 @@ static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) { ib_destroy_qp(qp_info->qp); + if (qp_info->snoop_table) + kfree(qp_info->snoop_table); } /* Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 1316) +++ core/mad_priv.h (working copy) @@ -121,6 +121,15 @@ u8 rmpp_version; }; +struct ib_mad_snoop_private { + struct ib_mad_agent agent; + struct ib_mad_qp_info *qp_info; + int snoop_index; + int mad_snoop_flags; + atomic_t refcount; + wait_queue_head_t wait; +}; + struct ib_mad_send_wr_private { struct ib_mad_list_head mad_list; struct list_head agent_list; @@ -171,6 +180,10 @@ struct ib_mad_queue send_queue; struct ib_mad_queue recv_queue; struct list_head overflow_list; + spinlock_t snoop_lock; + struct ib_mad_snoop_private **snoop_table; + int snoop_table_size; + atomic_t snoop_count; }; struct ib_mad_port_private { Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1316) +++ include/ib_mad.h (working copy) @@ -126,13 +126,29 @@ struct ib_mad_send_wc *mad_send_wc); /** + * ib_mad_snoop_handler - Callback handler for snooping sent MADs. + * @mad_agent: MAD agent that snooped the MAD. + * @send_wr: Work request information on the sent MAD. + * @mad_send_wc: Work completion information on the sent MAD. Valid + * only for snooping that occurs on a send completion. + * + * Clients snooping MADs should not modify data referenced by the @send_wr + * or @mad_send_wc. + */ +typedef void (*ib_mad_snoop_handler)(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc); + +/** * ib_mad_recv_handler - callback handler for a received MAD. * @mad_agent: MAD agent requesting the received MAD. * @mad_recv_wc: Received work completion information on the received MAD. * * MADs received in response to a send request operation will be handed to * the user after the send operation completes. All data buffers given - * to the user through this routine are owned by the receiving client. + * to registered agents through this routine are owned by the receiving + * client, except for snooping agents. Clients snooping MADs should not + * modify the data referenced by @mad_recv_wc. */ typedef void (*ib_mad_recv_handler)(struct ib_mad_agent *mad_agent, struct ib_mad_recv_wc *mad_recv_wc); @@ -143,6 +159,7 @@ * @qp: Reference to QP used for sending and receiving MADs. * @recv_handler: Callback handler for a received MAD. * @send_handler: Callback handler for a sent MAD. + * @snoop_handler: Callback handler for snooped sent MADs. * @context: User-specified context associated with this registration. * @hi_tid: Access layer assigned transaction ID for this client. * Unsolicited MADs sent by this client will have the upper 32-bits @@ -154,6 +171,7 @@ struct ib_qp *qp; ib_mad_recv_handler recv_handler; ib_mad_send_handler send_handler; + ib_mad_snoop_handler snoop_handler; void *context; u32 hi_tid; u8 port_num; @@ -247,6 +265,35 @@ ib_mad_recv_handler recv_handler, void *context); +enum ib_mad_snoop_flags { + /*IB_MAD_SNOOP_POSTED_SENDS = 1,*/ + /*IB_MAD_SNOOP_RMPP_SENDS = (1<<1),*/ + IB_MAD_SNOOP_SEND_COMPLETIONS = (1<<2), + /*IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS = (1<<3),*/ + IB_MAD_SNOOP_RECVS = (1<<4) + /*IB_MAD_SNOOP_RMPP_RECVS = (1<<5),*/ + /*IB_MAD_SNOOP_REDIRECTED_QPS = (1<<6)*/ +}; + +/** + * ib_register_mad_snoop - Register to snoop sent and received MADs. + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP traffic to snoop. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_snoop_flags: Specifies information where snooping occurs. + * @send_handler: The callback routine invoked for a snooped send. + * @recv_handler: The callback routine invoked for a snooped receive. + * @context: User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context); + /** * ib_unregister_mad_agent - Unregisters a client from using MAD services. * @mad_agent: Corresponding MAD registration request to deregister. From mshefty at ichips.intel.com Thu Dec 9 14:24:59 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 9 Dec 2004 14:24:59 -0800 Subject: [openib-general] [PATCH] [RFC] new test directory + test code for MAD snooping Message-ID: <20041209142459.5c8dbb0f.mshefty@ichips.intel.com> This patch creates a new subdirectory under infiniband called 'test', and adds a new kernel module called 'madeye' that can be used to snoop and display MADs (until user-mode smpdump and gmpdump programs are created). We need to decide if we want to include this sort of test code in the openib tree, and where it might best go. - Sean Index: Kconfig =================================================================== --- Kconfig (revision 1316) +++ Kconfig (working copy) @@ -11,4 +11,6 @@ source "drivers/infiniband/ulp/ipoib/Kconfig" +source "drivers/infiniband/test/madeye/Kconfig" + endmenu Index: Makefile =================================================================== --- Makefile (revision 1316) +++ Makefile (working copy) @@ -1,3 +1,4 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ +obj-$(CONFIG_INFINIBAND_MADEYE) += test/madeye/ Index: test/madeye/Kconfig =================================================================== --- test/madeye/Kconfig (revision 0) +++ test/madeye/Kconfig (revision 0) @@ -0,0 +1,6 @@ +config INFINIBAND_MADEYE + tristate "MAD debug viewer for InfiniBand" + depends on INFINIBAND + ---help--- + Prints sent and received MADs on QP 0/1 for debugging. + Index: test/madeye/madeye.c =================================================================== --- test/madeye/madeye.c (revision 0) +++ test/madeye/madeye.c (revision 0) @@ -0,0 +1,241 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * + * $Id$ + */ + +#include +#include +#include + +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand MAD viewer"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void madeye_remove_one(struct ib_device *device); +static void madeye_add_one(struct ib_device *device); + +static struct ib_client madeye_client = { + .name = "madeye", + .add = madeye_add_one, + .remove = madeye_remove_one +}; + +struct madeye_port { + struct ib_mad_agent *smi_agent; + struct ib_mad_agent *gsi_agent; +}; + +static char * get_class_name(u8 mgmt_class) +{ + switch(mgmt_class) { + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + return "LID routed SMP"; + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + return "Directed route SMP"; + case IB_MGMT_CLASS_SUBN_ADM: + return "Subnet admin."; + case IB_MGMT_CLASS_PERF_MGMT: + return "Perf. mgmt."; + case IB_MGMT_CLASS_BM: + return "Baseboard mgmt."; + case IB_MGMT_CLASS_DEVICE_MGMT: + return "Device mgmt."; + case IB_MGMT_CLASS_CM: + return "Comm. mgmt."; + case IB_MGMT_CLASS_SNMP: + return "SNMP"; + default: + return "Unknown vendor/application"; + } +} + +static char * get_method_name(u8 method) +{ + switch(method) { + case IB_MGMT_METHOD_GET: + return "Get"; + case IB_MGMT_METHOD_SET: + return "Set"; + case IB_MGMT_METHOD_GET_RESP: + return "Get response"; + case IB_MGMT_METHOD_SEND: + return "Send"; + case IB_MGMT_METHOD_TRAP: + return "Trap"; + case IB_MGMT_METHOD_REPORT: + return "Report"; + case IB_MGMT_METHOD_REPORT_RESP: + return "Report response"; + case IB_MGMT_METHOD_TRAP_REPRESS: + return "Trap repress"; + default: + return "Unknown"; + } +} + +static void print_status_details(u16 status) +{ + if (status & cpu_to_be16(0x0001)) + printk(" busy\n"); + if (status & cpu_to_be16(0x0002)) + printk(" redirection required\n"); + switch((be16_to_cpu(status) & 0x001C) >> 2) { + case 1: + printk(" bad version\n"); + break; + case 2: + printk(" method not supported\n"); + break; + case 3: + printk(" method/attribute combo not supported\n"); + break; + case 7: + printk(" invalid attribute/modifier value\n"); + break; + } +} + +static void print_mad_hdr(struct ib_mad_hdr *mad_hdr) +{ + printk("MAD version....0x%01x\n", mad_hdr->base_version); + printk("Class..........0x%01x (%s)\n", mad_hdr->mgmt_class, + get_class_name(mad_hdr->mgmt_class)); + printk("Class version..0x%01x\n", mad_hdr->class_version); + printk("Method.........0x%01x (%s)\n", mad_hdr->method, + get_method_name(mad_hdr->method)); + printk("Status.........0x%02x\n", mad_hdr->status); + if (mad_hdr->status) + print_status_details(mad_hdr->status); + printk("Class specific.0x%02x\n", mad_hdr->class_specific); + printk("Trans ID.......0x%llx\n", mad_hdr->tid); + printk("Attr ID........0x%02x\n", mad_hdr->attr_id); + printk("Attr modifier..0x%04x\n", mad_hdr->attr_mod); +} + +static void snoop_smi_handler(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + printk("Madeye:sent SMP\n"); + print_mad_hdr(send_wr->wr.ud.mad_hdr); +} + +static void recv_smi_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + printk("Madeye:recv SMP\n"); + print_mad_hdr(&mad_recv_wc->recv_buf->mad->mad_hdr); +} + +static void snoop_gsi_handler(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + printk("Madeye:sent GMP\n"); + print_mad_hdr(send_wr->wr.ud.mad_hdr); +} + +static void recv_gsi_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + printk("Madeye:recv GMP\n"); + print_mad_hdr(&mad_recv_wc->recv_buf->mad->mad_hdr); +} + +static void madeye_add_one(struct ib_device *device) +{ + struct madeye_port *port; + int reg_flags; + u8 i, s, e; + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + port = kmalloc(sizeof *port * (e - s + 1), GFP_KERNEL); + if (!port) + goto out; + + reg_flags = IB_MAD_SNOOP_SEND_COMPLETIONS | IB_MAD_SNOOP_RECVS; + for (i = s; i <= e; i++) { + port[i].smi_agent = ib_register_mad_snoop(device, i, + IB_QPT_SMI, + reg_flags, + snoop_smi_handler, + recv_smi_handler, + &port[i]); + port[i].gsi_agent = ib_register_mad_snoop(device, i, + IB_QPT_GSI, + reg_flags, + snoop_gsi_handler, + recv_gsi_handler, + &port[i]); + } + +out: + ib_set_client_data(device, &madeye_client, port); +} + +static void madeye_remove_one(struct ib_device *device) +{ + struct madeye_port *port; + int i, s, e; + + port = (struct madeye_port *) + ib_get_client_data(device, &madeye_client); + if (!port) + return; + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (i = s; i <= e; i++) { + if (!IS_ERR(port[i].smi_agent)) + ib_unregister_mad_agent(port[i].smi_agent); + if (!IS_ERR(port[i].gsi_agent)) + ib_unregister_mad_agent(port[i].gsi_agent); + } + kfree(port); +} + +static int __init ib_madeye_init(void) +{ + return ib_register_client(&madeye_client); +} + +static void __exit ib_madeye_cleanup(void) +{ + ib_unregister_client(&madeye_client); +} + +module_init(ib_madeye_init); +module_exit(ib_madeye_cleanup); Index: test/madeye/Makefile =================================================================== --- test/madeye/Makefile (revision 0) +++ test/madeye/Makefile (revision 0) @@ -0,0 +1,6 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_MADEYE) += ib_madeye.o + +ib_madeye-y := madeye.o \ + From iod00d at hp.com Thu Dec 9 15:41:09 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 9 Dec 2004 15:41:09 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52pt1ndzh4.fsf@topspin.com> References: <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> Message-ID: <20041209234109.GN7178@esmail.cup.hp.com> On Mon, Dec 06, 2004 at 10:59:51AM -0800, Roland Dreier wrote: > Grant> Yes - that works. Please commit. > > I rewrote things in a way that seems cleaner to me -- what I actually > committed is below. Please try one more time and make sure this still > fixes the problem. I've updated to svn 1316 and that has both the problem I originally observed plus it "hangs" when the module is loaded. ... > Index: infiniband/hw/mthca/mthca_cmd.c > =================================================================== > --- infiniband/hw/mthca/mthca_cmd.c (revision 1310) > +++ infiniband/hw/mthca/mthca_cmd.c (working copy) > @@ -293,6 +293,12 @@ > complete(&context->done); > } > > +void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) > +{ > + while (ncomp--) > + up(&dev->cmd.event_sem); > +} > + > static void event_timeout(unsigned long context_ptr) > { > struct mthca_cmd_context *context = > @@ -357,7 +363,6 @@ > dev->cmd.free_head = context - dev->cmd.context; > spin_unlock(&dev->cmd.context_lock); > > - up(&dev->cmd.event_sem); > return err; > } I have the feeling the timeout code isn't cleaning up the event_sem and that's causing the hang. trying to understand this now. thanks, grant > > Index: infiniband/hw/mthca/mthca_eq.c > =================================================================== > --- infiniband/hw/mthca/mthca_eq.c (revision 1310) > +++ infiniband/hw/mthca/mthca_eq.c (working copy) > @@ -219,6 +219,7 @@ > struct mthca_eqe *eqe; > int disarm_cqn; > int work = 0; > + int ncmd = 0; > > while (1) { > if (!next_eqe_sw(eq)) > @@ -274,6 +275,7 @@ > be16_to_cpu(eqe->event.cmd.token), > eqe->event.cmd.status, > be64_to_cpu(eqe->event.cmd.out_param)); > + ++ncmd; > break; > > case MTHCA_EVENT_TYPE_PORT_CHANGE: > @@ -303,6 +305,9 @@ > set_eq_ci(dev, eq->eqn, eq->cons_index); > } > > + if (ncmd) > + mthca_cmd_complete(dev, ncmd); > + > eq_req_not(dev, eq->eqn); > } > > Index: infiniband/hw/mthca/mthca_cmd.h > =================================================================== > --- infiniband/hw/mthca/mthca_cmd.h (revision 1310) > +++ infiniband/hw/mthca/mthca_cmd.h (working copy) > @@ -203,10 +203,9 @@ > > int mthca_cmd_use_events(struct mthca_dev *dev); > void mthca_cmd_use_polling(struct mthca_dev *dev); > -void mthca_cmd_event(struct mthca_dev *dev, > - u16 token, > - u8 status, > - u64 out_param); > +void mthca_cmd_event(struct mthca_dev *dev, u16 token, > + u8 status, u64 out_param); > +void mthca_cmd_complete(struct mthca_dev *dev, int ncomp); > > int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); > int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); From mshefty at ichips.intel.com Thu Dec 9 15:48:06 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Dec 2004 15:48:06 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041209234109.GN7178@esmail.cup.hp.com> References: <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> <20041209234109.GN7178@esmail.cup.hp.com> Message-ID: <41B8E436.7050509@ichips.intel.com> Grant Grundler wrote: > On Mon, Dec 06, 2004 at 10:59:51AM -0800, Roland Dreier wrote: > >> Grant> Yes - that works. Please commit. >> >>I rewrote things in a way that seems cleaner to me -- what I actually >>committed is below. Please try one more time and make sure this still >>fixes the problem. > > > I've updated to svn 1316 and that has both the problem I > originally observed plus it "hangs" when the module is loaded. > > ... > >>Index: infiniband/hw/mthca/mthca_cmd.c >>=================================================================== >>--- infiniband/hw/mthca/mthca_cmd.c (revision 1310) >>+++ infiniband/hw/mthca/mthca_cmd.c (working copy) >>@@ -293,6 +293,12 @@ >> complete(&context->done); >> } >> >>+void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) >>+{ >>+ while (ncomp--) >>+ up(&dev->cmd.event_sem); >>+} >>+ I had to remove this patch in order to get things working on my system. - Sean From mshefty at ichips.intel.com Thu Dec 9 15:49:16 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 9 Dec 2004 15:49:16 -0800 Subject: [openib-general] [PATCH] remove add_mad_reg_req function Message-ID: <20041209154916.69d2dc80.mshefty@ichips.intel.com> This patch removes add_mad_reg_req(), which removes redundant checks in the code. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1317) +++ core/mad.c (working copy) @@ -80,8 +80,6 @@ /* Forward declarations */ static int method_in_use(struct ib_mad_mgmt_method_table **method, struct ib_mad_reg_req *mad_reg_req); -static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, - struct ib_mad_agent_private *priv); static void remove_mad_reg_req(struct ib_mad_agent_private *priv); static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, struct ib_mad_private *mad); @@ -321,6 +319,8 @@ goto error3; } } + ret2 = add_nonoui_reg_req(mad_reg_req, mad_agent_priv, + mgmt_class); } else { /* "New" vendor class range */ vendor = port_priv->version[mad_reg_req-> @@ -335,13 +335,12 @@ goto error3; } } + ret2 = add_oui_reg_req(mad_reg_req, mad_agent_priv); + } + if (ret2) { + ret = ERR_PTR(ret2); + goto error3; } - } - - ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); - if (ret2) { - ret = ERR_PTR(ret2); - goto error3; } /* Add mad agent into port's agent list */ @@ -1007,24 +1006,6 @@ return ret; } -static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, - struct ib_mad_agent_private *priv) -{ - int ret; - u8 mgmt_class; - - /* Make sure MAD registration request supplied */ - if (!mad_reg_req) - return 0; - - mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); - if (!is_vendor_class(mgmt_class)) - ret = add_nonoui_reg_req(mad_reg_req, priv, mgmt_class); - else - ret = add_oui_reg_req(mad_reg_req, priv); - return ret; -} - static void remove_mad_reg_req(struct ib_mad_agent_private *agent_priv) { struct ib_mad_port_private *port_priv; From iod00d at hp.com Thu Dec 9 16:05:36 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 9 Dec 2004 16:05:36 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <41B8E436.7050509@ichips.intel.com> References: <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> <20041209234109.GN7178@esmail.cup.hp.com> <41B8E436.7050509@ichips.intel.com> Message-ID: <20041210000536.GO7178@esmail.cup.hp.com> On Thu, Dec 09, 2004 at 03:48:06PM -0800, Sean Hefty wrote: > >>+void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) > >>+{ > >>+ while (ncomp--) > >>+ up(&dev->cmd.event_sem); > >>+} > >>+ > > I had to remove this patch in order to get things working on my system. Ok. But I see now that the set_ci patch that worked hasn't been committed. Roland, Would you consider committing the set_ci patch and backing the ncmd patch out? thanks, grant From iod00d at hp.com Thu Dec 9 17:12:16 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 9 Dec 2004 17:12:16 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041210000536.GO7178@esmail.cup.hp.com> References: <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> <20041209234109.GN7178@esmail.cup.hp.com> <41B8E436.7050509@ichips.intel.com> <20041210000536.GO7178@esmail.cup.hp.com> Message-ID: <20041210011216.GQ7178@esmail.cup.hp.com> On Thu, Dec 09, 2004 at 04:05:36PM -0800, Grant Grundler wrote: > Roland, > Would you consider committing the set_ci patch and backing the ncmd patch out? Roland, Attached is a patch that does three things (LART me for combining them, but I can split them up later if you want): o reverse most of the mthca_cmd_complete() patch o add set_ci logic so consumer index is updated inside the event loop o eliminate "work" variable (nicely simplifies the code). If I understood the code correctly, "work" was always being set if next_eqe_sw(eq) was non-zero. It is so expensive to get to the interrupt handler in the first place that if nothing needs to be done: a) the problem is NOT in the interrupt handler. b) the spurious set_eq_ci() is light weight enough to be in the noise I do understand the wmb()/set_eq_ci() disturb the PCI data flows but we shouldn't be seeing spurious interrupts either. If they are happening often enough to disturb performance, something else is wrong. grant Index: hw/mthca/mthca_cmd.c =================================================================== --- hw/mthca/mthca_cmd.c (revision 1317) +++ hw/mthca/mthca_cmd.c (working copy) @@ -293,12 +293,6 @@ complete(&context->done); } -void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) -{ - while (ncomp--) - up(&dev->cmd.event_sem); -} - static void event_timeout(unsigned long context_ptr) { struct mthca_cmd_context *context = @@ -363,6 +357,7 @@ dev->cmd.free_head = context - dev->cmd.context; spin_unlock(&dev->cmd.context_lock); + up(&dev->cmd.event_sem); return err; } Index: hw/mthca/mthca_eq.c =================================================================== --- hw/mthca/mthca_eq.c (revision 1317) +++ hw/mthca/mthca_eq.c (working copy) @@ -218,15 +218,10 @@ { struct mthca_eqe *eqe; int disarm_cqn; - int work = 0; - int ncmd = 0; - while (1) { - if (!next_eqe_sw(eq)) - break; - + while (next_eqe_sw(eq)) { + int set_ci = 0; eqe = get_eqe(eq, eq->cons_index); - work = 1; switch (eqe->type) { case MTHCA_EVENT_TYPE_COMP: @@ -275,7 +270,11 @@ be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); - ++ncmd; + /* cmd_event() may add more commands. + * The card will think the queue has overflowed if + * we don't tell it we've been processing events. + */ + set_ci = 1; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -298,25 +297,25 @@ set_eqe_hw(eq, eq->cons_index); eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); - } - if (work) { - /* - * This barrier makes sure that all updates to - * ownership bits done by set_eqe_hw() hit memory - * before the consumer index is updated. set_eq_ci() - * allows the HCA to possibly write more EQ entries, - * and we want to avoid the exceedingly unlikely - * possibility of the HCA writing an entry and then - * having set_eqe_hw() overwrite the owner field. - */ - wmb(); - set_eq_ci(dev, eq->eqn, eq->cons_index); + if (set_ci) { + wmb(); /* see comment below */ + set_eq_ci(dev, eq->eqn, eq->cons_index); + set_ci = 0; + } } - if (ncmd) - mthca_cmd_complete(dev, ncmd); - + /* + * This barrier makes sure that all updates to + * ownership bits done by set_eqe_hw() hit memory + * before the consumer index is updated. set_eq_ci() + * allows the HCA to possibly write more EQ entries, + * and we want to avoid the exceedingly unlikely + * possibility of the HCA writing an entry and then + * having set_eqe_hw() overwrite the owner field. + */ + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); eq_req_not(dev, eq->eqn); } Index: hw/mthca/mthca_cmd.h =================================================================== --- hw/mthca/mthca_cmd.h (revision 1317) +++ hw/mthca/mthca_cmd.h (working copy) @@ -205,7 +205,6 @@ void mthca_cmd_use_polling(struct mthca_dev *dev); void mthca_cmd_event(struct mthca_dev *dev, u16 token, u8 status, u64 out_param); -void mthca_cmd_complete(struct mthca_dev *dev, int ncomp); int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); From mucci at cs.utk.edu Fri Dec 10 05:19:32 2004 From: mucci at cs.utk.edu (Philip Mucci) Date: Fri, 10 Dec 2004 14:19:32 +0100 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: References: Message-ID: <1102684772.3690.31.camel@muccislaptop.pdc.kth.se> Hi Shahar, Thanks again for the information. Yes, the OSM interface sure seems a bit excessive for my needs...you mentioned that the libumad interface will be much simpler? Simpler is better. I definitely would like to have access to your code, that would be great. I'll chat more with you about this off-line. Phil On Thu, 2004-12-09 at 18:55 +0200, shaharf wrote: > Philip, > I would say that there is not much in common between gen1 and > gen2 user mode mad interface. If you want your tools to works above > both, then osm vendor layer is your only choice. If you are planning to > use gen2 stacks only at some point you can use libumad. > > If you need just a simple PM portcounter get query, then maybe osm > vendor api is a bit too "fat" for you. I would consider using gen1 > interface directly. But still, this is your choice. > > I guess that on the long run, openib gen2 will be the only maintained > openib version. Anyone thinks differently? > > Shahar > > > From: Philip Mucci [mailto:mucci at cs.utk.edu] > > Sent: Thursday, December 09, 2004 6:44 PM > > To: shaharf > > Cc: openib-general at openib.org > > Subject: RE: [openib-general] Should I use umad -or- osm > > > > Hi Shahar, > > > > Thanks for the info. > > > > My needs are very 'simple'. Just to send and receive MADS to the PM > > agent on each adapter and switch in the network. > > > > Ideally, I would like this to work for existing installations based on > > either OpenIB gen1 or Mellanox Gold. But I think for that, I need to > use > > the current osm_vendor_api.h interface. > > > > How much will this interface change with gen2? will it export the same > > functions? Or will everything change > > > > Lastly, will I be able to send/recv these MADS as a non-root user? > > > > Thanks again, and the answer to your question is, yes, I can wait. > > > > Regards, > > > > Philip > > > > > > On Thu, 2004-12-09 at 17:52 +0200, shaharf wrote: > > > Hi Philip, > > > > > > I am currently implementing umad access library. It much simpler > > > then the osm_vendor api. I would recommend using umad library and > not > > > osm. As a matter of fact the current osm vendor layer does not > support > > > openib gen2. I am working on that either. Both the umad access > library > > > and the new osm vendor layer that uses it are not finished yet. I > guess > > > that I will need at least another week to reach a point where I can > > > release it. Even then it will change a lot until I will be finished > with > > > it. > > > The question is can you wait a little? I can give you a preliminary > > > version - but if you will use it you will have to modify your code > > > several each time the library interface is changed. > > > On the other hand, I would like to understand exactly what you need, > > > because you are the first "client" of the user mode stuff beside me. > > > > > > Shahar > > > > > > > -----Original Message----- > > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > > bounces at openib.org] On Behalf Of Philip Mucci > > > > Sent: Thursday, December 09, 2004 3:25 PM > > > > To: openib-general at openib.org > > > > Subject: [openib-general] Should I use umad -or- osm > > > > > > > > Hi folks, > > > > > > > > I've been tasked with developing a rough performance tool for IB > > > > networks. I've scanned the documentation and looks like the kind > of > > > data > > > > we're interested in can be obtained from the Mellanox ASICs. > > > > > > > > My question is a simple one: > > > > > > > > I've got to send/recv mads to enable and obtain the performance > > > counters > > > > from a user space tool...ideally non-root, but we'll work with > what we > > > > have. > > > > > > > > My current inclination has been to use the osm_vendor_api.h > functions > > > to > > > > do this work. However, the late work here done by Hal on the UMAD > > > access > > > > layer seems to be appropriate as well. > > > > > > > > Could someone elaborate on what you think the best (and most > > > > maintainable) approach to accomplishing this task might be? > > > > > > > > Regards, > > > > > > > > Philip > > > > > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib- > > > > general > From hnrose at earthlink.net Fri Dec 10 06:29:07 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 09:29:07 -0500 Subject: [openib-general] [PATCH] MAD snooping API/implementation Message-ID: <001801c4dec4$9bcee3c0$6501a8c0@comcast.net> Hi Sean, A couple of minor questions about this patch: 1. In ib_mad.h: /** + * ib_mad_snoop_handler - Callback handler for snooping sent MADs. + * @mad_agent: MAD agent that snooped the MAD. + * @send_wr: Work request information on the sent MAD. + * @mad_send_wc: Work completion information on the sent MAD. Valid + * only for snooping that occurs on a send completion. + * + * Clients snooping MADs should not modify data referenced by the @send_wr + * or @mad_send_wc. + */ I presume snoop clients should also not free the MAD either. If so, should that comment also be added ? 2. Should MAD snooping be exposed to user space too ? Thanks. - Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 06:55:22 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 09:55:22 -0500 Subject: [openib-general] [PATCH] MAD snooping API/implementation Message-ID: <003101c4dec8$468281c0$6501a8c0@comcast.net> Thanks :-) Applied. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 07:09:02 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 10:09:02 -0500 Subject: [openib-general] [PATCH] remove add_mad_reg_req function Message-ID: <003c01c4deca$2f418680$6501a8c0@comcast.net> Thanks. Applied with the minor addition of forward declarations for add_oui/nonoui_reg_req. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 07:11:04 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 10:11:04 -0500 Subject: [openib-general] [PATCH] remove add_mad_reg_req function Message-ID: <005f01c4deca$77e6f000$6501a8c0@comcast.net> Thanks. Applied with the minor addition of forward declarations for add_oui/nonoui_reg_req. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 07:38:08 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 10:38:08 -0500 Subject: [openib-general] IPv6 All Router Multicast Group Message-ID: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> Hi again Roland, In looking at the code, this IPv6 group (all routers) (0xff12:601b:ffff:0:0:0:0:2) is going through the send only path in ipoib_multicast.c::ipoib_mcast_send where: spin_lock_irqsave(&priv->lock, flags); mcast = __ipoib_mcast_find(dev, mgid); if (!mcast) { /* Let's create a new send only group now */ ipoib_dbg_mcast(priv, "setting up send only multicast group for " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(*mgid)); mcast = ipoib_mcast_alloc(dev, 0); if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); dev_kfree_skb_any(skb); goto out; } set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags); mcast->mcmember.mgid = *mgid; __ipoib_mcast_add(dev, mcast); list_add_tail(&mcast->list, &priv->multicast_list); } if (!mcast->ah) { if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); else dev_kfree_skb_any(skb); if (mcast->query) ipoib_dbg_mcast(priv, "no address vector, " "but multicast join already started\n"); else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) ipoib_mcast_sendonly_join(mcast); Although the sendonly_join code has been changed to do a full member rather than send only join, it does not fall back to create the group if it does not already exist. One wouldn't expect a send only join to create the group if it didn't already exist. Any idea on why this group is send only ? Don't end nodes need to both send and receive on the all routers group ? -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 07:41:54 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 10:41:54 -0500 Subject: [openib-general] IPoIB still not working Message-ID: <008601c4dece$c6e14da0$6501a8c0@comcast.net> Hi, MGID....................0xff12401bffff0000 : 0x0000000000000016 PortGid.................0xfe80000000000000 : 0x0002c9010ad25b91 qkey....................0x0 Mlid....................0x0 ScopeState..............0x1 Rate....................0x0 Mtu.....................0x0 [1102546997:000053334][18007] -> osm_physp_share_pkey: [ [1102546997:000053358][18007] -> osm_physp_share_pkey: ] [1102546997:000053389][18007] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7. This is the IPv4 equivalent of what the previous post on All Routers Multicast Group. For some reason, your node is joining this as send only (which attempts to join rather than create) the underlying IB multicast group for this IP multicast group. I'm not sure how the Linux network stack decides this. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 07:54:01 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 10:54:01 -0500 Subject: [openib-general] Other Send Only Multicast Groups Message-ID: <000c01c4ded0$78670dc0$6501a8c0@comcast.net> I've also seen the IGMP group (0x16) for IPv6 and/or IPv4 (224.0.0.22) joined as send only (in addition to the all routers (2) group). -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Fri Dec 10 08:50:01 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 10 Dec 2004 08:50:01 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F000302FBBE@orsmsx408> >[1102546997:000053389][18007] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: >method = SubnAdmSet,scope_state = 0x1, component mask = >0x0000000000010083, expected comp mask = 0x00000000000130c7. >This is the IPv4 equivalent of what the previous post on All Routers Multicast >Group. >For some reason, your node is joining this as send only (which attempts to >join rather than create) the underlying IB multicast group for this IP multicast group. >I'm not sure how the Linux network stack decides this. >-- Hal In comparing the behavior of my EM64T system against Seans 32-bit systems, I see Sean's stack only joining mcast group's c0000 and c0001, the 2 created by openSM. For some reason, my stack also creates an additional 2 groups. The first one C002 appears to get created OK. The next one, it seems to try to do the join without create, as seen above. I will try compare the .config files from Sean and my system to see what are the differences in the network stack configuration, but it seems to be related to how the network stack is configured. woody From sean.hefty at intel.com Fri Dec 10 09:15:55 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 10 Dec 2004 09:15:55 -0800 Subject: [openib-general] [PATCH] MAD snooping API/implementation In-Reply-To: <001801c4dec4$9bcee3c0$6501a8c0@comcast.net> Message-ID: I presume snoop clients should also not free the MAD either. If so, should that comment also be added ? Snooping clients should never touch the MADs. There are a couple of comments to that effect elsewhere in the code, but we can add more if needed. 2. Should MAD snooping be exposed to user space too ? I think so. My hope was that a client would be built on top of this API to expose snooping to user-mode, since it requires copying the MAD. I'd also like to eventually take the madeye code and modify it to collect statistics and dynamically allow printing of MAD data. - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Fri Dec 10 10:41:19 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 10:41:19 -0800 Subject: [openib-general] ipoib "link" failure on ia64 Message-ID: <20041210184119.GF11324@esmail.cup.hp.com> Hi Roland/et al, With the patch I posted last time I was able to ifconfig ib0/ib1. But I had problems with sending packets from one host to another. My goal was to run netperf but that doesn't seem possible yet on ia64. Both systems are running 2.6.10-rc2 + openib-1317 + set_ci patch. Both systems have both ports connected to a Topspin 12port switch. Both showed similar behaviour - following output applies to both. "ionize" is an rx2600 with proto Cougarcub Card flashed with: fw-cougarcub-a1-3.1.0.bin I will update this with fw-cougarcub-a1-3.2.0.bin that I found in a different tar ball. The other system "iowa" is an rx4640 with "HP" Cougar flashed with: hca-cougar-a1-250-157.bin I'll update that to the corresponding 3.2.0.bin as well. ionize:~# cat /sys/class/infiniband/mthca0/ports/?/state 4: ACTIVE 4: ACTIVE ionize:~# modprobe ib_ipoib ionize:~# ifconfig -a ... ionize:~# ifconfig ib0 10.0.0.2 netmask 255.255.255.0 broadcast 10.0.0.255 ionize:~# ifconfig ib1 10.0.1.2 netmask 255.255.255.0 broadcast 10.0.1.255 (Ditto for iowa but with x.x.x.1) I could ping 10.0.1.1 (from ionize) but not 10.0.0.1. Then after trying to run netperf, ping failed for 10.0.1.x subnet too. My initial guess is there are more race conditions in the code. Ideas on where I should start looking? Or other debug info wanted? grant From roland at topspin.com Fri Dec 10 10:52:43 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 10:52:43 -0800 Subject: [openib-general] Re: User MAD support for cancel MAD In-Reply-To: <1102521594.4129.42.camel@localhost.localdomain> (Hal Rosenstock's message of "Wed, 08 Dec 2004 10:59:54 -0500") References: <1102521594.4129.42.camel@localhost.localdomain> Message-ID: <52653aasuc.fsf@topspin.com> Hal> Hi Roland, It doesn't look to me like there is a way to Hal> cancel a MAD from user space. Would this be an additional Hal> ioctl to support ? This is needed from an OpenSM Hal> perspective. I guess it would be another ioctl. However I'm not sure how useful this is for userspace... in the kernel modules like IPoIB want to cancel pending sends to avoid a callback when the module is unloaded. However in userspace, an application can just close the file on exit. - R. From roland at topspin.com Fri Dec 10 10:56:07 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 10:56:07 -0800 Subject: [openib-general] Missining Infiniband class fields In-Reply-To: (shaharf@voltaire.com's message of "Thu, 9 Dec 2004 13:23:20 +0200") References: Message-ID: <521xdyasoo.fsf@topspin.com> shaharf> 2. I need an interface for setting/clearing IS_SM bit. If shaharf> there is such please let me know. If not, I would like to shaharf> add an additional ioctl to set/clear the IS_SM bit. A ctl shaharf> file in /dev/.../ports/.../ is anther option. Please tell shaharf> me what do you think. Don't add an ioctl. Let's create an "is_sm" file that sets the bit when opened and clears it when closed. shaharf> 3. It would help me very much if I could get an async shaharf> event on some changes, especially ports state shaharf> changes. Lid/SM lid changes events will be nice too. I am shaharf> not familiar with the HCA fw, so I don't know if it does shaharf> trigger such events. The AnafaII fw does. I would create another device special file like the umad file and use reads to get the events. - Roland From roland at topspin.com Fri Dec 10 10:56:55 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 10:56:55 -0800 Subject: [openib-general] Re: [PATCH] fix race condition in mthca event code In-Reply-To: <20041209104317.66264ea2.mshefty@ichips.intel.com> (Sean Hefty's message of "Thu, 9 Dec 2004 10:43:17 -0800") References: <20041209104317.66264ea2.mshefty@ichips.intel.com> Message-ID: <52wtvq9e2w.fsf@topspin.com> Sean> This patch fixed my problem hitting the BUG_ON code in Sean> mthca_cmd, line 328. It moves releasing the semaphore to Sean> after freeing the event entry. Unfortunately this backs out the fix for Grant's race. I need to think about the proper fix here. Thanks, Roland From roland at topspin.com Fri Dec 10 10:58:05 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 10:58:05 -0800 Subject: [openib-general] IPv6 All Router Multicast Group In-Reply-To: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 10:38:08 -0500") References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> Message-ID: <52sm6e9e0y.fsf@topspin.com> Hal> Any idea on why this group is send only ? Don't end nodes Hal> need to both send and receive on the all routers group ? IPoIB does a send only join when it gets a multicast packet to send. It will do a full member join when the kernel asks it to add a multicast group to the list of groups to receive from. - Roland From roland at topspin.com Fri Dec 10 11:24:43 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 11:24:43 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041210011216.GQ7178@esmail.cup.hp.com> (Grant Grundler's message of "Thu, 9 Dec 2004 17:12:16 -0800") References: <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> <20041209234109.GN7178@esmail.cup.hp.com> <41B8E436.7050509@ichips.intel.com> <20041210000536.GO7178@esmail.cup.hp.com> <20041210011216.GQ7178@esmail.cup.hp.com> Message-ID: <52fz2e9csk.fsf@topspin.com> Grant> Roland, Attached is a patch that does three things (LART me Grant> for combining them, but I can split them up later if you Grant> want) Thanks, I applied this. Sorry about the brokenness for the past few days... (The only thing I would LART you for would be leaving out the Signed-off-by: line with your patch, but I'm not going to be anal about that yet) - R. From hnrose at earthlink.net Fri Dec 10 11:37:18 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 14:37:18 -0500 Subject: [openib-general] IPv6 All Router Multicast Group References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> <52sm6e9e0y.fsf@topspin.com> Message-ID: <002601c4deef$a9c18ac0$6501a8c0@comcast.net> Roland Dreier wrote: > IPoIB does a send only join when it gets a multicast packet to send. and this packet is directed at a group which is not currently joined. > It will do a full member join when the kernel asks it to add a > multicast group to the list of groups to receive from. My question was a little unclear; I meant: Why does the kernel not request a full join in the case of the all routers and IGMP groups ? What is the differentiator for this ? -- Hal From roland at topspin.com Fri Dec 10 11:43:22 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 11:43:22 -0800 Subject: [openib-general] IPv6 All Router Multicast Group In-Reply-To: <002601c4deef$a9c18ac0$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 14:37:18 -0500") References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> <52sm6e9e0y.fsf@topspin.com> <002601c4deef$a9c18ac0$6501a8c0@comcast.net> Message-ID: <524qiu9bxh.fsf@topspin.com> Hal> My question was a little unclear; I meant: Why does the Hal> kernel not request a full join in the case of the all routers Hal> and IGMP groups ? What is the differentiator for this ? Not sure exactly but I would guess only routers (or nodes running routing protocols) would join these groups (as opposed to groups that all nodes have to join). On the other hand I'm not sure who is sending packets to these groups either... - R. From hnrose at earthlink.net Fri Dec 10 11:50:46 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 14:50:46 -0500 Subject: [openib-general] IPv6 All Router Multicast Group References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net><52sm6e9e0y.fsf@topspin.com><002601c4deef$a9c18ac0$6501a8c0@comcast.net> <524qiu9bxh.fsf@topspin.com> Message-ID: <005e01c4def1$8b673d20$6501a8c0@comcast.net> Roland Dreier wrote: > Hal> My question was a little unclear; I meant: Why does the > Hal> kernel not request a full join in the case of the all routers > Hal> and IGMP groups ? What is the differentiator for this ? > > Not sure exactly but I would guess only routers (or nodes running > routing protocols) would join these groups (as opposed to groups that > all nodes have to join). On the other hand I'm not sure who is > sending packets to these groups either... I would expect end nodes joining the all routers group to at least be listening to router advertisements so just joining send only doesn't make sense. If the node is behaving as a router, I would expect a full join as they would need to both send and receive. In terms of IGMP, if the end node is running multicast, it needs to send and receive as would a multicast router. So a send only join doesn't make sense to me for these groups. To state what sounds obvious, the only time a send only join makes sense would be for some send only application. Something doesn't quite seem right to me here. -- Hal From roland at topspin.com Fri Dec 10 11:59:47 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 11:59:47 -0800 Subject: [openib-general] IPv6 All Router Multicast Group In-Reply-To: <005e01c4def1$8b673d20$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 14:50:46 -0500") References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> <52sm6e9e0y.fsf@topspin.com> <002601c4deef$a9c18ac0$6501a8c0@comcast.net> <524qiu9bxh.fsf@topspin.com> <005e01c4def1$8b673d20$6501a8c0@comcast.net> Message-ID: <52zn0l9b64.fsf@topspin.com> Hal> I would expect end nodes joining the all routers group to at Hal> least be listening to router advertisements so just joining Hal> send only doesn't make sense. If the node is behaving as a Hal> router, I would expect a full join as they would need to both Hal> send and receive. Doesn't IPv6 autoconfiguration work by having a node send a router solicit message to the all-routers group (without needing to listen to the all-routers group)? - R. From hnrose at earthlink.net Fri Dec 10 12:07:50 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 15:07:50 -0500 Subject: [openib-general] IPv6 All Router Multicast Group References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net><52sm6e9e0y.fsf@topspin.com><002601c4deef$a9c18ac0$6501a8c0@comcast.net><524qiu9bxh.fsf@topspin.com><005e01c4def1$8b673d20$6501a8c0@comcast.net> <52zn0l9b64.fsf@topspin.com> Message-ID: <007301c4def3$ed472b20$6501a8c0@comcast.net> Roland Dreier wrote: > Hal> I would expect end nodes joining the all routers group to at > Hal> least be listening to router advertisements so just joining > Hal> send only doesn't make sense. If the node is behaving as a > Hal> router, I would expect a full join as they would need to both > Hal> send and receive. > > Doesn't IPv6 autoconfiguration work by having a node send a > router solicit message to the all-routers group (without needing to > listen to the all-routers group)? Routers send router advertisements periodically. If a hosts does not receive any router advertisements in some time period, it will solicit for routers (send a router solicitation message) some number of times. So maybe IPv6 unicast routers don't need to receive on this group (hosts definitely do). For IGMP, I think both hosts and routers need to both send and receive yet we see a send only join for this group. The same thing appears to be occuring in these IPv4 (all routers and IGMP). -- Hal From mshefty at ichips.intel.com Fri Dec 10 12:10:50 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Dec 2004 12:10:50 -0800 Subject: [openib-general] [PATCH] embed ib_mad_recv_buf into ib_mad_recv_wc Message-ID: <20041210121050.7e41f45e.mshefty@ichips.intel.com> This patch replaces a pointer to struct ib_mad_recv_buf in struct ib_mad_recv_wc by embedding the recv_buf directly. It saves having to allocate and dereference a pointer. Patch touches mad.c, user_mad.c, and sa_query.c + header files. - Sean Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1321) +++ include/ib_mad.h (working copy) @@ -215,7 +215,7 @@ */ struct ib_mad_recv_wc { struct ib_wc *wc; - struct ib_mad_recv_buf *recv_buf; + struct ib_mad_recv_buf recv_buf; int mad_len; }; Index: core/user_mad.c =================================================================== --- core/user_mad.c (revision 1321) +++ core/user_mad.c (working copy) @@ -148,7 +148,7 @@ memset(packet, 0, sizeof *packet); - memcpy(packet->mad.data, mad_recv_wc->recv_buf->mad, sizeof packet->mad.data); + memcpy(packet->mad.data, mad_recv_wc->recv_buf.mad, sizeof packet->mad.data); packet->mad.status = 0; packet->mad.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); packet->mad.lid = cpu_to_be16(mad_recv_wc->wc->slid); Index: core/mad.c =================================================================== --- core/mad.c (revision 1321) +++ core/mad.c (working copy) @@ -80,7 +80,7 @@ /* Forward declarations */ static int method_in_use(struct ib_mad_mgmt_method_table **method, struct ib_mad_reg_req *mad_reg_req); -static void remove_mad_reg_req(struct ib_mad_agent_private *priv); +static void remove_mad_reg_req(struct ib_mad_agent_private *priv); static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, struct ib_mad_private *mad); static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); @@ -696,11 +696,10 @@ mad_priv->header.recv_wc.wc = &wc; mad_priv->header.recv_wc.mad_len = sizeof(struct ib_mad); - INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); - mad_priv->header.recv_buf.grh = NULL; - mad_priv->header.recv_buf.mad = &mad_priv->mad.mad; - mad_priv->header.recv_wc.recv_buf = - &mad_priv->header.recv_buf; + INIT_LIST_HEAD(&mad_priv->header.recv_wc.recv_buf.list); + mad_priv->header.recv_wc.recv_buf.grh = NULL; + mad_priv->header.recv_wc.recv_buf.mad = + &mad_priv->mad.mad; if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) snoop_recv(mad_agent_priv->qp_info, &mad_priv->header.recv_wc, @@ -906,11 +905,12 @@ * Walk receive buffer list associated with this WC * No need to remove them from list of receive buffers */ - list_for_each_entry(entry, &mad_recv_wc->recv_buf->list, list) { + list_for_each_entry(entry, &mad_recv_wc->recv_buf.list, list) { /* Free previous receive buffer */ kmem_cache_free(ib_mad_cache, priv); - mad_priv_hdr = container_of(entry, struct ib_mad_private_header, - recv_buf); + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); priv = container_of(mad_priv_hdr, struct ib_mad_private, header); } @@ -1462,7 +1462,7 @@ struct ib_mad_private *recv) { /* Until we have RMPP, all receives are reassembled!... */ - INIT_LIST_HEAD(&recv->header.recv_buf.list); + INIT_LIST_HEAD(&recv->header.recv_wc.recv_buf.list); return recv; } @@ -1553,7 +1553,6 @@ struct ib_mad_private *recv, *response; struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent; - struct ib_smp *smp; int solicited; response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); @@ -1577,32 +1576,30 @@ /* Setup MAD receive work completion from "normal" work completion */ recv->header.recv_wc.wc = wc; recv->header.recv_wc.mad_len = sizeof(struct ib_mad); - recv->header.recv_wc.recv_buf = &recv->header.recv_buf; - recv->header.recv_buf.mad = (struct ib_mad *)&recv->mad; - recv->header.recv_buf.grh = &recv->grh; + recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; + recv->header.recv_wc.recv_buf.grh = &recv->grh; if (atomic_read(&qp_info->snoop_count)) snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); /* Validate MAD */ - if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num)) + if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) goto out; - if (recv->header.recv_buf.mad->mad_hdr.mgmt_class == + if (recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - smp = (struct ib_smp *)recv->header.recv_buf.mad; - if (!smi_handle_dr_smp_recv(smp, + if (!smi_handle_dr_smp_recv(&recv->mad.smp, port_priv->device->node_type, port_priv->port_num, port_priv->device->phys_port_cnt)) goto out; - if (!smi_check_forward_dr_smp(smp)) + if (!smi_check_forward_dr_smp(&recv->mad.smp)) goto local; - if (!smi_handle_dr_smp_send(smp, + if (!smi_handle_dr_smp_send(&recv->mad.smp, port_priv->device->node_type, port_priv->port_num)) goto out; - if (!smi_check_local_dr_smp(smp, + if (!smi_check_local_dr_smp(&recv->mad.smp, port_priv->device, port_priv->port_num)) goto out; @@ -1625,7 +1622,7 @@ ret = port_priv->device->process_mad(port_priv->device, 0, port_priv->port_num, wc->slid, - recv->header.recv_buf.mad, + &recv->mad.mad, &response->mad.mad); if (ret & IB_MAD_RESULT_SUCCESS) { if (ret & IB_MAD_RESULT_CONSUMED) @@ -1642,9 +1639,8 @@ } /* Determine corresponding MAD agent for incoming receive MAD */ - solicited = solicited_mad(recv->header.recv_buf.mad); - mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad, - solicited); + solicited = solicited_mad(&recv->mad.mad); + mad_agent = find_mad_agent(port_priv, &recv->mad.mad, solicited); if (mad_agent) { ib_mad_complete_recv(mad_agent, recv, solicited); /* Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 1321) +++ core/mad_priv.h (working copy) @@ -90,7 +90,6 @@ struct ib_mad_private_header { struct ib_mad_list_head mad_list; struct ib_mad_recv_wc recv_wc; - struct ib_mad_recv_buf recv_buf; DECLARE_PCI_UNMAP_ADDR(mapping) } __attribute__ ((packed)); Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 1321) +++ core/sa_query.c (working copy) @@ -728,9 +728,9 @@ if (query) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, - mad_recv_wc->recv_buf->mad->mad_hdr.status ? + mad_recv_wc->recv_buf.mad->mad_hdr.status ? -EINVAL : 0, - (struct ib_sa_mad *) mad_recv_wc->recv_buf->mad); + (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); else query->callback(query, -EIO, NULL); } From roland at topspin.com Fri Dec 10 12:29:58 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 12:29:58 -0800 Subject: [openib-general] Re: [PATCH] embed ib_mad_recv_buf into ib_mad_recv_wc In-Reply-To: <20041210121050.7e41f45e.mshefty@ichips.intel.com> (Sean Hefty's message of "Fri, 10 Dec 2004 12:10:50 -0800") References: <20041210121050.7e41f45e.mshefty@ichips.intel.com> Message-ID: <52vfb999rt.fsf@topspin.com> This patch seems fine to apply to me. - R. From hnrose at earthlink.net Fri Dec 10 13:10:22 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 16:10:22 -0500 Subject: [openib-general] A Couple of verbs/mthca/firmware questions Message-ID: <004001c4defc$a9f2c100$6501a8c0@comcast.net> Hi, I have a couple of questions relative to verbs/mthca/firmware: 1. On received packets, the LRH is not visible but there are some WC fields set. One of those fields is dlid_path_bits. dlid_path_bits seems to be the same regardless of whether the incoming DR SMP was sent to the actual DLID or the permissive DLID. Is there a way to distinguish these cases ? 2. If a status is set in the MAD header and a post_send is issued on that MAD, should that status be sent in the packet ? Is there any conversion of the status field which might occur ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Fri Dec 10 13:22:19 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 13:22:19 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) Message-ID: <20041210212219.GK11324@esmail.cup.hp.com> /me bounces! With openib-1321, ip_ipoib is seems to be working! I couldn't reproduce the problem with ping not working sometimes. Current issue is misaligned accesses in the kernel. Here's a "cleaner" set of data. ionize:/usr/src/linux-ia64-release-2.6.10# modprobe ib_mthca ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:81:00.0) GSI 60 (level, low) -> CPU 0 (0x0000) vector 67 ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 67 ionize:/usr/src/linux-ia64-release-2.6.10# elilo -v --efiboot ionize:/usr/src/linux-ia64-release-2.6.10# modprobe ib_ipoib ionize:/usr/src/linux-ia64-release-2.6.10# cat /sys/class/infiniband/mthca0/ports/?/state 4: ACTIVE 4: ACTIVE ionize:/usr/src/linux-ia64-release-2.6.10# ifconfig ib0 10.0.0.2 netmask 255.255.255.0 broadcast 10.0.0.255 ionize:/usr/src/linux-ia64-release-2.6.10# ifconfig ib1 10.0.1.2 netmask 255.255.255.0 broadcast 10.0.1.255 ionize:/usr/src/linux-ia64-release-2.6.10# ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data. kernel unaligned access to 0xe0000002ff5fe05c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fe05c, ip=0xa0000002001be010 kernel unaligned access to 0xe0000002ff5fe05c, ip=0xa0000002001bef10 64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=14.4 ms kernel unaligned access to 0xe0000002ff5fe05c, ip=0xa0000002001bef10 64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.571 ms 64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=0.069 ms 64 bytes from 10.0.0.1: icmp_seq=4 ttl=64 time=0.067 ms 64 bytes from 10.0.0.1: icmp_seq=5 ttl=64 time=0.069 ms 64 bytes from 10.0.0.1: icmp_seq=6 ttl=64 time=0.068 ms --- 10.0.0.1 ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 5001ms rtt min/avg/max/mdev = 0.067/2.551/14.463/5.330 ms ionize:/usr/src/linux-ia64-release-2.6.10# ionize:/usr/src/linux-ia64-release-2.6.10# cd /opt/netperf/ ionize:/opt/netperf# ls netperf snapshot_script tcp_rr_script udp_rr_script netserver tcp_range_script tcp_stream_script udp_stream_script ionize:/opt/netperf# ./snapshot_script 10.0.1.1 Netperf snapshot script started at Fri Dec 10 13:00:02 PST 2004 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 ... misaligned accesses reports are rate limited by the kernel. The above is just the tip of the iceberg. a0000002001bee60 t ipoib_start_xmit [ib_ipoib] a0000002001bf880 t ipoib_get_stats [ib_ipoib] The "netserver" (rx4640) is getting the following: kernel unaligned access to 0xe0000001008b0f5c, ip=0xa000000200152f10 a000000200152e60 t ipoib_start_xmit [ib_ipoib] a000000200153880 t ipoib_get_stats [ib_ipoib] based on IP and offset (0x5c) I'll guess this is the same problem on both sides. Still looking at it. FYA, Starting 32x4 TCP_STREAM tests at Fri Dec 10 13:09:36 PST 2004 ------------------------------------ Testing with the following command line: /opt/netperf/netperf -t TCP_STREAM -l 60 -H 10.0.1.1 -i 10,3 -I 99,5 -- -s 32768 -S 32768 -m 4096 ... Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 262142 262142 4096 60.00 1164.65 Fixing the alignment issue should help here. Then I can start drilling a bit deeper on bottlenecks. hth, grant From iod00d at hp.com Fri Dec 10 12:58:30 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 12:58:30 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52fz2e9csk.fsf@topspin.com> References: <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> <20041209234109.GN7178@esmail.cup.hp.com> <41B8E436.7050509@ichips.intel.com> <20041210000536.GO7178@esmail.cup.hp.com> <20041210011216.GQ7178@esmail.cup.hp.com> <52fz2e9csk.fsf@topspin.com> Message-ID: <20041210205830.GJ11324@esmail.cup.hp.com> On Fri, Dec 10, 2004 at 11:24:43AM -0800, Roland Dreier wrote: > Grant> Roland, Attached is a patch that does three things (LART me > Grant> for combining them, but I can split them up later if you > Grant> want) > > Thanks, I applied this. Sorry about the brokenness for the past few days... > > (The only thing I would LART you for would be leaving out the Signed-off-by: > line with your patch, but I'm not going to be anal about that yet) Yes - I definitely know better - apologies and thanks. Here's one for the archive in case it comes up: Signed-off-by: Grant Grundler (for the add "set_ci"/remove "work"/backout "cmd_complete" patch I submitted yesterday) thanks, grant From hnrose at earthlink.net Fri Dec 10 13:58:53 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 16:58:53 -0500 Subject: [openib-general] [PATCH] embed ib_mad_recv_buf into ib_mad_recv_wc Message-ID: <006001c4df03$729cef80$6501a8c0@comcast.net> Thanks. Applied. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Fri Dec 10 14:07:38 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 10 Dec 2004 14:07:38 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F000306A861@orsmsx408> >This is the IPv4 equivalent of what the previous post on All Routers >Multicast Group. I was able to get rid of the multicast join errors by removing various network services from starting up. Now I see no errors in joining multicast groups, since I am only joining the 2 default ones, but I still cannot ping on EM64T with the PCI-E HCAs. I am running the latest F/W I received from Mellanox 4.6.0-rc4. I cannot back up to the 4.3.5 firmware or any other firmware rev, because the MST tools don't work on my 2.6 based system. I will try running the same S/W on an EM64T system but with a PCI-X card to see if it is the PCI-E card, which I suspect. From hnrose at earthlink.net Fri Dec 10 14:12:06 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 17:12:06 -0500 Subject: [openib-general] IPoIB still not working References: <1AC79F16F5C5284499BB9591B33D6F000306A861@orsmsx408> Message-ID: <007201c4df05$49a50ac0$6501a8c0@comcast.net> Woodruff, Robert J wrote: > I was able to get rid of the multicast join errors by removing various > network services > from starting up. Now I see no errors in joining multicast > groups, since I am only joining the 2 default ones, but I still cannot > ping on EM64T > with the PCI-E HCAs. Those join errors were not on groups which would have anything to do with ping. The only group which matters for that is the broadcast group. You now have the same problem as others have reported with PCI-E HCAs. I believe the 4.6 released firmware (RSN) is supposed to fix this. > I cannot back up to the 4.3.5 firmware or any other firmware rev, > because the MST tools don't work on my 2.6 based system. I thought there was a workaround patch for the invariant sector using tvflash. -- Hal From roland at topspin.com Fri Dec 10 14:18:24 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 14:18:24 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <20041210212219.GK11324@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 10 Dec 2004 13:22:19 -0800") References: <20041210212219.GK11324@esmail.cup.hp.com> Message-ID: <52k6rp94r3.fsf@topspin.com> Grant> /me bounces! With openib-1321, ip_ipoib is seems to be working! Cool. (I was wondering whether the earlier problems were because each system had two interfaces on the same broadcast domain and hence maybe responding to ARPs from the wrong interface. I wonder whether /proc/sys/net/ipv4/conf/ibX/arp_filter might have helped...) Grant> Current issue is misaligned accesses in the kernel. Hmm... what's offsetof(struct neighbour, ha) on ia64? (I'll check for myself a little later but I think the problem may be stashing a pointer at neigh->ha + 24) My current tree has a lot of local changes so it's a little hard for me to generate a patch, but does changing the body of to_ipoib_neigh() to the following help? static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { return (struct ipoib_neigh **) (neigh->ha + 24 - (offsetof(struct neighbour, ha) & 4)); } - Roland From roland at topspin.com Fri Dec 10 15:33:09 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 15:33:09 -0800 Subject: [openib-general] A Couple of verbs/mthca/firmware questions In-Reply-To: <004001c4defc$a9f2c100$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 16:10:22 -0500") References: <004001c4defc$a9f2c100$6501a8c0@comcast.net> Message-ID: <527jnp91ai.fsf@topspin.com> Hal> 1. On received packets, the LRH is not visible but there are Hal> some WC fields set. One of those fields is Hal> dlid_path_bits. dlid_path_bits seems to be the same Hal> regardless of whether the incoming DR SMP was sent to the Hal> actual DLID or the permissive DLID. Is there a way to Hal> distinguish these cases ? Not that I know of... as far as I know, the IB spec doesn't have a way to distinguish this at the verbs level. Hal> 2. If a status is set in the MAD header and a post_send is Hal> issued on that MAD, should that status be sent in the packet Hal> ? Is there any conversion of the status field which might Hal> occur ? I think the payload of the UD packet (ie the full 256 byte MAD packet) should be put on the wire unchanged. - Roland From iod00d at hp.com Fri Dec 10 15:36:54 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 15:36:54 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <20041210212219.GK11324@esmail.cup.hp.com> References: <20041210212219.GK11324@esmail.cup.hp.com> Message-ID: <20041210233654.GN11324@esmail.cup.hp.com> On Fri, Dec 10, 2004 at 01:22:19PM -0800, Grant Grundler wrote: > The "netserver" (rx4640) is getting the following: > kernel unaligned access to 0xe0000001008b0f5c, ip=0xa000000200152f10 > > a000000200152e60 t ipoib_start_xmit [ib_ipoib] > a000000200153880 t ipoib_get_stats [ib_ipoib] f10 - e60 == 0xb0 if (!spin_trylock(&priv->tx_lock)) { local_irq_restore(flags); return NETDEV_TX_LOCKED; } 1e3c: 01 40 20 e6 cmp4.eq p9,p8=0,r8 1e40: 1c 48 01 42 00 21 [MFB] mov r41=r33 1e46: 00 00 00 02 80 04 nop.f 0x0 1e4c: 50 09 00 43 (p09) br.cond.dpnt.few 2790 if (skb->dst && skb->dst->neighbour) { 1e50: 0b 50 01 40 00 21 [MMI] mov r42=r32;; 1e56: e0 00 50 30 20 00 ld8 r14=[r20] 1e5c: 00 00 04 00 nop.i 0x0;; 1e60: 11 40 40 1c 01 21 [MIB] adds r8=144,r14 1e66: 80 00 38 12 72 04 cmp.eq p8,p9=0,r14 1e6c: 60 03 00 42 (p08) br.cond.dptk.few 21c0 ;; 1e70: 0a 78 00 10 18 10 [MMI] ld8 r15=[r8];; 1e76: 90 e0 3e 00 42 40 adds r9=92,r15 1e7c: 01 78 2c e4 cmp.eq p10,p11=0,r15 1e80: 11 80 10 1f 00 21 [MIB] adds r16=68,r15 1e86: 00 00 00 02 00 05 nop.i 0x0 1e8c: 40 03 00 42 (p10) br.cond.dptk.few 21c0 ;; if (unlikely(!*to_ipoib_path(skb->dst->neighbour))) { 1e90: 1d 70 00 12 18 10 [MFB] ld8 r14=[r9] 0x1e90 is the faulting insn. Roland, yeah - this is because of the "neigh->ha + 24" issue. I'll play with it. thanks, grant From roland at topspin.com Fri Dec 10 15:41:13 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 15:41:13 -0800 Subject: [openib-general] IPv6 All Router Multicast Group In-Reply-To: <007301c4def3$ed472b20$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 15:07:50 -0500") References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> <52sm6e9e0y.fsf@topspin.com> <002601c4deef$a9c18ac0$6501a8c0@comcast.net> <524qiu9bxh.fsf@topspin.com> <005e01c4def1$8b673d20$6501a8c0@comcast.net> <52zn0l9b64.fsf@topspin.com> <007301c4def3$ed472b20$6501a8c0@comcast.net> Message-ID: <523byd90x2.fsf@topspin.com> Hal> Routers send router advertisements periodically. yes. Hal> If a hosts does not receive any router advertisements in some Hal> time period, it will solicit for routers (send a router Hal> solicitation message) some number of times. yes, it will send router solicitation messages to the all-routers group. Hal> So maybe IPv6 unicast routers don't need to receive on this Hal> group (hosts definitely do). I think unicast routers would need to listen to the all-routers group. However, since router advertisements are sent to the all-nodes group, a typical IPv6 end node does not need to listen to the all-routers group (which is why the kernel doesn't join the all-routers group by default). Hal> For IGMP, I think both hosts and routers need to both send Hal> and receive yet we see a send only join for this group. OK, not sure how IGMP works on the host side. Hal> The same thing appears to be occuring in these IPv4 (all Hal> routers and IGMP). The only IPv4 group I see my systems joining is the broadcast group -- maybe you have more config options turned on than I do; I'm running # CONFIG_IP_ADVANCED_ROUTER is not set # CONFIG_IP_MROUTE is not set - Roland From roland at topspin.com Fri Dec 10 15:44:25 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 15:44:25 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <20041210233654.GN11324@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 10 Dec 2004 15:36:54 -0800") References: <20041210212219.GK11324@esmail.cup.hp.com> <20041210233654.GN11324@esmail.cup.hp.com> Message-ID: <52y8g57m7a.fsf@topspin.com> Grant> Roland, yeah - this is because of the "neigh->ha + 24" Grant> issue. I'll play with it. Try changing to_ipoib_neigh() (in ipoib.h) to this: static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { return (struct ipoib_neigh **) (neigh->ha + 24 - (offsetof(struct neighbour, ha) & 4)); } I think this should fix it. I have a big patch for IPoIB coming soon (fixes the neighbour lifetime issues as well as unicast ARP), and this will be part of it. I'd be curious how much this boosts your performance (it would be at least one unaligned trap per packet, so it's probably a big deal). - R. From hnrose at earthlink.net Fri Dec 10 15:48:33 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 18:48:33 -0500 Subject: [openib-general] IPv6 All Router Multicast Group References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net><52sm6e9e0y.fsf@topspin.com><002601c4deef$a9c18ac0$6501a8c0@comcast.net><524qiu9bxh.fsf@topspin.com><005e01c4def1$8b673d20$6501a8c0@comcast.net><52zn0l9b64.fsf@topspin.com><007301c4def3$ed472b20$6501a8c0@comcast.net> <523byd90x2.fsf@topspin.com> Message-ID: <002e01c4df12$c293fba0$6501a8c0@comcast.net> Roland Dreier wrote: > Hal> So maybe IPv6 unicast routers don't need to receive on this > Hal> group (hosts definitely do). > > I think unicast routers would need to listen to the all-routers > group. Do they need to know the other routers ? Isn't that what the routing protocols (RIP, OSPF, ...) are about ? > However, since router advertisements are sent to the all-nodes > group, a typical IPv6 end node does not need to listen to the > all-routers group (which is why the kernel doesn't join the > all-routers group by default). So an end node doesn't find available routers via this group ? What you are saying is consistent with the observations for this and the router would create the group and there is no one to send to if the group isn't there (send only join). > Hal> For IGMP, I think both hosts and routers need to both send > Hal> and receive yet we see a send only join for this group. > > OK, not sure how IGMP works on the host side. A host needs to tell the multicast router when it is joining and leaving a group so the router knows when to join the multicast tree for that group or prune the tree. > Hal> The same thing appears to be occuring in these IPv4 (all > Hal> routers and IGMP). > > The only IPv4 group I see my systems joining is the broadcast group -- > maybe you have more config options turned on than I do; I'm running > > # CONFIG_IP_ADVANCED_ROUTER is not set > # CONFIG_IP_MROUTE is not set I have the same for those options with also CONFIG_IP_MULTICAST = y -- Hal From roland at topspin.com Fri Dec 10 16:23:46 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 16:23:46 -0800 Subject: [openib-general] IPv6 All Router Multicast Group In-Reply-To: <002e01c4df12$c293fba0$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 18:48:33 -0500") References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> <52sm6e9e0y.fsf@topspin.com> <002601c4deef$a9c18ac0$6501a8c0@comcast.net> <524qiu9bxh.fsf@topspin.com> <005e01c4def1$8b673d20$6501a8c0@comcast.net> <52zn0l9b64.fsf@topspin.com> <007301c4def3$ed472b20$6501a8c0@comcast.net> <523byd90x2.fsf@topspin.com> <002e01c4df12$c293fba0$6501a8c0@comcast.net> Message-ID: <52u0qt7kdp.fsf@topspin.com> Hal> So an end node doesn't find available routers via this group Hal> ? What you are saying is consistent with the observations Hal> for this and the router would create the group and there is Hal> no one to send to if the group isn't there (send only join). Yes, see RFC 2461 -- router advertisements are sent to the all-nodes group (although to be precise, solicited advertisements MAY be unicast to the requestor). Hal> A host needs to tell the multicast router when it is joining Hal> and leaving a group so the router knows when to join the Hal> multicast tree for that group or prune the tree. If the host does not need to receive any multicast messages, then only the multicast routers would need to join the group. - R. From robert.j.woodruff at intel.com Fri Dec 10 16:28:41 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 10 Dec 2004 16:28:41 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F000306AAC3@orsmsx408> >You now have the same problem as others have reported with PCI-E HCAs. >I believe the 4.6 released firmware (RSN) is supposed to fix this. I tried replacing the PCI-E card with a PCI-X card without changing any software and ping works just fine with the PCI-X card on my EM64T system. As I mentioned before, I was running 5.4.6-rc4 in the PCI-E card and it still fails, so there is some other issue with the PCI-E card that needs to be investigated. woody From iod00d at hp.com Fri Dec 10 16:42:09 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 16:42:09 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <52k6rp94r3.fsf@topspin.com> References: <20041210212219.GK11324@esmail.cup.hp.com> <52k6rp94r3.fsf@topspin.com> Message-ID: <20041211004209.GO11324@esmail.cup.hp.com> On Fri, Dec 10, 2004 at 02:18:24PM -0800, Roland Dreier wrote: > Grant> /me bounces! With openib-1321, ip_ipoib is seems to be working! > > Cool. > > (I was wondering whether the earlier problems were because each system > had two interfaces on the same broadcast domain and hence maybe > responding to ARPs from the wrong interface. Yes, you are, as usual :^), probably right. The initial broadcast mask was 10.255.255.255 because I didn't specify one when I ran ifconfig (with params) for the first time. I checked what I had done by running ifconfig (no params) and realized the error. I ran ifconfig (with params) again for both ib0/1 and this time specified the broadcast address and netmask. It's likely it didn't recover from that. But I expect this issue is not ia64 specific and anyone should be able to reproduce it. > I wonder whether > /proc/sys/net/ipv4/conf/ibX/arp_filter might have helped...) sorry - I don't understand networking protocols well enough to know what you are alluding to here. But if you are already aware of the issue and fixing it... > Grant> Current issue is misaligned accesses in the kernel. > > Hmm... what's offsetof(struct neighbour, ha) on ia64? By hand, I counted 68. It should be in the asm I posted earlier. > (I'll check for > myself a little later but I think the problem may be stashing a > pointer at neigh->ha + 24) yes, I'm pretty sure it is. > My current tree has a lot of local changes so it's a little hard for > me to generate a patch, but does changing the body of to_ipoib_neigh() > to the following help? > > static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) > { > return (struct ipoib_neigh **) (neigh->ha + 24 - > (offsetof(struct neighbour, ha) & 4)); > } Sorry, I don't have this function in my tree...that's probably part of the changes you want to commit. I think it's called to_ipoib_path() in the exist tree: static inline struct ipoib_path **to_ipoib_path(struct neighbour *neigh) { return (struct ipoib_path **) (neigh->ha + 24); } I've add the same bit to it that you have above and that does avoid the misaligned access. Trying to unload the module didn't go smoothly either: ionize:/opt/netperf# ifconfig ib0 down ionize:/opt/netperf# ifconfig ib1 down ionize:/opt/netperf# rmmod ib_ipoib ib1: ib_dealloc_pd failed Not sure what caused that hiccup but I was able to unload everything else and reload the new modules just fine. In another email you commented: > I'd be curious how much this boosts your performance (it would be at > least one unaligned trap per packet, so it's probably a big deal). That's what I expected too. But not for this particular test: Starting 56x4 TCP_STREAM tests at Fri Dec 10 15:58:12 PST 2004 /opt/netperf/netperf -t TCP_STREAM -l 60 -H 10.0.1.1 -i 10,3 -I 99,5 -- -s 57344 -S 57344 -m 4096 TCP STREAM TEST to 10.0.1.1 : +/-2.5% @ 99% conf. !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 5.6% !!! Local CPU util : 0.0% !!! Remote CPU util : 0.0% Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 262142 262142 4096 60.00 1232.26 That's something like ~10% improvment. I ran the same test again but limited it to three passes *AND* under control of pfmon...about the same result (1221.12 Mbps) CPU0 16948574 UC_LOADS_RETIRED CPU1 13211 UC_LOADS_RETIRED *sigh*...forgot to grab IRQ counts. Again, but with one iteration, (60 seconds): 67: 149648661 0 IO-SAPIC-level ib_mthca pfmon -e uc_loads_retired -k --system-wide -- /opt/netperf/netperf -t TCP_STREAM -l 60 -H 10.0.1.1 -i 1,1 -- -s 57344 -S 57344 -m 4096 ... 114688 114688 4096 60.00 1170.88 CPU0 5656170 UC_LOADS_RETIRED CPU1 4466 UC_LOADS_RETIRED ionize:/opt/netperf# cat /proc/interrupts | fgrep mthca 67: 152474464 0 IO-SAPIC-level ib_mthca 5656170/(152474464-149648661) ~= 2 two uncached reads per Interrupt. That's what e1000 driver is doing today. No wonder we aren't much faster. I was expecting zero uncached reads from IB in the interrupt path. Fix that and we'll get back ~8 seconds of CPU time for a 60 second test. 47096 interrupts/second. We should be able to do 2x that on this box at least. (1.5GHz Madison) I'll dig up the other trivial things with pfmon. BTW, if anyone has another favorite trivial test or netperf parameters, I'd be happy to collect pfmon, q-syscollect, prospect, or oprofile output. thanks, grant From hnrose at earthlink.net Fri Dec 10 16:55:25 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 19:55:25 -0500 Subject: [openib-general] IPv6 All Router Multicast Group References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net><52sm6e9e0y.fsf@topspin.com><002601c4deef$a9c18ac0$6501a8c0@comcast.net><524qiu9bxh.fsf@topspin.com><005e01c4def1$8b673d20$6501a8c0@comcast.net><52zn0l9b64.fsf@topspin.com><007301c4def3$ed472b20$6501a8c0@comcast.net><523byd90x2.fsf@topspin.com><002e01c4df12$c293fba0$6501a8c0@comcast.net> <52u0qt7kdp.fsf@topspin.com> Message-ID: <001901c4df1c$19ac0a00$6501a8c0@comcast.net> Roland Dreier wrote: > Hal> A host needs to tell the multicast router when it is joining > Hal> and leaving a group so the router knows when to join the > Hal> multicast tree for that group or prune the tree. > > If the host does not need to receive any multicast messages, then only > the multicast routers would need to join the group. In IGMP, the host needs to both listen (for queries) and send (reports, leave). The inverse is true for mrouters. It seems like they both need to join the group if they are running IGMP of some version. IGMPv1 does not support leave. -- Hal From roland at topspin.com Fri Dec 10 18:23:36 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 18:23:36 -0800 Subject: [openib-general] [PATCH] IPoIB neighbour fixes Message-ID: <52mzwl7etz.fsf@topspin.com> I just committed this patch, which fixes both the "path mismatch for unicast ARP" and "neighbour destructor after rmmod" issues. I've tested rmmod'ing ipoib with traffic running with this patch, and it survives fine. I now have everything that I wanted to get done, and I'm planning on submitting patches to lkml again on Monday. - R. Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 1320) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -149,22 +149,116 @@ return 0; } +static struct ipoib_path *__path_find(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->path_tree.rb_node; + struct ipoib_path *path; + int ret; + + while (n) { + path = rb_entry(n, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, gid->raw, + sizeof (union ib_gid)); + + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return path; + } + + return NULL; +} + +static int __path_add(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->path_tree.rb_node; + struct rb_node *pn = NULL; + struct ipoib_path *tpath; + int ret; + + while (*n) { + pn = *n; + tpath = rb_entry(pn, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, tpath->pathrec.dgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&path->rb_node, pn, n); + rb_insert_color(&path->rb_node, &priv->path_tree); + + list_add_tail(&path->list, &priv->path_list); + + return 0; +} + +static void __path_free(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tn; + struct sk_buff *skb; + + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb_irq(skb); + + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + if (path->ah) + ipoib_put_ah(path->ah); + + rb_erase(&path->rb_node, &priv->path_tree); + list_del(&path->list); + kfree(path); +} + +void ipoib_flush_paths(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path, *tp; + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + + list_for_each_entry_safe(path, tp, &priv->path_list, list) + __path_free(dev, path); + + spin_unlock_irqrestore(&priv->lock, flags); +} + static void path_rec_completion(int status, struct ib_sa_path_rec *pathrec, void *path_ptr) { struct ipoib_path *path = path_ptr; struct ipoib_dev_priv *priv = netdev_priv(path->dev); + struct ipoib_ah *ah = NULL; + struct ipoib_neigh *neigh; + struct sk_buff_head skqueue; struct sk_buff *skb; - struct ipoib_ah *ah; + unsigned long flags; ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n", status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); - if (status != IB_WC_SUCCESS) - goto err; - - { + if (status == IB_WC_SUCCESS) { struct ib_ah_attr av = { .dlid = be16_to_cpu(pathrec->dlid), .sl = pathrec->sl, @@ -177,215 +271,216 @@ ah = ipoib_create_ah(path->dev, priv->pd, &av); } - if (!ah) - goto err; + spin_lock_irqsave(&priv->lock, flags); path->ah = ah; - ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n", - ah, pathrec->dlid, pathrec->sl); + if (ah) { + path->pathrec = *pathrec; - while ((skb = __skb_dequeue(&path->queue))) { + ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n", + ah, be16_to_cpu(pathrec->dlid), pathrec->sl); + + skb_queue_head_init(&skqueue); + + while ((skb = __skb_dequeue(&path->queue))) + __skb_queue_tail(&skqueue, skb); + + list_for_each_entry(neigh, &path->neigh_list, list) { + neigh->ah = path->ah; + kref_get(&path->ah->ref); + + while ((skb = __skb_dequeue(&neigh->queue))) + __skb_queue_tail(&skqueue, skb); + } + } else + path->query = NULL; + + + complete(&path->done); + + spin_unlock_irqrestore(&priv->lock, flags); + + while ((skb = __skb_dequeue(&skqueue))) { skb->dev = path->dev; if (dev_queue_xmit(skb)) ipoib_warn(priv, "dev_queue_xmit failed " "to requeue packet\n"); } - - return; - -err: - while ((skb = __skb_dequeue(&path->queue))) - dev_kfree_skb(skb); - - if (path->neighbour) - *to_ipoib_path(path->neighbour) = NULL; - - kfree(path); } -static void path_rec_start(struct sk_buff *skb, struct net_device *dev) +static struct ipoib_path *path_rec_create(struct net_device *dev, + union ib_gid *gid) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC); - struct ib_sa_path_rec rec = { - .numb_path = 1 - }; - struct ib_sa_query *query; + struct ipoib_path *path; + path = kmalloc(sizeof *path, GFP_ATOMIC); if (!path) - goto err; + return NULL; - path->ah = NULL; path->dev = dev; + path->pathrec.dlid = 0; + skb_queue_head_init(&path->queue); - __skb_queue_tail(&path->queue, skb); - path->neighbour = NULL; - rec.sgid = priv->local_gid; - memcpy(rec.dgid.raw, skb->dst->neighbour->ha + 4, 16); - rec.pkey = cpu_to_be16(priv->pkey); + INIT_LIST_HEAD(&path->neigh_list); + path->query = NULL; + init_completion(&path->done); - /* - * XXX there's a race here if path record completion runs - * before we get to finish up. Add a lock to path struct? - */ - if (ib_sa_path_rec_get(priv->ca, priv->port, &rec, - IB_SA_PATH_REC_DGID | - IB_SA_PATH_REC_SGID | - IB_SA_PATH_REC_NUMB_PATH | - IB_SA_PATH_REC_PKEY, - 1000, GFP_ATOMIC, - path_rec_completion, - path, &query) < 0) { - ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); - goto err; - } + memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.numb_path = 1; - path->neighbour = skb->dst->neighbour; - *to_ipoib_path(skb->dst->neighbour) = path; - return; + __path_add(dev, path); -err: - kfree(path); - ++priv->stats.tx_dropped; - dev_kfree_skb_any(skb); + return path; } -static void path_lookup(struct sk_buff *skb, struct net_device *dev) +static int path_rec_start(struct net_device *dev, + struct ipoib_path *path) { - struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + struct ipoib_dev_priv *priv = netdev_priv(dev); - /* Look up path record for unicasts */ - if (skb->dst->neighbour->ha[4] != 0xff) { - path_rec_start(skb, dev); - return; + path->query_id = + ib_sa_path_rec_get(priv->ca, priv->port, + &path->pathrec, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + 1000, GFP_ATOMIC, + path_rec_completion, + path, &path->query); + if (path->query_id < 0) { + ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); + path->query = NULL; + return path->query_id; } - /* Add in the P_Key */ - skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; - skb->dst->neighbour->ha[9] = priv->pkey & 0xff; - ipoib_mcast_send(dev, - (union ib_gid *) (skb->dst->neighbour->ha + 4), - skb); + return 0; } -static void unicast_arp_completion(int status, - struct ib_sa_path_rec *pathrec, - void *skb_ptr) +static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) { - struct sk_buff *skb = skb_ptr; - struct ipoib_dev_priv *priv = netdev_priv(skb->dev); - struct ipoib_ah *ah; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + struct ipoib_neigh *neigh; - ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n", - status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + return; + } - if (status) - goto err; + skb_queue_head_init(&neigh->queue); + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; - { - struct ib_ah_attr av = { - .dlid = be16_to_cpu(pathrec->dlid), - .sl = pathrec->sl, - .src_path_bits = 0, - .static_rate = 0, - .ah_flags = 0, - .port_num = priv->port - }; + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); - ah = ipoib_create_ah(skb->dev, priv->pd, &av); + path = __path_find(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) + goto err; } - if (!ah) - goto err; + list_add_tail(&neigh->list, &path->neigh_list); - *(struct ipoib_ah **) skb->cb = ah; + if (path->pathrec.dlid) { + neigh->ah = path->ah; + kref_get(&path->ah->ref); - if (dev_queue_xmit(skb)) - ipoib_warn(priv, "dev_queue_xmit failed " - "to requeue ARP packet\n"); + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + } else if (!path->query) { + neigh->ah = NULL; + __skb_queue_tail(&neigh->queue, skb); + if (path_rec_start(dev, path)) + goto err; + } + spin_unlock(&priv->lock); return; err: - dev_kfree_skb(skb); + *to_ipoib_neigh(skb->dst->neighbour) = NULL; + list_del(&neigh->list); + kfree(neigh); + neigh->neighbour->ops->destructor = NULL; + + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + + spin_unlock(&priv->lock); } -static void unicast_arp_finish(struct sk_buff *skb) +static void path_lookup(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(skb->dev); - struct ipoib_ah *ah = *(struct ipoib_ah **) skb->cb; - unsigned long flags; - if (ah) { - spin_lock_irqsave(&priv->lock, flags); - list_add_tail(&ah->list, &priv->dead_ahs); - spin_unlock_irqrestore(&priv->lock, flags); + /* Look up path record for unicasts */ + if (skb->dst->neighbour->ha[4] != 0xff) { + neigh_add_path(skb, dev); + return; } + + /* Add in the P_Key for multicasts */ + skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; + skb->dst->neighbour->ha[9] = priv->pkey & 0xff; + ipoib_mcast_send(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4), skb); } -/* - * For unicast packets with no skb->dst->neighbour (unicast ARPs are - * the main example), we fire off a path record query for each packet. - * This is pretty bad for scalability (since this is going to hammer - * the SM on a big fabric) but it's the best I can think of for now. - * - * Also we might have a problem if a path changes, because ARPs will - * still go through (since we'll get the new path from the SM for - * these queries) so we'll never update the neighbour. - */ -static void unicast_arp_start(struct sk_buff *skb, struct net_device *dev, - struct ipoib_pseudoheader *phdr) +static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, + struct ipoib_pseudoheader *phdr) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct sk_buff *tmp_skb; - struct ib_sa_path_rec rec = { - .numb_path = 1 - }; - struct ib_sa_query *query; + struct ipoib_path *path; - if (skb->destructor) { - tmp_skb = skb; - skb = skb_clone(tmp_skb, GFP_ATOMIC); - dev_kfree_skb_any(tmp_skb); - if (!skb) { + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (phdr->hwaddr + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (phdr->hwaddr + 4)); + if (path) { + __skb_queue_tail(&path->queue, skb); + + if (path_rec_start(dev, path)) + __path_free(dev, path); + } else { ++priv->stats.tx_dropped; - return; + dev_kfree_skb_any(skb); } + + spin_unlock(&priv->lock); + return; } - skb->dev = dev; - skb->destructor = unicast_arp_finish; - memset(skb->cb, 0, sizeof skb->cb); + ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); - rec.sgid = priv->local_gid; - memcpy(rec.dgid.raw, phdr->hwaddr + 4, 16); - rec.pkey = cpu_to_be16(priv->pkey); + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) phdr->hwaddr)); - /* - * XXX We need to keep a record of the skb and TID somewhere - * so that we can cancel the request if the device goes down - * before it finishes. - */ - if (ib_sa_path_rec_get(priv->ca, priv->port, &rec, - IB_SA_PATH_REC_DGID | - IB_SA_PATH_REC_SGID | - IB_SA_PATH_REC_NUMB_PATH | - IB_SA_PATH_REC_PKEY, - 1000, GFP_ATOMIC, - unicast_arp_completion, - skb, &query) < 0) { - ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); - ++priv->stats.tx_dropped; - dev_kfree_skb_any(skb); - } + spin_unlock(&priv->lock); } static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_path *path; + struct ipoib_neigh *neigh; unsigned long flags; local_irq_save(flags); @@ -395,21 +490,21 @@ } if (skb->dst && skb->dst->neighbour) { - if (unlikely(!*to_ipoib_path(skb->dst->neighbour))) { + if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { path_lookup(skb, dev); goto out; } - path = *to_ipoib_path(skb->dst->neighbour); + neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (likely(path->ah)) { - ipoib_send(dev, skb, path->ah, + if (likely(neigh->ah)) { + ipoib_send(dev, skb, neigh->ah, be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); goto out; } - if (skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) - __skb_queue_tail(&path->queue, skb); + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) + __skb_queue_tail(&neigh->queue, skb); else goto err; } else { @@ -418,25 +513,14 @@ skb_pull(skb, sizeof *phdr); if (phdr->hwaddr[4] == 0xff) { - /* Add in the P_Key */ + /* Add in the P_Key for multicast*/ phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; phdr->hwaddr[9] = priv->pkey & 0xff; ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb); } else { - /* unicast GID -- ARP reply?? */ + /* unicast GID -- should be ARP reply */ - /* - * If destructor is unicast_arp_finish, we've - * already been through the path lookup and - * now we can just send the packet. - */ - if (skb->destructor == unicast_arp_finish) { - ipoib_send(dev, skb, *(struct ipoib_ah **) skb->cb, - be32_to_cpup((u32 *) phdr->hwaddr)); - goto out; - } - if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " IPOIB_GID_FMT "\n", @@ -449,9 +533,8 @@ goto out; } - /* put the pseudoheader back on */ - skb_push(skb, sizeof *phdr); - unicast_arp_start(skb, dev, phdr); + /* put the pseudoheader back on and try to send */ + unicast_arp_send(skb, dev, phdr); } } @@ -516,19 +599,28 @@ schedule_work(&priv->restart_task); } -static void ipoib_neigh_destructor(struct neighbour *neigh) +static void ipoib_neigh_destructor(struct neighbour *n) { - struct ipoib_path *path = *to_ipoib_path(neigh); + struct ipoib_neigh *neigh = *to_ipoib_neigh(n); + struct ipoib_dev_priv *priv = netdev_priv(n->dev); + unsigned long flags; - ipoib_dbg(netdev_priv(neigh->dev), + ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n", - be32_to_cpup((__be32 *) neigh->ha), - IPOIB_GID_ARG(*((union ib_gid *) (neigh->ha + 4)))); + be32_to_cpup((__be32 *) n->ha), + IPOIB_GID_ARG(*((union ib_gid *) (n->ha + 4)))); - if (path && path->ah) { - ipoib_put_ah(path->ah); - kfree(path); + spin_lock_irqsave(&priv->lock, flags); + + if (neigh) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + *to_ipoib_neigh(n) = NULL; + kfree(neigh); } + + spin_unlock_irqrestore(&priv->lock, flags); } static int ipoib_neigh_setup(struct neighbour *neigh) @@ -669,6 +761,7 @@ init_MUTEX(&priv->mcast_mutex); init_MUTEX(&priv->vlan_mutex); + INIT_LIST_HEAD(&priv->path_list); INIT_LIST_HEAD(&priv->child_intfs); INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); Index: infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 1320) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -60,6 +60,8 @@ unsigned long flags; unsigned char logcount; + struct list_head neigh_list; + struct sk_buff_head pkt_queue; struct net_device *dev; @@ -77,11 +79,25 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast) { struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tmp; + unsigned long flags; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); + spin_lock_irqsave(&priv->lock, flags); + + list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); + if (mcast->ah) ipoib_put_ah(mcast->ah); @@ -114,6 +130,7 @@ mcast->logcount = 0; INIT_LIST_HEAD(&mcast->list); + INIT_LIST_HEAD(&mcast->neigh_list); skb_queue_head_init(&mcast->pkt_queue); mcast->ah = NULL; @@ -671,24 +688,25 @@ } out: - spin_unlock_irqrestore(&priv->lock, flags); if (mcast && mcast->ah) { if (skb->dst && skb->dst->neighbour && - !*to_ipoib_path(skb->dst->neighbour)) { - struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC); + !*to_ipoib_neigh(skb->dst->neighbour)) { + struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); - if (path) { + if (neigh) { kref_get(&mcast->ah->ref); - path->ah = mcast->ah; - path->dev = dev; - path->neighbour = skb->dst->neighbour; - *to_ipoib_path(skb->dst->neighbour) = path; + neigh->ah = mcast->ah; + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + list_add_tail(&neigh->list, &mcast->neigh_list); } } ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } + + spin_unlock_irqrestore(&priv->lock, flags); } void ipoib_mcast_dev_flush(struct net_device *dev) Index: infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband/ulp/ipoib/ipoib.h (revision 1320) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -93,6 +93,12 @@ DECLARE_PCI_UNMAP_ADDR(mapping) }; +/* + * Device private locking: tx_lock protects members used in TX fast + * path (and we use LLTX so upper layers don't do extra locking). + * lock protects everything else. lock nests inside of tx_lock (ie + * tx_lock must be acquired first if needed). + */ struct ipoib_dev_priv { spinlock_t lock; @@ -103,6 +109,9 @@ struct semaphore mcast_mutex; struct semaphore vlan_mutex; + struct rb_root path_tree; + struct list_head path_list; + struct ipoib_mcast *broadcast; struct list_head multicast_list; struct rb_root multicast_tree; @@ -162,16 +171,34 @@ }; struct ipoib_path { + struct net_device *dev; + struct ib_sa_path_rec pathrec; + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct list_head neigh_list; + + int query_id; + struct ib_sa_query *query; + struct completion done; + + struct rb_node rb_node; + struct list_head list; +}; + +struct ipoib_neigh { struct ipoib_ah *ah; struct sk_buff_head queue; - struct net_device *dev; struct neighbour *neighbour; + + struct list_head list; }; -static inline struct ipoib_path **to_ipoib_path(struct neighbour *neigh) +static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { - return (struct ipoib_path **) (neigh->ha + 24); + return (struct ipoib_neigh **) (neigh->ha + 24 - + (offsetof(struct neighbour, ha) & 4)); } extern struct workqueue_struct *ipoib_workqueue; @@ -194,6 +221,7 @@ struct ipoib_ah *address, u32 qpn); void ipoib_reap_ah(void *dev_ptr); +void ipoib_flush_paths(struct net_device *dev); struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); Index: infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband/ulp/ipoib/ipoib_ib.c (revision 1320) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -445,6 +445,8 @@ /* Delete broadcast and local addresses since they will be recreated */ ipoib_mcast_dev_down(dev); + ipoib_flush_paths(dev); + return 0; } From hnrose at earthlink.net Fri Dec 10 20:33:39 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 23:33:39 -0500 Subject: Fw: [openib-general] IPv6 All Router Multicast Group Message-ID: <004c01c4df3a$98e72980$6501a8c0@comcast.net> Hal Rosenstock wrote: > Roland Dreier wrote: >> Hal> A host needs to tell the multicast router when it is joining >> Hal> and leaving a group so the router knows when to join the >> Hal> multicast tree for that group or prune the tree. >> >> If the host does not need to receive any multicast messages, then >> only the multicast routers would need to join the group. > > In IGMP, the host needs to both listen (for queries) and send > (reports, leave). > The inverse is true for mrouters. It seems like they both need to > join the group > if they are running IGMP of some version. IGMPv1 does not support > leave. In thinking about this further, it depends on whether the messages are multicast or not and in what direction(s). The router to host general query is multicast (which means a host would need to receive as well as send) but this is on the all systems multicast address (224.0.0.1). Specific queries are sent on the group being queried. Routers send to these groups and also listen to the all routers group (224.0.0.2) for IGMPv2 (leaves are sent on this group from hosts) and all other groups (which is where the reports come back on). Only Version 3 Reports are sent with an IP destination address of 224.0.0.22, to which all IGMPv3-capable multicast routers listen. So hosts send only on this group. -- Hal From iod00d at hp.com Fri Dec 10 22:21:55 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 22:21:55 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <20041211004209.GO11324@esmail.cup.hp.com> References: <20041210212219.GK11324@esmail.cup.hp.com> <52k6rp94r3.fsf@topspin.com> <20041211004209.GO11324@esmail.cup.hp.com> Message-ID: <20041211062155.GC13957@esmail.cup.hp.com> On Fri, Dec 10, 2004 at 04:42:09PM -0800, Grant Grundler wrote: > I'll dig up the other trivial things with pfmon. pfmon 3.1 hasn't been released like I'd hoped and thus I may not be able to collect Data EAR like I hoped. I did a q-syscollect run (full output): http://gsyprf3.external.hp.com/apache2-default/openib/q-1321-tcp_stream-0.txt Here are the top offenders: Flat profile of CPU_CYCLES in kernel-cpu0.hist#0: Each histogram sample counts as 1.00034m seconds % time self cumul calls self/call tot/call name 20.48 8.10 8.10 41.0k 198u 198u default_idle 15.02 5.94 14.05 1.83M 3.25u 5.06u mthca_interrupt 14.91 5.90 19.94 17.2M 342n 342n _spin_unlock_irqrestore 4.05 1.60 21.55 12.1M 132n 149n ipt_do_table 3.20 1.27 22.81 7.17M 177n 177n do_csum 2.67 1.05 23.87 7.83M 135n 135n __copy_user 2.26 0.89 24.76 33.9M 26.4n 36.6n local_bh_enable 2.15 0.85 25.61 2.81M 303n 778n tcp_transmit_skb 2.01 0.79 26.41 1.45M 545n 3.98u tcp_sendmsg ... hrm...don't understand the 20% idle. This is a dual CPU system and (this version of) netperf is not multi-threaded. The top 3 only add up to about 50%. I guess I need to see what's being inlined into mthca_interrupt and try to break that down into smaller bits. thanks, grant From iod00d at hp.com Fri Dec 10 22:35:09 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 22:35:09 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <20041211062155.GC13957@esmail.cup.hp.com> References: <20041210212219.GK11324@esmail.cup.hp.com> <52k6rp94r3.fsf@topspin.com> <20041211004209.GO11324@esmail.cup.hp.com> <20041211062155.GC13957@esmail.cup.hp.com> Message-ID: <20041211063509.GA14155@esmail.cup.hp.com> On Fri, Dec 10, 2004 at 10:21:55PM -0800, Grant Grundler wrote: > I did a q-syscollect run (full output): > http://gsyprf3.external.hp.com/apache2-default/openib/q-1321-tcp_stream-0.txt Sorry that should be: http://gsyprf3.external.hp.com/openib/q-1321-tcp_stream-0.txt (misconfigured apache2 server - fixed it) grant From roland at topspin.com Sat Dec 11 06:37:37 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 11 Dec 2004 06:37:37 -0800 Subject: [openib-general] Missining Infiniband class fields In-Reply-To: <521xdyasoo.fsf@topspin.com> (Roland Dreier's message of "Fri, 10 Dec 2004 10:56:07 -0800") References: <521xdyasoo.fsf@topspin.com> Message-ID: <52acsk7vf2.fsf@topspin.com> Roland> I would create another device special file like the umad Roland> file and use reads to get the events. I thought about this a little more, and I'm not sure this is the right approach any more. I would look at using the kobject_uevent mechanism to pass these events to userspace (add a KOBJ_IB_ASYNC action and generate events using the IB device's class_device kobject). - R. From shaharf at voltaire.com Sun Dec 12 02:43:04 2004 From: shaharf at voltaire.com (shaharf) Date: Sun, 12 Dec 2004 12:43:04 +0200 Subject: [openib-general] Re: User MAD support for cancel MAD Message-ID: > I guess it would be another ioctl. However I'm not sure how useful > this is for userspace... in the kernel modules like IPoIB want to > cancel pending sends to avoid a callback when the module is unloaded. > However in userspace, an application can just close the file on exit. > > - R. OpenSM attempts to cancel mads on several scenarios. For example, the SM issues SwitchInfo to all switchs. One of them returns with a port change. This means that all other SwitchInfo are not relevant anymore - a full "heavy" sweep must be performed any way. While it is not critical to be able to cancel these mads, it may help freeing kernel resources in the case when these mads will timeout, for example when some switches are behind a switch that is disconnected. Another scenario is when a discovery is done and an error is received and another sweep is forced. The pending mads should be canceled. Again, this is not critical, but if the instrumentation is already there, it would be nice to use it. The interface issue is not so trivial. An IOCTL may do, but the problem is the parameter of the IOCTL. The most straight forward way is to specify a TID to cancel. This requires the usermode to avoid sending the same mad until it is timeouts. While this is reasonable, I would not force such a limitation, after all there are plenty of retries mechanisms and we should not force the applications to use only one sort. For example, if you have a retries mechanism that work over IB_MGT that just retires the mad after X msecs, I think you want to let it work also above openib. This means that either you allow no timeout/no kernel matching semantics (do we have one?) or let the user cancel *all* mads with the same TID. I think that both should be implemented. Shahar From shaharf at voltaire.com Sun Dec 12 10:22:08 2004 From: shaharf at voltaire.com (shaharf) Date: Sun, 12 Dec 2004 20:22:08 +0200 Subject: [openib-general] Missining Infiniband class fields Message-ID: > > Roland> I would create another device special file like the umad > Roland> file and use reads to get the events. > > I thought about this a little more, and I'm not sure this is the right > approach any more. I would look at using the kobject_uevent mechanism > to pass these events to userspace (add a KOBJ_IB_ASYNC action and > generate events using the IB device's class_device kobject). > > - R. Either way is good enough for me. Just tell me what the interface is and what structures and events are passed. Shahar From mshefty at ichips.intel.com Mon Dec 13 09:56:55 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Dec 2004 09:56:55 -0800 Subject: [openib-general] Re: User MAD support for cancel MAD In-Reply-To: References: Message-ID: <41BDD7E7.6050600@ichips.intel.com> shaharf wrote: > The interface issue is not so trivial. An IOCTL may do, but the problem > is the parameter of the IOCTL. The most straight forward way is to > specify a TID to cancel. This requires the usermode to avoid sending the > same mad until it is timeouts. While this is reasonable, I would not > force such a limitation, Why would a client send the same MAD before it times out? I think that enforcing such a limitation is perfectly acceptable. We could argue that cancel MAD should work based on TID, management class, and SGID/SLID, rather than simply the TID. > For example, if you have a retries mechanism that work over IB_MGT > that just retires the mad after X msecs, I think you want to let it work > also above openib. This means that either you allow no timeout/no kernel > matching semantics Clients are required to use the access layer timeout mechanism to properly match responses with requests. If a timeout is not specified, the access layer will drop the response, since it cannot uniquely identify which client to hand the response to. This is done to avoid making multiple data copies of a receive MAD in order to hand it to all clients. - Sean From roland at topspin.com Mon Dec 13 10:09:16 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:16 -0800 Subject: [openib-general] [PATCH][v3][0/21] Initial submission of InfiniBand patches for review Message-ID: <20041213109.xPBcb5yOtGKuT24L@topspin.com> The following series of patches is the latest version of the OpenIB InfiniBand drivers. We believe that this version is suitable for merging when 2.6.11 opens (or into -mm immediately), although of course we are willing to go through as many more iterations as required to fix any remaining issues. We appreciate all of the excellent feedback we received for our previous posting, and we believe we have addressed all of the problems that were identified. We did not intentionally ignore any issues -- if we did not address some of your comments, please rest assured that it was an error on our part. Based on the discussion on cleaning up kernel headers, we have left our .h files under drivers/infiniband/include. None of these .h files are used outside of drivers/infiniband, so it seems that it is better not to add them to the global include/ namespace. Thanks, Roland Dreier OpenIB Alliance www.openib.org From roland at topspin.com Mon Dec 13 10:09:17 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:17 -0800 Subject: [openib-general] [PATCH][v3][1/21] Add core InfiniBand support (public headers) In-Reply-To: <20041213109.xPBcb5yOtGKuT24L@topspin.com> Message-ID: <20041213109.NN2Neh1z1VRsHIIH@topspin.com> Add public headers for core InfiniBand support. This can be thought of as a midlayer that provides an abstraction between low-level hardware drivers and upper level protocols (such as IP-over-InfiniBand). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_cache.h 2004-12-13 09:44:39.409919591 -0800 @@ -0,0 +1,42 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ib_cache.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef _IB_CACHE_H +#define _IB_CACHE_H + +#include + +int ib_cached_gid_get(struct ib_device *device, + u8 port, + int index, + union ib_gid *gid); +int ib_cached_pkey_get(struct ib_device *device_handle, + u8 port, + int index, + u16 *pkey); +int ib_cached_pkey_find(struct ib_device *device, + u8 port, + u16 pkey, + u16 *index); + +#endif /* _IB_CACHE_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_fmr_pool.h 2004-12-13 09:44:39.435915761 -0800 @@ -0,0 +1,81 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: ib_fmr_pool.h 1295 2004-11-25 19:26:33Z roland $ + */ + +#if !defined(IB_FMR_POOL_H) +#define IB_FMR_POOL_H + +#include + +struct ib_fmr_pool; + +/** + * struct ib_fmr_pool_param - Parameters for creating FMR pool + * @max_pages_per_fmr:Maximum number of pages per map request. + * @access:Access flags for FMRs in pool. + * @pool_size:Number of FMRs to allocate for pool. + * @dirty_watermark:Flush is triggered when @dirty_watermark dirty + * FMRs are present. + * @flush_function:Callback called when unmapped FMRs are flushed and + * more FMRs are possibly available for mapping + * @flush_arg:Context passed to user's flush function. + * @cache:If set, FMRs may be reused after unmapping for identical map + * requests. + */ +struct ib_fmr_pool_param { + int max_pages_per_fmr; + enum ib_access_flags access; + int pool_size; + int dirty_watermark; + void (*flush_function)(struct ib_fmr_pool *pool, + void * arg); + void *flush_arg; + unsigned cache:1; +}; + +struct ib_pool_fmr { + struct ib_fmr *fmr; + struct ib_fmr_pool *pool; + struct list_head list; + struct hlist_node cache_node; + int ref_count; + int remap_count; + u64 io_virtual_address; + int page_list_len; + u64 page_list[0]; +}; + +struct ib_fmr_pool *ib_create_fmr_pool(struct ib_pd *pd, + struct ib_fmr_pool_param *params); + +int ib_destroy_fmr_pool(struct ib_fmr_pool *pool); + +int ib_flush_fmr_pool(struct ib_fmr_pool *pool); + +struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle, + u64 *page_list, + int list_len, + u64 *io_virtual_address); + +int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr); + +#endif /* IB_FMR_POOL_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_pack.h 2004-12-13 09:44:39.467911048 -0800 @@ -0,0 +1,234 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: ib_pack.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef IB_PACK_H +#define IB_PACK_H + +#include + +enum { + IB_LRH_BYTES = 8, + IB_GRH_BYTES = 40, + IB_BTH_BYTES = 12, + IB_DETH_BYTES = 8 +}; + +struct ib_field { + size_t struct_offset_bytes; + size_t struct_size_bytes; + int offset_words; + int offset_bits; + int size_bits; + char *field_name; +}; + +#define RESERVED \ + .field_name = "reserved" + +/* + * This macro cleans up the definitions of constants for BTH opcodes. + * It is used to define constants such as IB_OPCODE_UD_SEND_ONLY, + * which becomes IB_OPCODE_UD + IB_OPCODE_SEND_ONLY, and this gives + * the correct value. + * + * In short, user code should use the constants defined using the + * macro rather than worrying about adding together other constants. +*/ +#define IB_OPCODE(transport, op) \ + IB_OPCODE_ ## transport ## _ ## op = \ + IB_OPCODE_ ## transport + IB_OPCODE_ ## op + +enum { + /* transport types -- just used to define real constants */ + IB_OPCODE_RC = 0x00, + IB_OPCODE_UC = 0x20, + IB_OPCODE_RD = 0x40, + IB_OPCODE_UD = 0x60, + + /* operations -- just used to define real constants */ + IB_OPCODE_SEND_FIRST = 0x00, + IB_OPCODE_SEND_MIDDLE = 0x01, + IB_OPCODE_SEND_LAST = 0x02, + IB_OPCODE_SEND_LAST_WITH_IMMEDIATE = 0x03, + IB_OPCODE_SEND_ONLY = 0x04, + IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE = 0x05, + IB_OPCODE_RDMA_WRITE_FIRST = 0x06, + IB_OPCODE_RDMA_WRITE_MIDDLE = 0x07, + IB_OPCODE_RDMA_WRITE_LAST = 0x08, + IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE = 0x09, + IB_OPCODE_RDMA_WRITE_ONLY = 0x0a, + IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE = 0x0b, + IB_OPCODE_RDMA_READ_REQUEST = 0x0c, + IB_OPCODE_RDMA_READ_RESPONSE_FIRST = 0x0d, + IB_OPCODE_RDMA_READ_RESPONSE_MIDDLE = 0x0e, + IB_OPCODE_RDMA_READ_RESPONSE_LAST = 0x0f, + IB_OPCODE_RDMA_READ_RESPONSE_ONLY = 0x10, + IB_OPCODE_ACKNOWLEDGE = 0x11, + IB_OPCODE_ATOMIC_ACKNOWLEDGE = 0x12, + IB_OPCODE_COMPARE_SWAP = 0x13, + IB_OPCODE_FETCH_ADD = 0x14, + + /* real constants follow -- see comment about above IB_OPCODE() + macro for more details */ + + /* RC */ + IB_OPCODE(RC, SEND_FIRST), + IB_OPCODE(RC, SEND_MIDDLE), + IB_OPCODE(RC, SEND_LAST), + IB_OPCODE(RC, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(RC, SEND_ONLY), + IB_OPCODE(RC, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_WRITE_FIRST), + IB_OPCODE(RC, RDMA_WRITE_MIDDLE), + IB_OPCODE(RC, RDMA_WRITE_LAST), + IB_OPCODE(RC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_WRITE_ONLY), + IB_OPCODE(RC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_READ_REQUEST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_FIRST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_MIDDLE), + IB_OPCODE(RC, RDMA_READ_RESPONSE_LAST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_ONLY), + IB_OPCODE(RC, ACKNOWLEDGE), + IB_OPCODE(RC, ATOMIC_ACKNOWLEDGE), + IB_OPCODE(RC, COMPARE_SWAP), + IB_OPCODE(RC, FETCH_ADD), + + /* UC */ + IB_OPCODE(UC, SEND_FIRST), + IB_OPCODE(UC, SEND_MIDDLE), + IB_OPCODE(UC, SEND_LAST), + IB_OPCODE(UC, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(UC, SEND_ONLY), + IB_OPCODE(UC, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(UC, RDMA_WRITE_FIRST), + IB_OPCODE(UC, RDMA_WRITE_MIDDLE), + IB_OPCODE(UC, RDMA_WRITE_LAST), + IB_OPCODE(UC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(UC, RDMA_WRITE_ONLY), + IB_OPCODE(UC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + + /* RD */ + IB_OPCODE(RD, SEND_FIRST), + IB_OPCODE(RD, SEND_MIDDLE), + IB_OPCODE(RD, SEND_LAST), + IB_OPCODE(RD, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(RD, SEND_ONLY), + IB_OPCODE(RD, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_WRITE_FIRST), + IB_OPCODE(RD, RDMA_WRITE_MIDDLE), + IB_OPCODE(RD, RDMA_WRITE_LAST), + IB_OPCODE(RD, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_WRITE_ONLY), + IB_OPCODE(RD, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_READ_REQUEST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_FIRST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_MIDDLE), + IB_OPCODE(RD, RDMA_READ_RESPONSE_LAST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_ONLY), + IB_OPCODE(RD, ACKNOWLEDGE), + IB_OPCODE(RD, ATOMIC_ACKNOWLEDGE), + IB_OPCODE(RD, COMPARE_SWAP), + IB_OPCODE(RD, FETCH_ADD), + + /* UD */ + IB_OPCODE(UD, SEND_ONLY), + IB_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE) +}; + +enum { + IB_LNH_RAW = 0, + IB_LNH_IP = 1, + IB_LNH_IBA_LOCAL = 2, + IB_LNH_IBA_GLOBAL = 3 +}; + +struct ib_unpacked_lrh { + u8 virtual_lane; + u8 link_version; + u8 service_level; + u8 link_next_header; + __be16 destination_lid; + __be16 packet_length; + __be16 source_lid; +}; + +struct ib_unpacked_grh { + u8 ip_version; + u8 traffic_class; + __be32 flow_label; + __be16 payload_length; + u8 next_header; + u8 hop_limit; + union ib_gid source_gid; + union ib_gid destination_gid; +}; + +struct ib_unpacked_bth { + u8 opcode; + u8 solicited_event; + u8 mig_req; + u8 pad_count; + u8 transport_header_version; + __be16 pkey; + __be32 destination_qpn; + u8 ack_req; + __be32 psn; +}; + +struct ib_unpacked_deth { + __be32 qkey; + __be32 source_qpn; +}; + +struct ib_ud_header { + struct ib_unpacked_lrh lrh; + int grh_present; + struct ib_unpacked_grh grh; + struct ib_unpacked_bth bth; + struct ib_unpacked_deth deth; + int immediate_present; + __be32 immediate_data; +}; + +void ib_pack(const struct ib_field *desc, + int desc_len, + void *structure, + void *buf); + +void ib_unpack(const struct ib_field *desc, + int desc_len, + void *buf, + void *structure); + +void ib_ud_header_init(int payload_bytes, + int grh_present, + struct ib_ud_header *header); + +int ib_ud_header_pack(struct ib_ud_header *header, + void *buf); + +int ib_ud_header_unpack(void *buf, + struct ib_ud_header *header); + +#endif /* IB_PACK_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_verbs.h 2004-12-13 09:44:39.495906923 -0800 @@ -0,0 +1,1225 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id: ib_verbs.h 1304 2004-11-30 19:01:25Z sean.hefty $ + */ + +#if !defined(IB_VERBS_H) +#define IB_VERBS_H + +#include +#include +#include + +union ib_gid { + u8 raw[16]; + struct { + u64 subnet_prefix; + u64 interface_id; + } global; +}; + +enum ib_node_type { + IB_NODE_CA = 1, + IB_NODE_SWITCH, + IB_NODE_ROUTER +}; + +enum ib_device_cap_flags { + IB_DEVICE_RESIZE_MAX_WR = 1, + IB_DEVICE_BAD_PKEY_CNTR = (1<<1), + IB_DEVICE_BAD_QKEY_CNTR = (1<<2), + IB_DEVICE_RAW_MULTI = (1<<3), + IB_DEVICE_AUTO_PATH_MIG = (1<<4), + IB_DEVICE_CHANGE_PHY_PORT = (1<<5), + IB_DEVICE_UD_AV_PORT_ENFORCE = (1<<6), + IB_DEVICE_CURR_QP_STATE_MOD = (1<<7), + IB_DEVICE_SHUTDOWN_PORT = (1<<8), + IB_DEVICE_INIT_TYPE = (1<<9), + IB_DEVICE_PORT_ACTIVE_EVENT = (1<<10), + IB_DEVICE_SYS_IMAGE_GUID = (1<<11), + IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), + IB_DEVICE_SRQ_RESIZE = (1<<13), + IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_RQ_SIG_TYPE = (1<<15) +}; + +enum ib_atomic_cap { + IB_ATOMIC_NONE, + IB_ATOMIC_HCA, + IB_ATOMIC_GLOB +}; + +struct ib_device_attr { + u64 fw_ver; + u64 node_guid; + u64 sys_image_guid; + u64 max_mr_size; + u64 page_size_cap; + u32 vendor_id; + u32 vendor_part_id; + u32 hw_ver; + int max_qp; + int max_qp_wr; + int device_cap_flags; + int max_sge; + int max_sge_rd; + int max_cq; + int max_cqe; + int max_mr; + int max_pd; + int max_qp_rd_atom; + int max_ee_rd_atom; + int max_res_rd_atom; + int max_qp_init_rd_atom; + int max_ee_init_rd_atom; + enum ib_atomic_cap atomic_cap; + int max_ee; + int max_rdd; + int max_mw; + int max_raw_ipv6_qp; + int max_raw_ethy_qp; + int max_mcast_grp; + int max_mcast_qp_attach; + int max_total_mcast_qp_attach; + int max_ah; + int max_fmr; + int max_map_per_fmr; + int max_srq; + int max_srq_wr; + int max_srq_sge; + u16 max_pkeys; + u8 local_ca_ack_delay; +}; + +enum ib_mtu { + IB_MTU_256 = 1, + IB_MTU_512 = 2, + IB_MTU_1024 = 3, + IB_MTU_2048 = 4, + IB_MTU_4096 = 5 +}; + +static inline int ib_mtu_enum_to_int(enum ib_mtu mtu) +{ + switch (mtu) { + case IB_MTU_256: return 256; + case IB_MTU_512: return 512; + case IB_MTU_1024: return 1024; + case IB_MTU_2048: return 2048; + case IB_MTU_4096: return 4096; + default: return -1; + } +} + +enum ib_static_rate { + IB_STATIC_RATE_FULL = 0, + IB_STATIC_RATE_12X_TO_4X = 2, + IB_STATIC_RATE_4X_TO_1X = 3, + IB_STATIC_RATE_12X_TO_1X = 11 +}; + +enum ib_port_state { + IB_PORT_NOP = 0, + IB_PORT_DOWN = 1, + IB_PORT_INIT = 2, + IB_PORT_ARMED = 3, + IB_PORT_ACTIVE = 4, + IB_PORT_ACTIVE_DEFER = 5 +}; + +enum ib_port_cap_flags { + IB_PORT_SM = (1<<31), + IB_PORT_NOTICE_SUP = (1<<30), + IB_PORT_TRAP_SUP = (1<<29), + IB_PORT_AUTO_MIGR_SUP = (1<<27), + IB_PORT_SL_MAP_SUP = (1<<26), + IB_PORT_MKEY_NVRAM = (1<<25), + IB_PORT_PKEY_NVRAM = (1<<24), + IB_PORT_LED_INFO_SUP = (1<<23), + IB_PORT_SM_DISABLED = (1<<22), + IB_PORT_SYS_IMAGE_GUID_SUP = (1<<21), + IB_PORT_PKEY_SW_EXT_PORT_TRAP_SUP = (1<<20), + IB_PORT_CM_SUP = (1<<16), + IB_PORT_SNMP_TUNNEL_SUP = (1<<15), + IB_PORT_REINIT_SUP = (1<<14), + IB_PORT_DEVICE_MGMT_SUP = (1<<13), + IB_PORT_VENDOR_CLASS_SUP = (1<<12), + IB_PORT_DR_NOTICE_SUP = (1<<11), + IB_PORT_PORT_NOTICE_SUP = (1<<10), + IB_PORT_BOOT_MGMT_SUP = (1<<9) +}; + +struct ib_port_attr { + enum ib_port_state state; + enum ib_mtu max_mtu; + enum ib_mtu active_mtu; + int gid_tbl_len; + u32 port_cap_flags; + u32 max_msg_sz; + u32 bad_pkey_cntr; + u32 qkey_viol_cntr; + u16 pkey_tbl_len; + u16 lid; + u16 sm_lid; + u8 lmc; + u8 max_vl_num; + u8 sm_sl; + u8 subnet_timeout; + u8 init_type_reply; +}; + +enum ib_device_modify_flags { + IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 +}; + +struct ib_device_modify { + u64 sys_image_guid; +}; + +enum ib_port_modify_flags { + IB_PORT_SHUTDOWN = 1, + IB_PORT_INIT_TYPE = (1<<2), + IB_PORT_RESET_QKEY_CNTR = (1<<3) +}; + +struct ib_port_modify { + u32 set_port_cap_mask; + u32 clr_port_cap_mask; + u8 init_type; +}; + +enum ib_event_type { + IB_EVENT_CQ_ERR, + IB_EVENT_QP_FATAL, + IB_EVENT_QP_REQ_ERR, + IB_EVENT_QP_ACCESS_ERR, + IB_EVENT_COMM_EST, + IB_EVENT_SQ_DRAINED, + IB_EVENT_PATH_MIG, + IB_EVENT_PATH_MIG_ERR, + IB_EVENT_DEVICE_FATAL, + IB_EVENT_PORT_ACTIVE, + IB_EVENT_PORT_ERR, + IB_EVENT_LID_CHANGE, + IB_EVENT_PKEY_CHANGE, + IB_EVENT_SM_CHANGE +}; + +struct ib_event { + struct ib_device *device; + union { + struct ib_cq *cq; + struct ib_qp *qp; + u8 port_num; + } element; + enum ib_event_type event; +}; + +struct ib_event_handler { + struct ib_device *device; + void (*handler)(struct ib_event_handler *, struct ib_event *); + struct list_head list; +}; + +#define INIT_IB_EVENT_HANDLER(_ptr, _device, _handler) \ + do { \ + (_ptr)->device = _device; \ + (_ptr)->handler = _handler; \ + INIT_LIST_HEAD(&(_ptr)->list); \ + } while (0) + +struct ib_global_route { + union ib_gid dgid; + u32 flow_label; + u8 sgid_index; + u8 hop_limit; + u8 traffic_class; +}; + +enum { + IB_MULTICAST_QPN = 0xffffff +}; + +enum ib_ah_flags { + IB_AH_GRH = 1 +}; + +struct ib_ah_attr { + struct ib_global_route grh; + u16 dlid; + u8 sl; + u8 src_path_bits; + u8 static_rate; + u8 ah_flags; + u8 port_num; +}; + +enum ib_wc_status { + IB_WC_SUCCESS, + IB_WC_LOC_LEN_ERR, + IB_WC_LOC_QP_OP_ERR, + IB_WC_LOC_EEC_OP_ERR, + IB_WC_LOC_PROT_ERR, + IB_WC_WR_FLUSH_ERR, + IB_WC_MW_BIND_ERR, + IB_WC_BAD_RESP_ERR, + IB_WC_LOC_ACCESS_ERR, + IB_WC_REM_INV_REQ_ERR, + IB_WC_REM_ACCESS_ERR, + IB_WC_REM_OP_ERR, + IB_WC_RETRY_EXC_ERR, + IB_WC_RNR_RETRY_EXC_ERR, + IB_WC_LOC_RDD_VIOL_ERR, + IB_WC_REM_INV_RD_REQ_ERR, + IB_WC_REM_ABORT_ERR, + IB_WC_INV_EECN_ERR, + IB_WC_INV_EEC_STATE_ERR, + IB_WC_FATAL_ERR, + IB_WC_RESP_TIMEOUT_ERR, + IB_WC_GENERAL_ERR +}; + +enum ib_wc_opcode { + IB_WC_SEND, + IB_WC_RDMA_WRITE, + IB_WC_RDMA_READ, + IB_WC_COMP_SWAP, + IB_WC_FETCH_ADD, + IB_WC_BIND_MW, +/* + * Set value of IB_WC_RECV so consumers can test if a completion is a + * receive by testing (opcode & IB_WC_RECV). + */ + IB_WC_RECV = 1 << 7, + IB_WC_RECV_RDMA_WITH_IMM +}; + +enum ib_wc_flags { + IB_WC_GRH = 1, + IB_WC_WITH_IMM = (1<<1) +}; + +struct ib_wc { + u64 wr_id; + enum ib_wc_status status; + enum ib_wc_opcode opcode; + u32 vendor_err; + u32 byte_len; + __be32 imm_data; + u32 src_qp; + int wc_flags; + u16 pkey_index; + u16 slid; + u8 sl; + u8 dlid_path_bits; + u8 port_num; /* valid only for DR SMPs on switches */ +}; + +enum ib_cq_notify { + IB_CQ_SOLICITED, + IB_CQ_NEXT_COMP +}; + +struct ib_qp_cap { + u32 max_send_wr; + u32 max_recv_wr; + u32 max_send_sge; + u32 max_recv_sge; + u32 max_inline_data; +}; + +enum ib_sig_type { + IB_SIGNAL_ALL_WR, + IB_SIGNAL_REQ_WR +}; + +enum ib_qp_type { + /* + * IB_QPT_SMI and IB_QPT_GSI have to be the first two entries + * here (and in that order) since the MAD layer uses them as + * indices into a 2-entry table. + */ + IB_QPT_SMI, + IB_QPT_GSI, + + IB_QPT_RC, + IB_QPT_UC, + IB_QPT_UD, + IB_QPT_RAW_IPV6, + IB_QPT_RAW_ETY +}; + +struct ib_qp_init_attr { + void (*event_handler)(struct ib_event *, void *); + void *qp_context; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; + struct ib_srq *srq; + struct ib_qp_cap cap; + enum ib_sig_type sq_sig_type; + enum ib_sig_type rq_sig_type; + enum ib_qp_type qp_type; + u8 port_num; /* special QP types only */ +}; + +enum ib_rnr_timeout { + IB_RNR_TIMER_655_36 = 0, + IB_RNR_TIMER_000_01 = 1, + IB_RNR_TIMER_000_02 = 2, + IB_RNR_TIMER_000_03 = 3, + IB_RNR_TIMER_000_04 = 4, + IB_RNR_TIMER_000_06 = 5, + IB_RNR_TIMER_000_08 = 6, + IB_RNR_TIMER_000_12 = 7, + IB_RNR_TIMER_000_16 = 8, + IB_RNR_TIMER_000_24 = 9, + IB_RNR_TIMER_000_32 = 10, + IB_RNR_TIMER_000_48 = 11, + IB_RNR_TIMER_000_64 = 12, + IB_RNR_TIMER_000_96 = 13, + IB_RNR_TIMER_001_28 = 14, + IB_RNR_TIMER_001_92 = 15, + IB_RNR_TIMER_002_56 = 16, + IB_RNR_TIMER_003_84 = 17, + IB_RNR_TIMER_005_12 = 18, + IB_RNR_TIMER_007_68 = 19, + IB_RNR_TIMER_010_24 = 20, + IB_RNR_TIMER_015_36 = 21, + IB_RNR_TIMER_020_48 = 22, + IB_RNR_TIMER_030_72 = 23, + IB_RNR_TIMER_040_96 = 24, + IB_RNR_TIMER_061_44 = 25, + IB_RNR_TIMER_081_92 = 26, + IB_RNR_TIMER_122_88 = 27, + IB_RNR_TIMER_163_84 = 28, + IB_RNR_TIMER_245_76 = 29, + IB_RNR_TIMER_327_68 = 30, + IB_RNR_TIMER_491_52 = 31 +}; + +enum ib_qp_attr_mask { + IB_QP_STATE = 1, + IB_QP_CUR_STATE = (1<<1), + IB_QP_EN_SQD_ASYNC_NOTIFY = (1<<2), + IB_QP_ACCESS_FLAGS = (1<<3), + IB_QP_PKEY_INDEX = (1<<4), + IB_QP_PORT = (1<<5), + IB_QP_QKEY = (1<<6), + IB_QP_AV = (1<<7), + IB_QP_PATH_MTU = (1<<8), + IB_QP_TIMEOUT = (1<<9), + IB_QP_RETRY_CNT = (1<<10), + IB_QP_RNR_RETRY = (1<<11), + IB_QP_RQ_PSN = (1<<12), + IB_QP_MAX_QP_RD_ATOMIC = (1<<13), + IB_QP_ALT_PATH = (1<<14), + IB_QP_MIN_RNR_TIMER = (1<<15), + IB_QP_SQ_PSN = (1<<16), + IB_QP_MAX_DEST_RD_ATOMIC = (1<<17), + IB_QP_PATH_MIG_STATE = (1<<18), + IB_QP_CAP = (1<<19), + IB_QP_DEST_QPN = (1<<20) +}; + +enum ib_qp_state { + IB_QPS_RESET, + IB_QPS_INIT, + IB_QPS_RTR, + IB_QPS_RTS, + IB_QPS_SQD, + IB_QPS_SQE, + IB_QPS_ERR +}; + +enum ib_mig_state { + IB_MIG_MIGRATED, + IB_MIG_REARM, + IB_MIG_ARMED +}; + +struct ib_qp_attr { + enum ib_qp_state qp_state; + enum ib_qp_state cur_qp_state; + enum ib_mtu path_mtu; + enum ib_mig_state path_mig_state; + u32 qkey; + u32 rq_psn; + u32 sq_psn; + u32 dest_qp_num; + int qp_access_flags; + struct ib_qp_cap cap; + struct ib_ah_attr ah_attr; + struct ib_ah_attr alt_ah_attr; + u16 pkey_index; + u16 alt_pkey_index; + u8 en_sqd_async_notify; + u8 sq_draining; + u8 max_rd_atomic; + u8 max_dest_rd_atomic; + u8 min_rnr_timer; + u8 port_num; + u8 timeout; + u8 retry_cnt; + u8 rnr_retry; + u8 alt_port_num; + u8 alt_timeout; +}; + +enum ib_wr_opcode { + IB_WR_RDMA_WRITE, + IB_WR_RDMA_WRITE_WITH_IMM, + IB_WR_SEND, + IB_WR_SEND_WITH_IMM, + IB_WR_RDMA_READ, + IB_WR_ATOMIC_CMP_AND_SWP, + IB_WR_ATOMIC_FETCH_AND_ADD +}; + +enum ib_send_flags { + IB_SEND_FENCE = 1, + IB_SEND_SIGNALED = (1<<1), + IB_SEND_SOLICITED = (1<<2), + IB_SEND_INLINE = (1<<3) +}; + +enum ib_recv_flags { + IB_RECV_SIGNALED = 1 +}; + +struct ib_sge { + u64 addr; + u32 length; + u32 lkey; +}; + +struct ib_send_wr { + struct ib_send_wr *next; + u64 wr_id; + struct ib_sge *sg_list; + int num_sge; + enum ib_wr_opcode opcode; + int send_flags; + u32 imm_data; + union { + struct { + u64 remote_addr; + u32 rkey; + } rdma; + struct { + u64 remote_addr; + u64 compare_add; + u64 swap; + u32 rkey; + } atomic; + struct { + struct ib_ah *ah; + struct ib_mad_hdr *mad_hdr; + u32 remote_qpn; + u32 remote_qkey; + int timeout_ms; /* valid for MADs only */ + u16 pkey_index; /* valid for GSI only */ + u8 port_num; /* valid for DR SMPs on switch only */ + } ud; + } wr; +}; + +struct ib_recv_wr { + struct ib_recv_wr *next; + u64 wr_id; + struct ib_sge *sg_list; + int num_sge; + int recv_flags; +}; + +enum ib_access_flags { + IB_ACCESS_LOCAL_WRITE = 1, + IB_ACCESS_REMOTE_WRITE = (1<<1), + IB_ACCESS_REMOTE_READ = (1<<2), + IB_ACCESS_REMOTE_ATOMIC = (1<<3), + IB_ACCESS_MW_BIND = (1<<4) +}; + +struct ib_phys_buf { + u64 addr; + u64 size; +}; + +struct ib_mr_attr { + struct ib_pd *pd; + u64 device_virt_addr; + u64 size; + int mr_access_flags; + u32 lkey; + u32 rkey; +}; + +enum ib_mr_rereg_flags { + IB_MR_REREG_TRANS = 1, + IB_MR_REREG_PD = (1<<1), + IB_MR_REREG_ACCESS = (1<<2) +}; + +struct ib_mw_bind { + struct ib_mr *mr; + u64 wr_id; + u64 addr; + u32 length; + int send_flags; + int mw_access_flags; +}; + +struct ib_fmr_attr { + int max_pages; + int max_maps; + u8 page_size; +}; + +struct ib_pd { + struct ib_device *device; + atomic_t usecnt; /* count all resources */ +}; + +struct ib_ah { + struct ib_device *device; + struct ib_pd *pd; +}; + +typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context); + +struct ib_cq { + struct ib_device *device; + ib_comp_handler comp_handler; + void (*event_handler)(struct ib_event *, void *); + void * cq_context; + int cqe; + atomic_t usecnt; /* count number of work queues */ +}; + +struct ib_srq { + struct ib_device *device; + struct ib_pd *pd; + void *srq_context; + atomic_t usecnt; +}; + +struct ib_qp { + struct ib_device *device; + struct ib_pd *pd; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; + struct ib_srq *srq; + void (*event_handler)(struct ib_event *, void *); + void *qp_context; + u32 qp_num; +}; + +struct ib_mr { + struct ib_device *device; + struct ib_pd *pd; + u32 lkey; + u32 rkey; + atomic_t usecnt; /* count number of MWs */ +}; + +struct ib_mw { + struct ib_device *device; + struct ib_pd *pd; + u32 rkey; +}; + +struct ib_fmr { + struct ib_device *device; + struct ib_pd *pd; + struct list_head list; + u32 lkey; + u32 rkey; +}; + +struct ib_mad; + +enum ib_process_mad_flags { + IB_MAD_IGNORE_MKEY = 1 +}; + +enum ib_mad_result { + IB_MAD_RESULT_FAILURE = 0, /* (!SUCCESS is the important flag) */ + IB_MAD_RESULT_SUCCESS = 1 << 0, /* MAD was successfully processed */ + IB_MAD_RESULT_REPLY = 1 << 1, /* Reply packet needs to be sent */ + IB_MAD_RESULT_CONSUMED = 1 << 2 /* Packet consumed: stop processing */ +}; + +#define IB_DEVICE_NAME_MAX 64 + +struct ib_cache { + rwlock_t lock; + struct ib_event_handler event_handler; + struct ib_pkey_cache **pkey_cache; + struct ib_gid_cache **gid_cache; +}; + +struct ib_device { + struct device *dma_device; + + char name[IB_DEVICE_NAME_MAX]; + + struct list_head event_handler_list; + spinlock_t event_handler_lock; + + struct list_head core_list; + struct list_head client_data_list; + spinlock_t client_data_lock; + + struct ib_cache cache; + + u32 flags; + + int (*query_device)(struct ib_device *device, + struct ib_device_attr *device_attr); + int (*query_port)(struct ib_device *device, + u8 port_num, + struct ib_port_attr *port_attr); + int (*query_gid)(struct ib_device *device, + u8 port_num, int index, + union ib_gid *gid); + int (*query_pkey)(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey); + int (*modify_device)(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + int (*modify_port)(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify); + struct ib_pd * (*alloc_pd)(struct ib_device *device); + int (*dealloc_pd)(struct ib_pd *pd); + struct ib_ah * (*create_ah)(struct ib_pd *pd, + struct ib_ah_attr *ah_attr); + int (*modify_ah)(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); + int (*query_ah)(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); + int (*destroy_ah)(struct ib_ah *ah); + struct ib_qp * (*create_qp)(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + int (*modify_qp)(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask); + int (*query_qp)(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + int (*destroy_qp)(struct ib_qp *qp); + int (*post_send)(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + int (*post_recv)(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr); + struct ib_cq * (*create_cq)(struct ib_device *device, + int cqe); + int (*destroy_cq)(struct ib_cq *cq); + int (*resize_cq)(struct ib_cq *cq, int *cqe); + int (*poll_cq)(struct ib_cq *cq, int num_entries, + struct ib_wc *wc); + int (*peek_cq)(struct ib_cq *cq, int wc_cnt); + int (*req_notify_cq)(struct ib_cq *cq, + enum ib_cq_notify cq_notify); + int (*req_ncomp_notif)(struct ib_cq *cq, + int wc_cnt); + struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, + int mr_access_flags); + struct ib_mr * (*reg_phys_mr)(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + int (*query_mr)(struct ib_mr *mr, + struct ib_mr_attr *mr_attr); + int (*dereg_mr)(struct ib_mr *mr); + int (*rereg_phys_mr)(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + struct ib_mw * (*alloc_mw)(struct ib_pd *pd); + int (*bind_mw)(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); + int (*dealloc_mw)(struct ib_mw *mw); + struct ib_fmr * (*alloc_fmr)(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + int (*map_phys_fmr)(struct ib_fmr *fmr, + u64 *page_list, int list_len, + u64 iova); + int (*unmap_fmr)(struct list_head *fmr_list); + int (*dealloc_fmr)(struct ib_fmr *fmr); + int (*attach_mcast)(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + int (*detach_mcast)(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + int (*process_mad)(struct ib_device *device, + int process_mad_flags, + u8 port_num, + u16 source_lid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); + + struct class_device class_dev; + struct kobject ports_parent; + struct list_head port_list; + + enum { + IB_DEV_UNINITIALIZED, + IB_DEV_REGISTERED, + IB_DEV_UNREGISTERED + } reg_state; + + u8 node_type; + u8 phys_port_cnt; +}; + +struct ib_client { + char *name; + void (*add) (struct ib_device *); + void (*remove)(struct ib_device *); + + struct list_head list; +}; + +struct ib_device *ib_alloc_device(size_t size); +void ib_dealloc_device(struct ib_device *device); + +int ib_register_device (struct ib_device *device); +void ib_unregister_device(struct ib_device *device); + +int ib_register_client (struct ib_client *client); +void ib_unregister_client(struct ib_client *client); + +void *ib_get_client_data(struct ib_device *device, struct ib_client *client); +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data); + +int ib_register_event_handler (struct ib_event_handler *event_handler); +int ib_unregister_event_handler(struct ib_event_handler *event_handler); +void ib_dispatch_event(struct ib_event *event); + +int ib_query_device(struct ib_device *device, + struct ib_device_attr *device_attr); + +int ib_query_port(struct ib_device *device, + u8 port_num, struct ib_port_attr *port_attr); + +int ib_query_gid(struct ib_device *device, + u8 port_num, int index, union ib_gid *gid); + +int ib_query_pkey(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey); + +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + +int ib_modify_port(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify); + +/** + * ib_alloc_pd - Allocates an unused protection domain. + * @device: The device on which to allocate the protection domain. + * + * A protection domain object provides an association between QPs, shared + * receive queues, address handles, memory regions, and memory windows. + */ +struct ib_pd *ib_alloc_pd(struct ib_device *device); + +/** + * ib_dealloc_pd - Deallocates a protection domain. + * @pd: The protection domain to deallocate. + */ +int ib_dealloc_pd(struct ib_pd *pd); + +/** + * ib_create_ah - Creates an address handle for the given address vector. + * @pd: The protection domain associated with the address handle. + * @ah_attr: The attributes of the address vector. + * + * The address handle is used to reference a local or global destination + * in all UD QP post sends. + */ +struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); + +/** + * ib_modify_ah - Modifies the address vector associated with an address + * handle. + * @ah: The address handle to modify. + * @ah_attr: The new address vector attributes to associate with the + * address handle. + */ +int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +/** + * ib_query_ah - Queries the address vector associated with an address + * handle. + * @ah: The address handle to query. + * @ah_attr: The address vector attributes associated with the address + * handle. + */ +int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +/** + * ib_destroy_ah - Destroys an address handle. + * @ah: The address handle to destroy. + */ +int ib_destroy_ah(struct ib_ah *ah); + +/** + * ib_create_qp - Creates a QP associated with the specified protection + * domain. + * @pd: The protection domain associated with the QP. + * @qp_init_attr: A list of initial attributes required to create the QP. + */ +struct ib_qp *ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + +/** + * ib_modify_qp - Modifies the attributes for the specified QP and then + * transitions the QP to the given state. + * @qp: The QP to modify. + * @qp_attr: On input, specifies the QP attributes to modify. On output, + * the current values of selected QP attributes are returned. + * @qp_attr_mask: A bit-mask used to specify which attributes of the QP + * are being modified. + */ +int ib_modify_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask); + +/** + * ib_query_qp - Returns the attribute list and current values for the + * specified QP. + * @qp: The QP to query. + * @qp_attr: The attributes of the specified QP. + * @qp_attr_mask: A bit-mask used to select specific attributes to query. + * @qp_init_attr: Additional attributes of the selected QP. + * + * The qp_attr_mask may be used to limit the query to gathering only the + * selected attributes. + */ +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + +/** + * ib_destroy_qp - Destroys the specified QP. + * @qp: The QP to destroy. + */ +int ib_destroy_qp(struct ib_qp *qp); + +/** + * ib_post_send - Posts a list of work requests to the send queue of + * the specified QP. + * @qp: The QP to post the work request on. + * @send_wr: A list of work requests to post on the send queue. + * @bad_send_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ +static inline int ib_post_send(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + return qp->device->post_send(qp, send_wr, bad_send_wr); +} + +/** + * ib_post_recv - Posts a list of work requests to the receive queue of + * the specified QP. + * @qp: The QP to post the work request on. + * @recv_wr: A list of work requests to post on the receive queue. + * @bad_recv_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ +static inline int ib_post_recv(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) +{ + return qp->device->post_recv(qp, recv_wr, bad_recv_wr); +} + +/** + * ib_create_cq - Creates a CQ on the specified device. + * @device: The device on which to create the CQ. + * @comp_handler: A user-specified callback that is invoked when a + * completion event occurs on the CQ. + * @event_handler: A user-specified callback that is invoked when an + * asynchronous event not associated with a completion occurs on the CQ. + * @cq_context: Context associated with the CQ returned to the user via + * the associated completion and event handlers. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ +struct ib_cq *ib_create_cq(struct ib_device *device, + ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), + void *cq_context, int cqe); + +/** + * ib_resize_cq - Modifies the capacity of the CQ. + * @cq: The CQ to resize. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ +int ib_resize_cq(struct ib_cq *cq, int cqe); + +/** + * ib_destroy_cq - Destroys the specified CQ. + * @cq: The CQ to destroy. + */ +int ib_destroy_cq(struct ib_cq *cq); + +/** + * ib_poll_cq - poll a CQ for completion(s) + * @cq:the CQ being polled + * @num_entries:maximum number of completions to return + * @wc:array of at least @num_entries &struct ib_wc where completions + * will be returned + * + * Poll a CQ for (possibly multiple) completions. If the return value + * is < 0, an error occurred. If the return value is >= 0, it is the + * number of completions returned. If the return value is + * non-negative and < num_entries, then the CQ was emptied. + */ +static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, + struct ib_wc *wc) +{ + return cq->device->poll_cq(cq, num_entries, wc); +} + +/** + * ib_peek_cq - Returns the number of unreaped completions currently + * on the specified CQ. + * @cq: The CQ to peek. + * @wc_cnt: A minimum number of unreaped completions to check for. + * + * If the number of unreaped completions is greater than or equal to wc_cnt, + * this function returns wc_cnt, otherwise, it returns the actual number of + * unreaped completions. + */ +int ib_peek_cq(struct ib_cq *cq, int wc_cnt); + +/** + * ib_req_notify_cq - Request completion notification on a CQ. + * @cq: The CQ to generate an event for. + * @cq_notify: If set to %IB_CQ_SOLICITED, completion notification will + * occur on the next solicited event. If set to %IB_CQ_NEXT_COMP, + * notification will occur on the next completion. + */ +static inline int ib_req_notify_cq(struct ib_cq *cq, + enum ib_cq_notify cq_notify) +{ + return cq->device->req_notify_cq(cq, cq_notify); +} + +/** + * ib_req_ncomp_notif - Request completion notification when there are + * at least the specified number of unreaped completions on the CQ. + * @cq: The CQ to generate an event for. + * @wc_cnt: The number of unreaped completions that should be on the + * CQ before an event is generated. + */ +static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) +{ + return cq->device->req_ncomp_notif ? + cq->device->req_ncomp_notif(cq, wc_cnt) : + -ENOSYS; +} + +/** + * ib_get_dma_mr - Returns a memory region for system memory that is + * usable for DMA. + * @pd: The protection domain associated with the memory region. + * @mr_access_flags: Specifies the memory access rights. + */ +struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); + +/** + * ib_reg_phys_mr - Prepares a virtually addressed memory region for use + * by an HCA. + * @pd: The protection domain associated assigned to the registered region. + * @phys_buf_array: Specifies a list of physical buffers to use in the + * memory region. + * @num_phys_buf: Specifies the size of the phys_buf_array. + * @mr_access_flags: Specifies the memory access rights. + * @iova_start: The offset of the region's starting I/O virtual address. + */ +struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + +/** + * ib_rereg_phys_mr - Modifies the attributes of an existing memory region. + * Conceptually, this call performs the functions deregister memory region + * followed by register physical memory region. Where possible, + * resources are reused instead of deallocated and reallocated. + * @mr: The memory region to modify. + * @mr_rereg_mask: A bit-mask used to indicate which of the following + * properties of the memory region are being modified. + * @pd: If %IB_MR_REREG_PD is set in mr_rereg_mask, this field specifies + * the new protection domain to associated with the memory region, + * otherwise, this parameter is ignored. + * @phys_buf_array: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this + * field specifies a list of physical buffers to use in the new + * translation, otherwise, this parameter is ignored. + * @num_phys_buf: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this + * field specifies the size of the phys_buf_array, otherwise, this + * parameter is ignored. + * @mr_access_flags: If %IB_MR_REREG_ACCESS is set in mr_rereg_mask, this + * field specifies the new memory access rights, otherwise, this + * parameter is ignored. + * @iova_start: The offset of the region's starting I/O virtual address. + */ +int ib_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + +/** + * ib_query_mr - Retrieves information about a specific memory region. + * @mr: The memory region to retrieve information about. + * @mr_attr: The attributes of the specified memory region. + */ +int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); + +/** + * ib_dereg_mr - Deregisters a memory region and removes it from the + * HCA translation table. + * @mr: The memory region to deregister. + */ +int ib_dereg_mr(struct ib_mr *mr); + +/** + * ib_alloc_mw - Allocates a memory window. + * @pd: The protection domain associated with the memory window. + */ +struct ib_mw *ib_alloc_mw(struct ib_pd *pd); + +/** + * ib_bind_mw - Posts a work request to the send queue of the specified + * QP, which binds the memory window to the given address range and + * remote access attributes. + * @qp: QP to post the bind work request on. + * @mw: The memory window to bind. + * @mw_bind: Specifies information about the memory window, including + * its address range, remote access rights, and associated memory region. + */ +static inline int ib_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + /* XXX reference counting in corresponding MR? */ + return mw->device->bind_mw ? + mw->device->bind_mw(qp, mw, mw_bind) : + -ENOSYS; +} + +/** + * ib_dealloc_mw - Deallocates a memory window. + * @mw: The memory window to deallocate. + */ +int ib_dealloc_mw(struct ib_mw *mw); + +/** + * ib_alloc_fmr - Allocates a unmapped fast memory region. + * @pd: The protection domain associated with the unmapped region. + * @mr_access_flags: Specifies the memory access rights. + * @fmr_attr: Attributes of the unmapped region. + * + * A fast memory region must be mapped before it can be used as part of + * a work request. + */ +struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + +/** + * ib_map_phys_fmr - Maps a list of physical pages to a fast memory region. + * @fmr: The fast memory region to associate with the pages. + * @page_list: An array of physical pages to map to the fast memory region. + * @list_len: The number of pages in page_list. + * @iova: The I/O virtual address to use with the mapped region. + */ +static inline int ib_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, int list_len, + u64 iova) +{ + return fmr->device->map_phys_fmr(fmr, page_list, list_len, iova); +} + +/** + * ib_unmap_fmr - Removes the mapping from a list of fast memory regions. + * @fmr_list: A linked list of fast memory regions to unmap. + */ +int ib_unmap_fmr(struct list_head *fmr_list); + +/** + * ib_dealloc_fmr - Deallocates a fast memory region. + * @fmr: The fast memory region to deallocate. + */ +int ib_dealloc_fmr(struct ib_fmr *fmr); + +/** + * ib_attach_mcast - Attaches the specified QP to a multicast group. + * @qp: QP to attach to the multicast group. The QP must be type + * IB_QPT_UD. + * @gid: Multicast group GID. + * @lid: Multicast group LID in host byte order. + * + * In order to send and receive multicast packets, subnet + * administration must have created the multicast group and configured + * the fabric appropriately. The port associated with the specified + * QP must also be a member of the multicast group. + */ +int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +/** + * ib_detach_mcast - Detaches the specified QP from a multicast group. + * @qp: QP to detach from the multicast group. + * @gid: Multicast group GID. + * @lid: Multicast group LID in host byte order. + */ +int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +#endif /* IB_VERBS_H */ From roland at topspin.com Mon Dec 13 10:09:20 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:20 -0800 Subject: [openib-general] [PATCH][v3][2/21] Add core InfiniBand support In-Reply-To: <20041213109.NN2Neh1z1VRsHIIH@topspin.com> Message-ID: <20041213109.B80JuEFdg6Nma7kr@topspin.com> Add implementation of core InfiniBand support. This can be thought of as a midlayer that provides an abstraction between low-level hardware drivers and upper level protocols (such as IP-over-InfiniBand). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-13 09:44:40.230798660 -0800 @@ -0,0 +1,11 @@ +menu "InfiniBand support" + +config INFINIBAND + tristate "InfiniBand support" + default n + ---help--- + Core support for InfiniBand (IB). Make sure to also select + any protocols you wish to use as well as drivers for your + InfiniBand hardware. + +endmenu --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/Makefile 2004-12-13 09:44:40.278791590 -0800 @@ -0,0 +1 @@ +obj-$(CONFIG_INFINIBAND) += core/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-13 09:44:40.305787613 -0800 @@ -0,0 +1,13 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND) += \ + ib_core.o + +ib_core-objs := \ + packer.o \ + ud_header.o \ + verbs.o \ + sysfs.o \ + device.o \ + fmr_pool.o \ + cache.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/cache.c 2004-12-13 09:44:40.502758596 -0800 @@ -0,0 +1,317 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Topspin Communications. All rights reserved. + + $Id: cache.c 1293 2004-11-25 16:04:07Z roland $ +*/ + +#include +#include +#include +#include + +#include "core_priv.h" + +struct ib_pkey_cache { + int table_len; + u16 table[0]; +}; + +struct ib_gid_cache { + int table_len; + union ib_gid table[0]; +}; + +struct ib_update_work { + struct work_struct work; + struct ib_device *device; + u8 port_num; +}; + +static inline int start_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : 1; +} + +static inline int end_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; +} + +int ib_cached_gid_get(struct ib_device *device, + u8 port, + int index, + union ib_gid *gid) +{ + struct ib_gid_cache *cache; + unsigned long flags; + int ret = 0; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.gid_cache[port - start_port(device)]; + + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *gid = cache->table[index]; + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_gid_get); + +int ib_cached_pkey_get(struct ib_device *device, + u8 port, + int index, + u16 *pkey) +{ + struct ib_pkey_cache *cache; + unsigned long flags; + int ret = 0; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.pkey_cache[port - start_port(device)]; + + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *pkey = cache->table[index]; + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_pkey_get); + +int ib_cached_pkey_find(struct ib_device *device, + u8 port, + u16 pkey, + u16 *index) +{ + struct ib_pkey_cache *cache; + unsigned long flags; + int i; + int ret = -ENOENT; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.pkey_cache[port - start_port(device)]; + + *index = -1; + + for (i = 0; i < cache->table_len; ++i) + if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { + *index = i; + ret = 0; + break; + } + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_pkey_find); + +static void ib_cache_update(struct ib_device *device, + u8 port) +{ + struct ib_port_attr *tprops = NULL; + struct ib_pkey_cache *pkey_cache = NULL, *old_pkey_cache; + struct ib_gid_cache *gid_cache = NULL, *old_gid_cache; + int i; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + return; + + ret = ib_query_port(device, port, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed (%d) for %s\n", + ret, device->name); + goto err; + } + + pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * + sizeof *pkey_cache->table, GFP_KERNEL); + if (!pkey_cache) + goto err; + + pkey_cache->table_len = tprops->pkey_tbl_len; + + gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len * + sizeof *gid_cache->table, GFP_KERNEL); + if (!gid_cache) + goto err; + + gid_cache->table_len = tprops->gid_tbl_len; + + for (i = 0; i < pkey_cache->table_len; ++i) { + ret = ib_query_pkey(device, port, i, pkey_cache->table + i); + if (ret) { + printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; + } + } + + for (i = 0; i < gid_cache->table_len; ++i) { + ret = ib_query_gid(device, port, i, gid_cache->table + i); + if (ret) { + printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; + } + } + + write_lock_irq(&device->cache.lock); + + old_pkey_cache = device->cache.pkey_cache[port - start_port(device)]; + old_gid_cache = device->cache.gid_cache [port - start_port(device)]; + + device->cache.pkey_cache[port - start_port(device)] = pkey_cache; + device->cache.gid_cache [port - start_port(device)] = gid_cache; + + write_unlock_irq(&device->cache.lock); + + kfree(old_pkey_cache); + kfree(old_gid_cache); + kfree(tprops); + return; + +err: + kfree(pkey_cache); + kfree(gid_cache); + kfree(tprops); +} + +static void ib_cache_task(void *work_ptr) +{ + struct ib_update_work *work = work_ptr; + + ib_cache_update(work->device, work->port_num); + kfree(work); +} + +static void ib_cache_event(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct ib_update_work *work; + + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + work = kmalloc(sizeof *work, GFP_ATOMIC); + if (work) { + INIT_WORK(&work->work, ib_cache_task, work); + work->device = event->device; + work->port_num = event->element.port_num; + schedule_work(&work->work); + } + } +} + +void ib_cache_setup_one(struct ib_device *device) +{ + int p; + + rwlock_init(&device->cache.lock); + + device->cache.pkey_cache = + kmalloc(sizeof *device->cache.pkey_cache * + (end_port(device) - start_port(device) + 1), GFP_KERNEL); + device->cache.gid_cache = + kmalloc(sizeof *device->cache.pkey_cache * + (end_port(device) - start_port(device) + 1), GFP_KERNEL); + + if (!device->cache.pkey_cache || !device->cache.gid_cache) { + printk(KERN_WARNING "Couldn't allocate cache " + "for %s\n", device->name); + goto err; + } + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + device->cache.pkey_cache[p] = NULL; + device->cache.gid_cache [p] = NULL; + ib_cache_update(device, p + start_port(device)); + } + + INIT_IB_EVENT_HANDLER(&device->cache.event_handler, + device, ib_cache_event); + if (ib_register_event_handler(&device->cache.event_handler)) + goto err_cache; + + return; + +err_cache: + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); + } + +err: + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} + +void ib_cache_cleanup_one(struct ib_device *device) +{ + int p; + + ib_unregister_event_handler(&device->cache.event_handler); + flush_scheduled_work(); + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); + } + + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} + +struct ib_client cache_client = { + .name = "cache", + .add = ib_cache_setup_one, + .remove = ib_cache_cleanup_one +}; + +int __init ib_cache_setup(void) +{ + return ib_register_client(&cache_client); +} + +void __exit ib_cache_cleanup(void) +{ + ib_unregister_client(&cache_client); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/core_priv.h 2004-12-13 09:44:40.527754913 -0800 @@ -0,0 +1,41 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Topspin Communications. All rights reserved. + + $Id: core_priv.h 1288 2004-11-24 01:12:39Z roland $ +*/ + +#ifndef _CORE_PRIV_H +#define _CORE_PRIV_H + +#include +#include + +#include + +int ib_device_register_sysfs(struct ib_device *device); +void ib_device_unregister_sysfs(struct ib_device *device); + +int ib_sysfs_setup(void); +void ib_sysfs_cleanup(void); + +int ib_cache_setup(void); +void ib_cache_cleanup(void); + +#endif /* _CORE_PRIV_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/device.c 2004-12-13 09:44:40.448766550 -0800 @@ -0,0 +1,603 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: device.c 1304 2004-11-30 19:01:25Z sean.hefty $ + */ + +#include +#include +#include +#include +#include + +#include + +#include "core_priv.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("core kernel InfiniBand API"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct ib_client_data { + struct list_head list; + struct ib_client *client; + void * data; +}; + +static LIST_HEAD(device_list); +static LIST_HEAD(client_list); + +/* + * device_sem protects access to both device_list and client_list. + * There's no real point to using multiple locks or something fancier + * like an rwsem: we always access both lists, and we're always + * modifying one list or the other list. In any case this is not a + * hot path so there's no point in trying to optimize. + */ +static DECLARE_MUTEX(device_sem); + +static int ib_device_check_mandatory(struct ib_device *device) +{ +#define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device, x), #x } + static const struct { + size_t offset; + char *name; + } mandatory_table[] = { + IB_MANDATORY_FUNC(query_device), + IB_MANDATORY_FUNC(query_port), + IB_MANDATORY_FUNC(query_pkey), + IB_MANDATORY_FUNC(query_gid), + IB_MANDATORY_FUNC(alloc_pd), + IB_MANDATORY_FUNC(dealloc_pd), + IB_MANDATORY_FUNC(create_ah), + IB_MANDATORY_FUNC(destroy_ah), + IB_MANDATORY_FUNC(create_qp), + IB_MANDATORY_FUNC(modify_qp), + IB_MANDATORY_FUNC(destroy_qp), + IB_MANDATORY_FUNC(post_send), + IB_MANDATORY_FUNC(post_recv), + IB_MANDATORY_FUNC(create_cq), + IB_MANDATORY_FUNC(destroy_cq), + IB_MANDATORY_FUNC(poll_cq), + IB_MANDATORY_FUNC(req_notify_cq), + IB_MANDATORY_FUNC(get_dma_mr), + IB_MANDATORY_FUNC(dereg_mr) + }; + int i; + + for (i = 0; i < sizeof mandatory_table / sizeof mandatory_table[0]; ++i) { + if (!*(void **) ((void *) device + mandatory_table[i].offset)) { + printk(KERN_WARNING "Device %s is missing mandatory function %s\n", + device->name, mandatory_table[i].name); + return -EINVAL; + } + } + + return 0; +} + +static struct ib_device *__ib_device_get_by_name(const char *name) +{ + struct ib_device *device; + + list_for_each_entry(device, &device_list, core_list) + if (!strncmp(name, device->name, IB_DEVICE_NAME_MAX)) + return device; + + return NULL; +} + + +static int alloc_name(char *name) +{ + long *inuse; + char buf[IB_DEVICE_NAME_MAX]; + struct ib_device *device; + int i; + + inuse = (long *) get_zeroed_page(GFP_KERNEL); + if (!inuse) + return -ENOMEM; + + list_for_each_entry(device, &device_list, core_list) { + if (!sscanf(device->name, name, &i)) + continue; + if (i < 0 || i >= PAGE_SIZE * 8) + continue; + snprintf(buf, sizeof buf, name, i); + if (!strncmp(buf, device->name, IB_DEVICE_NAME_MAX)) + set_bit(i, inuse); + } + + i = find_first_zero_bit(inuse, PAGE_SIZE * 8); + free_page((unsigned long) inuse); + snprintf(buf, sizeof buf, name, i); + + if (__ib_device_get_by_name(buf)) + return -ENFILE; + + strlcpy(name, buf, IB_DEVICE_NAME_MAX); + return 0; +} + +/** + * ib_alloc_device - allocate an IB device struct + * @size:size of structure to allocate + * + * Low-level drivers should use ib_alloc_device() to allocate &struct + * ib_device. @size is the size of the structure to be allocated, + * including any private data used by the low-level driver. + * ib_dealloc_device() must be used to free structures allocated with + * ib_alloc_device(). + */ +struct ib_device *ib_alloc_device(size_t size) +{ + void *dev; + + BUG_ON(size < sizeof (struct ib_device)); + + dev = kmalloc(size, GFP_KERNEL); + if (!dev) + return NULL; + + memset(dev, 0, size); + + return dev; +} +EXPORT_SYMBOL(ib_alloc_device); + +/** + * ib_dealloc_device - free an IB device struct + * @device:structure to free + * + * Free a structure allocated with ib_alloc_device(). + */ +void ib_dealloc_device(struct ib_device *device) +{ + if (device->reg_state == IB_DEV_UNINITIALIZED) { + kfree(device); + return; + } + + BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); + + ib_device_unregister_sysfs(device); +} +EXPORT_SYMBOL(ib_dealloc_device); + +static int add_client_context(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + unsigned long flags; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) { + printk(KERN_WARNING "Couldn't allocate client context for %s/%s\n", + device->name, client->name); + return -ENOMEM; + } + + context->client = client; + context->data = NULL; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return 0; +} + +/** + * ib_register_device - Register an IB device with IB core + * @device:Device to register + * + * Low-level drivers use ib_register_device() to register their + * devices with the IB core. All registered clients will receive a + * callback for each device that is added. @device must be allocated + * with ib_alloc_device(). + */ +int ib_register_device(struct ib_device *device) +{ + int ret; + + down(&device_sem); + + if (strchr(device->name, '%')) { + ret = alloc_name(device->name); + if (ret) + goto out; + } + + if (ib_device_check_mandatory(device)) { + ret = -EINVAL; + goto out; + } + + INIT_LIST_HEAD(&device->event_handler_list); + INIT_LIST_HEAD(&device->client_data_list); + spin_lock_init(&device->event_handler_lock); + spin_lock_init(&device->client_data_lock); + + ret = ib_device_register_sysfs(device); + if (ret) { + printk(KERN_WARNING "Couldn't register device %s with driver model\n", + device->name); + goto out; + } + + list_add_tail(&device->core_list, &device_list); + + device->reg_state = IB_DEV_REGISTERED; + + { + struct ib_client *client; + + list_for_each_entry(client, &client_list, list) + if (client->add && !add_client_context(device, client)) + client->add(device); + } + + out: + up(&device_sem); + return ret; +} +EXPORT_SYMBOL(ib_register_device); + +/** + * ib_unregister_device - Unregister an IB device + * @device:Device to unregister + * + * Unregister an IB device. All clients will receive a remove callback. + */ +void ib_unregister_device(struct ib_device *device) +{ + struct ib_client *client; + struct ib_client_data *context, *tmp; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry_reverse(client, &client_list, list) + if (client->remove) + client->remove(device); + + list_del(&device->core_list); + + up(&device_sem); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + kfree(context); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + device->reg_state = IB_DEV_UNREGISTERED; +} +EXPORT_SYMBOL(ib_unregister_device); + +/** + * ib_register_client - Register an IB client + * @client:Client to register + * + * Upper level users of the IB drivers can use ib_register_client() to + * register callbacks for IB device addition and removal. When an IB + * device is added, each registered client's add method will be called + * (in the order the clients were registered), and when a device is + * removed, each client's remove method will be called (in the reverse + * order that clients were registered). In addition, when + * ib_register_client() is called, the client will receive an add + * callback for all devices already registered. + */ +int ib_register_client(struct ib_client *client) +{ + struct ib_device *device; + + down(&device_sem); + + list_add_tail(&client->list, &client_list); + list_for_each_entry(device, &device_list, core_list) + if (client->add && !add_client_context(device, client)) + client->add(device); + + up(&device_sem); + + return 0; +} +EXPORT_SYMBOL(ib_register_client); + +/** + * ib_unregister_client - Unregister an IB client + * @client:Client to unregister + * + * Upper level users use ib_unregister_client() to remove their client + * registration. When ib_unregister_client() is called, the client + * will receive a remove callback for each IB device still registered. + */ +void ib_unregister_client(struct ib_client *client) +{ + struct ib_client_data *context, *tmp; + struct ib_device *device; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry(device, &device_list, core_list) { + if (client->remove) + client->remove(device); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + if (context->client == client) { + list_del(&context->list); + kfree(context); + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + } + list_del(&client->list); + + up(&device_sem); +} +EXPORT_SYMBOL(ib_unregister_client); + +/** + * ib_get_client_data - Get IB client context + * @device:Device to get context for + * @client:Client to get context for + * + * ib_get_client_data() returns client context set with + * ib_set_client_data(). + */ +void *ib_get_client_data(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + void *ret = NULL; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + ret = context->data; + break; + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_get_client_data); + +/** + * ib_set_client_data - Get IB client context + * @device:Device to set context for + * @client:Client to set context for + * @data:Context to set + * + * ib_set_client_data() sets client context that can be retrieved with + * ib_get_client_data(). + */ +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data) +{ + struct ib_client_data *context; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + context->data = data; + goto out; + } + + printk(KERN_WARNING "No client context found for %s/%s\n", + device->name, client->name); + +out: + spin_unlock_irqrestore(&device->client_data_lock, flags); +} +EXPORT_SYMBOL(ib_set_client_data); + +/** + * ib_register_event_handler - Register an IB event handler + * @event_handler:Handler to register + * + * ib_register_event_handler() registers an event handler that will be + * called back when asynchronous IB events occur (as defined in + * chapter 11 of the InfiniBand Architecture Specification). This + * callback may occur in interrupt context. + */ +int ib_register_event_handler (struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_add_tail(&event_handler->list, + &event_handler->device->event_handler_list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_register_event_handler); + +/** + * ib_unregister_event_handler - Unregister an event handler + * @event_handler:Handler to unregister + * + * Unregister an event handler registered with + * ib_register_event_handler(). + */ +int ib_unregister_event_handler(struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_del(&event_handler->list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_unregister_event_handler); + +/** + * ib_dispatch_event - Dispatch an asynchronous event + * @event:Event to dispatch + * + * Low-level drivers must call ib_dispatch_event() to dispatch the + * event to all registered event handlers when an asynchronous event + * occurs. + */ +void ib_dispatch_event(struct ib_event *event) +{ + unsigned long flags; + struct ib_event_handler *handler; + + spin_lock_irqsave(&event->device->event_handler_lock, flags); + + list_for_each_entry(handler, &event->device->event_handler_list, list) + handler->handler(handler, event); + + spin_unlock_irqrestore(&event->device->event_handler_lock, flags); +} +EXPORT_SYMBOL(ib_dispatch_event); + +/** + * ib_query_device - Query IB device attributes + * @device:Device to query + * @device_attr:Device attributes + * + * ib_query_device() returns the attributes of a device through the + * @device_attr pointer. + */ +int ib_query_device(struct ib_device *device, + struct ib_device_attr *device_attr) +{ + return device->query_device(device, device_attr); +} +EXPORT_SYMBOL(ib_query_device); + +/** + * ib_query_port - Query IB port attributes + * @device:Device to query + * @port_num:Port number to query + * @port_attr:Port attributes + * + * ib_query_port() returns the attributes of a port through the + * @port_attr pointer. + */ +int ib_query_port(struct ib_device *device, + u8 port_num, + struct ib_port_attr *port_attr) +{ + return device->query_port(device, port_num, port_attr); +} +EXPORT_SYMBOL(ib_query_port); + +/** + * ib_query_gid - Get GID table entry + * @device:Device to query + * @port_num:Port number to query + * @index:GID table index to query + * @gid:Returned GID + * + * ib_query_gid() fetches the specified GID table entry. + */ +int ib_query_gid(struct ib_device *device, + u8 port_num, int index, union ib_gid *gid) +{ + return device->query_gid(device, port_num, index, gid); +} +EXPORT_SYMBOL(ib_query_gid); + +/** + * ib_query_pkey - Get P_Key table entry + * @device:Device to query + * @port_num:Port number to query + * @index:P_Key table index to query + * @pkey:Returned P_Key + * + * ib_query_pkey() fetches the specified P_Key table entry. + */ +int ib_query_pkey(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey) +{ + return device->query_pkey(device, port_num, index, pkey); +} +EXPORT_SYMBOL(ib_query_pkey); + +/** + * ib_modify_device - Change IB device attributes + * @device:Device to modify + * @device_modify_mask:Mask of attributes to change + * @device_modify:New attribute values + * + * ib_modify_device() changes a device's attributes as specified by + * the @device_modify_mask and @device_modify structure. + */ +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify) +{ + return device->modify_device(device, device_modify_mask, + device_modify); +} +EXPORT_SYMBOL(ib_modify_device); + +/** + * ib_modify_port - Modifies the attributes for the specified port. + * @device: The device to modify. + * @port_num: The number of the port to modify. + * @port_modify_mask: Mask used to specify which attributes of the port + * to change. + * @port_modify: New attribute values for the port. + * + * ib_modify_port() changes a port's attributes as specified by the + * @port_modify_mask and @port_modify structure. + */ +int ib_modify_port(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify) +{ + return device->modify_port(device, port_num, port_modify_mask, + port_modify); +} +EXPORT_SYMBOL(ib_modify_port); + +static int __init ib_core_init(void) +{ + int ret; + + ret = ib_sysfs_setup(); + if (ret) + printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); + + ret = ib_cache_setup(); + if (ret) { + printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n"); + ib_sysfs_cleanup(); + } + + return ret; +} + +static void __exit ib_core_cleanup(void) +{ + ib_cache_cleanup(); + ib_sysfs_cleanup(); +} + +module_init(ib_core_init); +module_exit(ib_core_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/fmr_pool.c 2004-12-13 09:44:40.476762425 -0800 @@ -0,0 +1,496 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: fmr_pool.c 1294 2004-11-25 19:17:57Z roland $ + */ + +#include +#include +#include +#include +#include + +#include + +#include "core_priv.h" + +enum { + IB_FMR_MAX_REMAPS = 32, + + IB_FMR_HASH_BITS = 8, + IB_FMR_HASH_SIZE = 1 << IB_FMR_HASH_BITS, + IB_FMR_HASH_MASK = IB_FMR_HASH_SIZE - 1 +}; + +/* + * If an FMR is not in use, then the list member will point to either + * its pool's free_list (if the FMR can be mapped again; that is, + * remap_count < IB_FMR_MAX_REMAPS) or its pool's dirty_list (if the + * FMR needs to be unmapped before being remapped). In either of + * these cases it is a bug if the ref_count is not 0. In other words, + * if ref_count is > 0, then the list member must not be linked into + * either free_list or dirty_list. + * + * The cache_node member is used to link the FMR into a cache bucket + * (if caching is enabled). This is independent of the reference + * count of the FMR. When a valid FMR is released, its ref_count is + * decremented, and if ref_count reaches 0, the FMR is placed in + * either free_list or dirty_list as appropriate. However, it is not + * removed from the cache and may be "revived" if a call to + * ib_fmr_register_physical() occurs before the FMR is remapped. In + * this case we just increment the ref_count and remove the FMR from + * free_list/dirty_list. + * + * Before we remap an FMR from free_list, we remove it from the cache + * (to prevent another user from obtaining a stale FMR). When an FMR + * is released, we add it to the tail of the free list, so that our + * cache eviction policy is "least recently used." + * + * All manipulation of ref_count, list and cache_node is protected by + * pool_lock to maintain consistency. + */ + +struct ib_fmr_pool { + spinlock_t pool_lock; + + int pool_size; + int max_pages; + int dirty_watermark; + int dirty_len; + struct list_head free_list; + struct list_head dirty_list; + struct hlist_head *cache_bucket; + + void (*flush_function)(struct ib_fmr_pool *pool, + void * arg); + void *flush_arg; + + struct task_struct *thread; + + atomic_t req_ser; + atomic_t flush_ser; + + wait_queue_head_t force_wait; +}; + +static inline u32 ib_fmr_hash(u64 first_page) +{ + return jhash_2words((u32) first_page, + (u32) (first_page >> 32), + 0); +} + +/* Caller must hold pool_lock */ +static inline struct ib_pool_fmr *ib_fmr_cache_lookup(struct ib_fmr_pool *pool, + u64 *page_list, + int page_list_len, + u64 io_virtual_address) +{ + struct hlist_head *bucket; + struct ib_pool_fmr *fmr; + struct hlist_node *pos; + + if (!pool->cache_bucket) + return NULL; + + bucket = pool->cache_bucket + ib_fmr_hash(*page_list); + + hlist_for_each_entry(fmr, pos, bucket, cache_node) + if (io_virtual_address == fmr->io_virtual_address && + page_list_len == fmr->page_list_len && + !memcmp(page_list, fmr->page_list, + page_list_len * sizeof *page_list)) + return fmr; + + return NULL; +} + +static void ib_fmr_batch_release(struct ib_fmr_pool *pool) +{ + int ret; + struct ib_pool_fmr *fmr; + LIST_HEAD(unmap_list); + LIST_HEAD(fmr_list); + + spin_lock_irq(&pool->pool_lock); + + list_for_each_entry(fmr, &pool->dirty_list, list) { + hlist_del_init(&fmr->cache_node); + fmr->remap_count = 0; + list_add_tail(&fmr->fmr->list, &fmr_list); + +#ifdef DEBUG + if (fmr->ref_count !=0) { + printk(KERN_WARNING "Unmapping FMR 0x%08x with ref count %d", + fmr, fmr->ref_count); + } +#endif + } + + list_splice(&pool->dirty_list, &unmap_list); + INIT_LIST_HEAD(&pool->dirty_list); + pool->dirty_len = 0; + + spin_unlock_irq(&pool->pool_lock); + + if (list_empty(&unmap_list)) { + return; + } + + ret = ib_unmap_fmr(&fmr_list); + if (ret) + printk(KERN_WARNING "ib_unmap_fmr returned %d", ret); + + spin_lock_irq(&pool->pool_lock); + list_splice(&unmap_list, &pool->free_list); + spin_unlock_irq(&pool->pool_lock); +} + +static int ib_fmr_cleanup_thread(void *pool_ptr) +{ + struct ib_fmr_pool *pool = pool_ptr; + + do { + if (pool->dirty_len >= pool->dirty_watermark || + atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) < 0) { + ib_fmr_batch_release(pool); + + atomic_inc(&pool->flush_ser); + wake_up_interruptible(&pool->force_wait); + + if (pool->flush_function) + pool->flush_function(pool, pool->flush_arg); + } + + set_current_state(TASK_INTERRUPTIBLE); + if (pool->dirty_len < pool->dirty_watermark && + atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) >= 0 && + !kthread_should_stop()) + schedule(); + __set_current_state(TASK_RUNNING); + } while (!kthread_should_stop()); + + return 0; +} + +/** + * ib_create_fmr_pool - Create an FMR pool + * @pd:Protection domain for FMRs + * @params:FMR pool parameters + * + * Create a pool of FMRs. Return value is pointer to new pool or + * error code if creation failed. + */ +struct ib_fmr_pool *ib_create_fmr_pool(struct ib_pd *pd, + struct ib_fmr_pool_param *params) +{ + struct ib_device *device; + struct ib_fmr_pool *pool; + int i; + int ret; + + if (!params) + return ERR_PTR(-EINVAL); + + device = pd->device; + if (!device->alloc_fmr || !device->dealloc_fmr || + !device->map_phys_fmr || !device->unmap_fmr) { + printk(KERN_WARNING "Device %s does not support fast memory regions", + device->name); + return ERR_PTR(-ENOSYS); + } + + pool = kmalloc(sizeof *pool, GFP_KERNEL); + if (!pool) { + printk(KERN_WARNING "couldn't allocate pool struct"); + return ERR_PTR(-ENOMEM); + } + + pool->cache_bucket = NULL; + + pool->flush_function = params->flush_function; + pool->flush_arg = params->flush_arg; + + INIT_LIST_HEAD(&pool->free_list); + INIT_LIST_HEAD(&pool->dirty_list); + + if (params->cache) { + pool->cache_bucket = + kmalloc(IB_FMR_HASH_SIZE * sizeof *pool->cache_bucket, + GFP_KERNEL); + if (!pool->cache_bucket) { + printk(KERN_WARNING "Failed to allocate cache in pool"); + ret = -ENOMEM; + goto out_free_pool; + } + + for (i = 0; i < IB_FMR_HASH_SIZE; ++i) + INIT_HLIST_HEAD(pool->cache_bucket + i); + } + + pool->pool_size = 0; + pool->max_pages = params->max_pages_per_fmr; + pool->dirty_watermark = params->dirty_watermark; + pool->dirty_len = 0; + spin_lock_init(&pool->pool_lock); + atomic_set(&pool->req_ser, 0); + atomic_set(&pool->flush_ser, 0); + init_waitqueue_head(&pool->force_wait); + + pool->thread = kthread_create(ib_fmr_cleanup_thread, + pool, + "ib_fmr(%s)", + device->name); + if (IS_ERR(pool->thread)) { + printk(KERN_WARNING "couldn't start cleanup thread"); + ret = PTR_ERR(pool->thread); + goto out_free_pool; + } + + { + struct ib_pool_fmr *fmr; + struct ib_fmr_attr attr = { + .max_pages = params->max_pages_per_fmr, + .max_maps = IB_FMR_MAX_REMAPS, + .page_size = PAGE_SHIFT + }; + + for (i = 0; i < params->pool_size; ++i) { + fmr = kmalloc(sizeof *fmr + params->max_pages_per_fmr * sizeof (u64), + GFP_KERNEL); + if (!fmr) { + printk(KERN_WARNING "failed to allocate fmr struct " + "for FMR %d", i); + goto out_fail; + } + + fmr->pool = pool; + fmr->remap_count = 0; + fmr->ref_count = 0; + INIT_HLIST_NODE(&fmr->cache_node); + + fmr->fmr = ib_alloc_fmr(pd, params->access, &attr); + if (IS_ERR(fmr->fmr)) { + printk(KERN_WARNING "fmr_create failed for FMR %d", i); + kfree(fmr); + goto out_fail; + } + + list_add_tail(&fmr->list, &pool->free_list); + ++pool->pool_size; + } + } + + return pool; + + out_free_pool: + kfree(pool->cache_bucket); + kfree(pool); + + return ERR_PTR(ret); + + out_fail: + ib_destroy_fmr_pool(pool); + + return ERR_PTR(-ENOMEM); +} +EXPORT_SYMBOL(ib_create_fmr_pool); + +/** + * ib_destroy_fmr_pool - Free FMR pool + * @pool:FMR pool to free + * + * Destroy an FMR pool and free all associated resources. + */ +int ib_destroy_fmr_pool(struct ib_fmr_pool *pool) +{ + struct ib_pool_fmr *fmr; + struct ib_pool_fmr *tmp; + int i; + + kthread_stop(pool->thread); + ib_fmr_batch_release(pool); + + i = 0; + list_for_each_entry_safe(fmr, tmp, &pool->free_list, list) { + ib_dealloc_fmr(fmr->fmr); + list_del(&fmr->list); + kfree(fmr); + ++i; + } + + if (i < pool->pool_size) + printk(KERN_WARNING "pool still has %d regions registered", + pool->pool_size - i); + + kfree(pool->cache_bucket); + kfree(pool); + + return 0; +} +EXPORT_SYMBOL(ib_destroy_fmr_pool); + +/** + * ib_flush_fmr_pool - Invalidate all unmapped FMRs + * @pool:FMR pool to flush + * + * Ensure that all unmapped FMRs are fully invalidated. + */ +int ib_flush_fmr_pool(struct ib_fmr_pool *pool) +{ + int serial; + + atomic_inc(&pool->req_ser); + /* + * It's OK if someone else bumps req_ser again here -- we'll + * just wait a little longer. + */ + serial = atomic_read(&pool->req_ser); + + wake_up_process(pool->thread); + + if (wait_event_interruptible(pool->force_wait, + atomic_read(&pool->flush_ser) - + atomic_read(&pool->req_ser) >= 0)) + return -EINTR; + + return 0; +} +EXPORT_SYMBOL(ib_flush_fmr_pool); + +/** + * ib_fmr_pool_map_phys - + * @pool:FMR pool to allocate FMR from + * @page_list:List of pages to map + * @list_len:Number of pages in @page_list + * @io_virtual_address:I/O virtual address for new FMR + * + * Map an FMR from an FMR pool. + */ +struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle, + u64 *page_list, + int list_len, + u64 *io_virtual_address) +{ + struct ib_fmr_pool *pool = pool_handle; + struct ib_pool_fmr *fmr; + unsigned long flags; + int result; + + if (list_len < 1 || list_len > pool->max_pages) + return ERR_PTR(-EINVAL); + + spin_lock_irqsave(&pool->pool_lock, flags); + fmr = ib_fmr_cache_lookup(pool, + page_list, + list_len, + *io_virtual_address); + if (fmr) { + /* found in cache */ + ++fmr->ref_count; + if (fmr->ref_count == 1) { + list_del(&fmr->list); + } + + spin_unlock_irqrestore(&pool->pool_lock, flags); + + return fmr; + } + + if (list_empty(&pool->free_list)) { + spin_unlock_irqrestore(&pool->pool_lock, flags); + return ERR_PTR(-EAGAIN); + } + + fmr = list_entry(pool->free_list.next, struct ib_pool_fmr, list); + list_del(&fmr->list); + hlist_del_init(&fmr->cache_node); + spin_unlock_irqrestore(&pool->pool_lock, flags); + + result = ib_map_phys_fmr(fmr->fmr, page_list, list_len, + *io_virtual_address); + + if (result) { + spin_lock_irqsave(&pool->pool_lock, flags); + list_add(&fmr->list, &pool->free_list); + spin_unlock_irqrestore(&pool->pool_lock, flags); + + printk(KERN_WARNING "fmr_map returns %d", + result); + + return ERR_PTR(result); + } + + ++fmr->remap_count; + fmr->ref_count = 1; + + if (pool->cache_bucket) { + fmr->io_virtual_address = *io_virtual_address; + fmr->page_list_len = list_len; + memcpy(fmr->page_list, page_list, list_len * sizeof(*page_list)); + + spin_lock_irqsave(&pool->pool_lock, flags); + hlist_add_head(&fmr->cache_node, + pool->cache_bucket + ib_fmr_hash(fmr->page_list[0])); + spin_unlock_irqrestore(&pool->pool_lock, flags); + } + + return fmr; +} +EXPORT_SYMBOL(ib_fmr_pool_map_phys); + +/** + * ib_fmr_pool_unmap - Unmap FMR + * @fmr:FMR to unmap + * + * Unmap an FMR. The FMR mapping may remain valid until the FMR is + * reused (or until ib_flush_fmr_pool() is called). + */ +int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr) +{ + struct ib_fmr_pool *pool; + unsigned long flags; + + pool = fmr->pool; + + spin_lock_irqsave(&pool->pool_lock, flags); + + --fmr->ref_count; + if (!fmr->ref_count) { + if (fmr->remap_count < IB_FMR_MAX_REMAPS) { + list_add_tail(&fmr->list, &pool->free_list); + } else { + list_add_tail(&fmr->list, &pool->dirty_list); + ++pool->dirty_len; + wake_up_process(pool->thread); + } + } + +#ifdef DEBUG + if (fmr->ref_count < 0) + printk(KERN_WARNING "FMR %p has ref count %d < 0", + fmr, fmr->ref_count); +#endif + + spin_unlock_irqrestore(&pool->pool_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_fmr_pool_unmap); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/packer.c 2004-12-13 09:44:40.335783194 -0800 @@ -0,0 +1,190 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: packer.c 1292 2004-11-25 05:24:40Z roland $ + */ + +#include + +static u64 value_read(int offset, int size, void *structure) +{ + switch (size) { + case 1: return *(u8 *) (structure + offset); + case 2: return be16_to_cpup((__be16 *) (structure + offset)); + case 4: return be32_to_cpup((__be32 *) (structure + offset)); + case 8: return be64_to_cpup((__be64 *) (structure + offset)); + default: + printk(KERN_WARNING "Field size %d bits not handled\n", size * 8); + return 0; + } +} + +/** + * ib_pack - Pack a structure into a buffer + * @desc:Array of structure field descriptions + * @desc_len:Number of entries in @desc + * @structure:Structure to pack from + * @buf:Buffer to pack into + * + * ib_pack() packs a list of structure fields into a buffer, + * controlled by the array of fields in @desc. + */ +void ib_pack(const struct ib_field *desc, + int desc_len, + void *structure, + void *buf) +{ + int i; + + for (i = 0; i < desc_len; ++i) { + if (desc[i].size_bits <= 32) { + int shift; + u32 val; + __be32 mask; + __be32 *addr; + + shift = 32 - desc[i].offset_bits - desc[i].size_bits; + if (desc[i].struct_size_bytes) + val = value_read(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + structure) << shift; + else + val = 0; + + mask = cpu_to_be32(((1ull << desc[i].size_bits) - 1) << shift); + addr = (__be32 *) buf + desc[i].offset_words; + *addr = (*addr & ~mask) | (cpu_to_be32(val) & mask); + } else if (desc[i].size_bits <= 64) { + int shift; + u64 val; + __be64 mask; + __be64 *addr; + + shift = 64 - desc[i].offset_bits - desc[i].size_bits; + if (desc[i].struct_size_bytes) + val = value_read(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + structure) << shift; + else + val = 0; + + mask = cpu_to_be64(((1ull << desc[i].size_bits) - 1) << shift); + addr = (__be64 *) ((__be32 *) buf + desc[i].offset_words); + *addr = (*addr & ~mask) | (cpu_to_be64(val) & mask); + } else { + if (desc[i].offset_bits % 8 || + desc[i].size_bits % 8) { + printk(KERN_WARNING "Structure field %s of size %d " + "bits is not byte-aligned\n", + desc[i].field_name, desc[i].size_bits); + } + + if (desc[i].struct_size_bytes) + memcpy(buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + structure + desc[i].struct_offset_bytes, + desc[i].size_bits / 8); + else + memset(buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + 0, + desc[i].size_bits / 8); + } + } +} +EXPORT_SYMBOL(ib_pack); + +static void value_write(int offset, int size, u64 val, void *structure) +{ + switch (size * 8) { + case 8: *( u8 *) (structure + offset) = val; break; + case 16: *(__be16 *) (structure + offset) = cpu_to_be16(val); break; + case 32: *(__be32 *) (structure + offset) = cpu_to_be32(val); break; + case 64: *(__be64 *) (structure + offset) = cpu_to_be64(val); break; + default: + printk(KERN_WARNING "Field size %d bits not handled\n", size * 8); + } +} + +/** + * ib_unpack - Unpack a buffer into a structure + * @desc:Array of structure field descriptions + * @desc_len:Number of entries in @desc + * @buf:Buffer to unpack from + * @structure:Structure to unpack into + * + * ib_pack() unpacks a list of structure fields from a buffer, + * controlled by the array of fields in @desc. + */ +void ib_unpack(const struct ib_field *desc, + int desc_len, + void *buf, + void *structure) +{ + int i; + + for (i = 0; i < desc_len; ++i) { + if (!desc[i].struct_size_bytes) + continue; + + if (desc[i].size_bits <= 32) { + int shift; + u32 val; + u32 mask; + __be32 *addr; + + shift = 32 - desc[i].offset_bits - desc[i].size_bits; + mask = ((1ull << desc[i].size_bits) - 1) << shift; + addr = (__be32 *) buf + desc[i].offset_words; + val = (be32_to_cpup(addr) & mask) >> shift; + value_write(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + val, + structure); + } else if (desc[i].size_bits <= 64) { + int shift; + u64 val; + u64 mask; + __be64 *addr; + + shift = 64 - desc[i].offset_bits - desc[i].size_bits; + mask = ((1ull << desc[i].size_bits) - 1) << shift; + addr = (__be64 *) buf + desc[i].offset_words; + val = (be64_to_cpup(addr) & mask) >> shift; + value_write(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + val, + structure); + } else { + if (desc[i].offset_bits % 8 || + desc[i].size_bits % 8) { + printk(KERN_WARNING "Structure field %s of size %d " + "bits is not byte-aligned\n", + desc[i].field_name, desc[i].size_bits); + } + + memcpy(structure + desc[i].struct_offset_bytes, + buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + desc[i].size_bits / 8); + } + } +} +EXPORT_SYMBOL(ib_unpack); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/sysfs.c 2004-12-13 09:44:40.421770527 -0800 @@ -0,0 +1,684 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: sysfs.c 1257 2004-11-17 23:12:18Z roland $ + */ + +#include "core_priv.h" + +#include + +struct ib_port { + struct kobject kobj; + struct ib_device *ibdev; + struct attribute_group gid_group; + struct attribute **gid_attr; + struct attribute_group pkey_group; + struct attribute **pkey_attr; + u8 port_num; +}; + +struct port_attribute { + struct attribute attr; + ssize_t (*show)(struct ib_port *, struct port_attribute *, char *buf); + ssize_t (*store)(struct ib_port *, struct port_attribute *, + const char *buf, size_t count); +}; + +#define PORT_ATTR(_name, _mode, _show, _store) \ +struct port_attribute port_attr_##_name = __ATTR(_name, _mode, _show, _store) + +#define PORT_ATTR_RO(_name) \ +struct port_attribute port_attr_##_name = __ATTR_RO(_name) + +struct port_table_attribute { + struct port_attribute attr; + int index; +}; + +static ssize_t port_attr_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct port_attribute *port_attr = + container_of(attr, struct port_attribute, attr); + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + + if (!port_attr->show) + return 0; + + return port_attr->show(p, port_attr, buf); +} + +static struct sysfs_ops port_sysfs_ops = { + .show = port_attr_show +}; + +static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + static const char *state_name[] = { + [IB_PORT_NOP] = "NOP", + [IB_PORT_DOWN] = "DOWN", + [IB_PORT_INIT] = "INIT", + [IB_PORT_ARMED] = "ARMED", + [IB_PORT_ACTIVE] = "ACTIVE", + [IB_PORT_ACTIVE_DEFER] = "ACTIVE_DEFER" + }; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d: %s\n", attr.state, + attr.state >= 0 && attr.state <= ARRAY_SIZE(state_name) ? + state_name[attr.state] : "UNKNOWN"); +} + +static ssize_t lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.lid); +} + +static ssize_t lid_mask_count_show(struct ib_port *p, + struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.lmc); +} + +static ssize_t sm_lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.sm_lid); +} + +static ssize_t sm_sl_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.sm_sl); +} + +static ssize_t cap_mask_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%08x\n", attr.port_cap_flags); +} + +static PORT_ATTR_RO(state); +static PORT_ATTR_RO(lid); +static PORT_ATTR_RO(lid_mask_count); +static PORT_ATTR_RO(sm_lid); +static PORT_ATTR_RO(sm_sl); +static PORT_ATTR_RO(cap_mask); + +static struct attribute *port_default_attrs[] = { + &port_attr_state.attr, + &port_attr_lid.attr, + &port_attr_lid_mask_count.attr, + &port_attr_sm_lid.attr, + &port_attr_sm_sl.attr, + &port_attr_cap_mask.attr, + NULL +}; + +static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + union ib_gid gid; + ssize_t ret; + + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) gid.raw)[0]), + be16_to_cpu(((u16 *) gid.raw)[1]), + be16_to_cpu(((u16 *) gid.raw)[2]), + be16_to_cpu(((u16 *) gid.raw)[3]), + be16_to_cpu(((u16 *) gid.raw)[4]), + be16_to_cpu(((u16 *) gid.raw)[5]), + be16_to_cpu(((u16 *) gid.raw)[6]), + be16_to_cpu(((u16 *) gid.raw)[7])); +} + +static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + u16 pkey; + ssize_t ret; + + ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); + if (ret) + return ret; + + return sprintf(buf, "0x%04x\n", pkey); +} + +#define PORT_PMA_ATTR(_name, _counter, _width, _offset) \ +struct port_table_attribute port_pma_attr_##_name = { \ + .attr = __ATTR(_name, S_IRUGO, show_pma_counter, NULL), \ + .index = (_offset) | ((_width) << 16) | ((_counter) << 24) \ +} + +static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + int offset = tab_attr->index & 0xffff; + int width = (tab_attr->index >> 16) & 0xff; + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + ssize_t ret; + + if (!p->ibdev->process_mad) + return sprintf(buf, "N/A (no PMA)\n"); + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + if (!in_mad || !out_mad) { + ret = -ENOMEM; + goto out; + } + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(0x12); /* PortCounters */ + + in_mad->data[41] = p->port_num; /* PortSelect field */ + + if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, p->port_num, 0xffff, + in_mad, out_mad) & + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { + ret = -EINVAL; + goto out; + } + + switch (width) { + case 4: + ret = sprintf(buf, "%u\n", (out_mad->data[40 + offset / 8] >> + (offset % 4)) & 0xf); + break; + case 8: + ret = sprintf(buf, "%u\n", out_mad->data[40 + offset / 8]); + break; + case 16: + ret = sprintf(buf, "%u\n", + be16_to_cpup((u16 *)(out_mad->data + 40 + offset / 8))); + break; + case 32: + ret = sprintf(buf, "%u\n", + be32_to_cpup((u32 *)(out_mad->data + 40 + offset / 8))); + break; + default: + ret = 0; + } + +out: + kfree(in_mad); + kfree(out_mad); + + return ret; +} + +static PORT_PMA_ATTR(symbol_error , 0, 16, 32); +static PORT_PMA_ATTR(link_error_recovery , 1, 8, 48); +static PORT_PMA_ATTR(link_downed , 2, 8, 56); +static PORT_PMA_ATTR(port_rcv_errors , 3, 16, 64); +static PORT_PMA_ATTR(port_rcv_remote_physical_errors, 4, 16, 80); +static PORT_PMA_ATTR(port_rcv_switch_relay_errors , 5, 16, 96); +static PORT_PMA_ATTR(port_xmit_discards , 6, 16, 112); +static PORT_PMA_ATTR(port_xmit_constraint_errors , 7, 8, 128); +static PORT_PMA_ATTR(port_rcv_constraint_errors , 8, 8, 136); +static PORT_PMA_ATTR(local_link_integrity_errors , 9, 4, 152); +static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10, 4, 156); +static PORT_PMA_ATTR(VL15_dropped , 11, 16, 176); +static PORT_PMA_ATTR(port_xmit_data , 12, 32, 192); +static PORT_PMA_ATTR(port_rcv_data , 13, 32, 224); +static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256); +static PORT_PMA_ATTR(port_rcv_packets , 15, 32, 288); + +static struct attribute *pma_attrs[] = { + &port_pma_attr_symbol_error.attr.attr, + &port_pma_attr_link_error_recovery.attr.attr, + &port_pma_attr_link_downed.attr.attr, + &port_pma_attr_port_rcv_errors.attr.attr, + &port_pma_attr_port_rcv_remote_physical_errors.attr.attr, + &port_pma_attr_port_rcv_switch_relay_errors.attr.attr, + &port_pma_attr_port_xmit_discards.attr.attr, + &port_pma_attr_port_xmit_constraint_errors.attr.attr, + &port_pma_attr_port_rcv_constraint_errors.attr.attr, + &port_pma_attr_local_link_integrity_errors.attr.attr, + &port_pma_attr_excessive_buffer_overrun_errors.attr.attr, + &port_pma_attr_VL15_dropped.attr.attr, + &port_pma_attr_port_xmit_data.attr.attr, + &port_pma_attr_port_rcv_data.attr.attr, + &port_pma_attr_port_xmit_packets.attr.attr, + &port_pma_attr_port_rcv_packets.attr.attr, + NULL +}; + +static struct attribute_group pma_group = { + .name = "counters", + .attrs = pma_attrs +}; + +static void ib_port_release(struct kobject *kobj) +{ + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + struct attribute *a; + int i; + + for (i = 0; (a = p->gid_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + for (i = 0; (a = p->pkey_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + kfree(p->gid_attr); + kfree(p); +} + +static struct kobj_type port_type = { + .release = ib_port_release, + .sysfs_ops = &port_sysfs_ops, + .default_attrs = port_default_attrs +}; + +static void ib_device_release(struct class_device *cdev) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + + kfree(dev); +} + +static int ib_device_hotplug(struct class_device *cdev, char **envp, + int num_envp, char *buf, int size) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + int i = 0, len = 0; + + if (add_hotplug_env_var(envp, num_envp, &i, buf, size, &len, + "NAME=%s", dev->name)) + return -ENOMEM; + + /* + * It might be nice to pass the node GUID to hotplug, but + * right now the only way to get it is to query the device + * provider, and this can crash during device removal because + * we are will be running after driver removal has started. + * We could add a node_guid field to struct ib_device, or we + * could just let the hotplug script read the node GUID from + * sysfs when devices are added. + */ + + envp[i] = NULL; + return 0; +} + +static int alloc_group(struct attribute ***attr, + ssize_t (*show)(struct ib_port *, + struct port_attribute *, char *buf), + int len) +{ + struct port_table_attribute ***tab_attr = + (struct port_table_attribute ***) attr; + int i; + int ret; + + *tab_attr = kmalloc((1 + len) * sizeof *tab_attr, GFP_KERNEL); + if (!*tab_attr) + return -ENOMEM; + + memset(*tab_attr, 0, (1 + len) * sizeof *tab_attr); + + for (i = 0; i < len; ++i) { + (*tab_attr)[i] = kmalloc(sizeof *(*tab_attr)[i], GFP_KERNEL); + if (!(*tab_attr)[i]) { + ret = -ENOMEM; + goto err; + } + memset((*tab_attr)[i], 0, sizeof *(*tab_attr)[i]); + (*tab_attr)[i]->attr.attr.name = kmalloc(8, GFP_KERNEL); + if (!(*tab_attr)[i]->attr.attr.name) { + ret = -ENOMEM; + goto err; + } + + if (snprintf((*tab_attr)[i]->attr.attr.name, 8, "%d", i) >= 8) { + ret = -ENOMEM; + goto err; + } + + (*tab_attr)[i]->attr.attr.mode = S_IRUGO; + (*tab_attr)[i]->attr.attr.owner = THIS_MODULE; + (*tab_attr)[i]->attr.show = show; + (*tab_attr)[i]->index = i; + } + + return 0; + +err: + for (i = 0; i < len; ++i) { + if ((*tab_attr)[i]) + kfree((*tab_attr)[i]->attr.attr.name); + kfree((*tab_attr)[i]); + } + + kfree(*tab_attr); + + return ret; +} + +static int add_port(struct ib_device *device, int port_num) +{ + struct ib_port *p; + struct ib_port_attr attr; + int i; + int ret; + + ret = ib_query_port(device, port_num, &attr); + if (ret) + return ret; + + p = kmalloc(sizeof *p, GFP_KERNEL); + if (!p) + return -ENOMEM; + memset(p, 0, sizeof *p); + + p->ibdev = device; + p->port_num = port_num; + p->kobj.ktype = &port_type; + + p->kobj.parent = kobject_get(&device->ports_parent); + if (!p->kobj.parent) { + ret = -EBUSY; + goto err; + } + + ret = kobject_set_name(&p->kobj, "%d", port_num); + if (ret) + goto err_put; + + ret = kobject_register(&p->kobj); + if (ret) + goto err_put; + + ret = sysfs_create_group(&p->kobj, &pma_group); + if (ret) + goto err_put; + + ret = alloc_group(&p->gid_attr, show_port_gid, attr.gid_tbl_len); + if (ret) + goto err_remove_pma; + + p->gid_group.name = "gids"; + p->gid_group.attrs = p->gid_attr; + + ret = sysfs_create_group(&p->kobj, &p->gid_group); + if (ret) + goto err_free_gid; + + ret = alloc_group(&p->pkey_attr, show_port_pkey, attr.pkey_tbl_len); + if (ret) + goto err_remove_gid; + + p->pkey_group.name = "pkeys"; + p->pkey_group.attrs = p->pkey_attr; + + ret = sysfs_create_group(&p->kobj, &p->pkey_group); + if (ret) + goto err_free_pkey; + + list_add_tail(&p->kobj.entry, &device->port_list); + + return 0; + +err_free_pkey: + for (i = 0; i < attr.pkey_tbl_len; ++i) { + kfree(p->pkey_attr[i]->name); + kfree(p->pkey_attr[i]); + } + + kfree(p->pkey_attr); + +err_remove_gid: + sysfs_remove_group(&p->kobj, &p->gid_group); + +err_free_gid: + for (i = 0; i < attr.gid_tbl_len; ++i) { + kfree(p->gid_attr[i]->name); + kfree(p->gid_attr[i]); + } + + kfree(p->gid_attr); + +err_remove_pma: + sysfs_remove_group(&p->kobj, &pma_group); + +err_put: + kobject_put(&device->ports_parent); + +err: + kfree(p); + return ret; +} + +static ssize_t show_sys_image_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.sys_image_guid)[0]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[1]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[2]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[3])); +} + +static ssize_t show_node_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.node_guid)[0]), + be16_to_cpu(((u16 *) &attr.node_guid)[1]), + be16_to_cpu(((u16 *) &attr.node_guid)[2]), + be16_to_cpu(((u16 *) &attr.node_guid)[3])); +} + +static CLASS_DEVICE_ATTR(sys_image_guid, S_IRUGO, show_sys_image_guid, NULL); +static CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); + +static struct class_device_attribute *ib_class_attributes[] = { + &class_device_attr_sys_image_guid, + &class_device_attr_node_guid +}; + +static struct class ib_class = { + .name = "infiniband", + .release = ib_device_release, + .hotplug = ib_device_hotplug, +}; + +int ib_device_register_sysfs(struct ib_device *device) +{ + struct class_device *class_dev = &device->class_dev; + int ret; + int i; + + class_dev->class = &ib_class; + class_dev->class_data = device; + strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); + + INIT_LIST_HEAD(&device->port_list); + + ret = class_device_register(class_dev); + if (ret) + goto err; + + for (i = 0; i < ARRAY_SIZE(ib_class_attributes); ++i) { + ret = class_device_create_file(class_dev, ib_class_attributes[i]); + if (ret) + goto err_unregister; + } + + device->ports_parent.parent = kobject_get(&class_dev->kobj); + if (!device->ports_parent.parent) { + ret = -EBUSY; + goto err_unregister; + } + ret = kobject_set_name(&device->ports_parent, "ports"); + if (ret) + goto err_put; + ret = kobject_register(&device->ports_parent); + if (ret) + goto err_put; + + if (device->node_type == IB_NODE_SWITCH) { + ret = add_port(device, 0); + if (ret) + goto err_put; + } else { + int i; + + for (i = 1; i <= device->phys_port_cnt; ++i) { + ret = add_port(device, i); + if (ret) + goto err_put; + } + } + + return 0; + +err_put: + { + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + } + + kobject_put(&class_dev->kobj); + +err_unregister: + class_device_unregister(class_dev); + +err: + return ret; +} + +void ib_device_unregister_sysfs(struct ib_device *device) +{ + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + + kobject_unregister(&device->ports_parent); + class_device_unregister(&device->class_dev); +} + +int ib_sysfs_setup(void) +{ + return class_register(&ib_class); +} + +void ib_sysfs_cleanup(void) +{ + class_unregister(&ib_class); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/ud_header.c 2004-12-13 09:44:40.360779512 -0800 @@ -0,0 +1,354 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: ud_header.c 1292 2004-11-25 05:24:40Z roland $ + */ + +#include + +#include + +#define STRUCT_FIELD(header, field) \ + .struct_offset_bytes = offsetof(struct ib_unpacked_ ## header, field), \ + .struct_size_bytes = sizeof ((struct ib_unpacked_ ## header *) 0)->field, \ + .field_name = #header ":" #field + +static const struct ib_field lrh_table[] = { + { STRUCT_FIELD(lrh, virtual_lane), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 4 }, + { STRUCT_FIELD(lrh, link_version), + .offset_words = 0, + .offset_bits = 4, + .size_bits = 4 }, + { STRUCT_FIELD(lrh, service_level), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 4 }, + { RESERVED, + .offset_words = 0, + .offset_bits = 12, + .size_bits = 2 }, + { STRUCT_FIELD(lrh, link_next_header), + .offset_words = 0, + .offset_bits = 14, + .size_bits = 2 }, + { STRUCT_FIELD(lrh, destination_lid), + .offset_words = 0, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 5 }, + { STRUCT_FIELD(lrh, packet_length), + .offset_words = 1, + .offset_bits = 5, + .size_bits = 11 }, + { STRUCT_FIELD(lrh, source_lid), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 } +}; + +static const struct ib_field grh_table[] = { + { STRUCT_FIELD(grh, ip_version), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 4 }, + { STRUCT_FIELD(grh, traffic_class), + .offset_words = 0, + .offset_bits = 4, + .size_bits = 8 }, + { STRUCT_FIELD(grh, flow_label), + .offset_words = 0, + .offset_bits = 12, + .size_bits = 20 }, + { STRUCT_FIELD(grh, payload_length), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { STRUCT_FIELD(grh, next_header), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 8 }, + { STRUCT_FIELD(grh, hop_limit), + .offset_words = 1, + .offset_bits = 24, + .size_bits = 8 }, + { STRUCT_FIELD(grh, source_gid), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 128 }, + { STRUCT_FIELD(grh, destination_gid), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 128 } +}; + +static const struct ib_field bth_table[] = { + { STRUCT_FIELD(bth, opcode), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(bth, solicited_event), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 1 }, + { STRUCT_FIELD(bth, mig_req), + .offset_words = 0, + .offset_bits = 9, + .size_bits = 1 }, + { STRUCT_FIELD(bth, pad_count), + .offset_words = 0, + .offset_bits = 10, + .size_bits = 2 }, + { STRUCT_FIELD(bth, transport_header_version), + .offset_words = 0, + .offset_bits = 12, + .size_bits = 4 }, + { STRUCT_FIELD(bth, pkey), + .offset_words = 0, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(bth, destination_qpn), + .offset_words = 1, + .offset_bits = 8, + .size_bits = 24 }, + { STRUCT_FIELD(bth, ack_req), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 1 }, + { RESERVED, + .offset_words = 2, + .offset_bits = 1, + .size_bits = 7 }, + { STRUCT_FIELD(bth, psn), + .offset_words = 2, + .offset_bits = 8, + .size_bits = 24 } +}; + +static const struct ib_field deth_table[] = { + { STRUCT_FIELD(deth, qkey), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(deth, source_qpn), + .offset_words = 1, + .offset_bits = 8, + .size_bits = 24 } +}; + +/** + * ib_ud_header_init - Initialize UD header structure + * @payload_bytes:Length of packet payload + * @grh_present:GRH flag (if non-zero, GRH will be included) + * @header:Structure to initialize + * + * ib_ud_header_init() initializes the lrh.link_version, lrh.link_next_header, + * lrh.packet_length, grh.ip_version, grh.payload_length, + * grh.next_header, bth.opcode, bth.pad_count and + * bth.transport_header_version fields of a &struct ib_ud_header given + * the payload length and whether a GRH will be included. + */ +void ib_ud_header_init(int payload_bytes, + int grh_present, + struct ib_ud_header *header) +{ + int header_len; + + memset(header, 0, sizeof *header); + + header_len = + IB_LRH_BYTES + + IB_BTH_BYTES + + IB_DETH_BYTES; + if (grh_present) { + header_len += IB_GRH_BYTES; + } + + header->lrh.link_version = 0; + header->lrh.link_next_header = + grh_present ? IB_LNH_IBA_GLOBAL : IB_LNH_IBA_LOCAL; + header->lrh.packet_length = (IB_LRH_BYTES + + IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) / 4; /* round up */ + + header->grh_present = grh_present; + if (grh_present) { + header->lrh.packet_length += IB_GRH_BYTES / 4; + + header->grh.ip_version = 6; + header->grh.payload_length = + cpu_to_be16((IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) & ~3); /* round up */ + header->grh.next_header = 0x1b; + } + + cpu_to_be16s(&header->lrh.packet_length); + + if (header->immediate_present) + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + else + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY; + header->bth.pad_count = (4 - payload_bytes) & 3; + header->bth.transport_header_version = 0; +} +EXPORT_SYMBOL(ib_ud_header_init); + +/** + * ib_ud_header_pack - Pack UD header struct into wire format + * @header:UD header struct + * @buf:Buffer to pack into + * + * ib_ud_header_pack() packs the UD header structure @header into wire + * format in the buffer @buf. + */ +int ib_ud_header_pack(struct ib_ud_header *header, + void *buf) +{ + int len = 0; + + ib_pack(lrh_table, ARRAY_SIZE(lrh_table), + &header->lrh, buf); + len += IB_LRH_BYTES; + + if (header->grh_present) { + ib_pack(grh_table, ARRAY_SIZE(grh_table), + &header->grh, buf + len); + len += IB_GRH_BYTES; + } + + ib_pack(bth_table, ARRAY_SIZE(bth_table), + &header->bth, buf + len); + len += IB_BTH_BYTES; + + ib_pack(deth_table, ARRAY_SIZE(deth_table), + &header->deth, buf + len); + len += IB_DETH_BYTES; + + if (header->immediate_present) { + memcpy(buf + len, &header->immediate_data, sizeof header->immediate_data); + len += sizeof header->immediate_data; + } + + return len; +} +EXPORT_SYMBOL(ib_ud_header_pack); + +/** + * ib_ud_header_unpack - Unpack UD header struct from wire format + * @header:UD header struct + * @buf:Buffer to pack into + * + * ib_ud_header_pack() unpacks the UD header structure @header from wire + * format in the buffer @buf. + */ +int ib_ud_header_unpack(void *buf, + struct ib_ud_header *header) +{ + ib_unpack(lrh_table, ARRAY_SIZE(lrh_table), + buf, &header->lrh); + buf += IB_LRH_BYTES; + + if (header->lrh.link_version != 0) { + printk(KERN_WARNING "Invalid LRH.link_version %d\n", + header->lrh.link_version); + return -EINVAL; + } + + switch (header->lrh.link_next_header) { + case IB_LNH_IBA_LOCAL: + header->grh_present = 0; + break; + + case IB_LNH_IBA_GLOBAL: + header->grh_present = 1; + ib_unpack(grh_table, ARRAY_SIZE(grh_table), + buf, &header->grh); + buf += IB_GRH_BYTES; + + if (header->grh.ip_version != 6) { + printk(KERN_WARNING "Invalid GRH.ip_version %d\n", + header->grh.ip_version); + return -EINVAL; + } + if (header->grh.next_header != 0x1b) { + printk(KERN_WARNING "Invalid GRH.next_header 0x%02x\n", + header->grh.next_header); + return -EINVAL; + } + break; + + default: + printk(KERN_WARNING "Invalid LRH.link_next_header %d\n", + header->lrh.link_next_header); + return -EINVAL; + } + + ib_unpack(bth_table, ARRAY_SIZE(bth_table), + buf, &header->bth); + buf += IB_BTH_BYTES; + + switch (header->bth.opcode) { + case IB_OPCODE_UD_SEND_ONLY: + header->immediate_present = 0; + break; + case IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE: + header->immediate_present = 1; + break; + default: + printk(KERN_WARNING "Invalid BTH.opcode 0x%02x\n", + header->bth.opcode); + return -EINVAL; + } + + if (header->bth.transport_header_version != 0) { + printk(KERN_WARNING "Invalid BTH.transport_header_version %d\n", + header->bth.transport_header_version); + return -EINVAL; + } + + ib_unpack(deth_table, ARRAY_SIZE(deth_table), + buf, &header->deth); + buf += IB_DETH_BYTES; + + if (header->immediate_present) + memcpy(&header->immediate_data, buf, sizeof header->immediate_data); + + return 0; +} +EXPORT_SYMBOL(ib_ud_header_unpack); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/verbs.c 2004-12-13 09:44:40.392774798 -0800 @@ -0,0 +1,420 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include +#include + +#include + +/* Protection domains */ + +struct ib_pd *ib_alloc_pd(struct ib_device *device) +{ + struct ib_pd *pd; + + pd = device->alloc_pd(device); + + if (!IS_ERR(pd)) { + pd->device = device; + atomic_set(&pd->usecnt, 0); + } + + return pd; +} +EXPORT_SYMBOL(ib_alloc_pd); + +int ib_dealloc_pd(struct ib_pd *pd) +{ + if (atomic_read(&pd->usecnt)) + return -EBUSY; + + return pd->device->dealloc_pd(pd); +} +EXPORT_SYMBOL(ib_dealloc_pd); + +/* Address handles */ + +struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + struct ib_ah *ah; + + ah = pd->device->create_ah(pd, ah_attr); + + if (!IS_ERR(ah)) { + ah->device = pd->device; + ah->pd = pd; + atomic_inc(&pd->usecnt); + } + + return ah; +} +EXPORT_SYMBOL(ib_create_ah); + +int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return ah->device->modify_ah ? + ah->device->modify_ah(ah, ah_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_modify_ah); + +int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return ah->device->query_ah ? + ah->device->query_ah(ah, ah_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_query_ah); + +int ib_destroy_ah(struct ib_ah *ah) +{ + struct ib_pd *pd; + int ret; + + pd = ah->pd; + ret = ah->device->destroy_ah(ah); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_destroy_ah); + +/* Queue pairs */ + +struct ib_qp *ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr) +{ + struct ib_qp *qp; + + qp = pd->device->create_qp(pd, qp_init_attr); + + if (!IS_ERR(qp)) { + qp->device = pd->device; + qp->pd = pd; + qp->send_cq = qp_init_attr->send_cq; + qp->recv_cq = qp_init_attr->recv_cq; + qp->srq = qp_init_attr->srq; + qp->event_handler = qp_init_attr->event_handler; + qp->qp_context = qp_init_attr->qp_context; + atomic_inc(&pd->usecnt); + atomic_inc(&qp_init_attr->send_cq->usecnt); + atomic_inc(&qp_init_attr->recv_cq->usecnt); + if (qp_init_attr->srq) + atomic_inc(&qp_init_attr->srq->usecnt); + } + + return qp; +} +EXPORT_SYMBOL(ib_create_qp); + +int ib_modify_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask) +{ + return qp->device->modify_qp(qp, qp_attr, qp_attr_mask); +} +EXPORT_SYMBOL(ib_modify_qp); + +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr) +{ + return qp->device->query_qp ? + qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_query_qp); + +int ib_destroy_qp(struct ib_qp *qp) +{ + struct ib_pd *pd; + struct ib_cq *scq, *rcq; + struct ib_srq *srq; + int ret; + + pd = qp->pd; + scq = qp->send_cq; + rcq = qp->recv_cq; + srq = qp->srq; + + ret = qp->device->destroy_qp(qp); + if (!ret) { + atomic_dec(&pd->usecnt); + atomic_dec(&scq->usecnt); + atomic_dec(&rcq->usecnt); + if (srq) + atomic_dec(&srq->usecnt); + } + + return ret; +} +EXPORT_SYMBOL(ib_destroy_qp); + +/* Completion queues */ + +struct ib_cq *ib_create_cq(struct ib_device *device, + ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), + void *cq_context, int cqe) +{ + struct ib_cq *cq; + + cq = device->create_cq(device, cqe); + + if (!IS_ERR(cq)) { + cq->device = device; + cq->comp_handler = comp_handler; + cq->event_handler = event_handler; + cq->cq_context = cq_context; + atomic_set(&cq->usecnt, 0); + } + + return cq; +} +EXPORT_SYMBOL(ib_create_cq); + +int ib_destroy_cq(struct ib_cq *cq) +{ + if (atomic_read(&cq->usecnt)) + return -EBUSY; + + return cq->device->destroy_cq(cq); +} +EXPORT_SYMBOL(ib_destroy_cq); + +int ib_resize_cq(struct ib_cq *cq, + int cqe) +{ + int ret; + + if (!cq->device->resize_cq) + return -ENOSYS; + + ret = cq->device->resize_cq(cq, &cqe); + if (!ret) + cq->cqe = cqe; + + return ret; +} +EXPORT_SYMBOL(ib_resize_cq); + +/* Memory regions */ + +struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) +{ + struct ib_mr *mr; + + mr = pd->device->get_dma_mr(pd, mr_access_flags); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_get_dma_mr); + +struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_mr *mr; + + mr = pd->device->reg_phys_mr(pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_reg_phys_mr); + +int ib_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_pd *old_pd; + int ret; + + if (!mr->device->rereg_phys_mr) + return -ENOSYS; + + if (atomic_read(&mr->usecnt)) + return -EBUSY; + + old_pd = mr->pd; + + ret = mr->device->rereg_phys_mr(mr, mr_rereg_mask, pd, + phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!ret && (mr_rereg_mask & IB_MR_REREG_PD)) { + atomic_dec(&old_pd->usecnt); + atomic_inc(&pd->usecnt); + } + + return ret; +} +EXPORT_SYMBOL(ib_rereg_phys_mr); + +int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) +{ + return mr->device->query_mr ? + mr->device->query_mr(mr, mr_attr) : -ENOSYS; +} +EXPORT_SYMBOL(ib_query_mr); + +int ib_dereg_mr(struct ib_mr *mr) +{ + struct ib_pd *pd; + int ret; + + if (atomic_read(&mr->usecnt)) + return -EBUSY; + + pd = mr->pd; + ret = mr->device->dereg_mr(mr); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dereg_mr); + +/* Memory windows */ + +struct ib_mw *ib_alloc_mw(struct ib_pd *pd) +{ + struct ib_mw *mw; + + if (!pd->device->alloc_mw) + return ERR_PTR(-ENOSYS); + + mw = pd->device->alloc_mw(pd); + if (!IS_ERR(mw)) { + mw->device = pd->device; + mw->pd = pd; + atomic_inc(&pd->usecnt); + } + + return mw; +} +EXPORT_SYMBOL(ib_alloc_mw); + +int ib_dealloc_mw(struct ib_mw *mw) +{ + struct ib_pd *pd; + int ret; + + pd = mw->pd; + ret = mw->device->dealloc_mw(mw); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dealloc_mw); + +/* "Fast" memory regions */ + +struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct ib_fmr *fmr; + + if (!pd->device->alloc_fmr) + return ERR_PTR(-ENOSYS); + + fmr = pd->device->alloc_fmr(pd, mr_access_flags, fmr_attr); + if (!IS_ERR(fmr)) { + fmr->device = pd->device; + fmr->pd = pd; + atomic_inc(&pd->usecnt); + } + + return fmr; +} +EXPORT_SYMBOL(ib_alloc_fmr); + +int ib_unmap_fmr(struct list_head *fmr_list) +{ + struct ib_fmr *fmr; + + if (list_empty(fmr_list)) + return 0; + + fmr = list_entry(fmr_list->next, struct ib_fmr, list); + return fmr->device->unmap_fmr(fmr_list); +} +EXPORT_SYMBOL(ib_unmap_fmr); + +int ib_dealloc_fmr(struct ib_fmr *fmr) +{ + struct ib_pd *pd; + int ret; + + pd = fmr->pd; + ret = fmr->device->dealloc_fmr(fmr); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dealloc_fmr); + +/* Multicast groups */ + +int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) +{ + return qp->device->attach_mcast ? + qp->device->attach_mcast(qp, gid, lid) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_attach_mcast); + +int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) +{ + return qp->device->detach_mcast ? + qp->device->detach_mcast(qp, gid, lid) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_detach_mcast); From roland at topspin.com Mon Dec 13 10:09:23 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:23 -0800 Subject: [openib-general] [PATCH][v3][3/21] Hook up drivers/infiniband In-Reply-To: <20041213109.B80JuEFdg6Nma7kr@topspin.com> Message-ID: <20041213109.GUz9r6Ey44wtDeiY@topspin.com> Add the appropriate lines to drivers/Kconfig and drivers/Makefile so that the kernel configuration and build systems know about drivers/infiniband. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/Kconfig 2004-12-11 15:16:34.000000000 -0800 +++ linux-bk/drivers/Kconfig 2004-12-13 09:44:41.933547814 -0800 @@ -54,4 +54,6 @@ source "drivers/usb/Kconfig" +source "drivers/infiniband/Kconfig" + endmenu --- linux-bk.orig/drivers/Makefile 2004-12-11 15:16:49.000000000 -0800 +++ linux-bk/drivers/Makefile 2004-12-13 09:44:41.953544868 -0800 @@ -59,4 +59,5 @@ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ +obj-$(CONFIG_INFINIBAND) += infiniband/ obj-y += firmware/ From roland at topspin.com Mon Dec 13 10:09:23 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:23 -0800 Subject: [openib-general] [PATCH][v3][4/21] Add InfiniBand MAD (management datagram) support (public headers) In-Reply-To: <20041213109.GUz9r6Ey44wtDeiY@topspin.com> Message-ID: <20041213109.kvbhOIc6xDgg0Bag@topspin.com> Add public headers for handling InfiniBand MADs (management datagrams), including sending and receiving MADs as well as passing MADs on to local agents. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_mad.h 2004-12-13 09:44:42.628445443 -0800 @@ -0,0 +1,393 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ + +#if !defined( IB_MAD_H ) +#define IB_MAD_H + +#include + +/* Management base version */ +#define IB_MGMT_BASE_VERSION 1 + +/* Management classes */ +#define IB_MGMT_CLASS_SUBN_LID_ROUTED 0x01 +#define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE 0x81 +#define IB_MGMT_CLASS_SUBN_ADM 0x03 +#define IB_MGMT_CLASS_PERF_MGMT 0x04 +#define IB_MGMT_CLASS_BM 0x05 +#define IB_MGMT_CLASS_DEVICE_MGMT 0x06 +#define IB_MGMT_CLASS_CM 0x07 +#define IB_MGMT_CLASS_SNMP 0x08 +#define IB_MGMT_CLASS_VENDOR_RANGE2_START 0x30 +#define IB_MGMT_CLASS_VENDOR_RANGE2_END 0x4F + +/* Management methods */ +#define IB_MGMT_METHOD_GET 0x01 +#define IB_MGMT_METHOD_SET 0x02 +#define IB_MGMT_METHOD_GET_RESP 0x81 +#define IB_MGMT_METHOD_SEND 0x03 +#define IB_MGMT_METHOD_TRAP 0x05 +#define IB_MGMT_METHOD_REPORT 0x06 +#define IB_MGMT_METHOD_REPORT_RESP 0x86 +#define IB_MGMT_METHOD_TRAP_REPRESS 0x07 + +#define IB_MGMT_METHOD_RESP 0x80 + +#define IB_MGMT_MAX_METHODS 128 + +#define IB_QP0 0 +#define IB_QP1 cpu_to_be32(1) +#define IB_QP1_QKEY 0x80010000 + +struct ib_grh { + u32 version_tclass_flow; + u16 paylen; + u8 next_hdr; + u8 hop_limit; + union ib_gid sgid; + union ib_gid dgid; +} __attribute__ ((packed)); + +struct ib_mad_hdr { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u16 class_specific; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; +} __attribute__ ((packed)); + +struct ib_rmpp_hdr { + u8 rmpp_version; + u8 rmpp_type; + u8 rmpp_rtime_flags; + u8 rmpp_status; + u32 seg_num; + u32 paylen_newwin; +} __attribute__ ((packed)); + +struct ib_mad { + struct ib_mad_hdr mad_hdr; + u8 data[232]; +} __attribute__ ((packed)); + +struct ib_rmpp_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 data[220]; +} __attribute__ ((packed)); + +struct ib_vendor_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 reserved; + u8 oui[3]; + u8 data[216]; +} __attribute__ ((packed)); + +struct ib_mad_agent; +struct ib_mad_send_wc; +struct ib_mad_recv_wc; + +/** + * ib_mad_send_handler - callback handler for a sent MAD. + * @mad_agent: MAD agent that sent the MAD. + * @mad_send_wc: Send work completion information on the sent MAD. + */ +typedef void (*ib_mad_send_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc); + +/** + * ib_mad_snoop_handler - Callback handler for snooping sent MADs. + * @mad_agent: MAD agent that snooped the MAD. + * @send_wr: Work request information on the sent MAD. + * @mad_send_wc: Work completion information on the sent MAD. Valid + * only for snooping that occurs on a send completion. + * + * Clients snooping MADs should not modify data referenced by the @send_wr + * or @mad_send_wc. + */ +typedef void (*ib_mad_snoop_handler)(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc); + +/** + * ib_mad_recv_handler - callback handler for a received MAD. + * @mad_agent: MAD agent requesting the received MAD. + * @mad_recv_wc: Received work completion information on the received MAD. + * + * MADs received in response to a send request operation will be handed to + * the user after the send operation completes. All data buffers given + * to registered agents through this routine are owned by the receiving + * client, except for snooping agents. Clients snooping MADs should not + * modify the data referenced by @mad_recv_wc. + */ +typedef void (*ib_mad_recv_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_mad_agent - Used to track MAD registration with the access layer. + * @device: Reference to device registration is on. + * @qp: Reference to QP used for sending and receiving MADs. + * @recv_handler: Callback handler for a received MAD. + * @send_handler: Callback handler for a sent MAD. + * @snoop_handler: Callback handler for snooped sent MADs. + * @context: User-specified context associated with this registration. + * @hi_tid: Access layer assigned transaction ID for this client. + * Unsolicited MADs sent by this client will have the upper 32-bits + * of their TID set to this value. + * @port_num: Port number on which QP is registered + */ +struct ib_mad_agent { + struct ib_device *device; + struct ib_qp *qp; + ib_mad_recv_handler recv_handler; + ib_mad_send_handler send_handler; + ib_mad_snoop_handler snoop_handler; + void *context; + u32 hi_tid; + u8 port_num; +}; + +/** + * ib_mad_send_wc - MAD send completion information. + * @wr_id: Work request identifier associated with the send MAD request. + * @status: Completion status. + * @vendor_err: Optional vendor error information returned with a failed + * request. + */ +struct ib_mad_send_wc { + u64 wr_id; + enum ib_wc_status status; + u32 vendor_err; +}; + +/** + * ib_mad_recv_buf - received MAD buffer information. + * @list: Reference to next data buffer for a received RMPP MAD. + * @grh: References a data buffer containing the global route header. + * The data refereced by this buffer is only valid if the GRH is + * valid. + * @mad: References the start of the received MAD. + */ +struct ib_mad_recv_buf { + struct list_head list; + struct ib_grh *grh; + struct ib_mad *mad; +}; + +/** + * ib_mad_recv_wc - received MAD information. + * @wc: Completion information for the received data. + * @recv_buf: Specifies the location of the received data buffer(s). + * @mad_len: The length of the received MAD, without duplicated headers. + * + * For received response, the wr_id field of the wc is set to the wr_id + * for the corresponding send request. + */ +struct ib_mad_recv_wc { + struct ib_wc *wc; + struct ib_mad_recv_buf recv_buf; + int mad_len; +}; + +/** + * ib_mad_reg_req - MAD registration request + * @mgmt_class: Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version: Indicates which version of MADs for the given + * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. + * @method_mask: The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + */ +struct ib_mad_reg_req { + u8 mgmt_class; + u8 mgmt_class_version; + u8 oui[3]; + DECLARE_BITMAP(method_mask, IB_MGMT_MAX_METHODS); +}; + +/** + * ib_register_mad_agent - Register to send/receive MADs. + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP to access. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_reg_req: Specifies which unsolicited MADs should be received + * by the caller. This parameter may be NULL if the caller only + * wishes to receive solicited responses. + * @rmpp_version: If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler: The completion callback routine invoked after a send + * request has completed. + * @recv_handler: The completion callback routine invoked for a received + * MAD. + * @context: User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +enum ib_mad_snoop_flags { + /*IB_MAD_SNOOP_POSTED_SENDS = 1,*/ + /*IB_MAD_SNOOP_RMPP_SENDS = (1<<1),*/ + IB_MAD_SNOOP_SEND_COMPLETIONS = (1<<2), + /*IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS = (1<<3),*/ + IB_MAD_SNOOP_RECVS = (1<<4) + /*IB_MAD_SNOOP_RMPP_RECVS = (1<<5),*/ + /*IB_MAD_SNOOP_REDIRECTED_QPS = (1<<6)*/ +}; + +/** + * ib_register_mad_snoop - Register to snoop sent and received MADs. + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP traffic to snoop. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_snoop_flags: Specifies information where snooping occurs. + * @send_handler: The callback routine invoked for a snooped send. + * @recv_handler: The callback routine invoked for a snooped receive. + * @context: User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_unregister_mad_agent - Unregisters a client from using MAD services. + * @mad_agent: Corresponding MAD registration request to deregister. + * + * After invoking this routine, MAD services are no longer usable by the + * client on the associated QP. + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent); + +/** + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client. + * @mad_agent: Specifies the associated registration to post the send to. + * @send_wr: Specifies the information needed to send the MAD(s). + * @bad_send_wr: Specifies the MAD on which an error was encountered. + * + * Sent MADs are not guaranteed to complete in the order that they were posted. + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + +/** + * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. + * @mad_recv_wc: Work completion information for a received MAD. + * @buf: User-provided data buffer to receive the coalesced buffers. The + * referenced buffer should be at least the size of the mad_len specified + * by @mad_recv_wc. + * + * This call copies a chain of received RMPP MADs into a single data buffer, + * removing duplicated headers. + */ +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf); + +/** + * ib_free_recv_mad - Returns data buffers used to receive a MAD to the + * access layer. + * @mad_recv_wc: Work completion information for a received MAD. + * + * Clients receiving MADs through their ib_mad_recv_handler must call this + * routine to return the work completion buffers to the access layer. + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_cancel_mad - Cancels an outstanding send MAD operation. + * @mad_agent: Specifies the registration associated with sent MAD. + * @wr_id: Indicates the work request identifier of the MAD to cancel. + * + * MADs will be returned to the user through the corresponding + * ib_mad_send_handler. + */ +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id); + +/** + * ib_redirect_mad_qp - Registers a QP for MAD services. + * @qp: Reference to a QP that requires MAD services. + * @rmpp_version: If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler: The completion callback routine invoked after a send + * request has completed. + * @recv_handler: The completion callback routine invoked for a received + * MAD. + * @context: User specified context associated with the registration. + * + * Use of this call allows clients to use MAD services, such as RMPP, + * on user-owned QPs. After calling this routine, users may send + * MADs on the specified QP by calling ib_mad_post_send. + */ +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_process_mad_wc - Processes a work completion associated with a + * MAD sent or received on a redirected QP. + * @mad_agent: Specifies the registered MAD service using the redirected QP. + * @wc: References a work completion associated with a sent or received + * MAD segment. + * + * This routine is used to complete or continue processing on a MAD request. + * If the work completion is associated with a send operation, calling + * this routine is required to continue an RMPP transfer or to wait for a + * corresponding response, if it is a request. If the work completion is + * associated with a receive operation, calling this routine is required to + * process an inbound or outbound RMPP transfer, or to match a response MAD + * with its corresponding request. + */ +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc); + +#endif /* IB_MAD_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_smi.h 2004-12-13 09:44:42.653441760 -0800 @@ -0,0 +1,67 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ + +#if !defined( IB_SMI_H ) +#define IB_SMI_H + +#include + +#define IB_LID_PERMISSIVE 0xFFFF + +#define IB_SMP_DATA_SIZE 64 +#define IB_SMP_MAX_PATH_HOPS 64 + +struct ib_smp { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u8 hop_ptr; + u8 hop_cnt; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; + u64 mkey; + u16 dr_slid; + u16 dr_dlid; + u8 reserved[28]; + u8 data[IB_SMP_DATA_SIZE]; + u8 initial_path[IB_SMP_MAX_PATH_HOPS]; + u8 return_path[IB_SMP_MAX_PATH_HOPS]; +} __attribute__ ((packed)); + +#define IB_SMP_DIRECTION cpu_to_be16(0x8000) + +static inline u8 +ib_get_smp_direction(struct ib_smp *smp) +{ + return ((smp->status & IB_SMP_DIRECTION) == IB_SMP_DIRECTION); +} + +#endif /* IB_SMI_H */ From roland at topspin.com Mon Dec 13 10:09:27 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:27 -0800 Subject: [openib-general] [PATCH][v3][6/21] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <20041213109.3tK6alRLJABxH4bu@topspin.com> Message-ID: <20041213109.cVS0twN822l4xQbR@topspin.com> Add support for sending queries to the SA (Subnet Administration). In particular the PathRecord and MCMember (multicast group member) used by the IP-over-InfiniBand driver are implemented. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-13 09:44:42.966395657 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-13 09:44:43.579305364 -0800 @@ -2,7 +2,8 @@ obj-$(CONFIG_INFINIBAND) += \ ib_core.o \ - ib_mad.o + ib_mad.o \ + ib_sa.o ib_core-objs := \ packer.o \ @@ -17,3 +18,5 @@ mad.o \ smi.o \ agent.o + +ib_sa-objs := sa_query.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/sa_query.c 2004-12-13 09:44:43.603301828 -0800 @@ -0,0 +1,855 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand subnet administration query support"); +MODULE_LICENSE("Dual BSD/GPL"); + +/* + * These two structures must be packed because they have 64-bit fields + * that are only 32-bit aligned. 64-bit architectures will lay them + * out wrong otherwise. (And unfortunately they are sent on the wire + * so we can't change the layout) + */ +struct ib_sa_hdr { + u64 sm_key; + u16 attr_offset; + u16 reserved; + ib_sa_comp_mask comp_mask; +} __attribute__ ((packed)); + +struct ib_sa_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + struct ib_sa_hdr sa_hdr; + u8 data[200]; +} __attribute__ ((packed)); + +struct ib_sa_sm_ah { + struct ib_ah *ah; + struct kref ref; +}; + +struct ib_sa_port { + struct ib_mad_agent *agent; + struct ib_mr *mr; + struct ib_sa_sm_ah *sm_ah; + struct work_struct update_task; + spinlock_t ah_lock; + u8 port_num; +}; + +struct ib_sa_device { + int start_port, end_port; + struct ib_event_handler event_handler; + struct ib_sa_port port[0]; +}; + +struct ib_sa_query { + void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); + void (*release)(struct ib_sa_query *); + struct ib_sa_port *port; + struct ib_sa_mad *mad; + struct ib_sa_sm_ah *sm_ah; + DECLARE_PCI_UNMAP_ADDR(mapping) + int id; +}; + +struct ib_sa_path_query { + void (*callback)(int, struct ib_sa_path_rec *, void *); + void *context; + struct ib_sa_query sa_query; +}; + +struct ib_sa_mcmember_query { + void (*callback)(int, struct ib_sa_mcmember_rec *, void *); + void *context; + struct ib_sa_query sa_query; +}; + +static void ib_sa_add_one(struct ib_device *device); +static void ib_sa_remove_one(struct ib_device *device); + +static struct ib_client sa_client = { + .name = "sa", + .add = ib_sa_add_one, + .remove = ib_sa_remove_one +}; + +static spinlock_t idr_lock; +static DEFINE_IDR(query_idr); + +static spinlock_t tid_lock; +static u32 tid; + +enum { + IB_SA_ATTR_CLASS_PORTINFO = 0x01, + IB_SA_ATTR_NOTICE = 0x02, + IB_SA_ATTR_INFORM_INFO = 0x03, + IB_SA_ATTR_NODE_REC = 0x11, + IB_SA_ATTR_PORT_INFO_REC = 0x12, + IB_SA_ATTR_SL2VL_REC = 0x13, + IB_SA_ATTR_SWITCH_REC = 0x14, + IB_SA_ATTR_LINEAR_FDB_REC = 0x15, + IB_SA_ATTR_RANDOM_FDB_REC = 0x16, + IB_SA_ATTR_MCAST_FDB_REC = 0x17, + IB_SA_ATTR_SM_INFO_REC = 0x18, + IB_SA_ATTR_LINK_REC = 0x20, + IB_SA_ATTR_GUID_INFO_REC = 0x30, + IB_SA_ATTR_SERVICE_REC = 0x31, + IB_SA_ATTR_PARTITION_REC = 0x33, + IB_SA_ATTR_RANGE_REC = 0x34, + IB_SA_ATTR_PATH_REC = 0x35, + IB_SA_ATTR_VL_ARB_REC = 0x36, + IB_SA_ATTR_MC_GROUP_REC = 0x37, + IB_SA_ATTR_MC_MEMBER_REC = 0x38, + IB_SA_ATTR_TRACE_REC = 0x39, + IB_SA_ATTR_MULTI_PATH_REC = 0x3a, + IB_SA_ATTR_SERVICE_ASSOC_REC = 0x3b +}; + +#define PATH_REC_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_path_rec, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_path_rec *) 0)->field, \ + .field_name = "sa_path_rec:" #field + +static const struct ib_field path_rec_table[] = { + { RESERVED, + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 32 }, + { PATH_REC_FIELD(dgid), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 128 }, + { PATH_REC_FIELD(sgid), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 128 }, + { PATH_REC_FIELD(dlid), + .offset_words = 10, + .offset_bits = 0, + .size_bits = 16 }, + { PATH_REC_FIELD(slid), + .offset_words = 10, + .offset_bits = 16, + .size_bits = 16 }, + { PATH_REC_FIELD(raw_traffic), + .offset_words = 11, + .offset_bits = 0, + .size_bits = 1 }, + { RESERVED, + .offset_words = 11, + .offset_bits = 1, + .size_bits = 3 }, + { PATH_REC_FIELD(flow_label), + .offset_words = 11, + .offset_bits = 4, + .size_bits = 20 }, + { PATH_REC_FIELD(hop_limit), + .offset_words = 11, + .offset_bits = 24, + .size_bits = 8 }, + { PATH_REC_FIELD(traffic_class), + .offset_words = 12, + .offset_bits = 0, + .size_bits = 8 }, + { PATH_REC_FIELD(reversible), + .offset_words = 12, + .offset_bits = 8, + .size_bits = 1 }, + { PATH_REC_FIELD(numb_path), + .offset_words = 12, + .offset_bits = 9, + .size_bits = 7 }, + { PATH_REC_FIELD(pkey), + .offset_words = 12, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 13, + .offset_bits = 0, + .size_bits = 12 }, + { PATH_REC_FIELD(sl), + .offset_words = 13, + .offset_bits = 12, + .size_bits = 4 }, + { PATH_REC_FIELD(mtu_selector), + .offset_words = 13, + .offset_bits = 16, + .size_bits = 2 }, + { PATH_REC_FIELD(mtu), + .offset_words = 13, + .offset_bits = 18, + .size_bits = 6 }, + { PATH_REC_FIELD(rate_selector), + .offset_words = 13, + .offset_bits = 24, + .size_bits = 2 }, + { PATH_REC_FIELD(rate), + .offset_words = 13, + .offset_bits = 26, + .size_bits = 6 }, + { PATH_REC_FIELD(packet_life_time_selector), + .offset_words = 14, + .offset_bits = 0, + .size_bits = 2 }, + { PATH_REC_FIELD(packet_life_time), + .offset_words = 14, + .offset_bits = 2, + .size_bits = 6 }, + { PATH_REC_FIELD(preference), + .offset_words = 14, + .offset_bits = 8, + .size_bits = 8 }, + { RESERVED, + .offset_words = 14, + .offset_bits = 16, + .size_bits = 48 }, +}; + +#define MCMEMBER_REC_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_mcmember_rec, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_mcmember_rec *) 0)->field, \ + .field_name = "sa_mcmember_rec:" #field + +static const struct ib_field mcmember_rec_table[] = { + { MCMEMBER_REC_FIELD(mgid), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 128 }, + { MCMEMBER_REC_FIELD(port_gid), + .offset_words = 4, + .offset_bits = 0, + .size_bits = 128 }, + { MCMEMBER_REC_FIELD(qkey), + .offset_words = 8, + .offset_bits = 0, + .size_bits = 32 }, + { MCMEMBER_REC_FIELD(mlid), + .offset_words = 9, + .offset_bits = 0, + .size_bits = 16 }, + { MCMEMBER_REC_FIELD(mtu_selector), + .offset_words = 9, + .offset_bits = 16, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(mtu), + .offset_words = 9, + .offset_bits = 18, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(traffic_class), + .offset_words = 9, + .offset_bits = 24, + .size_bits = 8 }, + { MCMEMBER_REC_FIELD(pkey), + .offset_words = 10, + .offset_bits = 0, + .size_bits = 16 }, + { MCMEMBER_REC_FIELD(rate_selector), + .offset_words = 10, + .offset_bits = 16, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(rate), + .offset_words = 10, + .offset_bits = 18, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(packet_life_time_selector), + .offset_words = 10, + .offset_bits = 24, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(packet_life_time), + .offset_words = 10, + .offset_bits = 26, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(sl), + .offset_words = 11, + .offset_bits = 0, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(flow_label), + .offset_words = 11, + .offset_bits = 4, + .size_bits = 20 }, + { MCMEMBER_REC_FIELD(hop_limit), + .offset_words = 11, + .offset_bits = 24, + .size_bits = 8 }, + { MCMEMBER_REC_FIELD(scope), + .offset_words = 12, + .offset_bits = 0, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(join_state), + .offset_words = 12, + .offset_bits = 4, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(proxy_join), + .offset_words = 12, + .offset_bits = 8, + .size_bits = 1 }, + { RESERVED, + .offset_words = 12, + .offset_bits = 9, + .size_bits = 23 }, +}; + +static void free_sm_ah(struct kref *kref) +{ + struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); + + ib_destroy_ah(sm_ah->ah); + kfree(sm_ah); +} + +static void update_sm_ah(void *port_ptr) +{ + struct ib_sa_port *port = port_ptr; + struct ib_sa_sm_ah *new_ah, *old_ah; + struct ib_port_attr port_attr; + struct ib_ah_attr ah_attr; + + if (ib_query_port(port->agent->device, port->port_num, &port_attr)) { + printk(KERN_WARNING "Couldn't query port\n"); + return; + } + + new_ah = kmalloc(sizeof *new_ah, GFP_KERNEL); + if (!new_ah) { + printk(KERN_WARNING "Couldn't allocate new SM AH\n"); + return; + } + + kref_init(&new_ah->ref); + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = port_attr.sm_lid; + ah_attr.sl = port_attr.sm_sl; + ah_attr.port_num = port->port_num; + + new_ah->ah = ib_create_ah(port->agent->qp->pd, &ah_attr); + if (IS_ERR(new_ah->ah)) { + printk(KERN_WARNING "Couldn't create new SM AH\n"); + kfree(new_ah); + return; + } + + spin_lock_irq(&port->ah_lock); + old_ah = port->sm_ah; + port->sm_ah = new_ah; + spin_unlock_irq(&port->ah_lock); + + if (old_ah) + kref_put(&old_ah->ref, free_sm_ah); +} + +static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event) +{ + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + struct ib_sa_device *sa_dev = + ib_get_client_data(event->device, &sa_client); + + schedule_work(&sa_dev->port[event->element.port_num - + sa_dev->start_port].update_task); + } +} + +/** + * ib_sa_cancel_query - try to cancel an SA query + * @id:ID of query to cancel + * @query:query pointer to cancel + * + * Try to cancel an SA query. If the id and query don't match up or + * the query has already completed, nothing is done. Otherwise the + * query is canceled and will complete with a status of -EINTR. + */ +void ib_sa_cancel_query(int id, struct ib_sa_query *query) +{ + unsigned long flags; + struct ib_mad_agent *agent; + + spin_lock_irqsave(&idr_lock, flags); + if (idr_find(&query_idr, id) != query) { + spin_unlock_irqrestore(&idr_lock, flags); + return; + } + agent = query->port->agent; + spin_unlock_irqrestore(&idr_lock, flags); + + ib_cancel_mad(agent, id); +} +EXPORT_SYMBOL(ib_sa_cancel_query); + +static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) +{ + unsigned long flags; + + memset(mad, 0, sizeof *mad); + + mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; + mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; + mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + + spin_lock_irqsave(&tid_lock, flags); + mad->mad_hdr.tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | tid++); + spin_unlock_irqrestore(&tid_lock, flags); +} + +static int send_mad(struct ib_sa_query *query, int timeout_ms) +{ + struct ib_sa_port *port = query->port; + unsigned long flags; + int ret; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .mad_hdr = &query->mad->mad_hdr, + .remote_qpn = 1, + .remote_qkey = IB_QP1_QKEY, + .timeout_ms = timeout_ms + } + } + }; + +retry: + if (!idr_pre_get(&query_idr, GFP_ATOMIC)) + return -ENOMEM; + spin_lock_irqsave(&idr_lock, flags); + ret = idr_get_new(&query_idr, query, &query->id); + spin_unlock_irqrestore(&idr_lock, flags); + if (ret == -EAGAIN) + goto retry; + if (ret) + return ret; + + wr.wr_id = query->id; + + spin_lock_irqsave(&port->ah_lock, flags); + kref_get(&port->sm_ah->ref); + query->sm_ah = port->sm_ah; + wr.wr.ud.ah = port->sm_ah->ah; + spin_unlock_irqrestore(&port->ah_lock, flags); + + gather_list.addr = dma_map_single(port->agent->device->dma_device, + query->mad, + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + gather_list.length = sizeof (struct ib_sa_mad); + gather_list.lkey = port->mr->lkey; + pci_unmap_addr_set(query, mapping, gather_list.addr); + + ret = ib_post_send_mad(port->agent, &wr, &bad_wr); + if (ret) { + dma_unmap_single(port->agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + spin_lock_irqsave(&idr_lock, flags); + idr_remove(&query_idr, query->id); + spin_unlock_irqrestore(&idr_lock, flags); + } + + return ret; +} + +static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_path_query *query = + container_of(sa_query, struct ib_sa_path_query, sa_query); + + if (mad) { + struct ib_sa_path_rec rec; + + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) +{ + kfree(sa_query->mad); + kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); +} + +/** + * ib_sa_path_rec_get - Start a Path get query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Path Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a Path Record Get query to the SA to look up a path. The + * callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_path_rec_get() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_path_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; + struct ib_mad_agent *agent = port->agent; + int ret; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); + if (!query->sa_query.mad) { + kfree(query); + return -ENOMEM; + } + + query->callback = callback; + query->context = context; + + init_mad(query->sa_query.mad, agent); + + query->sa_query.callback = ib_sa_path_rec_callback; + query->sa_query.release = ib_sa_path_rec_release; + query->sa_query.port = port; + query->sa_query.mad->mad_hdr.method = IB_MGMT_METHOD_GET; + query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), + rec, query->sa_query.mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms); + if (ret) { + *sa_query = NULL; + kfree(query->sa_query.mad); + kfree(query); + } + + return ret ? ret : query->sa_query.id; +} +EXPORT_SYMBOL(ib_sa_path_rec_get); + +static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_mcmember_query *query = + container_of(sa_query, struct ib_sa_mcmember_query, sa_query); + + if (mad) { + struct ib_sa_mcmember_rec rec; + + ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) +{ + kfree(sa_query->mad); + kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); +} + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_mcmember_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; + struct ib_mad_agent *agent = port->agent; + int ret; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); + if (!query->sa_query.mad) { + kfree(query); + return -ENOMEM; + } + + query->callback = callback; + query->context = context; + + init_mad(query->sa_query.mad, agent); + + query->sa_query.callback = ib_sa_mcmember_rec_callback; + query->sa_query.release = ib_sa_mcmember_rec_release; + query->sa_query.port = port; + query->sa_query.mad->mad_hdr.method = method; + query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); + query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + rec, query->sa_query.mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms); + if (ret) { + *sa_query = NULL; + kfree(query->sa_query.mad); + kfree(query); + } + + return ret ? ret : query->sa_query.id; +} +EXPORT_SYMBOL(ib_sa_mcmember_rec_query); + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_sa_query *query; + unsigned long flags; + + spin_lock_irqsave(&idr_lock, flags); + query = idr_find(&query_idr, mad_send_wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); + + if (!query) + return; + + switch (mad_send_wc->status) { + case IB_WC_SUCCESS: + /* No callback -- already got recv */ + break; + case IB_WC_RESP_TIMEOUT_ERR: + query->callback(query, -ETIMEDOUT, NULL); + break; + case IB_WC_WR_FLUSH_ERR: + query->callback(query, -EINTR, NULL); + break; + default: + query->callback(query, -EIO, NULL); + break; + } + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + + query->release(query); + + spin_lock_irqsave(&idr_lock, flags); + idr_remove(&query_idr, mad_send_wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); +} + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_query *query; + unsigned long flags; + + spin_lock_irqsave(&idr_lock, flags); + query = idr_find(&query_idr, mad_recv_wc->wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); + + if (query) { + if (mad_recv_wc->wc->status == IB_WC_SUCCESS) + query->callback(query, + mad_recv_wc->recv_buf.mad->mad_hdr.status ? + -EINVAL : 0, + (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); + else + query->callback(query, -EIO, NULL); + } + + ib_free_recv_mad(mad_recv_wc); +} + +static void ib_sa_add_one(struct ib_device *device) +{ + struct ib_sa_device *sa_dev; + int s, e, i; + + if (device->node_type == IB_NODE_SWITCH) + s = e = 0; + else { + s = 1; + e = device->phys_port_cnt; + } + + sa_dev = kmalloc(sizeof *sa_dev + + (e - s + 1) * sizeof (struct ib_sa_port), + GFP_KERNEL); + if (!sa_dev) + return; + + sa_dev->start_port = s; + sa_dev->end_port = e; + + for (i = 0; i <= e - s; ++i) { + sa_dev->port[i].mr = NULL; + sa_dev->port[i].sm_ah = NULL; + sa_dev->port[i].port_num = i + s; + spin_lock_init(&sa_dev->port[i].ah_lock); + + sa_dev->port[i].agent = + ib_register_mad_agent(device, i + s, IB_QPT_GSI, + NULL, 0, send_handler, + recv_handler, sa_dev); + if (IS_ERR(sa_dev->port[i].agent)) + goto err; + + sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(sa_dev->port[i].mr)) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + goto err; + } + + INIT_WORK(&sa_dev->port[i].update_task, + update_sm_ah, &sa_dev->port[i]); + } + + /* + * We register our event handler after everything is set up, + * and then update our cached info after the event handler is + * registered to avoid any problems if a port changes state + * during our initialization. + */ + + INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event); + if (ib_register_event_handler(&sa_dev->event_handler)) + goto err; + + for (i = 0; i <= e - s; ++i) + update_sm_ah(&sa_dev->port[i]); + + ib_set_client_data(device, &sa_client, sa_dev); + + return; + +err: + while (--i >= 0) { + ib_dereg_mr(sa_dev->port[i].mr); + ib_unregister_mad_agent(sa_dev->port[i].agent); + } + + kfree(sa_dev); + + return; +} + +static void ib_sa_remove_one(struct ib_device *device) +{ + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + int i; + + if (!sa_dev) + return; + + ib_unregister_event_handler(&sa_dev->event_handler); + + for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); + } + + kfree(sa_dev); +} + +static int __init ib_sa_init(void) +{ + int ret; + + spin_lock_init(&idr_lock); + spin_lock_init(&tid_lock); + + get_random_bytes(&tid, sizeof tid); + + ret = ib_register_client(&sa_client); + if (ret) + printk(KERN_ERR "Couldn't register ib_sa client\n"); + + return ret; +} + +static void __exit ib_sa_cleanup(void) +{ + ib_unregister_client(&sa_client); +} + +module_init(ib_sa_init); +module_exit(ib_sa_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_sa.h 2004-12-13 09:44:43.630297851 -0800 @@ -0,0 +1,269 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef IB_SA_H +#define IB_SA_H + +#include + +#include +#include + +enum { + IB_SA_CLASS_VERSION = 2, /* IB spec version 1.1/1.2 */ + + IB_SA_METHOD_DELETE = 0x15 +}; + +enum ib_sa_selector { + IB_SA_GTE = 0, + IB_SA_LTE = 1, + IB_SA_EQ = 2, + /* + * The meaning of "best" depends on the attribute: for + * example, for MTU best will return the largest available + * MTU, while for packet life time, best will return the + * smallest available life time. + */ + IB_SA_BEST = 3 +}; + +typedef u64 __bitwise ib_sa_comp_mask; + +#define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) + +/* + * Structures for SA records are named "struct ib_sa_xxx_rec." No + * attempt is made to pack structures to match the physical layout of + * SA records in SA MADs; all packing and unpacking is handled by the + * SA query code. + * + * For a record with structure ib_sa_xxx_rec, the naming convention + * for the component mask value for field yyy is IB_SA_XXX_REC_YYY (we + * never use different abbreviations or otherwise change the spelling + * of xxx/yyy between ib_sa_xxx_rec.yyy and IB_SA_XXX_REC_YYY). + * + * Reserved rows are indicated with comments to help maintainability. + */ + +/* reserved: 0 */ +/* reserved: 1 */ +#define IB_SA_PATH_REC_DGID IB_SA_COMP_MASK( 2) +#define IB_SA_PATH_REC_SGID IB_SA_COMP_MASK( 3) +#define IB_SA_PATH_REC_DLID IB_SA_COMP_MASK( 4) +#define IB_SA_PATH_REC_SLID IB_SA_COMP_MASK( 5) +#define IB_SA_PATH_REC_RAW_TRAFFIC IB_SA_COMP_MASK( 6) +/* reserved: 7 */ +#define IB_SA_PATH_REC_FLOW_LABEL IB_SA_COMP_MASK( 8) +#define IB_SA_PATH_REC_HOP_LIMIT IB_SA_COMP_MASK( 9) +#define IB_SA_PATH_REC_TRAFFIC_CLASS IB_SA_COMP_MASK(10) +#define IB_SA_PATH_REC_REVERSIBLE IB_SA_COMP_MASK(11) +#define IB_SA_PATH_REC_NUMB_PATH IB_SA_COMP_MASK(12) +#define IB_SA_PATH_REC_PKEY IB_SA_COMP_MASK(13) +/* reserved: 14 */ +#define IB_SA_PATH_REC_SL IB_SA_COMP_MASK(15) +#define IB_SA_PATH_REC_MTU_SELECTOR IB_SA_COMP_MASK(16) +#define IB_SA_PATH_REC_MTU IB_SA_COMP_MASK(17) +#define IB_SA_PATH_REC_RATE_SELECTOR IB_SA_COMP_MASK(18) +#define IB_SA_PATH_REC_RATE IB_SA_COMP_MASK(19) +#define IB_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR IB_SA_COMP_MASK(20) +#define IB_SA_PATH_REC_PACKET_LIFE_TIME IB_SA_COMP_MASK(21) +#define IB_SA_PATH_REC_PREFERENCE IB_SA_COMP_MASK(22) + +struct ib_sa_path_rec { + /* reserved */ + /* reserved */ + union ib_gid dgid; + union ib_gid sgid; + u16 dlid; + u16 slid; + int raw_traffic; + /* reserved */ + u32 flow_label; + u8 hop_limit; + u8 traffic_class; + int reversible; + u8 numb_path; + u16 pkey; + /* reserved */ + u8 sl; + u8 mtu_selector; + enum ib_mtu mtu; + u8 rate_selector; + u8 rate; + u8 packet_life_time_selector; + u8 packet_life_time; + u8 preference; +}; + +#define IB_SA_MCMEMBER_REC_MGID IB_SA_COMP_MASK( 0) +#define IB_SA_MCMEMBER_REC_PORT_GID IB_SA_COMP_MASK( 1) +#define IB_SA_MCMEMBER_REC_QKEY IB_SA_COMP_MASK( 2) +#define IB_SA_MCMEMBER_REC_MLID IB_SA_COMP_MASK( 3) +#define IB_SA_MCMEMBER_REC_MTU_SELECTOR IB_SA_COMP_MASK( 4) +#define IB_SA_MCMEMBER_REC_MTU IB_SA_COMP_MASK( 5) +#define IB_SA_MCMEMBER_REC_TRAFFIC_CLASS IB_SA_COMP_MASK( 6) +#define IB_SA_MCMEMBER_REC_PKEY IB_SA_COMP_MASK( 7) +#define IB_SA_MCMEMBER_REC_RATE_SELECTOR IB_SA_COMP_MASK( 8) +#define IB_SA_MCMEMBER_REC_RATE IB_SA_COMP_MASK( 9) +#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR IB_SA_COMP_MASK(10) +#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME IB_SA_COMP_MASK(11) +#define IB_SA_MCMEMBER_REC_SL IB_SA_COMP_MASK(12) +#define IB_SA_MCMEMBER_REC_FLOW_LABEL IB_SA_COMP_MASK(13) +#define IB_SA_MCMEMBER_REC_HOP_LIMIT IB_SA_COMP_MASK(14) +#define IB_SA_MCMEMBER_REC_SCOPE IB_SA_COMP_MASK(15) +#define IB_SA_MCMEMBER_REC_JOIN_STATE IB_SA_COMP_MASK(16) +#define IB_SA_MCMEMBER_REC_PROXY_JOIN IB_SA_COMP_MASK(17) + +struct ib_sa_mcmember_rec { + union ib_gid mgid; + union ib_gid port_gid; + u32 qkey; + u16 mlid; + u8 mtu_selector; + enum ib_mtu mtu; + u8 traffic_class; + u16 pkey; + u8 rate_selector; + u8 rate; + u8 packet_life_time_selector; + u8 packet_life_time; + u8 sl; + u32 flow_label; + u8 hop_limit; + u8 scope; + u8 join_state; + int proxy_join; +}; + +struct ib_sa_query; + +void ib_sa_cancel_query(int id, struct ib_sa_query *query); + +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +/** + * ib_sa_mcmember_rec_set - Start an MCMember set query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:MCMember Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send an MCMember Set query to the SA (eg to join a multicast + * group). The callback function will be called when the query + * completes (or fails); status is 0 for a successful response, -EINTR + * if the query is canceled, -ETIMEDOUT is the query timed out, or + * -EIO if an error occurred sending the query. The resp parameter of + * the callback is only valid if status is 0. + * + * If the return value of ib_sa_mcmember_rec_set() is negative, it is + * an error code. Otherwise it is a query ID that can be used to + * cancel the query. + */ +static inline int +ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query) +{ + return ib_sa_mcmember_rec_query(device, port_num, + IB_MGMT_METHOD_SET, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); +} + +/** + * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:MCMember Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send an MCMember Delete query to the SA (eg to leave a multicast + * group). The callback function will be called when the query + * completes (or fails); status is 0 for a successful response, -EINTR + * if the query is canceled, -ETIMEDOUT is the query timed out, or + * -EIO if an error occurred sending the query. The resp parameter of + * the callback is only valid if status is 0. + * + * If the return value of ib_sa_mcmember_rec_delete() is negative, it + * is an error code. Otherwise it is a query ID that can be used to + * cancel the query. + */ +static inline int +ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query) +{ + return ib_sa_mcmember_rec_query(device, port_num, + IB_SA_METHOD_DELETE, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); +} + + +#endif /* IB_SA_H */ From roland at topspin.com Mon Dec 13 10:09:25 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:25 -0800 Subject: [openib-general] [PATCH][v3][5/21] Add InfiniBand MAD (management datagram) support In-Reply-To: <20041213109.kvbhOIc6xDgg0Bag@topspin.com> Message-ID: <20041213109.3tK6alRLJABxH4bu@topspin.com> Add support for handling InfiniBand MADs (management datagrams), including sending and receiving MADs as well as passing MADs on to local agents. This is required for an SM (subnet manager) to discover and configure the host, since the SM's query MADs must be passed to the local SMA (subnet management agent). In addition, this support is used by upper level protocols to send queries to and receive responses from the SM. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-13 09:44:40.305787613 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-13 09:44:42.966395657 -0800 @@ -1,7 +1,8 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include obj-$(CONFIG_INFINIBAND) += \ - ib_core.o + ib_core.o \ + ib_mad.o ib_core-objs := \ packer.o \ @@ -11,3 +12,8 @@ device.o \ fmr_pool.o \ cache.o + +ib_mad-objs := \ + mad.o \ + smi.o \ + agent.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent.c 2004-12-13 09:44:43.016388292 -0800 @@ -0,0 +1,386 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include + +#include + +#include + +#include "smi.h" +#include "agent_priv.h" +#include "mad_priv.h" + + +spinlock_t ib_agent_port_list_lock; +static LIST_HEAD(ib_agent_port_list); + +extern kmem_cache_t *ib_mad_cache; + + +/* + * Caller must hold ib_agent_port_list_lock + */ +static inline struct ib_agent_port_private * +__ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_agent_port_private *entry; + + BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ + + if (device) { + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if (entry->dr_smp_agent->device == device && + entry->port_num == port_num) + return entry; + } + } else { + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if ((entry->dr_smp_agent == mad_agent) || + (entry->lr_smp_agent == mad_agent) || + (entry->perf_mgmt_agent == mad_agent)) + return entry; + } + } + return NULL; +} + +static inline struct ib_agent_port_private * +ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_agent_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + entry = __ib_get_agent_port(device, port_num, mad_agent); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + return entry; +} + +int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num) +{ + struct ib_agent_port_private *port_priv; + + if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + return 1; + port_priv = ib_get_agent_port(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " + "not open\n", + device->name, port_num); + return 1; + } + + return smi_check_local_smp(port_priv->dr_smp_agent, smp); +} + +static int agent_mad_send(struct ib_mad_agent *mad_agent, + struct ib_agent_port_private *port_priv, + struct ib_mad_private *mad_priv, + struct ib_grh *grh, + struct ib_wc *wc) +{ + struct ib_agent_send_wr *agent_send_wr; + struct ib_sge gather_list; + struct ib_send_wr send_wr; + struct ib_send_wr *bad_send_wr; + struct ib_ah_attr ah_attr; + unsigned long flags; + int ret = 1; + + agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); + if (!agent_send_wr) + goto out; + agent_send_wr->mad = mad_priv; + + /* PCI mapping */ + gather_list.addr = dma_map_single(mad_agent->device->dma_device, + &mad_priv->mad, + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + gather_list.length = sizeof(mad_priv->mad); + gather_list.lkey = (*port_priv->mr).lkey; + + send_wr.next = NULL; + send_wr.opcode = IB_WR_SEND; + send_wr.sg_list = &gather_list; + send_wr.num_sge = 1; + send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ + send_wr.wr.ud.timeout_ms = 0; + send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; + + ah_attr.dlid = wc->slid; + ah_attr.port_num = mad_agent->port_num; + ah_attr.src_path_bits = wc->dlid_path_bits; + ah_attr.sl = wc->sl; + ah_attr.static_rate = 0; + ah_attr.ah_flags = 0; /* No GRH */ + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (wc->wc_flags & IB_WC_GRH) { + ah_attr.ah_flags = IB_AH_GRH; + /* Should sgid be looked up ? */ + ah_attr.grh.sgid_index = 0; + ah_attr.grh.hop_limit = grh->hop_limit; + ah_attr.grh.flow_label = be32_to_cpup( + &grh->version_tclass_flow) & 0xfffff; + ah_attr.grh.traffic_class = (be32_to_cpup( + &grh->version_tclass_flow) >> 20) & 0xff; + memcpy(ah_attr.grh.dgid.raw, + grh->sgid.raw, + sizeof(ah_attr.grh.dgid)); + } + } + + agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); + if (IS_ERR(agent_send_wr->ah)) { + printk(KERN_ERR SPFX "No memory for address handle\n"); + kfree(agent_send_wr); + goto out; + } + + send_wr.wr.ud.ah = agent_send_wr->ah; + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + send_wr.wr.ud.pkey_index = wc->pkey_index; + send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; + } else { /* for SMPs */ + send_wr.wr.ud.pkey_index = 0; + send_wr.wr.ud.remote_qkey = 0; + } + send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr; + send_wr.wr_id = (unsigned long)agent_send_wr; + + pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); + + /* Send */ + spin_lock_irqsave(&port_priv->send_list_lock, flags); + if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(agent_send_wr, mapping), + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + ib_destroy_ah(agent_send_wr->ah); + kfree(agent_send_wr); + } else { + list_add_tail(&agent_send_wr->send_list, + &port_priv->send_posted_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + ret = 0; + } + +out: + return ret; +} + +int agent_send(struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num) +{ + struct ib_agent_port_private *port_priv; + struct ib_mad_agent *mad_agent; + + port_priv = ib_get_agent_port(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", + device->name, port_num); + return 1; + } + + /* Get mad agent based on mgmt_class in MAD */ + switch (mad->mad.mad.mad_hdr.mgmt_class) { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + mad_agent = port_priv->dr_smp_agent; + break; + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + mad_agent = port_priv->lr_smp_agent; + break; + case IB_MGMT_CLASS_PERF_MGMT: + mad_agent = port_priv->perf_mgmt_agent; + break; + default: + return 1; + } + + return agent_mad_send(mad_agent, port_priv, mad, grh, wc); +} + +static void agent_send_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_agent_port_private *port_priv; + struct ib_agent_send_wr *agent_send_wr; + unsigned long flags; + + /* Find matching MAD agent */ + port_priv = ib_get_agent_port(NULL, 0, mad_agent); + if (!port_priv) { + printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " + "agent %p\n", mad_agent); + return; + } + + agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id; + spin_lock_irqsave(&port_priv->send_list_lock, flags); + /* Remove completed send from posted send MAD list */ + list_del(&agent_send_wr->send_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + + /* Unmap PCI */ + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(agent_send_wr, mapping), + sizeof(agent_send_wr->mad->mad), + DMA_TO_DEVICE); + + ib_destroy_ah(agent_send_wr->ah); + + /* Release allocated memory */ + kmem_cache_free(ib_mad_cache, agent_send_wr->mad); + kfree(agent_send_wr); +} + +int ib_agent_port_open(struct ib_device *device, int port_num) +{ + int ret; + struct ib_agent_port_private *port_priv; + struct ib_mad_reg_req reg_req; + unsigned long flags; + + /* First, check if port already open for SMI */ + port_priv = ib_get_agent_port(device, port_num, NULL); + if (port_priv) { + printk(KERN_DEBUG SPFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR SPFX "No memory for ib_agent_port_private\n"); + ret = -ENOMEM; + goto error1; + } + + memset(port_priv, 0, sizeof *port_priv); + port_priv->port_num = port_num; + spin_lock_init(&port_priv->send_list_lock); + INIT_LIST_HEAD(&port_priv->send_posted_list); + + /* Obtain MAD agent for directed route SM class */ + reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE; + reg_req.mgmt_class_version = 1; + + port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + + if (IS_ERR(port_priv->dr_smp_agent)) { + ret = PTR_ERR(port_priv->dr_smp_agent); + goto error2; + } + + /* Obtain MAD agent for LID routed SM class */ + reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->lr_smp_agent)) { + ret = PTR_ERR(port_priv->lr_smp_agent); + goto error3; + } + + /* Obtain MAD agent for PerfMgmt class */ + reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->perf_mgmt_agent)) { + ret = PTR_ERR(port_priv->perf_mgmt_agent); + goto error4; + } + + port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(port_priv->mr)) { + printk(KERN_ERR SPFX "Couldn't get DMA MR\n"); + ret = PTR_ERR(port_priv->mr); + goto error5; + } + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_agent_port_list); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + return 0; + +error5: + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); +error4: + ib_unregister_mad_agent(port_priv->lr_smp_agent); +error3: + ib_unregister_mad_agent(port_priv->dr_smp_agent); +error2: + kfree(port_priv); +error1: + return ret; +} + +int ib_agent_port_close(struct ib_device *device, int port_num) +{ + struct ib_agent_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + port_priv = __ib_get_agent_port(device, port_num, NULL); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + printk(KERN_ERR SPFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + ib_dereg_mr(port_priv->mr); + + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); + ib_unregister_mad_agent(port_priv->lr_smp_agent); + ib_unregister_mad_agent(port_priv->dr_smp_agent); + kfree(port_priv); + + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent.h 2004-12-13 09:44:43.070380338 -0800 @@ -0,0 +1,42 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __AGENT_H_ +#define __AGENT_H_ + +extern spinlock_t ib_agent_port_list_lock; + +extern int ib_agent_port_open(struct ib_device *device, + int port_num); + +extern int ib_agent_port_close(struct ib_device *device, int port_num); + +extern int agent_send(struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num); + +#endif /* __AGENT_H_ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent_priv.h 2004-12-13 09:44:43.096376508 -0800 @@ -0,0 +1,51 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __IB_AGENT_PRIV_H__ +#define __IB_AGENT_PRIV_H__ + +#include + +#define SPFX "ib_agent: " + +struct ib_agent_send_wr { + struct list_head send_list; + struct ib_ah *ah; + struct ib_mad_private *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct ib_agent_port_private { + struct list_head port_list; + struct list_head send_posted_list; + spinlock_t send_list_lock; + int port_num; + struct ib_mad_agent *dr_smp_agent; /* DR SM class */ + struct ib_mad_agent *lr_smp_agent; /* LR SM class */ + struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ + struct ib_mr *mr; +}; + +#endif /* __IB_AGENT_PRIV_H__ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/mad.c 2004-12-13 09:44:42.990392121 -0800 @@ -0,0 +1,2593 @@ +/* + * Copyright (c) 2004, Voltaire, Inc. All rights reserved. + * Maintained by: vtrmaint1 at voltaire.com + * + * This program is intended for the purpose of Infiniband + * protocol stack for Linux Servers. + * + * This software program is free software and you are free to modifyi + * and/or redistribute it under a choice of one of the following two + * licenses: + * + * 1) under either the GNU General Public License (GPL) Version 2, June 1991, + * a copy of which is in the file LICENSE_GPL_V2.txt in the root directory. + * This GPL license is also available from the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA, or on the + * web at http://www.fsf.org/copyleft/gpl.html + * + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, on the web at + * http://www.opensource.org/licenses/bsd-license.php. + * + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * + * + * To obtain a copy of these licenses, the source code to this software or + * for other questions, you may write to Voltaire, Inc., + * Attention: Voltaire openSource maintainer, + * Voltaire, Inc. 54 Middlesex Turnpike Bedford, MA 01730 or + * by Email: vtrmaint1 at voltaire.com + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +#include +#include + +#include + +#include "mad_priv.h" +#include "smi.h" +#include "agent.h" + + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_DESCRIPTION("kernel IB MAD API"); +MODULE_AUTHOR("Hal Rosenstock"); +MODULE_AUTHOR("Sean Hefty"); + + +kmem_cache_t *ib_mad_cache; +static struct list_head ib_mad_port_list; +static u32 ib_mad_client_id = 0; + +/* Port list lock */ +static spinlock_t ib_mad_port_list_lock; + + +/* Forward declarations */ +static int method_in_use(struct ib_mad_mgmt_method_table **method, + struct ib_mad_reg_req *mad_reg_req); +static void remove_mad_reg_req(struct ib_mad_agent_private *priv); +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad); +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc); +static void timeout_sends(void *data); +static int solicited_mad(struct ib_mad *mad); +static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv, + u8 mgmt_class); +static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv); + +/* + * Returns a ib_mad_port_private structure or NULL for a device/port + * Assumes ib_mad_port_list_lock is being held + */ +static inline struct ib_mad_port_private * +__ib_get_mad_port(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *entry; + + list_for_each_entry(entry, &ib_mad_port_list, port_list) { + if (entry->device == device && entry->port_num == port_num) + return entry; + } + return NULL; +} + +/* + * Wrapper function to return a ib_mad_port_private structure or NULL + * for a device/port + */ +static inline struct ib_mad_port_private * +ib_get_mad_port(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + entry = __ib_get_mad_port(device, port_num); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + + return entry; +} + +static inline u8 convert_mgmt_class(u8 mgmt_class) +{ + /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ + return mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ? + 0 : mgmt_class; +} + +static int get_spl_qp_index(enum ib_qp_type qp_type) +{ + switch (qp_type) + { + case IB_QPT_SMI: + return 0; + case IB_QPT_GSI: + return 1; + default: + return -1; + } +} + +static int vendor_class_index(u8 mgmt_class) +{ + return mgmt_class - IB_MGMT_CLASS_VENDOR_RANGE2_START; +} + +static int is_vendor_class(u8 mgmt_class) +{ + if ((mgmt_class < IB_MGMT_CLASS_VENDOR_RANGE2_START) || + (mgmt_class > IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return 0; + return 1; +} + +static int is_vendor_oui(char *oui) +{ + if (oui[0] || oui[1] || oui[2]) + return 1; + return 0; +} + +static int is_vendor_method_in_use( + struct ib_mad_mgmt_vendor_class *vendor_class, + struct ib_mad_reg_req *mad_reg_req) +{ + struct ib_mad_mgmt_method_table *method; + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) { + if (!memcmp(vendor_class->oui[i], mad_reg_req->oui, 3)) { + method = vendor_class->method_table[i]; + if (method) { + if (method_in_use(&method, mad_reg_req)) + return 1; + else + break; + } + } + } + return 0; +} + +/* + * ib_register_mad_agent - Register to send/receive MADs + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret = ERR_PTR(-EINVAL); + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_reg_req *reg_req = NULL; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + struct ib_mad_mgmt_method_table *method; + int ret2, qpn; + unsigned long flags; + u8 mgmt_class, vclass; + + /* Validate parameters */ + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) + goto error1; + + if (rmpp_version) + goto error1; /* XXX: until RMPP implemented */ + + /* Validate MAD registration request if supplied */ + if (mad_reg_req) { + if (mad_reg_req->mgmt_class_version >= MAX_MGMT_VERSION) + goto error1; + if (!recv_handler) + goto error1; + if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) { + /* + * IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is the only + * one in this range currently allowed + */ + if (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + goto error1; + } else if (mad_reg_req->mgmt_class == 0) { + /* + * Class 0 is reserved in IBA and is used for + * aliasing of IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE + */ + goto error1; + } else if (is_vendor_class(mad_reg_req->mgmt_class)) { + /* + * If class is in "new" vendor range, + * ensure supplied OUI is not zero + */ + if (!is_vendor_oui(mad_reg_req->oui)) + goto error1; + } + /* Make sure class supplied is consistent with QP type */ + if (qp_type == IB_QPT_SMI) { + if ((mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_LID_ROUTED) && + (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } else { + if ((mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } + } else { + /* No registration request supplied */ + if (!send_handler) + goto error1; + } + + /* Validate device and port */ + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + + /* Allocate structures */ + mad_agent_priv = kmalloc(sizeof *mad_agent_priv, GFP_KERNEL); + if (!mad_agent_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + if (mad_reg_req) { + reg_req = kmalloc(sizeof *reg_req, GFP_KERNEL); + if (!reg_req) { + ret = ERR_PTR(-ENOMEM); + goto error2; + } + /* Make a copy of the MAD registration request */ + memcpy(reg_req, mad_reg_req, sizeof *reg_req); + } + + /* Now, fill in the various structures */ + memset(mad_agent_priv, 0, sizeof *mad_agent_priv); + mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; + mad_agent_priv->reg_req = reg_req; + mad_agent_priv->rmpp_version = rmpp_version; + mad_agent_priv->agent.device = device; + mad_agent_priv->agent.recv_handler = recv_handler; + mad_agent_priv->agent.send_handler = send_handler; + mad_agent_priv->agent.context = context; + mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_agent_priv->agent.port_num = port_num; + + spin_lock_irqsave(&port_priv->reg_lock, flags); + mad_agent_priv->agent.hi_tid = ++ib_mad_client_id; + + /* + * Make sure MAD registration (if supplied) + * is non overlapping with any existing ones + */ + if (mad_reg_req) { + mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); + if (!is_vendor_class(mgmt_class)) { + class = port_priv->version[mad_reg_req-> + mgmt_class_version].class; + if (class) { + method = class->method_table[mgmt_class]; + if (method) { + if (method_in_use(&method, + mad_reg_req)) + goto error3; + } + } + ret2 = add_nonoui_reg_req(mad_reg_req, mad_agent_priv, + mgmt_class); + } else { + /* "New" vendor class range */ + vendor = port_priv->version[mad_reg_req-> + mgmt_class_version].vendor; + if (vendor) { + vclass = vendor_class_index(mgmt_class); + vendor_class = vendor->vendor_class[vclass]; + if (vendor_class) { + if (is_vendor_method_in_use( + vendor_class, + mad_reg_req)) + goto error3; + } + } + ret2 = add_oui_reg_req(mad_reg_req, mad_agent_priv); + } + if (ret2) { + ret = ERR_PTR(ret2); + goto error3; + } + } + + /* Add mad agent into port's agent list */ + list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + spin_lock_init(&mad_agent_priv->lock); + INIT_LIST_HEAD(&mad_agent_priv->send_list); + INIT_LIST_HEAD(&mad_agent_priv->wait_list); + INIT_WORK(&mad_agent_priv->timed_work, timeout_sends, mad_agent_priv); + atomic_set(&mad_agent_priv->refcount, 1); + init_waitqueue_head(&mad_agent_priv->wait); + + return &mad_agent_priv->agent; + +error3: + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + kfree(reg_req); +error2: + kfree(mad_agent_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_agent); + +static inline int is_snooping_sends(int mad_snoop_flags) +{ + return (mad_snoop_flags & + (/*IB_MAD_SNOOP_POSTED_SENDS | + IB_MAD_SNOOP_RMPP_SENDS |*/ + IB_MAD_SNOOP_SEND_COMPLETIONS /*| + IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS*/)); +} + +static inline int is_snooping_recvs(int mad_snoop_flags) +{ + return (mad_snoop_flags & + (IB_MAD_SNOOP_RECVS /*| + IB_MAD_SNOOP_RMPP_RECVS*/)); +} + +static int register_snoop_agent(struct ib_mad_qp_info *qp_info, + struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_snoop_private **new_snoop_table; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + /* Check for empty slot in array. */ + for (i = 0; i < qp_info->snoop_table_size; i++) + if (!qp_info->snoop_table[i]) + break; + + if (i == qp_info->snoop_table_size) { + /* Grow table. */ + new_snoop_table = kmalloc(sizeof mad_snoop_priv * + qp_info->snoop_table_size + 1, + GFP_ATOMIC); + if (!new_snoop_table) { + i = -ENOMEM; + goto out; + } + if (qp_info->snoop_table) { + memcpy(new_snoop_table, qp_info->snoop_table, + sizeof mad_snoop_priv * + qp_info->snoop_table_size); + kfree(qp_info->snoop_table); + } + qp_info->snoop_table = new_snoop_table; + qp_info->snoop_table_size++; + } + qp_info->snoop_table[i] = mad_snoop_priv; + atomic_inc(&qp_info->snoop_count); +out: + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + return i; +} + +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret; + struct ib_mad_snoop_private *mad_snoop_priv; + int qpn; + + /* Validate parameters */ + if ((is_snooping_sends(mad_snoop_flags) && !snoop_handler) || + (is_snooping_recvs(mad_snoop_flags) && !recv_handler)) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + /* Allocate structures */ + mad_snoop_priv = kmalloc(sizeof *mad_snoop_priv, GFP_KERNEL); + if (!mad_snoop_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + /* Now, fill in the various structures */ + memset(mad_snoop_priv, 0, sizeof *mad_snoop_priv); + mad_snoop_priv->qp_info = &port_priv->qp_info[qpn]; + mad_snoop_priv->agent.device = device; + mad_snoop_priv->agent.recv_handler = recv_handler; + mad_snoop_priv->agent.snoop_handler = snoop_handler; + mad_snoop_priv->agent.context = context; + mad_snoop_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_snoop_priv->agent.port_num = port_num; + mad_snoop_priv->mad_snoop_flags = mad_snoop_flags; + init_waitqueue_head(&mad_snoop_priv->wait); + mad_snoop_priv->snoop_index = register_snoop_agent( + &port_priv->qp_info[qpn], + mad_snoop_priv); + if (mad_snoop_priv->snoop_index < 0) { + ret = ERR_PTR(mad_snoop_priv->snoop_index); + goto error2; + } + + atomic_set(&mad_snoop_priv->refcount, 1); + return &mad_snoop_priv->agent; + +error2: + kfree(mad_snoop_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_snoop); + +static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; + + /* Note that we could still be handling received MADs */ + + /* + * Canceling all sends results in dropping received response + * MADs, preventing us from queuing additional work + */ + cancel_mads(mad_agent_priv); + + port_priv = mad_agent_priv->qp_info->port_priv; + cancel_delayed_work(&mad_agent_priv->timed_work); + flush_workqueue(port_priv->wq); + + spin_lock_irqsave(&port_priv->reg_lock, flags); + remove_mad_reg_req(mad_agent_priv); + list_del(&mad_agent_priv->agent_list); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + /* XXX: Cleanup pending RMPP receives for this agent */ + + atomic_dec(&mad_agent_priv->refcount); + wait_event(mad_agent_priv->wait, + !atomic_read(&mad_agent_priv->refcount)); + + if (mad_agent_priv->reg_req) + kfree(mad_agent_priv->reg_req); + kfree(mad_agent_priv); +} + +static void unregister_mad_snoop(struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_qp_info *qp_info; + unsigned long flags; + + qp_info = mad_snoop_priv->qp_info; + spin_lock_irqsave(&qp_info->snoop_lock, flags); + qp_info->snoop_table[mad_snoop_priv->snoop_index] = NULL; + atomic_dec(&qp_info->snoop_count); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + + atomic_dec(&mad_snoop_priv->refcount); + wait_event(mad_snoop_priv->wait, + !atomic_read(&mad_snoop_priv->refcount)); + + kfree(mad_snoop_priv); +} + +/* + * ib_unregister_mad_agent - Unregisters a client from using MAD services + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_snoop_private *mad_snoop_priv; + + /* If the TID is zero, the agent can only snoop. */ + if (mad_agent->hi_tid) { + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + unregister_mad_agent(mad_agent_priv); + } else { + mad_snoop_priv = container_of(mad_agent, + struct ib_mad_snoop_private, + agent); + unregister_mad_snoop(mad_snoop_priv); + } + return 0; +} +EXPORT_SYMBOL(ib_unregister_mad_agent); + +static void dequeue_mad(struct ib_mad_list_head *mad_list) +{ + struct ib_mad_queue *mad_queue; + unsigned long flags; + + BUG_ON(!mad_list->mad_queue); + mad_queue = mad_list->mad_queue; + spin_lock_irqsave(&mad_queue->lock, flags); + list_del(&mad_list->list); + mad_queue->count--; + spin_unlock_irqrestore(&mad_queue->lock, flags); +} + +static void snoop_send(struct ib_mad_qp_info *qp_info, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.snoop_handler(&mad_snoop_priv->agent, + send_wr, mad_send_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + +static void snoop_recv(struct ib_mad_qp_info *qp_info, + struct ib_mad_recv_wc *mad_recv_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.recv_handler(&mad_snoop_priv->agent, + mad_recv_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + +/* + * Return 0 if SMP is to be sent + * Return 1 if SMP was consumed locally (whether or not solicited) + * Return < 0 if error + */ +static int handle_outgoing_smp(struct ib_mad_agent_private *mad_agent_priv, + struct ib_smp *smp, + struct ib_send_wr *send_wr) +{ + int ret; + struct ib_mad_private *mad_priv; + struct ib_mad_send_wc mad_send_wc; + struct ib_device *device = mad_agent_priv->agent.device; + u8 port_num = mad_agent_priv->agent.port_num; + + if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + ret = -EINVAL; + printk(KERN_ERR PFX "Invalid directed route\n"); + goto out; + } + /* Check to post send on QP or process locally */ + ret = smi_check_local_dr_smp(smp, device, port_num); + if (!ret || !device->process_mad) + goto out; + + mad_priv = kmem_cache_alloc(ib_mad_cache, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_priv) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for local response MAD\n"); + goto out; + } + ret = device->process_mad(device, 0, port_num, smp->dr_slid, + (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); + switch (ret) + { + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: + /* + * See if response is solicited and + * there is a recv handler + */ + if (solicited_mad(&mad_priv->mad.mad) && + mad_agent_priv->agent.recv_handler) { + struct ib_wc wc; + + /* + * Defined behavior is to complete response + * before request + */ + wc.wr_id = send_wr->wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = sizeof(struct ib_mad); + wc.src_qp = IB_QP0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = IB_LID_PERMISSIVE; + wc.sl = 0; + wc.dlid_path_bits = 0; + mad_priv->header.recv_wc.wc = &wc; + mad_priv->header.recv_wc.mad_len = + sizeof(struct ib_mad); + INIT_LIST_HEAD(&mad_priv->header.recv_wc.recv_buf.list); + mad_priv->header.recv_wc.recv_buf.grh = NULL; + mad_priv->header.recv_wc.recv_buf.mad = + &mad_priv->mad.mad; + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_recv(mad_agent_priv->qp_info, + &mad_priv->header.recv_wc, + IB_MAD_SNOOP_RECVS); + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &mad_priv->header.recv_wc); + } else + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS: + kmem_cache_free(ib_mad_cache, mad_priv); + ret = 0; + goto out; + default: + kmem_cache_free(ib_mad_cache, mad_priv); + ret = -EINVAL; + goto out; + } + + /* Complete send */ + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = send_wr->wr_id; + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_send(mad_agent_priv->qp_info, send_wr, &mad_send_wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + ret = 1; +out: + return ret; +} + +static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr) +{ + struct ib_mad_qp_info *qp_info; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; + + /* Replace user's WR ID with our own to find WR upon completion */ + qp_info = mad_agent_priv->qp_info; + mad_send_wr->wr_id = mad_send_wr->send_wr.wr_id; + mad_send_wr->send_wr.wr_id = (unsigned long)&mad_send_wr->mad_list; + mad_send_wr->mad_list.mad_queue = &qp_info->send_queue; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + if (qp_info->send_queue.count++ < qp_info->send_queue.max_active) { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->send_queue.list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = ib_post_send(mad_agent_priv->agent.qp, + &mad_send_wr->send_wr, &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + dequeue_mad(&mad_send_wr->mad_list); + } + } else { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->overflow_list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = 0; + } + return ret; +} + +/* + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + int ret = -EINVAL; + struct ib_mad_agent_private *mad_agent_priv; + + /* Validate supplied parameters */ + if (!bad_send_wr) + goto error1; + + if (!mad_agent || !send_wr) + goto error2; + + if (!mad_agent->send_handler) + goto error2; + + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + + /* Walk list of send WRs and post each on send list */ + while (send_wr) { + unsigned long flags; + struct ib_send_wr *next_send_wr; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_smp *smp; + + /* Validate more parameters */ + if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG) + goto error2; + + if (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler) + goto error2; + + if (!send_wr->wr.ud.mad_hdr) { + printk(KERN_ERR PFX "MAD header must be supplied " + "in WR %p\n", send_wr); + goto error2; + } + + /* + * Save pointer to next work request to post in case the + * current one completes, and the user modifies the work + * request associated with the completion + */ + next_send_wr = (struct ib_send_wr *)send_wr->next; + + smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; + if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + ret = handle_outgoing_smp(mad_agent_priv, smp, send_wr); + if (ret < 0) /* error */ + goto error2; + else if (ret == 1) /* locally consumed */ + goto next; + } + + /* Allocate MAD send WR tracking structure */ + mad_send_wr = kmalloc(sizeof *mad_send_wr, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_send_wr) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_send_wr_private\n"); + ret = -ENOMEM; + goto error2; + } + + mad_send_wr->send_wr = *send_wr; + mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; + memcpy(mad_send_wr->sg_list, send_wr->sg_list, + sizeof *send_wr->sg_list * send_wr->num_sge); + mad_send_wr->send_wr.next = NULL; + mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid; + mad_send_wr->agent = mad_agent; + /* Timeout will be updated after send completes */ + mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr. + ud.timeout_ms); + mad_send_wr->retry = 0; + /* One reference for each work request to QP + response */ + mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); + mad_send_wr->status = IB_WC_SUCCESS; + + /* Reference MAD agent until send completes */ + atomic_inc(&mad_agent_priv->refcount); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_add_tail(&mad_send_wr->agent_list, + &mad_agent_priv->send_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + ret = ib_send_mad(mad_agent_priv, mad_send_wr); + if (ret) { + /* Fail send request */ + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + atomic_dec(&mad_agent_priv->refcount); + goto error2; + } +next: + send_wr = next_send_wr; + } + return 0; + +error2: + *bad_send_wr = send_wr; +error1: + return ret; +} +EXPORT_SYMBOL(ib_post_send_mad); + +/* + * ib_free_recv_mad - Returns data buffers used to receive + * a MAD to the access layer + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_recv_buf *entry; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *priv; + + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, struct ib_mad_private, header); + + /* + * Walk receive buffer list associated with this WC + * No need to remove them from list of receive buffers + */ + list_for_each_entry(entry, &mad_recv_wc->recv_buf.list, list) { + /* Free previous receive buffer */ + kmem_cache_free(ib_mad_cache, priv); + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, struct ib_mad_private, + header); + } + + /* Free last buffer */ + kmem_cache_free(ib_mad_cache, priv); +} +EXPORT_SYMBOL(ib_free_recv_mad); + +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf) +{ + printk(KERN_ERR PFX "ib_coalesce_recv_mad() not implemented yet\n"); +} +EXPORT_SYMBOL(ib_coalesce_recv_mad); + +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + return ERR_PTR(-EINVAL); /* XXX: for now */ +} +EXPORT_SYMBOL(ib_redirect_mad_qp); + +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc) +{ + printk(KERN_ERR PFX "ib_process_mad_wc() not implemented yet\n"); + return 0; +} +EXPORT_SYMBOL(ib_process_mad_wc); + +static int method_in_use(struct ib_mad_mgmt_method_table **method, + struct ib_mad_reg_req *mad_reg_req) +{ + int i; + + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + if ((*method)->agent[i]) { + printk(KERN_ERR PFX "Method %d already in use\n", i); + return -EINVAL; + } + } + return 0; +} + +static int allocate_method_table(struct ib_mad_mgmt_method_table **method) +{ + /* Allocate management method table */ + *method = kmalloc(sizeof **method, GFP_ATOMIC); + if (!*method) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_method_table\n"); + return -ENOMEM; + } + /* Clear management method table */ + memset(*method, 0, sizeof **method); + + return 0; +} + +/* + * Check to see if there are any methods still in use + */ +static int check_method_table(struct ib_mad_mgmt_method_table *method) +{ + int i; + + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) + if (method->agent[i]) + return 1; + return 0; +} + +/* + * Check to see if there are any method tables for this class still in use + */ +static int check_class_table(struct ib_mad_mgmt_class_table *class) +{ + int i; + + for (i = 0; i < MAX_MGMT_CLASS; i++) + if (class->method_table[i]) + return 1; + return 0; +} + +static int check_vendor_class(struct ib_mad_mgmt_vendor_class *vendor_class) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + if (vendor_class->method_table[i]) + return 1; + return 0; +} + +static int find_vendor_oui(struct ib_mad_mgmt_vendor_class *vendor_class, + char *oui) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + /* Is there matching OUI for this vendor class ? */ + if (!memcmp(vendor_class->oui[i], oui, 3)) + return i; + + return -1; +} + +static int check_vendor_table(struct ib_mad_mgmt_vendor_class_table *vendor) +{ + int i; + + for (i = 0; i < MAX_MGMT_VENDOR_RANGE2; i++) + if (vendor->vendor_class[i]) + return 1; + + return 0; +} + +static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method, + struct ib_mad_agent_private *agent) +{ + int i; + + /* Remove any methods for this mad agent */ + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) { + if (method->agent[i] == agent) { + method->agent[i] = NULL; + } + } +} + +static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv, + u8 mgmt_class) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_class_table **class; + struct ib_mad_mgmt_method_table **method; + int i, ret; + + port_priv = agent_priv->qp_info->port_priv; + class = &port_priv->version[mad_reg_req->mgmt_class_version].class; + if (!*class) { + /* Allocate management class table for "new" class version */ + *class = kmalloc(sizeof **class, GFP_ATOMIC); + if (!*class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_class_table\n"); + ret = -ENOMEM; + goto error1; + } + /* Clear management class table */ + memset(*class, 0, sizeof(**class)); + /* Allocate method table for this management class */ + method = &(*class)->method_table[mgmt_class]; + if ((ret = allocate_method_table(method))) + goto error2; + } else { + method = &(*class)->method_table[mgmt_class]; + if (!*method) { + /* Allocate method table for this management class */ + if ((ret = allocate_method_table(method))) + goto error1; + } + } + + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error3; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = agent_priv; + } + return 0; + +error3: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, agent_priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; + goto error1; +error2: + kfree(*class); + *class = NULL; +error1: + return ret; +} + +static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_vendor_class_table **vendor_table; + struct ib_mad_mgmt_vendor_class_table *vendor = NULL; + struct ib_mad_mgmt_vendor_class *vendor_class = NULL; + struct ib_mad_mgmt_method_table **method; + int i, ret = -ENOMEM; + u8 vclass; + + /* "New" vendor (with OUI) class */ + vclass = vendor_class_index(mad_reg_req->mgmt_class); + port_priv = agent_priv->qp_info->port_priv; + vendor_table = &port_priv->version[ + mad_reg_req->mgmt_class_version].vendor; + if (!*vendor_table) { + /* Allocate mgmt vendor class table for "new" class version */ + vendor = kmalloc(sizeof *vendor, GFP_ATOMIC); + if (!vendor) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class_table\n"); + goto error1; + } + /* Clear management vendor class table */ + memset(vendor, 0, sizeof(*vendor)); + *vendor_table = vendor; + } + if (!(*vendor_table)->vendor_class[vclass]) { + /* Allocate table for this management vendor class */ + vendor_class = kmalloc(sizeof *vendor_class, GFP_ATOMIC); + if (!vendor_class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class\n"); + goto error2; + } + memset(vendor_class, 0, sizeof(*vendor_class)); + (*vendor_table)->vendor_class[vclass] = vendor_class; + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* Is there matching OUI for this vendor class ? */ + if (!memcmp((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3)) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(!*method); + goto check_in_use; + } + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* OUI slot available ? */ + if (!is_vendor_oui((*vendor_table)->vendor_class[ + vclass]->oui[i])) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(*method); + /* Allocate method table for this OUI */ + if ((ret = allocate_method_table(method))) + goto error3; + memcpy((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3); + goto check_in_use; + } + } + printk(KERN_ERR PFX "All OUI slots in use\n"); + goto error3; + +check_in_use: + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error4; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = agent_priv; + } + return 0; + +error4: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, agent_priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; +error3: + if (vendor_class) { + (*vendor_table)->vendor_class[vclass] = NULL; + kfree(vendor_class); + } +error2: + if (vendor) { + *vendor_table = NULL; + kfree(vendor); + } +error1: + return ret; +} + +static void remove_mad_reg_req(struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + int index; + u8 mgmt_class; + + /* + * Was MAD registration request supplied + * with original registration ? + */ + if (!agent_priv->reg_req) { + goto out; + } + + port_priv = agent_priv->qp_info->port_priv; + class = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].class; + if (!class) + goto vendor_check; + + mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); + method = class->method_table[mgmt_class]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* Now, check to see if there are any methods still in use */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + class->method_table[mgmt_class] = NULL; + /* Any management classes left ? */ + if (!check_class_table(class)) { + /* If not, release management class table */ + kfree(class); + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version].class = NULL; + } + } + } + +vendor_check: + vendor = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].vendor; + if (!vendor) + goto out; + + mgmt_class = vendor_class_index(agent_priv->reg_req->mgmt_class); + vendor_class = vendor->vendor_class[mgmt_class]; + if (vendor_class) { + index = find_vendor_oui(vendor_class, agent_priv->reg_req->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* + * Now, check to see if there are + * any methods still in use + */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + vendor_class->method_table[index] = NULL; + memset(vendor_class->oui[index], 0, 3); + /* Any OUIs left ? */ + if (!check_vendor_class(vendor_class)) { + /* If not, release vendor class table */ + kfree(vendor_class); + vendor->vendor_class[mgmt_class] = NULL; + /* Any other vendor classes left ? */ + if (!check_vendor_table(vendor)) { + kfree(vendor); + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version]. + vendor = NULL; + } + } + } + } + } + +out: + return; +} + +static int response_mad(struct ib_mad *mad) +{ + /* Trap represses are responses although response bit is reset */ + return ((mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) || + (mad->mad_hdr.method & IB_MGMT_METHOD_RESP)); +} + +static int solicited_mad(struct ib_mad *mad) +{ + /* CM MADs are never solicited */ + if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_CM) { + return 0; + } + + /* XXX: Determine whether MAD is using RMPP */ + + /* Not using RMPP */ + /* Is this MAD a response to a previous MAD ? */ + return response_mad(mad); +} + +static struct ib_mad_agent_private * +find_mad_agent(struct ib_mad_port_private *port_priv, + struct ib_mad *mad, + int solicited) +{ + struct ib_mad_agent_private *mad_agent = NULL; + unsigned long flags; + + spin_lock_irqsave(&port_priv->reg_lock, flags); + + /* + * Whether MAD was solicited determines type of routing to + * MAD client. + */ + if (solicited) { + u32 hi_tid; + struct ib_mad_agent_private *entry; + + /* + * Routing is based on high 32 bits of transaction ID + * of MAD. + */ + hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32; + list_for_each_entry(entry, &port_priv->agent_list, + agent_list) { + if (entry->agent.hi_tid == hi_tid) { + mad_agent = entry; + break; + } + } + } else { + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + struct ib_vendor_mad *vendor_mad; + int index; + + /* + * Routing is based on version, class, and method + * For "newer" vendor MADs, also based on OUI + */ + if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) + goto out; + if (!is_vendor_class(mad->mad_hdr.mgmt_class)) { + class = port_priv->version[ + mad->mad_hdr.class_version].class; + if (!class) + goto out; + method = class->method_table[convert_mgmt_class( + mad->mad_hdr.mgmt_class)]; + if (method) + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } else { + vendor = port_priv->version[ + mad->mad_hdr.class_version].vendor; + if (!vendor) + goto out; + vendor_class = vendor->vendor_class[vendor_class_index( + mad->mad_hdr.mgmt_class)]; + if (!vendor_class) + goto out; + /* Find matching OUI */ + vendor_mad = (struct ib_vendor_mad *)mad; + index = find_vendor_oui(vendor_class, vendor_mad->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } + } + } + + if (mad_agent) { + if (mad_agent->agent.recv_handler) + atomic_inc(&mad_agent->refcount); + else { + printk(KERN_NOTICE PFX "No receive handler for client " + "%p on port %d\n", + &mad_agent->agent, port_priv->port_num); + mad_agent = NULL; + } + } +out: + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + return mad_agent; +} + +static int validate_mad(struct ib_mad *mad, u32 qp_num) +{ + int valid = 0; + + /* Make sure MAD base version is understood */ + if (mad->mad_hdr.base_version != IB_MGMT_BASE_VERSION) { + printk(KERN_ERR PFX "MAD received with unsupported base " + "version %d\n", mad->mad_hdr.base_version); + goto out; + } + + /* Filter SMI packets sent to other than QP0 */ + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) { + if (qp_num == 0) + valid = 1; + } else { + /* Filter GSI packets sent to QP0 */ + if (qp_num != 0) + valid = 1; + } + +out: + return valid; +} + +/* + * Return start of fully reassembled MAD, or NULL, if MAD isn't assembled yet + */ +static struct ib_mad_private * +reassemble_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv) +{ + /* Until we have RMPP, all receives are reassembled!... */ + INIT_LIST_HEAD(&recv->header.recv_wc.recv_buf.list); + return recv; +} + +static struct ib_mad_send_wr_private* +find_send_req(struct ib_mad_agent_private *mad_agent_priv, + u64 tid) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, + agent_list) { + if (mad_send_wr->tid == tid) + return mad_send_wr; + } + + /* + * It's possible to receive the response before we've + * been notified that the send has completed + */ + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_list) { + if (mad_send_wr->tid == tid && mad_send_wr->timeout) { + /* Verify request has not been canceled */ + return (mad_send_wr->status == IB_WC_SUCCESS) ? + mad_send_wr : NULL; + } + } + return NULL; +} + +static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv, + int solicited) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + /* Fully reassemble receive before processing */ + recv = reassemble_recv(mad_agent_priv, recv); + if (!recv) { + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } + + /* Complete corresponding request */ + if (solicited) { + spin_lock_irqsave(&mad_agent_priv->lock, flags); + mad_send_wr = find_send_req(mad_agent_priv, + recv->mad.mad.mad_hdr.tid); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + ib_free_recv_mad(&recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } + /* Timeout = 0 means that we won't wait for a response */ + mad_send_wr->timeout = 0; + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + /* Defined behavior is to complete response before request */ + recv->header.recv_wc.wc->wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &recv->header.recv_wc); + atomic_dec(&mad_agent_priv->refcount); + + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); + } else { + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + } +} + +static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_qp_info *qp_info; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *recv, *response; + struct ib_mad_list_head *mad_list; + struct ib_mad_agent_private *mad_agent; + int solicited; + + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!response) + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " + "for response buffer\n"); + + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + dequeue_mad(mad_list); + + mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, header); + dma_unmap_single(port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); + + /* Setup MAD receive work completion from "normal" work completion */ + recv->header.recv_wc.wc = wc; + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); + recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; + recv->header.recv_wc.recv_buf.grh = &recv->grh; + + if (atomic_read(&qp_info->snoop_count)) + snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); + + /* Validate MAD */ + if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) + goto out; + + if (recv->mad.mad.mad_hdr.mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (!smi_handle_dr_smp_recv(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num, + port_priv->device->phys_port_cnt)) + goto out; + if (!smi_check_forward_dr_smp(&recv->mad.smp)) + goto local; + if (!smi_handle_dr_smp_send(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num)) + goto out; + if (!smi_check_local_dr_smp(&recv->mad.smp, + port_priv->device, + port_priv->port_num)) + goto out; + } + +local: + /* Give driver "right of first refusal" on incoming MAD */ + if (port_priv->device->process_mad) { + int ret; + + if (!response) { + printk(KERN_ERR PFX "No memory for response MAD\n"); + /* + * Is it better to assume that + * it wouldn't be processed ? + */ + goto out; + } + + ret = port_priv->device->process_mad(port_priv->device, 0, + port_priv->port_num, + wc->slid, + &recv->mad.mad, + &response->mad.mad); + if (ret & IB_MAD_RESULT_SUCCESS) { + if (ret & IB_MAD_RESULT_CONSUMED) + goto out; + if (ret & IB_MAD_RESULT_REPLY) { + /* Send response */ + if (!agent_send(response, &recv->grh, wc, + port_priv->device, + port_priv->port_num)) + response = NULL; + goto out; + } + } + } + + /* Determine corresponding MAD agent for incoming receive MAD */ + solicited = solicited_mad(&recv->mad.mad); + mad_agent = find_mad_agent(port_priv, &recv->mad.mad, solicited); + if (mad_agent) { + ib_mad_complete_recv(mad_agent, recv, solicited); + /* + * recv is freed up in error cases in ib_mad_complete_recv + * or via recv_handler in ib_mad_complete_recv() + */ + recv = NULL; + } + +out: + /* Post another receive request for this QP */ + if (response) { + ib_mad_post_receive_mads(qp_info, response); + if (recv) + kmem_cache_free(ib_mad_cache, recv); + } else + ib_mad_post_receive_mads(qp_info, recv); +} + +static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_send_wr_private *mad_send_wr; + unsigned long delay; + + if (list_empty(&mad_agent_priv->wait_list)) { + cancel_delayed_work(&mad_agent_priv->timed_work); + } else { + mad_send_wr = list_entry(mad_agent_priv->wait_list.next, + struct ib_mad_send_wr_private, + agent_list); + + if (time_after(mad_agent_priv->timeout, + mad_send_wr->timeout)) { + mad_agent_priv->timeout = mad_send_wr->timeout; + cancel_delayed_work(&mad_agent_priv->timed_work); + delay = mad_send_wr->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, + &mad_agent_priv->timed_work, delay); + } + } +} + +static void wait_for_response(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr ) +{ + struct ib_mad_send_wr_private *temp_mad_send_wr; + struct list_head *list_item; + unsigned long delay; + + list_del(&mad_send_wr->agent_list); + + delay = mad_send_wr->timeout; + mad_send_wr->timeout += jiffies; + + list_for_each_prev(list_item, &mad_agent_priv->wait_list) { + temp_mad_send_wr = list_entry(list_item, + struct ib_mad_send_wr_private, + agent_list); + if (time_after(mad_send_wr->timeout, + temp_mad_send_wr->timeout)) + break; + } + list_add(&mad_send_wr->agent_list, list_item); + + /* Reschedule a work item if we have a shorter timeout */ + if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) { + cancel_delayed_work(&mad_agent_priv->timed_work); + queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->timed_work, delay); + } +} + +/* + * Process a send work completion + */ +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_mad_agent_private *mad_agent_priv; + unsigned long flags; + + mad_agent_priv = container_of(mad_send_wr->agent, + struct ib_mad_agent_private, agent); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + if (mad_send_wc->status != IB_WC_SUCCESS && + mad_send_wr->status == IB_WC_SUCCESS) { + mad_send_wr->status = mad_send_wc->status; + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + } + + if (--mad_send_wr->refcount > 0) { + if (mad_send_wr->refcount == 1 && mad_send_wr->timeout && + mad_send_wr->status == IB_WC_SUCCESS) { + wait_for_response(mad_agent_priv, mad_send_wr); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + return; + } + + /* Remove send from MAD agent and notify client of completion */ + list_del(&mad_send_wr->agent_list); + adjust_timeout(mad_agent_priv); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + if (mad_send_wr->status != IB_WC_SUCCESS ) + mad_send_wc->status = mad_send_wr->status; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + mad_send_wc); + + /* Release reference on agent taken when sending */ + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + + kfree(mad_send_wr); +} + +static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_send_wr_private *mad_send_wr, *queued_send_wr; + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_queue *send_queue; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; + + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + send_queue = mad_list->mad_queue; + qp_info = send_queue->qp_info; + +retry: + queued_send_wr = NULL; + spin_lock_irqsave(&send_queue->lock, flags); + list_del(&mad_list->list); + + /* Move queued send to the send queue */ + if (send_queue->count-- > send_queue->max_active) { + mad_list = container_of(qp_info->overflow_list.next, + struct ib_mad_list_head, list); + queued_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + list_del(&mad_list->list); + list_add_tail(&mad_list->list, &send_queue->list); + } + spin_unlock_irqrestore(&send_queue->lock, flags); + + /* Restore client wr_id in WC and complete send */ + wc->wr_id = mad_send_wr->wr_id; + if (atomic_read(&qp_info->snoop_count)) + snoop_send(qp_info, &mad_send_wr->send_wr, + (struct ib_mad_send_wc *)wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc *)wc); + + if (queued_send_wr) { + ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr, + &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + mad_send_wr = queued_send_wr; + wc->status = IB_WC_LOC_QP_OP_ERR; + goto retry; + } + } +} + +static void mark_sends_for_retry(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_list_head *mad_list; + unsigned long flags; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + list_for_each_entry(mad_list, &qp_info->send_queue.list, list) { + mad_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + mad_send_wr->retry = 1; + } + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); +} + +static void mad_error_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_send_wr_private *mad_send_wr; + int ret; + + /* Determine if failure was a send or receive */ + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + if (mad_list->mad_queue == &qp_info->recv_queue) + /* + * Receive errors indicate that the QP has entered the error + * state - error handling/shutdown code will cleanup + */ + return; + + /* + * Send errors will transition the QP to SQE - move + * QP to RTS and repost flushed work requests + */ + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + if (wc->status == IB_WC_WR_FLUSH_ERR) { + if (mad_send_wr->retry) { + /* Repost send */ + struct ib_send_wr *bad_send_wr; + + mad_send_wr->retry = 0; + ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr, + &bad_send_wr); + if (ret) + ib_mad_send_done_handler(port_priv, wc); + } else + ib_mad_send_done_handler(port_priv, wc); + } else { + struct ib_qp_attr *attr; + + /* Transition QP to RTS and fail offending send */ + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RTS; + attr->cur_qp_state = IB_QPS_SQE; + ret = ib_modify_qp(qp_info->qp, attr, + IB_QP_STATE | IB_QP_CUR_STATE); + kfree(attr); + if (ret) + printk(KERN_ERR PFX "mad_error_handler - " + "ib_modify_qp to RTS : %d\n", ret); + else + mark_sends_for_retry(qp_info); + } + ib_mad_send_done_handler(port_priv, wc); + } +} + +/* + * IB MAD completion callback + */ +static void ib_mad_completion_handler(void *data) +{ + struct ib_mad_port_private *port_priv; + struct ib_wc wc; + + port_priv = (struct ib_mad_port_private *)data; + ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + + while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { + if (wc.status == IB_WC_SUCCESS) { + switch (wc.opcode) { + case IB_WC_SEND: + ib_mad_send_done_handler(port_priv, &wc); + break; + case IB_WC_RECV: + ib_mad_recv_done_handler(port_priv, &wc); + break; + default: + BUG_ON(1); + break; + } + } else + mad_error_handler(port_priv, &wc); + } +} + +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv) +{ + unsigned long flags; + struct ib_mad_send_wr_private *mad_send_wr, *temp_mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + struct list_head cancel_list; + + INIT_LIST_HEAD(&cancel_list); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &mad_agent_priv->send_list, agent_list) { + if (mad_send_wr->status == IB_WC_SUCCESS) { + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + } + } + + /* Empty wait list to prevent receives from finding a request */ + list_splice_init(&mad_agent_priv->wait_list, &cancel_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + /* Report all cancelled requests */ + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &cancel_list, agent_list) { + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + list_del(&mad_send_wr->agent_list); + kfree(mad_send_wr); + atomic_dec(&mad_agent_priv->refcount); + } +} + +static struct ib_mad_send_wr_private* +find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv, + u64 wr_id) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, + agent_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + return NULL; +} + +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + mad_send_wr = find_send_by_wr_id(mad_agent_priv, wr_id); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + goto out; + } + + if (mad_send_wr->status == IB_WC_SUCCESS) + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + + if (mad_send_wr->refcount != 0) { + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + goto out; + } + + list_del(&mad_send_wr->agent_list); + adjust_timeout(mad_agent_priv); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + +out: + return; +} +EXPORT_SYMBOL(ib_cancel_mad); + +static void timeout_sends(void *data) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags, delay; + + mad_agent_priv = (struct ib_mad_agent_private *)data; + + mad_send_wc.status = IB_WC_RESP_TIMEOUT_ERR; + mad_send_wc.vendor_err = 0; + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + while (!list_empty(&mad_agent_priv->wait_list)) { + mad_send_wr = list_entry(mad_agent_priv->wait_list.next, + struct ib_mad_send_wr_private, + agent_list); + + if (time_after(mad_send_wr->timeout, jiffies)) { + delay = mad_send_wr->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, + &mad_agent_priv->timed_work, delay); + break; + } + + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + atomic_dec(&mad_agent_priv->refcount); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); +} + +static void ib_mad_thread_completion_handler(struct ib_cq *cq) +{ + struct ib_mad_port_private *port_priv = cq->cq_context; + + queue_work(port_priv->wq, &port_priv->work); +} + +/* + * Allocate receive MADs and post receive WRs for them + */ +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad) +{ + unsigned long flags; + int post, ret; + struct ib_mad_private *mad_priv; + struct ib_sge sg_list; + struct ib_recv_wr recv_wr, *bad_recv_wr; + struct ib_mad_queue *recv_queue = &qp_info->recv_queue; + + /* Initialize common scatter list fields */ + sg_list.length = sizeof *mad_priv - sizeof mad_priv->header; + sg_list.lkey = (*qp_info->port_priv->mr).lkey; + + /* Initialize common receive WR fields */ + recv_wr.next = NULL; + recv_wr.sg_list = &sg_list; + recv_wr.num_sge = 1; + recv_wr.recv_flags = IB_RECV_SIGNALED; + + do { + /* Allocate and map receive buffer */ + if (mad) { + mad_priv = mad; + mad = NULL; + } else { + mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!mad_priv) { + printk(KERN_ERR PFX "No memory for receive buffer\n"); + ret = -ENOMEM; + break; + } + } + sg_list.addr = dma_map_single(qp_info->port_priv-> + device->dma_device, + &mad_priv->grh, + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); + recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; + mad_priv->header.mad_list.mad_queue = recv_queue; + + /* Post receive WR */ + spin_lock_irqsave(&recv_queue->lock, flags); + post = (++recv_queue->count < recv_queue->max_active); + list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list); + spin_unlock_irqrestore(&recv_queue->lock, flags); + ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); + if (ret) { + spin_lock_irqsave(&recv_queue->lock, flags); + list_del(&mad_priv->header.mad_list.list); + recv_queue->count--; + spin_unlock_irqrestore(&recv_queue->lock, flags); + dma_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&mad_priv->header, + mapping), + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); + kmem_cache_free(ib_mad_cache, mad_priv); + printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret); + break; + } + } while (post); + + return ret; +} + +/* + * Return all the posted receive MADs + */ +static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *recv; + struct ib_mad_list_head *mad_list; + + while (!list_empty(&qp_info->recv_queue.list)) { + + mad_list = list_entry(qp_info->recv_queue.list.next, + struct ib_mad_list_head, list); + mad_priv_hdr = container_of(mad_list, + struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, + header); + + /* Remove from posted receive MAD list */ + list_del(&mad_list->list); + + /* Undo PCI mapping */ + dma_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); + kmem_cache_free(ib_mad_cache, recv); + } + + qp_info->recv_queue.count = 0; +} + +/* + * Start the port + */ +static int ib_mad_port_start(struct ib_mad_port_private *port_priv) +{ + int ret, i; + struct ib_qp_attr *attr; + struct ib_qp *qp; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) { + printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n"); + return -ENOMEM; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + qp = port_priv->qp_info[i].qp; + /* + * PKey index for QP1 is irrelevant but + * one is needed for the Reset to Init transition + */ + attr->qp_state = IB_QPS_INIT; + attr->pkey_index = 0; + attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | + IB_QP_PKEY_INDEX | IB_QP_QKEY); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "INIT: %d\n", i, ret); + goto out; + } + + attr->qp_state = IB_QPS_RTR; + ret = ib_modify_qp(qp, attr, IB_QP_STATE); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "RTR: %d\n", i, ret); + goto out; + } + + attr->qp_state = IB_QPS_RTS; + attr->sq_psn = IB_MAD_SEND_Q_PSN; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "RTS: %d\n", i, ret); + goto out; + } + } + + ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + if (ret) { + printk(KERN_ERR PFX "Failed to request completion " + "notification: %d\n", ret); + goto out; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); + if (ret) { + printk(KERN_ERR PFX "Couldn't post receive WRs\n"); + goto out; + } + } +out: + kfree(attr); + return ret; +} + +static void qp_event_handler(struct ib_event *event, void *qp_context) +{ + struct ib_mad_qp_info *qp_info = qp_context; + + /* It's worse than that! He's dead, Jim! */ + printk(KERN_ERR PFX "Fatal error (%d) on MAD QP (%d)\n", + event->event, qp_info->qp->qp_num); +} + +static void init_mad_queue(struct ib_mad_qp_info *qp_info, + struct ib_mad_queue *mad_queue) +{ + mad_queue->qp_info = qp_info; + mad_queue->count = 0; + spin_lock_init(&mad_queue->lock); + INIT_LIST_HEAD(&mad_queue->list); +} + +static void init_mad_qp(struct ib_mad_port_private *port_priv, + struct ib_mad_qp_info *qp_info) +{ + qp_info->port_priv = port_priv; + init_mad_queue(qp_info, &qp_info->send_queue); + init_mad_queue(qp_info, &qp_info->recv_queue); + INIT_LIST_HEAD(&qp_info->overflow_list); + spin_lock_init(&qp_info->snoop_lock); + qp_info->snoop_table = NULL; + qp_info->snoop_table_size = 0; + atomic_set(&qp_info->snoop_count, 0); +} + +static int create_mad_qp(struct ib_mad_qp_info *qp_info, + enum ib_qp_type qp_type) +{ + struct ib_qp_init_attr qp_init_attr; + int ret; + + memset(&qp_init_attr, 0, sizeof qp_init_attr); + qp_init_attr.send_cq = qp_info->port_priv->cq; + qp_init_attr.recv_cq = qp_info->port_priv->cq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; + qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; + qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; + qp_init_attr.qp_type = qp_type; + qp_init_attr.port_num = qp_info->port_priv->port_num; + qp_init_attr.qp_context = qp_info; + qp_init_attr.event_handler = qp_event_handler; + qp_info->qp = ib_create_qp(qp_info->port_priv->pd, &qp_init_attr); + if (IS_ERR(qp_info->qp)) { + printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", + get_spl_qp_index(qp_type)); + ret = PTR_ERR(qp_info->qp); + goto error; + } + /* Use minimum queue sizes unless the CQ is resized */ + qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; + qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; + return 0; + +error: + return ret; +} + +static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) +{ + ib_destroy_qp(qp_info->qp); + if (qp_info->snoop_table) + kfree(qp_info->snoop_table); +} + +/* + * Open the port + * Create the QP, PD, MR, and CQ if needed + */ +static int ib_mad_port_open(struct ib_device *device, + int port_num) +{ + int ret, cq_size; + struct ib_mad_port_private *port_priv; + unsigned long flags; + char name[sizeof "ib_mad123"]; + + /* First, check if port already open at MAD layer */ + port_priv = ib_get_mad_port(device, port_num); + if (port_priv) { + printk(KERN_DEBUG PFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR PFX "No memory for ib_mad_port_private\n"); + return -ENOMEM; + } + memset(port_priv, 0, sizeof *port_priv); + port_priv->device = device; + port_priv->port_num = port_num; + spin_lock_init(&port_priv->reg_lock); + INIT_LIST_HEAD(&port_priv->agent_list); + init_mad_qp(port_priv, &port_priv->qp_info[0]); + init_mad_qp(port_priv, &port_priv->qp_info[1]); + + cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + port_priv->cq = ib_create_cq(port_priv->device, + (ib_comp_handler) + ib_mad_thread_completion_handler, + NULL, port_priv, cq_size); + if (IS_ERR(port_priv->cq)) { + printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n"); + ret = PTR_ERR(port_priv->cq); + goto error3; + } + + port_priv->pd = ib_alloc_pd(device); + if (IS_ERR(port_priv->pd)) { + printk(KERN_ERR PFX "Couldn't create ib_mad PD\n"); + ret = PTR_ERR(port_priv->pd); + goto error4; + } + + port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(port_priv->mr)) { + printk(KERN_ERR PFX "Couldn't get ib_mad DMA MR\n"); + ret = PTR_ERR(port_priv->mr); + goto error5; + } + + ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI); + if (ret) + goto error6; + ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI); + if (ret) + goto error7; + + snprintf(name, sizeof name, "ib_mad%d", port_num); + port_priv->wq = create_singlethread_workqueue(name); + if (!port_priv->wq) { + ret = -ENOMEM; + goto error8; + } + INIT_WORK(&port_priv->work, ib_mad_completion_handler, port_priv); + + ret = ib_mad_port_start(port_priv); + if (ret) { + printk(KERN_ERR PFX "Couldn't start port\n"); + goto error9; + } + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_mad_port_list); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + return 0; + +error9: + destroy_workqueue(port_priv->wq); +error8: + destroy_mad_qp(&port_priv->qp_info[1]); +error7: + destroy_mad_qp(&port_priv->qp_info[0]); +error6: + ib_dereg_mr(port_priv->mr); +error5: + ib_dealloc_pd(port_priv->pd); +error4: + ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); +error3: + kfree(port_priv); + + return ret; +} + +/* + * Close the port + * If there are no classes using the port, free the port + * resources (CQ, MR, PD, QP) and remove the port's info structure + */ +static int ib_mad_port_close(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + port_priv = __ib_get_mad_port(device, port_num); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + printk(KERN_ERR PFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + + /* Stop processing completions. */ + flush_workqueue(port_priv->wq); + destroy_workqueue(port_priv->wq); + destroy_mad_qp(&port_priv->qp_info[1]); + destroy_mad_qp(&port_priv->qp_info[0]); + ib_dereg_mr(port_priv->mr); + ib_dealloc_pd(port_priv->pd); + ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); + /* XXX: Handle deallocation of MAD registration tables */ + + kfree(port_priv); + + return 0; +} + +static void ib_mad_init_device(struct ib_device *device) +{ + int ret, num_ports, cur_port, i, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret = ib_mad_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d\n", + device->name, cur_port); + goto error_device_open; + } + ret = ib_agent_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d " + "for agents\n", + device->name, cur_port); + goto error_device_open; + } + } + + goto error_device_query; + +error_device_open: + while (i > 0) { + cur_port--; + ret2 = ib_agent_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + } + ret2 = ib_mad_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + } + i--; + } + +error_device_query: + return; +} + +static void ib_mad_remove_device(struct ib_device *device) +{ + int ret = 0, i, num_ports, cur_port, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret2 = ib_agent_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + ret2 = ib_mad_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + } +} + +static struct ib_client mad_client = { + .name = "mad", + .add = ib_mad_init_device, + .remove = ib_mad_remove_device +}; + +static int __init ib_mad_init_module(void) +{ + int ret; + + spin_lock_init(&ib_mad_port_list_lock); + spin_lock_init(&ib_agent_port_list_lock); + + ib_mad_cache = kmem_cache_create("ib_mad", + sizeof(struct ib_mad_private), + 0, + SLAB_HWCACHE_ALIGN, + NULL, + NULL); + if (!ib_mad_cache) { + printk(KERN_ERR PFX "Couldn't create ib_mad cache\n"); + ret = -ENOMEM; + goto error1; + } + + INIT_LIST_HEAD(&ib_mad_port_list); + + if (ib_register_client(&mad_client)) { + printk(KERN_ERR PFX "Couldn't register ib_mad client\n"); + ret = -EINVAL; + goto error2; + } + + return 0; + +error2: + kmem_cache_destroy(ib_mad_cache); +error1: + return ret; +} + +static void __exit ib_mad_cleanup_module(void) +{ + ib_unregister_client(&mad_client); + + if (kmem_cache_destroy(ib_mad_cache)) { + printk(KERN_DEBUG PFX "Failed to destroy ib_mad cache\n"); + } +} + +module_init(ib_mad_init_module); +module_exit(ib_mad_cleanup_module); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/mad_priv.h 2004-12-13 09:44:43.123372531 -0800 @@ -0,0 +1,204 @@ +/* + * Copyright (c) 2004, Voltaire, Inc. All rights reserved. + * Maintained by: vtrmaint1 at voltaire.com + * + * This program is intended for the purpose of Infiniband + * protocol stack for Linux Servers. + * + * This software program is free software and you are free to modifyi + * and/or redistribute it under a choice of one of the following two + * licenses: + * + * 1) under either the GNU General Public License (GPL) Version 2, June 1991, + * a copy of which is in the file LICENSE_GPL_V2.txt in the root directory. + * This GPL license is also available from the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA, or on the + * web at http://www.fsf.org/copyleft/gpl.html + * + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, on the web at + * http://www.opensource.org/licenses/bsd-license.php. + * + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * + * + * To obtain a copy of these licenses, the source code to this software or + * for other questions, you may write to Voltaire, Inc., + * Attention: Voltaire openSource maintainer, + * Voltaire, Inc. 54 Middlesex Turnpike Bedford, MA 01730 or + * by Email: vtrmaint1 at voltaire.com + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +#ifndef __IB_MAD_PRIV_H__ +#define __IB_MAD_PRIV_H__ + +#include +#include +#include +#include +#include + + +#define PFX "ib_mad: " + +#define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 as a minimum */ + +/* QP and CQ parameters */ +#define IB_MAD_QP_SEND_SIZE 128 +#define IB_MAD_QP_RECV_SIZE 512 +#define IB_MAD_SEND_REQ_MAX_SG 2 +#define IB_MAD_RECV_REQ_MAX_SG 1 + +#define IB_MAD_SEND_Q_PSN 0 + +/* Registration table sizes */ +#define MAX_MGMT_CLASS 80 +#define MAX_MGMT_VERSION 8 +#define MAX_MGMT_OUI 8 +#define MAX_MGMT_VENDOR_RANGE2 IB_MGMT_CLASS_VENDOR_RANGE2_END - \ + IB_MGMT_CLASS_VENDOR_RANGE2_START + 1 + +struct ib_mad_list_head { + struct list_head list; + struct ib_mad_queue *mad_queue; +}; + +struct ib_mad_private_header { + struct ib_mad_list_head mad_list; + struct ib_mad_recv_wc recv_wc; + DECLARE_PCI_UNMAP_ADDR(mapping) +} __attribute__ ((packed)); + +struct ib_mad_private { + struct ib_mad_private_header header; + struct ib_grh grh; + union { + struct ib_mad mad; + struct ib_rmpp_mad rmpp_mad; + struct ib_smp smp; + } mad; +} __attribute__ ((packed)); + +struct ib_mad_agent_private { + struct list_head agent_list; + struct ib_mad_agent agent; + struct ib_mad_reg_req *reg_req; + struct ib_mad_qp_info *qp_info; + + spinlock_t lock; + struct list_head send_list; + struct list_head wait_list; + struct work_struct timed_work; + unsigned long timeout; + + atomic_t refcount; + wait_queue_head_t wait; + u8 rmpp_version; +}; + +struct ib_mad_snoop_private { + struct ib_mad_agent agent; + struct ib_mad_qp_info *qp_info; + int snoop_index; + int mad_snoop_flags; + atomic_t refcount; + wait_queue_head_t wait; +}; + +struct ib_mad_send_wr_private { + struct ib_mad_list_head mad_list; + struct list_head agent_list; + struct ib_mad_agent *agent; + struct ib_send_wr send_wr; + struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; + u64 wr_id; /* client WR ID */ + u64 tid; + unsigned long timeout; + int retry; + int refcount; + enum ib_wc_status status; +}; + +struct ib_mad_mgmt_method_table { + struct ib_mad_agent_private *agent[IB_MGMT_MAX_METHODS]; +}; + +struct ib_mad_mgmt_class_table { + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; +}; + +struct ib_mad_mgmt_vendor_class { + u8 oui[MAX_MGMT_OUI][3]; + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_OUI]; +}; + +struct ib_mad_mgmt_vendor_class_table { + struct ib_mad_mgmt_vendor_class *vendor_class[MAX_MGMT_VENDOR_RANGE2]; +}; + +struct ib_mad_mgmt_version_table { + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; +}; + +struct ib_mad_queue { + spinlock_t lock; + struct list_head list; + int count; + int max_active; + struct ib_mad_qp_info *qp_info; +}; + +struct ib_mad_qp_info { + struct ib_mad_port_private *port_priv; + struct ib_qp *qp; + struct ib_mad_queue send_queue; + struct ib_mad_queue recv_queue; + struct list_head overflow_list; + spinlock_t snoop_lock; + struct ib_mad_snoop_private **snoop_table; + int snoop_table_size; + atomic_t snoop_count; +}; + +struct ib_mad_port_private { + struct list_head port_list; + struct ib_device *device; + int port_num; + struct ib_cq *cq; + struct ib_pd *pd; + struct ib_mr *mr; + + spinlock_t reg_lock; + struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION]; + struct list_head agent_list; + struct workqueue_struct *wq; + struct work_struct work; + struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; +}; + +#endif /* __IB_MAD_PRIV_H__ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/smi.c 2004-12-13 09:44:43.044384167 -0800 @@ -0,0 +1,222 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include + + +/* + * Fixup a directed route SMP for sending + * Return 0 if the SMP should be discarded + */ +int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 */ + if (hop_cnt && hop_ptr == 0) { + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:2 */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->return_path set when received */ + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + /* smp->return_path set when received */ + smp->hop_ptr++; + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + /* C14-9:5 -- Fail unreasonable hop pointer */ + return (hop_ptr == hop_cnt + 1); + + } else { + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) { + smp->hop_ptr--; + /* C14-13:3 -- SMPs destined for SM shouldn't be here */ + return (node_type == IB_NODE_SWITCH || + smp->dr_slid == IB_LID_PERMISSIVE); + } + + /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */ + if (hop_ptr == 0) + return 1; + + /* C14-13:5 -- Check for unreasonable hop pointer */ + return 0; + } +} + +/* + * Adjust information for a received SMP + * Return 0 if the SMP should be dropped + */ +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 -- sender should have incremented hop_ptr */ + if (hop_cnt && hop_ptr == 0) + return 0; + + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + return (smp->initial_path[hop_ptr+1] <= phys_port_cnt); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + if (hop_cnt) + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + /* C14-9:5 -- fail unreasonable hop pointer */ + return (hop_ptr == hop_cnt + 1); + + } else { + + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->hop_ptr updated when sending */ + return (smp->return_path[hop_ptr-1] <= phys_port_cnt); + } + + /* C14-13:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == 1) { + if (smp->dr_slid == IB_LID_PERMISSIVE) { + /* giving SMP to SM - update hop_ptr */ + smp->hop_ptr--; + return 1; + } + /* smp->hop_ptr updated when sending */ + return (node_type == IB_NODE_SWITCH); + } + + /* C14-13:4 -- hop_ptr = 0 -> give to SM */ + /* C14-13:5 -- Check for unreasonable hop pointer */ + return (hop_ptr == 0); + } +} + +/* + * Return 1 if the received DR SMP should be forwarded to the send queue + * Return 0 if the SMP should be completed up the stack + */ +int smi_check_forward_dr_smp(struct ib_smp *smp) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + if (!ib_get_smp_direction(smp)) { + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) + return 1; + + /* C14-9:3 -- at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) + return (smp->dr_dlid == IB_LID_PERMISSIVE); + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + if (hop_ptr == hop_cnt + 1) + return 1; + } else { + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) + return 1; + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) + return (smp->dr_slid != IB_LID_PERMISSIVE); + } + return 0; +} + --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/smi.h 2004-12-13 09:44:43.148368849 -0800 @@ -0,0 +1,54 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __SMI_H_ +#define __SMI_H_ + +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt); +extern int smi_check_forward_dr_smp(struct ib_smp *smp); +extern int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num); +extern int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num); + +/* + * Return 1 if the SMP should be handled by the local SMA/SM via process_mad + */ +static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, + struct ib_smp *smp) +{ + /* C14-9:3 -- We're at the end of the DR segment of path */ + /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ + return ((mad_agent->device->process_mad && + !ib_get_smp_direction(smp) && + (smp->hop_ptr == smp->hop_cnt + 1))); +} + +#endif /* __SMI_H_ */ From roland at topspin.com Mon Dec 13 10:09:29 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:29 -0800 Subject: [openib-general] [PATCH][v3][7/21] Add Mellanox HCA low-level driver In-Reply-To: <20041213109.cVS0twN822l4xQbR@topspin.com> Message-ID: <20041213109.PeK9O1dDWBb3rThl@topspin.com> Add a low-level driver for Mellanox MT23108 and MT25208 HCAs. The MT25208 is only fully supported when in MT23108 compatibility mode; only the very beginnings of support for native MT25208 mode (required for HCAs without local memory) is present. (As a side note, I believe this driver would be the first in-tree consumer of the PCI MSI/MSI-X API) Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-12-13 09:44:40.230798660 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-13 09:44:43.936252779 -0800 @@ -8,4 +8,6 @@ any protocols you wish to use as well as drivers for your InfiniBand hardware. +source "drivers/infiniband/hw/mthca/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-12-13 09:44:40.278791590 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-12-13 09:44:43.909256756 -0800 @@ -1 +1,2 @@ obj-$(CONFIG_INFINIBAND) += core/ +obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Kconfig 2004-12-13 09:44:43.962248949 -0800 @@ -0,0 +1,26 @@ +config INFINIBAND_MTHCA + tristate "Mellanox HCA support" + depends on PCI && INFINIBAND + ---help--- + This is a low-level driver for Mellanox InfiniHost host + channel adapters (HCAs), including the MT23108 PCI-X HCA + ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). + +config INFINIBAND_MTHCA_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_MTHCA + default n + ---help--- + This option causes the mthca driver produce a bunch of debug + messages. Select this is you are developing the driver or + trying to diagnose a problem. + +config INFINIBAND_MTHCA_SSE_DOORBELL + bool "SSE doorbell code" + depends on INFINIBAND_MTHCA && X86 && !X86_64 + default n + ---help--- + This option will have the mthca driver use SSE instructions + to ring hardware doorbell registers. This may improve + performance for some workloads, but the driver will not run + on processors without SSE instructions. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Makefile 2004-12-13 09:44:43.990244825 -0800 @@ -0,0 +1,12 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o + +ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ + mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \ + mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ + mthca_provider.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c 2004-12-13 09:44:44.017240848 -0800 @@ -0,0 +1,168 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_allocator.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" + +/* Trivial bitmap-based allocator */ +u32 mthca_alloc(struct mthca_alloc *alloc) +{ + u32 obj; + + spin_lock(&alloc->lock); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); + if (obj >= alloc->max) { + alloc->top = (alloc->top + alloc->max) & alloc->mask; + obj = find_first_zero_bit(alloc->table, alloc->max); + } + + if (obj < alloc->max) { + set_bit(obj, alloc->table); + obj |= alloc->top; + } else + obj = -1; + + spin_unlock(&alloc->lock); + + return obj; +} + +void mthca_free(struct mthca_alloc *alloc, u32 obj) +{ + obj &= alloc->max - 1; + spin_lock(&alloc->lock); + clear_bit(obj, alloc->table); + alloc->last = min(alloc->last, obj); + alloc->top = (alloc->top + alloc->max) & alloc->mask; + spin_unlock(&alloc->lock); +} + +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved) +{ + int i; + + /* num must be a power of 2 */ + if (num != 1 << (ffs(num) - 1)) + return -EINVAL; + + alloc->last = 0; + alloc->top = 0; + alloc->max = num; + alloc->mask = mask; + spin_lock_init(&alloc->lock); + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), + GFP_KERNEL); + if (!alloc->table) + return -ENOMEM; + + bitmap_zero(alloc->table, num); + for (i = 0; i < reserved; ++i) + set_bit(i, alloc->table); + + return 0; +} + +void mthca_alloc_cleanup(struct mthca_alloc *alloc) +{ + kfree(alloc->table); +} + +/* + * Array of pointers with lazy allocation of leaf pages. Callers of + * _get, _set and _clear methods must use a lock or otherwise + * serialize access to the array. + */ + +void *mthca_array_get(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (array->page_list[p].page) { + int i = index & (PAGE_SIZE / sizeof (void *) - 1); + return array->page_list[p].page[i]; + } else + return NULL; +} + +int mthca_array_set(struct mthca_array *array, int index, void *value) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ + if (!array->page_list[p].page) + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); + + if (!array->page_list[p].page) + return -ENOMEM; + + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = + value; + ++array->page_list[p].used; + + return 0; +} + +void mthca_array_clear(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (--array->page_list[p].used == 0) { + free_page((unsigned long) array->page_list[p].page); + array->page_list[p].page = NULL; + } + + if (array->page_list[p].used < 0) + pr_debug("Array %p index %d page %d with ref count %d < 0\n", + array, index, p, array->page_list[p].used); +} + +int mthca_array_init(struct mthca_array *array, int nent) +{ + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; + int i; + + array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL); + if (!array->page_list) + return -ENOMEM; + + for (i = 0; i < npage; ++i) { + array->page_list[i].page = NULL; + array->page_list[i].used = 0; + } + + return 0; +} + +void mthca_array_cleanup(struct mthca_array *array, int nent) +{ + int i; + + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + free_page((unsigned long) array->page_list[i].page); + + kfree(array->page_list); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h 2004-12-13 09:44:44.041237312 -0800 @@ -0,0 +1,44 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_config_reg.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef MTHCA_CONFIG_REG_H +#define MTHCA_CONFIG_REG_H + +#include + +#define MTHCA_HCR_BASE 0x80680 +#define MTHCA_HCR_SIZE 0x0001c +#define MTHCA_ECR_BASE 0x80700 +#define MTHCA_ECR_SIZE 0x00008 +#define MTHCA_ECR_CLR_BASE 0x80708 +#define MTHCA_ECR_CLR_SIZE 0x00008 +#define MTHCA_ECR_OFFSET (MTHCA_ECR_BASE - MTHCA_HCR_BASE) +#define MTHCA_ECR_CLR_OFFSET (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE) +#define MTHCA_CLR_INT_BASE 0xf00d8 +#define MTHCA_CLR_INT_SIZE 0x00008 + +#define MTHCA_MAP_HCR_SIZE (MTHCA_ECR_CLR_BASE + \ + MTHCA_ECR_CLR_SIZE - \ + MTHCA_HCR_BASE) + +#endif /* MTHCA_CONFIG_REG_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h 2004-12-13 09:44:44.069233188 -0800 @@ -0,0 +1,380 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_dev.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef MTHCA_DEV_H +#define MTHCA_DEV_H + +#include +#include +#include +#include +#include +#include + +#include "mthca_provider.h" +#include "mthca_doorbell.h" + +#define DRV_NAME "ib_mthca" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.06-pre" +#define DRV_RELDATE "November 8, 2004" + +/* Types of supported HCA */ +enum { + TAVOR, /* MT23108 */ + ARBEL_COMPAT, /* MT25208 in Tavor compat mode */ + ARBEL_NATIVE /* MT25208 with extended features */ +}; + +enum { + MTHCA_FLAG_DDR_HIDDEN = 1 << 1, + MTHCA_FLAG_SRQ = 1 << 2, + MTHCA_FLAG_MSI = 1 << 3, + MTHCA_FLAG_MSI_X = 1 << 4, + MTHCA_FLAG_NO_LAM = 1 << 5 +}; + +enum { + MTHCA_KAR_PAGE = 1, + MTHCA_MAX_PORTS = 2 +}; + +enum { + MTHCA_MPT_ENTRY_SIZE = 0x40, + MTHCA_EQ_CONTEXT_SIZE = 0x40, + MTHCA_CQ_CONTEXT_SIZE = 0x40, + MTHCA_QP_CONTEXT_SIZE = 0x200, + MTHCA_AV_SIZE = 0x20, + MTHCA_MGM_ENTRY_SIZE = 0x40 +}; + +enum { + MTHCA_EQ_CMD, + MTHCA_EQ_ASYNC, + MTHCA_EQ_COMP, + MTHCA_NUM_EQ +}; + +struct mthca_cmd { + int use_events; + struct semaphore hcr_sem; + struct semaphore poll_sem; + struct semaphore event_sem; + int max_cmds; + spinlock_t context_lock; + int free_head; + struct mthca_cmd_context *context; + u16 token_mask; +}; + +struct mthca_limits { + int num_ports; + int vl_cap; + int mtu_cap; + int gid_table_len; + int pkey_table_len; + int local_ca_ack_delay; + int max_sg; + int num_qps; + int reserved_qps; + int num_srqs; + int reserved_srqs; + int num_eecs; + int reserved_eecs; + int num_cqs; + int reserved_cqs; + int num_eqs; + int reserved_eqs; + int num_mpts; + int num_mtt_segs; + int mtt_seg_size; + int reserved_mtts; + int reserved_mrws; + int num_rdbs; + int reserved_uars; + int num_mgms; + int num_amgms; + int reserved_mcgs; + int num_pds; + int reserved_pds; +}; + +struct mthca_alloc { + u32 last; + u32 top; + u32 max; + u32 mask; + spinlock_t lock; + unsigned long *table; +}; + +struct mthca_array { + struct { + void **page; + int used; + } *page_list; +}; + +struct mthca_pd_table { + struct mthca_alloc alloc; +}; + +struct mthca_mr_table { + struct mthca_alloc mpt_alloc; + int max_mtt_order; + unsigned long **mtt_buddy; + u64 mtt_base; +}; + +struct mthca_eq_table { + struct mthca_alloc alloc; + void __iomem *clr_int; + u32 clr_mask; + struct mthca_eq eq[MTHCA_NUM_EQ]; + int have_irq; + u8 inta_pin; +}; + +struct mthca_cq_table { + struct mthca_alloc alloc; + spinlock_t lock; + struct mthca_array cq; +}; + +struct mthca_qp_table { + struct mthca_alloc alloc; + int sqp_start; + spinlock_t lock; + struct mthca_array qp; +}; + +struct mthca_av_table { + struct pci_pool *pool; + int num_ddr_avs; + u64 ddr_av_base; + void __iomem *av_map; + struct mthca_alloc alloc; +}; + +struct mthca_mcg_table { + struct semaphore sem; + struct mthca_alloc alloc; +}; + +struct mthca_dev { + struct ib_device ib_dev; + struct pci_dev *pdev; + + int hca_type; + unsigned long mthca_flags; + + u32 rev_id; + + /* firmware info */ + u64 fw_ver; + union { + struct { + u64 fw_start; + u64 fw_end; + } tavor; + struct { + u64 clr_int_base; + u64 eq_arm_base; + u64 eq_set_ci_base; + struct scatterlist *mem; + u16 fw_pages; + } arbel; + } fw; + + u64 ddr_start; + u64 ddr_end; + + MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) + + void __iomem *hcr; + void __iomem *clr_base; + void __iomem *kar; + + struct mthca_cmd cmd; + struct mthca_limits limits; + + struct mthca_pd_table pd_table; + struct mthca_mr_table mr_table; + struct mthca_eq_table eq_table; + struct mthca_cq_table cq_table; + struct mthca_qp_table qp_table; + struct mthca_av_table av_table; + struct mthca_mcg_table mcg_table; + + struct mthca_pd driver_pd; + struct mthca_mr driver_mr; + + struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; + struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; + spinlock_t sm_lock; +}; + +#define mthca_dbg(mdev, format, arg...) \ + dev_dbg(&mdev->pdev->dev, format, ## arg) +#define mthca_err(mdev, format, arg...) \ + dev_err(&mdev->pdev->dev, format, ## arg) +#define mthca_info(mdev, format, arg...) \ + dev_info(&mdev->pdev->dev, format, ## arg) +#define mthca_warn(mdev, format, arg...) \ + dev_warn(&mdev->pdev->dev, format, ## arg) + +extern void __buggy_use_of_MTHCA_GET(void); +extern void __buggy_use_of_MTHCA_PUT(void); + +#define MTHCA_GET(dest, source, offset) \ + do { \ + void *__p = (char *) (source) + (offset); \ + switch (sizeof (dest)) { \ + case 1: (dest) = *(u8 *) __p; break; \ + case 2: (dest) = be16_to_cpup(__p); break; \ + case 4: (dest) = be32_to_cpup(__p); break; \ + case 8: (dest) = be64_to_cpup(__p); break; \ + default: __buggy_use_of_MTHCA_GET(); \ + } \ + } while (0) + +#define MTHCA_PUT(dest, source, offset) \ + do { \ + __typeof__(source) *__p = \ + (__typeof__(source) *) ((char *) (dest) + (offset)); \ + switch (sizeof(source)) { \ + case 1: *__p = (source); break; \ + case 2: *__p = cpu_to_be16(source); break; \ + case 4: *__p = cpu_to_be32(source); break; \ + case 8: *__p = cpu_to_be64(source); break; \ + default: __buggy_use_of_MTHCA_PUT(); \ + } \ + } while (0) + +int mthca_reset(struct mthca_dev *mdev); + +u32 mthca_alloc(struct mthca_alloc *alloc); +void mthca_free(struct mthca_alloc *alloc, u32 obj); +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved); +void mthca_alloc_cleanup(struct mthca_alloc *alloc); +void *mthca_array_get(struct mthca_array *array, int index); +int mthca_array_set(struct mthca_array *array, int index, void *value); +void mthca_array_clear(struct mthca_array *array, int index); +int mthca_array_init(struct mthca_array *array, int nent); +void mthca_array_cleanup(struct mthca_array *array, int nent); + +int mthca_init_pd_table(struct mthca_dev *dev); +int mthca_init_mr_table(struct mthca_dev *dev); +int mthca_init_eq_table(struct mthca_dev *dev); +int mthca_init_cq_table(struct mthca_dev *dev); +int mthca_init_qp_table(struct mthca_dev *dev); +int mthca_init_av_table(struct mthca_dev *dev); +int mthca_init_mcg_table(struct mthca_dev *dev); + +void mthca_cleanup_pd_table(struct mthca_dev *dev); +void mthca_cleanup_mr_table(struct mthca_dev *dev); +void mthca_cleanup_eq_table(struct mthca_dev *dev); +void mthca_cleanup_cq_table(struct mthca_dev *dev); +void mthca_cleanup_qp_table(struct mthca_dev *dev); +void mthca_cleanup_av_table(struct mthca_dev *dev); +void mthca_cleanup_mcg_table(struct mthca_dev *dev); + +int mthca_register_device(struct mthca_dev *dev); +void mthca_unregister_device(struct mthca_dev *dev); + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr); +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry); +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited); +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq); +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq); +void mthca_cq_event(struct mthca_dev *dev, u32 cqn); +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe); +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp); +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp); +void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp); +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah); +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header); + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); +int mthca_create_agents(struct mthca_dev *dev); +void mthca_free_agents(struct mthca_dev *dev); + +static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct mthca_dev, ib_dev); +} + +#endif /* MTHCA_DEV_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h 2004-12-13 09:44:44.095229358 -0800 @@ -0,0 +1,112 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_doorbell.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include +#include + +#define MTHCA_RD_DOORBELL 0x00 +#define MTHCA_SEND_DOORBELL 0x10 +#define MTHCA_RECEIVE_DOORBELL 0x18 +#define MTHCA_CQ_DOORBELL 0x20 +#define MTHCA_EQ_DOORBELL 0x28 + +#if BITS_PER_LONG == 64 +/* + * Assume that we can just write a 64-bit doorbell atomically. s390 + * actually doesn't have writeq() but S/390 systems don't even have + * PCI so we won't worry about it. + */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + __raw_writeq(*(u64 *) val, dest); +} + +#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL) +/* Use SSE to write 64 bits atomically without a lock. */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline unsigned long mthca_get_fpu(void) +{ + unsigned long cr0; + + preempt_disable(); + asm volatile("mov %%cr0,%0; clts" : "=r" (cr0)); + return cr0; +} + +static inline void mthca_put_fpu(unsigned long cr0) +{ + asm volatile("mov %0,%%cr0" : : "r" (cr0)); + preempt_enable(); +} + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + /* i386 stack is aligned to 8 bytes, so this should be OK: */ + u8 xmmsave[8] __attribute__((aligned(8))); + unsigned long cr0; + + cr0 = mthca_get_fpu(); + + asm volatile ( + "movlps %%xmm0,(%0); \n\t" + "movlps (%1),%%xmm0; \n\t" + "movlps %%xmm0,(%2); \n\t" + "movlps (%0),%%xmm0; \n\t" + : + : "r" (xmmsave), "r" (val), "r" (dest) + : "memory" ); + + mthca_put_fpu(cr0); +} + +#else +/* Just fall back to a spinlock to protect the doorbell */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name; +#define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + unsigned long flags; + + spin_lock_irqsave(doorbell_lock, flags); + __raw_writel(val[0], dest); + __raw_writel(val[1], dest + 4); + spin_unlock_irqrestore(doorbell_lock, flags); +} + +#endif --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c 2004-12-13 09:44:44.121225529 -0800 @@ -0,0 +1,922 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_main.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL +#include +#endif + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" +#include "mthca_profile.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +#ifdef CONFIG_PCI_MSI + +static int msi_x = 0; +module_param(msi_x, int, 0444); +MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); + +static int msi = 0; +module_param(msi, int, 0444); +MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero"); + +#else /* CONFIG_PCI_MSI */ + +#define msi_x (0) +#define msi (0) + +#endif /* CONFIG_PCI_MSI */ + +static const char mthca_version[] __devinitdata = + "ib_mthca: Mellanox InfiniBand HCA driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +static int __devinit mthca_tune_pci(struct mthca_dev *mdev) +{ + int cap; + u16 val; + + /* First try to max out Read Byte Count */ + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) { + mthca_err(mdev, "Couldn't read PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2); + if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) { + mthca_err(mdev, "Couldn't write PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == TAVOR) + mthca_info(mdev, "No PCI-X capability, not setting RBC.\n"); + + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) { + mthca_err(mdev, "Couldn't read PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12); + if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) { + mthca_err(mdev, "Couldn't write PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == ARBEL_NATIVE || + mdev->hca_type == ARBEL_COMPAT) + mthca_info(mdev, "No PCI Express capability, " + "not setting Max Read Request Size.\n"); + + return 0; +} + +static int __devinit mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim) +{ + int err; + u8 status; + + err = mthca_QUERY_DEV_LIM(mdev, dev_lim, &status); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + if (dev_lim->min_page_sz > PAGE_SIZE) { + mthca_err(mdev, "HCA minimum page size of %d bigger than " + "kernel PAGE_SIZE of %ld, aborting.\n", + dev_lim->min_page_sz, PAGE_SIZE); + return -ENODEV; + } + if (dev_lim->num_ports > MTHCA_MAX_PORTS) { + mthca_err(mdev, "HCA has %d ports, but we only support %d, " + "aborting.\n", + dev_lim->num_ports, MTHCA_MAX_PORTS); + return -ENODEV; + } + + mdev->limits.num_ports = dev_lim->num_ports; + mdev->limits.vl_cap = dev_lim->max_vl; + mdev->limits.mtu_cap = dev_lim->max_mtu; + mdev->limits.gid_table_len = dev_lim->max_gids; + mdev->limits.pkey_table_len = dev_lim->max_pkeys; + mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay; + mdev->limits.max_sg = dev_lim->max_sg; + mdev->limits.reserved_qps = dev_lim->reserved_qps; + mdev->limits.reserved_srqs = dev_lim->reserved_srqs; + mdev->limits.reserved_eecs = dev_lim->reserved_eecs; + mdev->limits.reserved_cqs = dev_lim->reserved_cqs; + mdev->limits.reserved_eqs = dev_lim->reserved_eqs; + mdev->limits.reserved_mtts = dev_lim->reserved_mtts; + mdev->limits.reserved_mrws = dev_lim->reserved_mrws; + mdev->limits.reserved_uars = dev_lim->reserved_uars; + mdev->limits.reserved_pds = dev_lim->reserved_pds; + + if (dev_lim->flags & DEV_LIM_FLAG_SRQ) + mdev->mthca_flags |= MTHCA_FLAG_SRQ; + + return 0; +} + +static int __devinit mthca_init_tavor(struct mthca_dev *mdev) +{ + u8 status; + int err; + struct mthca_dev_lim dev_lim; + struct mthca_init_hca_param init_hca; + struct mthca_adapter adapter; + + err = mthca_SYS_EN(mdev, &status); + if (err) { + mthca_err(mdev, "SYS_EN command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "SYS_EN returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DDR(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_DDR command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DDR returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_dev_lim(mdev, &dev_lim); + + err = mthca_make_profile(mdev, &dev_lim, &init_hca); + if (err) + goto err_out_disable; + + err = mthca_INIT_HCA(mdev, &init_hca, &status); + if (err) { + mthca_err(mdev, "INIT_HCA command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "INIT_HCA returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_QUERY_ADAPTER(mdev, &adapter, &status); + if (err) { + mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_close; + } + + mdev->eq_table.inta_pin = adapter.inta_pin; + mdev->rev_id = adapter.revision_id; + + return 0; + +err_out_close: + mthca_CLOSE_HCA(mdev, 0, &status); + +err_out_disable: + mthca_SYS_DIS(mdev, &status); + + return err; +} + +static int __devinit mthca_load_fw(struct mthca_dev *mdev) +{ + u8 status; + int err; + int num_ent, num_sg, fw_pages, cur_order; + int i; + + /* FIXME: use HCA-attached memory for FW if present */ + + mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem * + mdev->fw.arbel.fw_pages, + GFP_KERNEL); + if (!mdev->fw.arbel.mem) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + return -ENOMEM; + } + + memset(mdev->fw.arbel.mem, 0, + sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages); + + fw_pages = mdev->fw.arbel.fw_pages; + num_ent = 0; + /* + * We allocate in as big chunks as we can, up to a maximum of + * 256 KB per chunk. + */ + cur_order = get_order(1 << 18); + while (fw_pages > 0) { + while (1 << cur_order > fw_pages) + --cur_order; + + /* + * We allocate with GFP_HIGHUSER because only the + * firmware is going to touch these pages, so there's + * no need for a kernel virtual address. We use + * __GFP_NOWARN because we'll deal with any allocation + * failures ourselves. + */ + mdev->fw.arbel.mem[num_ent].page = alloc_pages(GFP_HIGHUSER | __GFP_NOWARN, + cur_order); + mdev->fw.arbel.mem[num_ent].length = PAGE_SIZE << cur_order; + if (!mdev->fw.arbel.mem[num_ent].page) { + --cur_order; + if (cur_order < 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + } else { + ++num_ent; + fw_pages -= 1 << cur_order; + } + } + num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem, num_ent, + PCI_DMA_BIDIRECTIONAL); + if (num_sg <= 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + + err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status); + if (err) { + mthca_err(mdev, "MAP_FA command failed, aborting.\n"); + goto err_unmap; + } + if (status) { + mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap; + } + err = mthca_RUN_FW(mdev, &status); + if (err) { + mthca_err(mdev, "RUN_FW command failed, aborting.\n"); + goto err_unmap_fa; + } + if (status) { + mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap_fa; + } + + return 0; + +err_unmap_fa: + mthca_UNMAP_FA(mdev, &status); + +err_unmap: + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); +err_free: + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_pages(mdev->fw.arbel.mem[i].page, + get_order(mdev->fw.arbel.mem[i].length)); + kfree(mdev->fw.arbel.mem); + return err; +} + +static int __devinit mthca_init_arbel(struct mthca_dev *mdev) +{ + struct mthca_dev_lim dev_lim; + u8 status; + int err; + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_ENABLE_LAM(mdev, &status); + if (err) { + mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n"); + return err; + } + if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) { + mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n"); + mdev->mthca_flags |= MTHCA_FLAG_NO_LAM; + } else if (status) { + mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_load_fw(mdev); + if (err) { + mthca_err(mdev, "Failed to start FW, aborting.\n"); + goto err_out_disable; + } + + err = mthca_dev_lim(mdev, &dev_lim); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + goto err_out_disable; + } + + mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, " + "aborting.\n"); + err = -ENODEV; + +err_out_disable: + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + return err; +} + +static int __devinit mthca_init_hca(struct mthca_dev *mdev) +{ + if (mdev->hca_type == ARBEL_NATIVE) + return mthca_init_arbel(mdev); + else + return mthca_init_tavor(mdev); +} + +static int __devinit mthca_setup_hca(struct mthca_dev *dev) +{ + int err; + + MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock); + + err = mthca_init_pd_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + return err; + } + + err = mthca_init_mr_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "memory region table, aborting.\n"); + goto err_out_pd_table_free; + } + + err = mthca_pd_alloc(dev, &dev->driver_pd); + if (err) { + mthca_err(dev, "Failed to create driver PD, " + "aborting.\n"); + goto err_out_mr_table_free; + } + + err = mthca_init_eq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "event queue table, aborting.\n"); + goto err_out_pd_free; + } + + err = mthca_cmd_use_events(dev); + if (err) { + mthca_err(dev, "Failed to switch to event-driven " + "firmware commands, aborting.\n"); + goto err_out_eq_table_free; + } + + err = mthca_init_cq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "completion queue table, aborting.\n"); + goto err_out_cmd_poll; + } + + err = mthca_init_qp_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "queue pair table, aborting.\n"); + goto err_out_cq_table_free; + } + + err = mthca_init_av_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "address vector table, aborting.\n"); + goto err_out_qp_table_free; + } + + err = mthca_init_mcg_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "multicast group table, aborting.\n"); + goto err_out_av_table_free; + } + + return 0; + +err_out_av_table_free: + mthca_cleanup_av_table(dev); + +err_out_qp_table_free: + mthca_cleanup_qp_table(dev); + +err_out_cq_table_free: + mthca_cleanup_cq_table(dev); + +err_out_cmd_poll: + mthca_cmd_use_polling(dev); + +err_out_eq_table_free: + mthca_cleanup_eq_table(dev); + +err_out_pd_free: + mthca_pd_free(dev, &dev->driver_pd); + +err_out_mr_table_free: + mthca_cleanup_mr_table(dev); + +err_out_pd_table_free: + mthca_cleanup_pd_table(dev); + return err; +} + +static int __devinit mthca_request_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + int err; + + /* + * We request our first BAR in two chunks, since the MSI-X + * vector table is right in the middle. + * + * This is why we can't just use pci_request_regions() -- if + * we did then setting up MSI-X would fail, since the PCI core + * wants to do request_mem_region on the MSI-X vector table. + */ + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE, + DRV_NAME)) + return -EBUSY; + + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE, + DRV_NAME)) { + err = -EBUSY; + goto err_out_bar0_beg; + } + + err = pci_request_region(pdev, 2, DRV_NAME); + if (err) + goto err_out_bar0_end; + + if (!ddr_hidden) { + err = pci_request_region(pdev, 4, DRV_NAME); + if (err) + goto err_out_bar2; + } + + return 0; + +err_out_bar0_beg: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + +err_out_bar0_end: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + +err_out_bar2: + pci_release_region(pdev, 2); + return err; +} + +static void mthca_release_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + pci_release_region(pdev, 2); + if (!ddr_hidden) + pci_release_region(pdev, 4); +} + +static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev) +{ + struct msix_entry entries[3]; + int err; + + entries[0].entry = 0; + entries[1].entry = 1; + entries[2].entry = 2; + + err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries)); + if (err) { + if (err > 0) + mthca_info(mdev, "Only %d MSI-X vectors available, " + "not using MSI-X\n", err); + return err; + } + + mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector; + mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector; + mdev->eq_table.eq[MTHCA_EQ_CMD ].msi_x_vector = entries[2].vector; + + return 0; +} + +static void mthca_close_hca(struct mthca_dev *mdev) +{ + u8 status; + int i; + + mthca_CLOSE_HCA(mdev, 0, &status); + + if (mdev->hca_type == ARBEL_NATIVE) { + mthca_UNMAP_FA(mdev, &status); + + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_pages(mdev->fw.arbel.mem[i].page, + get_order(mdev->fw.arbel.mem[i].length)); + kfree(mdev->fw.arbel.mem); + + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + } else + mthca_SYS_DIS(mdev, &status); +} + +static int __devinit mthca_init_one(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + static int mthca_version_printed = 0; + int ddr_hidden = 0; + int err; + unsigned long mthca_base; + struct mthca_dev *mdev; + + if (!mthca_version_printed) { + printk(KERN_INFO "%s", mthca_version); + ++mthca_version_printed; + } + + printk(KERN_INFO PFX "Initializing %s (%s)\n", + pci_pretty_name(pdev), pci_name(pdev)); + + err = pci_enable_device(pdev); + if (err) { + dev_err(&pdev->dev, "Cannot enable PCI device, " + "aborting.\n"); + return err; + } + + /* + * Check for BARs. We expect 0: 1MB, 2: 8MB, 4: DDR (may not + * be present) + */ + if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) || + pci_resource_len(pdev, 0) != 1 << 20) { + dev_err(&pdev->dev, "Missing DCS, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) || + pci_resource_len(pdev, 2) != 1 << 23) { + dev_err(&pdev->dev, "Missing UAR, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM)) + ddr_hidden = 1; + + err = mthca_request_regions(pdev, ddr_hidden); + if (err) { + dev_err(&pdev->dev, "Cannot obtain PCI resources, " + "aborting.\n"); + goto err_out_disable_pdev; + } + + pci_set_master(pdev); + + err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n"); + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n"); + goto err_out_free_res; + } + } + err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit " + "consistent PCI DMA mask.\n"); + err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, " + "aborting.\n"); + goto err_out_free_res; + } + } + + mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev); + if (!mdev) { + dev_err(&pdev->dev, "Device struct alloc failed, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_res; + } + + mdev->pdev = pdev; + mdev->hca_type = id->driver_data; + + if (ddr_hidden) + mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; + + /* + * Now reset the HCA before we touch the PCI capabilities or + * attempt a firmware command, since a boot ROM may have left + * the HCA in an undefined state. + */ + err = mthca_reset(mdev); + if (err) { + mthca_err(mdev, "Failed to reset HCA, aborting.\n"); + goto err_out_free_dev; + } + + if (msi_x && !mthca_enable_msi_x(mdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI_X; + if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) && + !pci_enable_msi(pdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI; + + sema_init(&mdev->cmd.hcr_sem, 1); + sema_init(&mdev->cmd.poll_sem, 1); + mdev->cmd.use_events = 0; + + mthca_base = pci_resource_start(pdev, 0); + mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE); + if (!mdev->hcr) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_dev; + } + mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + if (!mdev->clr_base) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap; + } + + mthca_base = pci_resource_start(pdev, 2); + mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); + if (!mdev->kar) { + mthca_err(mdev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap_clr; + } + + err = mthca_tune_pci(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_init_hca(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_setup_hca(mdev); + if (err) + goto err_out_close; + + err = mthca_register_device(mdev); + if (err) + goto err_out_cleanup; + + err = mthca_create_agents(mdev); + if (err) + goto err_out_unregister; + + pci_set_drvdata(pdev, mdev); + + return 0; + +err_out_unregister: + mthca_unregister_device(mdev); + +err_out_cleanup: + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + +err_out_close: + mthca_close_hca(mdev); + +err_out_iounmap_kar: + iounmap(mdev->kar); + +err_out_iounmap_clr: + iounmap(mdev->clr_base); + +err_out_iounmap: + iounmap(mdev->hcr); + +err_out_free_dev: + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + +err_out_free_res: + mthca_release_regions(pdev, ddr_hidden); + +err_out_disable_pdev: + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + return err; +} + +static void __devexit mthca_remove_one(struct pci_dev *pdev) +{ + struct mthca_dev *mdev = pci_get_drvdata(pdev); + u8 status; + int p; + + if (mdev) { + mthca_free_agents(mdev); + mthca_unregister_device(mdev); + + for (p = 1; p <= mdev->limits.num_ports; ++p) + mthca_CLOSE_IB(mdev, p, &status); + + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + + mthca_close_hca(mdev); + + iounmap(mdev->hcr); + iounmap(mdev->clr_base); + + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + mthca_release_regions(pdev, mdev->mthca_flags & + MTHCA_FLAG_DDR_HIDDEN); + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + } +} + +static struct pci_device_id mthca_pci_table[] = { + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { 0, } +}; + +MODULE_DEVICE_TABLE(pci, mthca_pci_table); + +static struct pci_driver mthca_driver = { + .name = "ib_mthca", + .id_table = mthca_pci_table, + .probe = mthca_init_one, + .remove = __devexit_p(mthca_remove_one) +}; + +static int __init mthca_init(void) +{ + int ret; + + /* + * TODO: measure whether dynamically choosing doorbell code at + * runtime affects our performance. Is there a "magic" way to + * choose without having to follow a function pointer every + * time we ring a doorbell? + */ +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL + if (!cpu_has_xmm) { + printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n"); + printk(KERN_ERR PFX "the current CPU does not support SSE.\n"); + printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL " + "and recompile.\n"); + return -ENODEV; + } +#endif + + ret = pci_register_driver(&mthca_driver); + return ret < 0 ? ret : 0; +} + +static void __exit mthca_cleanup(void) +{ + pci_unregister_driver(&mthca_driver); +} + +module_init(mthca_init); +module_exit(mthca_cleanup); From roland at topspin.com Mon Dec 13 10:09:30 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:30 -0800 Subject: [openib-general] [PATCH][v3][8/21] Add Mellanox HCA low-level driver (midlayer interface) In-Reply-To: <20041213109.PeK9O1dDWBb3rThl@topspin.com> Message-ID: <20041213109.tdJgFcyoYZx3znNi@topspin.com> Add midlayer interface code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c 2004-12-13 09:44:44.599155121 -0800 @@ -0,0 +1,622 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +/* Temporary until we get core support straightened out */ +enum { + IB_SMP_ATTRIB_NODE_INFO = 0x0011, + IB_SMP_ATTRIB_GUID_INFO = 0x0014, + IB_SMP_ATTRIB_PORT_INFO = 0x0015, + IB_SMP_ATTRIB_PKEY_TABLE = 0x0016 +}; + +static int mthca_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + props->fw_ver = to_mdev(ibdev)->fw_ver; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + 1, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 76)) & + 0xffffff; + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 72)); + memcpy(&props->sys_image_guid, out_mad->data + 44, 8); + memcpy(&props->node_guid, out_mad->data + 52, 8); + + err = 0; + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->lid = be16_to_cpup((u16 *) (out_mad->data + 56)); + props->lmc = (*(u8 *) (out_mad->data + 74)) & 0x7; + props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 58)); + props->sm_sl = (*(u8 *) (out_mad->data + 76)) & 0xf; + props->state = (*(u8 *) (out_mad->data + 72)) & 0xf; + props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 60)); + props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; + props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; + props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 88)); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return 0; +} + +static int mthca_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 32); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + *pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw, out_mad->data + 48, 8); + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 8); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev) +{ + struct mthca_pd *pd; + int err; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = mthca_pd_alloc(to_mdev(ibdev), pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + return &pd->ibpd; +} + +static int mthca_dealloc_pd(struct ib_pd *pd) +{ + mthca_pd_free(to_mdev(pd->device), to_mpd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *mthca_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + int err; + struct mthca_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return ERR_PTR(-ENOMEM); + + err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah); + if (err) { + kfree(ah); + return ERR_PTR(err); + } + + return &ah->ibah; +} + +static int mthca_ah_destroy(struct ib_ah *ah) +{ + mthca_destroy_ah(to_mdev(ah->device), to_mah(ah)); + kfree(ah); + + return 0; +} + +static struct ib_qp *mthca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr) +{ + struct mthca_qp *qp; + int err; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + case IB_QPT_UC: + case IB_QPT_UD: + { + qp = kmalloc(sizeof *qp, GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->qp_type, init_attr->sq_sig_type, + init_attr->rq_sig_type, qp); + qp->ibqp.qp_num = qp->qpn; + break; + } + case IB_QPT_SMI: + case IB_QPT_GSI: + { + qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1; + + err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->sq_sig_type, init_attr->rq_sig_type, + qp->ibqp.qp_num, init_attr->port_num, + to_msqp(qp)); + break; + } + default: + /* Don't support raw QPs */ + return ERR_PTR(-ENOSYS); + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + init_attr->cap.max_inline_data = 0; + + return &qp->ibqp; +} + +static int mthca_destroy_qp(struct ib_qp *qp) +{ + mthca_free_qp(to_mdev(qp->device), to_mqp(qp)); + kfree(qp); + return 0; +} + +static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries) +{ + struct mthca_cq *cq; + int nent; + int err; + + cq = kmalloc(sizeof *cq, GFP_KERNEL); + if (!cq) + return ERR_PTR(-ENOMEM); + + for (nent = 1; nent < entries; nent <<= 1) + ; /* nothing */ + + err = mthca_init_cq(to_mdev(ibdev), nent, cq); + if (err) { + kfree(cq); + cq = ERR_PTR(err); + } else + cq->ibcq.cqe = nent; + + return &cq->ibcq; +} + +static int mthca_destroy_cq(struct ib_cq *cq) +{ + mthca_free_cq(to_mdev(cq->device), to_mcq(cq)); + kfree(cq); + + return 0; +} + +static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify) +{ + mthca_arm_cq(to_mdev(cq->device), to_mcq(cq), + notify == IB_CQ_SOLICITED); + return 0; +} + +static inline u32 convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC : 0) | + (acc & IB_ACCESS_REMOTE_WRITE ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? MTHCA_MPT_FLAG_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? MTHCA_MPT_FLAG_LOCAL_WRITE : 0) | + MTHCA_MPT_FLAG_LOCAL_READ; +} + +static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct mthca_mr *mr; + int err; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + err = mthca_mr_alloc_notrans(to_mdev(pd->device), + to_mpd(pd)->pd_num, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + struct mthca_mr *mr; + u64 *page_list; + u64 total_size; + u64 mask; + int shift; + int npages; + int err; + int i, j, n; + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + mask = 0; + total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (buffer_list[i].addr & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + /* Find largest page shift we can use to cover buffers */ + for (shift = PAGE_SHIFT; shift < 31; ++shift) + if (num_phys_buf > 1) { + if ((1ULL << shift) & mask) + break; + } else { + if (1ULL << shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << shift) - 1))) + break; + } + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1); + buffer_list[0].addr &= ~0ull << shift; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + npages = 0; + for (i = 0; i < num_phys_buf; ++i) + npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + + if (!npages) + return &mr->ibmr; + + page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); + if (!page_list) { + kfree(mr); + return ERR_PTR(-ENOMEM); + } + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + ++j) + page_list[n++] = buffer_list[i].addr + ((u64) j << shift); + + mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) " + "in PD %x; shift %d, npages %d.\n", + (unsigned long long) buffer_list[0].addr, + (unsigned long long) *iova_start, + to_mpd(pd)->pd_num, + shift, npages); + + err = mthca_mr_alloc_phys(to_mdev(pd->device), + to_mpd(pd)->pd_num, + page_list, shift, npages, + *iova_start, total_size, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + kfree(page_list); + return &mr->ibmr; +} + +static int mthca_dereg_mr(struct ib_mr *mr) +{ + mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); + kfree(mr); + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x\n", dev->rev_id); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32), + (int) (dev->fw_ver >> 16) & 0xffff, + (int) dev->fw_ver & 0xffff); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + switch (dev->hca_type) { + case TAVOR: return sprintf(buf, "MT23108\n"); + case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); + case ARBEL_NATIVE: return sprintf(buf, "MT25208\n"); + default: return sprintf(buf, "unknown\n"); + } +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); + +static struct class_device_attribute *mthca_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type +}; + +int mthca_register_device(struct mthca_dev *dev) +{ + int ret; + int i; + + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); + dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.phys_port_cnt = dev->limits.num_ports; + dev->ib_dev.dma_device = &dev->pdev->dev; + dev->ib_dev.class_dev.dev = &dev->pdev->dev; + dev->ib_dev.query_device = mthca_query_device; + dev->ib_dev.query_port = mthca_query_port; + dev->ib_dev.modify_port = mthca_modify_port; + dev->ib_dev.query_pkey = mthca_query_pkey; + dev->ib_dev.query_gid = mthca_query_gid; + dev->ib_dev.alloc_pd = mthca_alloc_pd; + dev->ib_dev.dealloc_pd = mthca_dealloc_pd; + dev->ib_dev.create_ah = mthca_ah_create; + dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.create_qp = mthca_create_qp; + dev->ib_dev.modify_qp = mthca_modify_qp; + dev->ib_dev.destroy_qp = mthca_destroy_qp; + dev->ib_dev.post_send = mthca_post_send; + dev->ib_dev.post_recv = mthca_post_receive; + dev->ib_dev.create_cq = mthca_create_cq; + dev->ib_dev.destroy_cq = mthca_destroy_cq; + dev->ib_dev.poll_cq = mthca_poll_cq; + dev->ib_dev.req_notify_cq = mthca_req_notify_cq; + dev->ib_dev.get_dma_mr = mthca_get_dma_mr; + dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; + dev->ib_dev.dereg_mr = mthca_dereg_mr; + dev->ib_dev.attach_mcast = mthca_multicast_attach; + dev->ib_dev.detach_mcast = mthca_multicast_detach; + dev->ib_dev.process_mad = mthca_process_mad; + + ret = ib_register_device(&dev->ib_dev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) { + ret = class_device_create_file(&dev->ib_dev.class_dev, + mthca_class_attributes[i]); + if (ret) { + ib_unregister_device(&dev->ib_dev); + return ret; + } + } + + return 0; +} + +void mthca_unregister_device(struct mthca_dev *dev) +{ + ib_unregister_device(&dev->ib_dev); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h 2004-12-13 09:44:44.636149671 -0800 @@ -0,0 +1,214 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef MTHCA_PROVIDER_H +#define MTHCA_PROVIDER_H + +#include +#include + +#define MTHCA_MPT_FLAG_ATOMIC (1 << 14) +#define MTHCA_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define MTHCA_MPT_FLAG_REMOTE_READ (1 << 12) +#define MTHCA_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define MTHCA_MPT_FLAG_LOCAL_READ (1 << 10) + +struct mthca_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct mthca_mr { + struct ib_mr ibmr; + int order; + u32 first_seg; +}; + +struct mthca_pd { + struct ib_pd ibpd; + u32 pd_num; + atomic_t sqp_count; + struct mthca_mr ntmr; +}; + +struct mthca_eq { + struct mthca_dev *dev; + int eqn; + u32 ecr_mask; + u16 msi_x_vector; + u16 msi_x_entry; + int have_irq; + int nent; + int cons_index; + struct mthca_buf_list *page_list; + struct mthca_mr mr; +}; + +struct mthca_av; + +struct mthca_ah { + struct ib_ah ibah; + int on_hca; + u32 key; + struct mthca_av *av; + dma_addr_t avdma; +}; + +/* + * Quick description of our CQ/QP locking scheme: + * + * We have one global lock that protects dev->cq/qp_table. Each + * struct mthca_cq/qp also has its own lock. An individual qp lock + * may be taken inside of an individual cq lock. Both cqs attached to + * a qp may be locked, with the send cq locked first. No other + * nesting should be done. + * + * Each struct mthca_cq/qp also has an atomic_t ref count. The + * pointer from the cq/qp_table to the struct counts as one reference. + * This reference also is good for access through the consumer API, so + * modifying the CQ/QP etc doesn't need to take another reference. + * Access because of a completion being polled does need a reference. + * + * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the + * destroy function to sleep on. + * + * This means that access from the consumer API requires nothing but + * taking the struct's lock. + * + * Access because of a completion event should go as follows: + * - lock cq/qp_table and look up struct + * - increment ref count in struct + * - drop cq/qp_table lock + * - lock struct, do your thing, and unlock struct + * - decrement ref count; if zero, wake up waiters + * + * To destroy a CQ/QP, we can do the following: + * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock + * - decrement ref count + * - wait_event until ref count is zero + * + * It is the consumer's responsibilty to make sure that no QP + * operations (WQE posting or state modification) are pending when the + * QP is destroyed. Also, the consumer must make sure that calls to + * qp_modify are serialized. + * + * Possible optimizations (wait for profile data to see if/where we + * have locks bouncing between CPUs): + * - split cq/qp table lock into n separate (cache-aligned) locks, + * indexed (say) by the page in the table + * - split QP struct lock into three (one for common info, one for the + * send queue and one for the receive queue) + */ + +struct mthca_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int cons_index; + int is_direct; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + struct mthca_mr mr; + wait_queue_head_t wait; +}; + +struct mthca_wq { + int max; + int cur; + int next; + int last_comp; + void *last; + int max_gs; + int wqe_shift; + enum ib_sig_type policy; +}; + +struct mthca_qp { + struct ib_qp ibqp; + spinlock_t lock; + atomic_t refcount; + u32 qpn; + int transport; + enum ib_qp_state state; + int is_direct; + struct mthca_mr mr; + + struct mthca_wq rq; + struct mthca_wq sq; + int send_wqe_offset; + + u64 *wrid; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + + wait_queue_head_t wait; +}; + +struct mthca_sqp { + struct mthca_qp qp; + int port; + int pkey_index; + u32 qkey; + u32 send_psn; + struct ib_ud_header ud_header; + int header_buf_size; + void *header_buf; + dma_addr_t header_dma; +}; + +static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct mthca_mr, ibmr); +} + +static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct mthca_pd, ibpd); +} + +static inline struct mthca_ah *to_mah(struct ib_ah *ibah) +{ + return container_of(ibah, struct mthca_ah, ibah); +} + +static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct mthca_cq, ibcq); +} + +static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct mthca_qp, ibqp); +} + +static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp) +{ + return container_of(qp, struct mthca_sqp, qp); +} + +#endif /* MTHCA_PROVIDER_H */ From roland at topspin.com Mon Dec 13 10:09:32 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:32 -0800 Subject: [openib-general] [PATCH][v3][9/21] Add Mellanox HCA low-level driver (FW commands) In-Reply-To: <20041213109.tdJgFcyoYZx3znNi@topspin.com> Message-ID: <20041213109.NiBwdaLIPMmwHwiP@topspin.com> Add firmware command processing code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c 2004-12-13 09:44:45.011094434 -0800 @@ -0,0 +1,1562 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.c 1321 2004-12-10 19:38:54Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" + +#define CMD_POLL_TOKEN 0xffff + +enum { + HCR_IN_PARAM_OFFSET = 0x00, + HCR_IN_MODIFIER_OFFSET = 0x08, + HCR_OUT_PARAM_OFFSET = 0x0c, + HCR_TOKEN_OFFSET = 0x14, + HCR_STATUS_OFFSET = 0x18, + + HCR_OPMOD_SHIFT = 12, + HCA_E_BIT = 22, + HCR_GO_BIT = 23 +}; + +enum { + /* initialization and general commands */ + CMD_SYS_EN = 0x1, + CMD_SYS_DIS = 0x2, + CMD_MAP_FA = 0xfff, + CMD_UNMAP_FA = 0xffe, + CMD_RUN_FW = 0xff6, + CMD_MOD_STAT_CFG = 0x34, + CMD_QUERY_DEV_LIM = 0x3, + CMD_QUERY_FW = 0x4, + CMD_ENABLE_LAM = 0xff8, + CMD_DISABLE_LAM = 0xff7, + CMD_QUERY_DDR = 0x5, + CMD_QUERY_ADAPTER = 0x6, + CMD_INIT_HCA = 0x7, + CMD_CLOSE_HCA = 0x8, + CMD_INIT_IB = 0x9, + CMD_CLOSE_IB = 0xa, + CMD_QUERY_HCA = 0xb, + CMD_SET_IB = 0xc, + CMD_ACCESS_DDR = 0x2e, + CMD_MAP_ICM = 0xffa, + CMD_UNMAP_ICM = 0xff9, + CMD_MAP_ICM_AUX = 0xffc, + CMD_UNMAP_ICM_AUX = 0xffb, + CMD_SET_ICM_SIZE = 0xffd, + + /* TPT commands */ + CMD_SW2HW_MPT = 0xd, + CMD_QUERY_MPT = 0xe, + CMD_HW2SW_MPT = 0xf, + CMD_READ_MTT = 0x10, + CMD_WRITE_MTT = 0x11, + CMD_SYNC_TPT = 0x2f, + + /* EQ commands */ + CMD_MAP_EQ = 0x12, + CMD_SW2HW_EQ = 0x13, + CMD_HW2SW_EQ = 0x14, + CMD_QUERY_EQ = 0x15, + + /* CQ commands */ + CMD_SW2HW_CQ = 0x16, + CMD_HW2SW_CQ = 0x17, + CMD_QUERY_CQ = 0x18, + CMD_RESIZE_CQ = 0x2c, + + /* SRQ commands */ + CMD_SW2HW_SRQ = 0x35, + CMD_HW2SW_SRQ = 0x36, + CMD_QUERY_SRQ = 0x37, + + /* QP/EE commands */ + CMD_RST2INIT_QPEE = 0x19, + CMD_INIT2RTR_QPEE = 0x1a, + CMD_RTR2RTS_QPEE = 0x1b, + CMD_RTS2RTS_QPEE = 0x1c, + CMD_SQERR2RTS_QPEE = 0x1d, + CMD_2ERR_QPEE = 0x1e, + CMD_RTS2SQD_QPEE = 0x1f, + CMD_SQD2SQD_QPEE = 0x38, + CMD_SQD2RTS_QPEE = 0x20, + CMD_ERR2RST_QPEE = 0x21, + CMD_QUERY_QPEE = 0x22, + CMD_INIT2INIT_QPEE = 0x2d, + CMD_SUSPEND_QPEE = 0x32, + CMD_UNSUSPEND_QPEE = 0x33, + /* special QPs and management commands */ + CMD_CONF_SPECIAL_QP = 0x23, + CMD_MAD_IFC = 0x24, + + /* multicast commands */ + CMD_READ_MGM = 0x25, + CMD_WRITE_MGM = 0x26, + CMD_MGID_HASH = 0x27, + + /* miscellaneous commands */ + CMD_DIAG_RPRT = 0x30, + CMD_NOP = 0x31, + + /* debug commands */ + CMD_QUERY_DEBUG_MSG = 0x2a, + CMD_SET_DEBUG_MSG = 0x2b, +}; + +/* + * According to Mellanox code, FW may be starved and never complete + * commands. So we can't use strict timeouts described in PRM -- we + * just arbitrarily select 60 seconds for now. + */ +#if 0 +/* + * Round up and add 1 to make sure we get the full wait time (since we + * will be starting in the middle of a jiffy) + */ +enum { + CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1, + CMD_TIME_CLASS_B = (HZ + 99) / 100 + 1, + CMD_TIME_CLASS_C = (HZ + 9) / 10 + 1 +}; +#else +enum { + CMD_TIME_CLASS_A = 60 * HZ, + CMD_TIME_CLASS_B = 60 * HZ, + CMD_TIME_CLASS_C = 60 * HZ +}; +#endif + +enum { + GO_BIT_TIMEOUT = HZ * 10 +}; + +struct mthca_cmd_context { + struct completion done; + struct timer_list timer; + int result; + int next; + u64 out_param; + u16 token; + u8 status; +}; + +static inline int go_bit(struct mthca_dev *dev) +{ + return readl(dev->hcr + HCR_STATUS_OFFSET) & + swab32(1 << HCR_GO_BIT); +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + if (down_interruptible(&dev->cmd.hcr_sem)) + return -EINTR; + + if (event) { + unsigned long end = jiffies + GO_BIT_TIMEOUT; + + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + } + + if (go_bit(dev)) { + err = -EAGAIN; + goto out; + } + + /* + * We use writel (instead of something like memcpy_toio) + * because writes of less than 32 bits to the HCR don't work + * (and some architectures such as ia64 implement memcpy_toio + * in terms of writeb). + */ + __raw_writel(cpu_to_be32(in_param >> 32), dev->hcr + 0 * 4); + __raw_writel(cpu_to_be32(in_param & 0xfffffffful), dev->hcr + 1 * 4); + __raw_writel(cpu_to_be32(in_modifier), dev->hcr + 2 * 4); + __raw_writel(cpu_to_be32(out_param >> 32), dev->hcr + 3 * 4); + __raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4); + __raw_writel(cpu_to_be32(token << 16), dev->hcr + 5 * 4); + + /* __raw_writel may not order writes. */ + wmb(); + + __raw_writel(cpu_to_be32((1 << HCR_GO_BIT) | + (event ? (1 << HCA_E_BIT) : 0) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), dev->hcr + 6 * 4); + +out: + up(&dev->cmd.hcr_sem); + return err; +} + +static int mthca_cmd_poll(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + unsigned long end; + + if (down_interruptible(&dev->cmd.poll_sem)) + return -EINTR; + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, CMD_POLL_TOKEN, 0); + if (err) + goto out; + + end = timeout + jiffies; + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + + if (go_bit(dev)) { + err = -EBUSY; + goto out; + } + + if (out_is_imm) { + memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64)); + be64_to_cpus(out_param); + } + + *status = be32_to_cpu(__raw_readl(dev->hcr + HCR_STATUS_OFFSET)) >> 24; + +out: + up(&dev->cmd.poll_sem); + return err; +} + +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param) +{ + struct mthca_cmd_context *context = + &dev->cmd.context[token & dev->cmd.token_mask]; + + /* previously timed out command completing at long last */ + if (token != context->token) + return; + + context->result = 0; + context->status = status; + context->out_param = out_param; + + context->token += dev->cmd.token_mask + 1; + + complete(&context->done); +} + +static void event_timeout(unsigned long context_ptr) +{ + struct mthca_cmd_context *context = + (struct mthca_cmd_context *) context_ptr; + + context->result = -EBUSY; + complete(&context->done); +} + +static int mthca_cmd_wait(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + struct mthca_cmd_context *context; + + if (down_interruptible(&dev->cmd.event_sem)) + return -EINTR; + + spin_lock(&dev->cmd.context_lock); + BUG_ON(dev->cmd.free_head < 0); + context = &dev->cmd.context[dev->cmd.free_head]; + dev->cmd.free_head = context->next; + spin_unlock(&dev->cmd.context_lock); + + init_completion(&context->done); + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, context->token, 1); + if (err) + goto out; + + context->timer.expires = jiffies + timeout; + add_timer(&context->timer); + + wait_for_completion(&context->done); + del_timer_sync(&context->timer); + + err = context->result; + if (err) + goto out; + + *status = context->status; + if (*status) + mthca_dbg(dev, "Command %02x completed with status %02x\n", + op, *status); + + if (out_is_imm) + *out_param = context->out_param; + +out: + spin_lock(&dev->cmd.context_lock); + context->next = dev->cmd.free_head; + dev->cmd.free_head = context - dev->cmd.context; + spin_unlock(&dev->cmd.context_lock); + + up(&dev->cmd.event_sem); + return err; +} + +/* Invoke a command with an output mailbox */ +static int mthca_cmd_box(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); +} + +/* Invoke a command with no output parameter */ +static int mthca_cmd(struct mthca_dev *dev, + u64 in_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + return mthca_cmd_box(dev, in_param, 0, in_modifier, + op_modifier, op, timeout, status); +} + +/* + * Invoke a command with an immediate output parameter (and copy the + * output into the caller's out_param pointer after the command + * executes). + */ +static int mthca_cmd_imm(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); +} + +/* + * Switch to using events to issue FW commands (should be called after + * event queue to command events has been initialized). + */ +int mthca_cmd_use_events(struct mthca_dev *dev) +{ + int i; + + dev->cmd.context = kmalloc(dev->cmd.max_cmds * + sizeof (struct mthca_cmd_context), + GFP_KERNEL); + if (!dev->cmd.context) + return -ENOMEM; + + for (i = 0; i < dev->cmd.max_cmds; ++i) { + dev->cmd.context[i].token = i; + dev->cmd.context[i].next = i + 1; + init_timer(&dev->cmd.context[i].timer); + dev->cmd.context[i].timer.data = + (unsigned long) &dev->cmd.context[i]; + dev->cmd.context[i].timer.function = event_timeout; + } + + dev->cmd.context[dev->cmd.max_cmds - 1].next = -1; + dev->cmd.free_head = 0; + + sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds); + spin_lock_init(&dev->cmd.context_lock); + + for (dev->cmd.token_mask = 1; + dev->cmd.token_mask < dev->cmd.max_cmds; + dev->cmd.token_mask <<= 1) + ; /* nothing */ + --dev->cmd.token_mask; + + dev->cmd.use_events = 1; + down(&dev->cmd.poll_sem); + + return 0; +} + +/* + * Switch back to polling (used when shutting down the device) + */ +void mthca_cmd_use_polling(struct mthca_dev *dev) +{ + int i; + + dev->cmd.use_events = 0; + + for (i = 0; i < dev->cmd.max_cmds; ++i) + down(&dev->cmd.event_sem); + + kfree(dev->cmd.context); + + up(&dev->cmd.poll_sem); +} + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status) +{ + u64 out; + int ret; + + ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status); + + if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR) + mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, " + "sladdr=%d, SPD source=%s\n", + (int) (out >> 6) & 0xf, (int) (out >> 4) & 3, + (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM"); + + return ret; +} + +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status); +} + +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int lg; + int nent = 0; + int i, j; + int err = 0; + int ts = 0; + + inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma); + memset(inbox, 0, PAGE_SIZE); + + for (i = 0; i < count; ++i) { + /* + * We have to pass pages that are aligned to their + * size, so find the least significant 1 in the + * address or size and use that as our log2 size. + */ + lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1; + if (lg < 12) { + mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n", + (unsigned long long) sg_dma_address(sglist + i), + sg_dma_len(sglist + i)); + err = -EINVAL; + goto out; + } + for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) { + *((__be64 *) (inbox + nent * 4 + 2)) = + cpu_to_be64((sg_dma_address(sglist + i) + + (j << lg)) | + (lg - 12)); + ts += 1 << (lg - 10); + if (nent == PAGE_SIZE / 16) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + if (err || *status) + goto out; + nent = 0; + } + } + } + + if (nent) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + } + + mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts); + +out: + pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma); + return err; +} + +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status); +} + +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status); +} + +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err = 0; + u8 lg; + +#define QUERY_FW_OUT_SIZE 0x100 +#define QUERY_FW_VER_OFFSET 0x00 +#define QUERY_FW_MAX_CMD_OFFSET 0x0f +#define QUERY_FW_ERR_START_OFFSET 0x30 +#define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_START_OFFSET 0x20 +#define QUERY_FW_END_OFFSET 0x28 + +#define QUERY_FW_SIZE_OFFSET 0x00 +#define QUERY_FW_CLR_INT_BASE_OFFSET 0x20 +#define QUERY_FW_EQ_ARM_BASE_OFFSET 0x40 +#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma); + if (!outbox) { + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->fw_ver, outbox, QUERY_FW_VER_OFFSET); + /* + * FW subminor version is at more signifant bits than minor + * version, so swap here. + */ + dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) | + ((dev->fw_ver & 0xffff0000ull) >> 16) | + ((dev->fw_ver & 0x0000ffffull) << 16); + + MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET); + dev->cmd.max_cmds = 1 << lg; + + mthca_dbg(dev, "FW version %012llx, max commands %d\n", + (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET); + MTHCA_GET(dev->fw.arbel.clr_int_base, outbox, QUERY_FW_CLR_INT_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_arm_base, outbox, QUERY_FW_EQ_ARM_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET); + mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2); + + /* + * Arbel page size is always 4 KB; round up number of + * system pages needed. + */ + dev->fw.arbel.fw_pages = + (dev->fw.arbel.fw_pages + (1 << (PAGE_SHIFT - 12)) - 1) >> + (PAGE_SHIFT - 12); + + mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", + (unsigned long long) dev->fw.arbel.clr_int_base, + (unsigned long long) dev->fw.arbel.eq_arm_base, + (unsigned long long) dev->fw.arbel.eq_set_ci_base); + } else { + MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET); + MTHCA_GET(dev->fw.tavor.fw_end, outbox, QUERY_FW_END_OFFSET); + + mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n", + (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10), + (unsigned long long) dev->fw.tavor.fw_start, + (unsigned long long) dev->fw.tavor.fw_end); + } + +out: + pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define ENABLE_LAM_OUT_SIZE 0x100 +#define ENABLE_LAM_START_OFFSET 0x00 +#define ENABLE_LAM_END_OFFSET 0x08 +#define ENABLE_LAM_INFO_OFFSET 0x13 + +#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4) +#define ENABLE_LAM_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM, + CMD_TIME_CLASS_C, status); + + if (err) + goto out; + + if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, ENABLE_LAM_END_OFFSET); + MTHCA_GET(info, outbox, ENABLE_LAM_INFO_OFFSET); + + if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & ENABLE_LAM_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status); +} + +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define QUERY_DDR_OUT_SIZE 0x100 +#define QUERY_DDR_START_OFFSET 0x00 +#define QUERY_DDR_END_OFFSET 0x08 +#define QUERY_DDR_INFO_OFFSET 0x13 + +#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4) +#define QUERY_DDR_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, QUERY_DDR_END_OFFSET); + MTHCA_GET(info, outbox, QUERY_DDR_INFO_OFFSET); + + if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & QUERY_DDR_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & QUERY_DDR_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + u8 field; + u16 size; + int err; + +#define QUERY_DEV_LIM_OUT_SIZE 0x100 +#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET 0x10 +#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET 0x11 +#define QUERY_DEV_LIM_RSVD_QP_OFFSET 0x12 +#define QUERY_DEV_LIM_MAX_QP_OFFSET 0x13 +#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET 0x14 +#define QUERY_DEV_LIM_MAX_SRQ_OFFSET 0x15 +#define QUERY_DEV_LIM_RSVD_EEC_OFFSET 0x16 +#define QUERY_DEV_LIM_MAX_EEC_OFFSET 0x17 +#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET 0x19 +#define QUERY_DEV_LIM_RSVD_CQ_OFFSET 0x1a +#define QUERY_DEV_LIM_MAX_CQ_OFFSET 0x1b +#define QUERY_DEV_LIM_MAX_MPT_OFFSET 0x1d +#define QUERY_DEV_LIM_RSVD_EQ_OFFSET 0x1e +#define QUERY_DEV_LIM_MAX_EQ_OFFSET 0x1f +#define QUERY_DEV_LIM_RSVD_MTT_OFFSET 0x20 +#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET 0x21 +#define QUERY_DEV_LIM_RSVD_MRW_OFFSET 0x22 +#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET 0x23 +#define QUERY_DEV_LIM_MAX_AV_OFFSET 0x27 +#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET 0x29 +#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET 0x2b +#define QUERY_DEV_LIM_MAX_RDMA_OFFSET 0x2f +#define QUERY_DEV_LIM_RSZ_SRQ_OFFSET 0x33 +#define QUERY_DEV_LIM_ACK_DELAY_OFFSET 0x35 +#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 +#define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 +#define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f +#define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 +#define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 +#define QUERY_DEV_LIM_UAR_SZ_OFFSET 0x49 +#define QUERY_DEV_LIM_PAGE_SZ_OFFSET 0x4b +#define QUERY_DEV_LIM_MAX_SG_OFFSET 0x51 +#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET 0x52 +#define QUERY_DEV_LIM_MAX_SG_RQ_OFFSET 0x55 +#define QUERY_DEV_LIM_MAX_DESC_SZ_RQ_OFFSET 0x56 +#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET 0x61 +#define QUERY_DEV_LIM_RSVD_MCG_OFFSET 0x62 +#define QUERY_DEV_LIM_MAX_MCG_OFFSET 0x63 +#define QUERY_DEV_LIM_RSVD_PD_OFFSET 0x64 +#define QUERY_DEV_LIM_MAX_PD_OFFSET 0x65 +#define QUERY_DEV_LIM_RSVD_RDD_OFFSET 0x66 +#define QUERY_DEV_LIM_MAX_RDD_OFFSET 0x67 +#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET 0x80 +#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET 0x82 +#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84 +#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86 +#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET 0x88 +#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET 0x8a +#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET 0x8c +#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET 0x8e +#define QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET 0x90 +#define QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET 0x92 +#define QUERY_DEV_LIM_PBL_SZ_OFFSET 0x96 +#define QUERY_DEV_LIM_BMME_FLAGS_OFFSET 0x97 +#define QUERY_DEV_LIM_RSVD_LKEY_OFFSET 0x98 +#define QUERY_DEV_LIM_LAMR_OFFSET 0x9f +#define QUERY_DEV_LIM_MAX_ICM_SZ_OFFSET 0xa0 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); + dev_lim->reserved_qps = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); + dev_lim->max_qps = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET); + dev_lim->reserved_srqs = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET); + dev_lim->max_srqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET); + dev_lim->reserved_eecs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET); + dev_lim->max_eecs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET); + dev_lim->max_cq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET); + dev_lim->reserved_cqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET); + dev_lim->max_cqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET); + dev_lim->max_mpts = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET); + dev_lim->reserved_eqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET); + dev_lim->max_eqs = 1 << (field & 0x7); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET); + dev_lim->reserved_mtts = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET); + dev_lim->max_mrw_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET); + dev_lim->reserved_mrws = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET); + dev_lim->max_mtt_seg = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET); + dev_lim->max_requester_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET); + dev_lim->max_responder_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET); + dev_lim->max_rdma_global = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET); + dev_lim->local_ca_ack_delay = field & 0x1f; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET); + dev_lim->max_mtu = field >> 4; + dev_lim->max_port_width = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET); + dev_lim->max_vl = field >> 4; + dev_lim->num_ports = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); + dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); + dev_lim->max_pkeys = 1 << (field & 0xf); + MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET); + dev_lim->reserved_uars = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET); + dev_lim->uar_size = 1 << ((field & 0x3f) + 20); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET); + dev_lim->min_page_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET); + dev_lim->max_sg = field; + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET); + dev_lim->max_desc_sz = size; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET); + dev_lim->max_qp_per_mcg = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET); + dev_lim->reserved_mgms = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET); + dev_lim->max_mcgs = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET); + dev_lim->reserved_pds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET); + dev_lim->max_pds = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET); + dev_lim->reserved_rdds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET); + dev_lim->max_rdds = 1 << (field & 0x3f); + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET); + dev_lim->eec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET); + dev_lim->qpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET); + dev_lim->eeec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET); + dev_lim->eqpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET); + dev_lim->eqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET); + dev_lim->cqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET); + dev_lim->srq_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); + dev_lim->uar_scratch_entry_sz = size; + + mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); + mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); + mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_lim->reserved_mrws, dev_lim->reserved_mtts); + mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); + mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_lim->max_pds, dev_lim->reserved_mgms); + + mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSZ_SRQ_OFFSET); + dev_lim->hca.arbel.resize_srq = field & 1; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET); + dev_lim->hca.arbel.mtt_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); + dev_lim->hca.arbel.mpt_entry_sz = size; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET); + dev_lim->hca.arbel.max_pbl_sz = 1 << (field & 0x3f); + MTHCA_GET(dev_lim->hca.arbel.bmme_flags, outbox, + QUERY_DEV_LIM_BMME_FLAGS_OFFSET); + MTHCA_GET(dev_lim->hca.arbel.reserved_lkey, outbox, + QUERY_DEV_LIM_RSVD_LKEY_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_LAMR_OFFSET); + dev_lim->hca.arbel.lam_required = field & 1; + MTHCA_GET(dev_lim->hca.arbel.max_icm_sz, outbox, + QUERY_DEV_LIM_MAX_ICM_SZ_OFFSET); + + if (dev_lim->hca.arbel.bmme_flags & 1) + mthca_dbg(dev, "Base MM extensions: yes " + "(flags %d, max PBL %d, rsvd L_Key %08x)\n", + dev_lim->hca.arbel.bmme_flags, + dev_lim->hca.arbel.max_pbl_sz, + dev_lim->hca.arbel.reserved_lkey); + else + mthca_dbg(dev, "Base MM extensions: no\n"); + + mthca_dbg(dev, "Max ICM size %lld MB\n", + (unsigned long long) dev_lim->hca.arbel.max_icm_sz >> 20); + } else { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); + dev_lim->hca.tavor.max_avs = 1 << (field & 0x3f); + } + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err; + +#define QUERY_ADAPTER_OUT_SIZE 0x100 +#define QUERY_ADAPTER_VENDOR_ID_OFFSET 0x00 +#define QUERY_ADAPTER_DEVICE_ID_OFFSET 0x04 +#define QUERY_ADAPTER_REVISION_ID_OFFSET 0x08 +#define QUERY_ADAPTER_INTA_PIN_OFFSET 0x10 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET); + MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET); + MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET); + MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + +#define INIT_HCA_IN_SIZE 0x200 +#define INIT_HCA_FLAGS_OFFSET 0x014 +#define INIT_HCA_QPC_OFFSET 0x020 +#define INIT_HCA_QPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x10) +#define INIT_HCA_LOG_QP_OFFSET (INIT_HCA_QPC_OFFSET + 0x17) +#define INIT_HCA_EEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x20) +#define INIT_HCA_LOG_EEC_OFFSET (INIT_HCA_QPC_OFFSET + 0x27) +#define INIT_HCA_SRQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x28) +#define INIT_HCA_LOG_SRQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x2f) +#define INIT_HCA_CQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x30) +#define INIT_HCA_LOG_CQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x37) +#define INIT_HCA_EQPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x40) +#define INIT_HCA_EEEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x50) +#define INIT_HCA_EQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x60) +#define INIT_HCA_LOG_EQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x67) +#define INIT_HCA_RDB_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x70) +#define INIT_HCA_UDAV_OFFSET 0x0b0 +#define INIT_HCA_UDAV_LKEY_OFFSET (INIT_HCA_UDAV_OFFSET + 0x0) +#define INIT_HCA_UDAV_PD_OFFSET (INIT_HCA_UDAV_OFFSET + 0x4) +#define INIT_HCA_MCAST_OFFSET 0x0c0 +#define INIT_HCA_MC_BASE_OFFSET (INIT_HCA_MCAST_OFFSET + 0x00) +#define INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12) +#define INIT_HCA_MC_HASH_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x16) +#define INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b) +#define INIT_HCA_TPT_OFFSET 0x0f0 +#define INIT_HCA_MPT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x00) +#define INIT_HCA_MTT_SEG_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x09) +#define INIT_HCA_LOG_MPT_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x0b) +#define INIT_HCA_MTT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x10) +#define INIT_HCA_UAR_OFFSET 0x120 +#define INIT_HCA_UAR_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x00) +#define INIT_HCA_UAR_PAGE_SZ_OFFSET (INIT_HCA_UAR_OFFSET + 0x0b) +#define INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10) + + inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_HCA_IN_SIZE); + +#if defined(__LITTLE_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1); +#elif defined(__BIG_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1); +#else +#error Host endianness not defined +#endif + /* Check port for UD address vector: */ + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1); + + /* We leave wqe_quota, responder_exu, etc as 0 (default) */ + + /* QPC/EEC/CQC/EQC/RDB attributes */ + + MTHCA_PUT(inbox, param->qpc_base, INIT_HCA_QPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_qps, INIT_HCA_LOG_QP_OFFSET); + MTHCA_PUT(inbox, param->eec_base, INIT_HCA_EEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET); + MTHCA_PUT(inbox, param->srqc_base, INIT_HCA_SRQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET); + MTHCA_PUT(inbox, param->cqc_base, INIT_HCA_CQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_cqs, INIT_HCA_LOG_CQ_OFFSET); + MTHCA_PUT(inbox, param->eqpc_base, INIT_HCA_EQPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eeec_base, INIT_HCA_EEEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eqc_base, INIT_HCA_EQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eqs, INIT_HCA_LOG_EQ_OFFSET); + MTHCA_PUT(inbox, param->rdb_base, INIT_HCA_RDB_BASE_OFFSET); + + /* UD AV attributes */ + + /* multicast attributes */ + + MTHCA_PUT(inbox, param->mc_base, INIT_HCA_MC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET); + MTHCA_PUT(inbox, param->mc_hash_sz, INIT_HCA_MC_HASH_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET); + + /* TPT attributes */ + + MTHCA_PUT(inbox, param->mpt_base, INIT_HCA_MPT_BASE_OFFSET); + MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET); + MTHCA_PUT(inbox, param->mtt_base, INIT_HCA_MTT_BASE_OFFSET); + + /* UAR attributes */ + { + u8 uar_page_sz = PAGE_SHIFT - 12; + MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET); + MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET); + } + + err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA, + HZ, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + u32 flags; + +#define INIT_IB_IN_SIZE 56 +#define INIT_IB_FLAGS_OFFSET 0x00 +#define INIT_IB_FLAG_SIG (1 << 18) +#define INIT_IB_FLAG_NG (1 << 17) +#define INIT_IB_FLAG_G0 (1 << 16) +#define INIT_IB_FLAG_1X (1 << 8) +#define INIT_IB_FLAG_4X (1 << 9) +#define INIT_IB_FLAG_12X (1 << 11) +#define INIT_IB_VL_SHIFT 4 +#define INIT_IB_MTU_SHIFT 12 +#define INIT_IB_MAX_GID_OFFSET 0x06 +#define INIT_IB_MAX_PKEY_OFFSET 0x0a +#define INIT_IB_GUID0_OFFSET 0x10 +#define INIT_IB_NODE_GUID_OFFSET 0x18 +#define INIT_IB_SI_GUID_OFFSET 0x20 + + inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_IB_IN_SIZE); + + flags = 0; + flags |= param->enable_1x ? INIT_IB_FLAG_1X : 0; + flags |= param->enable_4x ? INIT_IB_FLAG_4X : 0; + flags |= param->set_guid0 ? INIT_IB_FLAG_G0 : 0; + flags |= param->set_node_guid ? INIT_IB_FLAG_NG : 0; + flags |= param->set_si_guid ? INIT_IB_FLAG_SIG : 0; + flags |= param->vl_cap << INIT_IB_VL_SHIFT; + flags |= param->mtu_cap << INIT_IB_MTU_SHIFT; + MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET); + + MTHCA_PUT(inbox, param->gid_cap, INIT_IB_MAX_GID_OFFSET); + MTHCA_PUT(inbox, param->pkey_cap, INIT_IB_MAX_PKEY_OFFSET); + MTHCA_PUT(inbox, param->guid0, INIT_IB_GUID0_OFFSET); + MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET); + MTHCA_PUT(inbox, param->si_guid, INIT_IB_SI_GUID_OFFSET); + + err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, + CMD_TIME_CLASS_A, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status) +{ + return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status); +} + +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status) +{ + return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status); +} + +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + if (mpt_entry) { + outdma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry, + CMD_HW2SW_MPT, + CMD_TIME_CLASS_B, status); + + if (mpt_entry) + pci_unmap_single(dev->pdev, outdma, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mtt_entry, + (num_mtt + 2) * 8, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + (num_mtt + 2) * 8, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status) +{ + mthca_dbg(dev, "%s mask %016llx for eqn %d\n", + unmap ? "Clearing" : "Setting", + (unsigned long long) event_mask, eq_num); + return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num, + 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status); +} + +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, eq_num, 0, + CMD_HW2SW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, cq_num, 0, + CMD_HW2SW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status) +{ + static const u16 op[] = { + [MTHCA_TRANS_RST2INIT] = CMD_RST2INIT_QPEE, + [MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE, + [MTHCA_TRANS_INIT2RTR] = CMD_INIT2RTR_QPEE, + [MTHCA_TRANS_RTR2RTS] = CMD_RTR2RTS_QPEE, + [MTHCA_TRANS_RTS2RTS] = CMD_RTS2RTS_QPEE, + [MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE, + [MTHCA_TRANS_ANY2ERR] = CMD_2ERR_QPEE, + [MTHCA_TRANS_RTS2SQD] = CMD_RTS2SQD_QPEE, + [MTHCA_TRANS_SQD2SQD] = CMD_SQD2SQD_QPEE, + [MTHCA_TRANS_SQD2RTS] = CMD_SQD2RTS_QPEE, + [MTHCA_TRANS_ANY2RST] = CMD_ERR2RST_QPEE + }; + u8 op_mod = 0; + + dma_addr_t indma; + int err; + + if (trans < 0 || trans >= ARRAY_SIZE(op)) + return -EINVAL; + + if (trans == MTHCA_TRANS_ANY2RST) { + indma = 0; + op_mod = 3; /* don't write outbox, any->reset */ + + /* For debugging */ + qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + &indma); + op_mod = 2; /* write outbox, any->reset */ + } else { + indma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + } + + if (trans == MTHCA_TRANS_ANY2RST) { + err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + + } else + err = mthca_cmd(dev, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (trans != MTHCA_TRANS_ANY2RST) + pci_unmap_single(dev->pdev, indma, + MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE); + else + pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + qp_context, indma); + return err; +} + +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0, + CMD_QUERY_QPEE, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status) +{ + u8 op_mod; + + switch (type) { + case IB_QPT_SMI: + op_mod = 0; + break; + case IB_QPT_GSI: + op_mod = 1; + break; + case IB_QPT_RAW_IPV6: + op_mod = 2; + break; + case IB_QPT_RAW_ETY: + op_mod = 3; + break; + default: + return -EINVAL; + } + + return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP, + CMD_TIME_CLASS_B, status); +} + +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status) { + void *box; + dma_addr_t dma; + int err; + +#define MAD_IFC_BOX_SIZE 512 + + box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma); + if (!box) + return -ENOMEM; + + memcpy(box, in_mad, 256); + + err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey, + CMD_MAD_IFC, CMD_TIME_CLASS_C, status); + + if (!err && !*status) + memcpy(response_mad, box + 256, 256); + + pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma); + return err; +} + +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, index, 0, + CMD_READ_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status) +{ + dma_addr_t indma; + u64 imm; + int err; + + indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH, + CMD_TIME_CLASS_A, status); + *hash = imm; + + pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE); + return err; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h 2004-12-13 09:44:45.036090752 -0800 @@ -0,0 +1,265 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.h 1321 2004-12-10 19:38:54Z roland $ + */ + +#ifndef MTHCA_CMD_H +#define MTHCA_CMD_H + +#include + +#define MTHCA_CMD_MAILBOX_ALIGN 16UL +#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1) + +enum { + /* command completed successfully: */ + MTHCA_CMD_STAT_OK = 0x00, + /* Internal error (such as a bus error) occurred while processing command: */ + MTHCA_CMD_STAT_INTERNAL_ERR = 0x01, + /* Operation/command not supported or opcode modifier not supported: */ + MTHCA_CMD_STAT_BAD_OP = 0x02, + /* Parameter not supported or parameter out of range: */ + MTHCA_CMD_STAT_BAD_PARAM = 0x03, + /* System not enabled or bad system state: */ + MTHCA_CMD_STAT_BAD_SYS_STATE = 0x04, + /* Attempt to access reserved or unallocaterd resource: */ + MTHCA_CMD_STAT_BAD_RESOURCE = 0x05, + /* Requested resource is currently executing a command, or is otherwise busy: */ + MTHCA_CMD_STAT_RESOURCE_BUSY = 0x06, + /* memory error: */ + MTHCA_CMD_STAT_DDR_MEM_ERR = 0x07, + /* Required capability exceeds device limits: */ + MTHCA_CMD_STAT_EXCEED_LIM = 0x08, + /* Resource is not in the appropriate state or ownership: */ + MTHCA_CMD_STAT_BAD_RES_STATE = 0x09, + /* Index out of range: */ + MTHCA_CMD_STAT_BAD_INDEX = 0x0a, + /* FW image corrupted: */ + MTHCA_CMD_STAT_BAD_NVMEM = 0x0b, + /* Attempt to modify a QP/EE which is not in the presumed state: */ + MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, + /* Bad segment parameters (Address/Size): */ + MTHCA_CMD_STAT_BAD_SEG_PARAM = 0x20, + /* Memory Region has Memory Windows bound to: */ + MTHCA_CMD_STAT_REG_BOUND = 0x21, + /* HCA local attached memory not present: */ + MTHCA_CMD_STAT_LAM_NOT_PRE = 0x22, + /* Bad management packet (silently discarded): */ + MTHCA_CMD_STAT_BAD_PKT = 0x30, + /* More outstanding CQEs in CQ than new CQ size: */ + MTHCA_CMD_STAT_BAD_SIZE = 0x40 +}; + +enum { + MTHCA_TRANS_INVALID = 0, + MTHCA_TRANS_RST2INIT, + MTHCA_TRANS_INIT2INIT, + MTHCA_TRANS_INIT2RTR, + MTHCA_TRANS_RTR2RTS, + MTHCA_TRANS_RTS2RTS, + MTHCA_TRANS_SQERR2RTS, + MTHCA_TRANS_ANY2ERR, + MTHCA_TRANS_RTS2SQD, + MTHCA_TRANS_SQD2SQD, + MTHCA_TRANS_SQD2RTS, + MTHCA_TRANS_ANY2RST, +}; + +enum { + DEV_LIM_FLAG_SRQ = 1 << 6 +}; + +struct mthca_dev_lim { + int max_srq_sz; + int max_qp_sz; + int reserved_qps; + int max_qps; + int reserved_srqs; + int max_srqs; + int reserved_eecs; + int max_eecs; + int max_cq_sz; + int reserved_cqs; + int max_cqs; + int max_mpts; + int reserved_eqs; + int max_eqs; + int reserved_mtts; + int max_mrw_sz; + int reserved_mrws; + int max_mtt_seg; + int max_requester_per_qp; + int max_responder_per_qp; + int max_rdma_global; + int local_ca_ack_delay; + int max_mtu; + int max_port_width; + int max_vl; + int num_ports; + int max_gids; + int max_pkeys; + u32 flags; + int reserved_uars; + int uar_size; + int min_page_sz; + int max_sg; + int max_desc_sz; + int max_qp_per_mcg; + int reserved_mgms; + int max_mcgs; + int reserved_pds; + int max_pds; + int reserved_rdds; + int max_rdds; + int eec_entry_sz; + int qpc_entry_sz; + int eeec_entry_sz; + int eqpc_entry_sz; + int eqc_entry_sz; + int cqc_entry_sz; + int srq_entry_sz; + int uar_scratch_entry_sz; + union { + struct { + int max_avs; + } tavor; + struct { + int resize_srq; + int mtt_entry_sz; + int mpt_entry_sz; + int max_pbl_sz; + u8 bmme_flags; + u32 reserved_lkey; + int lam_required; + u64 max_icm_sz; + } arbel; + } hca; +}; + +struct mthca_adapter { + u32 vendor_id; + u32 device_id; + u32 revision_id; + u8 inta_pin; +}; + +struct mthca_init_hca_param { + u64 qpc_base; + u8 log_num_qps; + u64 eec_base; + u8 log_num_eecs; + u64 srqc_base; + u8 log_num_srqs; + u64 cqc_base; + u8 log_num_cqs; + u64 eqpc_base; + u64 eeec_base; + u64 eqc_base; + u8 log_num_eqs; + u64 rdb_base; + u64 mc_base; + u16 log_mc_entry_sz; + u16 mc_hash_sz; + u8 log_mc_table_sz; + u64 mpt_base; + u8 mtt_seg_sz; + u8 log_mpt_sz; + u64 mtt_base; + u64 uar_scratch_base; +}; + +struct mthca_init_ib_param { + int enable_1x; + int enable_4x; + int vl_cap; + int mtu_cap; + u16 gid_cap; + u16 pkey_cap; + int set_guid0; + u64 guid0; + int set_node_guid; + u64 node_guid; + int set_si_guid; + u64 si_guid; +}; + +int mthca_cmd_use_events(struct mthca_dev *dev); +void mthca_cmd_use_polling(struct mthca_dev *dev); +void mthca_cmd_event(struct mthca_dev *dev, u16 token, + u8 status, u64 out_param); + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status); +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status); +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status); +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status); +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status); +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status); +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status); +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status); +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status); +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status); +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status); +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status); +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status); +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status); +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status); +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status); + +#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) (x), MTHCA_CMD_MAILBOX_ALIGN)) + +#endif /* MTHCA_CMD_H */ From roland at topspin.com Mon Dec 13 10:09:34 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:34 -0800 Subject: [openib-general] [PATCH][v3][10/21] Add Mellanox HCA low-level driver (EQ) In-Reply-To: <20041213109.NiBwdaLIPMmwHwiP@topspin.com> Message-ID: <20041213109.SaVuYVABecenGuRB@topspin.com> Add event queue code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c 2004-12-13 09:44:45.329047594 -0800 @@ -0,0 +1,663 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_eq.c 1321 2004-12-10 19:38:54Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" +#include "mthca_config_reg.h" + +enum { + MTHCA_NUM_ASYNC_EQE = 0x80, + MTHCA_NUM_CMD_EQE = 0x80, + MTHCA_EQ_ENTRY_SIZE = 0x20 +}; + +/* + * Must be packed because start is 64 bits but only aligned to 32 bits. + */ +struct mthca_eq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 pd; + u8 reserved1[3]; + u8 intr; + u32 lost_count; + u32 lkey; + u32 reserved2[2]; + u32 consumer_index; + u32 producer_index; + u32 reserved3[4]; +} __attribute__((packed)); + +#define MTHCA_EQ_STATUS_OK ( 0 << 28) +#define MTHCA_EQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_EQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_EQ_OWNER_SW ( 0 << 24) +#define MTHCA_EQ_OWNER_HW ( 1 << 24) +#define MTHCA_EQ_FLAG_TR ( 1 << 18) +#define MTHCA_EQ_FLAG_OI ( 1 << 17) +#define MTHCA_EQ_STATE_ARMED ( 1 << 8) +#define MTHCA_EQ_STATE_FIRED ( 2 << 8) +#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 << 8) + +enum { + MTHCA_EVENT_TYPE_COMP = 0x00, + MTHCA_EVENT_TYPE_PATH_MIG = 0x01, + MTHCA_EVENT_TYPE_COMM_EST = 0x02, + MTHCA_EVENT_TYPE_SQ_DRAINED = 0x03, + MTHCA_EVENT_TYPE_SRQ_LAST_WQE = 0x13, + MTHCA_EVENT_TYPE_CQ_ERROR = 0x04, + MTHCA_EVENT_TYPE_WQ_CATAS_ERROR = 0x05, + MTHCA_EVENT_TYPE_EEC_CATAS_ERROR = 0x06, + MTHCA_EVENT_TYPE_PATH_MIG_FAILED = 0x07, + MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, + MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, + MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, + MTHCA_EVENT_TYPE_PORT_CHANGE = 0x09, + MTHCA_EVENT_TYPE_EQ_OVERFLOW = 0x0f, + MTHCA_EVENT_TYPE_ECC_DETECT = 0x0e, + MTHCA_EVENT_TYPE_CMD = 0x0a +}; + +#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG) | \ + (1ULL << MTHCA_EVENT_TYPE_COMM_EST) | \ + (1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED) | \ + (1ULL << MTHCA_EVENT_TYPE_CQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE) | \ + (1ULL << MTHCA_EVENT_TYPE_ECC_DETECT)) +#define MTHCA_SRQ_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE) +#define MTHCA_CMD_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_CMD) + +#define MTHCA_EQ_DB_INC_CI (1 << 24) +#define MTHCA_EQ_DB_REQ_NOT (2 << 24) +#define MTHCA_EQ_DB_DISARM_CQ (3 << 24) +#define MTHCA_EQ_DB_SET_CI (4 << 24) +#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24) + +struct mthca_eqe { + u8 reserved1; + u8 type; + u8 reserved2; + u8 subtype; + union { + u32 raw[6]; + struct { + u32 cqn; + } __attribute__((packed)) comp; + struct { + u16 reserved1; + u16 token; + u32 reserved2; + u8 reserved3[3]; + u8 status; + u64 out_param; + } __attribute__((packed)) cmd; + struct { + u32 qpn; + } __attribute__((packed)) qp; + struct { + u32 reserved1[2]; + u32 port; + } __attribute__((packed)) port_change; + } event; + u8 reserved3[3]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_EQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_EQ_ENTRY_OWNER_HW (1 << 7) + +static inline u64 async_mask(struct mthca_dev *dev) +{ + return dev->mthca_flags & MTHCA_FLAG_SRQ ? + MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK : + MTHCA_ASYNC_EVENT_MASK; +} + +static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); + doorbell[1] = cpu_to_be32(ci); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void eq_req_not(struct mthca_dev *dev, int eqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); + doorbell[1] = 0; + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + doorbell[1] = cpu_to_be32(cqn); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry) +{ + return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int next_eqe_sw(struct mthca_eq *eq) +{ + return !(MTHCA_EQ_ENTRY_OWNER_HW & + get_eqe(eq, eq->cons_index)->owner); +} + +static inline void set_eqe_hw(struct mthca_eq *eq, int entry) +{ + get_eqe(eq, entry)->owner = MTHCA_EQ_ENTRY_OWNER_HW; +} + +static void port_change(struct mthca_dev *dev, int port, int active) +{ + struct ib_event record; + + mthca_dbg(dev, "Port change to %s for port %d\n", + active ? "active" : "down", port); + + record.device = &dev->ib_dev; + record.event = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR; + record.element.port_num = port; + + ib_dispatch_event(&record); +} + +static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) +{ + struct mthca_eqe *eqe; + int disarm_cqn; + + while (next_eqe_sw(eq)) { + int set_ci = 0; + eqe = get_eqe(eq, eq->cons_index); + + switch (eqe->type) { + case MTHCA_EVENT_TYPE_COMP: + disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff; + disarm_cq(dev, eq->eqn, disarm_cqn); + mthca_cq_event(dev, disarm_cqn); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG); + break; + + case MTHCA_EVENT_TYPE_COMM_EST: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_COMM_EST); + break; + + case MTHCA_EVENT_TYPE_SQ_DRAINED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_SQ_DRAINED); + break; + + case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_FATAL); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG_FAILED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_REQ_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_ACCESS_ERR); + break; + + case MTHCA_EVENT_TYPE_CMD: + mthca_cmd_event(dev, + be16_to_cpu(eqe->event.cmd.token), + eqe->event.cmd.status, + be64_to_cpu(eqe->event.cmd.out_param)); + /* cmd_event() may add more commands. + * The card will think the queue has overflowed if + * we don't tell it we've been processing events. + */ + set_ci = 1; + break; + + case MTHCA_EVENT_TYPE_PORT_CHANGE: + port_change(dev, + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3, + eqe->subtype == 0x4); + break; + + case MTHCA_EVENT_TYPE_CQ_ERROR: + case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR: + case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR: + case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR: + case MTHCA_EVENT_TYPE_EQ_OVERFLOW: + case MTHCA_EVENT_TYPE_ECC_DETECT: + default: + mthca_warn(dev, "Unhandled event %02x(%02x) on eqn %d\n", + eqe->type, eqe->subtype, eq->eqn); + break; + }; + + set_eqe_hw(eq, eq->cons_index); + eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + + if (set_ci) { + wmb(); /* see comment below */ + set_eq_ci(dev, eq->eqn, eq->cons_index); + set_ci = 0; + } + } + + /* + * This barrier makes sure that all updates to + * ownership bits done by set_eqe_hw() hit memory + * before the consumer index is updated. set_eq_ci() + * allows the HCA to possibly write more EQ entries, + * and we want to avoid the exceedingly unlikely + * possibility of the HCA writing an entry and then + * having set_eqe_hw() overwrite the owner field. + */ + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + eq_req_not(dev, eq->eqn); +} + +static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +{ + struct mthca_dev *dev = dev_ptr; + u32 ecr; + int work = 0; + int i; + + if (dev->eq_table.clr_mask) + writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); + + while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) { + work = 1; + + writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (ecr & dev->eq_table.eq[i].ecr_mask) + mthca_eq_int(dev, &dev->eq_table.eq[i]); + } + + return IRQ_RETVAL(work); +} + +static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr, + struct pt_regs *regs) +{ + struct mthca_eq *eq = eq_ptr; + struct mthca_dev *dev = eq->dev; + + writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + mthca_eq_int(dev, eq); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static int __devinit mthca_create_eq(struct mthca_dev *dev, + int nent, + u8 intr, + struct mthca_eq *eq) +{ + int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + u64 *dma_list = NULL; + dma_addr_t t; + void *mailbox = NULL; + struct mthca_eq_context *eq_context; + int err = -ENOMEM; + int i; + u8 status; + + /* Make sure EQ size is aligned to a power of 2 size. */ + for (i = 1; i < nent; i <<= 1) + ; /* nothing */ + nent = i; + + eq->dev = dev; + + eq->page_list = kmalloc(npages * sizeof *eq->page_list, + GFP_KERNEL); + if (!eq->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + eq->page_list[i].buf = NULL; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free; + eq_context = MAILBOX_ALIGN(mailbox); + + for (i = 0; i < npages; ++i) { + eq->page_list[i].buf = pci_alloc_consistent(dev->pdev, + PAGE_SIZE, &t); + if (!eq->page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&eq->page_list[i], mapping, t); + + memset(eq->page_list[i].buf, 0, PAGE_SIZE); + } + + for (i = 0; i < nent; ++i) + set_eqe_hw(eq, i); + + eq->eqn = mthca_alloc(&dev->eq_table.alloc); + if (eq->eqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, PAGE_SHIFT, npages, + 0, npages * PAGE_SIZE, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &eq->mr); + if (err) + goto err_out_free_eq; + + eq->nent = nent; + + memset(eq_context, 0, sizeof *eq_context); + eq_context->flags = cpu_to_be32(MTHCA_EQ_STATUS_OK | + MTHCA_EQ_OWNER_HW | + MTHCA_EQ_STATE_ARMED | + MTHCA_EQ_FLAG_TR); + eq_context->start = cpu_to_be64(0); + eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + eq_context->intr = intr; + eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); + + err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err); + goto err_out_free_mr; + } + if (status) { + mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + kfree(dma_list); + kfree(mailbox); + + eq->ecr_mask = swab32(1 << eq->eqn); + eq->cons_index = 0; + + eq_req_not(dev, eq->eqn); + + mthca_dbg(dev, "Allocated EQ %d with %d entries\n", + eq->eqn, nent); + + return err; + + err_out_free_mr: + mthca_free_mr(dev, &eq->mr); + + err_out_free_eq: + mthca_free(&dev->eq_table.alloc, eq->eqn); + + err_out_free: + for (i = 0; i < npages; ++i) + if (eq->page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], + mapping)); + + kfree(eq->page_list); + kfree(dma_list); + kfree(mailbox); + + err_out: + return err; +} + +static void mthca_free_eq(struct mthca_dev *dev, + struct mthca_eq *eq) +{ + void *mailbox = NULL; + int err; + u8 status; + int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + int i; + + mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + return; + + err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox), + eq->eqn, &status); + if (err) + mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err); + if (status) + mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n", + status); + + if (0) { + mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); + for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4)); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + + mthca_free_mr(dev, &eq->mr); + for (i = 0; i < npages; ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], mapping)); + + kfree(eq->page_list); + kfree(mailbox); +} + +static void mthca_free_irqs(struct mthca_dev *dev) +{ + int i; + + if (dev->eq_table.have_irq) + free_irq(dev->pdev->irq, dev); + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (dev->eq_table.eq[i].have_irq) + free_irq(dev->eq_table.eq[i].msi_x_vector, + dev->eq_table.eq + i); +} + +int __devinit mthca_init_eq_table(struct mthca_dev *dev) +{ + int err; + u8 status; + u8 intr; + int i; + + err = mthca_alloc_init(&dev->eq_table.alloc, + dev->limits.num_eqs, + dev->limits.num_eqs - 1, + dev->limits.reserved_eqs); + if (err) + return err; + + if (dev->mthca_flags & MTHCA_FLAG_MSI || + dev->mthca_flags & MTHCA_FLAG_MSI_X) { + dev->eq_table.clr_mask = 0; + } else { + dev->eq_table.clr_mask = + swab32(1 << (dev->eq_table.inta_pin & 31)); + dev->eq_table.clr_int = dev->clr_base + + (dev->eq_table.inta_pin < 31 ? 4 : 0); + } + + intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? + 128 : dev->eq_table.inta_pin; + + err = mthca_create_eq(dev, dev->limits.num_cqs, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, + &dev->eq_table.eq[MTHCA_EQ_COMP]); + if (err) + goto err_out_free; + + err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, + &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + if (err) + goto err_out_comp; + + err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, + &dev->eq_table.eq[MTHCA_EQ_CMD]); + if (err) + goto err_out_async; + + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { + static const char *eq_name[] = { + [MTHCA_EQ_COMP] = DRV_NAME " (comp)", + [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", + [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" + }; + + for (i = 0; i < MTHCA_NUM_EQ; ++i) { + err = request_irq(dev->eq_table.eq[i].msi_x_vector, + mthca_msi_x_interrupt, 0, + eq_name[i], dev->eq_table.eq + i); + if (err) + goto err_out_cmd; + dev->eq_table.eq[i].have_irq = 1; + } + } else { + err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ, + DRV_NAME, dev); + if (err) + goto err_out_cmd; + dev->eq_table.have_irq = 1; + } + + err = mthca_MAP_EQ(dev, async_mask(dev), + 0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status); + + err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); + + return 0; + +err_out_cmd: + mthca_free_irqs(dev); + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]); + +err_out_async: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + +err_out_comp: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); + +err_out_free: + mthca_alloc_cleanup(&dev->eq_table.alloc); + return err; +} + +void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev) +{ + u8 status; + int i; + + mthca_free_irqs(dev); + + mthca_MAP_EQ(dev, async_mask(dev), + 1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + mthca_free_eq(dev, &dev->eq_table.eq[i]); + + mthca_alloc_cleanup(&dev->eq_table.alloc); +} From roland at topspin.com Mon Dec 13 10:09:35 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:35 -0800 Subject: [openib-general] [PATCH][v3][11/21] Add Mellanox HCA low-level driver (initialization) In-Reply-To: <20041213109.SaVuYVABecenGuRB@topspin.com> Message-ID: <20041213109.ZLx80T8qLt2sXe9H@topspin.com> Add device initializaton code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c 2004-12-13 09:44:45.555014305 -0800 @@ -0,0 +1,215 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include + +#include "mthca_profile.h" + +static int default_profile[MTHCA_RES_NUM] = { + [MTHCA_RES_QP] = 1 << 16, + [MTHCA_RES_EQP] = 1 << 16, + [MTHCA_RES_CQ] = 1 << 16, + [MTHCA_RES_EQ] = 32, + [MTHCA_RES_RDB] = 1 << 18, + [MTHCA_RES_MCG] = 1 << 13, + [MTHCA_RES_MPT] = 1 << 17, + [MTHCA_RES_MTT] = 1 << 20, + [MTHCA_RES_UDAV] = 1 << 15 +}; + +enum { + MTHCA_RDB_ENTRY_SIZE = 32, + MTHCA_MTT_SEG_SIZE = 64 +}; + +enum { + MTHCA_NUM_PDS = 1 << 15 +}; + +int mthca_make_profile(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca) +{ + /* just use default profile for now */ + struct mthca_resource { + u64 size; + u64 start; + int type; + int num; + int log_num; + }; + + u64 total_size = 0; + struct mthca_resource *profile; + struct mthca_resource tmp; + int i, j; + + default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE; + + profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); + if (!profile) + return -ENOMEM; + + profile[MTHCA_RES_QP].size = dev_lim->qpc_entry_sz; + profile[MTHCA_RES_EEC].size = dev_lim->eec_entry_sz; + profile[MTHCA_RES_SRQ].size = dev_lim->srq_entry_sz; + profile[MTHCA_RES_CQ].size = dev_lim->cqc_entry_sz; + profile[MTHCA_RES_EQP].size = dev_lim->eqpc_entry_sz; + profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz; + profile[MTHCA_RES_EQ].size = dev_lim->eqc_entry_sz; + profile[MTHCA_RES_RDB].size = MTHCA_RDB_ENTRY_SIZE; + profile[MTHCA_RES_MCG].size = MTHCA_MGM_ENTRY_SIZE; + profile[MTHCA_RES_MPT].size = MTHCA_MPT_ENTRY_SIZE; + profile[MTHCA_RES_MTT].size = MTHCA_MTT_SEG_SIZE; + profile[MTHCA_RES_UAR].size = dev_lim->uar_scratch_entry_sz; + profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE; + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + profile[i].type = i; + profile[i].num = default_profile[i]; + profile[i].log_num = max(ffs(default_profile[i]) - 1, 0); + profile[i].size *= default_profile[i]; + } + + /* + * Sort the resources in decreasing order of size. Since they + * all have sizes that are powers of 2, we'll be able to keep + * resources aligned to their size and pack them without gaps + * using the sorted order. + */ + for (i = MTHCA_RES_NUM; i > 0; --i) + for (j = 1; j < i; ++j) { + if (profile[j].size > profile[j - 1].size) { + tmp = profile[j]; + profile[j] = profile[j - 1]; + profile[j - 1] = tmp; + } + } + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + if (profile[i].size) { + profile[i].start = dev->ddr_start + total_size; + total_size += profile[i].size; + } + if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) { + mthca_err(dev, "Profile requires 0x%llx bytes; " + "won't fit between DDR start at 0x%016llx " + "and FW start at 0x%016llx.\n", + (unsigned long long) total_size, + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->fw.tavor.fw_start); + kfree(profile); + return -ENOMEM; + } + + if (profile[i].size) + mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx " + "(size 0x%8llx)\n", + i, profile[i].type, profile[i].log_num, + (unsigned long long) profile[i].start, + (unsigned long long) profile[i].size); + } + + mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n", + (int) (total_size >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10)); + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + switch (profile[i].type) { + case MTHCA_RES_QP: + dev->limits.num_qps = profile[i].num; + init_hca->qpc_base = profile[i].start; + init_hca->log_num_qps = profile[i].log_num; + break; + case MTHCA_RES_EEC: + dev->limits.num_eecs = profile[i].num; + init_hca->eec_base = profile[i].start; + init_hca->log_num_eecs = profile[i].log_num; + break; + case MTHCA_RES_SRQ: + dev->limits.num_srqs = profile[i].num; + init_hca->srqc_base = profile[i].start; + init_hca->log_num_srqs = profile[i].log_num; + break; + case MTHCA_RES_CQ: + dev->limits.num_cqs = profile[i].num; + init_hca->cqc_base = profile[i].start; + init_hca->log_num_cqs = profile[i].log_num; + break; + case MTHCA_RES_EQP: + init_hca->eqpc_base = profile[i].start; + break; + case MTHCA_RES_EEEC: + init_hca->eeec_base = profile[i].start; + break; + case MTHCA_RES_EQ: + dev->limits.num_eqs = profile[i].num; + init_hca->eqc_base = profile[i].start; + init_hca->log_num_eqs = profile[i].log_num; + break; + case MTHCA_RES_RDB: + dev->limits.num_rdbs = profile[i].num; + init_hca->rdb_base = profile[i].start; + break; + case MTHCA_RES_MCG: + dev->limits.num_mgms = profile[i].num >> 1; + dev->limits.num_amgms = profile[i].num >> 1; + init_hca->mc_base = profile[i].start; + init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1; + init_hca->log_mc_table_sz = profile[i].log_num; + init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); + break; + case MTHCA_RES_MPT: + dev->limits.num_mpts = profile[i].num; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; + break; + case MTHCA_RES_MTT: + dev->limits.num_mtt_segs = profile[i].num; + dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE; + dev->mr_table.mtt_base = profile[i].start; + init_hca->mtt_base = profile[i].start; + init_hca->mtt_seg_sz = ffs(MTHCA_MTT_SEG_SIZE) - 7; + break; + case MTHCA_RES_UAR: + init_hca->uar_scratch_base = profile[i].start; + break; + case MTHCA_RES_UDAV: + dev->av_table.ddr_av_base = profile[i].start; + dev->av_table.num_ddr_avs = profile[i].num; + default: + break; + } + } + + /* + * PDs don't take any HCA memory, but we assign them as part + * of the HCA profile anyway. + */ + dev->limits.num_pds = MTHCA_NUM_PDS; + + kfree(profile); + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h 2004-12-13 09:44:45.581010475 -0800 @@ -0,0 +1,51 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef MTHCA_PROFILE_H +#define MTHCA_PROFILE_H + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_RES_QP, + MTHCA_RES_EEC, + MTHCA_RES_SRQ, + MTHCA_RES_CQ, + MTHCA_RES_EQP, + MTHCA_RES_EEEC, + MTHCA_RES_EQ, + MTHCA_RES_RDB, + MTHCA_RES_MCG, + MTHCA_RES_MPT, + MTHCA_RES_MTT, + MTHCA_RES_UAR, + MTHCA_RES_UDAV, + MTHCA_RES_NUM +}; + +int mthca_make_profile(struct mthca_dev *mdev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca); + +#endif /* MTHCA_PROFILE_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c 2004-12-13 09:44:45.607006645 -0800 @@ -0,0 +1,221 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_reset.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +int mthca_reset(struct mthca_dev *mdev) +{ + int i; + int err = 0; + u32 *hca_header = NULL; + u32 *bridge_header = NULL; + struct pci_dev *bridge = NULL; + +#define MTHCA_RESET_OFFSET 0xf0010 +#define MTHCA_RESET_VALUE cpu_to_be32(1) + + /* + * Reset the chip. This is somewhat ugly because we have to + * save off the PCI header before reset and then restore it + * after the chip reboots. We skip config space offsets 22 + * and 23 since those have a special meaning. + * + * To make matters worse, for Tavor (PCI-X HCA) we have to + * find the associated bridge device and save off its PCI + * header as well. + */ + + if (mdev->hca_type == TAVOR) { + /* Look for the bridge -- its device ID will be 2 more + than HCA's device ID. */ + while ((bridge = pci_get_device(mdev->pdev->vendor, + mdev->pdev->device + 2, + bridge)) != NULL) { + if (bridge->hdr_type == PCI_HEADER_TYPE_BRIDGE && + bridge->subordinate == mdev->pdev->bus) { + mthca_dbg(mdev, "Found bridge: %s (%s)\n", + pci_pretty_name(bridge), pci_name(bridge)); + break; + } + } + + if (!bridge) { + /* + * Didn't find a bridge for a Tavor device -- + * assume we're in no-bridge mode and hope for + * the best. + */ + mthca_warn(mdev, "No bridge found for %s (%s)\n", + pci_pretty_name(mdev->pdev), pci_name(mdev->pdev)); + } + + } + + /* For Arbel do we need to save off the full 4K PCI Express header?? */ + hca_header = kmalloc(256, GFP_KERNEL); + if (!hca_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA " + "PCI header, aborting.\n"); + goto out; + } + } + + if (bridge) { + bridge_header = kmalloc(256, GFP_KERNEL); + if (!bridge_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "bridge PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA bridge " + "PCI header, aborting.\n"); + goto out; + } + } + } + + /* actually hit reset */ + { + void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) + + MTHCA_RESET_OFFSET, 4); + + if (!reset) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't map HCA reset register, " + "aborting.\n"); + goto out; + } + + writel(MTHCA_RESET_VALUE, reset); + iounmap(reset); + } + + /* Docs say to wait one second before accessing device */ + msleep(1000); + + /* Now wait for PCI device to start responding again */ + { + u32 v; + int c = 0; + + for (c = 0; c < 100; ++c) { + if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't access HCA after reset, " + "aborting.\n"); + goto out; + } + + if (v != 0xffffffff) + goto good; + + msleep(100); + } + + err = -ENODEV; + mthca_err(mdev, "PCI device did not come back after reset, " + "aborting.\n"); + goto out; + } + +good: + /* Now restore the PCI headers */ + if (bridge) { + /* + * Bridge control register is at 0x3e, so we'll + * naturally restore it last in this loop. + */ + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(bridge, PCI_COMMAND, + bridge_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, " + "aborting.\n"); + goto out; + } + } + + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(mdev->pdev, PCI_COMMAND, + hca_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA COMMAND, " + "aborting.\n"); + goto out; + } + +out: + if (bridge) + pci_dev_put(bridge); + kfree(bridge_header); + kfree(hca_header); + + return err; +} From roland at topspin.com Mon Dec 13 10:09:36 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:36 -0800 Subject: [openib-general] [PATCH][v3][12/21] Add Mellanox HCA low-level driver (QP/CQ) In-Reply-To: <20041213109.ZLx80T8qLt2sXe9H@topspin.com> Message-ID: <20041213109.qjNNDyU3lIqRtV2z@topspin.com> Add CQ (completion queue) and QP (queue pair) code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c 2004-12-13 09:44:45.891964666 -0800 @@ -0,0 +1,817 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cq.c 1298 2004-11-29 03:26:10Z roland $ + */ + +#include + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE +}; + +enum { + MTHCA_CQ_ENTRY_SIZE = 0x20 +}; + +/* + * Must be packed because start is 64 bits but only aligned to 32 bits. + */ +struct mthca_cq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 error_eqn; + u32 comp_eqn; + u32 pd; + u32 lkey; + u32 last_notified_index; + u32 solicit_producer_index; + u32 consumer_index; + u32 producer_index; + u32 cqn; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_CQ_STATUS_OK ( 0 << 28) +#define MTHCA_CQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_CQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_CQ_FLAG_TR ( 1 << 18) +#define MTHCA_CQ_FLAG_OI ( 1 << 17) +#define MTHCA_CQ_STATE_DISARMED ( 0 << 8) +#define MTHCA_CQ_STATE_ARMED ( 1 << 8) +#define MTHCA_CQ_STATE_ARMED_SOL ( 4 << 8) +#define MTHCA_EQ_STATE_FIRED (10 << 8) + +enum { + MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe +}; + +enum { + SYNDROME_LOCAL_LENGTH_ERR = 0x01, + SYNDROME_LOCAL_QP_OP_ERR = 0x02, + SYNDROME_LOCAL_EEC_OP_ERR = 0x03, + SYNDROME_LOCAL_PROT_ERR = 0x04, + SYNDROME_WR_FLUSH_ERR = 0x05, + SYNDROME_MW_BIND_ERR = 0x06, + SYNDROME_BAD_RESP_ERR = 0x10, + SYNDROME_LOCAL_ACCESS_ERR = 0x11, + SYNDROME_REMOTE_INVAL_REQ_ERR = 0x12, + SYNDROME_REMOTE_ACCESS_ERR = 0x13, + SYNDROME_REMOTE_OP_ERR = 0x14, + SYNDROME_RETRY_EXC_ERR = 0x15, + SYNDROME_RNR_RETRY_EXC_ERR = 0x16, + SYNDROME_LOCAL_RDD_VIOL_ERR = 0x20, + SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21, + SYNDROME_REMOTE_ABORTED_ERR = 0x22, + SYNDROME_INVAL_EECN_ERR = 0x23, + SYNDROME_INVAL_EEC_STATE_ERR = 0x24 +}; + +struct mthca_cqe { + u32 my_qpn; + u32 my_ee; + u32 rqpn; + u16 sl_g_mlpath; + u16 rlid; + u32 imm_etype_pkey_eec; + u32 byte_cnt; + u32 wqe; + u8 opcode; + u8 is_send; + u8 reserved; + u8 owner; +}; + +struct mthca_err_cqe { + u32 my_qpn; + u32 reserved1[3]; + u8 syndrome; + u8 reserved2; + u16 db_cnt; + u32 reserved3; + u32 wqe; + u8 opcode; + u8 reserved4[2]; + u8 owner; +}; + +#define MTHCA_CQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_CQ_ENTRY_OWNER_HW (1 << 7) + +#define MTHCA_CQ_DB_INC_CI (1 << 24) +#define MTHCA_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_CQ_DB_REQ_NOT_SOL (3 << 24) +#define MTHCA_CQ_DB_SET_CI (4 << 24) +#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24) + +static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) +{ + if (cq->is_direct) + return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE); + else + return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int cqe_sw(struct mthca_cq *cq, int i) +{ + return !(MTHCA_CQ_ENTRY_OWNER_HW & + get_cqe(cq, i)->owner); +} + +static inline int next_cqe_sw(struct mthca_cq *cq) +{ + return cqe_sw(cq, cq->cons_index); +} + +static inline void set_cqe_hw(struct mthca_cq *cq, int entry) +{ + get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW; +} + +static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, + int nent) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn); + doorbell[1] = cpu_to_be32(nent - 1); + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +void mthca_cq_event(struct mthca_dev *dev, u32 cqn) +{ + struct mthca_cq *cq; + + spin_lock(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock(&dev->cq_table.lock); + + if (!cq) { + mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); + return; + } + + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) +{ + struct mthca_cq *cq; + struct mthca_cqe *cqe; + int prod_index; + int nfreed = 0; + + spin_lock_irq(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock_irq(&dev->cq_table.lock); + + if (!cq) + return; + + spin_lock_irq(&cq->lock); + + /* + * First we need to find the current producer index, so we + * know where to start cleaning from. It doesn't matter if HW + * adds new entries after this loop -- the QP we're worried + * about is already in RESET, so the new entries won't come + * from our QP and therefore don't need to be checked. + */ + for (prod_index = cq->cons_index; + cqe_sw(cq, prod_index & (cq->ibcq.cqe - 1)); + ++prod_index) + if (prod_index == cq->cons_index + cq->ibcq.cqe - 1) + break; + + if (0) + mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n", + qpn, cqn, cq->cons_index, prod_index); + + /* + * Now sweep backwards through the CQ, removing CQ entries + * that match our QP by copying older entries on top of them. + */ + while (prod_index > cq->cons_index) { + cqe = get_cqe(cq, (prod_index - 1) & (cq->ibcq.cqe - 1)); + if (cqe->my_qpn == cpu_to_be32(qpn)) + ++nfreed; + else if (nfreed) + memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & + (cq->ibcq.cqe - 1)), + cqe, + MTHCA_CQ_ENTRY_SIZE); + --prod_index; + } + + if (nfreed) { + wmb(); + inc_cons_index(dev, cq, nfreed); + cq->cons_index = (cq->cons_index + nfreed) & (cq->ibcq.cqe - 1); + } + + spin_unlock_irq(&cq->lock); + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq, + struct mthca_qp *qp, int wqe_index, int is_send, + struct mthca_err_cqe *cqe, + struct ib_wc *entry, int *free_cqe) +{ + int err; + int dbd; + u32 new_wqe; + + if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) { + int j; + + mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); + } + + /* + * For completions in error, only work request ID, status (and + * freed resource count for RD) have to be set. + */ + switch (cqe->syndrome) { + case SYNDROME_LOCAL_LENGTH_ERR: + entry->status = IB_WC_LOC_LEN_ERR; + break; + case SYNDROME_LOCAL_QP_OP_ERR: + entry->status = IB_WC_LOC_QP_OP_ERR; + break; + case SYNDROME_LOCAL_EEC_OP_ERR: + entry->status = IB_WC_LOC_EEC_OP_ERR; + break; + case SYNDROME_LOCAL_PROT_ERR: + entry->status = IB_WC_LOC_PROT_ERR; + break; + case SYNDROME_WR_FLUSH_ERR: + entry->status = IB_WC_WR_FLUSH_ERR; + break; + case SYNDROME_MW_BIND_ERR: + entry->status = IB_WC_MW_BIND_ERR; + break; + case SYNDROME_BAD_RESP_ERR: + entry->status = IB_WC_BAD_RESP_ERR; + break; + case SYNDROME_LOCAL_ACCESS_ERR: + entry->status = IB_WC_LOC_ACCESS_ERR; + break; + case SYNDROME_REMOTE_INVAL_REQ_ERR: + entry->status = IB_WC_REM_INV_REQ_ERR; + break; + case SYNDROME_REMOTE_ACCESS_ERR: + entry->status = IB_WC_REM_ACCESS_ERR; + break; + case SYNDROME_REMOTE_OP_ERR: + entry->status = IB_WC_REM_OP_ERR; + break; + case SYNDROME_RETRY_EXC_ERR: + entry->status = IB_WC_RETRY_EXC_ERR; + break; + case SYNDROME_RNR_RETRY_EXC_ERR: + entry->status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case SYNDROME_LOCAL_RDD_VIOL_ERR: + entry->status = IB_WC_LOC_RDD_VIOL_ERR; + break; + case SYNDROME_REMOTE_INVAL_RD_REQ_ERR: + entry->status = IB_WC_REM_INV_RD_REQ_ERR; + break; + case SYNDROME_REMOTE_ABORTED_ERR: + entry->status = IB_WC_REM_ABORT_ERR; + break; + case SYNDROME_INVAL_EECN_ERR: + entry->status = IB_WC_INV_EECN_ERR; + break; + case SYNDROME_INVAL_EEC_STATE_ERR: + entry->status = IB_WC_INV_EEC_STATE_ERR; + break; + default: + entry->status = IB_WC_GENERAL_ERR; + break; + } + + err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); + if (err) + return err; + + /* + * If we're at the end of the WQE chain, or we've used up our + * doorbell count, free the CQE. Otherwise just update it for + * the next poll operation. + */ + if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) + return 0; + + cqe->db_cnt = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd); + cqe->wqe = new_wqe; + cqe->syndrome = SYNDROME_WR_FLUSH_ERR; + + *free_cqe = 0; + + return 0; +} + +static void dump_cqe(struct mthca_cqe *cqe) +{ + int j; + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); +} + +static inline int mthca_poll_one(struct mthca_dev *dev, + struct mthca_cq *cq, + struct mthca_qp **cur_qp, + int *freed, + struct ib_wc *entry) +{ + struct mthca_wq *wq; + struct mthca_cqe *cqe; + int wqe_index; + int is_error = 0; + int is_send; + int free_cqe = 1; + int err = 0; + + if (!next_cqe_sw(cq)) + return -EAGAIN; + + rmb(); + + cqe = get_cqe(cq, cq->cons_index); + + if (0) { + mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + dump_cqe(cqe); + } + + if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) { + is_error = 1; + is_send = cqe->opcode & 1; + } else + is_send = cqe->is_send & 0x80; + + if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { + if (*cur_qp) { + spin_unlock(&(*cur_qp)->lock); + if (atomic_dec_and_test(&(*cur_qp)->refcount)) + wake_up(&(*cur_qp)->wait); + } + + spin_lock(&dev->qp_table.lock); + *cur_qp = mthca_array_get(&dev->qp_table.qp, + be32_to_cpu(cqe->my_qpn) & + (dev->limits.num_qps - 1)); + if (*cur_qp) + atomic_inc(&(*cur_qp)->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!*cur_qp) { + mthca_warn(dev, "CQ entry for unknown QP %06x\n", + be32_to_cpu(cqe->my_qpn) & 0xffffff); + err = -EINVAL; + goto out; + } + + spin_lock(&(*cur_qp)->lock); + } + + if (is_send) { + wq = &(*cur_qp)->sq; + wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset) + >> wq->wqe_shift); + entry->wr_id = (*cur_qp)->wrid[wqe_index + + (*cur_qp)->rq.max]; + } else { + wq = &(*cur_qp)->rq; + wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + entry->wr_id = (*cur_qp)->wrid[wqe_index]; + } + + if (wq->last_comp < wqe_index) + wq->cur -= wqe_index - wq->last_comp; + else + wq->cur -= wq->max - wq->last_comp + wqe_index; + + wq->last_comp = wqe_index; + + if (0) + mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n", + is_send ? "Send" : "Receive", + (*cur_qp)->qpn, wqe_index, wq->max); + + if (is_error) { + err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, + (struct mthca_err_cqe *) cqe, + entry, &free_cqe); + goto out; + } + + if (is_send) { + entry->opcode = IB_WC_SEND; /* XXX */ + } else { + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + switch (cqe->opcode & 0x1f) { + case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE: + case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV; + break; + case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE: + case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV_RDMA_WITH_IMM; + break; + default: + entry->wc_flags = 0; + entry->opcode = IB_WC_RECV; + break; + } + entry->slid = be16_to_cpu(cqe->rlid); + entry->sl = be16_to_cpu(cqe->sl_g_mlpath) >> 12; + entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; + entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; + entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; + entry->wc_flags |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ? + IB_WC_GRH : 0; + } + + entry->status = IB_WC_SUCCESS; + + out: + if (free_cqe) { + set_cqe_hw(cq, cq->cons_index); + ++(*freed); + cq->cons_index = (cq->cons_index + 1) & (cq->ibcq.cqe - 1); + } + + return err; +} + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct mthca_dev *dev = to_mdev(ibcq->device); + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_qp *qp = NULL; + unsigned long flags; + int err = 0; + int freed = 0; + int npolled; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + err = mthca_poll_one(dev, cq, &qp, + &freed, entry + npolled); + if (err) + break; + } + + if (qp) { + spin_unlock(&qp->lock); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + + wmb(); + inc_cons_index(dev, cq, freed); + + spin_unlock_irqrestore(&cq->lock, flags); + + return err == 0 || err == -EAGAIN ? npolled : err; +} + +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((solicited ? + MTHCA_CQ_DB_REQ_NOT_SOL : + MTHCA_CQ_DB_REQ_NOT) | + cq->cqn); + doorbell[1] = 0xffffffff; + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq) +{ + int size = nent * MTHCA_CQ_ENTRY_SIZE; + dma_addr_t t; + void *mailbox = NULL; + int npages, shift; + u64 *dma_list = NULL; + struct mthca_cq_context *cq_context; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out; + + cq_context = MAILBOX_ALIGN(mailbox); + + if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { + if (0) + mthca_dbg(dev, "Creating direct CQ of size %d\n", size); + + cq->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + cq->queue.direct.buf = pci_alloc_consistent(dev->pdev, + size, &t); + if (!cq->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&cq->queue.direct, mapping, t); + + memset(cq->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + cq->is_direct = 0; + npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, + GFP_KERNEL); + if (!cq->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + cq->queue.page_list[i].buf = NULL; + + for (i = 0; i < npages; ++i) { + cq->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!cq->queue.page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); + + memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE); + } + } + + for (i = 0; i < nent; ++i) + set_cqe_hw(cq, i); + + cq->cqn = mthca_alloc(&dev->cq_table.alloc); + if (cq->cqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, shift, npages, + 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &cq->mr); + if (err) + goto err_out_free_cq; + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + memset(cq_context, 0, sizeof *cq_context); + cq_context->flags = cpu_to_be32(MTHCA_CQ_STATUS_OK | + MTHCA_CQ_STATE_DISARMED | + MTHCA_CQ_FLAG_TR); + cq_context->start = cpu_to_be64(0); + cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); + cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); + cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); + cq_context->cqn = cpu_to_be32(cq->cqn); + + err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err); + goto err_out_free_mr; + } + + if (status) { + mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + spin_lock_irq(&dev->cq_table.lock); + if (mthca_array_set(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1), + cq)) { + spin_unlock_irq(&dev->cq_table.lock); + goto err_out_free_mr; + } + spin_unlock_irq(&dev->cq_table.lock); + + cq->cons_index = 0; + + kfree(dma_list); + kfree(mailbox); + + return 0; + + err_out_free_mr: + mthca_free_mr(dev, &cq->mr); + + err_out_free_cq: + mthca_free(&dev->cq_table.alloc, cq->cqn); + + err_out_free: + if (cq->is_direct) + pci_free_consistent(dev->pdev, size, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, mapping)); + else { + for (i = 0; i < npages; ++i) + if (cq->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + err_out: + kfree(dma_list); + kfree(mailbox); + + return err; +} + +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq) +{ + void *mailbox; + int err; + u8 status; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_warn(dev, "No memory for mailbox to free CQ.\n"); + return; + } + + err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status); + if (err) + mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n", + status); + + if (0) { + u32 *ctx = MAILBOX_ALIGN(mailbox); + int j; + + printk(KERN_ERR "context for CQN %x\n", cq->cqn); + for (j = 0; j < 16; ++j) + printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); + } + + spin_lock_irq(&dev->cq_table.lock); + mthca_array_clear(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1)); + spin_unlock_irq(&dev->cq_table.lock); + + atomic_dec(&cq->refcount); + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + mthca_free_mr(dev, &cq->mr); + + if (cq->is_direct) + pci_free_consistent(dev->pdev, + cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, + mapping)); + else { + int i; + + for (i = 0; + i < (cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + mthca_free(&dev->cq_table.alloc, cq->cqn); + kfree(mailbox); +} + +int __devinit mthca_init_cq_table(struct mthca_dev *dev) +{ + int err; + + spin_lock_init(&dev->cq_table.lock); + + err = mthca_alloc_init(&dev->cq_table.alloc, + dev->limits.num_cqs, + (1 << 24) - 1, + dev->limits.reserved_cqs); + if (err) + return err; + + err = mthca_array_init(&dev->cq_table.cq, + dev->limits.num_cqs); + if (err) + mthca_alloc_cleanup(&dev->cq_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev) +{ + mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs); + mthca_alloc_cleanup(&dev->cq_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c 2004-12-13 09:44:45.916960983 -0800 @@ -0,0 +1,1479 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_qp.c 1324 2004-12-13 17:55:08Z roland $ + */ + +#include + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, + MTHCA_ACK_REQ_FREQ = 10, + MTHCA_FLIGHT_LIMIT = 9, + MTHCA_UD_HEADER_SIZE = 72 /* largest UD header possible */ +}; + +enum { + MTHCA_QP_STATE_RST = 0, + MTHCA_QP_STATE_INIT = 1, + MTHCA_QP_STATE_RTR = 2, + MTHCA_QP_STATE_RTS = 3, + MTHCA_QP_STATE_SQE = 4, + MTHCA_QP_STATE_SQD = 5, + MTHCA_QP_STATE_ERR = 6, + MTHCA_QP_STATE_DRAINING = 7 +}; + +enum { + MTHCA_QP_ST_RC = 0x0, + MTHCA_QP_ST_UC = 0x1, + MTHCA_QP_ST_RD = 0x2, + MTHCA_QP_ST_UD = 0x3, + MTHCA_QP_ST_MLX = 0x7 +}; + +enum { + MTHCA_QP_PM_MIGRATED = 0x3, + MTHCA_QP_PM_ARMED = 0x0, + MTHCA_QP_PM_REARM = 0x1 +}; + +enum { + /* qp_context flags */ + MTHCA_QP_BIT_DE = 1 << 8, + /* params1 */ + MTHCA_QP_BIT_SRE = 1 << 15, + MTHCA_QP_BIT_SWE = 1 << 14, + MTHCA_QP_BIT_SAE = 1 << 13, + MTHCA_QP_BIT_SIC = 1 << 4, + MTHCA_QP_BIT_SSC = 1 << 3, + /* params2 */ + MTHCA_QP_BIT_RRE = 1 << 15, + MTHCA_QP_BIT_RWE = 1 << 14, + MTHCA_QP_BIT_RAE = 1 << 13, + MTHCA_QP_BIT_RIC = 1 << 4, + MTHCA_QP_BIT_RSC = 1 << 3 +}; + +struct mthca_qp_path { + u32 port_pkey; + u8 rnr_retry; + u8 g_mylmc; + u16 rlid; + u8 ackto; + u8 mgid_index; + u8 static_rate; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u8 rgid[16]; +} __attribute__((packed)); + +struct mthca_qp_context { + u32 flags; + u32 sched_queue; + u32 mtu_msgmax; + u32 usr_page; + u32 local_qpn; + u32 remote_qpn; + u32 reserved1[2]; + struct mthca_qp_path pri_path; + struct mthca_qp_path alt_path; + u32 rdd; + u32 pd; + u32 wqe_base; + u32 wqe_lkey; + u32 params1; + u32 reserved2; + u32 next_send_psn; + u32 cqn_snd; + u32 next_snd_wqe[2]; + u32 last_acked_psn; + u32 ssn; + u32 params2; + u32 rnr_nextrecvpsn; + u32 ra_buff_indx; + u32 cqn_rcv; + u32 next_rcv_wqe[2]; + u32 qkey; + u32 srqn; + u32 rmsn; + u32 reserved3[19]; +} __attribute__((packed)); + +struct mthca_qp_param { + u32 opt_param_mask; + u32 reserved1; + struct mthca_qp_context context; + u32 reserved2[62]; +} __attribute__((packed)); + +enum { + MTHCA_QP_OPTPAR_ALT_ADDR_PATH = 1 << 0, + MTHCA_QP_OPTPAR_RRE = 1 << 1, + MTHCA_QP_OPTPAR_RAE = 1 << 2, + MTHCA_QP_OPTPAR_REW = 1 << 3, + MTHCA_QP_OPTPAR_PKEY_INDEX = 1 << 4, + MTHCA_QP_OPTPAR_Q_KEY = 1 << 5, + MTHCA_QP_OPTPAR_RNR_TIMEOUT = 1 << 6, + MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7, + MTHCA_QP_OPTPAR_SRA_MAX = 1 << 8, + MTHCA_QP_OPTPAR_RRA_MAX = 1 << 9, + MTHCA_QP_OPTPAR_PM_STATE = 1 << 10, + MTHCA_QP_OPTPAR_PORT_NUM = 1 << 11, + MTHCA_QP_OPTPAR_RETRY_COUNT = 1 << 12, + MTHCA_QP_OPTPAR_ALT_RNR_RETRY = 1 << 13, + MTHCA_QP_OPTPAR_ACK_TIMEOUT = 1 << 14, + MTHCA_QP_OPTPAR_RNR_RETRY = 1 << 15, + MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 +}; + +enum { + MTHCA_OPCODE_NOP = 0x00, + MTHCA_OPCODE_RDMA_WRITE = 0x08, + MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, + MTHCA_OPCODE_SEND = 0x0a, + MTHCA_OPCODE_SEND_IMM = 0x0b, + MTHCA_OPCODE_RDMA_READ = 0x10, + MTHCA_OPCODE_ATOMIC_CS = 0x11, + MTHCA_OPCODE_ATOMIC_FA = 0x12, + MTHCA_OPCODE_BIND_MW = 0x18, + MTHCA_OPCODE_INVALID = 0xff +}; + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 +}; + +struct mthca_next_seg { + u32 nda_op; /* [31:6] next WQE [4:0] next opcode */ + u32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + u32 flags; /* [3] CQ [2] Event [1] Solicit */ + u32 imm; /* immediate data */ +} __attribute__((packed)); + +struct mthca_ud_seg { + u32 reserved1; + u32 lkey; + u64 av_addr; + u32 reserved2[4]; + u32 dqpn; + u32 qkey; + u32 reserved3[2]; +} __attribute__((packed)); + +struct mthca_bind_seg { + u32 flags; /* [31] Atomic [30] rem write [29] rem read */ + u32 reserved; + u32 new_rkey; + u32 lkey; + u64 addr; + u64 length; +} __attribute__((packed)); + +struct mthca_raddr_seg { + u64 raddr; + u32 rkey; + u32 reserved; +} __attribute__((packed)); + +struct mthca_atomic_seg { + u64 swap_add; + u64 compare; +} __attribute__((packed)); + +struct mthca_data_seg { + u32 byte_count; + u32 lkey; + u64 addr; +} __attribute__((packed)); + +struct mthca_mlx_seg { + u32 nda_op; + u32 nds; + u32 flags; /* [17] VL15 [16] SLR [14:12] static rate + [11:8] SL [3] C [2] E */ + u16 rlid; + u16 vcrc; +} __attribute__((packed)); + +static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 3; +} + +static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 1; +} + +static void *get_recv_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + (n << qp->rq.wqe_shift); + else + return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf + + ((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1)); +} + +static void *get_send_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + qp->send_wqe_offset + + (n << qp->sq.wqe_shift); + else + return qp->queue.page_list[(qp->send_wqe_offset + + (n << qp->sq.wqe_shift)) >> + PAGE_SHIFT].buf + + ((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) & + (PAGE_SIZE - 1)); +} + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type) +{ + struct mthca_qp *qp; + struct ib_event event; + + spin_lock(&dev->qp_table.lock); + qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1)); + if (qp) + atomic_inc(&qp->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!qp) { + mthca_warn(dev, "Async event for bogus QP %08x\n", qpn); + return; + } + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); + + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +static int to_mthca_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: return MTHCA_QP_STATE_RST; + case IB_QPS_INIT: return MTHCA_QP_STATE_INIT; + case IB_QPS_RTR: return MTHCA_QP_STATE_RTR; + case IB_QPS_RTS: return MTHCA_QP_STATE_RTS; + case IB_QPS_SQD: return MTHCA_QP_STATE_SQD; + case IB_QPS_SQE: return MTHCA_QP_STATE_SQE; + case IB_QPS_ERR: return MTHCA_QP_STATE_ERR; + default: return -1; + } +} + +enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS }; + +static int to_mthca_st(int transport) +{ + switch (transport) { + case RC: return MTHCA_QP_ST_RC; + case UC: return MTHCA_QP_ST_UC; + case UD: return MTHCA_QP_ST_UD; + case RD: return MTHCA_QP_ST_RD; + case MLX: return MTHCA_QP_ST_MLX; + default: return -1; + } +} + +static const struct { + int trans; + u32 req_param[NUM_TRANS]; + u32 opt_param[NUM_TRANS]; +} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_RST2INIT, + .req_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + }, + /* bug-for-bug compatibility with VAPI: */ + .opt_param = { + [MLX] = IB_QP_PORT + } + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_INIT2INIT, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + }, + [IB_QPS_RTR] = { + .trans = MTHCA_TRANS_INIT2RTR, + .req_param = { + [RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTR2RTS, + .req_param = { + [UD] = IB_QP_SQ_PSN, + [RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + [MLX] = IB_QP_SQ_PSN, + }, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTS2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_RTS2SQD, + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQD2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_SQD2SQD, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQERR2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR } + } +}; + +static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr, + int attr_mask) +{ + if (attr_mask & IB_QP_PKEY_INDEX) + sqp->pkey_index = attr->pkey_index; + if (attr_mask & IB_QP_QKEY) + sqp->qkey = attr->qkey; + if (attr_mask & IB_QP_SQ_PSN) + sqp->send_psn = attr->sq_psn; +} + +static void init_port(struct mthca_dev *dev, int port) +{ + int err; + u8 status; + struct mthca_init_ib_param param; + + memset(¶m, 0, sizeof param); + + param.enable_1x = 1; + param.enable_4x = 1; + param.vl_cap = dev->limits.vl_cap; + param.mtu_cap = dev->limits.mtu_cap; + param.gid_cap = dev->limits.gid_table_len; + param.pkey_cap = dev->limits.pkey_table_len; + + err = mthca_INIT_IB(dev, ¶m, port, &status); + if (err) + mthca_warn(dev, "INIT_IB failed, return code %d.\n", err); + if (status) + mthca_warn(dev, "INIT_IB returned status %02x.\n", status); +} + +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + enum ib_qp_state cur_state, new_state; + void *mailbox = NULL; + struct mthca_qp_param *qp_param; + struct mthca_qp_context *qp_context; + u32 req_param, opt_param; + u8 status; + int err; + + if (attr_mask & IB_QP_CUR_STATE) { + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + cur_state = attr->cur_qp_state; + } else { + spin_lock_irq(&qp->lock); + cur_state = qp->state; + spin_unlock_irq(&qp->lock); + } + + if (attr_mask & IB_QP_STATE) { + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + new_state = attr->qp_state; + } else + new_state = cur_state; + + if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) { + mthca_dbg(dev, "Illegal QP transition " + "%d->%d\n", cur_state, new_state); + return -EINVAL; + } + + req_param = state_table[cur_state][new_state].req_param[qp->transport]; + opt_param = state_table[cur_state][new_state].opt_param[qp->transport]; + + if ((req_param & attr_mask) != req_param) { + mthca_dbg(dev, "QP transition " + "%d->%d missing req attr 0x%08x\n", + cur_state, new_state, + req_param & ~attr_mask); + return -EINVAL; + } + + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) { + mthca_dbg(dev, "QP transition (transport %d) " + "%d->%d has extra attr 0x%08x\n", + qp->transport, + cur_state, new_state, + attr_mask & ~(req_param | opt_param | + IB_QP_STATE)); + return -EINVAL; + } + + mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + qp_param = MAILBOX_ALIGN(mailbox); + qp_context = &qp_param->context; + memset(qp_param, 0, sizeof *qp_param); + + qp_context->flags = cpu_to_be32((to_mthca_state(new_state) << 28) | + (to_mthca_st(qp->transport) << 16)); + qp_context->flags |= cpu_to_be32(MTHCA_QP_BIT_DE); + if (!(attr_mask & IB_QP_PATH_MIG_STATE)) + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + else { + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE); + switch (attr->path_mig_state) { + case IB_MIG_MIGRATED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + break; + case IB_MIG_REARM: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11); + break; + case IB_MIG_ARMED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11); + break; + } + } + /* leave sched_queue as 0 */ + if (qp->transport == MLX || qp->transport == UD) + qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) | + (11 << 24)); + else if (attr_mask & IB_QP_PATH_MTU) { + qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | + (31 << 24)); + } + qp_context->usr_page = cpu_to_be32(MTHCA_KAR_PAGE); + qp_context->local_qpn = cpu_to_be32(qp->qpn); + if (attr_mask & IB_QP_DEST_QPN) { + qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); + } + + if (qp->transport == MLX) + qp_context->pri_path.port_pkey |= + cpu_to_be32(to_msqp(qp)->port << 24); + else { + if (attr_mask & IB_QP_PORT) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->port_num << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM); + } + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->pkey_index); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX); + } + + if (attr_mask & IB_QP_RNR_RETRY) { + qp_context->pri_path.rnr_retry = attr->rnr_retry << 5; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY); + } + + if (attr_mask & IB_QP_AV) { + qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; + qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); + qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3; + if (attr->ah_attr.ah_flags & IB_AH_GRH) { + qp_context->pri_path.g_mylmc |= 1 << 7; + qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; + qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32((attr->ah_attr.sl << 28) | + (attr->ah_attr.grh.traffic_class << 20) | + (attr->ah_attr.grh.flow_label)); + memcpy(qp_context->pri_path.rgid, + attr->ah_attr.grh.dgid.raw, 16); + } else { + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32(attr->ah_attr.sl << 28); + } + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); + } + + if (attr_mask & IB_QP_TIMEOUT) { + qp_context->pri_path.ackto = attr->timeout; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); + } + + /* XXX alt_path */ + + /* leave rdd as 0 */ + qp_context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pd_num); + /* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */ + qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); + qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | + (MTHCA_FLIGHT_LIMIT << 24) | + MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SWE | + MTHCA_QP_BIT_SAE); + if (qp->sq.policy == IB_SIGNAL_ALL_WR) + qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); + if (attr_mask & IB_QP_RETRY_CNT) { + qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT); + } + + /* XXX initiator resources */ + if (attr_mask & IB_QP_SQ_PSN) + qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); + qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); + + /* XXX RDMA/atomic enable, responder resources */ + + if (qp->rq.policy == IB_SIGNAL_ALL_WR) + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); + } + if (attr_mask & IB_QP_RQ_PSN) + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); + + /* XXX ra_buff_indx */ + + qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); + + if (attr_mask & IB_QP_QKEY) { + qp_context->qkey = cpu_to_be32(attr->qkey); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); + } + + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, + qp->qpn, 0, qp_param, 0, &status); + if (status) { + mthca_warn(dev, "modify QP %d returned status %02x.\n", + state_table[cur_state][new_state].trans, status); + err = -EINVAL; + } + + if (!err) { + spin_lock_irq(&qp->lock); + /* XXX deal with async transitions to ERROR */ + qp->state = new_state; + spin_unlock_irq(&qp->lock); + } + + kfree(mailbox); + + if (is_sqp(dev, qp)) + store_attrs(to_msqp(qp), attr, attr_mask); + + /* + * If we are moving QP0 to RTR, bring the IB link up; if we + * are moving QP0 to RESET or ERROR, bring the link back down. + */ + if (is_qp0(dev, qp)) { + if (cur_state != IB_QPS_RTR && + new_state == IB_QPS_RTR) + init_port(dev, to_msqp(qp)->port); + + if (cur_state != IB_QPS_RESET && + cur_state != IB_QPS_ERR && + (new_state == IB_QPS_RESET || + new_state == IB_QPS_ERR)) + mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + } + + return err; +} + +/* + * Allocate and register buffer for WQEs. qp->rq.max, sq.max, + * rq.max_gs and sq.max_gs must all be assigned. + * mthca_alloc_wqe_buf will calculate rq.wqe_shift and + * sq.wqe_shift (as well as send_wqe_offset, is_direct, and + * queue) + */ +static int mthca_alloc_wqe_buf(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int size; + int i; + int npages, shift; + dma_addr_t t; + u64 *dma_list = NULL; + int err = -ENOMEM; + + size = sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg); + + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; + qp->rq.wqe_shift++) + ; /* nothing */ + + size = sizeof (struct mthca_next_seg) + + qp->sq.max_gs * sizeof (struct mthca_data_seg); + if (qp->transport == MLX) + size += 2 * sizeof (struct mthca_data_seg); + else if (qp->transport == UD) + size += sizeof (struct mthca_ud_seg); + else /* bind seg is as big as atomic + raddr segs */ + size += sizeof (struct mthca_bind_seg); + + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; + qp->sq.wqe_shift++) + ; /* nothing */ + + qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift, + 1 << qp->sq.wqe_shift); + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64), + GFP_KERNEL); + if (!qp->wrid) + goto err_out; + + if (size <= MTHCA_MAX_DIRECT_QP_SIZE) { + qp->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n", + size, shift); + + qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t); + if (!qp->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&qp->queue.direct, mapping, t); + + memset(qp->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + qp->is_direct = 0; + npages = size / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + qp->queue.page_list = kmalloc(npages * + sizeof *qp->queue.page_list, + GFP_KERNEL); + if (!qp->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) { + qp->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!qp->queue.page_list[i].buf) + goto err_out_free; + + memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE); + + pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t); + dma_list[i] = t; + } + } + + err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, + npages, 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &qp->mr); + if (err) + goto err_out_free; + + kfree(dma_list); + return 0; + + err_out_free: + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else + for (i = 0; i < npages; ++i) { + if (qp->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + + } + + err_out: + kfree(qp->wrid); + kfree(dma_list); + return err; +} + +static int mthca_alloc_qp_common(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + spin_lock_init(&qp->lock); + atomic_set(&qp->refcount, 1); + qp->state = IB_QPS_RESET; + qp->sq.policy = send_policy; + qp->rq.policy = recv_policy; + qp->rq.cur = 0; + qp->sq.cur = 0; + qp->rq.next = 0; + qp->sq.next = 0; + qp->rq.last_comp = qp->rq.max - 1; + qp->sq.last_comp = qp->sq.max - 1; + qp->rq.last = NULL; + qp->sq.last = NULL; + + err = mthca_alloc_wqe_buf(dev, pd, qp); + return err; +} + +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + switch (type) { + case IB_QPT_RC: qp->transport = RC; break; + case IB_QPT_UC: qp->transport = UC; break; + case IB_QPT_UD: qp->transport = UD; break; + default: return -EINVAL; + } + + qp->qpn = mthca_alloc(&dev->qp_table.alloc); + if (qp->qpn == -1) + return -ENOMEM; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, qp); + if (err) { + mthca_free(&dev->qp_table.alloc, qp->qpn); + return err; + } + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_set(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1), qp); + spin_unlock_irq(&dev->qp_table.lock); + + return 0; +} + +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp) +{ + int err = 0; + u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; + + sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; + sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size, + &sqp->header_dma, GFP_KERNEL); + if (!sqp->header_buf) + return -ENOMEM; + + spin_lock_irq(&dev->qp_table.lock); + if (mthca_array_get(&dev->qp_table.qp, mqpn)) + err = -EBUSY; + else + mthca_array_set(&dev->qp_table.qp, mqpn, sqp); + spin_unlock_irq(&dev->qp_table.lock); + + if (err) + goto err_out; + + sqp->port = port; + sqp->qp.qpn = mqpn; + sqp->qp.transport = MLX; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, + &sqp->qp); + if (err) + goto err_out_free; + + atomic_inc(&pd->sqp_count); + + return 0; + + err_out_free: + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, mqpn); + spin_unlock_irq(&dev->qp_table.lock); + + err_out: + dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, + sqp->header_buf, sqp->header_dma); + + return err; +} + +void mthca_free_qp(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + u8 status; + int size; + int i; + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1)); + spin_unlock_irq(&dev->qp_table.lock); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); + + if (qp->state != IB_QPS_RESET) + mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status); + + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + if (qp->ibqp.send_cq != qp->ibqp.recv_cq) + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); + + mthca_free_mr(dev, &qp->mr); + + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else { + for (i = 0; i < size / PAGE_SIZE; ++i) { + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + } + } + + kfree(qp->wrid); + + if (is_sqp(dev, qp)) { + atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); + dma_free_coherent(&dev->pdev->dev, + to_msqp(qp)->header_buf_size, + to_msqp(qp)->header_buf, + to_msqp(qp)->header_dma); + } + else + mthca_free(&dev->qp_table.alloc, qp->qpn); +} + +/* Create UD header for an MLX send and build a data segment for it */ +static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp, + int ind, struct ib_send_wr *wr, + struct mthca_mlx_seg *mlx, + struct mthca_data_seg *data) +{ + int header_size; + int err; + + ib_ud_header_init(256, /* assume a MAD */ + sqp->ud_header.grh_present, + &sqp->ud_header); + + err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); + if (err) + return err; + mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1); + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) | + (sqp->ud_header.lrh.destination_lid == 0xffff ? + MTHCA_MLX_SLR : 0) | + (sqp->ud_header.lrh.service_level << 8)); + mlx->rlid = sqp->ud_header.lrh.destination_lid; + mlx->vcrc = 0; + + switch (wr->opcode) { + case IB_WR_SEND: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; + sqp->ud_header.immediate_present = 0; + break; + case IB_WR_SEND_WITH_IMM: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + sqp->ud_header.immediate_present = 1; + sqp->ud_header.immediate_data = wr->imm_data; + break; + default: + return -EINVAL; + } + + sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (sqp->ud_header.lrh.destination_lid == 0xffff) + sqp->ud_header.lrh.source_lid = 0xffff; + sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + sqp->pkey_index, + &sqp->ud_header.bth.pkey); + else + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + wr->wr.ud.pkey_index, + &sqp->ud_header.bth.pkey); + cpu_to_be16s(&sqp->ud_header.bth.pkey); + sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + sqp->qkey : wr->wr.ud.remote_qkey); + sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + header_size = ib_ud_header_pack(&sqp->ud_header, + sqp->header_buf + + ind * MTHCA_UD_HEADER_SIZE); + + data->byte_count = cpu_to_be32(header_size); + data->lkey = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey); + data->addr = cpu_to_be64(sqp->header_dma + + ind * MTHCA_UD_HEADER_SIZE); + + return 0; +} + +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + void *wqe; + void *prev_wqe; + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + u32 f0 = 0; + int ind; + u8 op0 = 0; + + static const u8 opcode[] = { + [IB_WR_SEND] = MTHCA_OPCODE_SEND, + [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, + [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, + [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, + [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, + }; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post send */ + + ind = qp->sq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->sq.cur + nreq >= qp->sq.max) { + mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", + qp->sq.cur, qp->sq.max, nreq); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_send_wqe(qp, ind); + prev_wqe = qp->sq.last; + qp->sq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + ((struct mthca_next_seg *) wqe)->flags = + ((wr->send_flags & IB_SEND_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | + ((wr->send_flags & IB_SEND_SOLICITED) ? + cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + cpu_to_be32(1); + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (qp->transport == UD) { + ((struct mthca_ud_seg *) wqe)->lkey = + cpu_to_be32(to_mah(wr->wr.ud.ah)->key); + ((struct mthca_ud_seg *) wqe)->av_addr = + cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma); + ((struct mthca_ud_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mthca_ud_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mthca_ud_seg); + size += sizeof (struct mthca_ud_seg) / 16; + } else if (qp->transport == MLX) { + err = build_mlx_header(dev, to_msqp(qp), ind, wr, + wqe - sizeof (struct mthca_next_seg), + wqe); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + if (wr->num_sge > qp->sq.max_gs) { + mthca_err(dev, "too many gathers\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC */ + if (qp->transport == MLX) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind + qp->rq.max] = wr->wr_id; + + if (wr->opcode >= ARRAY_SIZE(opcode)) { + mthca_err(dev, "opcode invalid\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32(((ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | + opcode[wr->opcode]); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); + } + + if (!size0) { + size0 = size; + op0 = opcode[wr->opcode]; + } + + ++ind; + if (unlikely(ind >= qp->sq.max)) + ind -= qp->sq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->sq.cur += nreq; + qp->sq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + int ind; + void *wqe; + void *prev_wqe; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post receive */ + + ind = qp->rq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->rq.cur + nreq >= qp->rq.max) { + mthca_err(dev, "RQ %06x full\n", qp->qpn); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_recv_wqe(qp, ind); + prev_wqe = qp->rq.last; + qp->rq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD); + ((struct mthca_next_seg *) wqe)->flags = + (wr->recv_flags & IB_RECV_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (wr->num_sge > qp->rq.max_gs) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind] = wr->wr_id; + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32((ind << qp->rq.wqe_shift) | 1); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD | size); + } + + if (!size0) + size0 = size; + + ++ind; + if (unlikely(ind >= qp->rq.max)) + ind -= qp->rq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->rq.cur += nreq; + qp->rq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe) +{ + struct mthca_next_seg *next; + + if (is_send) + next = get_send_wqe(qp, index); + else + next = get_recv_wqe(qp, index); + + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + if (next->ee_nds & cpu_to_be32(0x3f)) + *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | + (next->ee_nds & cpu_to_be32(0x3f)); + else + *new_wqe = 0; + + return 0; +} + +int __devinit mthca_init_qp_table(struct mthca_dev *dev) +{ + int err; + u8 status; + int i; + + spin_lock_init(&dev->qp_table.lock); + + /* + * We reserve 2 extra QPs per port for the special QPs. The + * special QP for port 1 has to be even, so round up. + */ + dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL; + err = mthca_alloc_init(&dev->qp_table.alloc, + dev->limits.num_qps, + (1 << 24) - 1, + dev->qp_table.sqp_start + + MTHCA_MAX_PORTS * 2); + if (err) + return err; + + err = mthca_array_init(&dev->qp_table.qp, + dev->limits.num_qps); + if (err) { + mthca_alloc_cleanup(&dev->qp_table.alloc); + return err; + } + + for (i = 0; i < 2; ++i) { + err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI, + dev->qp_table.sqp_start + i * 2, + &status); + if (err) + goto err_out; + if (status) { + mthca_warn(dev, "CONF_SPECIAL_QP returned " + "status %02x, aborting.\n", + status); + err = -EINVAL; + goto err_out; + } + } + return 0; + + err_out: + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps); + mthca_alloc_cleanup(&dev->qp_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev) +{ + int i; + u8 status; + + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_alloc_cleanup(&dev->qp_table.alloc); +} From roland at topspin.com Mon Dec 13 10:09:38 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:38 -0800 Subject: [openib-general] [PATCH][v3][13/21] Add Mellanox HCA low-level driver (last bits) In-Reply-To: <20041213109.qjNNDyU3lIqRtV2z@topspin.com> Message-ID: <20041213109.xRUlTjiTPPxsdvte@topspin.com> Add code for remaining InfiniBand objects (address vectors, multicast groups, memory regions and protection domains) Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c 2004-12-13 09:44:46.775834455 -0800 @@ -0,0 +1,205 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_av.c 1298 2004-11-29 03:26:10Z roland $ + */ + +#include + +#include +#include + +#include "mthca_dev.h" + +struct mthca_av { + u32 port_pd; + u8 reserved1; + u8 g_slid; + u16 dlid; + u8 reserved2; + u8 gid_index; + u8 msg_sr; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u32 dgid[4]; +}; + +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah) +{ + u32 index = -1; + struct mthca_av *av = NULL; + + ah->on_hca = 0; + + if (!atomic_read(&pd->sqp_count) && + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + index = mthca_alloc(&dev->av_table.alloc); + + /* fall back to allocate in host memory */ + if (index == -1) + goto host_alloc; + + av = kmalloc(sizeof *av, GFP_KERNEL); + if (!av) + goto host_alloc; + + ah->on_hca = 1; + ah->avdma = dev->av_table.ddr_av_base + + index * MTHCA_AV_SIZE; + } + + host_alloc: + if (!ah->on_hca) { + ah->av = pci_pool_alloc(dev->av_table.pool, + SLAB_KERNEL, &ah->avdma); + if (!ah->av) + return -ENOMEM; + + av = ah->av; + } + + ah->key = pd->ntmr.ibmr.lkey; + + memset(av, 0, MTHCA_AV_SIZE); + + av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24)); + av->g_slid = ah_attr->src_path_bits; + av->dlid = cpu_to_be16(ah_attr->dlid); + av->msg_sr = (3 << 4) | /* 2K message */ + ah_attr->static_rate; + av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + if (ah_attr->ah_flags & IB_AH_GRH) { + av->g_slid |= 0x80; + av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + + ah_attr->grh.sgid_index; + av->hop_limit = ah_attr->grh.hop_limit; + av->sl_tclass_flowlabel |= + cpu_to_be32((ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); + } + + if (0) { + int j; + + mthca_dbg(dev, "Created UDAV at %p/%08lx:\n", + av, (unsigned long) ah->avdma); + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) av)[j])); + } + + if (ah->on_hca) { + memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, + av, MTHCA_AV_SIZE); + kfree(av); + } + + return 0; +} + +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) +{ + if (ah->on_hca) + mthca_free(&dev->av_table.alloc, + (ah->avdma - dev->av_table.ddr_av_base) / + MTHCA_AV_SIZE); + else + pci_pool_free(dev->av_table.pool, ah->av, ah->avdma); + + return 0; +} + +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header) +{ + if (ah->on_hca) + return -EINVAL; + + header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + header->lrh.destination_lid = ah->av->dlid; + header->lrh.source_lid = ah->av->g_slid & 0x7f; + if (ah->av->g_slid & 0x80) { + header->grh_present = 1; + header->grh.traffic_class = + (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; + header->grh.flow_label = + ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); + ib_cached_gid_get(&dev->ib_dev, + be32_to_cpu(ah->av->port_pd) >> 24, + ah->av->gid_index, + &header->grh.source_gid); + memcpy(header->grh.destination_gid.raw, + ah->av->dgid, 16); + } else { + header->grh_present = 0; + } + + return 0; +} + +int __devinit mthca_init_av_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->av_table.alloc, + dev->av_table.num_ddr_avs, + dev->av_table.num_ddr_avs - 1, + 0); + if (err) + return err; + + dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev, + MTHCA_AV_SIZE, + MTHCA_AV_SIZE, 0); + if (!dev->av_table.pool) + goto out_free_alloc; + + if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) + + dev->av_table.ddr_av_base - + dev->ddr_start, + dev->av_table.num_ddr_avs * + MTHCA_AV_SIZE); + if (!dev->av_table.av_map) + goto out_free_pool; + } else + dev->av_table.av_map = NULL; + + return 0; + + out_free_pool: + pci_pool_destroy(dev->av_table.pool); + + out_free_alloc: + mthca_alloc_cleanup(&dev->av_table.alloc); + return -ENOMEM; +} + +void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) +{ + if (dev->av_table.av_map) + iounmap(dev->av_table.av_map); + pci_pool_destroy(dev->av_table.pool); + mthca_alloc_cleanup(&dev->av_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c 2004-12-13 09:44:46.802830478 -0800 @@ -0,0 +1,365 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mcg.c 1298 2004-11-29 03:26:10Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2) +}; + +struct mthca_mgm { + u32 next_gid_index; + u32 reserved[3]; + u8 gid[16]; + u32 qp[MTHCA_QP_PER_MGM]; +}; + +static const u8 zero_gid[16]; /* automatically initialized to 0 */ + +/* + * Caller must hold MCG table semaphore. gid and mgm parameters must + * be properly aligned for command interface. + * + * Returns 0 unless a firmware command error occurs. + * + * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1 + * and *mgm holds MGM entry. + * + * if GID is found in AMGM, *index = index in AMGM, *prev = index of + * previous entry in hash chain and *mgm holds AMGM entry. + * + * If no AMGM exists for given gid, *index = -1, *prev = index of last + * entry in hash chain and *mgm holds end of hash chain. + */ +static int find_mgm(struct mthca_dev *dev, + u8 *gid, struct mthca_mgm *mgm, + u16 *hash, int *prev, int *index) +{ + void *mailbox; + u8 *mgid; + int err; + u8 status; + + mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgid = MAILBOX_ALIGN(mailbox); + + memcpy(mgid, gid, 16); + + err = mthca_MGID_HASH(dev, mgid, hash, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "MGID_HASH returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (0) + mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:" + "%04x:%04x:%04x:%04x is %04x\n", + be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]), + be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]), + be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]), + be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]), + *hash); + + *index = *hash; + *prev = -1; + + do { + err = mthca_READ_MGM(dev, *index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + return -EINVAL; + } + + if (!memcmp(mgm->gid, zero_gid, 16)) { + if (*index != *hash) { + mthca_err(dev, "Found zero MGID in AMGM.\n"); + err = -EINVAL; + } + goto out; + } + + if (!memcmp(mgm->gid, gid, 16)) + goto out; + + *prev = *index; + *index = be32_to_cpu(mgm->next_gid_index) >> 5; + } while (*index); + + *index = -1; + + out: + kfree(mailbox); + return err; +} + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int index, prev; + int link = 0; + int i; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index != -1) { + if (!memcmp(mgm->gid, zero_gid, 16)) + memcpy(mgm->gid, gid->raw, 16); + } else { + link = 1; + + index = mthca_alloc(&dev->mcg_table.alloc); + if (index == -1) { + mthca_err(dev, "No AMGM entries left\n"); + err = -ENOMEM; + goto out; + } + + err = mthca_READ_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + memcpy(mgm->gid, gid->raw, 16); + mgm->next_gid_index = 0; + } + + for (i = 0; i < MTHCA_QP_PER_MGM; ++i) + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { + mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31)); + break; + } + + if (i == MTHCA_QP_PER_MGM) { + mthca_err(dev, "MGM at index %x is full.\n", index); + err = -ENOMEM; + goto out; + } + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + if (!link) + goto out; + + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int prev, index; + int i, loc; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index == -1) { + mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x " + "not found\n", + be16_to_cpu(((u16 *) gid->raw)[0]), + be16_to_cpu(((u16 *) gid->raw)[1]), + be16_to_cpu(((u16 *) gid->raw)[2]), + be16_to_cpu(((u16 *) gid->raw)[3]), + be16_to_cpu(((u16 *) gid->raw)[4]), + be16_to_cpu(((u16 *) gid->raw)[5]), + be16_to_cpu(((u16 *) gid->raw)[6]), + be16_to_cpu(((u16 *) gid->raw)[7])); + err = -EINVAL; + goto out; + } + + for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) { + if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) + loc = i; + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) + break; + } + + if (loc == -1) { + mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num); + err = -EINVAL; + goto out; + } + + mgm->qp[loc] = mgm->qp[i - 1]; + mgm->qp[i - 1] = 0; + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (i != 1) + goto out; + + goto out; + + if (prev == -1) { + /* Remove entry from MGM */ + if (be32_to_cpu(mgm->next_gid_index) >> 5) { + err = mthca_READ_MGM(dev, + be32_to_cpu(mgm->next_gid_index) >> 5, + mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", + status); + err = -EINVAL; + goto out; + } + } else + memset(mgm->gid, 0, 16); + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } else { + /* Remove entry from AMGM */ + index = be32_to_cpu(mgm->next_gid_index) >> 5; + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int __devinit mthca_init_mcg_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->mcg_table.alloc, + dev->limits.num_amgms, + dev->limits.num_amgms - 1, + 0); + if (err) + return err; + + init_MUTEX(&dev->mcg_table.sem); + + return 0; +} + +void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev) +{ + mthca_alloc_cleanup(&dev->mcg_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c 2004-12-13 09:44:46.828826648 -0800 @@ -0,0 +1,385 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mr.c 1298 2004-11-29 03:26:10Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +/* + * Must be packed because mtt_seg is 64 bits but only aligned to 32 bits. + */ +struct mthca_mpt_entry { + u32 flags; + u32 page_size; + u32 key; + u32 pd; + u64 start; + u64 length; + u32 lkey; + u32 window_count; + u32 window_count_limit; + u64 mtt_seg; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_MPT_FLAG_SW_OWNS (0xfUL << 28) +#define MTHCA_MPT_FLAG_MIO (1 << 17) +#define MTHCA_MPT_FLAG_BIND_ENABLE (1 << 15) +#define MTHCA_MPT_FLAG_PHYSICAL (1 << 9) +#define MTHCA_MPT_FLAG_REGION (1 << 8) + +#define MTHCA_MTT_FLAG_PRESENT 1 + +/* + * Buddy allocator for MTT segments (currently not very efficient + * since it doesn't keep a free list and just searches linearly + * through the bitmaps) + */ + +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +{ + int o; + int m; + u32 seg; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + for (o = order; o <= dev->mr_table.max_mtt_order; ++o) { + m = 1 << (dev->mr_table.max_mtt_order - o); + seg = find_first_bit(dev->mr_table.mtt_buddy[o], m); + if (seg < m) + goto found; + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + return -1; + + found: + clear_bit(seg, dev->mr_table.mtt_buddy[o]); + + while (o > order) { + --o; + seg <<= 1; + set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]); + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + + seg <<= order; + + return seg; +} + +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +{ + seg >>= order; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) { + clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]); + seg >>= 1; + ++order; + } + + set_bit(seg, dev->mr_table.mtt_buddy[order]); + + spin_unlock(&dev->mr_table.mpt_alloc.lock); +} + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + struct mthca_mpt_entry *mpt_entry; + int err; + u8 status; + + might_sleep(); + + mr->order = -1; + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return -ENOMEM; + } + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_PHYSICAL | + MTHCA_MPT_FLAG_REGION | + access); + mpt_entry->page_size = 0; + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = 0; + mpt_entry->length = ~0ULL; + + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; +} + +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + u64 *mtt_entry; + struct mthca_mpt_entry *mpt_entry; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + WARN_ON(buffer_size_shift >= 32); + + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + /* nothing */ ; + + mr->first_seg = mthca_alloc_mtt(dev, mr->order); + if (mr->first_seg == -1) + goto err_out_mpt_free; + + /* + * If list_len is odd, we add one more dummy entry for + * firmware efficiency. + */ + mailbox = kmalloc(max(sizeof *mpt_entry, + (size_t) 8 * (list_len + (list_len & 1) + 2)) + + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mtt_entry = MAILBOX_ALIGN(mailbox); + + mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + mtt_entry[1] = 0; + for (i = 0; i < list_len; ++i) + mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + if (list_len & 1) { + mtt_entry[i + 2] = 0; + ++list_len; + } + + if (0) { + mthca_dbg(dev, "Dumping MPT entry\n"); + for (i = 0; i < list_len + 2; ++i) + printk(KERN_ERR "[%2d] %016llx\n", + i, (unsigned long long) be64_to_cpu(mtt_entry[i])); + } + + err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status); + if (err) { + mthca_warn(dev, "WRITE_MTT failed (%d)\n", err); + goto err_out_mailbox_free; + } + if (status) { + mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_mailbox_free; + } + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12); + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = cpu_to_be64(iova); + mpt_entry->length = cpu_to_be64(total_size); + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; + + err_out_mailbox_free: + kfree(mailbox); + + err_out_free_mtt: + mthca_free_mtt(dev, mr->first_seg, mr->order); + + err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return err; +} + +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) +{ + int err; + u8 status; + + might_sleep(); + + err = mthca_HW2SW_MPT(dev, NULL, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", + status); + + if (mr->order >= 0) + mthca_free_mtt(dev, mr->first_seg, mr->order); + + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); +} + +int __devinit mthca_init_mr_table(struct mthca_dev *dev) +{ + int err; + int i, s; + + err = mthca_alloc_init(&dev->mr_table.mpt_alloc, + dev->limits.num_mpts, + ~0, dev->limits.reserved_mrws); + if (err) + return err; + + err = -ENOMEM; + + for (i = 1, dev->mr_table.max_mtt_order = 0; + i < dev->limits.num_mtt_segs; + i <<= 1, ++dev->mr_table.max_mtt_order) + /* nothing */ ; + + dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) * + sizeof (long *), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy) + goto err_out; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + dev->mr_table.mtt_buddy[i] = NULL; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) { + s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i)); + dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy[i]) + goto err_out_free; + bitmap_zero(dev->mr_table.mtt_buddy[i], + 1 << (dev->mr_table.max_mtt_order - i)); + } + + set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]); + + for (i = 0; i < dev->mr_table.max_mtt_order; ++i) + if (1 << i >= dev->limits.reserved_mtts) + break; + + if (i == dev->mr_table.max_mtt_order) { + mthca_err(dev, "MTT table of order %d is " + "too small.\n", i); + goto err_out_free; + } + + (void) mthca_alloc_mtt(dev, i); + + return 0; + + err_out_free: + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + + err_out: + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); + + return err; +} + +void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) +{ + int i; + + /* XXX check if any MRs are still allocated? */ + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + kfree(dev->mr_table.mtt_buddy); + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c 2004-12-13 09:44:46.861821787 -0800 @@ -0,0 +1,69 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_pd.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd) +{ + int err; + + might_sleep(); + + atomic_set(&pd->sqp_count, 0); + pd->pd_num = mthca_alloc(&dev->pd_table.alloc); + if (pd->pd_num == -1) + return -ENOMEM; + + err = mthca_mr_alloc_notrans(dev, pd->pd_num, + MTHCA_MPT_FLAG_LOCAL_READ | + MTHCA_MPT_FLAG_LOCAL_WRITE, + &pd->ntmr); + if (err) + mthca_free(&dev->pd_table.alloc, pd->pd_num); + + return err; +} + +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd) +{ + might_sleep(); + mthca_free_mr(dev, &pd->ntmr); + mthca_free(&dev->pd_table.alloc, pd->pd_num); +} + +int __devinit mthca_init_pd_table(struct mthca_dev *dev) +{ + return mthca_alloc_init(&dev->pd_table.alloc, + dev->limits.num_pds, + (1 << 24) - 1, + dev->limits.reserved_pds); +} + +void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev) +{ + /* XXX check if any PDs are still allocated? */ + mthca_alloc_cleanup(&dev->pd_table.alloc); +} From roland at topspin.com Mon Dec 13 10:09:45 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:45 -0800 Subject: [openib-general] [PATCH][v3][14/21] Add Mellanox HCA low-level driver (MAD) In-Reply-To: <20041213109.xRUlTjiTPPxsdvte@topspin.com> Message-ID: <20041213109.m8TyDSPRhM2X6Nst@topspin.com> Add MAD (management datagram) code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c 2004-12-13 09:44:47.344750643 -0800 @@ -0,0 +1,314 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mad.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + IB_SM_PORT_INFO = 0x0015, + IB_SM_PKEY_TABLE = 0x0016, + IB_SM_SM_INFO = 0x0020, + IB_SM_VENDOR_START = 0xff00 +}; + +enum { + MTHCA_VENDOR_CLASS1 = 0x9, + MTHCA_VENDOR_CLASS2 = 0xa +}; + +struct mthca_trap_mad { + struct ib_mad *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static void update_sm_ah(struct mthca_dev *dev, + u8 port_num, u16 lid, u8 sl) +{ + struct ib_ah *new_ah; + struct ib_ah_attr ah_attr; + unsigned long flags; + + if (!dev->send_agent[port_num - 1][0]) + return; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = lid; + ah_attr.sl = sl; + ah_attr.port_num = port_num; + + new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd, + &ah_attr); + if (IS_ERR(new_ah)) + return; + + spin_lock_irqsave(&dev->sm_lock, flags); + if (dev->sm_ah[port_num - 1]) + ib_destroy_ah(dev->sm_ah[port_num - 1]); + dev->sm_ah[port_num - 1] = new_ah; + spin_unlock_irqrestore(&dev->sm_lock, flags); +} + +/* + * Snoop SM MADs for port info and P_Key table sets, so we can + * synthesize LID change and P_Key change events. + */ +static void smp_snoop(struct ib_device *ibdev, + u8 port_num, + struct ib_mad *mad) +{ + struct ib_event event; + + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_SET) { + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) { + update_sm_ah(to_mdev(ibdev), port_num, + be16_to_cpup((__be16 *) (mad->data + 58)), + (*(u8 *) (mad->data + 76)) & 0xf); + + event.device = ibdev; + event.event = IB_EVENT_LID_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + } +} + +static void forward_trap(struct mthca_dev *dev, + u8 port_num, + struct ib_mad *mad) +{ + int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; + struct mthca_trap_mad *tmad; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = qpn ? IB_QP1_QKEY : 0, + .timeout_ms = 0 + } + } + }; + struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; + int ret; + unsigned long flags; + + if (agent) { + tmad = kmalloc(sizeof *tmad, GFP_KERNEL); + if (!tmad) + return; + + tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL); + if (!tmad->mad) { + kfree(tmad); + return; + } + + memcpy(tmad->mad, mad, sizeof *mad); + + wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; + wr.wr_id = (unsigned long) tmad; + + gather_list.addr = dma_map_single(agent->device->dma_device, + tmad->mad, + sizeof *tmad->mad, + DMA_TO_DEVICE); + gather_list.length = sizeof *tmad->mad; + gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; + pci_unmap_addr_set(tmad, mapping, gather_list.addr); + + /* + * We rely here on the fact that MLX QPs don't use the + * address handle after the send is posted (this is + * wrong following the IB spec strictly, but we know + * it's OK for our devices). + */ + spin_lock_irqsave(&dev->sm_lock, flags); + wr.wr.ud.ah = dev->sm_ah[port_num - 1]; + if (wr.wr.ud.ah) + ret = ib_post_send_mad(agent, &wr, &bad_wr); + else + ret = -EINVAL; + spin_unlock_irqrestore(&dev->sm_lock, flags); + + if (ret) { + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + DMA_TO_DEVICE); + kfree(tmad->mad); + kfree(tmad); + } + } +} + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + int err; + u8 status; + + /* Forward locally generated traps to the SM */ + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && + slid == 0) { + forward_trap(to_mdev(ibdev), port_num, in_mad); + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + /* + * Only handle SM gets, sets and trap represses for SM class + * + * Only handle PMA and Mellanox vendor-specific class gets and + * sets for other classes. + */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_TRAP_REPRESS) + return IB_MAD_RESULT_SUCCESS; + + /* + * Don't process SMInfo queries or vendor-specific + * MADs -- the SMA can't handle them. + */ + if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO || + be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START) + return IB_MAD_RESULT_SUCCESS; + } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1 || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET) + return IB_MAD_RESULT_SUCCESS; + } else + return IB_MAD_RESULT_SUCCESS; + + err = mthca_MAD_IFC(to_mdev(ibdev), + !!(mad_flags & IB_MAD_IGNORE_MKEY), + port_num, in_mad, out_mad, + &status); + if (err) { + mthca_err(to_mdev(ibdev), "MAD_IFC failed\n"); + return IB_MAD_RESULT_FAILURE; + } + if (status == MTHCA_CMD_STAT_BAD_PKT) + return IB_MAD_RESULT_SUCCESS; + if (status) { + mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n", + status); + return IB_MAD_RESULT_FAILURE; + } + + if (!out_mad->mad_hdr.status) + smp_snoop(ibdev, port_num, in_mad); + + /* set return bit in status of directed route responses */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + out_mad->mad_hdr.status |= cpu_to_be16(1 << 15); + + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) + /* no response for trap repress */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct mthca_trap_mad *tmad = + (void *) (unsigned long) mad_send_wc->wr_id; + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + DMA_TO_DEVICE); + kfree(tmad->mad); + kfree(tmad); +} + +int mthca_create_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + spin_lock_init(&dev->sm_lock); + + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) { + agent = ib_register_mad_agent(&dev->ib_dev, p + 1, + q ? IB_QPT_GSI : IB_QPT_SMI, + NULL, 0, send_handler, + NULL, NULL); + if (IS_ERR(agent)) + goto err; + dev->send_agent[p][q] = agent; + } + + return 0; + +err: + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) + if (dev->send_agent[p][q]) + ib_unregister_mad_agent(dev->send_agent[p][q]); + + return PTR_ERR(agent); +} + +void mthca_free_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + for (p = 0; p < dev->limits.num_ports; ++p) { + for (q = 0; q <= 1; ++q) { + agent = dev->send_agent[p][q]; + dev->send_agent[p][q] = NULL; + ib_unregister_mad_agent(agent); + } + + if (dev->sm_ah[p]) + ib_destroy_ah(dev->sm_ah[p]); + } +} From roland at topspin.com Mon Dec 13 10:09:45 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:45 -0800 Subject: [openib-general] [PATCH][v3][15/21] IPoIB IPv4 multicast In-Reply-To: <20041213109.m8TyDSPRhM2X6Nst@topspin.com> Message-ID: <20041213109.5NKezuGE9PMejMSM@topspin.com> Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB hardware addresses. Also add so INFINIBAND_ALEN has a home. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/include/linux/if_infiniband.h 2004-12-13 09:44:47.613711020 -0800 @@ -0,0 +1,29 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef _LINUX_IF_INFINIBAND_H +#define _LINUX_IF_INFINIBAND_H + +#define INFINIBAND_ALEN 20 /* Octets in IPoIB HW addr */ + +#endif /* _LINUX_IF_INFINIBAND_H */ --- linux-bk.orig/include/net/ip.h 2004-12-11 15:16:19.000000000 -0800 +++ linux-bk/include/net/ip.h 2004-12-13 09:44:47.641706896 -0800 @@ -229,6 +229,39 @@ buf[3]=addr&0x7F; } +/* + * Map a multicast IP onto multicast MAC for type IP-over-InfiniBand. + * Leave P_Key as 0 to be filled in by driver. + */ + +static inline void ip_ib_mc_map(u32 addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + addr = ntohl(addr); + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x40; /* IPv4 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + buf[10] = 0; + buf[11] = 0; + buf[12] = 0; + buf[13] = 0; + buf[14] = 0; + buf[15] = 0; + buf[19] = addr & 0xff; + addr >>= 8; + buf[18] = addr & 0xff; + addr >>= 8; + buf[17] = addr & 0xff; + addr >>= 8; + buf[16] = addr & 0x0f; +} + #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) #include #endif --- linux-bk.orig/net/ipv4/arp.c 2004-12-11 15:16:28.000000000 -0800 +++ linux-bk/net/ipv4/arp.c 2004-12-13 09:44:47.650705570 -0800 @@ -213,6 +213,9 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; + case ARPHRD_INFINIBAND: + ip_ib_mc_map(addr, haddr); + return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); From roland at topspin.com Mon Dec 13 10:09:46 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:46 -0800 Subject: [openib-general] [PATCH][v3][16/21] IPoIB IPv6 support In-Reply-To: <20041213109.5NKezuGE9PMejMSM@topspin.com> Message-ID: <20041213109.iziHvQZqtmP83gmx@topspin.com> Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB hardware addresses, and add support for autoconfiguration for devices with type ARPHRD_INFINIBAND. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Nitin Hande Signed-off-by: Roland Dreier --- linux-bk.orig/include/net/if_inet6.h 2004-12-11 15:16:37.000000000 -0800 +++ linux-bk/include/net/if_inet6.h 2004-12-13 09:44:48.801536031 -0800 @@ -266,5 +266,20 @@ { buf[0] = 0x00; } + +static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x60; /* IPv6 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + memcpy(buf + 10, addr->s6_addr + 6, 10); +} #endif #endif --- linux-bk.orig/net/ipv6/addrconf.c 2004-12-11 15:16:33.000000000 -0800 +++ linux-bk/net/ipv6/addrconf.c 2004-12-13 09:44:48.840530286 -0800 @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -1095,6 +1096,12 @@ memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + if (dev->addr_len != INFINIBAND_ALEN) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; } return -1; } @@ -1794,6 +1801,7 @@ if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && + (dev->type != ARPHRD_INFINIBAND) && (dev->type != ARPHRD_ARCNET)) { /* Alas, we support only Ethernet autoconfiguration. */ return; --- linux-bk.orig/net/ipv6/ndisc.c 2004-12-11 15:16:13.000000000 -0800 +++ linux-bk/net/ipv6/ndisc.c 2004-12-13 09:44:48.890522921 -0800 @@ -260,6 +260,9 @@ case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; + case ARPHRD_INFINIBAND: + ipv6_ib_mc_map(addr, buf); + return 0; default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); From roland at topspin.com Mon Dec 13 10:09:51 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:51 -0800 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <20041213109.iziHvQZqtmP83gmx@topspin.com> Message-ID: <20041213109.JT1ejUdkRIUXbWOm@topspin.com> Add a driver that implements the (IPoIB) IP-over-InfiniBand protocol. This is a network device driver of type ARPHRD_INFINIBAND (and addr_len INFINIBAND_ALEN bytes). The ARP/ND implementation for this driver is not completely straightforward, because InfiniBand requires an additional path lookup be performed (through an IB-specific mechanism) after a remote hardware address has been resolved. We are very open to suggestions of a better way to handle this than the current implementation. Although IB has a special multicast group join mode intended to support IP multicast routing (non member join), no means to identify different multicast styles has yet been determined, so all joins by the driver are currently full member joins. We are looking for guidance in how to solve this. The IPoIB protocol/encapsulation is described in the Internet-Drafts http://www.ietf.org/internet-drafts/draft-ietf-ipoib-architecture-04.txt http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-12-13 09:44:43.936252779 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-13 09:44:49.385450009 -0800 @@ -2,7 +2,6 @@ config INFINIBAND tristate "InfiniBand support" - default n ---help--- Core support for InfiniBand (IB). Make sure to also select any protocols you wish to use as well as drivers for your @@ -10,4 +9,6 @@ source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/ulp/ipoib/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-12-13 09:44:43.909256756 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-12-13 09:44:49.342456343 -0800 @@ -1,2 +1,3 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ +obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Kconfig 2004-12-13 09:44:49.470437489 -0800 @@ -0,0 +1,33 @@ +config INFINIBAND_IPOIB + tristate "IP-over-InfiniBand" + depends on INFINIBAND && NETDEVICES && INET + ---help--- + Support for the IP-over-InfiniBand protocol (IPoIB). This + transports IP packets over InfiniBand so you can use your IB + device as a fancy NIC. + + The IPoIB protocol is defined by the IETF ipoib working + group: . + +config INFINIBAND_IPOIB_DEBUG + bool "IP-over-InfiniBand debugging" + depends on INFINIBAND_IPOIB + ---help--- + This option causes debugging code to be compiled into the + IPoIB driver. The output can be turned on via the + debug_level and mcast_debug_level module parameters (which + can also be set after the driver is loaded through sysfs). + + This option also creates an "ipoib_debugfs," which can be + mounted to expose debugging information about IB multicast + groups used by the IPoIB driver. + +config INFINIBAND_IPOIB_DEBUG_DATA + bool "IP-over-InfiniBand data path debugging" + depends on INFINIBAND_IPOIB_DEBUG + ---help--- + This option compiles debugging code into the the data path + of the IPoIB driver. The output can be turned on via the + data_debug_level module parameter; however, even with output + turned off, this debugging code will have some performance + impact. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Makefile 2004-12-13 09:44:49.426443970 -0800 @@ -0,0 +1,11 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_IPOIB) += ib_ipoib.o + +ib_ipoib-y := ipoib_main.o \ + ipoib_ib.o \ + ipoib_multicast.o \ + ipoib_verbs.o \ + ipoib_vlan.o +ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o + --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib.h 2004-12-13 09:44:49.497433512 -0800 @@ -0,0 +1,340 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib.h 1323 2004-12-11 02:36:04Z roland $ + */ + +#ifndef _IPOIB_H +#define _IPOIB_H + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include + +#include +#include +#include + +/* constants */ + +enum { + IPOIB_PACKET_SIZE = 2048, + IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, + + IPOIB_ENCAP_LEN = 4, + + IPOIB_RX_RING_SIZE = 128, + IPOIB_TX_RING_SIZE = 64, + + IPOIB_NUM_WC = 4, + + IPOIB_MAX_PATH_REC_QUEUE = 3, + IPOIB_MAX_MCAST_QUEUE = 3, + + IPOIB_FLAG_TX_FULL = 0, + IPOIB_FLAG_OPER_UP = 1, + IPOIB_FLAG_ADMIN_UP = 2, + IPOIB_PKEY_ASSIGNED = 3, + IPOIB_PKEY_STOP = 4, + IPOIB_FLAG_SUBINTERFACE = 5, + IPOIB_MCAST_RUN = 6, + IPOIB_STOP_REAPER = 7, + + IPOIB_MAX_BACKOFF_SECONDS = 16, + + IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ + IPOIB_MCAST_FLAG_SENDONLY = 1, + IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ + IPOIB_MCAST_FLAG_ATTACHED = 3, +}; + +/* structs */ + +struct ipoib_header { + u16 proto; + u16 reserved; +}; + +struct ipoib_pseudoheader { + u8 hwaddr[INFINIBAND_ALEN]; +}; + +struct ipoib_mcast; + +struct ipoib_buf { + struct sk_buff *skb; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +/* + * Device private locking: tx_lock protects members used in TX fast + * path (and we use LLTX so upper layers don't do extra locking). + * lock protects everything else. lock nests inside of tx_lock (ie + * tx_lock must be acquired first if needed). + */ +struct ipoib_dev_priv { + spinlock_t lock; + + struct net_device *dev; + + unsigned long flags; + + struct semaphore mcast_mutex; + struct semaphore vlan_mutex; + + struct rb_root path_tree; + struct list_head path_list; + + struct ipoib_mcast *broadcast; + struct list_head multicast_list; + struct rb_root multicast_tree; + + struct work_struct pkey_task; + struct work_struct mcast_task; + struct work_struct flush_task; + struct work_struct restart_task; + struct work_struct ah_reap_task; + + struct ib_device *ca; + u8 port; + u16 pkey; + struct ib_pd *pd; + struct ib_mr *mr; + struct ib_cq *cq; + struct ib_qp *qp; + u32 qkey; + + union ib_gid local_gid; + u16 local_lid; + + unsigned int admin_mtu; + unsigned int mcast_mtu; + + struct ipoib_buf *rx_ring; + + spinlock_t tx_lock; + struct ipoib_buf *tx_ring; + unsigned tx_head; + unsigned tx_tail; + + struct ib_wc ibwc[IPOIB_NUM_WC]; + + struct list_head dead_ahs; + + struct ib_event_handler event_handler; + + struct net_device_stats stats; + + struct net_device *parent; + struct list_head child_intfs; + struct list_head list; + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG + struct list_head fs_list; + struct dentry *mcg_dentry; +#endif +}; + +struct ipoib_ah { + struct net_device *dev; + struct ib_ah *ah; + struct list_head list; + struct kref ref; + unsigned last_send; +}; + +struct ipoib_path { + struct net_device *dev; + struct ib_sa_path_rec pathrec; + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct list_head neigh_list; + + int query_id; + struct ib_sa_query *query; + struct completion done; + + struct rb_node rb_node; + struct list_head list; +}; + +struct ipoib_neigh { + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct neighbour *neighbour; + + struct list_head list; +}; + +static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) +{ + return (struct ipoib_neigh **) (neigh->ha + 24 - + (offsetof(struct neighbour, ha) & 4)); +} + +extern struct workqueue_struct *ipoib_workqueue; + +/* functions */ + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr); +void ipoib_free_ah(struct kref *kref); +static inline void ipoib_put_ah(struct ipoib_ah *ah) +{ + kref_put(&ah->ref, ipoib_free_ah); +} + +int ipoib_add_pkey_attr(struct net_device *dev); + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn); +void ipoib_reap_ah(void *dev_ptr); + +void ipoib_flush_paths(struct net_device *dev); +struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_ib_dev_flush(void *dev); +void ipoib_ib_dev_cleanup(struct net_device *dev); + +int ipoib_ib_dev_open(struct net_device *dev); +int ipoib_ib_dev_up(struct net_device *dev); +int ipoib_ib_dev_down(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev); + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_dev_cleanup(struct net_device *dev); + +void ipoib_mcast_join_task(void *dev_ptr); +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb); + +void ipoib_mcast_restart_task(void *dev_ptr); +int ipoib_mcast_start_thread(struct net_device *dev); +int ipoib_mcast_stop_thread(struct net_device *dev); + +void ipoib_mcast_dev_down(struct net_device *dev); +void ipoib_mcast_dev_flush(struct net_device *dev); + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev); +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter); +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter); +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *gid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only); + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); + +int ipoib_qp_create(struct net_device *dev); +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca); +void ipoib_transport_dev_cleanup(struct net_device *dev); + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); + +void ipoib_pkey_poll(void *dev); +int ipoib_pkey_dev_delay_open(struct net_device *dev); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int ipoib_create_debug_file(struct net_device *dev); +void ipoib_delete_debug_file(struct net_device *dev); +int ipoib_register_debugfs(void); +void ipoib_unregister_debugfs(void); +#else +static inline int ipoib_create_debug_file(struct net_device *dev) { return 0; } +static inline void ipoib_delete_debug_file(struct net_device *dev) { } +static inline int ipoib_register_debugfs(void) { return 0; } +static inline void ipoib_unregister_debugfs(void) { } +#endif + + +#define ipoib_printk(level, priv, format, arg...) \ + printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg) +#define ipoib_warn(priv, format, arg...) \ + ipoib_printk(KERN_WARNING, priv, format , ## arg) + + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +extern int debug_level; + +#define ipoib_dbg(priv, format, arg...) \ + do { \ + if (debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { \ + if (mcast_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG */ +#define ipoib_dbg(priv, format, arg...) \ + do { (void) (priv); } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +#define ipoib_dbg_data(priv, format, arg...) \ + do { \ + if (data_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ +#define ipoib_dbg_data(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ + + +#define IPOIB_GID_FMT "%x:%x:%x:%x:%x:%x:%x:%x" + +#define IPOIB_GID_ARG(gid) be16_to_cpup((__be16 *) ((gid).raw + 0)), \ + be16_to_cpup((__be16 *) ((gid).raw + 2)), \ + be16_to_cpup((__be16 *) ((gid).raw + 4)), \ + be16_to_cpup((__be16 *) ((gid).raw + 6)), \ + be16_to_cpup((__be16 *) ((gid).raw + 8)), \ + be16_to_cpup((__be16 *) ((gid).raw + 10)), \ + be16_to_cpup((__be16 *) ((gid).raw + 12)), \ + be16_to_cpup((__be16 *) ((gid).raw + 14)) + +#endif /* _IPOIB_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2004-12-13 09:44:49.522429829 -0800 @@ -0,0 +1,276 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#include +#include + +#include "ipoib.h" + +enum { + IPOIB_MAGIC = 0x49504942 /* "IPIB" */ +}; + +static DECLARE_MUTEX(ipoib_fs_mutex); +static struct dentry *ipoib_root; +static struct super_block *ipoib_sb; +static LIST_HEAD(ipoib_device_list); + +static void *ipoib_mcg_seq_start(struct seq_file *file, loff_t *pos) +{ + struct ipoib_mcast_iter *iter; + loff_t n = *pos; + + iter = ipoib_mcast_iter_init(file->private); + if (!iter) + return NULL; + + while (n--) { + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + } + + return iter; +} + +static void *ipoib_mcg_seq_next(struct seq_file *file, void *iter_ptr, + loff_t *pos) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + + (*pos)++; + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +static void ipoib_mcg_seq_stop(struct seq_file *file, void *iter_ptr) +{ + /* nothing for now */ +} + +static int ipoib_mcg_seq_show(struct seq_file *file, void *iter_ptr) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + char gid_buf[sizeof "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"]; + union ib_gid mgid; + int i, n; + unsigned long created; + unsigned int queuelen, complete, send_only; + + if (iter) { + ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen, + &complete, &send_only); + + for (n = 0, i = 0; i < sizeof mgid / 2; ++i) { + n += sprintf(gid_buf + n, "%x", + be16_to_cpu(((u16 *)mgid.raw)[i])); + if (i < sizeof mgid / 2 - 1) + gid_buf[n++] = ':'; + } + } + + seq_printf(file, "GID: %*s", -(1 + (int) sizeof gid_buf), gid_buf); + + seq_printf(file, + " created: %10ld queuelen: %4d complete: %d send_only: %d\n", + created, queuelen, complete, send_only); + + return 0; +} + +static struct seq_operations ipoib_seq_ops = { + .start = ipoib_mcg_seq_start, + .next = ipoib_mcg_seq_next, + .stop = ipoib_mcg_seq_stop, + .show = ipoib_mcg_seq_show, +}; + +static int ipoib_mcg_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &ipoib_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->u.generic_ip; + + return 0; +} + +static struct file_operations ipoib_fops = { + .owner = THIS_MODULE, + .open = ipoib_mcg_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +static struct inode *ipoib_get_inode(void) +{ + struct inode *inode = new_inode(ipoib_sb); + + if (inode) { + inode->i_mode = S_IFREG | S_IRUGO; + inode->i_uid = 0; + inode->i_gid = 0; + inode->i_blksize = PAGE_CACHE_SIZE; + inode->i_blocks = 0; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode->i_fop = &ipoib_fops; + } + + return inode; +} + +static int __ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dentry *dentry; + struct inode *inode; + char name[IFNAMSIZ + sizeof "_mcg"]; + + snprintf(name, sizeof name, "%s_mcg", dev->name); + + dentry = d_alloc_name(ipoib_root, name); + if (!dentry) + return -ENOMEM; + + inode = ipoib_get_inode(); + if (!inode) { + dput(dentry); + return -ENOMEM; + } + + inode->u.generic_ip = dev; + priv->mcg_dentry = dentry; + + d_add(dentry, inode); + + return 0; +} + +int ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + + list_add_tail(&priv->fs_list, &ipoib_device_list); + + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return 0; + } + + up(&ipoib_fs_mutex); + + return __ipoib_create_debug_file(dev); +} + +void ipoib_delete_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + list_del(&priv->fs_list); + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return; + } + up(&ipoib_fs_mutex); + + if (priv->mcg_dentry) { + d_drop(priv->mcg_dentry); + simple_unlink(ipoib_root->d_inode, priv->mcg_dentry); + } +} + +static int ipoib_fill_super(struct super_block *sb, void *data, int silent) +{ + static struct tree_descr ipoib_files[] = { + { "" } + }; + struct ipoib_dev_priv *priv; + int ret; + + ret = simple_fill_super(sb, IPOIB_MAGIC, ipoib_files); + if (ret) + return ret; + + ipoib_root = sb->s_root; + + down(&ipoib_fs_mutex); + + ipoib_sb = sb; + + list_for_each_entry(priv, &ipoib_device_list, fs_list) { + ret = __ipoib_create_debug_file(priv->dev); + if (ret) + break; + } + + up(&ipoib_fs_mutex); + + return ret; +} + +static struct super_block *ipoib_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + return get_sb_single(fs_type, flags, data, ipoib_fill_super); +} + +static void ipoib_kill_sb(struct super_block *sb) +{ + down(&ipoib_fs_mutex); + ipoib_sb = NULL; + up(&ipoib_fs_mutex); + + kill_litter_super(sb); +} + +static struct file_system_type ipoib_fs_type = { + .owner = THIS_MODULE, + .name = "ipoib_debugfs", + .get_sb = ipoib_get_sb, + .kill_sb = ipoib_kill_sb, +}; + +int ipoib_register_debugfs(void) +{ + return register_filesystem(&ipoib_fs_type); +} + +void ipoib_unregister_debugfs(void) +{ + unregister_filesystem(&ipoib_fs_type); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2004-12-13 09:44:49.547426147 -0800 @@ -0,0 +1,627 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_ib.c 1323 2004-12-11 02:36:04Z roland $ + */ + +#include +#include + +#include + +#include "ipoib.h" + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +int data_debug_level; + +module_param(data_debug_level, int, 0644); +MODULE_PARM_DESC(data_debug_level, + "Enable data path debug tracing if > 0"); +#endif + +#define IPOIB_OP_RECV (1ul << 31) + +static DECLARE_MUTEX(pkey_sem); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr) +{ + struct ipoib_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return NULL; + + ah->dev = dev; + ah->last_send = 0; + kref_init(&ah->ref); + + ah->ah = ib_create_ah(pd, attr); + if (IS_ERR(ah->ah)) { + kfree(ah); + ah = NULL; + } else + ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah); + + return ah; +} + +void ipoib_free_ah(struct kref *kref) +{ + struct ipoib_ah *ah = container_of(kref, struct ipoib_ah, ref); + struct ipoib_dev_priv *priv = netdev_priv(ah->dev); + + unsigned long flags; + + if (ah->last_send <= priv->tx_tail) { + ipoib_dbg(priv, "Freeing ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } else { + spin_lock_irqsave(&priv->lock, flags); + list_add_tail(&ah->list, &priv->dead_ahs); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv, + unsigned int wr_id, + dma_addr_t addr) +{ + struct ib_sge list = { + .addr = addr, + .length = IPOIB_BUF_SIZE, + .lkey = priv->mr->lkey, + }; + struct ib_recv_wr param = { + .wr_id = wr_id | IPOIB_OP_RECV, + .sg_list = &list, + .num_sge = 1, + .recv_flags = IB_RECV_SIGNALED + }; + struct ib_recv_wr *bad_wr; + + return ib_post_recv(priv->qp, ¶m, &bad_wr); +} + +static int ipoib_ib_post_receive(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + dma_addr_t addr; + int ret; + + skb = dev_alloc_skb(IPOIB_BUF_SIZE + 4); + if (!skb) { + ipoib_warn(priv, "failed to allocate receive buffer\n"); + + priv->rx_ring[id].skb = NULL; + return -ENOMEM; + } + skb_reserve(skb, 4); /* 16 byte align IP header */ + priv->rx_ring[id].skb = skb; + addr = dma_map_single(priv->ca->dma_device, + skb->data, IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr); + + ret = ipoib_ib_receive(priv, id, addr); + if (ret) { + ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n", + id, ret); + priv->rx_ring[id].skb = NULL; + } + + return ret; +} + +static int ipoib_ib_post_receives(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + if (ipoib_ib_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + return -EIO; + } + } + + return 0; +} + +static void ipoib_ib_handle_wc(struct net_device *dev, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + + ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); + + if (wr_id & IPOIB_OP_RECV) { + wr_id &= ~IPOIB_OP_RECV; + + if (wr_id < IPOIB_RX_RING_SIZE) { + struct sk_buff *skb = priv->rx_ring[wr_id].skb; + + priv->rx_ring[wr_id].skb = NULL; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(&priv->rx_ring[wr_id], + mapping), + IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + + if (wc->status != IB_WC_SUCCESS) { + if (wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed recv event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put(skb, wc->byte_len); + skb_pull(skb, IB_GRH_BYTES); + + if (wc->slid != priv->local_lid || + wc->src_qp != priv->qp->qp_num) { + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + } else { + ipoib_dbg_data(priv, "dropping loopback packet\n"); + dev_kfree_skb_any(skb); + } + + /* repost receive */ + if (ipoib_ib_post_receive(dev, wr_id)) + ipoib_warn(priv, "ipoib_ib_post_receive failed " + "for buf %d\n", wr_id); + } else + ipoib_warn(priv, "completion event with wrid %d\n", + wr_id); + + } else { + struct ipoib_buf *tx_req; + unsigned long flags; + + if (wr_id >= IPOIB_TX_RING_SIZE) { + ipoib_warn(priv, "completion event with wrid %d (> %d)\n", + wr_id, IPOIB_TX_RING_SIZE); + return; + } + + ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); + + tx_req = &priv->tx_ring[wr_id]; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; + + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->tx_tail; + if (priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + } +} + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + for (i = 0; i < n; ++i) + ipoib_ib_handle_wc(dev, priv->ibwc + i); + } while (n == IPOIB_NUM_WC); +} + +static inline int post_send(struct ipoib_dev_priv *priv, + unsigned int wr_id, + struct ib_ah *address, u32 qpn, + dma_addr_t addr, int len) +{ + struct ib_sge list = { + .addr = addr, + .length = len, + .lkey = priv->mr->lkey, + }; + struct ib_send_wr param = { + .wr_id = wr_id, + .opcode = IB_WR_SEND, + .sg_list = &list, + .num_sge = 1, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = priv->qkey, + .ah = address + }, + }, + .send_flags = IB_SEND_SIGNALED, + }; + struct ib_send_wr *bad_wr; + + return ib_post_send(priv->qp, ¶m, &bad_wr); +} + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_buf *tx_req; + dma_addr_t addr; + + if (skb->len > dev->mtu + INFINIBAND_ALEN) { + ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + skb->len, dev->mtu + INFINIBAND_ALEN); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + + if (!(skb = skb_unshare(skb, GFP_ATOMIC))) { + ipoib_warn(priv, "failed to unshare sk_buff. Dropping\n"); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + return; + } + + ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n", + skb->len, address, qpn); + + /* + * We put the skb into the tx_ring _before_ we call post_send() + * because it's entirely possible that the completion handler will + * run before we execute anything after the post_send(). That + * means we have to make sure everything is properly recorded and + * our state is consistent before we call post_send(). + */ + tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req->skb = skb; + addr = dma_map_single(priv->ca->dma_device, + skb->data, skb->len, + DMA_TO_DEVICE); + pci_unmap_addr_set(tx_req, mapping, addr); + + if (post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + address->ah, qpn, addr, skb->len)) { + ipoib_warn(priv, "post_send failed\n"); + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + } else { + dev->trans_start = jiffies; + + address->last_send = priv->tx_head; + ++priv->tx_head; + + if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + } + } +} + +void __ipoib_reap_ah(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah, *tah; + LIST_HEAD(remove_list); + + spin_lock_irq(&priv->lock); + list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list) + if (ah->last_send <= priv->tx_tail) { + list_del(&ah->list); + list_add_tail(&ah->list, &remove_list); + } + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(ah, tah, &remove_list, list) { + ipoib_dbg(priv, "Reaping ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } +} + +void ipoib_reap_ah(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + __ipoib_reap_ah(dev); + + if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); +} + +int ipoib_ib_dev_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + ret = ipoib_qp_create(dev); + if (ret) { + ipoib_warn(priv, "ipoib_qp_create returned %d\n", ret); + return -1; + } + + ret = ipoib_ib_post_receives(dev); + if (ret) { + ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + return -1; + } + + clear_bit(IPOIB_STOP_REAPER, &priv->flags); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + + return 0; +} + +int ipoib_ib_dev_up(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + + return ipoib_mcast_start_thread(dev); +} + +int ipoib_ib_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "downing ib_dev\n"); + + clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + netif_carrier_off(dev); + + /* Shutdown the P_Key thread if still active */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + set_bit(IPOIB_PKEY_STOP, &priv->flags); + cancel_delayed_work(&priv->pkey_task); + up(&pkey_sem); + flush_workqueue(ipoib_workqueue); + } + + ipoib_mcast_stop_thread(dev); + + /* + * Flush the multicast groups first so we stop any multicast joins. The + * completion thread may have already died and we may deadlock waiting + * for the completion thread to finish some multicast joins. + */ + ipoib_mcast_dev_flush(dev); + + /* Delete broadcast and local addresses since they will be recreated */ + ipoib_mcast_dev_down(dev); + + ipoib_flush_paths(dev); + + return 0; +} + +static int recvs_pending(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + return 1; + + return 0; +} + +int ipoib_ib_dev_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + int attr_mask; + int i; + + /* Kill the existing QP and allocate a new one */ + qp_attr.qp_state = IB_QPS_ERR; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to ERROR state\n"); + + /* Wait for all sends and receives to complete */ + while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) + yield(); + + ipoib_dbg(priv, "All sends and receives done.\n"); + + qp_attr.qp_state = IB_QPS_RESET; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to RESET state\n"); + + /* Wait for all AHs to be reaped */ + set_bit(IPOIB_STOP_REAPER, &priv->flags); + cancel_delayed_work(&priv->ah_reap_task); + flush_workqueue(ipoib_workqueue); + while (!list_empty(&priv->dead_ahs)) { + __ipoib_reap_ah(dev); + yield(); + } + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + ipoib_warn(priv, "Recv skb still around @ %d\n", i); + + return 0; +} + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + priv->ca = ca; + priv->port = port; + priv->qp = NULL; + + if (ipoib_transport_dev_init(dev, ca)) { + printk(KERN_WARNING "%s: ipoib_transport_dev_init failed\n", ca->name); + return -ENODEV; + } + + if (dev->flags & IFF_UP) { + if (ipoib_ib_dev_open(dev)) { + ipoib_transport_dev_cleanup(dev); + return -ENODEV; + } + } + + return 0; +} + +void ipoib_ib_dev_flush(void *_dev) +{ + struct net_device *dev = (struct net_device *)_dev; + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv; + + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + return; + + ipoib_dbg(priv, "flushing\n"); + + ipoib_ib_dev_down(dev); + + /* + * The device could have been brought down between the start and when + * we get here, don't bring it back up if it's not configured up + */ + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_ib_dev_up(dev); + + /* Flush any child interfaces too */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + ipoib_ib_dev_flush(&cpriv->dev); +} + +void ipoib_ib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "cleaning up ib_dev\n"); + + ipoib_mcast_stop_thread(dev); + + /* Delete the broadcast address and the local address */ + ipoib_mcast_dev_down(dev); + + ipoib_transport_dev_cleanup(dev); +} + +/* + * Delayed P_Key Assigment Interim Support + * + * The following is initial implementation of delayed P_Key assigment + * mechanism. It is using the same approach implemented for the multicast + * group join. The single goal of this implementation is to quickly address + * Bug #2507. This implementation will probably be removed when the P_Key + * change async notification is available. + */ +int ipoib_open(struct net_device *dev); + +static void ipoib_pkey_dev_check_presence(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + u16 pkey_index = 0; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + else + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); +} + +void ipoib_pkey_poll(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_pkey_dev_check_presence(dev); + + if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + ipoib_open(dev); + else { + down(&pkey_sem); + if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + } +} + +int ipoib_pkey_dev_delay_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Look for the interface pkey value in the IB Port P_Key table and */ + /* set the interface pkey assigment flag */ + ipoib_pkey_dev_check_presence(dev); + + /* P_Key value not assigned yet - start polling */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + clear_bit(IPOIB_PKEY_STOP, &priv->flags); + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + return 1; + } + + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_main.c 2004-12-13 09:44:49.573422317 -0800 @@ -0,0 +1,1023 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_main.c 1323 2004-12-11 02:36:04Z roland $ + */ + +#include "ipoib.h" + +#include +#include + +#include +#include +#include + +#include /* For ARPHRD_xxx */ + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); +MODULE_LICENSE("Dual BSD/GPL"); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int debug_level; + +module_param(debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); +#endif + +static const u8 ipv4_bcast_addr[] = { + 0x00, 0xff, 0xff, 0xff, + 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff +}; + +struct workqueue_struct *ipoib_workqueue; + +static void ipoib_add_one(struct ib_device *device); +static void ipoib_remove_one(struct ib_device *device); + +static struct ib_client ipoib_client = { + .name = "ipoib", + .add = ipoib_add_one, + .remove = ipoib_remove_one +}; + +int ipoib_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "bringing up interface\n"); + + set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + if (ipoib_pkey_dev_delay_open(dev)) + return 0; + + if (ipoib_ib_dev_open(dev)) + return -EINVAL; + + if (ipoib_ib_dev_up(dev)) + return -EINVAL; + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring up any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (flags & IFF_UP) + continue; + + dev_change_flags(cpriv->dev, flags | IFF_UP); + } + up(&priv->vlan_mutex); + } + + netif_start_queue(dev); + + return 0; +} + +static int ipoib_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "stopping interface\n"); + + clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + netif_stop_queue(dev); + + ipoib_ib_dev_down(dev); + ipoib_ib_dev_stop(dev); + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring down any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (!(flags & IFF_UP)) + continue; + + dev_change_flags(cpriv->dev, flags & ~IFF_UP); + } + up(&priv->vlan_mutex); + } + + return 0; +} + +static int ipoib_change_mtu(struct net_device *dev, int new_mtu) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) + return -EINVAL; + + priv->admin_mtu = new_mtu; + + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + return 0; +} + +static struct ipoib_path *__path_find(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->path_tree.rb_node; + struct ipoib_path *path; + int ret; + + while (n) { + path = rb_entry(n, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, gid->raw, + sizeof (union ib_gid)); + + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return path; + } + + return NULL; +} + +static int __path_add(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->path_tree.rb_node; + struct rb_node *pn = NULL; + struct ipoib_path *tpath; + int ret; + + while (*n) { + pn = *n; + tpath = rb_entry(pn, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, tpath->pathrec.dgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&path->rb_node, pn, n); + rb_insert_color(&path->rb_node, &priv->path_tree); + + list_add_tail(&path->list, &priv->path_list); + + return 0; +} + +static void __path_free(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tn; + struct sk_buff *skb; + + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb_irq(skb); + + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + if (path->ah) + ipoib_put_ah(path->ah); + + rb_erase(&path->rb_node, &priv->path_tree); + list_del(&path->list); + kfree(path); +} + +void ipoib_flush_paths(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path, *tp; + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + + list_for_each_entry_safe(path, tp, &priv->path_list, list) + __path_free(dev, path); + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static void path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *path_ptr) +{ + struct ipoib_path *path = path_ptr; + struct ipoib_dev_priv *priv = netdev_priv(path->dev); + struct ipoib_ah *ah = NULL; + struct ipoib_neigh *neigh; + struct sk_buff_head skqueue; + struct sk_buff *skb; + unsigned long flags; + + ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n", + status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); + + if (status == IB_WC_SUCCESS) { + struct ib_ah_attr av = { + .dlid = be16_to_cpu(pathrec->dlid), + .sl = pathrec->sl, + .src_path_bits = 0, + .static_rate = 0, + .ah_flags = 0, + .port_num = priv->port + }; + + ah = ipoib_create_ah(path->dev, priv->pd, &av); + } + + spin_lock_irqsave(&priv->lock, flags); + + path->ah = ah; + + if (ah) { + path->pathrec = *pathrec; + + ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n", + ah, be16_to_cpu(pathrec->dlid), pathrec->sl); + + skb_queue_head_init(&skqueue); + + while ((skb = __skb_dequeue(&path->queue))) + __skb_queue_tail(&skqueue, skb); + + list_for_each_entry(neigh, &path->neigh_list, list) { + neigh->ah = path->ah; + kref_get(&path->ah->ref); + + while ((skb = __skb_dequeue(&neigh->queue))) + __skb_queue_tail(&skqueue, skb); + } + } else + path->query = NULL; + + + complete(&path->done); + + spin_unlock_irqrestore(&priv->lock, flags); + + while ((skb = __skb_dequeue(&skqueue))) { + skb->dev = path->dev; + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed " + "to requeue packet\n"); + } +} + +static struct ipoib_path *path_rec_create(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + path = kmalloc(sizeof *path, GFP_ATOMIC); + if (!path) + return NULL; + + path->dev = dev; + path->pathrec.dlid = 0; + + skb_queue_head_init(&path->queue); + + INIT_LIST_HEAD(&path->neigh_list); + path->query = NULL; + init_completion(&path->done); + + memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.numb_path = 1; + + __path_add(dev, path); + + return path; +} + +static int path_rec_start(struct net_device *dev, + struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + path->query_id = + ib_sa_path_rec_get(priv->ca, priv->port, + &path->pathrec, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + 1000, GFP_ATOMIC, + path_rec_completion, + path, &path->query); + if (path->query_id < 0) { + ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); + path->query = NULL; + return path->query_id; + } + + return 0; +} + +static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + return; + } + + skb_queue_head_init(&neigh->queue); + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) + goto err; + } + + list_add_tail(&neigh->list, &path->neigh_list); + + if (path->pathrec.dlid) { + neigh->ah = path->ah; + kref_get(&path->ah->ref); + + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + } else if (!path->query) { + neigh->ah = NULL; + __skb_queue_tail(&neigh->queue, skb); + if (path_rec_start(dev, path)) + goto err; + } + + spin_unlock(&priv->lock); + return; + +err: + *to_ipoib_neigh(skb->dst->neighbour) = NULL; + list_del(&neigh->list); + kfree(neigh); + neigh->neighbour->ops->destructor = NULL; + + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + + spin_unlock(&priv->lock); +} + +static void path_lookup(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + + /* Look up path record for unicasts */ + if (skb->dst->neighbour->ha[4] != 0xff) { + neigh_add_path(skb, dev); + return; + } + + /* Add in the P_Key for multicasts */ + skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; + skb->dst->neighbour->ha[9] = priv->pkey & 0xff; + ipoib_mcast_send(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4), skb); +} + +static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, + struct ipoib_pseudoheader *phdr) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (phdr->hwaddr + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (phdr->hwaddr + 4)); + if (path) { + __skb_queue_tail(&path->queue, skb); + + if (path_rec_start(dev, path)) + __path_free(dev, path); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + spin_unlock(&priv->lock); + return; + } + + ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); + + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) phdr->hwaddr)); + + spin_unlock(&priv->lock); +} + +static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh; + unsigned long flags; + + local_irq_save(flags); + if (!spin_trylock(&priv->tx_lock)) { + local_irq_restore(flags); + return NETDEV_TX_LOCKED; + } + + if (skb->dst && skb->dst->neighbour) { + if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { + path_lookup(skb, dev); + goto out; + } + + neigh = *to_ipoib_neigh(skb->dst->neighbour); + + if (likely(neigh->ah)) { + ipoib_send(dev, skb, neigh->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + goto out; + } + + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) + __skb_queue_tail(&neigh->queue, skb); + else + goto err; + } else { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb->data; + skb_pull(skb, sizeof *phdr); + + if (phdr->hwaddr[4] == 0xff) { + /* Add in the P_Key for multicast*/ + phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; + phdr->hwaddr[9] = priv->pkey & 0xff; + + ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb); + } else { + /* unicast GID -- should be ARP reply */ + + if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { + ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " + IPOIB_GID_FMT "\n", + skb->dst ? "neigh" : "dst", + be16_to_cpup((u16 *) skb->data), + be32_to_cpup((u32 *) phdr->hwaddr), + IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + goto out; + } + + /* put the pseudoheader back on and try to send */ + unicast_arp_send(skb, dev, phdr); + } + } + + goto out; + +err: + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + +out: + spin_unlock_irqrestore(&priv->tx_lock, flags); + + return NETDEV_TX_OK; +} + +struct net_device_stats *ipoib_get_stats(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return &priv->stats; +} + +static void ipoib_timeout(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_warn(priv, "transmit timeout: latency %ld\n", + jiffies - dev->trans_start); + /* XXX reset QP, etc. */ +} + +static int ipoib_hard_header(struct sk_buff *skb, + struct net_device *dev, + unsigned short type, + void *daddr, void *saddr, unsigned len) +{ + struct ipoib_header *header; + + header = (struct ipoib_header *) skb_push(skb, sizeof *header); + + header->proto = htons(type); + header->reserved = 0; + + /* + * If we don't have a neighbour structure, stuff the + * destination address onto the front of the skb so we can + * figure out where to send the packet later. + */ + if (!skb->dst || !skb->dst->neighbour) { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb_push(skb, sizeof *phdr); + memcpy(phdr->hwaddr, daddr, INFINIBAND_ALEN); + } + + return 0; +} + +static void ipoib_set_mcast_list(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + schedule_work(&priv->restart_task); +} + +static void ipoib_neigh_destructor(struct neighbour *n) +{ + struct ipoib_neigh *neigh = *to_ipoib_neigh(n); + struct ipoib_dev_priv *priv = netdev_priv(n->dev); + unsigned long flags; + + ipoib_dbg(priv, + "neigh_destructor for %06x " IPOIB_GID_FMT "\n", + be32_to_cpup((__be32 *) n->ha), + IPOIB_GID_ARG(*((union ib_gid *) (n->ha + 4)))); + + spin_lock_irqsave(&priv->lock, flags); + + if (neigh) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + *to_ipoib_neigh(n) = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static int ipoib_neigh_setup(struct neighbour *neigh) +{ + /* + * Is this kosher? I can't find anybody in the kernel that + * sets neigh->destructor, so we should be able to set it here + * without trouble. + */ + neigh->ops->destructor = ipoib_neigh_destructor; + + return 0; +} + +static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) +{ + parms->neigh_setup = ipoib_neigh_setup; + + return 0; +} + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Allocate RX/TX "rings" to hold queued skbs */ + + priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->rx_ring) { + printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", + ca->name, IPOIB_RX_RING_SIZE); + goto out; + } + memset(priv->rx_ring, 0, + IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf)); + + priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->tx_ring) { + printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", + ca->name, IPOIB_TX_RING_SIZE); + goto out_rx_ring_cleanup; + } + memset(priv->tx_ring, 0, + IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf)); + + /* priv->tx_head & tx_tail are already 0 */ + + if (ipoib_ib_dev_init(dev, ca, port)) + goto out_tx_ring_cleanup; + + return 0; + +out_tx_ring_cleanup: + kfree(priv->tx_ring); + +out_rx_ring_cleanup: + kfree(priv->rx_ring); + +out: + return -ENOMEM; +} + +void ipoib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv; + + ipoib_delete_debug_file(dev); + + /* Delete any child interfaces first */ + list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { + unregister_netdev(cpriv->dev); + ipoib_dev_cleanup(cpriv->dev); + free_netdev(cpriv->dev); + } + + ipoib_ib_dev_cleanup(dev); + + if (priv->rx_ring) { + kfree(priv->rx_ring); + priv->rx_ring = NULL; + } + + if (priv->tx_ring) { + kfree(priv->tx_ring); + priv->tx_ring = NULL; + } +} + +static void ipoib_setup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + dev->open = ipoib_open; + dev->stop = ipoib_stop; + dev->change_mtu = ipoib_change_mtu; + dev->hard_start_xmit = ipoib_start_xmit; + dev->get_stats = ipoib_get_stats; + dev->tx_timeout = ipoib_timeout; + dev->hard_header = ipoib_hard_header; + dev->set_multicast_list = ipoib_set_mcast_list; + dev->neigh_setup = ipoib_neigh_setup_dev; + + dev->watchdog_timeo = HZ; + + dev->rebuild_header = NULL; + dev->set_mac_address = NULL; + dev->header_cache_update = NULL; + + dev->flags |= IFF_BROADCAST | IFF_MULTICAST; + + /* + * We add in INFINIBAND_ALEN to allow for the destination + * address "pseudoheader" for skbs without neighbour struct. + */ + dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; + dev->addr_len = INFINIBAND_ALEN; + dev->type = ARPHRD_INFINIBAND; + dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + + /* MTU will be reset when mcast join happens */ + dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; + priv->mcast_mtu = priv->admin_mtu = dev->mtu; + + memcpy(dev->broadcast, ipv4_bcast_addr, INFINIBAND_ALEN); + + netif_carrier_off(dev); + + SET_MODULE_OWNER(dev); + + priv->dev = dev; + + spin_lock_init(&priv->lock); + spin_lock_init(&priv->tx_lock); + + init_MUTEX(&priv->mcast_mutex); + init_MUTEX(&priv->vlan_mutex); + + INIT_LIST_HEAD(&priv->path_list); + INIT_LIST_HEAD(&priv->child_intfs); + INIT_LIST_HEAD(&priv->dead_ahs); + INIT_LIST_HEAD(&priv->multicast_list); + + INIT_WORK(&priv->pkey_task, ipoib_pkey_poll, priv->dev); + INIT_WORK(&priv->mcast_task, ipoib_mcast_join_task, priv->dev); + INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, priv->dev); + INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev); + INIT_WORK(&priv->ah_reap_task, ipoib_reap_ah, priv->dev); +} + +struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) +{ + struct net_device *dev; + + dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name, + ipoib_setup); + if (!dev) + return NULL; + + return netdev_priv(dev); +} + +static ssize_t show_pkey(struct class_device *cdev, char *buf) +{ + struct ipoib_dev_priv *priv = + netdev_priv(container_of(cdev, struct net_device, class_dev)); + + return sprintf(buf, "0x%04x\n", priv->pkey); +} +static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); + +static ssize_t create_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; +} +static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); + +static ssize_t delete_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; + +} +static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); + +int ipoib_add_pkey_attr(struct net_device *dev) +{ + return class_device_create_file(&dev->class_dev, + &class_device_attr_pkey); +} + +static struct net_device *ipoib_add_port(const char *format, + struct ib_device *hca, u8 port) +{ + struct ipoib_dev_priv *priv; + int result = -ENOMEM; + + priv = ipoib_intf_alloc(format); + if (!priv) + goto alloc_mem_failed; + + SET_NETDEV_DEV(priv->dev, hca->dma_device); + + result = ib_query_pkey(hca, port, 0, &priv->pkey); + if (result) { + printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } + + priv->dev->broadcast[8] = priv->pkey >> 8; + priv->dev->broadcast[9] = priv->pkey & 0xff; + + result = ib_query_gid(hca, port, 0, &priv->local_gid); + if (result) { + printk(KERN_WARNING "%s: ib_query_gid port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + + result = ipoib_dev_init(priv->dev, hca, port); + if (result < 0) { + printk(KERN_WARNING "%s: failed to initialize port %d (ret = %d)\n", + hca->name, port, result); + goto device_init_failed; + } + + INIT_IB_EVENT_HANDLER(&priv->event_handler, + priv->ca, ipoib_event); + result = ib_register_event_handler(&priv->event_handler); + if (result < 0) { + printk(KERN_WARNING "%s: ib_register_event_handler failed for " + "port %d (ret = %d)\n", + hca->name, port, result); + goto event_failed; + } + + result = register_netdev(priv->dev); + if (result) { + printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", + hca->name, port, result); + goto register_failed; + } + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_create_child)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_delete_child)) + goto sysfs_failed; + + return priv->dev; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ib_unregister_event_handler(&priv->event_handler); + +event_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +alloc_mem_failed: + return ERR_PTR(result); +} + +static void ipoib_add_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct net_device *dev; + struct ipoib_dev_priv *priv; + int s, e, p; + + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); + if (!dev_list) + return; + + INIT_LIST_HEAD(dev_list); + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (p = s; p <= e; ++p) { + dev = ipoib_add_port("ib%d", device, p); + if (!IS_ERR(dev)) { + priv = netdev_priv(dev); + list_add_tail(&priv->list, dev_list); + } + } + + ib_set_client_data(device, &ipoib_client, dev_list); +} + +static void ipoib_remove_one(struct ib_device *device) +{ + struct ipoib_dev_priv *priv, *tmp; + struct list_head *dev_list; + + dev_list = ib_get_client_data(device, &ipoib_client); + + list_for_each_entry_safe(priv, tmp, dev_list, list) { + ib_unregister_event_handler(&priv->event_handler); + + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + free_netdev(priv->dev); + } +} + +static int __init ipoib_init_module(void) +{ + int ret; + + ret = ipoib_register_debugfs(); + if (ret) + return ret; + + /* + * We create our own workqueue mainly because we want to be + * able to flush it when devices are being removed. We can't + * use schedule_work()/flush_scheduled_work() because both + * unregister_netdev() and linkwatch_event take the rtnl lock, + * so flush_scheduled_work() can deadlock during device + * removal. + */ + ipoib_workqueue = create_singlethread_workqueue("ipoib"); + if (!ipoib_workqueue) { + ret = -ENOMEM; + goto err_fs; + } + + ret = ib_register_client(&ipoib_client); + if (ret) + goto err_wq; + + return 0; + +err_fs: + ipoib_unregister_debugfs(); + +err_wq: + destroy_workqueue(ipoib_workqueue); + + return ret; +} + +static void __exit ipoib_cleanup_module(void) +{ + ipoib_unregister_debugfs(); + ib_unregister_client(&ipoib_client); + destroy_workqueue(ipoib_workqueue); +} + +module_init(ipoib_init_module); +module_exit(ipoib_cleanup_module); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2004-12-13 09:44:49.603417898 -0800 @@ -0,0 +1,954 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_multicast.c 1323 2004-12-11 02:36:04Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ipoib.h" + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int mcast_debug_level; + +module_param(mcast_debug_level, int, 0644); +MODULE_PARM_DESC(mcast_debug_level, + "Enable multicast debug tracing if > 0"); +#endif + +static DECLARE_MUTEX(mcast_mutex); + +/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ +struct ipoib_mcast { + struct ib_sa_mcmember_rec mcmember; + struct ipoib_ah *ah; + + struct rb_node rb_node; + struct list_head list; + struct completion done; + + int query_id; + struct ib_sa_query *query; + + unsigned long created; + unsigned long backoff; + + unsigned long flags; + unsigned char logcount; + + struct list_head neigh_list; + + struct sk_buff_head pkt_queue; + + struct net_device *dev; +}; + +struct ipoib_mcast_iter { + struct net_device *dev; + union ib_gid mgid; + unsigned long created; + unsigned int queuelen; + unsigned int complete; + unsigned int send_only; +}; + +static void ipoib_mcast_free(struct ipoib_mcast *mcast) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tmp; + unsigned long flags; + + ipoib_dbg_mcast(netdev_priv(dev), + "deleting multicast group " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + spin_lock_irqsave(&priv->lock, flags); + + list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); + + if (mcast->ah) + ipoib_put_ah(mcast->ah); + + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + dev_kfree_skb_any(skb); + } + + kfree(mcast); +} + +static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev, + int can_sleep) +{ + struct ipoib_mcast *mcast; + + mcast = kmalloc(sizeof (*mcast), can_sleep ? GFP_KERNEL : GFP_ATOMIC); + if (!mcast) + return NULL; + + memset(mcast, 0, sizeof (*mcast)); + + init_completion(&mcast->done); + + mcast->dev = dev; + mcast->created = jiffies; + mcast->backoff = HZ; + mcast->logcount = 0; + + INIT_LIST_HEAD(&mcast->list); + INIT_LIST_HEAD(&mcast->neigh_list); + skb_queue_head_init(&mcast->pkt_queue); + + mcast->ah = NULL; + mcast->query = NULL; + + return mcast; +} + +static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->multicast_tree.rb_node; + + while (n) { + struct ipoib_mcast *mcast; + int ret; + + mcast = rb_entry(n, struct ipoib_mcast, rb_node); + + ret = memcmp(mgid->raw, mcast->mcmember.mgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return mcast; + } + + return NULL; +} + +static int __ipoib_mcast_add(struct net_device *dev, struct ipoib_mcast *mcast) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->multicast_tree.rb_node, *pn = NULL; + + while (*n) { + struct ipoib_mcast *tmcast; + int ret; + + pn = *n; + tmcast = rb_entry(pn, struct ipoib_mcast, rb_node); + + ret = memcmp(mcast->mcmember.mgid.raw, tmcast->mcmember.mgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&mcast->rb_node, pn, n); + rb_insert_color(&mcast->rb_node, &priv->multicast_tree); + + return 0; +} + +static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, + struct ib_sa_mcmember_rec *mcmember) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + mcast->mcmember = *mcmember; + + /* Set the cached Q_Key before we attach if it's the broadcast group */ + if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + sizeof (union ib_gid))) + priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); + + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + ipoib_warn(priv, "multicast group " IPOIB_GID_FMT + " already attached\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + return 0; + } + + ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret < 0) { + ipoib_warn(priv, "couldn't attach QP to multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags); + return ret; + } + } + + { + struct ib_ah_attr av = { + .dlid = be16_to_cpu(mcast->mcmember.mlid), + .port_num = priv->port, + .sl = mcast->mcmember.sl, + .src_path_bits = 0, + .static_rate = 0, + .ah_flags = IB_AH_GRH, + .grh = { + .flow_label = be32_to_cpu(mcast->mcmember.flow_label), + .hop_limit = mcast->mcmember.hop_limit, + .sgid_index = 0, + .traffic_class = mcast->mcmember.traffic_class + } + }; + + av.grh.dgid = mcast->mcmember.mgid; + + mcast->ah = ipoib_create_ah(dev, priv->pd, &av); + if (!mcast->ah) { + ipoib_warn(priv, "ib_address_create failed\n"); + } else { + ipoib_dbg_mcast(priv, "MGID " IPOIB_GID_FMT + " AV %p, LID 0x%04x, SL %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + mcast->ah->ah, + be16_to_cpu(mcast->mcmember.mlid), + mcast->mcmember.sl); + } + } + + /* actually send any queued packets */ + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n"); + } + + return 0; +} + +static void +ipoib_mcast_sendonly_join_complete(int status, + struct ib_sa_mcmember_rec *mcmember, + void *mcast_ptr) +{ + struct ipoib_mcast *mcast = mcast_ptr; + struct net_device *dev = mcast->dev; + + if (!status) + ipoib_mcast_join_finish(mcast, mcmember); + else { + if (mcast->logcount++ < 20) + ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); + + /* Flush out any queued packets */ + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + + dev_kfree_skb_any(skb); + } + + /* Clear the busy flag so we try again */ + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + } + + complete(&mcast->done); +} + +static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { +#if 0 /* Some SMs don't support send-only yet */ + .join_state = 4 +#else + .join_state = 1 +#endif + }; + int ret = 0; + + if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) { + ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n"); + return -ENODEV; + } + + if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n"); + return -EBUSY; + } + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + 1000, GFP_ATOMIC, + ipoib_mcast_sendonly_join_complete, + mcast, &mcast->query); + if (ret < 0) { + ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", + ret); + } else { + ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT + ", starting join\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + mcast->query_id = ret; + } + + return ret; +} + +static void ipoib_mcast_join_complete(int status, + struct ib_sa_mcmember_rec *mcmember, + void *mcast_ptr) +{ + struct ipoib_mcast *mcast = mcast_ptr; + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg_mcast(priv, "join completion for " IPOIB_GID_FMT + " (status %d)\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); + + if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { + mcast->backoff = HZ; + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_work(ipoib_workqueue, &priv->mcast_task); + up(&mcast_mutex); + complete(&mcast->done); + return; + } + + if (status == -EINTR) { + complete(&mcast->done); + return; + } + + if (status && mcast->logcount++ < 20) { + if (status == -ETIMEDOUT || status == -EINTR) { + ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT + ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + status); + } else { + ipoib_warn(priv, "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + status); + } + } + + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + + mcast->query = NULL; + + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { + if (status == -ETIMEDOUT) + queue_work(ipoib_workqueue, &priv->mcast_task); + else + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, + mcast->backoff * HZ); + } else + complete(&mcast->done); + up(&mcast_mutex); + + return; +} + +static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, + int create) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { + .join_state = 1 + }; + ib_sa_comp_mask comp_mask; + int ret = 0; + + ipoib_dbg_mcast(priv, "joining MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + comp_mask = + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE; + + if (create) { + comp_mask |= + IB_SA_MCMEMBER_REC_QKEY | + IB_SA_MCMEMBER_REC_SL | + IB_SA_MCMEMBER_REC_FLOW_LABEL | + IB_SA_MCMEMBER_REC_TRAFFIC_CLASS; + + rec.qkey = priv->broadcast->mcmember.qkey; + rec.sl = priv->broadcast->mcmember.sl; + rec.flow_label = priv->broadcast->mcmember.flow_label; + rec.traffic_class = priv->broadcast->mcmember.traffic_class; + } + + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, + mcast->backoff * 1000, GFP_ATOMIC, + ipoib_mcast_join_complete, + mcast, &mcast->query); + + if (ret < 0) { + ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); + + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->mcast_task, + mcast->backoff); + up(&mcast_mutex); + } else + mcast->query_id = ret; +} + +void ipoib_mcast_join_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (!test_bit(IPOIB_MCAST_RUN, &priv->flags)) + return; + + if (ib_query_gid(priv->ca, priv->port, 0, &priv->local_gid)) + ipoib_warn(priv, "ib_gid_entry_get() failed\n"); + else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + if (!priv->broadcast) { + priv->broadcast = ipoib_mcast_alloc(dev, 1); + if (!priv->broadcast) { + ipoib_warn(priv, "failed to allocate broadcast group\n"); + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->mcast_task, HZ); + up(&mcast_mutex); + return; + } + + memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + sizeof (union ib_gid)); + + spin_lock_irq(&priv->lock); + __ipoib_mcast_add(dev, priv->broadcast); + spin_unlock_irq(&priv->lock); + } + + if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { + ipoib_mcast_join(dev, priv->broadcast, 0); + return; + } + + while (1) { + struct ipoib_mcast *mcast = NULL; + + spin_lock_irq(&priv->lock); + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + /* Found the next unjoined group */ + break; + } + } + spin_unlock_irq(&priv->lock); + + if (&mcast->list == &priv->multicast_list) { + /* All done */ + break; + } + + ipoib_mcast_join(dev, mcast, 1); + return; + } + + { + struct ib_port_attr attr; + + if (!ib_query_port(priv->ca, priv->port, &attr)) + priv->local_lid = attr.lid; + else + ipoib_warn(priv, "ib_query_port failed\n"); + } + + priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - + IPOIB_ENCAP_LEN; + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n"); + + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + netif_carrier_on(dev); +} + +int ipoib_mcast_start_thread(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg_mcast(priv, "starting multicast thread\n"); + + down(&mcast_mutex); + if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_work(ipoib_workqueue, &priv->mcast_task); + up(&mcast_mutex); + + return 0; +} + +int ipoib_mcast_stop_thread(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; + + ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + + down(&mcast_mutex); + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + cancel_delayed_work(&priv->mcast_task); + up(&mcast_mutex); + + flush_workqueue(ipoib_workqueue); + + if (priv->broadcast && priv->broadcast->query) { + ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); + priv->broadcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for bcast\n"); + wait_for_completion(&priv->broadcast->done); + } + + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } + } + + return 0; +} + +int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { + .join_state = 1 + }; + int ret = 0; + + if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) + return 0; + + ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + /* Remove ourselves from the multicast group */ + ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret) + ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + + /* + * Just make one shot at leaving and don't wait for a reply; + * if we fail, too bad. + */ + ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + 0, GFP_ATOMIC, NULL, + mcast, &mcast->query); + if (ret < 0) + ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " + "for leave (result = %d)\n", ret); + + return 0; +} + +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + mcast = __ipoib_mcast_find(dev, mgid); + if (!mcast) { + /* Let's create a new send only group now */ + ipoib_dbg_mcast(priv, "setting up send only multicast group for " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(*mgid)); + + mcast = ipoib_mcast_alloc(dev, 0); + if (!mcast) { + ipoib_warn(priv, "unable to allocate memory for " + "multicast structure\n"); + dev_kfree_skb_any(skb); + goto out; + } + + set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags); + mcast->mcmember.mgid = *mgid; + __ipoib_mcast_add(dev, mcast); + list_add_tail(&mcast->list, &priv->multicast_list); + } + + if (!mcast->ah) { + if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) + skb_queue_tail(&mcast->pkt_queue, skb); + else + dev_kfree_skb_any(skb); + + if (mcast->query) + ipoib_dbg_mcast(priv, "no address vector, " + "but multicast join already started\n"); + else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) + ipoib_mcast_sendonly_join(mcast); + + /* + * If lookup completes between here and out:, don't + * want to send packet twice. + */ + mcast = NULL; + } + +out: + if (mcast && mcast->ah) { + if (skb->dst && + skb->dst->neighbour && + !*to_ipoib_neigh(skb->dst->neighbour)) { + struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + + if (neigh) { + kref_get(&mcast->ah->ref); + neigh->ah = mcast->ah; + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + list_add_tail(&neigh->list, &mcast->neigh_list); + } + } + + ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + } + + spin_unlock_irqrestore(&priv->lock, flags); +} + +void ipoib_mcast_dev_flush(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + LIST_HEAD(remove_list); + struct ipoib_mcast *mcast, *tmcast, *nmcast; + unsigned long flags; + + ipoib_dbg_mcast(priv, "flushing multicast list\n"); + + spin_lock_irqsave(&priv->lock, flags); + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + nmcast = ipoib_mcast_alloc(dev, 0); + if (nmcast) { + nmcast->flags = + mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY); + + nmcast->mcmember.mgid = mcast->mcmember.mgid; + + /* Add the new group in before the to-be-destroyed group */ + list_add_tail(&nmcast->list, &mcast->list); + list_del_init(&mcast->list); + + rb_replace_node(&mcast->rb_node, &nmcast->rb_node, + &priv->multicast_tree); + + list_add_tail(&mcast->list, &remove_list); + } else { + ipoib_warn(priv, "could not reallocate multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + } + } + + if (priv->broadcast) { + nmcast = ipoib_mcast_alloc(dev, 0); + if (nmcast) { + nmcast->mcmember.mgid = priv->broadcast->mcmember.mgid; + + rb_replace_node(&priv->broadcast->rb_node, + &nmcast->rb_node, + &priv->multicast_tree); + + list_add_tail(&priv->broadcast->list, &remove_list); + } + + priv->broadcast = nmcast; + } + + spin_unlock_irqrestore(&priv->lock, flags); + + list_for_each_entry(mcast, &remove_list, list) { + ipoib_mcast_leave(dev, mcast); + ipoib_mcast_free(mcast); + } +} + +void ipoib_mcast_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; + + /* Delete broadcast since it will be recreated */ + if (priv->broadcast) { + ipoib_dbg_mcast(priv, "deleting broadcast group\n"); + + spin_lock_irqsave(&priv->lock, flags); + rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); + spin_unlock_irqrestore(&priv->lock, flags); + ipoib_mcast_leave(dev, priv->broadcast); + ipoib_mcast_free(priv->broadcast); + priv->broadcast = NULL; + } +} + +void ipoib_mcast_restart_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dev_mc_list *mclist; + struct ipoib_mcast *mcast, *tmcast; + LIST_HEAD(remove_list); + unsigned long flags; + + ipoib_dbg_mcast(priv, "restarting multicast task\n"); + + ipoib_mcast_stop_thread(dev); + + spin_lock_irqsave(&priv->lock, flags); + + /* + * Unfortunately, the networking core only gives us a list of all of + * the multicast hardware addresses. We need to figure out which ones + * are new and which ones have been removed + */ + + /* Clear out the found flag */ + list_for_each_entry(mcast, &priv->multicast_list, list) + clear_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags); + + /* Mark all of the entries that are found or don't exist */ + for (mclist = dev->mc_list; mclist; mclist = mclist->next) { + union ib_gid mgid; + + memcpy(mgid.raw, mclist->dmi_addr + 4, sizeof mgid); + + /* Add in the P_Key */ + mgid.raw[4] = (priv->pkey >> 8) & 0xff; + mgid.raw[5] = priv->pkey & 0xff; + + mcast = __ipoib_mcast_find(dev, &mgid); + if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + struct ipoib_mcast *nmcast; + + /* Not found or send-only group, let's add a new entry */ + ipoib_dbg_mcast(priv, "adding multicast entry for mgid " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); + + nmcast = ipoib_mcast_alloc(dev, 0); + if (!nmcast) { + ipoib_warn(priv, "unable to allocate memory for multicast structure\n"); + continue; + } + + set_bit(IPOIB_MCAST_FLAG_FOUND, &nmcast->flags); + + nmcast->mcmember.mgid = mgid; + + if (mcast) { + /* Destroy the send only entry */ + list_del(&mcast->list); + list_add_tail(&mcast->list, &remove_list); + + rb_replace_node(&mcast->rb_node, + &nmcast->rb_node, + &priv->multicast_tree); + } else + __ipoib_mcast_add(dev, nmcast); + + list_add_tail(&nmcast->list, &priv->multicast_list); + } + + if (mcast) + set_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags); + } + + /* Remove all of the entries don't exist anymore */ + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags) && + !test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "deleting multicast group " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rb_erase(&mcast->rb_node, &priv->multicast_tree); + + /* Move to the remove list */ + list_del(&mcast->list); + list_add_tail(&mcast->list, &remove_list); + } + } + spin_unlock_irqrestore(&priv->lock, flags); + + /* We have to cancel outside of the spinlock */ + list_for_each_entry(mcast, &remove_list, list) { + ipoib_mcast_leave(mcast->dev, mcast); + ipoib_mcast_free(mcast); + } + + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_mcast_start_thread(dev); +} + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev) +{ + struct ipoib_mcast_iter *iter; + + iter = kmalloc(sizeof *iter, GFP_KERNEL); + if (!iter) + return NULL; + + iter->dev = dev; + memset(iter->mgid.raw, 0, sizeof iter->mgid); + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter) +{ + kfree(iter); +} + +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter) +{ + struct ipoib_dev_priv *priv = netdev_priv(iter->dev); + struct rb_node *n; + struct ipoib_mcast *mcast; + int ret = 1; + + spin_lock_irq(&priv->lock); + + n = rb_first(&priv->multicast_tree); + + while (n) { + mcast = rb_entry(n, struct ipoib_mcast, rb_node); + + if (memcmp(iter->mgid.raw, mcast->mcmember.mgid.raw, + sizeof (union ib_gid)) < 0) { + iter->mgid = mcast->mcmember.mgid; + iter->created = mcast->created; + iter->queuelen = skb_queue_len(&mcast->pkt_queue); + iter->complete = !!mcast->ah; + iter->send_only = !!(mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY)); + + ret = 0; + + break; + } + + n = rb_next(n); + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *mgid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only) +{ + *mgid = iter->mgid; + *created = iter->created; + *queuelen = iter->queuelen; + *complete = iter->complete; + *send_only = iter->send_only; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2004-12-13 09:44:49.634413332 -0800 @@ -0,0 +1,243 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_verbs.c 1308 2004-12-03 01:31:40Z roland $ + */ + +#include + +#include "ipoib.h" + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr *qp_attr; + int attr_mask; + int ret; + u16 pkey_index; + + ret = -ENOMEM; + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) + goto out; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ret = -ENXIO; + goto out; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + /* set correct QKey for QP */ + qp_attr->qkey = priv->qkey; + attr_mask = IB_QP_QKEY; + ret = ib_modify_qp(priv->qp, qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret); + goto out; + } + + /* attach QP to multicast group */ + down(&priv->mcast_mutex); + ret = ib_attach_mcast(priv->qp, mgid, mlid); + up(&priv->mcast_mutex); + if (ret) + ipoib_warn(priv, "failed to attach to multicast group, ret = %d\n", ret); + +out: + kfree(qp_attr); + return ret; +} + +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + down(&priv->mcast_mutex); + ret = ib_detach_mcast(priv->qp, mgid, mlid); + up(&priv->mcast_mutex); + if (ret) + ipoib_warn(priv, "ib_detach_mcast failed (result = %d)\n", ret); + + return ret; +} + +int ipoib_qp_create(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u16 pkey_index; + struct ib_qp_attr qp_attr; + int attr_mask; + + /* + * Search through the port P_Key table for the requested pkey value. + * The port has to be assigned to the respective IB partition in + * advance. + */ + ret = ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index); + if (ret) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return ret; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qkey = 0; + qp_attr.port_num = priv->port; + qp_attr.pkey_index = pkey_index; + attr_mask = + IB_QP_QKEY | + IB_QP_PORT | + IB_QP_PKEY_INDEX | + IB_QP_STATE; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret); + goto out_fail; + } + + qp_attr.qp_state = IB_QPS_RTR; + /* Can't set this in a INIT->RTR transition */ + attr_mask &= ~IB_QP_PORT; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret); + goto out_fail; + } + + qp_attr.qp_state = IB_QPS_RTS; + qp_attr.sq_psn = 0; + attr_mask |= IB_QP_SQ_PSN; + attr_mask &= ~IB_QP_PKEY_INDEX; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret); + goto out_fail; + } + + return 0; + +out_fail: + ib_destroy_qp(priv->qp); + priv->qp = NULL; + + return -EINVAL; +} + +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr init_attr = { + .cap = { + .max_send_wr = IPOIB_TX_RING_SIZE, + .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_sge = 1, + .max_recv_sge = 1 + }, + .sq_sig_type = IB_SIGNAL_ALL_WR, + .rq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_UD + }; + + priv->pd = ib_alloc_pd(priv->ca); + if (IS_ERR(priv->pd)) { + printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name); + return -ENODEV; + } + + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, + IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + if (IS_ERR(priv->cq)) { + printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); + goto out_free_pd; + } + + if (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP)) + goto out_free_cq; + + priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(priv->mr)) { + printk(KERN_WARNING "%s: ib_reg_phys_mr failed\n", ca->name); + goto out_free_cq; + } + + init_attr.send_cq = priv->cq; + init_attr.recv_cq = priv->cq, + + priv->qp = ib_create_qp(priv->pd, &init_attr); + if (IS_ERR(priv->qp)) { + printk(KERN_WARNING "%s: failed to create QP\n", ca->name); + goto out_free_mr; + } + + priv->dev->dev_addr[1] = (priv->qp->qp_num >> 16) & 0xff; + priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; + priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; + + return 0; + +out_free_mr: + ib_dereg_mr(priv->mr); + +out_free_cq: + ib_destroy_cq(priv->cq); + +out_free_pd: + ib_dealloc_pd(priv->pd); + return -ENODEV; +} + +void ipoib_transport_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->qp) { + if (ib_destroy_qp(priv->qp)) + ipoib_warn(priv, "ib_qp_destroy failed\n"); + + priv->qp = NULL; + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + } + + if (ib_dereg_mr(priv->mr)) + ipoib_warn(priv, "ib_dereg_mr failed\n"); + + if (ib_destroy_cq(priv->cq)) + ipoib_warn(priv, "ib_cq_destroy failed\n"); + + if (ib_dealloc_pd(priv->pd)) + ipoib_warn(priv, "ib_dealloc_pd failed\n"); +} + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record) +{ + struct ipoib_dev_priv *priv = + container_of(handler, struct ipoib_dev_priv, event_handler); + + if (record->event == IB_EVENT_PORT_ACTIVE || + record->event == IB_EVENT_LID_CHANGE || + record->event == IB_EVENT_SM_CHANGE) { + ipoib_dbg(priv, "Port active event\n"); + schedule_work(&priv->flush_task); + } +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2004-12-13 09:44:49.660409502 -0800 @@ -0,0 +1,166 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_vlan.c 1271 2004-11-18 22:11:29Z roland $ + */ + +#include +#include + +#include +#include +#include + +#include + +#include "ipoib.h" + +static ssize_t show_parent(struct class_device *class_dev, char *buf) +{ + struct net_device *dev = + container_of(class_dev, struct net_device, class_dev); + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return sprintf(buf, "%s\n", priv->parent->name); +} +static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) +{ + struct ipoib_dev_priv *ppriv, *priv; + char intf_name[IFNAMSIZ]; + int result; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + ppriv = netdev_priv(pdev); + + down(&ppriv->vlan_mutex); + + /* + * First ensure this isn't a duplicate. We check the parent device and + * then all of the child interfaces to make sure the Pkey doesn't match. + */ + if (ppriv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } + + list_for_each_entry(priv, &ppriv->child_intfs, list) { + if (priv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } + } + + snprintf(intf_name, sizeof intf_name, "%s.%04x", + ppriv->dev->name, pkey); + priv = ipoib_intf_alloc(intf_name); + if (!priv) { + result = -ENOMEM; + goto err; + } + + set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags); + + priv->pkey = pkey; + + memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN); + priv->dev->broadcast[8] = pkey >> 8; + priv->dev->broadcast[9] = pkey & 0xff; + + result = ipoib_dev_init(priv->dev, ppriv->ca, ppriv->port); + if (result < 0) { + ipoib_warn(ppriv, "failed to initialize subinterface: " + "device %s, port %d", + ppriv->ca->name, ppriv->port); + goto device_init_failed; + } + + result = register_netdev(priv->dev); + if (result) { + ipoib_warn(priv, "failed to initialize; error %i", result); + goto register_failed; + } + + priv->parent = ppriv->dev; + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_parent)) + goto sysfs_failed; + + list_add_tail(&priv->list, &ppriv->child_intfs); + + up(&ppriv->vlan_mutex); + + return 0; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +err: + up(&ppriv->vlan_mutex); + return result; +} + +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey) +{ + struct ipoib_dev_priv *ppriv, *priv, *tpriv; + int ret = -ENOENT; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + ppriv = netdev_priv(pdev); + + down(&ppriv->vlan_mutex); + list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) { + if (priv->pkey == pkey) { + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + + list_del(&priv->list); + + kfree(priv); + + ret = 0; + break; + } + } + up(&ppriv->vlan_mutex); + + return ret; +} From roland at topspin.com Mon Dec 13 10:09:59 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:59 -0800 Subject: [openib-general] [PATCH][v3][19/21] Document InfiniBand ioctl use In-Reply-To: <20041213109.dhdvNgaA6X6YHRdV@topspin.com> Message-ID: <20041213109.42OdQqmmAkW2Pv7s@topspin.com> Add the 0x1b ioctl magic number used by ib_umad module to Documentation/ioctl-number.txt. Signed-off-by: Roland Dreier --- linux-bk.orig/Documentation/ioctl-number.txt 2004-12-11 15:16:34.000000000 -0800 +++ linux-bk/Documentation/ioctl-number.txt 2004-12-13 09:44:50.671260584 -0800 @@ -72,6 +72,7 @@ 0x09 all linux/md.h 0x12 all linux/fs.h linux/blkpg.h +0x1b all InfiniBand Subsystem 0x20 all drivers/cdrom/cm206.h 0x22 all scsi/sg.h '#' 00-3F IEEE 1394 Subsystem Block for the entire subsystem From roland at topspin.com Mon Dec 13 10:09:57 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:57 -0800 Subject: [openib-general] [PATCH][v3][18/21] Add InfiniBand userspace MAD support In-Reply-To: <20041213109.JT1ejUdkRIUXbWOm@topspin.com> Message-ID: <20041213109.dhdvNgaA6X6YHRdV@topspin.com> Add a driver that provides a character special device for each InfiniBand port. This device allows userspace to send and receive MADs via write() and read() (with some control operations implemented as ioctls). All operations are 32/64 clean and have been tested with 32-bit userspace running on a ppc64 kernel. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-13 09:44:43.579305364 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-13 09:44:50.192331140 -0800 @@ -1,22 +1,12 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += \ - ib_core.o \ - ib_mad.o \ - ib_sa.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o ib_umad.o -ib_core-objs := \ - packer.o \ - ud_header.o \ - verbs.o \ - sysfs.o \ - device.o \ - fmr_pool.o \ - cache.o +ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ + device.o fmr_pool.o cache.o -ib_mad-objs := \ - mad.o \ - smi.o \ - agent.o +ib_mad-y := mad.o smi.o agent.o -ib_sa-objs := sa_query.o +ib_sa-y := sa_query.o + +ib_umad-y := user_mad.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/user_mad.c 2004-12-13 09:44:50.232325248 -0800 @@ -0,0 +1,694 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand userspace MAD packet access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + IB_UMAD_MAX_PORTS = 256, + IB_UMAD_MAX_AGENTS = 32 +}; + +struct ib_umad_port { + int devnum; + struct cdev dev; + struct class_device class_dev; + struct ib_device *ib_dev; + struct ib_umad_device *umad_dev; + u8 port_num; +}; + +struct ib_umad_device { + int start_port, end_port; + struct kref ref; + struct ib_umad_port port[0]; +}; + +struct ib_umad_file { + struct ib_umad_port *port; + spinlock_t recv_lock; + struct list_head recv_list; + wait_queue_head_t recv_wait; + struct rw_semaphore agent_mutex; + struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; + struct ib_mr *mr[IB_UMAD_MAX_AGENTS]; +}; + +struct ib_umad_packet { + struct ib_user_mad mad; + struct ib_ah *ah; + struct list_head list; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static dev_t base_dev; +static spinlock_t map_lock; +static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS); + +static void ib_umad_add_one(struct ib_device *device); +static void ib_umad_remove_one(struct ib_device *device); + +static int queue_packet(struct ib_umad_file *file, + struct ib_mad_agent *agent, + struct ib_umad_packet *packet) +{ + int ret = 1; + + down_read(&file->agent_mutex); + for (packet->mad.id = 0; + packet->mad.id < IB_UMAD_MAX_AGENTS; + packet->mad.id++) + if (agent == file->agent[packet->mad.id]) { + spin_lock_irq(&file->recv_lock); + list_add_tail(&packet->list, &file->recv_list); + spin_unlock_irq(&file->recv_lock); + wake_up_interruptible(&file->recv_wait); + ret = 0; + break; + } + + up_read(&file->agent_mutex); + + return ret; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *send_wc) +{ + struct ib_umad_file *file = agent->context; + struct ib_umad_packet *packet = + (void *) (unsigned long) send_wc->wr_id; + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(packet, mapping), + sizeof packet->mad.data, + DMA_TO_DEVICE); + ib_destroy_ah(packet->ah); + + if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { + packet->mad.status = ETIMEDOUT; + + if (!queue_packet(file, agent, packet)) + return; + } + + kfree(packet); +} + +static void recv_handler(struct ib_mad_agent *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_umad_file *file = agent->context; + struct ib_umad_packet *packet; + + if (mad_recv_wc->wc->status != IB_WC_SUCCESS) + goto out; + + packet = kmalloc(sizeof *packet, GFP_KERNEL); + if (!packet) + goto out; + + memset(packet, 0, sizeof *packet); + + memcpy(packet->mad.data, mad_recv_wc->recv_buf.mad, sizeof packet->mad.data); + packet->mad.status = 0; + packet->mad.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); + packet->mad.lid = cpu_to_be16(mad_recv_wc->wc->slid); + packet->mad.sl = mad_recv_wc->wc->sl; + packet->mad.path_bits = mad_recv_wc->wc->dlid_path_bits; + packet->mad.grh_present = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH); + if (packet->mad.grh_present) { + /* XXX parse GRH */ + packet->mad.gid_index = 0; + packet->mad.hop_limit = 0; + packet->mad.traffic_class = 0; + memset(packet->mad.gid, 0, 16); + packet->mad.flow_label = 0; + } + + if (queue_packet(file, agent, packet)) + kfree(packet); + +out: + ib_free_recv_mad(mad_recv_wc); +} + +static ssize_t ib_umad_read(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_umad_file *file = filp->private_data; + struct ib_umad_packet *packet; + ssize_t ret; + + if (count < sizeof (struct ib_user_mad)) + return -EINVAL; + + spin_lock_irq(&file->recv_lock); + + while (list_empty(&file->recv_list)) { + spin_unlock_irq(&file->recv_lock); + + if (filp->f_flags & O_NONBLOCK) + return -EAGAIN; + + if (wait_event_interruptible(file->recv_wait, + !list_empty(&file->recv_list))) + return -ERESTARTSYS; + + spin_lock_irq(&file->recv_lock); + } + + packet = list_entry(file->recv_list.next, struct ib_umad_packet, list); + list_del(&packet->list); + + spin_unlock_irq(&file->recv_lock); + + if (copy_to_user(buf, &packet->mad, sizeof packet->mad)) + ret = -EFAULT; + else + ret = sizeof packet->mad; + + kfree(packet); + return ret; +} + +static ssize_t ib_umad_write(struct file *filp, const char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_umad_file *file = filp->private_data; + struct ib_umad_packet *packet; + struct ib_mad_agent *agent; + struct ib_ah_attr ah_attr; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + }; + int ret; + + if (count < sizeof (struct ib_user_mad)) + return -EINVAL; + + packet = kmalloc(sizeof *packet, GFP_KERNEL); + if (!packet) + return -ENOMEM; + + if (copy_from_user(&packet->mad, buf, sizeof packet->mad)) { + kfree(packet); + return -EFAULT; + } + + if (packet->mad.id < 0 || packet->mad.id >= IB_UMAD_MAX_AGENTS) { + ret = -EINVAL; + goto err; + } + + down_read(&file->agent_mutex); + + agent = file->agent[packet->mad.id]; + if (!agent) { + ret = -EINVAL; + goto err_up; + } + + ((struct ib_mad_hdr *) packet->mad.data)->tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | + (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) & + 0xffffffff)); + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = be16_to_cpu(packet->mad.lid); + ah_attr.sl = packet->mad.sl; + ah_attr.src_path_bits = packet->mad.path_bits; + ah_attr.port_num = file->port->port_num; + /* XXX handle GRH */ + + packet->ah = ib_create_ah(agent->qp->pd, &ah_attr); + if (IS_ERR(packet->ah)) { + ret = PTR_ERR(packet->ah); + goto err_up; + } + + gather_list.addr = dma_map_single(agent->device->dma_device, + packet->mad.data, + sizeof packet->mad.data, + DMA_TO_DEVICE); + gather_list.length = sizeof packet->mad.data; + gather_list.lkey = file->mr[packet->mad.id]->lkey; + pci_unmap_addr_set(packet, mapping, gather_list.addr); + + wr.wr.ud.mad_hdr = (struct ib_mad_hdr *) packet->mad.data; + wr.wr.ud.ah = packet->ah; + wr.wr.ud.remote_qpn = be32_to_cpu(packet->mad.qpn); + wr.wr.ud.remote_qkey = be32_to_cpu(packet->mad.qkey); + wr.wr.ud.timeout_ms = packet->mad.timeout_ms; + + wr.wr_id = (unsigned long) packet; + + ret = ib_post_send_mad(agent, &wr, &bad_wr); + if (ret) { + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(packet, mapping), + sizeof packet->mad.data, + DMA_TO_DEVICE); + goto err_up; + } + + up_read(&file->agent_mutex); + + return sizeof packet->mad; + +err_up: + up_read(&file->agent_mutex); + +err: + kfree(packet); + return ret; +} + +static unsigned int ib_umad_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ib_umad_file *file = filp->private_data; + + /* we will always be able to post a MAD send */ + unsigned int mask = POLLOUT | POLLWRNORM; + + poll_wait(filp, &file->recv_wait, wait); + + if (!list_empty(&file->recv_list)) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +static int ib_umad_reg_agent(struct ib_umad_file *file, unsigned long arg) +{ + struct ib_user_mad_reg_req ureq; + struct ib_mad_reg_req req; + struct ib_mad_agent *agent; + int agent_id; + int ret; + + down_write(&file->agent_mutex); + + if (copy_from_user(&ureq, (void __user *) arg, sizeof ureq)) { + ret = -EFAULT; + goto out; + } + + if (ureq.qpn != 0 && ureq.qpn != 1) { + ret = -EINVAL; + goto out; + } + + for (agent_id = 0; agent_id < IB_UMAD_MAX_AGENTS; ++agent_id) + if (!file->agent[agent_id]) + goto found; + + ret = -ENOMEM; + goto out; + +found: + req.mgmt_class = ureq.mgmt_class; + req.mgmt_class_version = ureq.mgmt_class_version; + memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); + memcpy(req.oui, ureq.oui, sizeof req.oui); + + agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num, + ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI, + &req, 0, send_handler, recv_handler, + file); + if (IS_ERR(agent)) { + ret = PTR_ERR(agent); + goto out; + } + + file->agent[agent_id] = agent; + + file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(file->mr[agent_id])) { + ret = -ENOMEM; + goto err; + } + + if (put_user(agent_id, + (u32 __user *) (arg + offsetof(struct ib_user_mad_reg_req, id)))) { + ret = -EFAULT; + goto err_mr; + } + + ret = 0; + goto out; + +err_mr: + ib_dereg_mr(file->mr[agent_id]); + +err: + file->agent[agent_id] = NULL; + ib_unregister_mad_agent(agent); + +out: + up_write(&file->agent_mutex); + return ret; +} + +static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg) +{ + u32 id; + int ret = 0; + + down_write(&file->agent_mutex); + + if (get_user(id, (u32 __user *) arg)) { + ret = -EFAULT; + goto out; + } + + if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) { + ret = -EINVAL; + goto out; + } + + ib_dereg_mr(file->mr[id]); + ib_unregister_mad_agent(file->agent[id]); + file->agent[id] = NULL; + +out: + up_write(&file->agent_mutex); + return ret; +} + +static int ib_umad_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +{ + switch (cmd) { + case IB_USER_MAD_REGISTER_AGENT: + return ib_umad_reg_agent(filp->private_data, arg); + case IB_USER_MAD_UNREGISTER_AGENT: + return ib_umad_unreg_agent(filp->private_data, arg); + default: + return -ENOIOCTLCMD; + } +} + +static int ib_umad_open(struct inode *inode, struct file *filp) +{ + struct ib_umad_port *port = + container_of(inode->i_cdev, struct ib_umad_port, dev); + struct ib_umad_file *file; + + file = kmalloc(sizeof *file, GFP_KERNEL); + if (!file) + return -ENOMEM; + + memset(file, 0, sizeof *file); + + spin_lock_init(&file->recv_lock); + init_rwsem(&file->agent_mutex); + INIT_LIST_HEAD(&file->recv_list); + init_waitqueue_head(&file->recv_wait); + + file->port = port; + filp->private_data = file; + + return 0; +} + +static int ib_umad_close(struct inode *inode, struct file *filp) +{ + struct ib_umad_file *file = filp->private_data; + int i; + + for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) + if (file->agent[i]) { + ib_dereg_mr(file->mr[i]); + ib_unregister_mad_agent(file->agent[i]); + } + + kfree(file); + + return 0; +} + +static struct file_operations umad_fops = { + .owner = THIS_MODULE, + .read = ib_umad_read, + .write = ib_umad_write, + .poll = ib_umad_poll, + .ioctl = ib_umad_ioctl, + .open = ib_umad_open, + .release = ib_umad_close +}; + +static struct ib_client umad_client = { + .name = "umad", + .add = ib_umad_add_one, + .remove = ib_umad_remove_one +}; + +static ssize_t show_dev(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + return print_dev_t(buf, port->dev.dev); +} +static CLASS_DEVICE_ATTR(dev, S_IRUGO, show_dev, NULL); + +static ssize_t show_ibdev(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + return sprintf(buf, "%s\n", port->ib_dev->name); +} +static CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); + +static ssize_t show_port(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + return sprintf(buf, "%d\n", port->port_num); +} +static CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL); + +static void ib_umad_release_dev(struct kref *ref) +{ + struct ib_umad_device *dev = + container_of(ref, struct ib_umad_device, ref); + + kfree(dev); +} + +static void ib_umad_release_port(struct class_device *class_dev) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + cdev_del(&port->dev); + clear_bit(port->devnum, dev_map); + kref_put(&port->umad_dev->ref, ib_umad_release_dev); +} + +static struct class umad_class = { + .name = "infiniband_mad", + .release = ib_umad_release_port +}; + +static ssize_t show_abi_version(struct class *class, char *buf) +{ + return sprintf(buf, "%d\n", IB_USER_MAD_ABI_VERSION); +} +static CLASS_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); + +static void ib_umad_add_one(struct ib_device *device) +{ + struct ib_umad_device *umad_dev; + int s, e, i; + + if (device->node_type == IB_NODE_SWITCH) + s = e = 0; + else { + s = 1; + e = device->phys_port_cnt; + } + + umad_dev = kmalloc(sizeof *umad_dev + + (e - s + 1) * sizeof (struct ib_umad_port), + GFP_KERNEL); + if (!umad_dev) + return; + + memset(umad_dev, 0, sizeof *umad_dev + + (e - s + 1) * sizeof (struct ib_umad_port)); + + kref_init(&umad_dev->ref); + + umad_dev->start_port = s; + umad_dev->end_port = e; + + for (i = s; i <= e; ++i) { + umad_dev->port[i - s].umad_dev = umad_dev; + kref_get(&umad_dev->ref); + + spin_lock(&map_lock); + umad_dev->port[i - s].devnum = + find_first_zero_bit(dev_map, IB_UMAD_MAX_PORTS); + if (umad_dev->port[i - s].devnum >= IB_UMAD_MAX_PORTS) { + spin_unlock(&map_lock); + goto err; + } + set_bit(umad_dev->port[i - s].devnum, dev_map); + spin_unlock(&map_lock); + + umad_dev->port[i - s].ib_dev = device; + umad_dev->port[i - s].port_num = i; + + cdev_init(&umad_dev->port[i - s].dev, &umad_fops); + umad_dev->port[i - s].dev.owner = THIS_MODULE; + kobject_set_name(&umad_dev->port[i - s].dev.kobj, + "umad%d", umad_dev->port[i - s].devnum); + if (cdev_add(&umad_dev->port[i - s].dev, base_dev + + umad_dev->port[i - s].devnum, 1)) + goto err; + + umad_dev->port[i - s].class_dev.class = &umad_class; + umad_dev->port[i - s].class_dev.dev = device->dma_device; + snprintf(umad_dev->port[i - s].class_dev.class_id, + BUS_ID_SIZE, "umad%d", umad_dev->port[i - s].devnum); + if (class_device_register(&umad_dev->port[i - s].class_dev)) + goto err_class; + + if (class_device_create_file(&umad_dev->port[i - s].class_dev, + &class_device_attr_dev)) + goto err_class; + if (class_device_create_file(&umad_dev->port[i - s].class_dev, + &class_device_attr_ibdev)) + goto err_class; + if (class_device_create_file(&umad_dev->port[i - s].class_dev, + &class_device_attr_port)) + goto err_class; + } + + ib_set_client_data(device, &umad_client, umad_dev); + + return; + +err_class: + cdev_del(&umad_dev->port[i - s].dev); + clear_bit(umad_dev->port[i - s].devnum, dev_map); + +err: + while (--i >= s) + class_device_unregister(&umad_dev->port[i - s].class_dev); + + kref_put(&umad_dev->ref, ib_umad_release_dev); +} + +static void ib_umad_remove_one(struct ib_device *device) +{ + struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client); + int i; + + if (!umad_dev) + return; + + for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) + class_device_unregister(&umad_dev->port[i].class_dev); + + kref_put(&umad_dev->ref, ib_umad_release_dev); +} + +static int __init ib_umad_init(void) +{ + int ret; + + spin_lock_init(&map_lock); + + ret = alloc_chrdev_region(&base_dev, 0, IB_UMAD_MAX_PORTS, + "infiniband_mad"); + if (ret) { + printk(KERN_ERR "user_mad: couldn't get device number\n"); + goto out; + } + + ret = class_register(&umad_class); + if (ret) { + printk(KERN_ERR "user_mad: couldn't create class infiniband_mad\n"); + goto out_chrdev; + } + + ret = class_create_file(&umad_class, &class_attr_abi_version); + if (ret) { + printk(KERN_ERR "user_mad: couldn't create abi_version attribute\n"); + goto out_class; + } + + ret = ib_register_client(&umad_client); + if (ret) { + printk(KERN_ERR "user_mad: couldn't register ib_umad client\n"); + goto out_class; + } + + return 0; + +out_class: + class_unregister(&umad_class); + +out_chrdev: + unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS); + +out: + return ret; +} + +static void __exit ib_umad_cleanup(void) +{ + ib_unregister_client(&umad_client); + class_unregister(&umad_class); + unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS); +} + +module_init(ib_umad_init); +module_exit(ib_umad_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_user_mad.h 2004-12-13 09:44:50.261320976 -0800 @@ -0,0 +1,112 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef IB_USER_MAD_H +#define IB_USER_MAD_H + +#include +#include + +/* + * Increment this value if any changes that break userspace ABI + * compatibility are made. + */ +#define IB_USER_MAD_ABI_VERSION 2 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + */ + +/** + * ib_user_mad - MAD packet + * @data - Contents of MAD + * @id - ID of agent MAD received with/to be sent with + * @status - 0 on successful receive, ETIMEDOUT if no response + * received (transaction ID in data[] will be set to TID of original + * request) (ignored on send) + * @timeout_ms - Milliseconds to wait for response (unset on receive) + * @qpn - Remote QP number received from/to be sent to + * @qkey - Remote Q_Key to be sent with (unset on receive) + * @lid - Remote lid received from/to be sent to + * @sl - Service level received with/to be sent with + * @path_bits - Local path bits received with/to be sent with + * @grh_present - If set, GRH was received/should be sent + * @gid_index - Local GID index to send with (unset on receive) + * @hop_limit - Hop limit in GRH + * @traffic_class - Traffic class in GRH + * @gid - Remote GID in GRH + * @flow_label - Flow label in GRH + * + * All multi-byte quantities are stored in network (big endian) byte order. + */ +struct ib_user_mad { + __u8 data[256]; + __u32 id; + __u32 status; + __u32 timeout_ms; + __u32 qpn; + __u32 qkey; + __u16 lid; + __u8 sl; + __u8 path_bits; + __u8 grh_present; + __u8 gid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 gid[16]; + __u32 flow_label; +}; + +/** + * ib_user_mad_reg_req - MAD registration request + * @id - Set by the kernel; used to identify agent in future requests. + * @qpn - Queue pair number; must be 0 or 1. + * @method_mask - The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + * @mgmt_class - Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version - Indicates which version of MADs for the given + * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. + */ +struct ib_user_mad_reg_req { + __u32 id; + __u32 method_mask[4]; + __u8 qpn; + __u8 mgmt_class; + __u8 mgmt_class_version; + __u8 oui[3]; +}; + +#define IB_IOCTL_MAGIC 0x1b + +#define IB_USER_MAD_REGISTER_AGENT _IOWR(IB_IOCTL_MAGIC, 1, \ + struct ib_user_mad_reg_req) + +#define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 2, __u32) + +#endif /* IB_USER_MAD_H */ From roland at topspin.com Mon Dec 13 10:10:04 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:10:04 -0800 Subject: [openib-general] [PATCH][v3][20/21] Add InfiniBand Documentation files In-Reply-To: <20041213109.42OdQqmmAkW2Pv7s@topspin.com> Message-ID: <200412131010.qyAMW5NxoiM4CntC@topspin.com> Add files to Documentation/infiniband that describe the tree under /sys/class/infiniband, the IPoIB driver and the userspace MAD access driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/ipoib.txt 2004-12-13 09:44:50.964217426 -0800 @@ -0,0 +1,56 @@ +IP OVER INFINIBAND + + The ib_ipoib driver is an implementation of the IP over InfiniBand + protocol as specified by the latest Internet-Drafts issued by the + IETF ipoib working group. It is a "native" implementation in the + sense of setting the interface type to ARPHRD_INFINIBAND and the + hardware address length to 20 (earlier proprietary implementations + masqueraded to the kernel as ethernet interfaces). + +Partitions and P_Keys + + When the IPoIB driver is loaded, it creates one interface for each + port using the P_Key at index 0. To create an interface with a + different P_Key, write the desired P_Key into the main interface's + /sys/class/net//create_child file. For example: + + echo 0x8001 > /sys/class/net/ib0/create_child + + This will create an interface named ib0.8001 with P_Key 0x8001. To + remove a subinterface, use the "delete_child" file: + + echo 0x8001 > /sys/class/net/ib0/delete_child + + The P_Key for any interface is given by the "pkey" file, and the + main interface for a subinterface is in "parent." + +Debugging Information + + By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set + to 'y', tracing messages are compiled into the driver. They are + turned on by setting the module parameters debug_level and + mcast_debug_level to 1. These parameters can be controlled at + runtime through files in /sys/module/ib_ipoib/. + + CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs" + virtual filesystem. By mounting this filesystem, for example with + + mkdir -p /ipoib_debugfs + mount -t ipoib_debugfs none /ipoib_debufs + + it is possible to get statistics about multicast groups from the + files /ipoib_debugfs/ib0_mcg and so on. + + The performance impact of this option is negligible, so it + is safe to enable this option with debug_level set to 0 for normal + operation. + + CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in + the data path when data_debug_level is set to 1. However, even with + the output disabled, enabling this configuration option will affect + performance, because it adds tests to the fast path. + +References + + IETF IP over InfiniBand (ipoib) Working Group + http://ietf.org/html.charters/ipoib-charter.html --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/sysfs.txt 2004-12-13 09:44:51.008210945 -0800 @@ -0,0 +1,63 @@ +SYSFS FILES + + For each InfiniBand device, the InfiniBand drivers create the + following files under /sys/class/infiniband/: + + node_guid - Node GUID + sys_image_guid - System image GUID + + In addition, there is a "ports" subdirectory, with one subdirectory + for each port. For example, if mthca0 is a 2-port HCA, there will + be two directories: + + /sys/class/infiniband/mthca0/ports/1 + /sys/class/infiniband/mthca0/ports/2 + + (A switch will only have a single "0" subdirectory for switch port + 0; no subdirectory is created for normal switch ports) + + In each port subdirectory, the following files are created: + + cap_mask - Port capability mask + lid - Port LID + lid_mask_count - Port LID mask count + sm_lid - Subnet manager LID for port's subnet + sm_sl - Subnet manager SL for port's subnet + state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER) + + There is also a "counters" subdirectory, with files + + VL15_dropped + excessive_buffer_overrun_errors + link_downed + link_error_recovery + local_link_integrity_errors + port_rcv_constraint_errors + port_rcv_data + port_rcv_errors + port_rcv_packets + port_rcv_remote_physical_errors + port_rcv_switch_relay_errors + port_xmit_constraint_errors + port_xmit_data + port_xmit_discards + port_xmit_packets + symbol_error + + Each of these files contains the corresponding value from the port's + Performance Management PortCounters attribute, as described in + section 16.1.3.5 of the InfiniBand Architecture Specification. + + The "pkeys" and "gids" subdirectories contain one file for each + entry in the port's P_Key or GID table respectively. For example, + ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key + table. + +MTHCA + + The Mellanox HCA driver also creates the files: + + hw_rev - Hardware revision number + fw_ver - Firmware version + hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)", + or "MT25208" --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/user_mad.txt 2004-12-13 09:44:51.441147165 -0800 @@ -0,0 +1,81 @@ +USERSPACE MAD ACCESS + +Device files + + Each port of each InfiniBand device has a "umad" device attached. + For example, a two-port HCA will have two devices, while a switch + will have one device (for switch port 0). + +Creating MAD agents + + A MAD agent can be created by filling in a struct ib_user_mad_reg_req + and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file + descriptor for the appropriate device file. If the registration + request succeeds, a 32-bit id will be returned in the structure. + For example: + + struct ib_user_mad_reg_req req = { /* ... */ }; + ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req); + if (!ret) + my_agent = req.id; + else + perror("agent register"); + + Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT + ioctl. Also, all agents registered through a file descriptor will + be unregistered when the descriptor is closed. + +Receiving MADs + + MADs are received using read(). The buffer passed to read() must be + large enough to hold at least one struct ib_user_mad. For example: + + struct ib_user_mad mad; + ret = read(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("read"); + + In addition to the actual MAD contents, the other struct ib_user_mad + fields will be filled in with information on the received MAD. For + example, the remote LID will be in mad.lid. + + If a send times out, a receive will be generated with mad.status set + to ETIMEDOUT. Otherwise when a MAD has been successfully received, + mad.status will be 0. + + poll()/select() may be used to wait until a MAD can be read. + +Sending MADs + + MADs are sent using write(). The agent ID for sending should be + filled into the id field of the MAD, the destination LID should be + filled into the lid field, and so on. For example: + + struct ib_user_mad mad; + + /* fill in mad.data */ + + mad.id = my_agent; /* req.id from agent registration */ + mad.lid = my_dest; /* in network byte order... */ + /* etc. */ + + ret = write(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("write"); + +/dev files + + To create the appropriate character device files automatically with + udev, a rule like + + KERNEL="umad*", NAME="infiniband/%k" + + can be used. This will create a device node named + + /dev/infiniband/umad0 + + for the first port, and so on. The InfiniBand device and port + associated with this device can be determined from the files + + /sys/class/infiniband_mad/umad0/ibdev + /sys/class/infiniband_mad/umad0/port From roland at topspin.com Mon Dec 13 10:10:07 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:10:07 -0800 Subject: [openib-general] [PATCH][v3][21/21] InfiniBand MAINTAINERS entry In-Reply-To: <200412131010.qyAMW5NxoiM4CntC@topspin.com> Message-ID: <200412131010.czsQg0IIemnlj3gy@topspin.com> Add OpenIB maintainers information to MAINTAINERS. Signed-off-by: Roland Dreier --- linux-bk.orig/MAINTAINERS 2004-12-11 15:16:16.000000000 -0800 +++ linux-bk/MAINTAINERS 2004-12-13 09:44:51.816091929 -0800 @@ -1081,6 +1081,17 @@ L: linux-fbdev-devel at lists.sourceforge.net S: Maintained +INFINIBAND SUBSYSTEM +P: Roland Dreier +M: roland at topspin.com +P: Sean Hefty +M: mshefty at ichips.intel.com +P: Hal Rosenstock +M: halr at voltaire.com +L: openib-general at openib.org +W: http://www.openib.org/ +S: Supported + INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS P: Vojtech Pavlik M: vojtech at suse.cz From halr at voltaire.com Mon Dec 13 10:16:23 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Dec 2004 20:16:23 +0200 Subject: [openib-general] Re: User MAD support for cancel MAD Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175AFF@taurus.voltaire.com> [Sean wrote:] > We could argue that cancel MAD should work based on TID, management class, > and SGID/SLID, rather than simply the TID. Are you saying to have other forms of MAD like cancel by mgmt_class, cancel by SGID and/or SLID in addition to cancel by TID ? We could do this if there is a need. I don't think that need has been demonstrated (at least yet). -- Hal From mshefty at ichips.intel.com Mon Dec 13 10:28:18 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Dec 2004 10:28:18 -0800 Subject: [openib-general] Re: User MAD support for cancel MAD In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175AFF@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175AFF@taurus.voltaire.com> Message-ID: <41BDDF42.2080507@ichips.intel.com> Hal Rosenstock wrote: > [Sean wrote:] > >>We could argue that cancel MAD should work based on TID, management class, >>and SGID/SLID, rather than simply the TID. > > > Are you saying to have other forms of MAD like cancel by mgmt_class, cancel > by SGID and/or SLID in addition to cancel by TID ? We could do this if there > is a need. I don't think that need has been demonstrated (at least yet). No - I'm not suggesting that we have more ways to cancel a MAD. I was simply pointing out that we _could_ have cancel_mad work based on TID + mgmt_class + SGID/SLID (which would match uniqueness as defined by the spec), rather than just TID. I think canceling MADs based on the wr_id is sufficient. - Sean From tduffy at sun.com Mon Dec 13 10:44:24 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 13 Dec 2004 10:44:24 -0800 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <20041213109.JT1ejUdkRIUXbWOm@topspin.com> References: <20041213109.JT1ejUdkRIUXbWOm@topspin.com> Message-ID: <1102963464.9258.11.camel@duffman> On Mon, 2004-12-13 at 10:09 -0800, Roland Dreier wrote: > --- linux-bk.orig/drivers/infiniband/Kconfig 2004-12-13 09:44:43.936252779 -0800 > +++ linux-bk/drivers/infiniband/Kconfig 2004-12-13 09:44:49.385450009 -0800 > @@ -2,7 +2,6 @@ > > config INFINIBAND > tristate "InfiniBand support" > - default n > ---help--- > Core support for InfiniBand (IB). Make sure to also select > any protocols you wish to use as well as drivers for your Is there a reason why you put this in in an earlier patch and then take it out later? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Mon Dec 13 10:48:07 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 13 Dec 2004 10:48:07 -0800 Subject: [openib-general] [PATCH][v3][20/21] Add InfiniBand Documentation files In-Reply-To: <200412131010.qyAMW5NxoiM4CntC@topspin.com> References: <200412131010.qyAMW5NxoiM4CntC@topspin.com> Message-ID: <1102963687.9258.14.camel@duffman> On Mon, 2004-12-13 at 10:10 -0800, Roland Dreier wrote: > Add files to Documentation/infiniband that describe the tree under > /sys/class/infiniband, the IPoIB driver and the userspace MAD access driver. Do you want to add in Hal's ipoib faq as well? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Mon Dec 13 10:49:36 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:49:36 -0800 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <1102963464.9258.11.camel@duffman> (Tom Duffy's message of "Mon, 13 Dec 2004 10:44:24 -0800") References: <20041213109.JT1ejUdkRIUXbWOm@topspin.com> <1102963464.9258.11.camel@duffman> Message-ID: <52mzwi58zj.fsf@topspin.com> Tom> Is there a reason why you put this in in an earlier patch and Tom> then take it out later? I guess the reasons are stupidity and bad patch scripts... Doesn't hurt for now, will be fixed in future versions. - R. From tduffy at sun.com Mon Dec 13 10:54:29 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 13 Dec 2004 10:54:29 -0800 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <52mzwi58zj.fsf@topspin.com> References: <20041213109.JT1ejUdkRIUXbWOm@topspin.com> <1102963464.9258.11.camel@duffman> <52mzwi58zj.fsf@topspin.com> Message-ID: <1102964069.9258.20.camel@duffman> On Mon, 2004-12-13 at 10:49 -0800, Roland Dreier wrote: > Tom> Is there a reason why you put this in in an earlier patch and > Tom> then take it out later? > > I guess the reasons are stupidity and bad patch scripts... > > Doesn't hurt for now, will be fixed in future versions. Speaking of nits, there are also some formatting issues with the Makefiles that changes in the later patches... But, the end result is you get the "correct" formatting if you apply all the patches. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Mon Dec 13 11:00:30 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 11:00:30 -0800 Subject: [openib-general] [PATCH][v3][20/21] Add InfiniBand Documentation files In-Reply-To: <1102963687.9258.14.camel@duffman> (Tom Duffy's message of "Mon, 13 Dec 2004 10:48:07 -0800") References: <200412131010.qyAMW5NxoiM4CntC@topspin.com> <1102963687.9258.14.camel@duffman> Message-ID: <52is7658hd.fsf@topspin.com> Tom> Do you want to add in Hal's ipoib faq as well? Good idea. I'll fix up the formatting and add that to the next version as well. - R. From roland at topspin.com Mon Dec 13 11:00:48 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 11:00:48 -0800 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <1102964069.9258.20.camel@duffman> (Tom Duffy's message of "Mon, 13 Dec 2004 10:54:29 -0800") References: <20041213109.JT1ejUdkRIUXbWOm@topspin.com> <1102963464.9258.11.camel@duffman> <52mzwi58zj.fsf@topspin.com> <1102964069.9258.20.camel@duffman> Message-ID: <52ekhu58gv.fsf@topspin.com> Tom> Speaking of nits, there are also some formatting issues with Tom> the Makefiles that changes in the later patches... Thanks, I've fixed those intermediate versions as well. - R. From halr at voltaire.com Mon Dec 13 12:19:09 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Dec 2004 22:19:09 +0200 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand)driver Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B0C@taurus.voltaire.com> Hi Roland, [You wrote:] > The IPoIB protocol/encapsulation is described in the Internet-Drafts > http://www.ietf.org/internet-drafts/draft-ietf-ipoib-architecture-04.txt > http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt The latest I-D is now http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Also, isn't DHCP over IB (http://www.ietf.org/internet-drafts/draft-ietf-ipoib-dhcp-over-infiniband-07.txt) also supported ? If so, is that part of this or some other patch being submitted ? Thanks. -- Hal From mst at mellanox.co.il Mon Dec 13 12:14:59 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Dec 2004 22:14:59 +0200 Subject: [openib-general] Re: User MAD support for cancel MAD In-Reply-To: References: Message-ID: <20041213201459.GA24197@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Re: User MAD support for cancel MAD": > > The interface issue is not so trivial. An IOCTL may do, but the problem > is the parameter of the IOCTL. The most straight forward way is to > specify a TID to cancel. This requires the usermode to avoid sending the > same mad until it is timeouts. And here I thought it is the kernel that selects the TID. No? mst From mshefty at ichips.intel.com Mon Dec 13 12:27:35 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Dec 2004 12:27:35 -0800 Subject: [openib-general] Re: User MAD support for cancel MAD In-Reply-To: <20041213201459.GA24197@mellanox.co.il> References: <20041213201459.GA24197@mellanox.co.il> Message-ID: <41BDFB37.9070800@ichips.intel.com> Michael S. Tsirkin wrote: > Hello! > Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Re: User MAD support for cancel MAD": > >>The interface issue is not so trivial. An IOCTL may do, but the problem >>is the parameter of the IOCTL. The most straight forward way is to >>specify a TID to cancel. This requires the usermode to avoid sending the >>same mad until it is timeouts. > > > And here I thought it is the kernel that selects the TID. > No? The MAD layer in the kernel assigns the only half of the TID. This is done to ensure that TIDs are unique between clients. Also, as a side note (and not a direct response), the current ib_cancel_mad API cancels MADs based on work request ID, not TID. - Sean From halr at voltaire.com Mon Dec 13 12:31:38 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Dec 2004 22:31:38 +0200 Subject: [openib-general] IPoIB still not working Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B05@taurus.voltaire.com> [You wrote:] > MGID....................0xff12401bffff0000 : 0x0000000000000016 > [1102546997:000053389][18007] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7. MGID 0xff12401bffff0000 : 0x0000000000000016 corresponds to 224.0.0.22 (IGMP). Only Version 3 Reports are sent with an IP destination address of 224.0.0.22, to which all IGMPv3-capable multicast routers listen. The OpenIB stack is joining this group (as it is a end node) rather than attempting to creating this group (hence the component masks). It is joining as a full rather than send only member due to a limitation with some SMs not supporting send only (or non member) joins. Even if that were supported by the SM, if the group were not created first, the send only join would fail. It makes no sense to send when no one is listening so the fact that it does not join is a benign error (ignore the message). The code retries on the next send to this group anyhow. It takes a multicast router which would be listening to IGMP to create the group. Do you have one configured on your subnet ? -- Hal From robert.j.woodruff at intel.com Mon Dec 13 14:14:21 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 13 Dec 2004 14:14:21 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F00030B8563@orsmsx408> >It takes a multicast router which would be listening to IGMP >to create the group. Do you have one configured on your subnet ? >-- Hal The real question is how come the Mellanox PCI-E card does not seem to work with the openib.org IPoIB ? I replaced the PCI-E card with a PCI-X card and it works fine. With the PCI-E card, it appears (from the counters logged in /sys/class/infiniband/mtcha0/ports/1/counters) that the PCI-E card is sending data but not receiving any. From halr at voltaire.com Mon Dec 13 14:14:58 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Dec 2004 00:14:58 +0200 Subject: [openib-general] IPoIB still not working Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B12@taurus.voltaire.com> [Woddy wrote:] > The real question is how come the Mellanox PCI-E card does not seem to > work with the openib.org IPoIB ? > I replaced the PCI-E card with a PCI-X card and it works fine. > With the PCI-E card, it appears (from the counters logged in > /sys/class/infiniband/mtcha0/ports/1/counters) > that the PCI-E card is sending data but not receiving any. I believe (yet to be confirmed) that this is fixed in the upcoming Arbel firmware release. -- Hal From robert.j.woodruff at intel.com Mon Dec 13 14:19:56 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 13 Dec 2004 14:19:56 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F00030B8581@orsmsx408> >I believe (yet to be confirmed) that this is fixed in the upcoming >Arbel firmware release. >-- Hal I hope so. I am running the 5.4.6-rc4 firmware that I got from Trevor and it still fails. Perhaps Eitan knows if there is some newer version that fixes this issue. woody From halr at voltaire.com Mon Dec 13 14:29:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Dec 2004 00:29:21 +0200 Subject: [openib-general] IPoIB still not working Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B13@taurus.voltaire.com> >>I believe (yet to be confirmed) that this is fixed in the upcoming >>Arbel firmware release. >I hope so. I am running the 5.4.6-rc4 firmware that I got from Trevor >and it still fails. Perhaps Eitan knows if there is some newer version >that fixes this issue. There could be an OpenIB problem. I don't know for sure as I do not have a PCI-E machine or HCA. My only experience is with IPoIB on other stacks with PCI-E. -- Hal From robert.j.woodruff at intel.com Mon Dec 13 14:56:35 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 13 Dec 2004 14:56:35 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F00030B8655@orsmsx408> >>I believe (yet to be confirmed) that this is fixed in the upcoming >>Arbel firmware release. >I hope so. I am running the 5.4.6-rc4 firmware that I got from Trevor >and it still fails. Perhaps Eitan knows if there is some newer version >that fixes this issue. >There could be an OpenIB problem. I don't know for sure as I do >not have a PCI-E machine or HCA. My only experience is with IPoIB >on other stacks with PCI-E. >-- Hal I have other stacks that work fine on the same systems with the same hardware so I suppose that mthca could be doing something wrong. However, it sure points to the PCI-E hardware since without changing any software and just the hardware, it works with the PCI-X card and fails with the PCI-E card. They are suppose to be software compatible, so I am not sure what we could be doing wrong. From mshefty at ichips.intel.com Mon Dec 13 15:37:11 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Dec 2004 15:37:11 -0800 Subject: [openib-general] [PATCH] define SMP attributes in ib_smi.h Message-ID: <20041213153711.2cde885f.mshefty@ichips.intel.com> This patch defines SMP attribute values in ib_smi.h. (I use these in madeye.) It removes/replaces the SMP attribute enums from mthca. There is also a minor change from using cpu_to_be32 to __constant_htonl to allow using #define's in switch statements. - Sean Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1324) +++ include/ib_mad.h (working copy) @@ -60,7 +60,7 @@ #define IB_MGMT_MAX_METHODS 128 #define IB_QP0 0 -#define IB_QP1 cpu_to_be32(1) +#define IB_QP1 __constant_htonl(1) #define IB_QP1_QKEY 0x80010000 struct ib_grh { Index: include/ib_smi.h =================================================================== --- include/ib_smi.h (revision 1324) +++ include/ib_smi.h (working copy) @@ -56,7 +56,25 @@ u8 return_path[IB_SMP_MAX_PATH_HOPS]; } __attribute__ ((packed)); -#define IB_SMP_DIRECTION cpu_to_be16(0x8000) +#define IB_SMP_DIRECTION __constant_htons(0x8000) + +/* Subnet management attributes */ +#define IB_SMP_ATTR_NOTICE __constant_htons(0x0002) +#define IB_SMP_ATTR_NODE_DESC __constant_htons(0x0010) +#define IB_SMP_ATTR_NODE_INFO __constant_htons(0x0011) +#define IB_SMP_ATTR_SWITCH_INFO __constant_htons(0x0012) +#define IB_SMP_ATTR_GUID_INFO __constant_htons(0x0014) +#define IB_SMP_ATTR_PORT_INFO __constant_htons(0x0015) +#define IB_SMP_ATTR_PKEY_TABLE __constant_htons(0x0016) +#define IB_SMP_ATTR_SL_TO_VL_TABLE __constant_htons(0x0017) +#define IB_SMP_ATTR_VL_ARB_TABLE __constant_htons(0x0018) +#define IB_SMP_ATTR_LINEAR_FORWARD_TABLE __constant_htons(0x0019) +#define IB_SMP_ATTR_RANDOM_FORWARD_TABLE __constant_htons(0x001A) +#define IB_SMP_ATTR_MCAST_FORWARD_TABLE __constant_htons(0x001B) +#define IB_SMP_ATTR_SM_INFO __constant_htons(0x0020) +#define IB_SMP_ATTR_VENDOR_DIAG __constant_htons(0x0030) +#define IB_SMP_ATTR_LED_INFO __constant_htons(0x0031) +#define IB_SMP_ATTR_VENDOR_MASK __constant_htons(0xFF00) static inline u8 ib_get_smp_direction(struct ib_smp *smp) Index: hw/mthca/mthca_provider.c =================================================================== --- hw/mthca/mthca_provider.c (revision 1324) +++ hw/mthca/mthca_provider.c (working copy) @@ -22,18 +22,11 @@ */ #include +#include #include "mthca_dev.h" #include "mthca_cmd.h" -/* Temporary until we get core support straightened out */ -enum { - IB_SMP_ATTRIB_NODE_INFO = 0x0011, - IB_SMP_ATTRIB_GUID_INFO = 0x0014, - IB_SMP_ATTRIB_PORT_INFO = 0x0015, - IB_SMP_ATTRIB_PKEY_TABLE = 0x0016 -}; - static int mthca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) { @@ -54,7 +47,7 @@ in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; in_mad->mad_hdr.class_version = 1; in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; - in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO); + in_mad->mad_hdr.attr_id = IB_SMP_ATTR_NODE_INFO; err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, in_mad, out_mad, @@ -98,7 +91,7 @@ in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; in_mad->mad_hdr.class_version = 1; in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; - in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_id = IB_SMP_ATTR_PORT_INFO; in_mad->mad_hdr.attr_mod = cpu_to_be32(port); err = mthca_MAD_IFC(to_mdev(ibdev), 1, @@ -152,7 +145,7 @@ in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; in_mad->mad_hdr.class_version = 1; in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; - in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE); + in_mad->mad_hdr.attr_id = IB_SMP_ATTR_PKEY_TABLE; in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 32); err = mthca_MAD_IFC(to_mdev(ibdev), 1, @@ -191,7 +184,7 @@ in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; in_mad->mad_hdr.class_version = 1; in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; - in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_id = IB_SMP_ATTR_PORT_INFO; in_mad->mad_hdr.attr_mod = cpu_to_be32(port); err = mthca_MAD_IFC(to_mdev(ibdev), 1, @@ -211,7 +204,7 @@ in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; in_mad->mad_hdr.class_version = 1; in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; - in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO); + in_mad->mad_hdr.attr_id = IB_SMP_ATTR_GUID_INFO; in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 8); err = mthca_MAD_IFC(to_mdev(ibdev), 1, Index: hw/mthca/mthca_mad.c =================================================================== --- hw/mthca/mthca_mad.c (revision 1324) +++ hw/mthca/mthca_mad.c (working copy) @@ -23,18 +23,12 @@ #include #include +#include #include "mthca_dev.h" #include "mthca_cmd.h" enum { - IB_SM_PORT_INFO = 0x0015, - IB_SM_PKEY_TABLE = 0x0016, - IB_SM_SM_INFO = 0x0020, - IB_SM_VENDOR_START = 0xff00 -}; - -enum { MTHCA_VENDOR_CLASS1 = 0x9, MTHCA_VENDOR_CLASS2 = 0xa }; @@ -84,7 +78,7 @@ if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && mad->mad_hdr.method == IB_MGMT_METHOD_SET) { - if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) { + if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { update_sm_ah(to_mdev(ibdev), port_num, be16_to_cpup((__be16 *) (mad->data + 58)), (*(u8 *) (mad->data + 76)) & 0xf); @@ -95,7 +89,7 @@ ib_dispatch_event(&event); } - if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { + if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) { event.device = ibdev; event.event = IB_EVENT_PKEY_CHANGE; event.element.port_num = port_num; @@ -211,8 +205,9 @@ * Don't process SMInfo queries or vendor-specific * MADs -- the SMA can't handle them. */ - if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO || - be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START) + if (in_mad->mad_hdr.attr_id == IB_SMP_ATTR_SM_INFO || + ((in_mad->mad_hdr.attr_id & IB_SMP_ATTR_VENDOR_MASK) == + IB_SMP_ATTR_VENDOR_MASK)) return IB_MAD_RESULT_SUCCESS; } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1 || From mshefty at ichips.intel.com Mon Dec 13 15:54:23 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Dec 2004 15:54:23 -0800 Subject: [openib-general] [PATCH] [RFC] new test directory + test code for MAD snooping In-Reply-To: <20041209142459.5c8dbb0f.mshefty@ichips.intel.com> References: <20041209142459.5c8dbb0f.mshefty@ichips.intel.com> Message-ID: <41BE2BAF.10506@ichips.intel.com> Sean Hefty wrote: > This patch creates a new subdirectory under infiniband called 'test', and adds a > new kernel module called 'madeye' that can be used to snoop and display MADs > (until user-mode smpdump and gmpdump programs are created). > > We need to decide if we want to include this sort of test code in the openib > tree, and where it might best go. Does anyone care where this code gets checked into? Right now I'm leaning on putting it in gen2/test or gen2/utils to keep from polluting gen2/trunk. I'm anticipating creating a test app for the CM as well in order to verify some basic functionality. - Sean From jjengla at sandia.gov Mon Dec 13 08:03:35 2004 From: jjengla at sandia.gov (Josh England) Date: Mon, 13 Dec 2004 08:03:35 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00030B8655@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F00030B8655@orsmsx408> Message-ID: <1102953815.15045.4.camel@localhost> On Mon, 2004-12-13 at 14:56 -0800, Woodruff, Robert J wrote: > I have other stacks that work fine on the same systems with the same > hardware This is what I'm seeing as well. I'm doing IPoIB over PCI-E just fine with VAPI 3.2. Could there be some extra (software) initialization stuff for PCI-E cards that needs to be done? -JE From roland at topspin.com Mon Dec 13 16:14:54 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 16:14:54 -0800 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand)driver In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175B0C@taurus.voltaire.com> (Hal Rosenstock's message of "Mon, 13 Dec 2004 22:19:09 +0200") References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B0C@taurus.voltaire.com> Message-ID: <52sm694txd.fsf@topspin.com> Hal> The latest I-D is now Hal> http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Thanks, I'll correct this. Hal> Also, isn't DHCP over IB Hal> (http://www.ietf.org/internet-drafts/draft-ietf-ipoib-dhcp-over-infiniband-07.txt) Hal> also supported ? If so, is that part of this or some other Hal> patch being submitted ? DHCP should work but I don't think it's a kernel issue (I don't think kernel DHCP for NFS root over IPoIB will work unfortunately). - R. From roland at topspin.com Mon Dec 13 16:17:24 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 16:17:24 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00030B8655@orsmsx408> (Robert J. Woodruff's message of "Mon, 13 Dec 2004 14:56:35 -0800") References: <1AC79F16F5C5284499BB9591B33D6F00030B8655@orsmsx408> Message-ID: <52oegx4tt7.fsf@topspin.com> Robert> I have other stacks that work fine on the same systems Robert> with the same hardware so I suppose that mthca could be Robert> doing something wrong. Seems like there must be an mthca bug -- unfortunately I don't have a PCI-e system set up to debug IPoIB so it may be a while before I can debug this (unless someone can get more details about why IPoIB is failing) - R. From roland at topspin.com Mon Dec 13 17:05:24 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 17:05:24 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <52oegx4tt7.fsf@topspin.com> (Roland Dreier's message of "Mon, 13 Dec 2004 16:17:24 -0800") References: <1AC79F16F5C5284499BB9591B33D6F00030B8655@orsmsx408> <52oegx4tt7.fsf@topspin.com> Message-ID: <527jnl4rl7.fsf@topspin.com> By the way, does anyone have a fabric with both PCI-X and PCIe HCAs? It would be interesting to see if a PCI-X HCA receives ARPs sent by a PCIe HCA (and also test the other direction). - R. From hnrose at earthlink.net Mon Dec 13 17:29:32 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Mon, 13 Dec 2004 20:29:32 -0500 Subject: [openib-general] [PATCH] [RFC] new test directory + test codefor MAD snooping References: <20041209142459.5c8dbb0f.mshefty@ichips.intel.com> <41BE2BAF.10506@ichips.intel.com> Message-ID: <003101c4e17c$5d6c7600$6501a8c0@comcast.net> Sean Hefty wrote: > Does anyone care where this code gets checked into? Right now I'm > leaning on putting it in gen2/test or gen2/utils to keep from > polluting gen2/trunk. I think that's a good idea (not to mix this into trunk). Still not sure about what is the best way for test kernel modules. Are there any examples anyone is aware of ? -- Hal From yoshfuji at linux-ipv6.org Mon Dec 13 18:30:23 2004 From: yoshfuji at linux-ipv6.org (YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?=) Date: Tue, 14 Dec 2004 11:30:23 +0900 (JST) Subject: [openib-general] Re: [PATCH][v3][16/21] IPoIB IPv6 support In-Reply-To: <20041213109.iziHvQZqtmP83gmx@topspin.com> References: <20041213109.5NKezuGE9PMejMSM@topspin.com> <20041213109.iziHvQZqtmP83gmx@topspin.com> Message-ID: <20041214.113023.65866789.yoshfuji@linux-ipv6.org> In article <20041213109.iziHvQZqtmP83gmx at topspin.com> (at Mon, 13 Dec 2004 10:09:46 -0800), Roland Dreier says: > } > @@ -1794,6 +1801,7 @@ > if ((dev->type != ARPHRD_ETHER) && > (dev->type != ARPHRD_FDDI) && > (dev->type != ARPHRD_IEEE802_TR) && > + (dev->type != ARPHRD_INFINIBAND) && > (dev->type != ARPHRD_ARCNET)) { > /* Alas, we support only Ethernet autoconfiguration. */ > return; Please put ARPHRD_INFINIBAND after ARPHRD_ARCNET like other places. --yoshfuji From roland at topspin.com Mon Dec 13 18:36:16 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 18:36:16 -0800 Subject: [openib-general] Re: [PATCH][v3][16/21] IPoIB IPv6 support In-Reply-To: <20041214.113023.65866789.yoshfuji@linux-ipv6.org> (YOSHIFUJI Hideaki's message of "Tue, 14 Dec 2004 11:30:23 +0900 (JST)") References: <20041213109.5NKezuGE9PMejMSM@topspin.com> <20041213109.iziHvQZqtmP83gmx@topspin.com> <20041214.113023.65866789.yoshfuji@linux-ipv6.org> Message-ID: <52y8g138tb.fsf@topspin.com> YOSHIFUJI> Please put ARPHRD_INFINIBAND after ARPHRD_ARCNET like YOSHIFUJI> other places. Thanks, I've made this update. Thanks, Roland From yoshfuji at linux-ipv6.org Mon Dec 13 18:42:59 2004 From: yoshfuji at linux-ipv6.org (YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?=) Date: Tue, 14 Dec 2004 11:42:59 +0900 (JST) Subject: [openib-general] Re: [PATCH][v3][20/21] Add InfiniBand Documentation files In-Reply-To: <200412131010.qyAMW5NxoiM4CntC@topspin.com> References: <20041213109.42OdQqmmAkW2Pv7s@topspin.com> <200412131010.qyAMW5NxoiM4CntC@topspin.com> Message-ID: <20041214.114259.16194522.yoshfuji@linux-ipv6.org> In article <200412131010.qyAMW5NxoiM4CntC at topspin.com> (at Mon, 13 Dec 2004 10:10:04 -0800), Roland Dreier says: > Add files to Documentation/infiniband that describe the tree under > /sys/class/infiniband, the IPoIB driver and the userspace MAD access driver. > > Signed-off-by: Roland Dreier LICENSE.TXT is missing. Please add one. -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From roland at topspin.com Mon Dec 13 18:58:38 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 18:58:38 -0800 Subject: [openib-general] Re: [PATCH][v3][20/21] Add InfiniBand Documentation files In-Reply-To: <20041214.114259.16194522.yoshfuji@linux-ipv6.org> (YOSHIFUJI Hideaki's message of "Tue, 14 Dec 2004 11:42:59 +0900 (JST)") References: <20041213109.42OdQqmmAkW2Pv7s@topspin.com> <200412131010.qyAMW5NxoiM4CntC@topspin.com> <20041214.114259.16194522.yoshfuji@linux-ipv6.org> Message-ID: <52u0qp37s1.fsf@topspin.com> YOSHIFUJI> LICENSE.TXT is missing. Please add one. I think it would be even better to rewrite the copyright header so it doesn't refer to external files. I will see if this is possible. Thanks, Roland From roland at topspin.com Mon Dec 13 19:02:40 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 19:02:40 -0800 Subject: Better kernel license (was Re: [openib-general] Re: [PATCH][v3][20/21] Add InfiniBand Documentation files) In-Reply-To: <52u0qp37s1.fsf@topspin.com> (Roland Dreier's message of "Mon, 13 Dec 2004 18:58:38 -0800") References: <20041213109.42OdQqmmAkW2Pv7s@topspin.com> <200412131010.qyAMW5NxoiM4CntC@topspin.com> <20041214.114259.16194522.yoshfuji@linux-ipv6.org> <52u0qp37s1.fsf@topspin.com> Message-ID: <52pt1d37lb.fsf_-_@topspin.com> How would everyone feel about replacing the copyright notices in the kernel source files with the following? I think it says exactly the same thing, just without referring to any external URLs. - Roland /* * Copyright (c) 2004 Company 1. All rights reserved. * Copyright (c) 2004 Company 2. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * $Id$ */ From yoshfuji at linux-ipv6.org Mon Dec 13 19:16:46 2004 From: yoshfuji at linux-ipv6.org (YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?=) Date: Tue, 14 Dec 2004 12:16:46 +0900 (JST) Subject: [openib-general] Re: [PATCH][v3][20/21] Add InfiniBand Documentation files In-Reply-To: <52u0qp37s1.fsf@topspin.com> References: <200412131010.qyAMW5NxoiM4CntC@topspin.com> <20041214.114259.16194522.yoshfuji@linux-ipv6.org> <52u0qp37s1.fsf@topspin.com> Message-ID: <20041214.121646.42527619.yoshfuji@linux-ipv6.org> In article <52u0qp37s1.fsf at topspin.com> (at Mon, 13 Dec 2004 18:58:38 -0800), Roland Dreier says: > YOSHIFUJI> LICENSE.TXT is missing. Please add one. > > I think it would be even better to rewrite the copyright header so it > doesn't refer to external files. I will see if this is possible. Yes, I agree. --yoshfuji From tduffy at sun.com Mon Dec 13 19:16:55 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 13 Dec 2004 19:16:55 -0800 Subject: Better kernel license (was Re: [openib-general] Re: [PATCH][v3][20/21] Add InfiniBand Documentation files) In-Reply-To: <52pt1d37lb.fsf_-_@topspin.com> References: <20041213109.42OdQqmmAkW2Pv7s@topspin.com> <200412131010.qyAMW5NxoiM4CntC@topspin.com> <20041214.114259.16194522.yoshfuji@linux-ipv6.org> <52u0qp37s1.fsf@topspin.com> <52pt1d37lb.fsf_-_@topspin.com> Message-ID: <1102994215.1054.5.camel@duffman> On Mon, 2004-12-13 at 19:02 -0800, Roland Dreier wrote: > How would everyone feel about replacing the copyright notices in the > kernel source files with the following? I like this idea. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From shaharf at voltaire.com Tue Dec 14 00:40:11 2004 From: shaharf at voltaire.com (shaharf) Date: Tue, 14 Dec 2004 10:40:11 +0200 Subject: [openib-general] Re: User MAD support for cancel MAD Message-ID: > > Are you saying to have other forms of MAD like cancel by mgmt_class, > cancel > > by SGID and/or SLID in addition to cancel by TID ? We could do this if > there > > is a need. I don't think that need has been demonstrated (at least yet). > > No - I'm not suggesting that we have more ways to cancel a MAD. I was > simply pointing out that we _could_ have cancel_mad work based on TID + > mgmt_class + SGID/SLID (which would match uniqueness as defined by the > spec), rather than just TID. > > I think canceling MADs based on the wr_id is sufficient. > > - Sean Sean, I totally agrees, I too think canceling mads should be based on wr_id, but that requires to expose the we_id to the user mode level. This may overcome some problems that may result using TIDS. Shahar From roland at topspin.com Tue Dec 14 08:34:26 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Dec 2004 08:34:26 -0800 Subject: [openib-general] Re: [PATCH] define SMP attributes in ib_smi.h In-Reply-To: <20041213153711.2cde885f.mshefty@ichips.intel.com> (Sean Hefty's message of "Mon, 13 Dec 2004 15:37:11 -0800") References: <20041213153711.2cde885f.mshefty@ichips.intel.com> Message-ID: <52u0qo260d.fsf@topspin.com> The mthca pieces seem fine to me. Once the header changes are in place I will commit the mthca stuff. Thanks, Roland From halr at voltaire.com Tue Dec 14 08:31:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Dec 2004 11:31:34 -0500 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand)driver In-Reply-To: <52sm694txd.fsf@topspin.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B0C@taurus.voltaire.com> <52sm694txd.fsf@topspin.com> Message-ID: <1103041894.4127.10.camel@localhost.localdomain> On Mon, 2004-12-13 at 19:14, Roland Dreier wrote: > Hal> Also, isn't DHCP over IB > Hal> (http://www.ietf.org/internet-drafts/draft-ietf-ipoib-dhcp-over-infiniband-07.txt) > Hal> also supported ? If so, is that part of this or some other > Hal> patch being submitted ? > > DHCP should work but I don't think it's a kernel issue Where is Linux DHCP kept ? Are the IB changes incorporated into the DHCP source package ? If so, what version is needed ? > (I don't think kernel DHCP for NFS root over IPoIB will work unfortunately). Does root over NFS only work with RARP or BOOTP and not DHCP ? Is this because DHCP (cleint) is not in the kernel ? Thanks. -- Hal From robert.j.woodruff at intel.com Tue Dec 14 08:49:33 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 14 Dec 2004 08:49:33 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> Roland >Seems like there must be an mthca bug -- unfortunately I don't have a >PCI-e system set up to debug IPoIB so it may be a while before I can >debug this (unless someone can get more details about why IPoIB is failing) > - R. If I get some time, I will try to investigate further. Rolland, to turn on the debug tracing in IPoIB, I just need to enable the IPoIB debug options in .CONFIG, Right ? Do I need to pass any parameters when I load the driver ? Looks like there is something called debug_level, mcast_debug_level, and data_debug_level in ipoib.h. woody From halr at voltaire.com Tue Dec 14 08:48:13 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Dec 2004 11:48:13 -0500 Subject: [openib-general] Re: [PATCH] define SMP attributes in ib_smi.h In-Reply-To: <52u0qo260d.fsf@topspin.com> References: <20041213153711.2cde885f.mshefty@ichips.intel.com> <52u0qo260d.fsf@topspin.com> Message-ID: <1103042893.4127.12.camel@localhost.localdomain> On Tue, 2004-12-14 at 11:34, Roland Dreier wrote: > The mthca pieces seem fine to me. Once the header changes are in > place I will commit the mthca stuff. Header changes (ib_smi.h and ib_mad.h) are now committed. -- Hal From roland at topspin.com Tue Dec 14 08:52:38 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Dec 2004 08:52:38 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> (Robert J. Woodruff's message of "Tue, 14 Dec 2004 08:49:33 -0800") References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> Message-ID: <52mzwg2561.fsf@topspin.com> Robert> If I get some time, I will try to investigate further. Robert> Rolland, to turn on the debug tracing in IPoIB, I just Robert> need to enable the IPoIB debug options in .CONFIG, Right ? Robert> Do I need to pass any parameters when I load the driver ? Robert> Looks like there is something called debug_level, Robert> mcast_debug_level, and data_debug_level in ipoib.h. Turn on the CONFIG options, then you can turn on debugging by setting any or all of the above module parameters to 1 (also can be changed with the module loaded through /sys/module/ib_ipoib). - R. From halr at voltaire.com Tue Dec 14 09:00:52 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Dec 2004 12:00:52 -0500 Subject: [openib-general] [PATCH] IPoIB FAQ Message-ID: <1103043652.4127.16.camel@localhost.localdomain> Adds more information on enabling debug for IPoIB Index: ipoib_faq.txt =================================================================== --- ipoib_faq.txt (revision 1325) +++ ipoib_faq.txt (working copy) @@ -37,6 +37,13 @@ mount -t ipoib_debugfs none /ipoib_debufs/ cat /ipoib_debugfs/ib0_mcg +There are 3 module parameters for IPoIB debug: +debug_level, mcast_debug_level, and data_debug_level. +If both CONFIG options have been enabled, debugging is +turned on by setting the ib_ipoib module parameters to 1. +(This can also be changed with the module loaded through +/sys/module/ib_ipoib). + Other things to verify and supply to help isolate the problem: 1. Verify the firmware version via From tduffy at sun.com Tue Dec 14 09:10:47 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 14 Dec 2004 09:10:47 -0800 Subject: [openib-general] [PATCH][v3][5/21] Add InfiniBand MAD (management datagram) support In-Reply-To: <20041213109.3tK6alRLJABxH4bu@topspin.com> References: <20041213109.3tK6alRLJABxH4bu@topspin.com> Message-ID: <1103044247.1054.11.camel@duffman> On Mon, 2004-12-13 at 10:09 -0800, Roland Dreier wrote: > Add support for handling InfiniBand MADs (management datagrams), > including sending and receiving MADs as well as passing MADs on to > local agents. > > This is required for an SM (subnet manager) to discover and configure > the host, since the SM's query MADs must be passed to the local SMA > (subnet management agent). In addition, this support is used by upper > level protocols to send queries to and receive responses from the SM. This one didn't make it to lkml. I think it was over the size limit. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Dec 14 09:42:25 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Dec 2004 09:42:25 -0800 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand)driver In-Reply-To: <1103041894.4127.10.camel@localhost.localdomain> (Hal Rosenstock's message of "14 Dec 2004 11:31:34 -0500") References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B0C@taurus.voltaire.com> <52sm694txd.fsf@topspin.com> <1103041894.4127.10.camel@localhost.localdomain> Message-ID: <52is7422v2.fsf@topspin.com> Hal> Where is Linux DHCP kept ? Are the IB changes incorporated Hal> into the DHCP source package ? If so, what version is needed ? There are lots of Linux DHCP client packages. I'm not sure if any of them will work with IPoIB. Hal> Does root over NFS only work with RARP or BOOTP and not DHCP Hal> ? Is this because DHCP (cleint) is not in the kernel ? No, there is DHCP client support in the kernel. I don't think it supports client ID though, so it won't work with IPoIB. - Roland From roland at topspin.com Tue Dec 14 09:43:01 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Dec 2004 09:43:01 -0800 Subject: [openib-general] [PATCH][v3][5/21] Add InfiniBand MAD (management datagram) support In-Reply-To: <1103044247.1054.11.camel@duffman> (Tom Duffy's message of "Tue, 14 Dec 2004 09:10:47 -0800") References: <20041213109.3tK6alRLJABxH4bu@topspin.com> <1103044247.1054.11.camel@duffman> Message-ID: <52ekhs22u2.fsf@topspin.com> Tom> This one didn't make it to lkml. I think it was over the Tom> size limit. Ugh, looks like it was just over 100K. I'll split it next time. - R. From roland at topspin.com Tue Dec 14 09:44:59 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Dec 2004 09:44:59 -0800 Subject: [openib-general] Re: [PATCH] define SMP attributes in ib_smi.h In-Reply-To: <1103042893.4127.12.camel@localhost.localdomain> (Hal Rosenstock's message of "14 Dec 2004 11:48:13 -0500") References: <20041213153711.2cde885f.mshefty@ichips.intel.com> <52u0qo260d.fsf@topspin.com> <1103042893.4127.12.camel@localhost.localdomain> Message-ID: <52acsg22qs.fsf@topspin.com> I've committed the mthca piece as well. By the way, in the future please include a Signed-off-by: line. Thanks, Roland From mshefty at ichips.intel.com Tue Dec 14 09:48:18 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 14 Dec 2004 09:48:18 -0800 Subject: [openib-general] Re: [PATCH] define SMP attributes in ib_smi.h In-Reply-To: <52acsg22qs.fsf@topspin.com> References: <20041213153711.2cde885f.mshefty@ichips.intel.com> <52u0qo260d.fsf@topspin.com> <1103042893.4127.12.camel@localhost.localdomain> <52acsg22qs.fsf@topspin.com> Message-ID: <41BF2762.6060304@ichips.intel.com> Roland Dreier wrote: > I've committed the mthca piece as well. By the way, in the future > please include a Signed-off-by: line. Thanks for the reminder. I've made a note to _try_ to remind myself next time. (Feel free to reject the patch next time if I don't include it.) - Sean From robert.j.woodruff at intel.com Tue Dec 14 10:00:55 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 14 Dec 2004 10:00:55 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F00030B8F7B@orsmsx408> >By the way, does anyone have a fabric with both PCI-X and PCIe HCAs? >It would be interesting to see if a PCI-X HCA receives ARPs sent by a >PCIe HCA (and also test the other direction). > - R. Yes. I have 2 nodes that are PCI-X and 2 nodes that are PCI-E on the same fabric. From mshefty at ichips.intel.com Tue Dec 14 12:05:43 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 14 Dec 2004 12:05:43 -0800 Subject: [openib-general] [PATCH] [RFC] new test directory + test codefor MAD snooping In-Reply-To: <003101c4e17c$5d6c7600$6501a8c0@comcast.net> References: <20041209142459.5c8dbb0f.mshefty@ichips.intel.com> <41BE2BAF.10506@ichips.intel.com> <003101c4e17c$5d6c7600$6501a8c0@comcast.net> Message-ID: <41BF4797.8030007@ichips.intel.com> Hal Rosenstock wrote: > Sean Hefty wrote: > >>Does anyone care where this code gets checked into? Right now I'm >>leaning on putting it in gen2/test or gen2/utils to keep from >>polluting gen2/trunk. > > > I think that's a good idea (not to mix this into trunk). Still not sure > about what is the best way for test kernel modules. Are there any > examples anyone is aware of ? I've pushed in the madeye code under gen2/utils. Longer term I think we'll eventually have better testing utilities, but for now I think that sharing some of these component tests is useful for ensuring that we have a basic level of functionality. - Sean From halr at voltaire.com Tue Dec 14 13:45:23 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Dec 2004 16:45:23 -0500 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand)driver References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B0C@taurus.voltaire.com><52sm694txd.fsf@topspin.com><1103041894.4127.10.camel@localhost.localdomain> <52is7422v2.fsf@topspin.com> Message-ID: <00a201c4e226$3bf20d50$6601a8c0@Gripen> Roland Dreier wrote: > Hal> Does root over NFS only work with RARP or BOOTP and not DHCP > Hal> ? Is this because DHCP (cleint) is not in the kernel ? > > No, there is DHCP client support in the kernel. I don't think it > supports client ID though, so it won't work with IPoIB. Right, it looks like net/ipv4/addrconf.c does not currently support option 61 nor does it support the setting of the broadcast flag in DHCPDISCOVER and DHCPREQUEST messages. -- Hal From mst at mellanox.co.il Tue Dec 14 22:09:20 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Dec 2004 08:09:20 +0200 Subject: [openib-general] A Couple of verbs/mthca/firmware questions In-Reply-To: <004001c4defc$a9f2c100$6501a8c0@comcast.net> References: <004001c4defc$a9f2c100$6501a8c0@comcast.net> Message-ID: <20041215060920.GA11501@mellanox.co.il> Hello! Here's the answer from our firmware developers: Quoting r. Hal Rosenstock (hnrose at earthlink.net) "[openib-general] A Couple of verbs/mthca/firmware questions": > Hi, > > I have a couple of questions relative to verbs/mthca/firmware: > > 1. On received packets, the LRH is not visible but there are some WC > fields set. One of those fields is dlid_path_bits. dlid_path_bits seems > to be the same regardless of whether the incoming DR SMP was sent to the > actual DLID or the permissive DLID. Is there a way to distinguish these > cases ? > 1. dlid_path_bits are the 0-7 LSB from the DLID in LRH. permissive LID is not distinguished. > 2. If a status is set in the MAD header and a post_send is issued on > that MAD, should that status be sent in the packet ? Is there any > conversion of the status field which might occur ? 2. no conversion and no modification is done to the MAD. MST From mst at mellanox.co.il Wed Dec 15 04:58:25 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Dec 2004 14:58:25 +0200 Subject: [openib-general] MT25218 (aka Arbel memfree) support Message-ID: <20041215125825.GA12480@mellanox.co.il> Hello! I'd like to start working on MT25218 (aka Arbel memfree, aka native mode) support for openib gen2. There are many similiarities with MT25208, but there are also major differences, and a lot of extra setup to be done to compensate for the absence of on-board memory, which unfortunately affects not only the set-up commands but also performance-critical data path verbs (e.g. posting WQEs, CQ polling). What would be the best way to organise this code? I guess could be two thinkable ways to do this: 1. Add the functionality to the mthca module, use a run-time flag to describe hardware differences. Thus, the same module will handle devices of all types. 2. Create a sibling or a child directory to mthca, and export common functions from mthca. In this case we will have two kernel modules, sharing some common files. I am leaning towards option 2 for both the performance reasons and for better separation between drivers. Therefore, I propose adding a directory memfree under https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/hw/mthca/memfree and all memfree specific code will go there. Comments welcome, MST From halr at voltaire.com Wed Dec 15 05:05:33 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 08:05:33 -0500 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <20041215125825.GA12480@mellanox.co.il> References: <20041215125825.GA12480@mellanox.co.il> Message-ID: <1103115933.4126.53.camel@localhost.localdomain> Hi, On Wed, 2004-12-15 at 07:58, Michael S. Tsirkin wrote: > Hello! > I'd like to start working on MT25218 (aka Arbel memfree, aka native mode) > support for openib gen2. > > There are many similiarities with MT25208, but there are also > major differences, and a lot of extra setup to be done > to compensate for the absence of on-board memory, > which unfortunately affects not only the set-up > commands but also performance-critical data path verbs > (e.g. posting WQEs, CQ polling). > > What would be the best way to organise this code? > I guess could be two thinkable ways to do this: > > 1. Add the functionality to the mthca module, > use a run-time flag to describe hardware differences. > Thus, the same module will handle devices of all > types. > > 2. Create a sibling or a child directory to mthca, > and export common functions from mthca. > In this case we will have two kernel modules, > sharing some common files. > > I am leaning towards option 2 for both the performance > reasons and for better separation between drivers. > > > Therefore, I propose adding a directory memfree under > > https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/hw/mthca/memfree > > and all memfree specific code will go there. There already is some Arbel native code in mthca. I believe the path being taken is more in lines with approach 1. As Roland is the maintainer of this, you would need to coordinate with him. I would expect we can utilize your offer to accelerate Arbel native support, which a number of people have asked about. Also, with this work, would Tavor also be made to work with DDR hidden ? -- Hal From mst at mellanox.co.il Wed Dec 15 05:24:59 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Dec 2004 15:24:59 +0200 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <1103115933.4126.53.camel@localhost.localdomain> References: <20041215125825.GA12480@mellanox.co.il> <1103115933.4126.53.camel@localhost.localdomain> Message-ID: <20041215132459.GB12480@mellanox.co.il> Hello! Quoting r. Hal Rosenstock (halr at voltaire.com) "Re: [openib-general] MT25218 (aka Arbel memfree) support": > Hi, > > On Wed, 2004-12-15 at 07:58, Michael S. Tsirkin wrote: > > Hello! > > I'd like to start working on MT25218 (aka Arbel memfree, aka native mode) > > support for openib gen2. > > > > There are many similiarities with MT25208, but there are also > > major differences, and a lot of extra setup to be done > > to compensate for the absence of on-board memory, > > which unfortunately affects not only the set-up > > commands but also performance-critical data path verbs > > (e.g. posting WQEs, CQ polling). > > > > What would be the best way to organise this code? > > I guess could be two thinkable ways to do this: > > > > 1. Add the functionality to the mthca module, > > use a run-time flag to describe hardware differences. > > Thus, the same module will handle devices of all > > types. > > > > 2. Create a sibling or a child directory to mthca, > > and export common functions from mthca. > > In this case we will have two kernel modules, > > sharing some common files. > > > > I am leaning towards option 2 for both the performance > > reasons and for better separation between drivers. > > > > > > Therefore, I propose adding a directory memfree under > > > > https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/hw/mthca/memfree > > > > and all memfree specific code will go there. > > There already is some Arbel native code in mthca. I believe the path > being taken is more in lines with approach 1. As Roland is the > maintainer of this, you would need to coordinate with him. I would > expect we can utilize your offer to accelerate Arbel native support, > which a number of people have asked about. OK. > Also, with this work, would Tavor also be made to work with DDR hidden ? > > -- Hal > Thats an unrelated issue. With DDR hidden AFAIK the only difference is that FMRs dont work since you have to use a command to access the internal DDR. Do you know of any issues with that? MST From halr at voltaire.com Wed Dec 15 05:35:06 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 08:35:06 -0500 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <20041215132459.GB12480@mellanox.co.il> References: <20041215125825.GA12480@mellanox.co.il> <1103115933.4126.53.camel@localhost.localdomain> <20041215132459.GB12480@mellanox.co.il> Message-ID: <1103117706.4126.93.camel@localhost.localdomain> On Wed, 2004-12-15 at 08:24, Michael S. Tsirkin wrote: > Thats an unrelated issue. > > With DDR hidden AFAIK the only difference is > that FMRs dont work since you have to use a command to access the > internal DDR. > > Do you know of any issues with that? Currently, the AV needs to be read when sending UD: int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header) if (ah->on_hca) return -EINVAL; on_hca is set in mthca_create_ah when DDR is hidden -- Hal From mst at mellanox.co.il Wed Dec 15 05:47:43 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Dec 2004 15:47:43 +0200 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <1103117706.4126.93.camel@localhost.localdomain> References: <20041215125825.GA12480@mellanox.co.il> <1103115933.4126.53.camel@localhost.localdomain> <20041215132459.GB12480@mellanox.co.il> <1103117706.4126.93.camel@localhost.localdomain> Message-ID: <20041215134743.GA12584@mellanox.co.il> Hello! Quoting r. Hal Rosenstock (halr at voltaire.com) "Re: [openib-general] MT25218 (aka Arbel memfree) support": > On Wed, 2004-12-15 at 08:24, Michael S. Tsirkin wrote: > > Thats an unrelated issue. > > > > With DDR hidden AFAIK the only difference is > > that FMRs dont work since you have to use a command to access the > > internal DDR. > > > > Do you know of any issues with that? > > Currently, the AV needs to be read when sending UD: > > int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, > struct ib_ud_header *header) > if (ah->on_hca) > return -EINVAL; > > on_hca is set in mthca_create_ah when DDR is hidden > > -- Hal But ah->on_hca = 0; if (!atomic_read(&pd->sqp_count) && !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { index = mthca_alloc(&dev->av_table.alloc); /* fall back to allocate in host memory */ if (index == -1) goto host_alloc; av = kmalloc(sizeof *av, GFP_KERNEL); if (!av) goto host_alloc; ah->on_hca = 1; ah->avdma = dev->av_table.ddr_av_base + index * MTHCA_AV_SIZE; } So it shouldnt be set when DDR is hidden. mst From halr at voltaire.com Wed Dec 15 05:51:19 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 08:51:19 -0500 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <20041215134743.GA12584@mellanox.co.il> References: <20041215125825.GA12480@mellanox.co.il> <1103115933.4126.53.camel@localhost.localdomain> <20041215132459.GB12480@mellanox.co.il> <1103117706.4126.93.camel@localhost.localdomain> <20041215134743.GA12584@mellanox.co.il> Message-ID: <1103118679.4126.111.camel@localhost.localdomain> On Wed, 2004-12-15 at 08:47, Michael S. Tsirkin wrote: > But > > ah->on_hca = 0; > So it shouldnt be set when DDR is hidden. Got it backwards: it is when DDR is not hidden. Sorry. -- Hal From halr at voltaire.com Wed Dec 15 06:56:46 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 09:56:46 -0500 Subject: [openib-general] IPoIB oops on path record completion Message-ID: <1103122605.4349.7.camel@localhost.localdomain> Dec 15 09:46:01 localhost kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000104 Dec 15 09:46:01 localhost kernel: printing eip: Dec 15 09:46:01 localhost kernel: d093f90d Dec 15 09:46:01 localhost kernel: *pde = 0ba10067 Dec 15 09:46:01 localhost kernel: *pte = 00000000 Dec 15 09:46:01 localhost kernel: Oops: 0002 [#1] Dec 15 09:46:01 localhost kernel: Modules linked in: ib_ipoib ib_sa ib_mthca ib_mad ib_core loop ide_cd cdrom lp ipv6 autofs parport_pc parport uhci_hcd ehci_hcd ohci_hcd eepro100 mii evdev usbcore Dec 15 09:46:01 localhost kernel: CPU: 0 Dec 15 09:46:01 localhost kernel: EIP: 0060:[] Not tainted VLI Dec 15 09:46:01 localhost kernel: EFLAGS: 00010202 (2.6.9) Dec 15 09:46:01 localhost kernel: EIP is at path_rec_completion+0x2ad/0x450 [ib_ipoib] Dec 15 09:46:01 localhost kernel: eax: 00000100 ebx: 00000000 ecx: ca39be14 edx: cb4ebb60 Dec 15 09:46:01 localhost kernel: esi: 00000000 edi: cffc6440 ebp: c5f12a80 esp: ca39bdac Dec 15 09:46:01 localhost kernel: ds: 007b es: 007b ss: 0068 Dec 15 09:46:01 localhost kernel: Process ib_mad1 (pid: 8938, threadinfo=ca39a000 task=cfef7160) Dec 15 09:46:01 localhost kernel: Stack: 01000002 00000000 01000004 5d000000 00000286 c126f3a0 01000003 80000000 Dec 15 09:46:01 localhost kernel: 00000282 00000000 02000000 ce19d000 0c0fc060 cef3dce0 00000282 cc0fc060 Dec 15 09:46:01 localhost kernel: 00000206 cef3dce0 0c0fc060 ca39be50 c0108cd4 00000005 ce19d000 ca39be50 Dec 15 09:46:01 localhost kernel: Call Trace: Dec 15 09:46:01 localhost kernel: [] handle_IRQ_event+0x34/0x70 Dec 15 09:46:01 localhost kernel: [] mthca_destroy_ah+0x5c/0x60 [ib_mthca] Dec 15 09:46:01 localhost kernel: [] ib_sa_path_rec_callback+0x5a/0x90 [ib_sa] Dec 15 09:46:01 localhost kernel: [] queue_delayed_work+0x55/0x80 Dec 15 09:46:01 localhost kernel: [] ib_mad_send_done_handler+0x13b/0x230 [ib_mad] Dec 15 09:46:01 localhost kernel: [] send_handler+0x10b/0x350 [ib_sa] Dec 15 09:46:01 localhost kernel: [] schedule+0x306/0x5c0 Dec 15 09:46:01 localhost kernel: [] timeout_sends+0x113/0x2f0 [ib_mad] Dec 15 09:46:01 localhost kernel: [] timeout_sends+0x0/0x2f0 [ib_mad] Dec 15 09:46:01 localhost kernel: [] worker_thread+0x251/0x470 Dec 15 09:46:01 localhost kernel: [] default_wake_function+0x0/0x20 Dec 15 09:46:01 localhost kernel: [] default_wake_function+0x0/0x20 Dec 15 09:46:01 localhost kernel: [] worker_thread+0x0/0x470 Dec 15 09:46:01 localhost kernel: [] kthread+0xaa/0xb0 Dec 15 09:46:01 localhost kernel: [] kthread+0x0/0xb0 Dec 15 09:46:01 localhost kernel: [] kernel_thread_helper+0x5/0x10 Dec 15 09:46:01 localhost kernel: Code: ff 74 24 20 9d 89 f6 8d bc 27 00 00 00 00 8b 44 24 68 8d 4c 24 68 31 d2 39 c8 74 23 89 c2 8b 00 ff 4c 24 70 c7 42 08 00 00 00 00 <89> 48 04 89 44 24 68 c7 42 04 00 00 00 00 c7 02 00 00 00 00 85 From postmaster at mail.iaai.fr Wed Dec 15 07:03:26 2004 From: postmaster at mail.iaai.fr (postmaster at mail.iaai.fr) Date: Wed, 15 Dec 2004 16:03:26 +0100 Subject: [openib-general] WARNING! Blocked mail [Mail Delivery (failure bruno.ballester@iaai.fr)] Message-ID: Votre Email avec l'objet 'Mail Delivery (failure bruno.ballester at iaai.fr)', envoye au(x) destinataire(s) bruno.ballester at iaai.fr contenait un fichier en attachement de type '[scr]'. Notre organisation n'accepte pas ce type de fichier pas email. Ce message n'a PAS ete delivre au(x) destinataire(s) du message. Si vous avez la moindre question a ce sujet, veuillez contacter notre administrateur (mailto:systeme.information/mailvirus at mail.iaai.fr). -- Ce message a ete genere par exiscan (Institut des Applications Avancees de l'Internet - http://www.iaai.fr) From halr at voltaire.com Wed Dec 15 07:29:01 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 10:29:01 -0500 Subject: [openib-general] IPoIB oops on path record completion Message-ID: <1103124540.4222.13.camel@localhost.localdomain> Dec 15 10:15:56 localhost kernel: ib0: Unicast, no dst: type 0000, QPN 8060000 20:800:1404:2:0:404:fe80:0 >From ipoib_main.c line 522: /* unicast GID -- should be ARP reply */ if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " IPOIB_GID_FMT "\n", skb->dst ? "neigh" : "dst", be16_to_cpup((u16 *) skb->data), be32_to_cpup((u32 *) phdr->hwaddr), IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); dev_kfree_skb_any(skb); ++priv->stats.tx_dropped; goto out; } So it looks like 0x806 is being put in the QPN rather than type on the remote machine which is an x86_64 (Opteron) machine. Not sure whether this is related to what causes the oops or not. -- Hal From halr at voltaire.com Wed Dec 15 07:32:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 10:32:34 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> Message-ID: <1103124702.4222.16.camel@localhost.localdomain> Hi Woody, One thing I would be sure of is the following: In /var/log/messages (or via dmesg), make sure there are no oops (that all processes needed are running). I have seen (just reported) what I think is an alignment issue on x64_64 which is what I think you are using. -- Hal From roland at topspin.com Wed Dec 15 07:51:17 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 07:51:17 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <52mzwg2561.fsf@topspin.com> (Roland Dreier's message of "Tue, 14 Dec 2004 08:52:38 -0800") References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> <52mzwg2561.fsf@topspin.com> Message-ID: <52ekhry2yy.fsf@topspin.com> OK, I think the following change (already committed) should fix IPoIB with PCI Express HCAs. Thanks to Josh England for giving me access to a couple of his machines for debugging, and to Tziporet Koren for pointing out this Arbel "feature." - Roland Index: infiniband/hw/mthca/mthca_av.c =================================================================== --- infiniband/hw/mthca/mthca_av.c (revision 1331) +++ infiniband/hw/mthca/mthca_av.c (working copy) @@ -97,6 +97,9 @@ cpu_to_be32((ah_attr->grh.traffic_class << 20) | ah_attr->grh.flow_label); memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); + } else { + /* Arbel workaround -- low byte of GID must be 2 */ + av->dgid[3] = cpu_to_be32(2); } if (0) { From roland at topspin.com Wed Dec 15 07:53:16 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 07:53:16 -0800 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <20041215125825.GA12480@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 15 Dec 2004 14:58:25 +0200") References: <20041215125825.GA12480@mellanox.co.il> Message-ID: <52acsfy2vn.fsf@topspin.com> Michael> What would be the best way to organise this code? I Michael> guess could be two thinkable ways to do this: Michael> 1. Add the functionality to the mthca module, use a Michael> run-time flag to describe hardware differences. Thus, Michael> the same module will handle devices of all types. I've already started adding some memfree support ot the mthca module (following this approach). So far the code does initialization through MAP_FA and RUN_FW, and calls QUERY_DEV_LIM. By the way, there's no need for a runtime flag, the PCI device ID is sufficient to distinguish Tavor mode from Arbel mode. - Roland From roland at topspin.com Wed Dec 15 07:54:01 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 07:54:01 -0800 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <1103115933.4126.53.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Dec 2004 08:05:33 -0500") References: <20041215125825.GA12480@mellanox.co.il> <1103115933.4126.53.camel@localhost.localdomain> Message-ID: <526533y2ue.fsf@topspin.com> Hal> Also, with this work, would Tavor also be made to work with Hal> DDR hidden ? This should already work. I've been using DDR-hidden mode to work around IBM pSeries OpenFirmware issues for several months now. - Roland From roland at topspin.com Wed Dec 15 07:56:10 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 07:56:10 -0800 Subject: [openib-general] Re: Better kernel license In-Reply-To: <52pt1d37lb.fsf_-_@topspin.com> (Roland Dreier's message of "Mon, 13 Dec 2004 19:02:40 -0800") References: <20041213109.42OdQqmmAkW2Pv7s@topspin.com> <200412131010.qyAMW5NxoiM4CntC@topspin.com> <20041214.114259.16194522.yoshfuji@linux-ipv6.org> <52u0qp37s1.fsf@topspin.com> <52pt1d37lb.fsf_-_@topspin.com> Message-ID: <52vfb3wo6d.fsf@topspin.com> I haven't gotten much response to my initial query. So does anyone (especially copyright holders) object if I change the licenses headers in our kernel source files to the below? If I don't get any complaints, I'll commit this change on Thursday afternoon. Thanks, Roland /* * Copyright (c) 2004 Company 1. All rights reserved. * Copyright (c) 2004 Company 2. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * $Id$ */ From halr at voltaire.com Wed Dec 15 08:02:03 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 11:02:03 -0500 Subject: [openib-general] removing ib_mthca module Message-ID: <1103126174.4357.10.camel@localhost.localdomain> After loading mthca and IPoIB, running some IPoIB traffic then removing IPoIB module, and mthca, I got the following: Dec 15 10:48:51 localhost kernel: ib0: ib_dealloc_pd failed is back :-( From halr at voltaire.com Wed Dec 15 08:05:16 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 11:05:16 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <52ekhry2yy.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> <52mzwg2561.fsf@topspin.com> <52ekhry2yy.fsf@topspin.com> Message-ID: <1103126523.4357.18.camel@localhost.localdomain> On Wed, 2004-12-15 at 10:51, Roland Dreier wrote: > Thanks to Josh England for giving me access to > a couple of his machines for debugging, and to Tziporet Koren for > pointing out this Arbel "feature." > + /* Arbel workaround -- low byte of GID must be 2 */ > + av->dgid[3] = cpu_to_be32(2); Is this intended as a temporary or permanent workaround ? If temporary, how temporary (is there a timeframe for the "real" fix) and any idea of what it would entail ? -- Hal From roland at topspin.com Wed Dec 15 08:09:11 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 08:09:11 -0800 Subject: [openib-general] removing ib_mthca module In-Reply-To: <1103126174.4357.10.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Dec 2004 11:02:03 -0500") References: <1103126174.4357.10.camel@localhost.localdomain> Message-ID: <52r7lrwnko.fsf@topspin.com> I found some bugs in the new IPoIB code (although it does fix the neigh destructor oops on module removal). I'll be committing fixes later today once I finish testing. - R. From roland at topspin.com Wed Dec 15 08:13:25 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 08:13:25 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1103126523.4357.18.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Dec 2004 11:05:16 -0500") References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> <52mzwg2561.fsf@topspin.com> <52ekhry2yy.fsf@topspin.com> <1103126523.4357.18.camel@localhost.localdomain> Message-ID: <52mzwfwndm.fsf@topspin.com> Hal> Is this intended as a temporary or permanent workaround ? If Hal> temporary, how temporary (is there a timeframe for the "real" Hal> fix) and any idea of what it would entail ? As I understand it, this is a permanent workaround for an Arbel hardware issue. - Roland From halr at voltaire.com Wed Dec 15 08:16:57 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 11:16:57 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <52mzwfwndm.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> <52mzwg2561.fsf@topspin.com> <52ekhry2yy.fsf@topspin.com> <1103126523.4357.18.camel@localhost.localdomain> <52mzwfwndm.fsf@topspin.com> Message-ID: <1103127417.4357.25.camel@localhost.localdomain> On Wed, 2004-12-15 at 11:13, Roland Dreier wrote: > As I understand it, this is a permanent workaround for an Arbel > hardware issue. If that is the case, isn't there more than this that needs to be done ? -- Hal From mst at mellanox.co.il Wed Dec 15 08:28:54 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Dec 2004 18:28:54 +0200 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <1103118679.4126.111.camel@localhost.localdomain> References: <20041215125825.GA12480@mellanox.co.il> <1103115933.4126.53.camel@localhost.localdomain> <20041215132459.GB12480@mellanox.co.il> <1103117706.4126.93.camel@localhost.localdomain> <20041215134743.GA12584@mellanox.co.il> <1103118679.4126.111.camel@localhost.localdomain> Message-ID: <20041215162854.GB13160@mellanox.co.il> Hello! Quoting r. Hal Rosenstock (halr at voltaire.com) "Re: [openib-general] MT25218 (aka Arbel memfree) support": > On Wed, 2004-12-15 at 08:47, Michael S. Tsirkin wrote: > > But > > > > ah->on_hca = 0; > > > So it shouldnt be set when DDR is hidden. > > Got it backwards: it is when DDR is not hidden. Sorry. > So what's the problem again? MST From roland at topspin.com Wed Dec 15 08:36:04 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 08:36:04 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1103127417.4357.25.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Dec 2004 11:16:57 -0500") References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> <52mzwg2561.fsf@topspin.com> <52ekhry2yy.fsf@topspin.com> <1103126523.4357.18.camel@localhost.localdomain> <52mzwfwndm.fsf@topspin.com> <1103127417.4357.25.camel@localhost.localdomain> Message-ID: <52ekhrwmbv.fsf@topspin.com> Hal> If that is the case, isn't there more than this that needs to Hal> be done ? The docs say that the low byte of the GID has to be set to 2 when a GRH is not being included in the AV. That's what this patch does. Why would there be anything else required? - Roland From mst at mellanox.co.il Wed Dec 15 08:36:20 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Dec 2004 18:36:20 +0200 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <52acsfy2vn.fsf@topspin.com> References: <20041215125825.GA12480@mellanox.co.il> <52acsfy2vn.fsf@topspin.com> Message-ID: <20041215163620.GC13160@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] MT25218 (aka Arbel memfree) support": > Michael> What would be the best way to organise this code? I > Michael> guess could be two thinkable ways to do this: > > Michael> 1. Add the functionality to the mthca module, use a > Michael> run-time flag to describe hardware differences. Thus, > Michael> the same module will handle devices of all types. > > I've already started adding some memfree support ot the mthca module > (following this approach). So far the code does initialization > through MAP_FA and RUN_FW, and calls QUERY_DEV_LIM. Yep, I see that. I think its a valid approach - although I thought that it is better to put the memfree code in a separate file. I guess we can always optimise to compile-time checks it later. > By the way, there's no need for a runtime flag, the PCI device ID is > sufficient to distinguish Tavor mode from Arbel mode. > > - Roland How do you mean? I see you use the hca_type flag? MST From roland at topspin.com Wed Dec 15 08:39:15 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 08:39:15 -0800 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <20041215163620.GC13160@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 15 Dec 2004 18:36:20 +0200") References: <20041215125825.GA12480@mellanox.co.il> <52acsfy2vn.fsf@topspin.com> <20041215163620.GC13160@mellanox.co.il> Message-ID: <526533wm6k.fsf@topspin.com> Roland> By the way, there's no need for a runtime flag, the PCI Roland> device ID is sufficient to distinguish Tavor mode from Roland> Arbel mode. Michael> How do you mean? I see you use the hca_type flag? Maybe I misunderstood you. But all I meant was that there's no requirement to add any more code to check for HCA type or require the user to set a flag. - Roland From halr at voltaire.com Wed Dec 15 08:51:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 11:51:08 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <52ekhrwmbv.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> <52mzwg2561.fsf@topspin.com> <52ekhry2yy.fsf@topspin.com> <1103126523.4357.18.camel@localhost.localdomain> <52mzwfwndm.fsf@topspin.com> <1103127417.4357.25.camel@localhost.localdomain> <52ekhrwmbv.fsf@topspin.com> Message-ID: <1103129468.4224.5.camel@localhost.localdomain> On Wed, 2004-12-15 at 11:36, Roland Dreier wrote: > Hal> If that is the case, isn't there more than this that needs to > Hal> be done ? > > The docs say that the low byte of the GID has to be set to 2 when a > GRH is not being included in the AV. That's what this patch does. > Why would there be anything else required? Isn't the GID the subnet prefix concatenated with the EUI-64 GUID ? If so, I see 2 issues: 1. GUIDInfo should also reflect this. 2. Uniquification of EUI-64s (flag odd ones as errors) -- Hal From mshefty at ichips.intel.com Wed Dec 15 09:01:19 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Dec 2004 09:01:19 -0800 Subject: [openib-general] Re: Better kernel license In-Reply-To: <52vfb3wo6d.fsf@topspin.com> References: <20041213109.42OdQqmmAkW2Pv7s@topspin.com> <200412131010.qyAMW5NxoiM4CntC@topspin.com> <20041214.114259.16194522.yoshfuji@linux-ipv6.org> <52u0qp37s1.fsf@topspin.com> <52pt1d37lb.fsf_-_@topspin.com> <52vfb3wo6d.fsf@topspin.com> Message-ID: <41C06DDF.7010108@ichips.intel.com> Roland Dreier wrote: > I haven't gotten much response to my initial query. So does anyone > (especially copyright holders) object if I change the licenses headers > in our kernel source files to the below? > > If I don't get any complaints, I'll commit this change on Thursday > afternoon. This looks fine to me. - Sean From mst at mellanox.co.il Wed Dec 15 09:20:14 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Dec 2004 19:20:14 +0200 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <526533wm6k.fsf@topspin.com> References: <20041215125825.GA12480@mellanox.co.il> <52acsfy2vn.fsf@topspin.com> <20041215163620.GC13160@mellanox.co.il> <526533wm6k.fsf@topspin.com> Message-ID: <20041215172014.GB13365@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] MT25218 (aka Arbel memfree) support": > Roland> By the way, there's no need for a runtime flag, the PCI > Roland> device ID is sufficient to distinguish Tavor mode from > Roland> Arbel mode. > > Michael> How do you mean? I see you use the hca_type flag? > > Maybe I misunderstood you. But all I meant was that there's no > requirement to add any more code to check for HCA type or require the > user to set a flag. > Yes, clearly, hca_type is sufficient. MST From jjengla at sandia.gov Wed Dec 15 10:22:27 2004 From: jjengla at sandia.gov (England, Joshua J) Date: Wed, 15 Dec 2004 11:22:27 -0700 Subject: [openib-general] IPoIB still not working Message-ID: <42FD903D5ADE644A87701E946196A11373A53B@ES22SNLNT.srn.sandia.gov> Roland, Thanks for taking the time to debug this issue. I'll definitely pound on the stuff and let you know if anything breaks. -JE -----Original Message----- From: openib-general-bounces at openib.org on behalf of Roland Dreier Sent: Wed 12/15/2004 8:51 AM To: Woodruff, Robert J Cc: openib-general at openib.org Subject: Re: [openib-general] IPoIB still not working OK, I think the following change (already committed) should fix IPoIB with PCI Express HCAs. Thanks to Josh England for giving me access to a couple of his machines for debugging, and to Tziporet Koren for pointing out this Arbel "feature." - Roland Index: infiniband/hw/mthca/mthca_av.c =================================================================== --- infiniband/hw/mthca/mthca_av.c (revision 1331) +++ infiniband/hw/mthca/mthca_av.c (working copy) @@ -97,6 +97,9 @@ cpu_to_be32((ah_attr->grh.traffic_class << 20) | ah_attr->grh.flow_label); memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); + } else { + /* Arbel workaround -- low byte of GID must be 2 */ + av->dgid[3] = cpu_to_be32(2); } if (0) { _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From libor at topspin.com Wed Dec 15 10:35:32 2004 From: libor at topspin.com (Libor Michalek) Date: Wed, 15 Dec 2004 10:35:32 -0800 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <52acsfy2vn.fsf@topspin.com>; from roland@topspin.com on Wed, Dec 15, 2004 at 07:53:16AM -0800 References: <20041215125825.GA12480@mellanox.co.il> <52acsfy2vn.fsf@topspin.com> Message-ID: <20041215103532.A26487@topspin.com> On Wed, Dec 15, 2004 at 07:53:16AM -0800, Roland Dreier wrote: > Michael> What would be the best way to organise this code? I > Michael> guess could be two thinkable ways to do this: > > Michael> 1. Add the functionality to the mthca module, use a > Michael> run-time flag to describe hardware differences. Thus, > Michael> the same module will handle devices of all types. > > By the way, there's no need for a runtime flag, the PCI device ID is > sufficient to distinguish Tavor mode from Arbel mode. Roland, How is the determination whether to use memfree mode or onboard memory going to be made for Arbel mode? -Libor From robert.j.woodruff at intel.com Wed Dec 15 10:58:57 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 15 Dec 2004 10:58:57 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F00030F1E67@orsmsx408> Hal> > + /* Arbel workaround -- low byte of GID must be 2 */ > + av->dgid[3] = cpu_to_be32(2); >Is this intended as a temporary or permanent workaround ? If temporary, >how temporary (is there a timeframe for the "real" fix) and any idea of >what it would entail ? >-- Hal Thanks. This fixed my problem with the PCI-E cards also. woody From roland at topspin.com Wed Dec 15 11:19:43 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 11:19:43 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1103129468.4224.5.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Dec 2004 11:51:08 -0500") References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> <52mzwg2561.fsf@topspin.com> <52ekhry2yy.fsf@topspin.com> <1103126523.4357.18.camel@localhost.localdomain> <52mzwfwndm.fsf@topspin.com> <1103127417.4357.25.camel@localhost.localdomain> <52ekhrwmbv.fsf@topspin.com> <1103129468.4224.5.camel@localhost.localdomain> Message-ID: <521xdrwer4.fsf@topspin.com> Hal> Isn't the GID the subnet prefix concatenated with the EUI-64 GUID ? Hal> If so, I see 2 issues: 1. GUIDInfo should also reflect this. Hal> 2. Uniquification of EUI-64s (flag odd ones as errors) This change/workaround only affects the value put into an internal HCA data structure when a GRH/GID is _not_ being sent. - Roland From roland at topspin.com Wed Dec 15 11:20:54 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 11:20:54 -0800 Subject: [openib-general] MT25218 (aka Arbel memfree) support In-Reply-To: <20041215103532.A26487@topspin.com> (Libor Michalek's message of "Wed, 15 Dec 2004 10:35:32 -0800") References: <20041215125825.GA12480@mellanox.co.il> <52acsfy2vn.fsf@topspin.com> <20041215103532.A26487@topspin.com> Message-ID: <52wtvjv04p.fsf@topspin.com> Libor> How is the determination whether to use memfree mode or Libor> onboard memory going to be made for Arbel mode? As I understand it, only memfree mode is supported for Arbel mode. However, if onboard memory is ever supported, I think we would just have the driver prefer using HCA-attached memory to system memory. - R. From robert.j.woodruff at intel.com Wed Dec 15 11:35:47 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 15 Dec 2004 11:35:47 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F00030F1F3D@orsmsx408> >One thing I would be sure of is the following: >In /var/log/messages (or via dmesg), make sure there are no oops >(that all processes needed are running). >I have seen (just reported) what I think is an alignment issue on x64_64 >which is what I think you are using. >-- Hal I have not seen any oops, but have only done minimal testing so far. I also do not see the message that in /var/log/messages or dmesg, so my stack is not executing that code path. ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " IPOIB_GID_FMT "\n", skb->dst ? "neigh" : "dst", be16_to_cpup((u16 *) skb->data), be32_to_cpup((u32 *) phdr->hwaddr), IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr I now plan to re-install my PCI-E cards on the 2 nodes that have PCI-X cards, which will give me a 4 node PCI-E cluster, and load up the very latest code from svn, and run some MPI tests over IPoIB. I will let you know if I see any issues. From tduffy at sun.com Wed Dec 15 13:23:03 2004 From: tduffy at sun.com (Tom Duffy) Date: Wed, 15 Dec 2004 13:23:03 -0800 Subject: [openib-general] new sparse warnings Message-ID: <1103145783.22163.7.camel@duffman> on x86_64. CHECK /build1/tduffy/openib-work/linux-2.6.9-openib/drivers/infiniband/core/user_mad.c /build1/tduffy/openib-work/linux-2.6.9-openib/drivers/infiniband/core/user_mad.c:375:6: warning: cast removes address space of expression /build1/tduffy/openib-work/linux-2.6.9-openib/drivers/infiniband/core/user_mad.c:375:6: warning: cast removes address space of expression /build1/tduffy/openib-work/linux-2.6.9-openib/drivers/infiniband/core/user_mad.c:375:6: warning: cast removes address space of expression /build1/tduffy/openib-work/linux-2.6.9-openib/drivers/infiniband/core/user_mad.c:375:6: warning: cast removes address space of expression CC [M] drivers/infiniband/core/user_mad.o -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Wed Dec 15 13:42:46 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 13:42:46 -0800 Subject: [openib-general] new sparse warnings In-Reply-To: <1103145783.22163.7.camel@duffman> (Tom Duffy's message of "Wed, 15 Dec 2004 13:23:03 -0800") References: <1103145783.22163.7.camel@duffman> Message-ID: <52fz27utk9.fsf@topspin.com> I'm puzzled by that sparse warning. I get no complaint on i386, and the line in question seems to be if (put_user(agent_id, (u32 __user *) (arg + offsetof(struct ib_user_mad_reg_req, id)))) { which I believe is correctly annotated. A quick glance at shows the definition of put_user() to be way to obfuscated to understand quickly but my first guess would be that there is some subtle difference between the i386 and x86_64 versions that triggers the warning. - R. From roland at topspin.com Wed Dec 15 13:44:57 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 13:44:57 -0800 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <1103124540.4222.13.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Dec 2004 10:29:01 -0500") References: <1103124540.4222.13.camel@localhost.localdomain> Message-ID: <52brcvutgm.fsf@topspin.com> I've committed a few fixes to IPoIB. Please let me know if you still see issues with the latest code. - R. From Tom.Duffy at Sun.COM Wed Dec 15 13:55:39 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Wed, 15 Dec 2004 13:55:39 -0800 Subject: [openib-general] new sparse warnings In-Reply-To: <52fz27utk9.fsf@topspin.com> References: <1103145783.22163.7.camel@duffman> <52fz27utk9.fsf@topspin.com> Message-ID: <1103147739.22163.10.camel@duffman> On Wed, 2004-12-15 at 13:42 -0800, Roland Dreier wrote: > I'm puzzled by that sparse warning. I get no complaint on i386, and > the line in question seems to be > > if (put_user(agent_id, > (u32 __user *) (arg + offsetof(struct ib_user_mad_reg_req, id)))) { > > which I believe is correctly annotated. A quick glance at > shows the definition of put_user() to be way to obfuscated to > understand quickly but my first guess would be that there is some > subtle difference between the i386 and x86_64 versions that triggers > the warning. Yeah, I started down this route as well, and ran into the same confusion. I thought maybe somebody smarter than me could figure it out... -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Wed Dec 15 14:23:25 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 17:23:25 -0500 Subject: [openib-general] [PATCH] mad: Use thread context for callbacks due to locally handled MADs in handle_outgoing_smp Message-ID: <1103149405.4224.277.camel@localhost.localdomain> mad: Use thread context for callbacks due to locally handled MADs in handle_outgoing_smp Signed-off-by: Hal Rosenstock Index: mad_priv.h =================================================================== --- mad_priv.h (revision 1332) +++ mad_priv.h (working copy) @@ -114,6 +114,8 @@ struct list_head wait_list; struct work_struct timed_work; unsigned long timeout; + struct list_head local_list; + struct work_struct local_work; atomic_t refcount; wait_queue_head_t wait; @@ -143,6 +145,15 @@ enum ib_wc_status status; }; +struct ib_mad_local_private { + struct list_head completion_list; + struct ib_mad_private *mad_priv; + struct ib_send_wr send_wr; + struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; + u64 wr_id; /* client WR ID */ + u64 tid; +}; + struct ib_mad_mgmt_method_table { struct ib_mad_agent_private *agent[IB_MGMT_MAX_METHODS]; }; Index: mad.c =================================================================== --- mad.c (revision 1332) +++ mad.c (working copy) @@ -87,6 +87,7 @@ static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, struct ib_mad_send_wc *mad_send_wc); static void timeout_sends(void *data); +static void local_completions(void *data); static int solicited_mad(struct ib_mad *mad); static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, struct ib_mad_agent_private *agent_priv, @@ -356,6 +357,9 @@ INIT_LIST_HEAD(&mad_agent_priv->send_list); INIT_LIST_HEAD(&mad_agent_priv->wait_list); INIT_WORK(&mad_agent_priv->timed_work, timeout_sends, mad_agent_priv); + INIT_LIST_HEAD(&mad_agent_priv->local_list); + INIT_WORK(&mad_agent_priv->local_work, local_completions, + mad_agent_priv); atomic_set(&mad_agent_priv->refcount, 1); init_waitqueue_head(&mad_agent_priv->wait); @@ -640,9 +644,10 @@ struct ib_smp *smp, struct ib_send_wr *send_wr) { - int ret; + int ret, alloc_flags; + unsigned long flags; + struct ib_mad_local_private *local; struct ib_mad_private *mad_priv; - struct ib_mad_send_wc mad_send_wc; struct ib_device *device = mad_agent_priv->agent.device; u8 port_num = mad_agent_priv->agent.port_num; @@ -656,12 +661,22 @@ if (!ret || !device->process_mad) goto out; - mad_priv = kmem_cache_alloc(ib_mad_cache, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); + if (in_atomic() || irqs_disabled()) + alloc_flags = GFP_ATOMIC; + else + alloc_flags = GFP_KERNEL; + local = kmalloc(sizeof *local, alloc_flags); + if (!local) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); + goto out; + } + local->mad_priv = NULL; + mad_priv = kmem_cache_alloc(ib_mad_cache, alloc_flags); if (!mad_priv) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for local response MAD\n"); + kfree(local); goto out; } ret = device->process_mad(device, 0, port_num, smp->dr_slid, @@ -675,39 +690,9 @@ * there is a recv handler */ if (solicited_mad(&mad_priv->mad.mad) && - mad_agent_priv->agent.recv_handler) { - struct ib_wc wc; - - /* - * Defined behavior is to complete response - * before request - */ - wc.wr_id = send_wr->wr_id; - wc.status = IB_WC_SUCCESS; - wc.opcode = IB_WC_RECV; - wc.vendor_err = 0; - wc.byte_len = sizeof(struct ib_mad); - wc.src_qp = IB_QP0; - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = IB_LID_PERMISSIVE; - wc.sl = 0; - wc.dlid_path_bits = 0; - mad_priv->header.recv_wc.wc = &wc; - mad_priv->header.recv_wc.mad_len = - sizeof(struct ib_mad); - INIT_LIST_HEAD(&mad_priv->header.recv_wc.recv_buf.list); - mad_priv->header.recv_wc.recv_buf.grh = NULL; - mad_priv->header.recv_wc.recv_buf.mad = - &mad_priv->mad.mad; - if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) - snoop_recv(mad_agent_priv->qp_info, - &mad_priv->header.recv_wc, - IB_MAD_SNOOP_RECVS); - mad_agent_priv->agent.recv_handler( - &mad_agent_priv->agent, - &mad_priv->header.recv_wc); - } else + mad_agent_priv->agent.recv_handler) + local->mad_priv = mad_priv; + else kmem_cache_free(ib_mad_cache, mad_priv); break; case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: @@ -715,23 +700,31 @@ break; case IB_MAD_RESULT_SUCCESS: kmem_cache_free(ib_mad_cache, mad_priv); + kfree(local); ret = 0; goto out; default: kmem_cache_free(ib_mad_cache, mad_priv); + kfree(local); ret = -EINVAL; goto out; } - /* Complete send */ - mad_send_wc.status = IB_WC_SUCCESS; - mad_send_wc.vendor_err = 0; - mad_send_wc.wr_id = send_wr->wr_id; - if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) - snoop_send(mad_agent_priv->qp_info, send_wr, &mad_send_wc, - IB_MAD_SNOOP_SEND_COMPLETIONS); - mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, - &mad_send_wc); + local->send_wr = *send_wr; + local->send_wr.sg_list = local->sg_list; + memcpy(local->sg_list, send_wr->sg_list, + sizeof *send_wr->sg_list * send_wr->num_sge); + local->send_wr.next = NULL; + local->tid = send_wr->wr.ud.mad_hdr->tid; + local->wr_id = send_wr->wr_id; + /* Reference MAD agent until local completion handled */ + atomic_inc(&mad_agent_priv->refcount); + /* Queue local completion to local list */ + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_add_tail(&local->completion_list, &mad_agent_priv->local_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + queue_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->local_work); ret = 1; out: return ret; @@ -2020,6 +2013,73 @@ } EXPORT_SYMBOL(ib_cancel_mad); +static void local_completions(void *data) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_local_private *local; + unsigned long flags; + struct ib_wc wc; + struct ib_mad_send_wc mad_send_wc; + + mad_agent_priv = (struct ib_mad_agent_private *)data; + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + while (!list_empty(&mad_agent_priv->local_list)) { + local = list_entry(mad_agent_priv->local_list.next, + struct ib_mad_local_private, + completion_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + if (local->mad_priv) { + /* + * Defined behavior is to complete response + * before request + */ + wc.wr_id = local->wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = sizeof(struct ib_mad); + wc.src_qp = IB_QP0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = IB_LID_PERMISSIVE; + wc.sl = 0; + wc.dlid_path_bits = 0; + local->mad_priv->header.recv_wc.wc = &wc; + local->mad_priv->header.recv_wc.mad_len = + sizeof(struct ib_mad); + INIT_LIST_HEAD(&local->mad_priv->header.recv_wc.recv_buf.list); + local->mad_priv->header.recv_wc.recv_buf.grh = NULL; + local->mad_priv->header.recv_wc.recv_buf.mad = + &local->mad_priv->mad.mad; + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_recv(mad_agent_priv->qp_info, + &local->mad_priv->header.recv_wc, + IB_MAD_SNOOP_RECVS); + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &local->mad_priv->header.recv_wc); + } + + /* Complete send */ + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = local->wr_id; + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_send(mad_agent_priv->qp_info, &local->send_wr, + &mad_send_wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_del(&local->completion_list); + atomic_dec(&mad_agent_priv->refcount); + kfree(local); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); +} + static void timeout_sends(void *data) { struct ib_mad_agent_private *mad_agent_priv; From halr at voltaire.com Wed Dec 15 15:08:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 18:08:28 -0500 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <52brcvutgm.fsf@topspin.com> References: <1103124540.4222.13.camel@localhost.localdomain> <52brcvutgm.fsf@topspin.com> Message-ID: <1103151733.4224.280.camel@localhost.localdomain> On Wed, 2004-12-15 at 16:44, Roland Dreier wrote: > I've committed a few fixes to IPoIB. Please let me know if you still > see issues with the latest code. Still get oops on ping -b immediately after bringing up ib0 on the IP subnet :-( Dec 15 17:59:45 localhost kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000000b Dec 15 17:59:45 localhost kernel: printing eip: Dec 15 17:59:45 localhost kernel: d09a5900 Dec 15 17:59:45 localhost kernel: *pde = 0ba0e067 Dec 15 17:59:45 localhost kernel: *pte = 00000000 Dec 15 17:59:45 localhost kernel: Oops: 0000 [#1] Dec 15 17:59:45 localhost kernel: Modules linked in: ib_ipoib ib_sa ib_mthca ib_mad ib_core loop ide_cd cdrom lp ipv6 autofs parport_pc parport uhci_hcd ehci_hcd ohci_hcd eepro100 mii evdev usbcore Dec 15 17:59:45 localhost kernel: CPU: 0 Dec 15 17:59:45 localhost kernel: EIP: 0060:[] Not tainted VLI Dec 15 17:59:45 localhost kernel: EFLAGS: 00010203 (2.6.9) Dec 15 17:59:45 localhost kernel: EIP is at path_rec_completion+0x2a0/0x450 [ib_ipoib] Dec 15 17:59:45 localhost kernel: eax: 0000000b ebx: 00000000 ecx: c3cb5e14 edx: 0000000b Dec 15 17:59:45 localhost kernel: esi: 00000000 edi: cff66a60 ebp: ce175a80 esp: c3cb5dac Dec 15 17:59:45 localhost kernel: ds: 007b es: 007b ss: 0068 Dec 15 17:59:45 localhost kernel: Process ib_mad1 (pid: 25367, threadinfo=c3cb4000 task=c306f420) Dec 15 17:59:45 localhost kernel: Stack: c1d2c800 c1d2cc80 00000001 c6b22420 00000087 00000001 c3cb5e1c c0108cd4 Dec 15 17:59:45 localhost kernel: 00000282 00000000 c3cb5e1c c03a3a80 c03a3a80 00000005 c6b22420 c0109304 Dec 15 17:59:45 localhost kernel: 00000005 00000282 c128aa00 0000000a 00000286 00000082 d0944f00 00000246 Dec 15 17:59:45 localhost kernel: Call Trace: Dec 15 17:59:45 localhost kernel: [] handle_IRQ_event+0x34/0x70 Dec 15 17:59:45 localhost kernel: [] do_IRQ+0x194/0x350 Dec 15 17:59:45 localhost kernel: [] ib_sa_path_rec_callback+0x5a/0x90 [ib_sa] Dec 15 17:59:45 localhost kernel: [] ib_mad_send_done_handler+0x13b/0x230 [ib_mad] Dec 15 17:59:45 localhost kernel: [] send_handler+0x10b/0x350 [ib_sa] Dec 15 17:59:45 localhost kernel: [] schedule+0x306/0x5c0 Dec 15 17:59:45 localhost kernel: [] timeout_sends+0x113/0x2f0 [ib_mad] Dec 15 17:59:45 localhost kernel: [] timeout_sends+0x0/0x2f0 [ib_mad] Dec 15 17:59:45 localhost kernel: [] worker_thread+0x251/0x470 Dec 15 17:59:45 localhost kernel: [] default_wake_function+0x0/0x20 Dec 15 17:59:45 localhost kernel: [] default_wake_function+0x0/0x20 Dec 15 17:59:45 localhost kernel: [] worker_thread+0x0/0x470 Dec 15 17:59:45 localhost kernel: [] kthread+0xaa/0xb0 Dec 15 17:59:45 localhost kernel: [] kthread+0x0/0xb0 Dec 15 17:59:45 localhost kernel: [] kernel_thread_helper+0x5/0x10 Dec 15 17:59:45 localhost kernel: Code: 45 08 85 c0 75 7b c7 45 04 00 00 00 00 ff 74 24 20 9d 89 f6 8d bc 27 00 00 00 00 8b 44 24 68 8d 4c 24 68 31 d2 39 c8 74 23 89 c2 <8b> 00 ff 4c 24 70 c7 42 08 00 00 00 00 89 48 04 89 44 24 68 c7 From halr at voltaire.com Wed Dec 15 15:11:56 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 18:11:56 -0500 Subject: [openib-general] [PATCH] mad: Use thread context for callbacks due to locally handled MADs in handle_outgoing_smp In-Reply-To: <1103149405.4224.277.camel@localhost.localdomain> References: <1103149405.4224.277.camel@localhost.localdomain> Message-ID: <1103152316.4146.4.camel@localhost.localdomain> On Wed, 2004-12-15 at 17:23, Hal Rosenstock wrote: > mad: Use thread context for callbacks due to locally handled MADs in > handle_outgoing_smp I forgot to comment on the following relative to this change: I did not implement cancel looking at the local completion list for possible matches. If that is incorrect, it is straightforward to add this in. -- Hal From mshefty at ichips.intel.com Wed Dec 15 15:18:31 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Dec 2004 15:18:31 -0800 Subject: [openib-general] [PATCH] mad: Use thread context for callbacks due to locally handled MADs in handle_outgoing_smp In-Reply-To: <1103152316.4146.4.camel@localhost.localdomain> References: <1103149405.4224.277.camel@localhost.localdomain> <1103152316.4146.4.camel@localhost.localdomain> Message-ID: <41C0C647.1040901@ichips.intel.com> Hal Rosenstock wrote: > On Wed, 2004-12-15 at 17:23, Hal Rosenstock wrote: > >>mad: Use thread context for callbacks due to locally handled MADs in >>handle_outgoing_smp > > > I forgot to comment on the following relative to this change: > I did not implement cancel looking at the local completion list for > possible matches. If that is incorrect, it is straightforward to add > this in. I think that's okay. Cancel will simply return (since it doesn't return anything), and the user's callback will be invoked soon after or before with the completion. And it looked like the reference counting was there to handle deregistration. - Sean From halr at voltaire.com Wed Dec 15 15:22:35 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 18:22:35 -0500 Subject: [openib-general] [PATCH] mad: Use thread context for callbacks due to locally handled MADs in handle_outgoing_smp In-Reply-To: <41C0C647.1040901@ichips.intel.com> References: <1103149405.4224.277.camel@localhost.localdomain> <1103152316.4146.4.camel@localhost.localdomain> <41C0C647.1040901@ichips.intel.com> Message-ID: <1103152954.4146.8.camel@localhost.localdomain> On Wed, 2004-12-15 at 18:18, Sean Hefty wrote: > And it looked like the reference counting was there to handle deregistration. Yes. Thanks. -- Hal From roland at topspin.com Wed Dec 15 15:36:30 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 15:36:30 -0800 Subject: [openib-general] new sparse warnings In-Reply-To: <1103145783.22163.7.camel@duffman> (Tom Duffy's message of "Wed, 15 Dec 2004 13:23:03 -0800") References: <1103145783.22163.7.camel@duffman> Message-ID: <521xdruoap.fsf@topspin.com> I have no x86_64 sparse setup to test with, but does this kernel patch fix it? Index: linux-bk/include/asm-x86_64/uaccess.h =================================================================== --- linux-bk.orig/include/asm-x86_64/uaccess.h 2004-12-11 15:16:44.000000000 -0800 +++ linux-bk/include/asm-x86_64/uaccess.h 2004-12-15 15:35:47.482091664 -0800 @@ -172,7 +172,7 @@ /* FIXME: this hack is definitely wrong -AK */ struct __large_struct { unsigned long buf[100]; }; -#define __m(x) (*(struct __large_struct *)(x)) +#define __m(x) (*(struct __large_struct __user *)(x)) /* * Tell gcc we read from memory instead of writing: this is because From tduffy at sun.com Wed Dec 15 15:43:47 2004 From: tduffy at sun.com (Tom Duffy) Date: Wed, 15 Dec 2004 15:43:47 -0800 Subject: [openib-general] new sparse warnings In-Reply-To: <521xdruoap.fsf@topspin.com> References: <1103145783.22163.7.camel@duffman> <521xdruoap.fsf@topspin.com> Message-ID: <1103154227.12446.2.camel@duffman> On Wed, 2004-12-15 at 15:36 -0800, Roland Dreier wrote: > I have no x86_64 sparse setup to test with, but does this kernel patch fix it? Yes, this fixed the sparse warning. Thanks, I knew somebody smarter would be able to fix it :) -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Wed Dec 15 16:12:29 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 19:12:29 -0500 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <1103151733.4224.280.camel@localhost.localdomain> References: <1103124540.4222.13.camel@localhost.localdomain> <52brcvutgm.fsf@topspin.com> <1103151733.4224.280.camel@localhost.localdomain> Message-ID: <1103155948.4127.1.camel@localhost.localdomain> On Wed, 2004-12-15 at 18:08, Hal Rosenstock wrote: > On Wed, 2004-12-15 at 16:44, Roland Dreier wrote: > > I've committed a few fixes to IPoIB. Please let me know if you still > > see issues with the latest code. > > Still get oops on ping -b immediately after bringing up ib0 on the IP > subnet :-( This is due to the following: ib_sa_path_rec_callback: sa_query 0xc0db0788 status 0xffffff92 mad 0x00000000 which invokes query->callback(status, NULL, query->context); ipoib_main.c: static void path_rec_completion(int status, struct ib_sa_path_rec *pathrec, void *path_ptr) path_rec_completion is using the pathrec parameter as a pointer without checking it for NULL first. Is the same also true for any other SA client callbacks like multicast ? Also, what I do see when I do a broadcast ping is that the path record is obtained over and over rather than being requested once and cached. Is that what is supposed to be happening now ? -- Hal From mshefty at ichips.intel.com Wed Dec 15 16:25:23 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Dec 2004 16:25:23 -0800 Subject: [openib-general] CM header file Message-ID: <41C0D5F3.7010307@ichips.intel.com> Listed below is a start at an updated CM API. Several details are missing, but there should be enough there to evaluate the direction of the CM. Some design notes: * User's are responsible for destroying "connection identifiers". This was done to ensure that reference counting can be handled properly. * The CM will allocate connection identifiers for clients upon receiving REQ or SIDR_REQ messages. The new identifiers are returned to clients via a callback. See below. * QP transitions are expected to be done by the clients. * The CM handles formatting CM MADs and tracking CM states. Questions: * Connection identifiers cannot be destroyed from within a CM callback. I think that we can support this by adding a flag (destroying within callback) to the destroy call. Is this worth adding? * Should listening clients call an "accept" routine to wait for a connection request? Currently, the API operates asynchronously and inokves a CM event handler. * Should we allow forcing a connection into the established state, regardless of its previous state? This would let the user connect without using the CM, and then give control of the connection to the CM. * Should calls to send the LAP/APR set the QP's alternate path information? Related to this, should the CM support a call to force path migration on the QP. - Sean /* * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available at * , or the OpenIB.org BSD * license, available in the LICENSE.TXT file accompanying this * software. These details are also available at * . * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * Copyright (c) 2004 Intel Corporation. All rights reserved. * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004 Voltaire Corporation. All rights reserved. * * $Id$ */ #if !defined(IB_CM_H) #define IB_CM_H #include struct ib_path_record; //*** TBD: define me somewhere else (ib_smi.h?) enum ib_cm_state { IB_CM_IDLE, IB_CM_LISTEN, IB_CM_REQ_SENT, IB_CM_REQ_RCVD, IB_CM_MRA_REQ_SENT IB_CM_MRA_REQ_RCVD, IB_CM_REP_SENT, IB_CM_REP_RCVD, IB_CM_MRA_REP_SENT, IB_CM_MRA_REP_RCVD, IB_CM_ESTABLISHED, IB_CM_LAP_SENT, IB_CM_LAP_RCVD, IB_CM_MRA_LAP_SENT, IB_CM_MRA_LAP_RCVD, IB_CM_DREQ_SENT, IB_CM_DREQ_RCVD, IB_CM_TIMEWAIT, IB_CM_SIDR_REQ_SENT, IB_CM_SIDR_REQ_RCVD }; enum ib_cm_event_type { IB_CM_REQ_TIMEOUT, IB_CM_REQ_RECEIVED, IB_CM_REP_TIMEOUT, IB_CM_REP_RECEIVED, IB_CM_RTU_RECEIVED, IB_CM_DREQ_TIMEOUT IB_CM_DREQ_RECEIVED, IB_CM_DREP_RECEIVED, IB_CM_MRA_RECEIVED, IB_CM_LAP_TIMEOUT, IB_CM_LAP_RECEIVED, IB_CM_APR_RECEIVED }; struct ib_cm_event { /* a whole lot more stuff goes here */ void *private_data; u8 private_data_len; enum ib_cm_event_type event; }; typedef void (*ib_cm_handler)(struct ib_cm_id *, struct ib_cm_event *); struct ib_cm_id { ib_cm_handler cm_handler; void *context; u64 service_id; enum ib_cm_state state; }; /** * ib_create_cm_id - Allocate a connection identifier. * @cm_handler: Callback invoked to notify the user of CM events. * @context: User specified context associated with the connection * identifier. * * Connection identifiers are used to track connection states and * listen requests. */ struct ib_cm_id *ib_create_cm_id(ib_cm_handler cm_handler, void *context); /** * ib_destroy_cm_id - Destroy a connection identifier. * @cm_id: Connection identifier to destroy. * * This call blocks until the connection identifier is destroyed. */ int ib_destroy_cm_id(struct ib_cm_id *cm_id); //*** TBD : add flags to allow calling routine from CM callback... /** * ib_cm_listen - Initiates listening on the specified service ID for * connection and service ID resolution requests. * @cm_id: Connection identifier associated with the listen request. * @service_id: Service identifier matched against incoming connection * and service ID resolution requests. */ int ib_cm_listen(struct ib_cm_id *cm_id, u64 service_id); struct ib_cm_req_param { struct ib_qp *qp; struct ib_path_record *primary_path; struct ib_path_record *alternate_path; u64 service_id; int timeout_ms; void *private_data; u8 private_data_len; u8 responder_resources; u8 initiator_depth; u8 remote_cm_response_timeout; u8 flow_control; u8 local_cm_response_timeout; u8 retry_count; u8 rnr_retry_count; u8 max_cm_retries; }; /** * ib_send_cm_req - Sends a connection request to the remote node. * @cm_id: Connection identifier that will be associated with the * connection request. * @param: Connection request information needed to establish the * connection. */ int ib_send_cm_req(struct ib_cm_id *cm_id, struct ib_cm_req_param *param); struct ib_cm_rep_param { struct ib_qp *qp; void *private_data; u8 reply_private_data_len; u8 responder_resources; u8 initiator_depth; u8 target_ack_delay; u8 failover_accepted; u8 flow_control; u8 rnr_retry_count; }; /** * ib_send_cm_rep - Sends a connection reply in response to a connection * request. * @cm_id: Connection identifier that will be associated with the * connection request. * @param: Connection reply information needed to establish the * connection. */ int ib_send_cm_rep(struct ib_cm_id *cm_id, struct ib_cm_req_param *param); /** * ib_send_cm_rtu - Sends a connection ready to use message in response * to a connection reply message. * @cm_id: Connection identifier associated with the connection request. * @private_data: Optional user-defined private data sent with the * ready to use message. * @private_data_len: Size of the private data buffer, in bytes. */ int ib_send_cm_rtu(struct ib_cm_id *cm_id, void *private_data, u8 private_data_len); /** * ib_send_cm_dreq - Sends a disconnection request for an existing * connection. * @cm_id: Connection identifier associated with the connection being * released. * @private_data: Optional user-defined private data sent with the * disconnection request message. * @private_data_len: Size of the private data buffer, in bytes. */ int ib_send_cm_dreq(struct ib_cm_id *cm_id, void *private_data, u8 private_data_len); /** * ib_send_cm_drep - Sends a disconnection reply to a disconnection request. * @cm_id: Connection identifier associated with the connection being * released. * @private_data: Optional user-defined private data sent with the * disconnection reply message. * @private_data_len: Size of the private data buffer, in bytes. */ int ib_send_cm_drep(struct ib_cm_id *cm_id, void *private_data, u8 private_data_len); /** * ib_cm_establish - Forces a connection state to established. * @cm_id: Connection identifier to transition to established. * * This routine should be invoked by users who receive messages on a * connected QP before an RTU has been received. */ int ib_cm_establish(struct ib_cm_id *id); //*** TBD: should we allow a user for force any cm_id to established? enum ib_cm_rej_reason { IB_CM_REJ_NO_QP = __constant_htons(1) IB_CM_REJ_NO_EEC = __constant_htons(2) IB_CM_REJ_NO_RESOURCES = __constant_htons(3) IB_CM_REJ_TIMEOUT = __constant_htons(4) IB_CM_REJ_UNSUPPORTED = __constant_htons(5) IB_CM_REJ_INVALID_COMM_ID = __constant_htons(6) IB_CM_REJ_INVALID_COMM_INSTANCE = __constant_htons(7) IB_CM_REJ_INVALID_SERVICE_ID = __constant_htons(8) IB_CM_REJ_INVALID_TRANSPORT_TYPE = __constant_htons(9) IB_CM_REJ_STALE_CONN = __constant_htons(10) IB_CM_REJ_RDC_NOT_EXIST = __constant_htons(11) IB_CM_REJ_INVALID_GID = __constant_htons(12) IB_CM_REJ_INVALID_LID = __constant_htons(13) IB_CM_REJ_INVALID_SL = __constant_htons(14) IB_CM_REJ_INVALID_TRAFFIC_CLASS = __constant_htons(15) IB_CM_REJ_INVALID_HOP_LIMIT = __constant_htons(16) IB_CM_REJ_INVALID_PACKET_RATE = __constant_htons(17) IB_CM_REJ_INVALID_ALT_GID = __constant_htons(18) IB_CM_REJ_INVALID_ALT_LID = __constant_htons(19) IB_CM_REJ_INVALID_ALT_SL = __constant_htons(20) IB_CM_REJ_INVALID_ALT_TRAFFIC_CLASS = __constant_htons(21) IB_CM_REJ_INVALID_ALT_HOP_LIMIT = __constant_htons(22) IB_CM_REJ_INVALID_ALT_PACKET_RATE = __constant_htons(23) IB_CM_REJ_PORT_REDIRECT = __constant_htons(24) IB_CM_REJ_INVALID_MTU = __constant_htons(26) IB_CM_REJ_INSUFFICIENT_RESP_RESOURCES = __constant_htons(27) IB_CM_REJ_CONSUMER_DEFINED = __constant_htons(28) IB_CM_REJ_INVALID_RNR_RETRY = __constant_htons(29) IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = __constant_htons(30) IB_CM_REJ_INVALID_CLASS_VERSION = __constant_htons(31) IB_CM_REJ_INVALID_FLOW_LABEL = __constant_htons(32) IB_CM_REJ_INVALID_ALT_FLOW_LABEL = __constant_htons(33) }; /** * ib_send_cm_rej - Sends a connection rejection message to the * remote node. * @cm_id: Connection identifier associated with the connection being * rejected. * @reason: Reason for the connection request rejection. * @ari: Optional additional rejection information. * @ari_length: Size of the additional rejection information, in bytes. * @private_data: Optional user-defined private data sent with the * rejection message. * @private_data_len: Size of the private data buffer, in bytes. */ int ib_send_cm_rej(struct ib_cm_id *cm_id, enum ib_cm_rej_reason reason, void *ari, u8 ari_length, void *private_data, u8 private_data_len); /** * ib_send_cm_mra - Sends a message receipt acknowledgement to a connection * message. * @cm_id: Connection identifier associated with the connection message. * @service_timeout: The maximum time required for the sender to reply to * to the connection message. * @private_data: Optional user-defined private data sent with the * message receipt acknowledgement. * @private_data_len: Size of the private data buffer, in bytes. */ int ib_send_cm_mra(struct ib_cm_id *cm_id, u8 service_timeout, void *private_data, u8 private_data_len); /** * ib_send_cm_lap - Sends a load alternate path request. * @cm_id: Connection identifier associated with the load alternate path * message. * @alternate_path: A path record that identifies the alternate path to * load. * @private_data: Optional user-defined private data sent with the * load alternate path message. * @private_data_len: Size of the private data buffer, in bytes. */ int ib_send_cm_lap(struct ib_cm_id *cm_id, struct ib_path_record *alternate_path, void *private_data, u8 private_data_len); //*** TBD: should LAP/APR set the QP's alternate path information... enum ib_cm_apr_status { IB_CM_APR_SUCCESS IB_CM_APR_INVALID_COMM_ID IB_CM_APR_UNSUPPORTED IB_CM_APR_REJECT IB_CM_APR_REDIRECT IB_CM_APR_IS_CURRENT IB_CM_APR_INVALID_QPN_EECN IB_CM_APR_INVALID_LID IB_CM_APR_INVALID_GID IB_CM_APR_INVALID_FLOW_LABEL IB_CM_APR_INVALID_TCLASS IB_CM_APR_INVALID_HOP_LIMIT IB_CM_APR_INVALID_PACKET_RATE IB_CM_APR_INVALID_SL }; /** * ib_send_cm_apr - Sends an alternate path response message in response to * a load alternate path request. * @cm_id: Connection identifier associated with the alternate path response. * @status: Reply status sent with the alternate path response. * @info: Optional additional information sent with the alternate path * response. * @info_length: Size of the additional information, in bytes. * @private_data: Optional user-defined private data sent with the * alternate path response message. * @private_data_len: Size of the private data buffer, in bytes. */ int ib_send_cm_apr(struct ib_cm_id *cm_id, enum ib_cm_apr_status status, void *info, u8 info_length, void *private_data, u8 private_data_len); /** * ib_cm_migrate_path - Forces a connected QP to use the loaded alternate * path. * @cm_id: Connection identifier associated with the alternate path response. */ //int ib_cm_migrate_path(struct ib_cm_id *cm_id); //*** TBD: should this be part of CM, since it only modifies the QP state? struct ib_cm_sidr_req_param { struct ib_path_record *path; u64 service_id; int timeout_ms; void *private_data; u8 private_data_len; u16 pkey; }; /** * ib_send_cm_sidr_req - Sends a service ID resolution request to the * remote node. * @cm_id: Communication identifier that will be associated with the * service ID resolution request. * @param: Service ID resolution request information. */ int ib_send_cm_sidr_req(struct ib_cm_id *cm_id, struct ib_cm_sidr_req_param *param); enum ib_cm_sidr_status { IB_SIDR_SUCCESS, IB_SIDR_UNSUPPORTED, IB_SIDR_REJECT, IB_SIDR_NO_QP, IB_SIDR_REDIRECT, IB_SIDR_UNSUPPORTED_VERSION }; struct ib_cm_sidr_rep_param { u32 qp_num; u32 qkey; enum ib_cm_sidr_status status; void *info; u8 info_length; void *private_data; u8 private_data_len; }; /** * ib_send_cm_sidr_rep - Sends a service ID resolution request to the * remote node. * @cm_id: Communication identifier associated with the received service ID * resolution request. * @param: Service ID resolution reply information. */ int ib_send_cm_sidr_rep(struct ib_cm_id *cm_id, struct ib_cm_sidr_rep_param *param); #endif /* IB_CM_H */ From roland at topspin.com Wed Dec 15 17:14:32 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 17:14:32 -0800 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <1103155948.4127.1.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Dec 2004 19:12:29 -0500") References: <1103124540.4222.13.camel@localhost.localdomain> <52brcvutgm.fsf@topspin.com> <1103151733.4224.280.camel@localhost.localdomain> <1103155948.4127.1.camel@localhost.localdomain> Message-ID: <52wtvjt56v.fsf@topspin.com> Hal> This is due to the following: ib_sa_path_rec_callback: Hal> sa_query 0xc0db0788 status 0xffffff92 mad 0x00000000 which Hal> invokes query-> callback(status, NULL, query->context); Hal> ipoib_main.c: static void path_rec_completion(int status, Hal> struct ib_sa_path_rec *pathrec, void *path_ptr) Hal> path_rec_completion is using the pathrec parameter as a Hal> pointer without checking it for NULL first. Hmm... are you sure this is what causes the oops? path_rec_completion() will only dereference the pathrec parameter if its local variable ah is non-NULL: if (ah) { path->pathrec = *pathrec; and ah can only be set to non-NULL if status is successful (ah is initialized to NULL and the only place it can be changed is ah = ipoib_create_ah(path->dev, priv->pd, &av); which is inside a test of status. Can you give the exact sequence you use to duplicate this? I haven't been able to make it happen in my network. Hal> Also, what I do see when I do a broadcast ping is that the Hal> path record is obtained over and over rather than being Hal> requested once and cached. Is that what is supposed to be Hal> happening now ? No, that shouldn't happen. I'll try to figure out what's happening. - R. From halr at voltaire.com Wed Dec 15 17:18:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 20:18:21 -0500 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <52wtvjt56v.fsf@topspin.com> References: <1103124540.4222.13.camel@localhost.localdomain> <52brcvutgm.fsf@topspin.com> <1103151733.4224.280.camel@localhost.localdomain> <1103155948.4127.1.camel@localhost.localdomain> <52wtvjt56v.fsf@topspin.com> Message-ID: <1103159901.4127.17.camel@localhost.localdomain> On Wed, 2004-12-15 at 20:14, Roland Dreier wrote: > Hal> This is due to the following: ib_sa_path_rec_callback: > Hal> sa_query 0xc0db0788 status 0xffffff92 mad 0x00000000 which > Hal> invokes query-> callback(status, NULL, query->context); > > Hal> ipoib_main.c: static void path_rec_completion(int status, > Hal> struct ib_sa_path_rec *pathrec, void *path_ptr) > > Hal> path_rec_completion is using the pathrec parameter as a > Hal> pointer without checking it for NULL first. > > Hmm... are you sure this is what causes the oops? No but it definitely oops in that callback. I didn't trace it in path_rec_completion; only glanced at the code. Don't the debug statements deference through NULL regardless of status ? > path_rec_completion() will only dereference the pathrec parameter if > its local variable ah is non-NULL: > > if (ah) { > path->pathrec = *pathrec; > > and ah can only be set to non-NULL if status is successful (ah is > initialized to NULL and the only place it can be changed is > > ah = ipoib_create_ah(path->dev, priv->pd, &av); > > which is inside a test of status. > > Can you give the exact sequence you use to duplicate this? I haven't > been able to make it happen in my network. Have you gotten a negative status on the callback (and NULL pathrec) ? I've yet to see this response on the analyzer as there are too many to go through right now. I do see it with extra debug I put in to narrow this down. > Hal> Also, what I do see when I do a broadcast ping is that the > Hal> path record is obtained over and over rather than being > Hal> requested once and cached. Is that what is supposed to be > Hal> happening now ? > > No, that shouldn't happen. I'll try to figure out what's happening. Thanks. -- Hal From roland at topspin.com Wed Dec 15 17:38:49 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 17:38:49 -0800 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <1103159901.4127.17.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Dec 2004 20:18:21 -0500") References: <1103124540.4222.13.camel@localhost.localdomain> <52brcvutgm.fsf@topspin.com> <1103151733.4224.280.camel@localhost.localdomain> <1103155948.4127.1.camel@localhost.localdomain> <52wtvjt56v.fsf@topspin.com> <1103159901.4127.17.camel@localhost.localdomain> Message-ID: <52llbzt42e.fsf@topspin.com> Hal> No but it definitely oops in that callback. I didn't trace it Hal> in path_rec_completion; only glanced at the code. Don't the Hal> debug statements deference through NULL regardless of status ? Good point, I missed those uses. I've just pushed a patch that should fix this problem. Hal> Have you gotten a negative status on the callback (and NULL Hal> pathrec) ? I've yet to see this response on the analyzer as Hal> there are too many to go through right now. I do see it with Hal> extra debug I put in to narrow this down. No, I guess that's why I never saw this problem -- I've never had a path record callback fail. Hal> Also, what I do see when I do a broadcast ping is that the Hal> path record is obtained over and over rather than being Hal> requested once and cached. Is that what is supposed to be Hal> happening now ? I can't duplicate this one. If I do something like ifconfig ib0 10.0.0.1 ping -b 10.255.255.255 I only see one path record lookup for each remote system. You seem to be seeing a path record lookup fail, which of course will cause the lookup to be retried the next time we want to send something to that destination. Do you know why the the path record lookup is failing? - R. From libor at topspin.com Wed Dec 15 17:48:14 2004 From: libor at topspin.com (Libor Michalek) Date: Wed, 15 Dec 2004 17:48:14 -0800 Subject: [openib-general] CM header file In-Reply-To: <41C0D5F3.7010307@ichips.intel.com>; from mshefty@ichips.intel.com on Wed, Dec 15, 2004 at 04:25:23PM -0800 References: <41C0D5F3.7010307@ichips.intel.com> Message-ID: <20041215174814.C26487@topspin.com> On Wed, Dec 15, 2004 at 04:25:23PM -0800, Sean Hefty wrote: > Listed below is a start at an updated CM API. Several details are > missing, but there should be enough there to evaluate the direction of > the CM. Some design notes: > Questions: > > * Connection identifiers cannot be destroyed from within a CM callback. > I think that we can support this by adding a flag (destroying within > callback) to the destroy call. Is this worth adding? The other option is to destroy the connection if the consumer returns an error value from the callback. Along the same lines, will the consumer be allowed to call the corresponding response function from a callback? (e.g. ib_send_cm_rep() from the REQ callback) If not then the same behaviour could also be achieved with a callback return value. > * Should listening clients call an "accept" routine to wait for a > connection request? Currently, the API operates asynchronously and > inokves a CM event handler. I don't think this is necessary. Presumably the biggest reason to use "accept" is to force the consumer to use its own thread to handle CM state changes, thus avoiding CM or MAD thread deadlock if there is a buggy consumer, or is there another reason to add this step? However, a nice to have feature which I've grown use to is the ability to listen to an entire range of service IDs using a value/mask combo. Thanks for taking this on, once you have something minimal I'll be happy to try it out, I've been waiting. :) -Libor From halr at voltaire.com Wed Dec 15 18:21:56 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2004 21:21:56 -0500 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <52llbzt42e.fsf@topspin.com> References: <1103124540.4222.13.camel@localhost.localdomain> <52brcvutgm.fsf@topspin.com> <1103151733.4224.280.camel@localhost.localdomain> <1103155948.4127.1.camel@localhost.localdomain> <52wtvjt56v.fsf@topspin.com> <1103159901.4127.17.camel@localhost.localdomain> <52llbzt42e.fsf@topspin.com> Message-ID: <1103163716.4126.30.camel@localhost.localdomain> On Wed, 2004-12-15 at 20:38, Roland Dreier wrote: > Hal> No but it definitely oops in that callback. I didn't trace it > Hal> in path_rec_completion; only glanced at the code. Don't the > Hal> debug statements deference through NULL regardless of status ? > > Good point, I missed those uses. I've just pushed a patch that should > fix this problem. Still oops-es :-( > Hal> Have you gotten a negative status on the callback (and NULL > Hal> pathrec) ? I've yet to see this response on the analyzer as > Hal> there are too many to go through right now. I do see it with > Hal> extra debug I put in to narrow this down. > > No, I guess that's why I never saw this problem -- I've never had a > path record callback fail. I get the callback error but the packets on the wire look OK (see below). > Hal> Also, what I do see when I do a broadcast ping is that the > Hal> path record is obtained over and over rather than being > Hal> requested once and cached. Is that what is supposed to be > Hal> happening now ? > > I can't duplicate this one. If I do something like > > ifconfig ib0 10.0.0.1 > ping -b 10.255.255.255 > > I only see one path record lookup for each remote system. I'm using /24 rather than /8 (class C rather than class A). So my config is ifconfig ib0 192.168.0.1 ping -b 192.168.0.255 I have 3 nodes (this is an x86 and the other 2 are x86_64). I doubt this makes a difference in terms of these issues. > You seem to be seeing a path record lookup fail, which of course will > cause the lookup to be retried the next time we want to send something > to that destination. Do you know why the the path record lookup is failing? Just looked at the trace and all SA GetResp(PathRecord) had status 0 but ib_sa_path_rec_callback: sa_query 0xc373fe48 status 0xffffff92 mad 0x00000000 This status is #define ETIMEDOUT 110 /* Connection timed out */ Can you shorten your timeout down to 1 msec and see what happens ? I do see one SA Get which was not responded to. There are a variety of reasons this could occur. -- Hal From roland at topspin.com Wed Dec 15 20:18:09 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Dec 2004 20:18:09 -0800 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <1103163716.4126.30.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Dec 2004 21:21:56 -0500") References: <1103124540.4222.13.camel@localhost.localdomain> <52brcvutgm.fsf@topspin.com> <1103151733.4224.280.camel@localhost.localdomain> <1103155948.4127.1.camel@localhost.localdomain> <52wtvjt56v.fsf@topspin.com> <1103159901.4127.17.camel@localhost.localdomain> <52llbzt42e.fsf@topspin.com> <1103163716.4126.30.camel@localhost.localdomain> Message-ID: <527jniub9a.fsf@topspin.com> Hal> Can you shorten your timeout down to 1 msec and see what happens ? OK, that let me reproduce the oops, and I pushed a fix out (skqueue was used unitialized). - R. From tziporet at mellanox.co.il Wed Dec 15 23:28:20 2004 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 16 Dec 2004 09:28:20 +0200 Subject: [openib-general] MT25218 (aka Arbel memfree) support Message-ID: <506C3D7B14CDD411A52C00025558DED6064BEBAA@mtlex01.yok.mtl.com> The current (5.0.1) MemFree FW does not support the HCA-attached memory. Tziporet -----Original Message----- From: Roland Dreier [mailto:roland at topspin.com] Sent: Wednesday, December 15, 2004 9:21 PM To: Libor Michalek Cc: openib-general at openib.org Subject: Re: [openib-general] MT25218 (aka Arbel memfree) support Libor> How is the determination whether to use memfree mode or Libor> onboard memory going to be made for Arbel mode? As I understand it, only memfree mode is supported for Arbel mode. However, if onboard memory is ever supported, I think we would just have the driver prefer using HCA-attached memory to system memory. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at mellanox.co.il Thu Dec 16 00:40:36 2004 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 16 Dec 2004 10:40:36 +0200 Subject: [openib-general] IPoIB still not working Message-ID: <506C3D7B14CDD411A52C00025558DED6064BEBB0@mtlex01.yok.mtl.com> Roland is correct here - this was the only change we had to do for Arbel. Tavor FW was enhanced to get this workaround too. Tziporet -----Original Message----- From: Roland Dreier [mailto:roland at topspin.com] Sent: Wednesday, December 15, 2004 9:20 PM To: Hal Rosenstock Cc: openib-general at openib.org Subject: Re: [openib-general] IPoIB still not working Hal> Isn't the GID the subnet prefix concatenated with the EUI-64 GUID ? Hal> If so, I see 2 issues: 1. GUIDInfo should also reflect this. Hal> 2. Uniquification of EUI-64s (flag odd ones as errors) This change/workaround only affects the value put into an internal HCA data structure when a GRH/GID is _not_ being sent. - Roland _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Dec 16 04:55:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 07:55:36 -0500 Subject: [openib-general] [PATCH] [RFC] new test directory + test codefor MAD snooping In-Reply-To: <41BF4797.8030007@ichips.intel.com> References: <20041209142459.5c8dbb0f.mshefty@ichips.intel.com> <41BE2BAF.10506@ichips.intel.com> <003101c4e17c$5d6c7600$6501a8c0@comcast.net> <41BF4797.8030007@ichips.intel.com> Message-ID: <1103201736.4126.95.camel@localhost.localdomain> On Tue, 2004-12-14 at 15:05, Sean Hefty wrote: > I've pushed in the madeye code under gen2/utils. I think it would be better as something like gen2/trunk/src/tests, gen2/trunk/tests, gen2/trunk/src/utils, or /gen2/trunk/utils so it is all in one place and can be obtained with a single checkout. Would there be an issue if it was under src ? Would this cause some submittal problem ? -- Hal From halr at voltaire.com Thu Dec 16 05:32:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 08:32:11 -0500 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <527jniub9a.fsf@topspin.com> References: <1103124540.4222.13.camel@localhost.localdomain> <52brcvutgm.fsf@topspin.com> <1103151733.4224.280.camel@localhost.localdomain> <1103155948.4127.1.camel@localhost.localdomain> <52wtvjt56v.fsf@topspin.com> <1103159901.4127.17.camel@localhost.localdomain> <52llbzt42e.fsf@topspin.com> <1103163716.4126.30.camel@localhost.localdomain> <527jniub9a.fsf@topspin.com> Message-ID: <1103203545.4126.143.camel@localhost.localdomain> On Wed, 2004-12-15 at 23:18, Roland Dreier wrote: > Hal> Can you shorten your timeout down to 1 msec and see what happens ? > > OK, that let me reproduce the oops, and I pushed a fix out (skqueue > was used unitialized). That fixed the oops :-) Thanks. I am still seeing the continual retransmission of SA Get(PathRecords) even after I terminate the ping -b. The status on the callback is 0. When I remove ipoib module, I get: Dec 16 08:21:31 localhost kernel: ib0: dev_queue_xmit failed to requeue packet Dec 16 08:21:31 localhost kernel: ib_sa_path_rec_callback: sa_query 0xc26052e8 status 0x0 mad 0xce5e1a4c Dec 16 08:21:31 localhost kernel: ib0: dev_queue_xmit failed to requeue packet Dec 16 08:21:31 localhost kernel: ib_sa_path_rec_callback: sa_query 0xc26052a8 status 0x0 mad 0xce5e18ec Dec 16 08:21:31 localhost kernel: ib0: dev_queue_xmit failed to requeue packet Dec 16 08:21:31 localhost kernel: ib_sa_path_rec_callback: sa_query 0xc2605268 status 0x0 mad 0xce5e178c Dec 16 08:21:31 localhost kernel: ib0: dev_queue_xmit failed to requeue packet Dec 16 08:21:31 localhost kernel: ib_sa_path_rec_callback: sa_query 0xc2605228 status 0x0 mad 0xce5e162c Dec 16 08:21:31 localhost kernel: ib0: dev_queue_xmit failed to requeue packet Dec 16 08:21:31 localhost kernel: ib_sa_path_rec_callback: sa_query 0xc26051e8 status 0x0 mad 0xce5e14cc Dec 16 08:21:31 localhost kernel: ib0: dev_queue_xmit failed to requeue packet Dec 16 08:21:31 localhost kernel: ib0: ib_dealloc_pd failed When I then remove mthca, I get: Dec 16 08:23:22 localhost kernel: ool_destroy mthca_av, c9f32000 busy Dec 16 08:23:22 localhost kernel: ib_mthca 0000:02:00.0: dma_pool_destroy mthca_av, c5bc5000 busy The latter is repeated over and over for different AVs. So there appear to me to be 2 problems: 1. In this error case, why can't resources be freed ? 2. Why does the Get(PR) keep being retried ? -- Hal From roland at topspin.com Thu Dec 16 06:58:35 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Dec 2004 06:58:35 -0800 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <1103203545.4126.143.camel@localhost.localdomain> (Hal Rosenstock's message of "16 Dec 2004 08:32:11 -0500") References: <1103124540.4222.13.camel@localhost.localdomain> <52brcvutgm.fsf@topspin.com> <1103151733.4224.280.camel@localhost.localdomain> <1103155948.4127.1.camel@localhost.localdomain> <52wtvjt56v.fsf@topspin.com> <1103159901.4127.17.camel@localhost.localdomain> <52llbzt42e.fsf@topspin.com> <1103163716.4126.30.camel@localhost.localdomain> <527jniub9a.fsf@topspin.com> <1103203545.4126.143.camel@localhost.localdomain> Message-ID: <52wtvis31g.fsf@topspin.com> Hal> I am still seeing the continual retransmission of SA Hal> Get(PathRecords) even after I terminate the ping -b. The Hal> status on the callback is 0. OK, I found another bug and pushed the change out. Are things better now? - Roland From halr at voltaire.com Thu Dec 16 07:46:31 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 10:46:31 -0500 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <52wtvis31g.fsf@topspin.com> References: <1103124540.4222.13.camel@localhost.localdomain> <52brcvutgm.fsf@topspin.com> <1103151733.4224.280.camel@localhost.localdomain> <1103155948.4127.1.camel@localhost.localdomain> <52wtvjt56v.fsf@topspin.com> <1103159901.4127.17.camel@localhost.localdomain> <52llbzt42e.fsf@topspin.com> <1103163716.4126.30.camel@localhost.localdomain> <527jniub9a.fsf@topspin.com> <1103203545.4126.143.camel@localhost.localdomain> <52wtvis31g.fsf@topspin.com> Message-ID: <1103211990.4327.14.camel@localhost.localdomain> On Thu, 2004-12-16 at 09:58, Roland Dreier wrote: > Hal> I am still seeing the continual retransmission of SA > Hal> Get(PathRecords) even after I terminate the ping -b. The > Hal> status on the callback is 0. > > OK, I found another bug and pushed the change out. Are things better now? Yes, that's better. Still have the partial connectivity problem. I can see the ARP going out on the broadcast group followed by ARPs coming oin on the broadcast group followed by the PathRecord requests/responses with the SA followed by the unicast ARP and ICMP. After the unicast ARP to one of the nodes, it is not heard from. Yet if I initiate it at one of the remote nodes, connectivity is restored. -- Hal From fzago at systemfabricworks.com Thu Dec 16 07:51:49 2004 From: fzago at systemfabricworks.com (frank zago) Date: Thu, 16 Dec 2004 09:51:49 -0600 Subject: [openib-general] CM header file In-Reply-To: <20041215174814.C26487@topspin.com> References: <41C0D5F3.7010307@ichips.intel.com> <20041215174814.C26487@topspin.com> Message-ID: <41C1AF15.8000802@systemfabricworks.com> > > However, a nice to have feature which I've grown use to is the ability >to listen to an entire range of service IDs using a value/mask combo. > > I second this request. It very usefull in some cases. Also, I didn't see a provision for peer to peer connections. Frank. From halr at voltaire.com Thu Dec 16 07:54:46 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 10:54:46 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <42FD903D5ADE644A87701E946196A11373A53B@ES22SNLNT.srn.sandia.gov> References: <42FD903D5ADE644A87701E946196A11373A53B@ES22SNLNT.srn.sandia.gov> Message-ID: <1103212485.4327.19.camel@localhost.localdomain> On Wed, 2004-12-15 at 13:22, England, Joshua J wrote: > I'll definitely pound on the stuff and let you know if anything > breaks. You are using the 4.3.5 firmware, right ? I want to put the proper info into the IPoIB FAQ. Thanks. -- Hal From jjengla at sandia.gov Thu Dec 16 09:14:41 2004 From: jjengla at sandia.gov (England, Joshua J) Date: Thu, 16 Dec 2004 10:14:41 -0700 Subject: [openib-general] IPoIB still not working Message-ID: <42FD903D5ADE644A87701E946196A11373A546@ES22SNLNT.srn.sandia.gov> They're on 4.5.3. -JE -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thu 12/16/2004 8:54 AM To: England, Joshua J Cc: Roland Dreier; Robert J Woodruff; openib-general at openib.org Subject: RE: [openib-general] IPoIB still not working On Wed, 2004-12-15 at 13:22, England, Joshua J wrote: > I'll definitely pound on the stuff and let you know if anything > breaks. You are using the 4.3.5 firmware, right ? I want to put the proper info into the IPoIB FAQ. Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Dec 16 09:19:23 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 12:19:23 -0500 Subject: [openib-general] [PATCH] IPoIB FAQ Message-ID: <1103217563.4327.67.camel@localhost.localdomain> reflect proper Arbel firmware revision Signed-off-by: Hal Rosenstock Index: ipoib_faq.txt =================================================================== --- ipoib_faq.txt (revision 1346) +++ ipoib_faq.txt (working copy) @@ -50,8 +50,8 @@ cat /sys/class/infiniband/mthca0/fw_ver -For PCI-X HCAs, version 3.2.0 is recommended. For PCIe HCAs, version -4.6.1 is recommended. +For PCI-X HCAs, version 3.2.0 or later is recommended. For PCIe HCAs, +version 4.5.3 or later is recommended. 2. Make sure the IB modules are loaded: /sbin/lsmod | grep ib_ From robert.j.woodruff at intel.com Thu Dec 16 09:31:14 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 16 Dec 2004 09:31:14 -0800 Subject: [openib-general] IPoIB oops on path record completion Message-ID: <1AC79F16F5C5284499BB9591B33D6F0003125600@orsmsx408> >Still have the partial connectivity problem. I can see the ARP going out >on the broadcast group followed by ARPs coming oin on the broadcast >group followed by the PathRecord requests/responses with the SA followed >by the unicast ARP and ICMP. After the unicast ARP to one of the nodes, >it is not heard from. Yet if I initiate it at one of the remote nodes, >connectivity is restored. >-- Hal I also seem to be having some partial connectivity problems. The first 2 nodes seem to be able to communicate, but adding the 3rd and 4th nodes, they cannot ping the first 2. From roland at topspin.com Thu Dec 16 09:35:20 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Dec 2004 09:35:20 -0800 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003125600@orsmsx408> (Robert J. Woodruff's message of "Thu, 16 Dec 2004 09:31:14 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0003125600@orsmsx408> Message-ID: <52y8fyqh7r.fsf@topspin.com> Robert> I also seem to be having some partial connectivity Robert> problems. The first 2 nodes seem to be able to Robert> communicate, but adding the 3rd and 4th nodes, they cannot Robert> ping the first 2. Are you running the latest code from svn? I fixed a bug this morning that would cause problems with more than 2 nodes. Thanks, Roland From mshefty at ichips.intel.com Thu Dec 16 09:36:59 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 09:36:59 -0800 Subject: [openib-general] [PATCH] [RFC] new test directory + test codefor MAD snooping In-Reply-To: <1103201736.4126.95.camel@localhost.localdomain> References: <20041209142459.5c8dbb0f.mshefty@ichips.intel.com> <41BE2BAF.10506@ichips.intel.com> <003101c4e17c$5d6c7600$6501a8c0@comcast.net> <41BF4797.8030007@ichips.intel.com> <1103201736.4126.95.camel@localhost.localdomain> Message-ID: <41C1C7BB.6090106@ichips.intel.com> Hal Rosenstock wrote: > On Tue, 2004-12-14 at 15:05, Sean Hefty wrote: > >>I've pushed in the madeye code under gen2/utils. > > > I think it would be better as something like > gen2/trunk/src/tests, gen2/trunk/tests, gen2/trunk/src/utils, > or /gen2/trunk/utils so it is all in one place and can be > obtained with a single checkout. > > Would there be an issue if it was under src ? Would this cause > some submittal problem ? I have no issue with where the code goes. - Sean From robert.j.woodruff at intel.com Thu Dec 16 09:41:55 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 16 Dec 2004 09:41:55 -0800 Subject: [openib-general] IPoIB oops on path record completion Message-ID: <1AC79F16F5C5284499BB9591B33D6F0003125642@orsmsx408> >Are you running the latest code from svn? I fixed a bug this morning >that would cause problems with more than 2 nodes. >Thanks, > Roland Thanks, I will grab it and give it a try. I am running 1335 and I know that you pushed a couple of fixes late yesterday and this morning after I downloaded the code yesterday afternoon. I also saw one oops, but saw a note that you pushed a fix for that also. woody -----Original Message----- From: Roland Dreier [mailto:roland at topspin.com] Sent: Thursday, December 16, 2004 9:35 AM To: Woodruff, Robert J Cc: Hal Rosenstock; openib-general at openib.org Subject: Re: [openib-general] IPoIB oops on path record completion Robert> I also seem to be having some partial connectivity Robert> problems. The first 2 nodes seem to be able to Robert> communicate, but adding the 3rd and 4th nodes, they cannot Robert> ping the first 2. Are you running the latest code from svn? I fixed a bug this morning that would cause problems with more than 2 nodes. Thanks, Roland From halr at voltaire.com Thu Dec 16 09:38:38 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 12:38:38 -0500 Subject: [openib-general] IPoIB oops on path record completion In-Reply-To: <52y8fyqh7r.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003125600@orsmsx408> <52y8fyqh7r.fsf@topspin.com> Message-ID: <1103218556.4327.77.camel@localhost.localdomain> On Thu, 2004-12-16 at 12:35, Roland Dreier wrote: > Are you running the latest code from svn? I fixed a bug this morning > that would cause problems with more than 2 nodes. I am. -- Hal From mshefty at ichips.intel.com Thu Dec 16 09:47:54 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 09:47:54 -0800 Subject: [openib-general] CM header file In-Reply-To: <20041215174814.C26487@topspin.com> References: <41C0D5F3.7010307@ichips.intel.com> <20041215174814.C26487@topspin.com> Message-ID: <41C1CA4A.6020901@ichips.intel.com> Libor Michalek wrote: > The other option is to destroy the connection if the consumer returns > an error value from the callback. I'll have to think about this. As a personal preference I try to avoid having callbacks return values. But then I'm not thrilled about passing in flags to destroy to handle this either. > Along the same lines, will the > consumer be allowed to call the corresponding response function from > a callback? (e.g. ib_send_cm_rep() from the REQ callback) If not then > the same behaviour could also be achieved with a callback return value. This is a goal of the implementation, and I don't foresee any reason why it can't be done. >>* Should listening clients call an "accept" routine to wait for a >> connection request? Currently, the API operates asynchronously and >> inokves a CM event handler. > > I don't think this is necessary. Presumably the biggest reason to use > "accept" is to force the consumer to use its own thread to handle CM state > changes, thus avoiding CM or MAD thread deadlock if there is a buggy > consumer, or is there another reason to add this step? I didn't think it was necessary either, but wanted to mention it as a possible idea. > However, a nice to have feature which I've grown use to is the ability > to listen to an entire range of service IDs using a value/mask combo. I'll add this. I saw the mask in the Topspin CM API, but didn't look into why it was there, so removed it. Thanks for the feedback. - Sean From robert.j.woodruff at intel.com Thu Dec 16 10:53:31 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 16 Dec 2004 10:53:31 -0800 Subject: [openib-general] IPoIB oops on path record completion Message-ID: <1AC79F16F5C5284499BB9591B33D6F000312582B@orsmsx408> >Are you running the latest code from svn? I fixed a bug this morning >that would cause problems with more than 2 nodes. >Thanks, > Roland With the 1348 version I just downloaded, I can now ping from all nodes to all other nodes. I will not try to install and run some MPI tests and/or other tests to exercise the stack. woody From mshefty at ichips.intel.com Thu Dec 16 11:54:57 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 11:54:57 -0800 Subject: [openib-general] CM header file In-Reply-To: <41C1AF15.8000802@systemfabricworks.com> References: <41C0D5F3.7010307@ichips.intel.com> <20041215174814.C26487@topspin.com> <41C1AF15.8000802@systemfabricworks.com> Message-ID: <41C1E811.9080405@ichips.intel.com> frank zago wrote: > >> >> However, a nice to have feature which I've grown use to is the ability >> to listen to an entire range of service IDs using a value/mask combo. >> >> > > I second this request. It very usefull in some cases. > > Also, I didn't see a provision for peer to peer connections. I'm not sure that peer-to-peer needs to be exposed by the API. The CM should be able to determine the connection model when matching a received MAD with a local service ID. I.e. does the local service ID match with a listen request or a connection request... - Sean From halr at voltaire.com Thu Dec 16 11:53:46 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 14:53:46 -0500 Subject: [openib-general] CM header file In-Reply-To: <41C0D5F3.7010307@ichips.intel.com> References: <41C0D5F3.7010307@ichips.intel.com> Message-ID: <1103226826.4327.149.camel@localhost.localdomain> On Wed, 2004-12-15 at 19:25, Sean Hefty wrote: Here are some initial CM comments: Would UC as well as RC be supported ? If so, UC can wait a little for implementing if this adds time. /** > * ib_send_cm_mra - Sends a message receipt acknowledgement to a > connection > * message. > * @cm_id: Connection identifier associated with the connection message. > * @service_timeout: The maximum time required for the sender to reply to > * to the connection message. > * @private_data: Optional user-defined private data sent with the > * message receipt acknowledgement. > * @private_data_len: Size of the private data buffer, in bytes. > */ Couldn't outgoing MRAs be automatically generated rather than forcing the CM client to do this ? Would the CM be 1.1 compliant ? If so, would 1.2 be a future thing ? > struct ib_path_record; //*** TBD: define me somewhere else (ib_smi.h?) Is this the SA PathRecord ? There is ib_sa_path_rec in ib_sa.h. More later as I go through this in more detail... -- Hal From mshefty at ichips.intel.com Thu Dec 16 11:55:56 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 11:55:56 -0800 Subject: [openib-general] [PATCH] initial CM module Message-ID: <20041216115556.25cdb461.mshefty@ichips.intel.com> This patch adds in the initial CM API and module code. The module loads, unloads, and allocates/deallocates connection structures, but that's about it. This patch does not include changes needed to Kconfig or the Makefile, since I'm not sure that it makes sense to change these yet. I will commit this unless there are any objections. - Sean Index: include/ib_cm.h =================================================================== --- include/ib_cm.h (revision 0) +++ include/ib_cm.h (revision 0) @@ -0,0 +1,387 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ +#if !defined(IB_CM_H) +#define IB_CM_H + +#include +#include + +enum ib_cm_state { + IB_CM_IDLE, + IB_CM_LISTEN, + IB_CM_REQ_SENT, + IB_CM_REQ_RCVD, + IB_CM_MRA_REQ_SENT, + IB_CM_MRA_REQ_RCVD, + IB_CM_REP_SENT, + IB_CM_REP_RCVD, + IB_CM_MRA_REP_SENT, + IB_CM_MRA_REP_RCVD, + IB_CM_ESTABLISHED, + IB_CM_LAP_SENT, + IB_CM_LAP_RCVD, + IB_CM_MRA_LAP_SENT, + IB_CM_MRA_LAP_RCVD, + IB_CM_DREQ_SENT, + IB_CM_DREQ_RCVD, + IB_CM_TIMEWAIT, + IB_CM_SIDR_REQ_SENT, + IB_CM_SIDR_REQ_RCVD +}; + +enum ib_cm_event_type { + IB_CM_REQ_TIMEOUT, + IB_CM_REQ_RECEIVED, + IB_CM_REP_TIMEOUT, + IB_CM_REP_RECEIVED, + IB_CM_RTU_RECEIVED, + IB_CM_DREQ_TIMEOUT, + IB_CM_DREQ_RECEIVED, + IB_CM_DREP_RECEIVED, + IB_CM_MRA_RECEIVED, + IB_CM_LAP_TIMEOUT, + IB_CM_LAP_RECEIVED, + IB_CM_APR_RECEIVED +}; + +struct ib_cm_event { + /* a whole lot more stuff goes here */ + void *private_data; + u8 private_data_len; + enum ib_cm_event_type event; +}; + +struct ib_cm_id; + +typedef void (*ib_cm_handler)(struct ib_cm_id *cm_id, + struct ib_cm_event *event); + +struct ib_cm_id { + ib_cm_handler cm_handler; + void *context; + u64 service_id; + enum ib_cm_state state; +}; + +/** + * ib_create_cm_id - Allocate a connection identifier. + * @cm_handler: Callback invoked to notify the user of CM events. + * @context: User specified context associated with the connection + * identifier. + * + * Connection identifiers are used to track connection states and + * listen requests. + */ +struct ib_cm_id *ib_create_cm_id(ib_cm_handler cm_handler, + void *context); + +/** + * ib_destroy_cm_id - Destroy a connection identifier. + * @cm_id: Connection identifier to destroy. + * + * This call blocks until the connection identifier is destroyed. + */ +int ib_destroy_cm_id(struct ib_cm_id *cm_id); +//*** TBD : add flags to allow calling routine from CM callback... + +/** + * ib_cm_listen - Initiates listening on the specified service ID for + * connection and service ID resolution requests. + * @cm_id: Connection identifier associated with the listen request. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. + */ +int ib_cm_listen(struct ib_cm_id *cm_id, + u64 service_id, + u64 service_mask); + +struct ib_cm_req_param { + struct ib_qp *qp; + struct ib_path_record *primary_path; + struct ib_path_record *alternate_path; + u64 service_id; + int timeout_ms; + void *private_data; + u8 private_data_len; + u8 responder_resources; + u8 initiator_depth; + u8 remote_cm_response_timeout; + u8 flow_control; + u8 local_cm_response_timeout; + u8 retry_count; + u8 rnr_retry_count; + u8 max_cm_retries; +}; + +/** + * ib_send_cm_req - Sends a connection request to the remote node. + * @cm_id: Connection identifier that will be associated with the + * connection request. + * @param: Connection request information needed to establish the + * connection. + */ +int ib_send_cm_req(struct ib_cm_id *cm_id, + struct ib_cm_req_param *param); + +struct ib_cm_rep_param { + struct ib_qp *qp; + void *private_data; + u8 reply_private_data_len; + u8 responder_resources; + u8 initiator_depth; + u8 target_ack_delay; + u8 failover_accepted; + u8 flow_control; + u8 rnr_retry_count; +}; + +/** + * ib_send_cm_rep - Sends a connection reply in response to a connection + * request. + * @cm_id: Connection identifier that will be associated with the + * connection request. + * @param: Connection reply information needed to establish the + * connection. + */ +int ib_send_cm_rep(struct ib_cm_id *cm_id, + struct ib_cm_req_param *param); + +/** + * ib_send_cm_rtu - Sends a connection ready to use message in response + * to a connection reply message. + * @cm_id: Connection identifier associated with the connection request. + * @private_data: Optional user-defined private data sent with the + * ready to use message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_rtu(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len); + +/** + * ib_send_cm_dreq - Sends a disconnection request for an existing + * connection. + * @cm_id: Connection identifier associated with the connection being + * released. + * @private_data: Optional user-defined private data sent with the + * disconnection request message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_dreq(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len); + +/** + * ib_send_cm_drep - Sends a disconnection reply to a disconnection request. + * @cm_id: Connection identifier associated with the connection being + * released. + * @private_data: Optional user-defined private data sent with the + * disconnection reply message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_drep(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len); + +/** + * ib_cm_establish - Forces a connection state to established. + * @cm_id: Connection identifier to transition to established. + * + * This routine should be invoked by users who receive messages on a + * connected QP before an RTU has been received. + */ +int ib_cm_establish(struct ib_cm_id *id); + +enum ib_cm_rej_reason { + IB_CM_REJ_NO_QP = __constant_htons(1), + IB_CM_REJ_NO_EEC = __constant_htons(2), + IB_CM_REJ_NO_RESOURCES = __constant_htons(3), + IB_CM_REJ_TIMEOUT = __constant_htons(4), + IB_CM_REJ_UNSUPPORTED = __constant_htons(5), + IB_CM_REJ_INVALID_COMM_ID = __constant_htons(6), + IB_CM_REJ_INVALID_COMM_INSTANCE = __constant_htons(7), + IB_CM_REJ_INVALID_SERVICE_ID = __constant_htons(8), + IB_CM_REJ_INVALID_TRANSPORT_TYPE = __constant_htons(9), + IB_CM_REJ_STALE_CONN = __constant_htons(10), + IB_CM_REJ_RDC_NOT_EXIST = __constant_htons(11), + IB_CM_REJ_INVALID_GID = __constant_htons(12), + IB_CM_REJ_INVALID_LID = __constant_htons(13), + IB_CM_REJ_INVALID_SL = __constant_htons(14), + IB_CM_REJ_INVALID_TRAFFIC_CLASS = __constant_htons(15), + IB_CM_REJ_INVALID_HOP_LIMIT = __constant_htons(16), + IB_CM_REJ_INVALID_PACKET_RATE = __constant_htons(17), + IB_CM_REJ_INVALID_ALT_GID = __constant_htons(18), + IB_CM_REJ_INVALID_ALT_LID = __constant_htons(19), + IB_CM_REJ_INVALID_ALT_SL = __constant_htons(20), + IB_CM_REJ_INVALID_ALT_TRAFFIC_CLASS = __constant_htons(21), + IB_CM_REJ_INVALID_ALT_HOP_LIMIT = __constant_htons(22), + IB_CM_REJ_INVALID_ALT_PACKET_RATE = __constant_htons(23), + IB_CM_REJ_PORT_REDIRECT = __constant_htons(24), + IB_CM_REJ_INVALID_MTU = __constant_htons(26), + IB_CM_REJ_INSUFFICIENT_RESP_RESOURCES = __constant_htons(27), + IB_CM_REJ_CONSUMER_DEFINED = __constant_htons(28), + IB_CM_REJ_INVALID_RNR_RETRY = __constant_htons(29), + IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = __constant_htons(30), + IB_CM_REJ_INVALID_CLASS_VERSION = __constant_htons(31), + IB_CM_REJ_INVALID_FLOW_LABEL = __constant_htons(32), + IB_CM_REJ_INVALID_ALT_FLOW_LABEL = __constant_htons(33) +}; + +/** + * ib_send_cm_rej - Sends a connection rejection message to the + * remote node. + * @cm_id: Connection identifier associated with the connection being + * rejected. + * @reason: Reason for the connection request rejection. + * @ari: Optional additional rejection information. + * @ari_length: Size of the additional rejection information, in bytes. + * @private_data: Optional user-defined private data sent with the + * rejection message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_rej(struct ib_cm_id *cm_id, + enum ib_cm_rej_reason reason, + void *ari, + u8 ari_length, + void *private_data, + u8 private_data_len); + +/** + * ib_send_cm_mra - Sends a message receipt acknowledgement to a connection + * message. + * @cm_id: Connection identifier associated with the connection message. + * @service_timeout: The maximum time required for the sender to reply to + * to the connection message. + * @private_data: Optional user-defined private data sent with the + * message receipt acknowledgement. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_mra(struct ib_cm_id *cm_id, + u8 service_timeout, + void *private_data, + u8 private_data_len); + +/** + * ib_send_cm_lap - Sends a load alternate path request. + * @cm_id: Connection identifier associated with the load alternate path + * message. + * @alternate_path: A path record that identifies the alternate path to + * load. + * @private_data: Optional user-defined private data sent with the + * load alternate path message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_lap(struct ib_cm_id *cm_id, + struct ib_path_record *alternate_path, + void *private_data, + u8 private_data_len); + +enum ib_cm_apr_status { + IB_CM_APR_SUCCESS, + IB_CM_APR_INVALID_COMM_ID, + IB_CM_APR_UNSUPPORTED, + IB_CM_APR_REJECT, + IB_CM_APR_REDIRECT, + IB_CM_APR_IS_CURRENT, + IB_CM_APR_INVALID_QPN_EECN, + IB_CM_APR_INVALID_LID, + IB_CM_APR_INVALID_GID, + IB_CM_APR_INVALID_FLOW_LABEL, + IB_CM_APR_INVALID_TCLASS, + IB_CM_APR_INVALID_HOP_LIMIT, + IB_CM_APR_INVALID_PACKET_RATE, + IB_CM_APR_INVALID_SL +}; + +/** + * ib_send_cm_apr - Sends an alternate path response message in response to + * a load alternate path request. + * @cm_id: Connection identifier associated with the alternate path response. + * @status: Reply status sent with the alternate path response. + * @info: Optional additional information sent with the alternate path + * response. + * @info_length: Size of the additional information, in bytes. + * @private_data: Optional user-defined private data sent with the + * alternate path response message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_apr(struct ib_cm_id *cm_id, + enum ib_cm_apr_status status, + void *info, + u8 info_length, + void *private_data, + u8 private_data_len); + +struct ib_cm_sidr_req_param { + struct ib_path_record *path; + u64 service_id; + int timeout_ms; + void *private_data; + u8 private_data_len; + u16 pkey; +}; + +/** + * ib_send_cm_sidr_req - Sends a service ID resolution request to the + * remote node. + * @cm_id: Communication identifier that will be associated with the + * service ID resolution request. + * @param: Service ID resolution request information. + */ +int ib_send_cm_sidr_req(struct ib_cm_id *cm_id, + struct ib_cm_sidr_req_param *param); + +enum ib_cm_sidr_status { + IB_SIDR_SUCCESS, + IB_SIDR_UNSUPPORTED, + IB_SIDR_REJECT, + IB_SIDR_NO_QP, + IB_SIDR_REDIRECT, + IB_SIDR_UNSUPPORTED_VERSION +}; + +struct ib_cm_sidr_rep_param { + u32 qp_num; + u32 qkey; + enum ib_cm_sidr_status status; + void *info; + u8 info_length; + void *private_data; + u8 private_data_len; +}; + +/** + * ib_send_cm_sidr_rep - Sends a service ID resolution request to the + * remote node. + * @cm_id: Communication identifier associated with the received service ID + * resolution request. + * @param: Service ID resolution reply information. + */ +int ib_send_cm_sidr_rep(struct ib_cm_id *cm_id, + struct ib_cm_sidr_rep_param *param); + +#endif /* IB_CM_H */ Index: core/cm.c =================================================================== --- core/cm.c (revision 0) +++ core/cm.c (revision 0) @@ -0,0 +1,292 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ +#include +#include + +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand CM"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void cm_add_one(struct ib_device *device); +static void cm_remove_one(struct ib_device *device); + +static struct ib_client cm_client = { + .name = "cm", + .add = cm_add_one, + .remove = cm_remove_one +}; + +struct cm_port { + struct ib_mad_agent *mad_agent; +}; + +struct ib_cm_id_private { + struct ib_cm_id id; + + spinlock_t lock; + wait_queue_head_t wait; + atomic_t refcount; +}; + +struct ib_cm_id *ib_create_cm_id(ib_cm_handler cm_handler, + void *context) +{ + struct ib_cm_id_private *cm_id_priv; + + cm_id_priv = kmalloc(sizeof *cm_id_priv, GFP_KERNEL); + if (!cm_id_priv) + return ERR_PTR(-ENOMEM); + + cm_id_priv->id.service_id = 0; + cm_id_priv->id.state = IB_CM_IDLE; + cm_id_priv->id.cm_handler = cm_handler; + cm_id_priv->id.context = context; + + spin_lock_init(&cm_id_priv->lock); + init_waitqueue_head(&cm_id_priv->wait); + atomic_set(&cm_id_priv->refcount, 1); + + return &cm_id_priv->id; +} +EXPORT_SYMBOL(ib_create_cm_id); + +static void reset_cm_state(struct ib_cm_id_private *cm_id_priv) +{ + /* reject connections if establishing */ + /* disconnect established connections */ + /* update timewait info */ +} + +int ib_destroy_cm_id(struct ib_cm_id *cm_id) +{ + struct ib_cm_id_private *cm_id_priv; + unsigned long flags; + + cm_id_priv = container_of(cm_id, struct ib_cm_id_private, id); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch(cm_id->state) { + case IB_CM_IDLE: + case IB_CM_LISTEN: + case IB_CM_TIMEWAIT: + break; /* Connection is ready to be destroyed. */ + default: + reset_cm_state(cm_id_priv); + break; + } + cm_id->state = IB_CM_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + atomic_dec(&cm_id_priv->refcount); + wait_event(cm_id_priv->wait, + !atomic_read(&cm_id_priv->refcount)); + kfree(cm_id_priv); + return 0; +} +EXPORT_SYMBOL(ib_destroy_cm_id); + +int ib_cm_listen(struct ib_cm_id *cm_id, + u64 service_id, + u64 service_mask) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_cm_listen); + +int ib_send_cm_req(struct ib_cm_id *cm_id, + struct ib_cm_req_param *param) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_req); + +int ib_send_cm_rep(struct ib_cm_id *cm_id, + struct ib_cm_req_param *param) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_rep); + +int ib_send_cm_rtu(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_rtu); + +int ib_send_cm_dreq(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_dreq); + +int ib_send_cm_drep(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_drep); + +int ib_cm_establish(struct ib_cm_id *id) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_cm_establish); + +int ib_send_cm_rej(struct ib_cm_id *cm_id, + enum ib_cm_rej_reason reason, + void *ari, + u8 ari_length, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_rej); + +int ib_send_cm_mra(struct ib_cm_id *cm_id, + u8 service_timeout, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_mra); + +int ib_send_cm_lap(struct ib_cm_id *cm_id, + struct ib_path_record *alternate_path, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_lap); + +int ib_send_cm_apr(struct ib_cm_id *cm_id, + enum ib_cm_apr_status status, + void *info, + u8 info_length, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_apr); + +int ib_send_cm_sidr_req(struct ib_cm_id *cm_id, + struct ib_cm_sidr_req_param *param) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_sidr_req); + +int ib_send_cm_sidr_rep(struct ib_cm_id *cm_id, + struct ib_cm_sidr_rep_param *param) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_sidr_rep); + +static void send_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct cm_port *port; + + port = (struct cm_port *)mad_agent->context; +} + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct cm_port *port; + + port = (struct cm_port *)mad_agent->context; +} + +static void cm_add_one(struct ib_device *device) +{ + struct cm_port *port; + struct ib_mad_reg_req reg_req; + u8 i; + + port = kmalloc(sizeof *port * device->phys_port_cnt, GFP_KERNEL); + if (!port) + goto out; + + memset(®_req, 0, sizeof reg_req); + reg_req.mgmt_class = IB_MGMT_CLASS_CM; + reg_req.mgmt_class_version = 1; + set_bit(IB_MGMT_METHOD_GET, reg_req.method_mask); + set_bit(IB_MGMT_METHOD_SET, reg_req.method_mask); + set_bit(IB_MGMT_METHOD_GET_RESP, reg_req.method_mask); + set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); + for (i = 1; i <= device->phys_port_cnt; i++) { + port[i].mad_agent = ib_register_mad_agent(device, i, + IB_QPT_GSI, + ®_req, + 0, + send_handler, + recv_handler, + &port[i]); + } + +out: + ib_set_client_data(device, &cm_client, port); +} + +static void cm_remove_one(struct ib_device *device) +{ + struct cm_port *port; + int i; + + port = (struct cm_port *)ib_get_client_data(device, &cm_client); + if (!port) + return; + + for (i = 1; i <= device->phys_port_cnt; i++) { + if (!IS_ERR(port[i].mad_agent)) + ib_unregister_mad_agent(port[i].mad_agent); + } + kfree(port); +} + +static int __init ib_cm_init(void) +{ + return ib_register_client(&cm_client); +} + +static void __exit ib_cm_cleanup(void) +{ + ib_unregister_client(&cm_client); +} + +module_init(ib_cm_init); +module_exit(ib_cm_cleanup); From mshefty at ichips.intel.com Thu Dec 16 11:58:46 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 11:58:46 -0800 Subject: [openib-general] [PATCH] new CM test utility Message-ID: <20041216115846.11f7e956.mshefty@ichips.intel.com> This patch adds in a new test utility framework for CM development. It's currently located in the gen2/utils directory. I will commit this change unless there are any objections. - Sean Index: util/cmpost/Kconfig =================================================================== --- util/cmpost/Kconfig (revision 0) +++ util/cmpost/Kconfig (revision 0) @@ -0,0 +1,6 @@ +config INFINIBAND_CMPOST + tristate "Connection manager test utility for InfiniBand" + depends on INFINIBAND + ---help--- + Test module for Infiniband connection manager. + Index: util/cmpost/cmpost.c =================================================================== --- util/cmpost/cmpost.c (revision 0) +++ util/cmpost/cmpost.c (revision 0) @@ -0,0 +1,100 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * + * $Id$ + */ + +#include +#include +#include + +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand CM test utility"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void cmpost_remove_one(struct ib_device *device); +static void cmpost_add_one(struct ib_device *device); + +static struct ib_client cmpost_client = { + .name = "cmpost", + .add = cmpost_add_one, + .remove = cmpost_remove_one +}; + +struct cmpost_port { + struct ib_cm_id *listen_cm_id; + struct ib_cm_id *conn_cm_id; +}; + +void cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ +} + +static void cmpost_add_one(struct ib_device *device) +{ + struct cmpost_port *port; + int i; + + port = kmalloc(sizeof *port * device->phys_port_cnt, GFP_KERNEL); + if (!port) + goto out; + + for (i = 1; i <= device->phys_port_cnt; i++) { + port[i].listen_cm_id = ib_create_cm_id(cm_handler, &port[i]); + port[i].conn_cm_id = ib_create_cm_id(cm_handler, &port[i]); + } + +out: + ib_set_client_data(device, &cmpost_client, port); +} + +static void cmpost_remove_one(struct ib_device *device) +{ + struct cmpost_port *port; + int i; + + port = (struct cmpost_port *) + ib_get_client_data(device, &cmpost_client); + if (!port) + return; + + for (i = 1; i <= device->phys_port_cnt; i++) { + if (!IS_ERR(port[i].listen_cm_id)) + ib_destroy_cm_id(port[i].listen_cm_id); + if (!IS_ERR(port[i].conn_cm_id)) + ib_destroy_cm_id(port[i].conn_cm_id); + } + kfree(port); +} + +static int __init ib_cmpost_init(void) +{ + return ib_register_client(&cmpost_client); +} + +static void __exit ib_cmpost_cleanup(void) +{ + ib_unregister_client(&cmpost_client); +} + +module_init(ib_cmpost_init); +module_exit(ib_cmpost_cleanup); Index: util/cmpost/Makefile =================================================================== --- util/cmpost/Makefile (revision 0) +++ util/cmpost/Makefile (revision 0) @@ -0,0 +1,6 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_CMPOST) += ib_cmpost.o + +ib_cmpost-y := cmpost.o \ + From mshefty at ichips.intel.com Thu Dec 16 12:14:18 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 12:14:18 -0800 Subject: [openib-general] CM header file In-Reply-To: <1103226826.4327.149.camel@localhost.localdomain> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> Message-ID: <41C1EC9A.5030005@ichips.intel.com> Hal Rosenstock wrote: > Would UC as well as RC be supported ? If so, UC can wait a little for > implementing if this adds time. I'm not sure that RC or UC matters much to the CM. The CM will probably just read the value from the QP type. > /** > >> * ib_send_cm_mra - Sends a message receipt acknowledgement to a >>connection >> * message. >> * @cm_id: Connection identifier associated with the connection message. >> * @service_timeout: The maximum time required for the sender to reply to >> * to the connection message. >> * @private_data: Optional user-defined private data sent with the >> * message receipt acknowledgement. >> * @private_data_len: Size of the private data buffer, in bytes. >> */ > > Couldn't outgoing MRAs be automatically generated rather than forcing > the CM client to do this ? I'm not sure that there's much value in abstracting this call. Also, there would need to be a way for the client to specify the private data and service timeout, which could change per message. > Would the CM be 1.1 compliant ? If so, would 1.2 be a future thing ? I was looking at the 1.2 version of the spec, but was planning on pulling code from the existing Topspin CM. I'm guessing that it's 1.1 compliant. Depending on the differences between 1.1 and 1.2, I can try to support both. >>struct ib_path_record; //*** TBD: define me somewhere else (ib_smi.h?) > > Is this the SA PathRecord ? There is ib_sa_path_rec in ib_sa.h. Thanks - I included ib_sa.h from ib_cm.h. One final note, I'm hoping that a more abstracted CM could be layered on top of this one, if it were desired. E.g. one that performs QP transitions, automatically generates MRAs, retries requests, etc. - Sean From halr at voltaire.com Thu Dec 16 12:19:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 15:19:32 -0500 Subject: [openib-general] IPoIB Partial Connectivity Scenario Message-ID: <1103228372.4327.167.camel@localhost.localdomain> I've looked at the remote side to understand what it was (or wasn't doing). The partial connectivity stems from an issue in resolving the path on the remote side. I have a proposal: Rather than a single SA Get(PathRecord) with a 1 second timeout, what about a retry or two with a smaller (0.33 - 0.5 sec) timeout ? SA Get/GetResp is inherently unreliable and these could be retried. -- Hal From halr at voltaire.com Thu Dec 16 12:28:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 15:28:21 -0500 Subject: [openib-general] CM header file In-Reply-To: <41C1EC9A.5030005@ichips.intel.com> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> Message-ID: <1103228901.4327.188.camel@localhost.localdomain> On Thu, 2004-12-16 at 15:14, Sean Hefty wrote: > Hal Rosenstock wrote: > > > Would UC as well as RC be supported ? If so, UC can wait a little for > > implementing if this adds time. > > I'm not sure that RC or UC matters much to the CM. The CM will > probably just read the value from the QP type. So it's just a pass through in terms of the components ? > > Would the CM be 1.1 compliant ? If so, would 1.2 be a future thing ? > > I was looking at the 1.2 version of the spec, but was planning on > pulling code from the existing Topspin CM. I'm guessing that it's 1.1 > compliant. I would state it a little differently: It's intended to be 1.1 compliant. > Depending on the differences between 1.1 and 1.2, I can try > to support both. The class version for CM was bumped for 1.2. A 1.2 CM could be made to be compatible with 1.1 (revert back to the previous class version and formats) but this is more work. It depends on what this needs to interoperate with as to what the requirement is here. -- Hal From mshefty at ichips.intel.com Thu Dec 16 12:37:49 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 12:37:49 -0800 Subject: [openib-general] CM header file In-Reply-To: <1103228901.4327.188.camel@localhost.localdomain> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <1103228901.4327.188.camel@localhost.localdomain> Message-ID: <41C1F21D.4010709@ichips.intel.com> Hal Rosenstock wrote: >>>Would UC as well as RC be supported ? If so, UC can wait a little for >>>implementing if this adds time. >> >>I'm not sure that RC or UC matters much to the CM. The CM will >>probably just read the value from the QP type. > > So it's just a pass through in terms of the components ? I believe so. I think that the only place the CM needs to know the difference is when setting the transport service type in the REQ message. >> Depending on the differences between 1.1 and 1.2, I can try >>to support both. > > The class version for CM was bumped for 1.2. A 1.2 CM could be made to > be compatible with 1.1 (revert back to the previous class version and > formats) but this is more work. It depends on what this needs to > interoperate with as to what the requirement is here. I think it makes sense to target 1.1 for now. If the amount of effort to support 1.2 is minimal, I will add in that support. - Sean From halr at voltaire.com Thu Dec 16 12:41:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 15:41:32 -0500 Subject: [openib-general] GUID/EUI-64 Issue In-Reply-To: <1102514316.4141.843.camel@localhost.localdomain> References: <1102514316.4141.843.camel@localhost.localdomain> Message-ID: <1103229681.4327.214.camel@localhost.localdomain> On Wed, 2004-12-08 at 08:58, Hal Rosenstock wrote: > Hi, > > Did we come to closure on how to handle the GUID/EUI-64 issue ? > > -- Hal > > On Thu, 2004-11-11 at 13:11, Roland Dreier wrote: > > My only questions are: > > > > + eui[0] ^= 2; > > > > I remember some discussion about whether IBTA GUIDs are already > > modified EUI-64 or not. Is this the correct transformation or > should > > we be doing something like "eui[0] |= 2;" (ie assume the universal > bit > > should always be set in our IPv6 address)? > > IBTA GUIDs are EUI-64. The only issue I recall was whether the > polarity > of the U/G bit was consistent with IEEE. This was updated at IBA 1.2. > It > now says "manufacturer assigns EUI-64 with global scope set. May also > assign additional EUI-64 with local scope." Here's what the IPoIB I-D says on this: [AARCH] requires the interface identifier be created in the "Modified EUI-64" format when derived from an EUI-64 identifier. [IBTA] is unclear if the GUID should use IEEE EUI-64 format or the "Modified EUI-64" format. Therefore, when creating an interface identifier from the GUID an implementation MUST do the following: => Determine if the GUID is a modified EUI-64 identifier ("u" bit is toggled) as defined by [AARCH] => If the GUID is a modified EUI-64 identifier then the "u" bit MUST NOT be toggled when creating the interface identifier => If the GUID is an unmodified EUI-64 identifier then the "u" bit MUST be toggled in compliance with [AARCH] This confusion is due to IBA 1.1 and previous versions. It has been fixed at IBA 1.2. -- Hal From roland at topspin.com Thu Dec 16 12:49:51 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Dec 2004 12:49:51 -0800 Subject: [openib-general] CM header file In-Reply-To: <41C1E811.9080405@ichips.intel.com> (Sean Hefty's message of "Thu, 16 Dec 2004 11:54:57 -0800") References: <41C0D5F3.7010307@ichips.intel.com> <20041215174814.C26487@topspin.com> <41C1AF15.8000802@systemfabricworks.com> <41C1E811.9080405@ichips.intel.com> Message-ID: <52oeguq87k.fsf@topspin.com> Sean> I'm not sure that peer-to-peer needs to be exposed by the Sean> API. The CM should be able to determine the connection Sean> model when matching a received MAD with a local service ID. Sean> I.e. does the local service ID match with a listen request Sean> or a connection request... Nope, a consumer has to say whether an active connection request is peer-to-peer or not. It's entirely possible for two systems to be listening for the same service, and want to make two connections (ie the normal TCP-like case where system A connects to port X on system B at the same time as system B connects to the same port X on system A). Or the application may want peer-to-peer semantics. There's no way for the CM to know without being told. - R. From roland at topspin.com Thu Dec 16 12:53:05 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Dec 2004 12:53:05 -0800 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <20041216115556.25cdb461.mshefty@ichips.intel.com> (Sean Hefty's message of "Thu, 16 Dec 2004 11:55:56 -0800") References: <20041216115556.25cdb461.mshefty@ichips.intel.com> Message-ID: <52k6riq826.fsf@topspin.com> Can you use the new copyright header I posted? Also it's good to hold off on Makefile/Kconfig changes for now, since that will simplify generating patches for upstream merging. If it gets too difficult to hold back on the CM, I would suggest developing the CM on a branch. - R. From halr at voltaire.com Thu Dec 16 12:55:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 15:55:08 -0500 Subject: [openib-general] IPoIB Path Static Rate Message-ID: <1103230507.4327.224.camel@localhost.localdomain> Hi Roland, It looks to me like after obtaining the PathRecord, the static rate is not used when the AV is created. Shouldn't it be ? Is there an issue with doing this ? There is a similar issue with the multicast AVs as well. I know there is an assumption that everything is 4x but I am not sure that is good one to lock in. -- Hal From roland at topspin.com Thu Dec 16 13:01:39 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Dec 2004 13:01:39 -0800 Subject: [openib-general] Re: IPoIB Partial Connectivity Scenario In-Reply-To: <1103228372.4327.167.camel@localhost.localdomain> (Hal Rosenstock's message of "16 Dec 2004 15:19:32 -0500") References: <1103228372.4327.167.camel@localhost.localdomain> Message-ID: <52fz26q7nw.fsf@topspin.com> Hal> I have a proposal: Rather than a single SA Get(PathRecord) Hal> with a 1 second timeout, what about a retry or two with a Hal> smaller (0.33 - 0.5 sec) timeout ? SA Get/GetResp is Hal> inherently unreliable and these could be retried. Sure, that sounds reasonable. I had thought that future packets would cause the path record to be retried, but maybe that's not happening. - R. From mshefty at ichips.intel.com Thu Dec 16 13:09:38 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 13:09:38 -0800 Subject: [openib-general] CM header file In-Reply-To: <52oeguq87k.fsf@topspin.com> References: <41C0D5F3.7010307@ichips.intel.com> <20041215174814.C26487@topspin.com> <41C1AF15.8000802@systemfabricworks.com> <41C1E811.9080405@ichips.intel.com> <52oeguq87k.fsf@topspin.com> Message-ID: <41C1F992.4010908@ichips.intel.com> Roland Dreier wrote: > Sean> I'm not sure that peer-to-peer needs to be exposed by the > Sean> API. The CM should be able to determine the connection > Sean> model when matching a received MAD with a local service ID. > Sean> I.e. does the local service ID match with a listen request > Sean> or a connection request... > > Nope, a consumer has to say whether an active connection request is > peer-to-peer or not. It's entirely possible for two systems to be > listening for the same service, and want to make two connections (ie > the normal TCP-like case where system A connects to port X on system B > at the same time as system B connects to the same port X on system > A). Or the application may want peer-to-peer semantics. There's no > way for the CM to know without being told. My thinking was that the connection model isn't carried in the CM MADs however, so a receiving CM has to determine how to match the connection request based on what the remote user requested, which isn't known. I guess that by not having this parameter, an incoming request could be matched incorrectly with an outgoing request, if the local listening service isn't up. I'll add a peer_to_peer parameter to ib_cm_req_param. - Sean From mshefty at ichips.intel.com Thu Dec 16 13:11:33 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 13:11:33 -0800 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <52k6riq826.fsf@topspin.com> References: <20041216115556.25cdb461.mshefty@ichips.intel.com> <52k6riq826.fsf@topspin.com> Message-ID: <41C1FA05.6020103@ichips.intel.com> Roland Dreier wrote: > Can you use the new copyright header I posted? Will do. > Also it's good to hold off on Makefile/Kconfig changes for now, since > that will simplify generating patches for upstream merging. If it > gets too difficult to hold back on the CM, I would suggest developing > the CM on a branch. It's not a problem holding off these changes. - Sean From halr at voltaire.com Thu Dec 16 13:12:33 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 16:12:33 -0500 Subject: [openib-general] Re: IPoIB Partial Connectivity Scenario In-Reply-To: <52fz26q7nw.fsf@topspin.com> References: <1103228372.4327.167.camel@localhost.localdomain> <52fz26q7nw.fsf@topspin.com> Message-ID: <1103231553.4327.230.camel@localhost.localdomain> On Thu, 2004-12-16 at 16:01, Roland Dreier wrote: > Sure, that sounds reasonable. I had thought that future packets would > cause the path record to be retried, but maybe that's not happening. If that were to happen that would be fine too. I don't see them on subsequent packets. Not sure what is going on. Can you artificially drop a GetResp ? That might reproduce what I am seeing. -- Hal From halr at voltaire.com Thu Dec 16 14:05:07 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2004 17:05:07 -0500 Subject: [openib-general] Re: IPoIB Partial Connectivity Scenario In-Reply-To: <1103231553.4327.230.camel@localhost.localdomain> References: <1103228372.4327.167.camel@localhost.localdomain> <52fz26q7nw.fsf@topspin.com> <1103231553.4327.230.camel@localhost.localdomain> Message-ID: <1103234707.4214.8.camel@localhost.localdomain> On the remote node to which connectivity fails, it has a stale arp cache entry which does not seem to go away as if the timer is not started. Is that possible ? Is there a case where the ARP entry is created but not timed ? /sbin/ip neigh show dev ib0 192.168.0.1 lladdr 00:00:04:04:fe:80:00:00:00:00:00:00:00:54:42:b1:00:00:09:01 nud stale The other nodes' entries (they have bidirectional connectivity) do go from reachable to delay to stale to "gone". -- Hal From fzago at systemfabricworks.com Thu Dec 16 14:10:52 2004 From: fzago at systemfabricworks.com (frank zago) Date: Thu, 16 Dec 2004 16:10:52 -0600 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <20041216115556.25cdb461.mshefty@ichips.intel.com> References: <20041216115556.25cdb461.mshefty@ichips.intel.com> Message-ID: <41C207EC.301@systemfabricworks.com> Hi Sean, >This patch adds in the initial CM API and module code. The module loads, >unloads, and allocates/deallocates connection structures, but that's about it. >This patch does not include changes needed to Kconfig or the Makefile, since I'm >not sure that it makes sense to change these yet. > >I will commit this unless there are any objections. >[...] > >+int ib_send_cm_req(struct ib_cm_id *cm_id, >+ struct ib_cm_req_param *param); >+ > >+int ib_send_cm_rep(struct ib_cm_id *cm_id, >+ struct ib_cm_req_param *param); >+ >+int ib_send_cm_rtu(struct ib_cm_id *cm_id, >+ void *private_data, >+ u8 private_data_len); > > I've used several CM and I found this kind of interface to be painful to use. I'd rather see an interface similar to Topspin's where you register a CM callback, get CM events and react (or not) to these. With the interface you propose it takes maybe 200 lines of code to establish a simple connection, while with a callback it can be down to 30 lines. It should be as easy as possible for an application or a driver to establish a connection. I shouldn't have to rewrite a CM state machine every time I need a connection. Frank. From mshefty at ichips.intel.com Thu Dec 16 14:30:17 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 14:30:17 -0800 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <41C207EC.301@systemfabricworks.com> References: <20041216115556.25cdb461.mshefty@ichips.intel.com> <41C207EC.301@systemfabricworks.com> Message-ID: <41C20C79.9010406@ichips.intel.com> frank zago wrote: >> +int ib_send_cm_req(struct ib_cm_id *cm_id, >> + struct ib_cm_req_param *param); >> + The Topspin API: int ib_cm_connect(struct ib_cm_active_param *param, struct ib_path_record *primary_path, struct ib_path_record *alternate_path, tTS_IB_SERVICE_ID service_id, int peer_to_peer, ib_cm_callback_func function, void *arg, tTS_IB_CM_COMM_ID *comm_id) Main difference here is that the parameters have been folded into one. (I'm not sure why ib_cm_active_param didn't just include the others.) >> +int ib_send_cm_rep(struct ib_cm_id *cm_id, >> + struct ib_cm_req_param *param); int ib_cm_accept(tTS_IB_CM_COMM_ID comm_id, struct ib_cm_passive_param *param) Nearly identical. Missing field was added to ib_cm_req_param that was lacking in ib_cm_passive_param. >> +int ib_send_cm_rtu(struct ib_cm_id *cm_id, >> + void *private_data, >> + u8 private_data_len); int ib_cm_confirm(tTS_IB_CM_COMM_ID comm_id, void *rtu_private_data, int rtu_private_data_len) Same. > > I've used several CM and I found this kind of interface to be painful to > use. > I'd rather see an interface similar to Topspin's where you register a CM > callback, get CM events and react (or not) to these. As pointed out above, this interface is similar to Topspin's. > With the interface you propose it takes maybe 200 lines of code to > establish a simple connection, while with a callback it can be down to > 30 lines. It should be as easy as possible for an application or a > driver to establish a connection. I shouldn't have to rewrite a CM state > machine every time I need a connection. This API does use a callback model based on events. You register for a callback by calling ib_create_cm_id. - Sean From fzago at systemfabricworks.com Thu Dec 16 14:35:09 2004 From: fzago at systemfabricworks.com (frank zago) Date: Thu, 16 Dec 2004 16:35:09 -0600 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <41C20C79.9010406@ichips.intel.com> References: <20041216115556.25cdb461.mshefty@ichips.intel.com> <41C207EC.301@systemfabricworks.com> <41C20C79.9010406@ichips.intel.com> Message-ID: <41C20D9D.7080502@systemfabricworks.com> >> With the interface you propose it takes maybe 200 lines of code to >> establish a simple connection, while with a callback it can be down >> to 30 lines. It should be as easy as possible for an application or a >> driver to establish a connection. I shouldn't have to rewrite a CM >> state machine every time I need a connection. > > > This API does use a callback model based on events. You register for > a callback by calling ib_create_cm_id. > > - Sean > > I was under the impression that with your proposal, an application would have to send REQ/REP/RTU/... mads. Is that the case? From libor at topspin.com Thu Dec 16 14:37:58 2004 From: libor at topspin.com (Libor Michalek) Date: Thu, 16 Dec 2004 14:37:58 -0800 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <41C207EC.301@systemfabricworks.com>; from fzago@systemfabricworks.com on Thu, Dec 16, 2004 at 04:10:52PM -0600 References: <20041216115556.25cdb461.mshefty@ichips.intel.com> <41C207EC.301@systemfabricworks.com> Message-ID: <20041216143758.D26487@topspin.com> On Thu, Dec 16, 2004 at 04:10:52PM -0600, frank zago wrote: > > I've used several CM and I found this kind of interface to be painful to > use. > I'd rather see an interface similar to Topspin's where you register a CM > callback, get CM events and react (or not) to these. > > With the interface you propose it takes maybe 200 lines of code to > establish a simple connection, while with a callback it can be down to > 30 lines. It should be as easy as possible for an application or a > driver to establish a connection. I shouldn't have to rewrite a CM state > machine every time I need a connection. This ties into what I was saying about an error return value from the consumer callback being treated as a connection handle destroy request. There were three return types supported: error - connection handle destroy request defer - CM requires an API call to continue. (e.g. REQ callback requires a send_rep() call to continue) What the Sean's API does by default. success - CM generates callback response. (e.g. REQ callback completed successfully so the CM generates a REP response.) For this to work the corresponsing consumer callback needs to have return parameters to retrieve the parameters to generate the next CM response. So the callback consumer must set the return parameters if it uses the success return value. However, I'm not sure that not having this ability is that big of a code bloat, since you will call the corresponsing API function from the callback, and presumably set similar parameters for the API call as you would set for the callback return parameters. The bigger increase in consumer line count is having each consumer perform the QP state transitions. -Libor From mshefty at ichips.intel.com Thu Dec 16 14:41:56 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 14:41:56 -0800 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <41C20D9D.7080502@systemfabricworks.com> References: <20041216115556.25cdb461.mshefty@ichips.intel.com> <41C207EC.301@systemfabricworks.com> <41C20C79.9010406@ichips.intel.com> <41C20D9D.7080502@systemfabricworks.com> Message-ID: <41C20F34.6000404@ichips.intel.com> frank zago wrote: > >>> With the interface you propose it takes maybe 200 lines of code to >>> establish a simple connection, while with a callback it can be down >>> to 30 lines. It should be as easy as possible for an application or a >>> driver to establish a connection. I shouldn't have to rewrite a CM >>> state machine every time I need a connection. >> >> >> >> This API does use a callback model based on events. You register for >> a callback by calling ib_create_cm_id. >> >> - Sean >> >> > I was under the impression that with your proposal, an application would > have to send REQ/REP/RTU/... mads. Is that the case? The MADs are allocated and formated by the CM. The function names are descriptive of the action taken by the CM, but the user does not actually deal with MADs. I'm open to changing the name, but I would prefer it reflect what the CM is doing. The CM also manages the connection state and will fail any call done while in the wrong state. (This should be similar to the Topspin code.) - Sean From libor at topspin.com Thu Dec 16 14:42:26 2004 From: libor at topspin.com (Libor Michalek) Date: Thu, 16 Dec 2004 14:42:26 -0800 Subject: [openib-general] CM header file In-Reply-To: <41C1EC9A.5030005@ichips.intel.com>; from mshefty@ichips.intel.com on Thu, Dec 16, 2004 at 12:14:18PM -0800 References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> Message-ID: <20041216144226.E26487@topspin.com> On Thu, Dec 16, 2004 at 12:14:18PM -0800, Sean Hefty wrote: > > One final note, I'm hoping that a more abstracted CM could be layered > on top of this one, if it were desired. E.g. one that performs QP > transitions, automatically generates MRAs, retries requests, etc. Are you suggesting that this proposed CM layer would not be responsible for retries and timeouts? I'm not sure how useful that would be, seems like every CM consumer would need/want these capabilities. -Libor From mshefty at ichips.intel.com Thu Dec 16 14:47:45 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 14:47:45 -0800 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <20041216143758.D26487@topspin.com> References: <20041216115556.25cdb461.mshefty@ichips.intel.com> <41C207EC.301@systemfabricworks.com> <20041216143758.D26487@topspin.com> Message-ID: <41C21091.5000305@ichips.intel.com> Libor Michalek wrote: > This ties into what I was saying about an error return value from the > consumer callback being treated as a connection handle destroy request. > There were three return types supported: I'm not opposed to this. I just haven't thought about it enough. > error - connection handle destroy request > defer - CM requires an API call to continue. (e.g. REQ callback > requires a send_rep() call to continue) What the Sean's > API does by default. > success - CM generates callback response. (e.g. REQ callback completed > successfully so the CM generates a REP response.) > > However, I'm not sure that not having this ability is that big of a > code bloat, since you will call the corresponsing API function from > the callback, and presumably set similar parameters for the API call > as you would set for the callback return parameters. The bigger increase > in consumer line count is having each consumer perform the QP state > transitions. I agree. I'm not sure that success is needed. This adds several parameters that need to be returned from the callback to save a single function call. - Sean From mshefty at ichips.intel.com Thu Dec 16 14:51:29 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 14:51:29 -0800 Subject: [openib-general] CM header file In-Reply-To: <20041216144226.E26487@topspin.com> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> Message-ID: <41C21171.8010407@ichips.intel.com> Libor Michalek wrote: > On Thu, Dec 16, 2004 at 12:14:18PM -0800, Sean Hefty wrote: > >>One final note, I'm hoping that a more abstracted CM could be layered >>on top of this one, if it were desired. E.g. one that performs QP >>transitions, automatically generates MRAs, retries requests, etc. > > > Are you suggesting that this proposed CM layer would not be responsible > for retries and timeouts? I'm not sure how useful that would be, seems > like every CM consumer would need/want these capabilities. I was trying to match the existing MAD API. The CM would perform timeouts, but not retries. Consumers could retry request immediately upon notification of a timeout. This lets the client change the timeout value. (I'm negotiable on this, but the cost of having clients initiate retries is basically one function call.) I have given some thought to how retries should work. I've thought about adding a new call, ib_retry_cm_send() - or something like that, that resends the last message sent. Or the callback could indicate to retry. - Sean From fzago at systemfabricworks.com Thu Dec 16 15:02:06 2004 From: fzago at systemfabricworks.com (frank zago) Date: Thu, 16 Dec 2004 17:02:06 -0600 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <41C21091.5000305@ichips.intel.com> References: <20041216115556.25cdb461.mshefty@ichips.intel.com> <41C207EC.301@systemfabricworks.com> <20041216143758.D26487@topspin.com> <41C21091.5000305@ichips.intel.com> Message-ID: <41C213EE.5070602@systemfabricworks.com> Sean Hefty wrote: > Libor Michalek wrote: > >> This ties into what I was saying about an error return value from the >> consumer callback being treated as a connection handle destroy request. >> There were three return types supported: > > > I'm not opposed to this. I just haven't thought about it enough. > >> error - connection handle destroy request >> defer - CM requires an API call to continue. (e.g. REQ callback >> requires a send_rep() call to continue) What the Sean's >> API does by default. >> success - CM generates callback response. (e.g. REQ callback completed >> successfully so the CM generates a REP response.) >> >> However, I'm not sure that not having this ability is that big of a >> code bloat, since you will call the corresponsing API function from >> the callback, and presumably set similar parameters for the API call >> as you would set for the callback return parameters. The bigger increase >> in consumer line count is having each consumer perform the QP state >> transitions. > > > I agree. I'm not sure that success is needed. This adds several > parameters that need to be returned from the callback to save a single > function call. It doesn't just save a function call. For instance: cm_handler(cm_id, event, param) { ... case CM_REQ: if (not yet connected) { param.req.userdata[0]=0x5A; param.req.userdatalen=1; ret = OK; } else { ret = REJECT; } .... } Here, CM has prepared with defaults values the userdata that the handler can override. It hides CM internals: no allocation of a REQ structure, no cleaning of structures, no sending of mads. Frank. From fzago at systemfabricworks.com Thu Dec 16 15:05:41 2004 From: fzago at systemfabricworks.com (frank zago) Date: Thu, 16 Dec 2004 17:05:41 -0600 Subject: [openib-general] CM header file In-Reply-To: <41C21171.8010407@ichips.intel.com> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> Message-ID: <41C214C5.9090704@systemfabricworks.com> > > I was trying to match the existing MAD API. The CM would perform > timeouts, but not retries. Consumers could retry request immediately > upon notification of a timeout. This lets the client change the > timeout value. (I'm negotiable on this, but the cost of having > clients initiate retries is basically one function call.) > > I have given some thought to how retries should work. I've thought > about adding a new call, ib_retry_cm_send() - or something like that, > that resends the last message sent. Or the callback could indicate to > retry. I have the same problem here than with the handler. A client should be able to set timeouts and retries, and let CM take care of the painful stuff. The complexity should be in the CM, not repeated amongst the clients. Frank From libor at topspin.com Thu Dec 16 15:21:35 2004 From: libor at topspin.com (Libor Michalek) Date: Thu, 16 Dec 2004 15:21:35 -0800 Subject: [openib-general] CM header file In-Reply-To: <41C21171.8010407@ichips.intel.com>; from mshefty@ichips.intel.com on Thu, Dec 16, 2004 at 02:51:29PM -0800 References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> Message-ID: <20041216152135.F26487@topspin.com> On Thu, Dec 16, 2004 at 02:51:29PM -0800, Sean Hefty wrote: > Libor Michalek wrote: > > On Thu, Dec 16, 2004 at 12:14:18PM -0800, Sean Hefty wrote: > > > >>One final note, I'm hoping that a more abstracted CM could be layered > >>on top of this one, if it were desired. E.g. one that performs QP > >>transitions, automatically generates MRAs, retries requests, etc. > > > > > > Are you suggesting that this proposed CM layer would not be responsible > > for retries and timeouts? I'm not sure how useful that would be, seems > > like every CM consumer would need/want these capabilities. > > I was trying to match the existing MAD API. The CM would perform > timeouts, but not retries. Consumers could retry request immediately > upon notification of a timeout. This lets the client change the > timeout value. (I'm negotiable on this, but the cost of having clients > initiate retries is basically one function call.) I was thinking the cost is a bit higher then a function call. The data used to generate the CM MAD needs to be stored somewhere to generate the retry if necessary. I'll need to take a closer look at which parameters your expecting from the consumer and which you are going to save as part of the internal CM connection structure. > I have given some thought to how retries should work. I've thought > about adding a new call, ib_retry_cm_send() - or something like that, > that resends the last message sent. Or the callback could indicate to > retry. Oh, so the CM would have enough information to generate the retry, but the consumer would need to notify the CM to do so? -Libor From roland at topspin.com Thu Dec 16 15:28:29 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Dec 2004 15:28:29 -0800 Subject: [openib-general] CM header file In-Reply-To: <41C1F992.4010908@ichips.intel.com> (Sean Hefty's message of "Thu, 16 Dec 2004 13:09:38 -0800") References: <41C0D5F3.7010307@ichips.intel.com> <20041215174814.C26487@topspin.com> <41C1AF15.8000802@systemfabricworks.com> <41C1E811.9080405@ichips.intel.com> <52oeguq87k.fsf@topspin.com> <41C1F992.4010908@ichips.intel.com> Message-ID: <52brctrffm.fsf@topspin.com> Sean> My thinking was that the connection model isn't carried in Sean> the CM MADs however, so a receiving CM has to determine how Sean> to match the connection request based on what the remote Sean> user requested, which isn't known. Sean> I guess that by not having this parameter, an incoming Sean> request could be matched incorrectly with an outgoing Sean> request, if the local listening service isn't up. I'll add Sean> a peer_to_peer parameter to ib_cm_req_param. Right, peer-to-peer comes in when the CM receives a REQ after it has sent its own REQ. So the active connect (which generated the send of a REQ) needs to have a flag saying whether subsequent incoming REQs should go through peer compare or be treated independently. - R. From roland at topspin.com Thu Dec 16 15:31:32 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Dec 2004 15:31:32 -0800 Subject: [openib-general] Re: IPoIB Path Static Rate In-Reply-To: <1103230507.4327.224.camel@localhost.localdomain> (Hal Rosenstock's message of "16 Dec 2004 15:55:08 -0500") References: <1103230507.4327.224.camel@localhost.localdomain> Message-ID: <527jnhrfaj.fsf@topspin.com> Hal> It looks to me like after obtaining the PathRecord, the Hal> static rate is not used when the AV is created. Shouldn't it Hal> be ? Is there an issue with doing this ? There is a similar Hal> issue with the multicast AVs as well. I know there is an Hal> assumption that everything is 4x but I am not sure that is Hal> good one to lock in. Ultimately we probably will want to do static rate correctly, but I decided to punt on this for now because it's not trivial to determine the correct IPD to use, and IPoIB doesn't really inject packets at rates beyond 1X anyway. - R. From mshefty at ichips.intel.com Thu Dec 16 15:35:18 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 15:35:18 -0800 Subject: [openib-general] CM header file In-Reply-To: <20041216152135.F26487@topspin.com> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <20041216152135.F26487@topspin.com> Message-ID: <41C21BB6.1060403@ichips.intel.com> Libor Michalek wrote: >>I have given some thought to how retries should work. I've thought >>about adding a new call, ib_retry_cm_send() - or something like that, >>that resends the last message sent. Or the callback could indicate to >>retry. > > Oh, so the CM would have enough information to generate the retry, but > the consumer would need to notify the CM to do so? I think it makes sense in the case of a timeout for the CM to optimize for the retry case. (Such as keeping the last sent MAD around for retransmission.) The argument used against putting retries into the MAD layer was to allow the consumer to set the retry policy, such as increasing the timeout value. A major difference between the MAD API and the CM API is that the CM abstracts the MAD from the user, making the retry issue a more difficult to do efficiently. So, while I think it would be nice to let the user initiate retries, I want to make sure that it doesn't have a substantial impact on performance. - Sean From mshefty at ichips.intel.com Thu Dec 16 15:41:31 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 15:41:31 -0800 Subject: [openib-general] CM header file In-Reply-To: <41C214C5.9090704@systemfabricworks.com> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <41C214C5.9090704@systemfabricworks.com> Message-ID: <41C21D2B.5000509@ichips.intel.com> frank zago wrote: >> I was trying to match the existing MAD API. The CM would perform >> timeouts, but not retries. Consumers could retry request immediately >> upon notification of a timeout. This lets the client change the >> timeout value. (I'm negotiable on this, but the cost of having >> clients initiate retries is basically one function call.) >> >> I have given some thought to how retries should work. I've thought >> about adding a new call, ib_retry_cm_send() - or something like that, >> that resends the last message sent. Or the callback could indicate to >> retry. > > I have the same problem here than with the handler. A client should be > able to set timeouts and retries, and let CM take care of the painful > stuff. > The complexity should be in the CM, not repeated amongst the clients. I am trying to avoid adding policy or hard-coding default values into the CM, and follow the same design decisions that we used with the current APIs. Long term I think simpler interfaces will be a better solution. Areas where the majority of clients will need to implement the exact same code make sense to push into the CM. So far there doesn't seem to be any disagreement that the CM has features that it doesn't need. And the list of desired features seem to be: * Perform QP transitions for the user. * Provide default values when establishing a connection. * Perform retransmissions. * Destroy connection identifiers from/after a callback. Are there others? - Sean From roland at topspin.com Thu Dec 16 15:44:12 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Dec 2004 15:44:12 -0800 Subject: [openib-general] CM header file In-Reply-To: <41C21BB6.1060403@ichips.intel.com> (Sean Hefty's message of "Thu, 16 Dec 2004 15:35:18 -0800") References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <20041216152135.F26487@topspin.com> <41C21BB6.1060403@ichips.intel.com> Message-ID: <52y8fxq04z.fsf@topspin.com> Sean> I think it makes sense in the case of a timeout for the CM Sean> to optimize for the retry case. (Such as keeping the last Sean> sent MAD around for retransmission.) The argument used Sean> against putting retries into the MAD layer was to allow the Sean> consumer to set the retry policy, such as increasing the Sean> timeout value. Sean> A major difference between the MAD API and the CM API is Sean> that the CM abstracts the MAD from the user, making the Sean> retry issue a more difficult to do efficiently. So, while I Sean> think it would be nice to let the user initiate retries, I Sean> want to make sure that it doesn't have a substantial impact Sean> on performance. The CM retry policy is specified much more tightly than for general MADs. The total number of retries is limited by the "Max CM Retries" field and the timeout waiting for each response is also part of the CM protocol. - Roland From mshefty at ichips.intel.com Thu Dec 16 15:55:33 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Dec 2004 15:55:33 -0800 Subject: [openib-general] CM header file In-Reply-To: <52y8fxq04z.fsf@topspin.com> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <20041216152135.F26487@topspin.com> <41C21BB6.1060403@ichips.intel.com> <52y8fxq04z.fsf@topspin.com> Message-ID: <41C22075.6050506@ichips.intel.com> Roland Dreier wrote: > The CM retry policy is specified much more tightly than for general > MADs. The total number of retries is limited by the "Max CM Retries" > field and the timeout waiting for each response is also part of the CM > protocol. It appears then that given max_cm_retries, remote_cm_response_timeout, and primary_path->packet_life_time (or local_ack_timeout), the CM can issue retries automatically for the user. Does this sound correct? I will plan on having the code issues retries. (I'm assuming that the Topspin CM does this, which is where I'm planning on pulling code over from.) - Sean From fzago at systemfabricworks.com Thu Dec 16 15:55:45 2004 From: fzago at systemfabricworks.com (frank zago) Date: Thu, 16 Dec 2004 17:55:45 -0600 Subject: [openib-general] CM header file In-Reply-To: <41C21D2B.5000509@ichips.intel.com> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <41C214C5.9090704@systemfabricworks.com> <41C21D2B.5000509@ichips.intel.com> Message-ID: <41C22081.2010806@systemfabricworks.com> > > Areas where the majority of clients will need to implement the exact > same code make sense to push into the CM. So far there doesn't seem > to be any disagreement that the CM has features that it doesn't need. > And the list of desired features seem to be: > > * Perform QP transitions for the user. > * Provide default values when establishing a connection. > * Perform retransmissions. > * Destroy connection identifiers from/after a callback. > > Are there others? What about: * CM creates the QP when necessary, unless the app provides one. * CM will destroy the QP on connection closure if it owns it. Frank. From robert.j.woodruff at intel.com Thu Dec 16 16:05:43 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 16 Dec 2004 16:05:43 -0800 Subject: [openib-general] Re: IPoIB Failure CQ overrun Message-ID: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> I am now seeing a new failure now. I bring up 2 nodes and initially can ping between the nodes. Then I try to run netpipe, and after the messages size gets a little past 4K, it hangs. I see the same behavior running MPI over TCP. This use to work. I look in the dmesg log and see the following: ib0: send complete, wrid 0 ib0: called: id 1, op 0, status: 0 ib0: sending packet, length=2048 address=00000100bff20b00 qpn=0x000404 ib0: send complete, wrid 1 ib0: sending packet, length=2048 address=00000100bff20b00 qpn=0x000404 ib0: sending packet, length=389 address=00000100bff20b00 qpn=0x000404 ib_mthca 0000:04:00.0: CQ overrun on CQN 00000082 ib0: called: id 2, op 0, status: 0 ib0: send complete, wrid 2 ib0: sending packet, length=2048 address=00000100bff20b00 qpn=0x000404 ib0: sending packet, length=2048 address=00000100bff20b00 qpn=0x000404 ib0: sending packet, length=2048 address=00000100bff20b00 qpn=0x000404 ib0: sending packet, length=2048 address=00000100bff20b00 qpn=0x000404 ib0: sending packet, length=2048 address=00000100bff20b00 qpn=0x000404 ib0: sending packet, length=2048 address=00000100bff20b00 qpn=0x000404 After it gets into this state, the interface is dead. /sbin/ip neigh show dev ib0 192.168.0.1 nud failed Any ideas ? woody From roland at topspin.com Thu Dec 16 16:29:20 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Dec 2004 16:29:20 -0800 Subject: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> (Robert J. Woodruff's message of "Thu, 16 Dec 2004 16:05:43 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> Message-ID: <52u0qlpy1r.fsf@topspin.com> Robert> 0000:04:00.0: CQ overrun on CQN 00000082 This appears to be an issue with the latest FW (I see it with Tavor FW 3.3.1 but not 3.2.0). I am working with Mellanox on finding out whether it's a FW bug or a problem with mthca. For now you can work around it by changing IPOIB_NUM_WC = 4, to IPOIB_NUM_WC = 1, in ipoib.h. - Roland From robert.j.woodruff at intel.com Thu Dec 16 16:32:00 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 16 Dec 2004 16:32:00 -0800 Subject: [openib-general] Re: IPoIB Failure CQ overrun Message-ID: <1AC79F16F5C5284499BB9591B33D6F000315C183@orsmsx408> >This appears to be an issue with the latest FW (I see it with Tavor FW >3.3.1 but not 3.2.0). I am working with Mellanox on finding out >whether it's a FW bug or a problem with mthca. >For now you can work around it by changing > IPOIB_NUM_WC = 4, >to > IPOIB_NUM_WC = 1, >in ipoib.h. > - Roland Ok, thanks, I will give it a try. woody From robert.j.woodruff at intel.com Thu Dec 16 17:07:02 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 16 Dec 2004 17:07:02 -0800 Subject: [openib-general] Re: IPoIB Failure CQ overrun Message-ID: <1AC79F16F5C5284499BB9591B33D6F000315C204@orsmsx408> >This appears to be an issue with the latest FW (I see it with Tavor FW >3.3.1 but not 3.2.0). I am working with Mellanox on finding out >whether it's a FW bug or a problem with mthca. >For now you can work around it by changing > IPOIB_NUM_WC = 4, >to > IPOIB_NUM_WC = 1, >in ipoib.h. > - Roland Thanks, That fixed the problem. From halr at voltaire.com Fri Dec 17 06:05:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Dec 2004 09:05:24 -0500 Subject: [openib-general] Latest mthca_qp.c appears to be broken with gcc 3.4.2 Message-ID: <1103292324.4214.627.camel@localhost.localdomain> drivers/infiniband/hw/mthca/mthca_qp.c: In function `mthca_post_send': drivers/infiniband/hw/mthca/mthca_qp.c:1249: error: label at end of compound statement gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3) From halr at voltaire.com Fri Dec 17 06:17:54 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Dec 2004 09:17:54 -0500 Subject: [openib-general] Re: IPoIB Partial Connectivity Scenario In-Reply-To: <1103234707.4214.8.camel@localhost.localdomain> References: <1103228372.4327.167.camel@localhost.localdomain> <52fz26q7nw.fsf@topspin.com> <1103231553.4327.230.camel@localhost.localdomain> <1103234707.4214.8.camel@localhost.localdomain> Message-ID: <1103293074.4214.647.camel@localhost.localdomain> On Thu, 2004-12-16 at 17:05, Hal Rosenstock wrote: > On the remote node to which connectivity fails, it has a stale arp cache > entry which does not seem to go away as if the timer is not started. Is > that possible ? Is there a case where the ARP entry is created but not > timed ? > > /sbin/ip neigh show dev ib0 > 192.168.0.1 lladdr > 00:00:04:04:fe:80:00:00:00:00:00:00:00:54:42:b1:00:00:09:01 nud stale > > The other nodes' entries (they have bidirectional connectivity) do go > from reachable to delay to stale to "gone". With the latest tree (1354), this problem is now gone. Was your change to ipoib_main.c intended to fix this ? r1353 | roland | 2004-12-16 23:10:24 -0500 (Thu, 16 Dec 2004) | 4 lines Be slightly more careful about when we queue packets. Thanks. -- Hal From roland at topspin.com Fri Dec 17 07:08:42 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Dec 2004 07:08:42 -0800 Subject: [openib-general] Re: IPoIB Partial Connectivity Scenario In-Reply-To: <1103293074.4214.647.camel@localhost.localdomain> (Hal Rosenstock's message of "17 Dec 2004 09:17:54 -0500") References: <1103228372.4327.167.camel@localhost.localdomain> <52fz26q7nw.fsf@topspin.com> <1103231553.4327.230.camel@localhost.localdomain> <1103234707.4214.8.camel@localhost.localdomain> <1103293074.4214.647.camel@localhost.localdomain> Message-ID: <52llbxotc5.fsf@topspin.com> Hal> With the latest tree (1354), this problem is now gone. Was Hal> your change to ipoib_main.c intended to fix this ? No, it should just prevent queues from growing arbitrarily long and start dropping packets. - R. From roland at topspin.com Fri Dec 17 07:09:02 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Dec 2004 07:09:02 -0800 Subject: [openib-general] Re: Latest mthca_qp.c appears to be broken with gcc 3.4.2 In-Reply-To: <1103292324.4214.627.camel@localhost.localdomain> (Hal Rosenstock's message of "17 Dec 2004 09:05:24 -0500") References: <1103292324.4214.627.camel@localhost.localdomain> Message-ID: <52hdmlotbl.fsf@topspin.com> Hal> drivers/infiniband/hw/mthca/mthca_qp.c: In function `mthca_post_send': Hal> drivers/infiniband/hw/mthca/mthca_qp.c:1249: error: label at end of compound statement Thanks, I've fixed this. - R. From halr at voltaire.com Fri Dec 17 07:27:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Dec 2004 10:27:26 -0500 Subject: [openib-general] Re: IPoIB Partial Connectivity Scenario In-Reply-To: <52llbxotc5.fsf@topspin.com> References: <1103228372.4327.167.camel@localhost.localdomain> <52fz26q7nw.fsf@topspin.com> <1103231553.4327.230.camel@localhost.localdomain> <1103234707.4214.8.camel@localhost.localdomain> <1103293074.4214.647.camel@localhost.localdomain> <52llbxotc5.fsf@topspin.com> Message-ID: <1103297246.4214.668.camel@localhost.localdomain> On Fri, 2004-12-17 at 10:08, Roland Dreier wrote: > Hal> With the latest tree (1354), this problem is now gone. Was > Hal> your change to ipoib_main.c intended to fix this ? > > No, it should just prevent queues from growing arbitrarily long and > start dropping packets. Guess the PR requests/responses aren't being lost right now... From iod00d at hp.com Fri Dec 17 08:49:04 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 17 Dec 2004 08:49:04 -0800 Subject: [openib-general] CM header file In-Reply-To: <41C22081.2010806@systemfabricworks.com> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <41C214C5.9090704@systemfabricworks.com> <41C21D2B.5000509@ichips.intel.com> <41C22081.2010806@systemfabricworks.com> Message-ID: <20041217164904.GC20977@esmail.cup.hp.com> On Thu, Dec 16, 2004 at 05:55:45PM -0600, frank zago wrote: > >Are there others? > > What about: > * CM creates the QP when necessary, unless the app provides one. Sorry, I can't get enthusiastic about this idea. It adds complexity and code where it's usually not required. If someone wants to another layer of inline code that handles the allocation for the caller (and deallocation when it's done), that's works for me. But it shouldn't be part of the basic design. my $0.02, grant From roland at topspin.com Fri Dec 17 08:53:15 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Dec 2004 08:53:15 -0800 Subject: [openib-general] CM header file In-Reply-To: <20041217164904.GC20977@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 17 Dec 2004 08:49:04 -0800") References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <41C214C5.9090704@systemfabricworks.com> <41C21D2B.5000509@ichips.intel.com> <41C22081.2010806@systemfabricworks.com> <20041217164904.GC20977@esmail.cup.hp.com> Message-ID: <52d5x8q32c.fsf@topspin.com> Frank> What about: * CM creates the QP when necessary, unless the Frank> app provides one. Grant> Sorry, I can't get enthusiastic about this idea. It adds Grant> complexity and code where it's usually not required. If Grant> someone wants to another layer of inline code that handles Grant> the allocation for the caller (and deallocation when it's Grant> done), that's works for me. But it shouldn't be part of Grant> the basic design. I agree. In the original design of the Topspin drivers, we had the CM handle allocating and freeing QPs. However it leads to all sorts of object lifetime issues and ends up as more trouble than it's worth. - R. From halr at voltaire.com Fri Dec 17 09:03:38 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Dec 2004 12:03:38 -0500 Subject: [openib-general] CM header file In-Reply-To: <52d5x8q32c.fsf@topspin.com> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <41C214C5.9090704@systemfabricworks.com> <41C21D2B.5000509@ichips.intel.com> <41C22081.2010806@systemfabricworks.com> <20041217164904.GC20977@esmail.cup.hp.com> <52d5x8q32c.fsf@topspin.com> Message-ID: <1103303017.4214.781.camel@localhost.localdomain> On Fri, 2004-12-17 at 11:53, Roland Dreier wrote: > I agree. In the original design of the Topspin drivers, we had the CM > handle allocating and freeing QPs. However it leads to all sorts of > object lifetime issues and ends up as more trouble than it's worth. But not doing this does eliminate some optimizations that can be done in terms of this which are useful for certain applications which setup and tear down connections frequently. -- Hal From roland at topspin.com Fri Dec 17 09:18:32 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Dec 2004 09:18:32 -0800 Subject: [openib-general] CM header file In-Reply-To: <1103303017.4214.781.camel@localhost.localdomain> (Hal Rosenstock's message of "17 Dec 2004 12:03:38 -0500") References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <41C214C5.9090704@systemfabricworks.com> <41C21D2B.5000509@ichips.intel.com> <41C22081.2010806@systemfabricworks.com> <20041217164904.GC20977@esmail.cup.hp.com> <52d5x8q32c.fsf@topspin.com> <1103303017.4214.781.camel@localhost.localdomain> Message-ID: <528y7wq1w7.fsf@topspin.com> Hal> But not doing this does eliminate some optimizations that can Hal> be done in terms of this which are useful for certain Hal> applications which setup and tear down connections Hal> frequently. I can't parse this very well (too many negatives with "not doing" and "does eliminate") -- are you agreeing with Grant and me or supporting Frank's original suggestion? - R. From mshefty at ichips.intel.com Fri Dec 17 09:39:24 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 17 Dec 2004 09:39:24 -0800 Subject: [openib-general] CM header file In-Reply-To: <52d5x8q32c.fsf@topspin.com> References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <41C214C5.9090704@systemfabricworks.com> <41C21D2B.5000509@ichips.intel.com> <41C22081.2010806@systemfabricworks.com> <20041217164904.GC20977@esmail.cup.hp.com> <52d5x8q32c.fsf@topspin.com> Message-ID: <41C319CC.2070900@ichips.intel.com> Roland Dreier wrote: > Frank> What about: * CM creates the QP when necessary, unless the > Frank> app provides one. > > Grant> Sorry, I can't get enthusiastic about this idea. It adds > Grant> complexity and code where it's usually not required. If > Grant> someone wants to another layer of inline code that handles > Grant> the allocation for the caller (and deallocation when it's > Grant> done), that's works for me. But it shouldn't be part of > Grant> the basic design. > > I agree. In the original design of the Topspin drivers, we had the CM > handle allocating and freeing QPs. However it leads to all sorts of > object lifetime issues and ends up as more trouble than it's worth. I agree with Grant and Roland. Again, I believe that the initial CM should be fairly focused on what it does. If more advanced features are desired, then a separate module should be layered on top of this one. That doesn't mean that those features can't be in the openib stack, I just don't think they necessarily belong in the lower layers. I do want to continue collecting feedback, to make sure that nothing that should be part of this layer gets missed. - Sean From libor at topspin.com Fri Dec 17 11:38:53 2004 From: libor at topspin.com (Libor Michalek) Date: Fri, 17 Dec 2004 11:38:53 -0800 Subject: [openib-general] CM header file In-Reply-To: <52d5x8q32c.fsf@topspin.com>; from roland@topspin.com on Fri, Dec 17, 2004 at 08:53:15AM -0800 References: <41C0D5F3.7010307@ichips.intel.com> <1103226826.4327.149.camel@localhost.localdomain> <41C1EC9A.5030005@ichips.intel.com> <20041216144226.E26487@topspin.com> <41C21171.8010407@ichips.intel.com> <41C214C5.9090704@systemfabricworks.com> <41C21D2B.5000509@ichips.intel.com> <41C22081.2010806@systemfabricworks.com> <20041217164904.GC20977@esmail.cup.hp.com> <52d5x8q32c.fsf@topspin.com> Message-ID: <20041217113853.A29699@topspin.com> On Fri, Dec 17, 2004 at 08:53:15AM -0800, Roland Dreier wrote: > Frank> What about: * CM creates the QP when necessary, unless the > Frank> app provides one. > > Grant> Sorry, I can't get enthusiastic about this idea. It adds > Grant> complexity and code where it's usually not required. If > Grant> someone wants to another layer of inline code that handles > Grant> the allocation for the caller (and deallocation when it's > Grant> done), that's works for me. But it shouldn't be part of > Grant> the basic design. > > I agree. In the original design of the Topspin drivers, we had the CM > handle allocating and freeing QPs. However it leads to all sorts of > object lifetime issues and ends up as more trouble than it's worth. True. It resulted in numerous race conditions revolving around object lifetime, as a result most of the software that used the CM ended up using it in consumer provided QP mode. At the time of the design I was the biggest proponent for having CM allocated QPs, but in retrospect it was a bad idea. -Libor From robert.j.woodruff at intel.com Fri Dec 17 12:00:00 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 17 Dec 2004 12:00:00 -0800 Subject: [openib-general] Kernel assertion Message-ID: <1AC79F16F5C5284499BB9591B33D6F000315C8F9@orsmsx408> I was running a 4 node cluster running an MPI application over IPoIB and one of the system's died with the following messages logged to /var/log/messages. The svn rev. is 1355. 2 of the nodes are PCI-E cards and 2 nodes are PCI-X. The system that asserted was one of the PCI-E systems. Dec 17 11:45:33 iclust-16 rsh(pam_unix)[4605]: session closed for user woody Dec 17 11:46:30 iclust-16 kernel: ib_mthca 0000:04:00.0: SQ full (64 posted, 64 max, 0 nreq) Dec 17 11:46:30 iclust-16 kernel: ib0: post_send failed Dec 17 11:46:39 iclust-16 kernel: ib_mthca 0000:04:00.0: SQ full (64 posted, 64 max, 0 nreq) Dec 17 11:46:39 iclust-16 kernel: ib0: post_send failed Dec 17 11:46:39 iclust-16 kernel: KERNEL: assertion (!atomic_read(&skb->users)) failed at net/core/dev.c (1616) Dec 17 11:49:49 iclust-16 syslogd 1.4.1: restart. Any ideas ? woody From roland at topspin.com Fri Dec 17 12:04:26 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Dec 2004 12:04:26 -0800 Subject: [openib-general] Kernel assertion In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000315C8F9@orsmsx408> (Robert J. Woodruff's message of "Fri, 17 Dec 2004 12:00:00 -0800") References: <1AC79F16F5C5284499BB9591B33D6F000315C8F9@orsmsx408> Message-ID: <52vfb0ofn9.fsf@topspin.com> Robert> I was running a 4 node cluster running an MPI application Robert> over IPoIB and one of the system's died with the following Robert> messages logged to /var/log/messages. The svn rev. is Robert> 1355. 2 of the nodes are PCI-E cards and 2 nodes are Robert> PCI-X. The system that asserted was one of the PCI-E Robert> systems. Not sure, I'll take a look at how this could happen. - R. From roland at topspin.com Fri Dec 17 12:45:07 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Dec 2004 12:45:07 -0800 Subject: [openib-general] Kernel assertion In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000315C8F9@orsmsx408> (Robert J. Woodruff's message of "Fri, 17 Dec 2004 12:00:00 -0800") References: <1AC79F16F5C5284499BB9591B33D6F000315C8F9@orsmsx408> Message-ID: <52r7loodrg.fsf@topspin.com> I've been able to reproduce this here by reducing the size of my TX queue and lowering the MTU, so I should be able to fix it soon. Thanks, Roland From roland at topspin.com Fri Dec 17 13:57:40 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Dec 2004 13:57:40 -0800 Subject: [openib-general] LLTX and netif_stop_queue Message-ID: <52llbwoaej.fsf@topspin.com> While testing my IP-over-InfiniBand driver, I discovered that if a net device sets NETIF_F_LLTX, it seems the device's hard_start_xmit method can be called even after a netif_stop_queue(). This is because in the LLTX case, qdisc_restart() holds no locks while calling hard_start_xmit, so something like the following can happen: CPU 1 CPU 2 qdisc_restart: drop queue lock call hard_start_xmit() net driver: acquire TX lock queue packet to HW acquire queue lock... qdisc_restart: drop queue lock call hard_start_xmit: queue full, call netif_stop_queue() release TX lock net driver: acquire TX lock queue is already full! Is my understanding correct? If so it seems the patch below would make sense. (e1000 seems to handle this properly already) Thanks, Roland Since tg3 and sungem now use lockless TX (NETIF_F_LLTX), it's possible for their hard_start_xmit method to be called even after they call netif_stop_queue. Therefore a full queue no longer indicates a bug -- this patch fixes the comment and removes the KERN_ERR printk. Signed-off-by: Roland Dreier Index: linux-bk/drivers/net/sungem.c =================================================================== --- linux-bk.orig/drivers/net/sungem.c 2004-12-16 15:56:19.000000000 -0800 +++ linux-bk/drivers/net/sungem.c 2004-12-17 13:46:43.307064457 -0800 @@ -976,12 +976,10 @@ return NETDEV_TX_LOCKED; } - /* This is a hard error, log it. */ + /* This may happen, since we have NETIF_F_LLTX set */ if (TX_BUFFS_AVAIL(gp) <= (skb_shinfo(skb)->nr_frags + 1)) { netif_stop_queue(dev); spin_unlock_irqrestore(&gp->tx_lock, flags); - printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", - dev->name); return NETDEV_TX_BUSY; } Index: linux-bk/drivers/net/tg3.c =================================================================== --- linux-bk.orig/drivers/net/tg3.c 2004-12-16 15:56:06.000000000 -0800 +++ linux-bk/drivers/net/tg3.c 2004-12-17 13:46:25.952622672 -0800 @@ -3076,12 +3076,10 @@ return NETDEV_TX_LOCKED; } - /* This is a hard error, log it. */ + /* This may happen, since we have NETIF_F_LLTX set */ if (unlikely(TX_BUFFS_AVAIL(tp) <= (skb_shinfo(skb)->nr_frags + 1))) { netif_stop_queue(dev); spin_unlock_irqrestore(&tp->tx_lock, flags); - printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", - dev->name); return NETDEV_TX_BUSY; } From roland at topspin.com Fri Dec 17 14:04:52 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Dec 2004 14:04:52 -0800 Subject: [openib-general] Kernel assertion In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000315C8F9@orsmsx408> (Robert J. Woodruff's message of "Fri, 17 Dec 2004 12:00:00 -0800") References: <1AC79F16F5C5284499BB9591B33D6F000315C8F9@orsmsx408> Message-ID: <52hdmkoa2j.fsf@topspin.com> OK, I checked in a fix for this. I'm actually not sure if it was really a bug in IPoIB or a glitch in the network stack (see my recent post to netdev, cc'ed to openib-general, about 'LLTX and netif_stop_queue') but in any case I made the messages go away in my setup. By the way, it might be interesting to see if increasing IPOIB_RX_RING_SIZE and/or IPOIB_TX_RING_SIZE in ipoib.h had any effect on your performance (make sure to keep them powers of 2). Thanks, Roland From robert.j.woodruff at intel.com Fri Dec 17 14:19:03 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 17 Dec 2004 14:19:03 -0800 Subject: [openib-general] Kernel assertion Message-ID: <1AC79F16F5C5284499BB9591B33D6F000315CA9D@orsmsx408> Roland> By the way, it might be interesting to see if increasing >IPOIB_RX_RING_SIZE and/or IPOIB_TX_RING_SIZE in ipoib.h had any effect >on your performance (make sure to keep them powers of 2). >Thanks, > Roland Ok thanks, I will pull a fresh set of code and give it a try. I can also try increasing the ring queue size. From mshefty at ichips.intel.com Fri Dec 17 17:23:21 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 17 Dec 2004 17:23:21 -0800 Subject: [openib-general] [PATCH] CM listen update Message-ID: <20041217172321.4b9026bb.mshefty@ichips.intel.com> The following patch implements a call to ib_cm_listen. Listen requests are stored using a red-black tree. A service mask was also added to the listen request. The changes only affect the CM module. - Sean Index: include/ib_cm.h =================================================================== --- include/ib_cm.h (revision 1359) +++ include/ib_cm.h (working copy) @@ -93,6 +93,7 @@ ib_cm_handler cm_handler; void *context; u64 service_id; + u64 service_mask; enum ib_cm_state state; }; @@ -125,7 +126,8 @@ * @service_id: Service identifier matched against incoming connection * and service ID resolution requests. * @service_mask: Mask applied to service ID used to listen across a - * range of service IDs. + * range of service IDs. If set to 0, the service ID is matched + * exactly. */ int ib_cm_listen(struct ib_cm_id *cm_id, u64 service_id, @@ -136,7 +138,6 @@ struct ib_path_record *primary_path; struct ib_path_record *alternate_path; u64 service_id; - int timeout_ms; void *private_data; u8 private_data_len; u8 peer_to_peer; Index: core/cm.c =================================================================== --- core/cm.c (revision 1359) +++ core/cm.c (working copy) @@ -34,6 +34,7 @@ * $Id$ */ #include +#include #include #include @@ -51,6 +52,11 @@ .remove = cm_remove_one }; +static struct ib_cm { + spinlock_t lock; + struct rb_root service_table; +} cm; + struct cm_port { struct ib_mad_agent *mad_agent; }; @@ -58,11 +64,51 @@ struct ib_cm_id_private { struct ib_cm_id id; + struct rb_node node; spinlock_t lock; wait_queue_head_t wait; atomic_t refcount; }; +static struct ib_cm_id_private *find_cm_service(u64 service_id) +{ + struct rb_node *node = cm.service_table.rb_node; + struct ib_cm_id_private *cm_id_priv; + + while (node) { + cm_id_priv = rb_entry(node, struct ib_cm_id_private, node); + if ((cm_id_priv->id.service_mask & service_id) == + (cm_id_priv->id.service_mask & cm_id_priv->id.service_id)) + return cm_id_priv; + + if (service_id < cm_id_priv->id.service_id) + node = node->rb_left; + else + node = node->rb_right; + } + return NULL; +} + +static void insert_cm_service(struct ib_cm_id_private *cm_id_priv) +{ + struct rb_node **link = &cm.service_table.rb_node; + struct rb_node *parent = NULL; + struct ib_cm_id_private *cur_cm_id_priv; + u64 service_id = cm_id_priv->id.service_id; + + while (*link) { + parent = *link; + cur_cm_id_priv = rb_entry(parent, struct ib_cm_id_private, + node); + if (service_id < cur_cm_id_priv->id.service_id) + link = &(*link)->rb_left; + else + link = &(*link)->rb_right; + } + rb_link_node(&cm_id_priv->node, parent, link); + rb_insert_color(&cm_id_priv->node, &cm.service_table); +} + struct ib_cm_id *ib_create_cm_id(ib_cm_handler cm_handler, void *context) { @@ -72,7 +118,7 @@ if (!cm_id_priv) return ERR_PTR(-ENOMEM); - cm_id_priv->id.service_id = 0; + memset(cm_id_priv, 0, sizeof *cm_id_priv); cm_id_priv->id.state = IB_CM_IDLE; cm_id_priv->id.cm_handler = cm_handler; cm_id_priv->id.context = context; @@ -95,16 +141,21 @@ int ib_destroy_cm_id(struct ib_cm_id *cm_id) { struct ib_cm_id_private *cm_id_priv; - unsigned long flags; + unsigned long flags, flags2; cm_id_priv = container_of(cm_id, struct ib_cm_id_private, id); spin_lock_irqsave(&cm_id_priv->lock, flags); switch(cm_id->state) { - case IB_CM_IDLE: case IB_CM_LISTEN: + spin_lock_irqsave(&cm.lock, flags2); + rb_erase(&cm_id_priv->node, &cm.service_table); + spin_lock_irqrestore(&cm.lock, flags2); + break; + case IB_CM_IDLE: + break; case IB_CM_TIMEWAIT: - break; /* Connection is ready to be destroyed. */ + break; default: reset_cm_state(cm_id_priv); break; @@ -113,8 +164,7 @@ spin_unlock_irqrestore(&cm_id_priv->lock, flags); atomic_dec(&cm_id_priv->refcount); - wait_event(cm_id_priv->wait, - !atomic_read(&cm_id_priv->refcount)); + wait_event(cm_id_priv->wait, !atomic_read(&cm_id_priv->refcount)); kfree(cm_id_priv); return 0; } @@ -124,7 +174,33 @@ u64 service_id, u64 service_mask) { - return -EINVAL; + struct ib_cm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct ib_cm_id_private, id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id->state != IB_CM_IDLE) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = -EINVAL; + goto out; + } + cm_id->state = IB_CM_LISTEN; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + cm_id->service_id = service_id; + cm_id->service_mask = service_mask ? service_mask : ~0ULL; + + spin_lock_irqsave(&cm.lock, flags); + if (find_cm_service(service_id)) { + /* No one else is able to change the cm_id state. */ + cm_id->state = IB_CM_IDLE; + ret = -EBUSY; + } else + insert_cm_service(cm_id_priv); + spin_unlock_irqrestore(&cm.lock, flags); +out: + return ret; } EXPORT_SYMBOL(ib_cm_listen); @@ -291,6 +367,9 @@ static int __init ib_cm_init(void) { + memset(&cm, 0, sizeof cm); + spin_lock_init(&cm.lock); + cm.service_table = RB_ROOT; return ib_register_client(&cm_client); } From mshefty at ichips.intel.com Fri Dec 17 17:27:30 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 17 Dec 2004 17:27:30 -0800 Subject: [openib-general] [PATCH] new test code for ib_cm_listen Message-ID: <20041217172730.0f6e2fdc.mshefty@ichips.intel.com> This patch adds basic test code for ib_cm_listen. It tests listen insertions, removal, duplicate insertions, component mask usage, and all error handling cases. - Sean Index: util/cmpost/cmpost.c =================================================================== --- util/cmpost/cmpost.c (revision 1352) +++ util/cmpost/cmpost.c (working copy) @@ -55,14 +55,69 @@ struct ib_cm_id *conn_cm_id; }; +/* Last test number used: 8 */ +#define VERIFY_RESULT(test, result, value) \ + printk("cmpost test %d : ", test); \ + if (result == value) { \ + printk("passed\n"); \ + } else { \ + printk("FAILED! result = %d\n", result); \ + return result; \ + } + void cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) { } +static int test_listen() +{ + struct ib_cm_id * listen_ids[3]; + int i, ret; + + for (i = 0; i < 3; i++) { + listen_ids[i] = ib_create_cm_id(cm_handler, NULL); + if (IS_ERR(listen_ids[i])) + return PTR_ERR(listen_ids[i]); + } + + ret = ib_cm_listen(listen_ids[0], 0x1000, 0); + VERIFY_RESULT(1, ret, 0); + ret = ib_cm_listen(listen_ids[0], 0x1000, 0); + VERIFY_RESULT(2, ret, -EINVAL); + + ret = ib_cm_listen(listen_ids[1], 0x1001, 0); + VERIFY_RESULT(3, ret, 0); + ret = ib_cm_listen(listen_ids[2], 0x1000, 0); + VERIFY_RESULT(4, ret, -EBUSY); + ret = ib_cm_listen(listen_ids[2], 0x1001, 0xF00F); + VERIFY_RESULT(5, ret, -EBUSY); + ret = ib_cm_listen(listen_ids[2], 0x1100, 0x1100); + VERIFY_RESULT(6, ret, 0); + + ib_destroy_cm_id(listen_ids[0]); + listen_ids[0] = ib_create_cm_id(cm_handler, NULL); + if (IS_ERR(listen_ids[0])) + return PTR_ERR(listen_ids[0]); + ret = ib_cm_listen(listen_ids[0], 0x1011, 0); + VERIFY_RESULT(7, ret, 0); + ib_destroy_cm_id(listen_ids[0]); + listen_ids[0] = ib_create_cm_id(cm_handler, NULL); + if (IS_ERR(listen_ids[0])) + return PTR_ERR(listen_ids[0]); + ret = ib_cm_listen(listen_ids[0], 0x1000, 0); + VERIFY_RESULT(8, ret, 0); + + for (i = 0; i < 3; i++) { + if (listen_ids[i] && !IS_ERR(listen_ids[i])) + ib_destroy_cm_id(listen_ids[i]); + } + return 0; +} + static void cmpost_add_one(struct ib_device *device) { struct cmpost_port *port; - int i; + int i, ret; port = kmalloc(sizeof *port * device->phys_port_cnt, GFP_KERNEL); if (!port) @@ -73,6 +128,11 @@ port[i].conn_cm_id = ib_create_cm_id(cm_handler, &port[i]); } + ret = test_listen(); + if (ret) { + printk("FAILURE: test_listen failed: %d\n", ret); + goto out; + } out: ib_set_client_data(device, &cmpost_client, port); } From robert.j.woodruff at intel.com Fri Dec 17 17:45:19 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 17 Dec 2004 17:45:19 -0800 Subject: [openib-general] Kernel assertion Message-ID: <1AC79F16F5C5284499BB9591B33D6F000315CC7C@orsmsx408> >Roland wrote, >OK, I checked in a fix for this. I'm actually not sure if it was >really a bug in IPoIB or a glitch in the network stack (see my recent >post to netdev, cc'ed to openib-general, about 'LLTX and >netif_stop_queue') but in any case I made the messages go away in my >setup. >By the way, it might be interesting to see if increasing >IPOIB_RX_RING_SIZE and/or IPOIB_TX_RING_SIZE in ipoib.h had any effect >on your performance (make sure to keep them powers of 2). >Thanks, > Roland Ok this seems to fix the assertion and I am now able to run a 4 node cluster OK. I have 2 PCI-E HCA nodes and 2 PCI-X HCA nodes. I still had to modify IPOIB_NUM_WC = 4, to IPOIB_NUM_WC = 1, and with that change it seems to run fine with the 4.6-0-rc4 firmware. I still cannot seen to get it to run reliably with the 4.3.5 firmware, so that is why I am running PCI-X cards in the other 2 nodes. Guess I need to investigate how to use tvflash to upgrade all the nodes to 4.6.0-rc4. I will let it run over the week-end running MPI jobs to test the stability, but it has been running for about 1 hour without any problems. I also played around with the ring buffer sizes, I tried the default 64-128, 128-256, and 256-512. It really did not seem to make much difference in performance (on MPI Pallas benchmark). See attached logs, mpi.64-128.log, mpi.128-256.log, and mpi.256-512.log for the numbers running across the PCI-E HCAs. I also make a run between the 2 PCI cards. The PCI-E cards look a lot better in performance, but I am not sure if the PCI slots are PCI-X 133 or not, so that could explain why it is so much slower on the PCI-X cards. woody -------------- next part -------------- A non-text attachment was scrubbed... Name: mpi-256-512.log Type: application/octet-stream Size: 21692 bytes Desc: mpi-256-512.log URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mpi-64-128.log Type: application/octet-stream Size: 21692 bytes Desc: mpi-64-128.log URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mpi-64-128-pci.log Type: application/octet-stream Size: 21691 bytes Desc: mpi-64-128-pci.log URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mpi-128-256.log Type: application/octet-stream Size: 21692 bytes Desc: mpi-128-256.log URL: From halr at voltaire.com Fri Dec 17 21:21:45 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2004 00:21:45 -0500 Subject: [openib-general] Re: IPoIB Path Static Rate In-Reply-To: <527jnhrfaj.fsf@topspin.com> References: <1103230507.4327.224.camel@localhost.localdomain> <527jnhrfaj.fsf@topspin.com> Message-ID: <1103347112.4214.795.camel@localhost.localdomain> On Thu, 2004-12-16 at 18:31, Roland Dreier wrote: > Ultimately we probably will want to do static rate correctly, but I > decided to punt on this for now because it's not trivial to determine > the correct IPD to use, and IPoIB doesn't really inject packets at > rates beyond 1X anyway. >From Woody's MPI data (logs in http://openib.org/pipermail/openib-general/2004-December/007167.html), there appear to be some numbers above 1x to me. 1x is 2.5 Gbps minus signalling is really a 2 Gbps data rate. There are numbers in the log in excess of 350 MBytes/sec which is 2.6 Gbps. -- Hal From davem at davemloft.net Fri Dec 17 21:44:32 2004 From: davem at davemloft.net (David S. Miller) Date: Fri, 17 Dec 2004 21:44:32 -0800 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <52llbwoaej.fsf@topspin.com> References: <52llbwoaej.fsf@topspin.com> Message-ID: <20041217214432.07b7b21e.davem@davemloft.net> On Fri, 17 Dec 2004 13:57:40 -0800 Roland Dreier wrote: > While testing my IP-over-InfiniBand driver, I discovered that if a net > device sets NETIF_F_LLTX, it seems the device's hard_start_xmit method > can be called even after a netif_stop_queue(). > > This is because in the LLTX case, qdisc_restart() holds no locks while > calling hard_start_xmit, so something like the following can happen: ... > if (TX_BUFFS_AVAIL(gp) <= (skb_shinfo(skb)->nr_frags + 1)) { > netif_stop_queue(dev); > spin_unlock_irqrestore(&gp->tx_lock, flags); > - printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", > - dev->name); > return NETDEV_TX_BUSY; > } I understand the bug, but we're not going to fix it this way. This is a crucial invariant that we need to check for because it indicates a pretty serious state error except in this bug case you've discovered. Perhaps one way to fix this is to add a pointer to a spinlock to the netdev struct, and have hold that the upper level grab that when NETIF_F_LLTX when doing queue state checks. Actually, that could end up being racy too. If we can't find a good fix for this, besides removing the necessary debugging message, we might have to pull NETIF_F_LLTX out or disable it temporarily until we figure out a way. From roland at topspin.com Sat Dec 18 07:35:49 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 18 Dec 2004 07:35:49 -0800 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <20041217214432.07b7b21e.davem@davemloft.net> (David S. Miller's message of "Fri, 17 Dec 2004 21:44:32 -0800") References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> Message-ID: <528y7vobze.fsf@topspin.com> David> I understand the bug, but we're not going to fix it this David> way. This is a crucial invariant that we need to check for David> because it indicates a pretty serious state error except in David> this bug case you've discovered. OK, that's fair enough. I was pretty bummed myself when I realized hard_start_xmit was getting called even after I stopped the queue. Thanks for confirming my analysis. David> Perhaps one way to fix this is to add a pointer to a David> spinlock to the netdev struct, and have hold that the upper David> level grab that when NETIF_F_LLTX when doing queue state David> checks. Actually, that could end up being racy too. I may be missing something, but it seems to me that we get all of the benefits of LLTX by just documenting that device drivers can use the xmit_lock member of struct net_device to synchronize other parts of the driver against hard_start_xmit. I guess the driver also should set xmit_lock_owner to -1 after it acquires xmit_lock. This would mean that NETIF_F_LLTX can go away, and LLTX drivers would just replace their use of their own private tx_lock with xmit_lock (except in hard_start_xmit, where the upper layer already holds xmit_lock). By the way, am I correct in thinking that the use of xmit_lock_owner in qdisc_restart() is racy? if (!spin_trylock(&dev->xmit_lock)) { /* get the lock */ if (!spin_trylock(&dev->xmit_lock)) { /* fail */ if (dev->xmit_lock_owner == smp_processor_id()) { /* test the wrong value */ /* set the value too late */ dev->xmit_lock_owner = smp_processor_id(); I don't see a simple way to fix this unfortunately. Thanks, Roland From roland at topspin.com Sat Dec 18 07:46:56 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 18 Dec 2004 07:46:56 -0800 Subject: [openib-general] Re: IPoIB Path Static Rate In-Reply-To: <1103347112.4214.795.camel@localhost.localdomain> (Hal Rosenstock's message of "18 Dec 2004 00:21:45 -0500") References: <1103230507.4327.224.camel@localhost.localdomain> <527jnhrfaj.fsf@topspin.com> <1103347112.4214.795.camel@localhost.localdomain> Message-ID: <521xdnobgv.fsf@topspin.com> Hal> From Woody's MPI data (logs in Hal> http://openib.org/pipermail/openib-general/2004-December/007167.html), Hal> there appear to be some numbers above 1x to me. 1x is 2.5 Hal> Gbps minus signalling is really a 2 Gbps data rate. There are Hal> numbers in the log in excess of 350 MBytes/sec which is 2.6 Hal> Gbps. Fair enough but I still don't think there are many fabrics right now where not setting static rate for IPoIB will even be required, let alone have a practical effect. I don't object in principle to doing this right, it just seems like a fair bit of work to write code to retrieve the local PortInfo to get the active link speed and link width so that we can calculate our local rate and get the right IPD. To solve this, do you think it makes sense to add the active link speed and link width to struct ib_port_attr so that it's easy for ULPs to get them? Otherwise ULPs would have to do their own PortInfo queries. - Roland From halr at voltaire.com Sat Dec 18 08:42:07 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2004 11:42:07 -0500 Subject: [openib-general] Re: IPoIB Path Static Rate In-Reply-To: <521xdnobgv.fsf@topspin.com> References: <1103230507.4327.224.camel@localhost.localdomain> <527jnhrfaj.fsf@topspin.com> <1103347112.4214.795.camel@localhost.localdomain> <521xdnobgv.fsf@topspin.com> Message-ID: <1103388127.4214.1218.camel@localhost.localdomain> On Sat, 2004-12-18 at 10:46, Roland Dreier wrote: > Hal> From Woody's MPI data (logs in > Hal> http://openib.org/pipermail/openib-general/2004-December/007167.html), > Hal> there appear to be some numbers above 1x to me. 1x is 2.5 > Hal> Gbps minus signalling is really a 2 Gbps data rate. There are > Hal> numbers in the log in excess of 350 MBytes/sec which is 2.6 > Hal> Gbps. > > Fair enough but I still don't think there are many fabrics right now > where not setting static rate for IPoIB will even be required, let > alone have a practical effect. Agreed. > I don't object in principle to doing this right, it just seems like a > fair bit of work to write code to retrieve the local PortInfo to get > the active link speed and link width so that we can calculate our > local rate and get the right IPD. Link speed active (other than locked at 2.5 Gbps) was added at IBA 1.2. > To solve this, do you think it makes sense to add the active link > speed and link width to struct ib_port_attr so that it's easy for ULPs > to get them? Otherwise ULPs would have to do their own PortInfo queries. That makes sense to me. There's one catch: since link width/speed active can change under the covers, this info cannot be cached by the application or there needs to be a tie in to an event for either of these (1.2) or link width active (1.1) changing. -- Hal From roland at topspin.com Sat Dec 18 09:55:48 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 18 Dec 2004 09:55:48 -0800 Subject: [openib-general] Re: IPoIB Path Static Rate In-Reply-To: <1103388127.4214.1218.camel@localhost.localdomain> (Hal Rosenstock's message of "18 Dec 2004 11:42:07 -0500") References: <1103230507.4327.224.camel@localhost.localdomain> <527jnhrfaj.fsf@topspin.com> <1103347112.4214.795.camel@localhost.localdomain> <521xdnobgv.fsf@topspin.com> <1103388127.4214.1218.camel@localhost.localdomain> Message-ID: <52wtvfmqxn.fsf@topspin.com> Hal> That makes sense to me. There's one catch: since link Hal> width/speed active can change under the covers, this info Hal> cannot be cached by the application or there needs to be a Hal> tie in to an event for either of these (1.2) or link width Hal> active (1.1) changing. Surely link width and/or speed can't change without the port state changing, can they? As I understand it, the link layer can't renegotiate this sort of thing without bringing the link down. In which case ULPs only need to refresh their rate information when they get one of the existing port state change events. (If the link rate can change without any notice, then static rate for RC is meaningless since a link could change from 4X to 1X without disturbing existing connections). - Roland From roland at topspin.com Sat Dec 18 09:58:15 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 18 Dec 2004 09:58:15 -0800 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <528y7vobze.fsf@topspin.com> (Roland Dreier's message of "Sat, 18 Dec 2004 07:35:49 -0800") References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <528y7vobze.fsf@topspin.com> Message-ID: <52sm63mqtk.fsf@topspin.com> Roland> I may be missing something, but it seems to me that we get Roland> all of the benefits of LLTX by just documenting that Roland> device drivers can use the xmit_lock member of struct Roland> net_device to synchronize other parts of the driver Roland> against hard_start_xmit. I guess the driver also should Roland> set xmit_lock_owner to -1 after it acquires xmit_lock. Thinking about this a little more, I realize that there's no reason for the driver to set xmit_lock_owner -- if the driver is able to acquire the lock, then xmit_lock_owner will already be -1. So it seems LLTX can be replaced by just having drivers use net_device.xmit_lock instead of their own private tx_lock. Assuming this works (and I don't see anything wrong with it) then this seems like a pretty nice solution: we remove some code from the networking core and get rid of all the "trylock" logic in driver's hard_start_xmit. - Roland From halr at voltaire.com Sat Dec 18 09:58:44 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2004 12:58:44 -0500 Subject: [openib-general] Re: IPoIB Path Static Rate In-Reply-To: <52wtvfmqxn.fsf@topspin.com> References: <1103230507.4327.224.camel@localhost.localdomain> <527jnhrfaj.fsf@topspin.com> <1103347112.4214.795.camel@localhost.localdomain> <521xdnobgv.fsf@topspin.com> <1103388127.4214.1218.camel@localhost.localdomain> <52wtvfmqxn.fsf@topspin.com> Message-ID: <1103392724.4214.1221.camel@localhost.localdomain> On Sat, 2004-12-18 at 12:55, Roland Dreier wrote: > Surely link width and/or speed can't change without the port state > changing, can they? As I understand it, the link layer can't > renegotiate this sort of thing without bringing the link down. In > which case ULPs only need to refresh their rate information when they > get one of the existing port state change events. Yes, port state change is sufficient to take care of this. > (If the link rate can change without any notice, then static rate for > RC is meaningless since a link could change from 4X to 1X without > disturbing existing connections). -- Hal From roland at topspin.com Sat Dec 18 10:26:40 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 18 Dec 2004 10:26:40 -0800 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <52sm63mqtk.fsf@topspin.com> (Roland Dreier's message of "Sat, 18 Dec 2004 09:58:15 -0800") References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <528y7vobze.fsf@topspin.com> <52sm63mqtk.fsf@topspin.com> Message-ID: <52oegrmpi7.fsf@topspin.com> Roland> So it seems LLTX can be replaced by just having drivers Roland> use net_device.xmit_lock instead of their own private Roland> tx_lock. Assuming this works (and I don't see anything Roland> wrong with it) then this seems like a pretty nice Roland> solution: we remove some code from the networking core and Roland> get rid of all the "trylock" logic in driver's Roland> hard_start_xmit. Actually trying it instead of talking out of my ass... Just doing this naively without changing the net core can deadlock because the net core acquires dev->xmit_lock without disabling interrupts. So if the driver tries to use xmit_lock in its interrupt handler, it will deadlock if the interrupt occurred during hard_start_xmit. Even doing local_irq_save() in hard_start_xmit isn't good enough, because there's still a window between the net core's call to hard_start_xmit and the actual local_irq_save where xmit_lock is held with interrupts on. Maybe it makes sense to change NETIF_F_LLTX to NETIF_F_TX_IRQ_DIS or something like that and have that flag mean "disable interrupts when calling hard_start_xmit." (We could just do this unconditionally but I'm not sure if any drivers rely on having interrupts enabled during hard_start_xmit and I'm worried about making a change in semantics like that -- not to mention some drivers may not need interrupts disabled and may not want the cost). - Roland From shaharf at voltaire.com Sun Dec 19 10:26:16 2004 From: shaharf at voltaire.com (shaharf) Date: Sun, 19 Dec 2004 20:26:16 +0200 Subject: [openib-general] osm Message-ID: Hi all, I took the liberty to commit the user-mode tree even though it is far to be complete or stable. However, the tree contains several applications and utilities that may help all of us, even at this preliminary stage. I renamed the 1362 osm tree to be osm.old and I started a new osm tree that is not based on the old, rather it is based on gen1 osm tree. I also changed the tree structure. I have no intension to stick to the old and very ugly osm tree structure (that was inherited from the old ibal tree). I also rewrote all Makefiles. The new tree is (rev 1366): Userspace: Makefile make.rules make.inc include libcommon/* libumad/* util/ mad_test/* diags/ host ibstat scripts net smpdump osm/ complib include complib iba opensm To make this tree, please edit make.inc and set OPENIB_ROOT to the .../trunk/src dir. Then do make && make install Libraries will be installed at $ OPENIB_ROOT/lib and binaries at $OPENIB_ROOT/bin. If you don't like that change it in make.inc. Hopefully everything will compile and install and then you can run opensm by $OPENIB_ROOT/bin/opensm & It will run on the first existing port. Plese verify that it is active. You can override that by using "-g portguid_in_hex". On any case of problems that you want me to assist, please run the opensm with -V and send me the log file (/tmp/osm.log). Other utilites: Ibstat - show host adaptors status Ibstatus - almost the same but implemented as a script Mad_test - simple umad interface test program based on Rolands test program (ported to C) Smpdump simple solicited SMP query tool. Shows its output as hex dump (unless requested otherwise, w.g. using -s). Don't forget to modprobe ib_umad and configure udev before using the userspace staff. Shahar -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Sun Dec 19 11:10:54 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2004 14:10:54 -0500 Subject: [openib-general] osm In-Reply-To: References: Message-ID: <1103483454.4214.1248.camel@localhost.localdomain> On Sun, 2004-12-19 at 13:26, shaharf wrote: > I took the liberty to commit the user-mode tree even > though it is far to be complete or stable. Great. Continue to inform us as different tools are stable. > I renamed the 1362 osm tree to be osm.old and I started a > new osm tree that is not based on the old, rather it is based on gen1 > osm tree. I also changed the tree structure. I have no intension to > stick to the old and very ugly osm tree structure (that was inherited > from the old ibal tree). Does anyone think we need to keep osm.old in the gen2 tree ? > I also rewrote all Makefiles. A subsequent step will be to do the build using autotools. A lot of what you wrote should end up in a readme or FAQ. -- Hal From hadi at cyberus.ca Sun Dec 19 11:31:15 2004 From: hadi at cyberus.ca (jamal) Date: 19 Dec 2004 14:31:15 -0500 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <20041217214432.07b7b21e.davem@davemloft.net> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> Message-ID: <1103484675.1050.158.camel@jzny.localdomain> On Sat, 2004-12-18 at 00:44, David S. Miller wrote: > Perhaps one way to fix this is to add a pointer to a spinlock to > the netdev struct, and have hold that the upper level grab that > when NETIF_F_LLTX when doing queue state checks. Actually, that > could end up being racy too. How about releasing the qlock only when the LLTX transmit lock is grabbed? That should bring it to par with what it was originally. cheers, jamal From hadi at cyberus.ca Sun Dec 19 11:33:55 2004 From: hadi at cyberus.ca (jamal) Date: 19 Dec 2004 14:33:55 -0500 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <528y7vobze.fsf@topspin.com> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <528y7vobze.fsf@topspin.com> Message-ID: <1103484835.1050.160.camel@jzny.localdomain> On Sat, 2004-12-18 at 10:35, Roland Dreier wrote: > By the way, am I correct in thinking that the use of xmit_lock_owner > in qdisc_restart() is racy? No. > if (!spin_trylock(&dev->xmit_lock)) { > /* get the lock */ > > if (!spin_trylock(&dev->xmit_lock)) { > /* fail */ > if (dev->xmit_lock_owner == smp_processor_id()) { > /* test the wrong value */ > > /* set the value too late */ > dev->xmit_lock_owner = smp_processor_id(); > The setting is protected by the queue lock. So no race. cheers, jamal From hadi at cyberus.ca Sun Dec 19 11:54:51 2004 From: hadi at cyberus.ca (jamal) Date: 19 Dec 2004 14:54:51 -0500 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <1103484675.1050.158.camel@jzny.localdomain> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> Message-ID: <1103486090.1047.166.camel@jzny.localdomain> On Sun, 2004-12-19 at 14:31, jamal wrote: > How about releasing the qlock only when the LLTX transmit lock is > grabbed? That should bring it to par with what it was originally. Something like two attached patches... one showing how to do it for e1000. untested (not even compiled) cheers, jamal -------------- next part -------------- --- a/net/sched/bak.sch_generic.c 2004-12-19 13:46:19.799356432 -0500 +++ b/net/sched/sch_generic.c 2004-12-19 13:49:14.384815408 -0500 @@ -128,12 +128,11 @@ } /* Remember that the driver is grabbed by us. */ dev->xmit_lock_owner = smp_processor_id(); + /* And release queue */ + spin_unlock(&dev->queue_lock); } { - /* And release queue */ - spin_unlock(&dev->queue_lock); - if (!netif_queue_stopped(dev)) { int ret; if (netdev_nit) @@ -141,15 +140,14 @@ ret = dev->hard_start_xmit(skb, dev); if (ret == NETDEV_TX_OK) { + dev->xmit_lock_owner = -1; if (!nolock) { - dev->xmit_lock_owner = -1; spin_unlock(&dev->xmit_lock); } spin_lock(&dev->queue_lock); return -1; } if (ret == NETDEV_TX_LOCKED && nolock) { - spin_lock(&dev->queue_lock); goto collision; } } -------------- next part -------------- --- a/drivers/net/e1000/b.e1000_main.c 2004-12-19 13:50:59.481838232 -0500 +++ b/drivers/net/e1000/e1000_main.c 2004-12-19 13:53:34.326298296 -0500 @@ -1809,6 +1809,10 @@ return NETDEV_TX_LOCKED; } + /* new from sch_generic for LLTX */ + spin_unlock(&dev->queue_lock); + dev->xmit_lock_owner = smp_processor_id(); + /* need: count + 2 desc gap to keep tail from touching * head, otherwise try next time */ if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) { From hadi at znyx.com Sun Dec 19 12:02:24 2004 From: hadi at znyx.com (Jamal Hadi Salim) Date: 19 Dec 2004 15:02:24 -0500 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <1103486090.1047.166.camel@jzny.localdomain> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> <1103486090.1047.166.camel@jzny.localdomain> Message-ID: <1103486544.1049.169.camel@jzny.localdomain> On Sun, 2004-12-19 at 14:54, jamal wrote: > On Sun, 2004-12-19 at 14:31, jamal wrote: > > > How about releasing the qlock only when the LLTX transmit lock is > > grabbed? That should bring it to par with what it was originally. > > Something like two attached patches... one showing how to do it for > e1000. untested (not even compiled) After attempting to compile, p2 take2. cheers, jamal -------------- next part -------------- --- 2610-rc3-bk12/drivers/net/e1000/bak.e1000_main.c 2004-12-19 13:50:59.000000000 -0500 +++ 2610-rc3-bk12/drivers/net/e1000/e1000_main.c 2004-12-19 13:58:17.000000000 -0500 @@ -1809,6 +1809,10 @@ return NETDEV_TX_LOCKED; } + /* new from sch_generic for LLTX */ + spin_unlock(&netdev->queue_lock); + netdev->xmit_lock_owner = smp_processor_id(); + /* need: count + 2 desc gap to keep tail from touching * head, otherwise try next time */ if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) { From roland at topspin.com Sun Dec 19 14:35:26 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 14:35:26 -0800 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <1103484675.1050.158.camel@jzny.localdomain> (jamal's message of "19 Dec 2004 14:31:15 -0500") References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> Message-ID: <52fz21ncgh.fsf@topspin.com> jamal> How about releasing the qlock only when the LLTX transmit jamal> lock is grabbed? That should bring it to par with what it jamal> was originally. This seems a little risky. I can't point to a specific deadlock but it doesn't seem right on general principles to unlock in a different order than you nested the locks when acquiring them -- if I understand correctly, you're suggesting lock(queue_lock), lock(tx_lock), unlock(queue_lock), unlock(tx_lock). Thanks, Roland From hadi at cyberus.ca Sun Dec 19 15:06:32 2004 From: hadi at cyberus.ca (jamal) Date: 19 Dec 2004 18:06:32 -0500 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <52fz21ncgh.fsf@topspin.com> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> <52fz21ncgh.fsf@topspin.com> Message-ID: <1103497592.1046.235.camel@jzny.localdomain> On Sun, 2004-12-19 at 17:35, Roland Dreier wrote: > jamal> How about releasing the qlock only when the LLTX transmit > jamal> lock is grabbed? That should bring it to par with what it > jamal> was originally. > > This seems a little risky. I can't point to a specific deadlock but > it doesn't seem right on general principles to unlock in a different > order than you nested the locks when acquiring them -- if I understand > correctly, you're suggesting lock(queue_lock), lock(tx_lock), > unlock(queue_lock), unlock(tx_lock). There is no deadlock. Thats exactly how things work. Try the patches i posted. cheers, jamal From roland at topspin.com Sun Dec 19 15:11:29 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 15:11:29 -0800 Subject: [openib-general] osm In-Reply-To: (shaharf@voltaire.com's message of "Sun, 19 Dec 2004 20:26:16 +0200") References: Message-ID: <52brcpnase.fsf@topspin.com> shaharf> Hi all, I took the liberty to commit the user-mode tree shaharf> even though it is far to be complete or stable. However, shaharf> the tree contains several applications and utilities that shaharf> may help all of us, even at this preliminary stage. That's great, better to commit early and get feedback than to work privately and end up going the wrong direction. A couple quick suggestions: - Add copyright/license headers to all your source files - Use GNU autotools instead of writing your own Makefiles - Roland From roland at topspin.com Sun Dec 19 15:16:58 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 15:16:58 -0800 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <1103497592.1046.235.camel@jzny.localdomain> (jamal's message of "19 Dec 2004 18:06:32 -0500") References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> <52fz21ncgh.fsf@topspin.com> <1103497592.1046.235.camel@jzny.localdomain> Message-ID: <527jndnaj9.fsf@topspin.com> Roland> This seems a little risky. I can't point to a specific Roland> deadlock but it doesn't seem right on general principles Roland> to unlock in a different order than you nested the locks Roland> when acquiring them -- if I understand correctly, you're Roland> suggesting lock(queue_lock), lock(tx_lock), Roland> unlock(queue_lock), unlock(tx_lock). jamal> There is no deadlock. Thats exactly how things work. Try jamal> the patches i posted. Thinking about it more, I guess it's OK. I was just think about the basic general rule that locks need to be acquired/released in LIFO order to avoid deadlock (eg lock(A), lock(B), unlock(B), unlock(A)). However in this case unlocking queue_lock after acquiring the driver's tx_lock seems to be OK because the driver does a trylock on tx_lock in the xmit path, so it can't deadlock. At worst the trylock will just fail to get the lock. Thanks, Roland From roland at topspin.com Sun Dec 19 21:23:14 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 21:23:14 -0800 Subject: [openib-general] osm In-Reply-To: <52brcpnase.fsf@topspin.com> (Roland Dreier's message of "Sun, 19 Dec 2004 15:11:29 -0800") References: <52brcpnase.fsf@topspin.com> Message-ID: <523by1mtkt.fsf@topspin.com> One more suggestion: consider using libsysfs (which every modern distro will ship) instead reimplementing it in libcommon/sysfs.c. - R. From roland at topspin.com Sun Dec 19 22:14:43 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:14:43 -0800 Subject: [openib-general] [PATCH][v4][0/24] Second InfiniBand merge candidate patch set Message-ID: <200412192214.KlDxQ7icOmxHYIf0@topspin.com> The following series of patches is the latest version of the OpenIB InfiniBand drivers. We believe that this version is suitable for merging when 2.6.11 opens (or into -mm immediately), although of course we are willing to go through as many more iterations as required to fix any remaining issues. The previous posting last week did not generate many comments. We have fixed all the minor glitches identified, and updated the copyright/license comments in the source files so that no external files or URLs other than the main COPYING file are referenced (the license remains unchanged as dual GPL/BSD). In addition a fair bit of testing has been done and several bugs have been fixed over the past week. The IP-over-InfiniBand driver now appears to work well on a broad range of hardware. Merging these patches should be low-risk: the only non-documentation changes outside of the new drivers/infiniband tree are a small set of networking patches to handle devices of type AF_INFINIBAND where appropriate. These patches have been posted to netdev at oss.sgi.com multiple times. Linus/Andrew - is there any further that you want to see done, or do these patches look acceptable to you? Thanks, Roland Dreier OpenIB Alliance www.openib.org From roland at topspin.com Sun Dec 19 22:14:46 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:14:46 -0800 Subject: [openib-general] [PATCH][v4][1/24] Add core InfiniBand support (public headers) In-Reply-To: <200412192214.KlDxQ7icOmxHYIf0@topspin.com> Message-ID: <200412192214.PJDxbje135ozyhtX@topspin.com> Add public headers for core InfiniBand support. This can be thought of as a midlayer that provides an abstraction between low-level hardware drivers and upper level protocols (such as IP-over-InfiniBand). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_cache.h 2004-12-19 22:04:11.573993560 -0800 @@ -0,0 +1,53 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef _IB_CACHE_H +#define _IB_CACHE_H + +#include + +int ib_cached_gid_get(struct ib_device *device, + u8 port, + int index, + union ib_gid *gid); +int ib_cached_pkey_get(struct ib_device *device_handle, + u8 port, + int index, + u16 *pkey); +int ib_cached_pkey_find(struct ib_device *device, + u8 port, + u16 pkey, + u16 *index); + +#endif /* _IB_CACHE_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_fmr_pool.h 2004-12-19 22:04:11.597990024 -0800 @@ -0,0 +1,92 @@ +/* + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_fmr_pool.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#if !defined(IB_FMR_POOL_H) +#define IB_FMR_POOL_H + +#include + +struct ib_fmr_pool; + +/** + * struct ib_fmr_pool_param - Parameters for creating FMR pool + * @max_pages_per_fmr:Maximum number of pages per map request. + * @access:Access flags for FMRs in pool. + * @pool_size:Number of FMRs to allocate for pool. + * @dirty_watermark:Flush is triggered when @dirty_watermark dirty + * FMRs are present. + * @flush_function:Callback called when unmapped FMRs are flushed and + * more FMRs are possibly available for mapping + * @flush_arg:Context passed to user's flush function. + * @cache:If set, FMRs may be reused after unmapping for identical map + * requests. + */ +struct ib_fmr_pool_param { + int max_pages_per_fmr; + enum ib_access_flags access; + int pool_size; + int dirty_watermark; + void (*flush_function)(struct ib_fmr_pool *pool, + void * arg); + void *flush_arg; + unsigned cache:1; +}; + +struct ib_pool_fmr { + struct ib_fmr *fmr; + struct ib_fmr_pool *pool; + struct list_head list; + struct hlist_node cache_node; + int ref_count; + int remap_count; + u64 io_virtual_address; + int page_list_len; + u64 page_list[0]; +}; + +struct ib_fmr_pool *ib_create_fmr_pool(struct ib_pd *pd, + struct ib_fmr_pool_param *params); + +int ib_destroy_fmr_pool(struct ib_fmr_pool *pool); + +int ib_flush_fmr_pool(struct ib_fmr_pool *pool); + +struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle, + u64 *page_list, + int list_len, + u64 *io_virtual_address); + +int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr); + +#endif /* IB_FMR_POOL_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_pack.h 2004-12-19 22:04:11.622986340 -0800 @@ -0,0 +1,245 @@ +/* + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_pack.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef IB_PACK_H +#define IB_PACK_H + +#include + +enum { + IB_LRH_BYTES = 8, + IB_GRH_BYTES = 40, + IB_BTH_BYTES = 12, + IB_DETH_BYTES = 8 +}; + +struct ib_field { + size_t struct_offset_bytes; + size_t struct_size_bytes; + int offset_words; + int offset_bits; + int size_bits; + char *field_name; +}; + +#define RESERVED \ + .field_name = "reserved" + +/* + * This macro cleans up the definitions of constants for BTH opcodes. + * It is used to define constants such as IB_OPCODE_UD_SEND_ONLY, + * which becomes IB_OPCODE_UD + IB_OPCODE_SEND_ONLY, and this gives + * the correct value. + * + * In short, user code should use the constants defined using the + * macro rather than worrying about adding together other constants. +*/ +#define IB_OPCODE(transport, op) \ + IB_OPCODE_ ## transport ## _ ## op = \ + IB_OPCODE_ ## transport + IB_OPCODE_ ## op + +enum { + /* transport types -- just used to define real constants */ + IB_OPCODE_RC = 0x00, + IB_OPCODE_UC = 0x20, + IB_OPCODE_RD = 0x40, + IB_OPCODE_UD = 0x60, + + /* operations -- just used to define real constants */ + IB_OPCODE_SEND_FIRST = 0x00, + IB_OPCODE_SEND_MIDDLE = 0x01, + IB_OPCODE_SEND_LAST = 0x02, + IB_OPCODE_SEND_LAST_WITH_IMMEDIATE = 0x03, + IB_OPCODE_SEND_ONLY = 0x04, + IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE = 0x05, + IB_OPCODE_RDMA_WRITE_FIRST = 0x06, + IB_OPCODE_RDMA_WRITE_MIDDLE = 0x07, + IB_OPCODE_RDMA_WRITE_LAST = 0x08, + IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE = 0x09, + IB_OPCODE_RDMA_WRITE_ONLY = 0x0a, + IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE = 0x0b, + IB_OPCODE_RDMA_READ_REQUEST = 0x0c, + IB_OPCODE_RDMA_READ_RESPONSE_FIRST = 0x0d, + IB_OPCODE_RDMA_READ_RESPONSE_MIDDLE = 0x0e, + IB_OPCODE_RDMA_READ_RESPONSE_LAST = 0x0f, + IB_OPCODE_RDMA_READ_RESPONSE_ONLY = 0x10, + IB_OPCODE_ACKNOWLEDGE = 0x11, + IB_OPCODE_ATOMIC_ACKNOWLEDGE = 0x12, + IB_OPCODE_COMPARE_SWAP = 0x13, + IB_OPCODE_FETCH_ADD = 0x14, + + /* real constants follow -- see comment about above IB_OPCODE() + macro for more details */ + + /* RC */ + IB_OPCODE(RC, SEND_FIRST), + IB_OPCODE(RC, SEND_MIDDLE), + IB_OPCODE(RC, SEND_LAST), + IB_OPCODE(RC, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(RC, SEND_ONLY), + IB_OPCODE(RC, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_WRITE_FIRST), + IB_OPCODE(RC, RDMA_WRITE_MIDDLE), + IB_OPCODE(RC, RDMA_WRITE_LAST), + IB_OPCODE(RC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_WRITE_ONLY), + IB_OPCODE(RC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_READ_REQUEST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_FIRST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_MIDDLE), + IB_OPCODE(RC, RDMA_READ_RESPONSE_LAST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_ONLY), + IB_OPCODE(RC, ACKNOWLEDGE), + IB_OPCODE(RC, ATOMIC_ACKNOWLEDGE), + IB_OPCODE(RC, COMPARE_SWAP), + IB_OPCODE(RC, FETCH_ADD), + + /* UC */ + IB_OPCODE(UC, SEND_FIRST), + IB_OPCODE(UC, SEND_MIDDLE), + IB_OPCODE(UC, SEND_LAST), + IB_OPCODE(UC, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(UC, SEND_ONLY), + IB_OPCODE(UC, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(UC, RDMA_WRITE_FIRST), + IB_OPCODE(UC, RDMA_WRITE_MIDDLE), + IB_OPCODE(UC, RDMA_WRITE_LAST), + IB_OPCODE(UC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(UC, RDMA_WRITE_ONLY), + IB_OPCODE(UC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + + /* RD */ + IB_OPCODE(RD, SEND_FIRST), + IB_OPCODE(RD, SEND_MIDDLE), + IB_OPCODE(RD, SEND_LAST), + IB_OPCODE(RD, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(RD, SEND_ONLY), + IB_OPCODE(RD, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_WRITE_FIRST), + IB_OPCODE(RD, RDMA_WRITE_MIDDLE), + IB_OPCODE(RD, RDMA_WRITE_LAST), + IB_OPCODE(RD, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_WRITE_ONLY), + IB_OPCODE(RD, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_READ_REQUEST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_FIRST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_MIDDLE), + IB_OPCODE(RD, RDMA_READ_RESPONSE_LAST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_ONLY), + IB_OPCODE(RD, ACKNOWLEDGE), + IB_OPCODE(RD, ATOMIC_ACKNOWLEDGE), + IB_OPCODE(RD, COMPARE_SWAP), + IB_OPCODE(RD, FETCH_ADD), + + /* UD */ + IB_OPCODE(UD, SEND_ONLY), + IB_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE) +}; + +enum { + IB_LNH_RAW = 0, + IB_LNH_IP = 1, + IB_LNH_IBA_LOCAL = 2, + IB_LNH_IBA_GLOBAL = 3 +}; + +struct ib_unpacked_lrh { + u8 virtual_lane; + u8 link_version; + u8 service_level; + u8 link_next_header; + __be16 destination_lid; + __be16 packet_length; + __be16 source_lid; +}; + +struct ib_unpacked_grh { + u8 ip_version; + u8 traffic_class; + __be32 flow_label; + __be16 payload_length; + u8 next_header; + u8 hop_limit; + union ib_gid source_gid; + union ib_gid destination_gid; +}; + +struct ib_unpacked_bth { + u8 opcode; + u8 solicited_event; + u8 mig_req; + u8 pad_count; + u8 transport_header_version; + __be16 pkey; + __be32 destination_qpn; + u8 ack_req; + __be32 psn; +}; + +struct ib_unpacked_deth { + __be32 qkey; + __be32 source_qpn; +}; + +struct ib_ud_header { + struct ib_unpacked_lrh lrh; + int grh_present; + struct ib_unpacked_grh grh; + struct ib_unpacked_bth bth; + struct ib_unpacked_deth deth; + int immediate_present; + __be32 immediate_data; +}; + +void ib_pack(const struct ib_field *desc, + int desc_len, + void *structure, + void *buf); + +void ib_unpack(const struct ib_field *desc, + int desc_len, + void *buf, + void *structure); + +void ib_ud_header_init(int payload_bytes, + int grh_present, + struct ib_ud_header *header); + +int ib_ud_header_pack(struct ib_ud_header *header, + void *buf); + +int ib_ud_header_unpack(void *buf, + struct ib_ud_header *header); + +#endif /* IB_PACK_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_verbs.h 2004-12-19 22:04:11.647982657 -0800 @@ -0,0 +1,1236 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#if !defined(IB_VERBS_H) +#define IB_VERBS_H + +#include +#include +#include + +union ib_gid { + u8 raw[16]; + struct { + u64 subnet_prefix; + u64 interface_id; + } global; +}; + +enum ib_node_type { + IB_NODE_CA = 1, + IB_NODE_SWITCH, + IB_NODE_ROUTER +}; + +enum ib_device_cap_flags { + IB_DEVICE_RESIZE_MAX_WR = 1, + IB_DEVICE_BAD_PKEY_CNTR = (1<<1), + IB_DEVICE_BAD_QKEY_CNTR = (1<<2), + IB_DEVICE_RAW_MULTI = (1<<3), + IB_DEVICE_AUTO_PATH_MIG = (1<<4), + IB_DEVICE_CHANGE_PHY_PORT = (1<<5), + IB_DEVICE_UD_AV_PORT_ENFORCE = (1<<6), + IB_DEVICE_CURR_QP_STATE_MOD = (1<<7), + IB_DEVICE_SHUTDOWN_PORT = (1<<8), + IB_DEVICE_INIT_TYPE = (1<<9), + IB_DEVICE_PORT_ACTIVE_EVENT = (1<<10), + IB_DEVICE_SYS_IMAGE_GUID = (1<<11), + IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), + IB_DEVICE_SRQ_RESIZE = (1<<13), + IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_RQ_SIG_TYPE = (1<<15) +}; + +enum ib_atomic_cap { + IB_ATOMIC_NONE, + IB_ATOMIC_HCA, + IB_ATOMIC_GLOB +}; + +struct ib_device_attr { + u64 fw_ver; + u64 node_guid; + u64 sys_image_guid; + u64 max_mr_size; + u64 page_size_cap; + u32 vendor_id; + u32 vendor_part_id; + u32 hw_ver; + int max_qp; + int max_qp_wr; + int device_cap_flags; + int max_sge; + int max_sge_rd; + int max_cq; + int max_cqe; + int max_mr; + int max_pd; + int max_qp_rd_atom; + int max_ee_rd_atom; + int max_res_rd_atom; + int max_qp_init_rd_atom; + int max_ee_init_rd_atom; + enum ib_atomic_cap atomic_cap; + int max_ee; + int max_rdd; + int max_mw; + int max_raw_ipv6_qp; + int max_raw_ethy_qp; + int max_mcast_grp; + int max_mcast_qp_attach; + int max_total_mcast_qp_attach; + int max_ah; + int max_fmr; + int max_map_per_fmr; + int max_srq; + int max_srq_wr; + int max_srq_sge; + u16 max_pkeys; + u8 local_ca_ack_delay; +}; + +enum ib_mtu { + IB_MTU_256 = 1, + IB_MTU_512 = 2, + IB_MTU_1024 = 3, + IB_MTU_2048 = 4, + IB_MTU_4096 = 5 +}; + +static inline int ib_mtu_enum_to_int(enum ib_mtu mtu) +{ + switch (mtu) { + case IB_MTU_256: return 256; + case IB_MTU_512: return 512; + case IB_MTU_1024: return 1024; + case IB_MTU_2048: return 2048; + case IB_MTU_4096: return 4096; + default: return -1; + } +} + +enum ib_static_rate { + IB_STATIC_RATE_FULL = 0, + IB_STATIC_RATE_12X_TO_4X = 2, + IB_STATIC_RATE_4X_TO_1X = 3, + IB_STATIC_RATE_12X_TO_1X = 11 +}; + +enum ib_port_state { + IB_PORT_NOP = 0, + IB_PORT_DOWN = 1, + IB_PORT_INIT = 2, + IB_PORT_ARMED = 3, + IB_PORT_ACTIVE = 4, + IB_PORT_ACTIVE_DEFER = 5 +}; + +enum ib_port_cap_flags { + IB_PORT_SM = (1<<31), + IB_PORT_NOTICE_SUP = (1<<30), + IB_PORT_TRAP_SUP = (1<<29), + IB_PORT_AUTO_MIGR_SUP = (1<<27), + IB_PORT_SL_MAP_SUP = (1<<26), + IB_PORT_MKEY_NVRAM = (1<<25), + IB_PORT_PKEY_NVRAM = (1<<24), + IB_PORT_LED_INFO_SUP = (1<<23), + IB_PORT_SM_DISABLED = (1<<22), + IB_PORT_SYS_IMAGE_GUID_SUP = (1<<21), + IB_PORT_PKEY_SW_EXT_PORT_TRAP_SUP = (1<<20), + IB_PORT_CM_SUP = (1<<16), + IB_PORT_SNMP_TUNNEL_SUP = (1<<15), + IB_PORT_REINIT_SUP = (1<<14), + IB_PORT_DEVICE_MGMT_SUP = (1<<13), + IB_PORT_VENDOR_CLASS_SUP = (1<<12), + IB_PORT_DR_NOTICE_SUP = (1<<11), + IB_PORT_PORT_NOTICE_SUP = (1<<10), + IB_PORT_BOOT_MGMT_SUP = (1<<9) +}; + +struct ib_port_attr { + enum ib_port_state state; + enum ib_mtu max_mtu; + enum ib_mtu active_mtu; + int gid_tbl_len; + u32 port_cap_flags; + u32 max_msg_sz; + u32 bad_pkey_cntr; + u32 qkey_viol_cntr; + u16 pkey_tbl_len; + u16 lid; + u16 sm_lid; + u8 lmc; + u8 max_vl_num; + u8 sm_sl; + u8 subnet_timeout; + u8 init_type_reply; +}; + +enum ib_device_modify_flags { + IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 +}; + +struct ib_device_modify { + u64 sys_image_guid; +}; + +enum ib_port_modify_flags { + IB_PORT_SHUTDOWN = 1, + IB_PORT_INIT_TYPE = (1<<2), + IB_PORT_RESET_QKEY_CNTR = (1<<3) +}; + +struct ib_port_modify { + u32 set_port_cap_mask; + u32 clr_port_cap_mask; + u8 init_type; +}; + +enum ib_event_type { + IB_EVENT_CQ_ERR, + IB_EVENT_QP_FATAL, + IB_EVENT_QP_REQ_ERR, + IB_EVENT_QP_ACCESS_ERR, + IB_EVENT_COMM_EST, + IB_EVENT_SQ_DRAINED, + IB_EVENT_PATH_MIG, + IB_EVENT_PATH_MIG_ERR, + IB_EVENT_DEVICE_FATAL, + IB_EVENT_PORT_ACTIVE, + IB_EVENT_PORT_ERR, + IB_EVENT_LID_CHANGE, + IB_EVENT_PKEY_CHANGE, + IB_EVENT_SM_CHANGE +}; + +struct ib_event { + struct ib_device *device; + union { + struct ib_cq *cq; + struct ib_qp *qp; + u8 port_num; + } element; + enum ib_event_type event; +}; + +struct ib_event_handler { + struct ib_device *device; + void (*handler)(struct ib_event_handler *, struct ib_event *); + struct list_head list; +}; + +#define INIT_IB_EVENT_HANDLER(_ptr, _device, _handler) \ + do { \ + (_ptr)->device = _device; \ + (_ptr)->handler = _handler; \ + INIT_LIST_HEAD(&(_ptr)->list); \ + } while (0) + +struct ib_global_route { + union ib_gid dgid; + u32 flow_label; + u8 sgid_index; + u8 hop_limit; + u8 traffic_class; +}; + +enum { + IB_MULTICAST_QPN = 0xffffff +}; + +enum ib_ah_flags { + IB_AH_GRH = 1 +}; + +struct ib_ah_attr { + struct ib_global_route grh; + u16 dlid; + u8 sl; + u8 src_path_bits; + u8 static_rate; + u8 ah_flags; + u8 port_num; +}; + +enum ib_wc_status { + IB_WC_SUCCESS, + IB_WC_LOC_LEN_ERR, + IB_WC_LOC_QP_OP_ERR, + IB_WC_LOC_EEC_OP_ERR, + IB_WC_LOC_PROT_ERR, + IB_WC_WR_FLUSH_ERR, + IB_WC_MW_BIND_ERR, + IB_WC_BAD_RESP_ERR, + IB_WC_LOC_ACCESS_ERR, + IB_WC_REM_INV_REQ_ERR, + IB_WC_REM_ACCESS_ERR, + IB_WC_REM_OP_ERR, + IB_WC_RETRY_EXC_ERR, + IB_WC_RNR_RETRY_EXC_ERR, + IB_WC_LOC_RDD_VIOL_ERR, + IB_WC_REM_INV_RD_REQ_ERR, + IB_WC_REM_ABORT_ERR, + IB_WC_INV_EECN_ERR, + IB_WC_INV_EEC_STATE_ERR, + IB_WC_FATAL_ERR, + IB_WC_RESP_TIMEOUT_ERR, + IB_WC_GENERAL_ERR +}; + +enum ib_wc_opcode { + IB_WC_SEND, + IB_WC_RDMA_WRITE, + IB_WC_RDMA_READ, + IB_WC_COMP_SWAP, + IB_WC_FETCH_ADD, + IB_WC_BIND_MW, +/* + * Set value of IB_WC_RECV so consumers can test if a completion is a + * receive by testing (opcode & IB_WC_RECV). + */ + IB_WC_RECV = 1 << 7, + IB_WC_RECV_RDMA_WITH_IMM +}; + +enum ib_wc_flags { + IB_WC_GRH = 1, + IB_WC_WITH_IMM = (1<<1) +}; + +struct ib_wc { + u64 wr_id; + enum ib_wc_status status; + enum ib_wc_opcode opcode; + u32 vendor_err; + u32 byte_len; + __be32 imm_data; + u32 src_qp; + int wc_flags; + u16 pkey_index; + u16 slid; + u8 sl; + u8 dlid_path_bits; + u8 port_num; /* valid only for DR SMPs on switches */ +}; + +enum ib_cq_notify { + IB_CQ_SOLICITED, + IB_CQ_NEXT_COMP +}; + +struct ib_qp_cap { + u32 max_send_wr; + u32 max_recv_wr; + u32 max_send_sge; + u32 max_recv_sge; + u32 max_inline_data; +}; + +enum ib_sig_type { + IB_SIGNAL_ALL_WR, + IB_SIGNAL_REQ_WR +}; + +enum ib_qp_type { + /* + * IB_QPT_SMI and IB_QPT_GSI have to be the first two entries + * here (and in that order) since the MAD layer uses them as + * indices into a 2-entry table. + */ + IB_QPT_SMI, + IB_QPT_GSI, + + IB_QPT_RC, + IB_QPT_UC, + IB_QPT_UD, + IB_QPT_RAW_IPV6, + IB_QPT_RAW_ETY +}; + +struct ib_qp_init_attr { + void (*event_handler)(struct ib_event *, void *); + void *qp_context; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; + struct ib_srq *srq; + struct ib_qp_cap cap; + enum ib_sig_type sq_sig_type; + enum ib_sig_type rq_sig_type; + enum ib_qp_type qp_type; + u8 port_num; /* special QP types only */ +}; + +enum ib_rnr_timeout { + IB_RNR_TIMER_655_36 = 0, + IB_RNR_TIMER_000_01 = 1, + IB_RNR_TIMER_000_02 = 2, + IB_RNR_TIMER_000_03 = 3, + IB_RNR_TIMER_000_04 = 4, + IB_RNR_TIMER_000_06 = 5, + IB_RNR_TIMER_000_08 = 6, + IB_RNR_TIMER_000_12 = 7, + IB_RNR_TIMER_000_16 = 8, + IB_RNR_TIMER_000_24 = 9, + IB_RNR_TIMER_000_32 = 10, + IB_RNR_TIMER_000_48 = 11, + IB_RNR_TIMER_000_64 = 12, + IB_RNR_TIMER_000_96 = 13, + IB_RNR_TIMER_001_28 = 14, + IB_RNR_TIMER_001_92 = 15, + IB_RNR_TIMER_002_56 = 16, + IB_RNR_TIMER_003_84 = 17, + IB_RNR_TIMER_005_12 = 18, + IB_RNR_TIMER_007_68 = 19, + IB_RNR_TIMER_010_24 = 20, + IB_RNR_TIMER_015_36 = 21, + IB_RNR_TIMER_020_48 = 22, + IB_RNR_TIMER_030_72 = 23, + IB_RNR_TIMER_040_96 = 24, + IB_RNR_TIMER_061_44 = 25, + IB_RNR_TIMER_081_92 = 26, + IB_RNR_TIMER_122_88 = 27, + IB_RNR_TIMER_163_84 = 28, + IB_RNR_TIMER_245_76 = 29, + IB_RNR_TIMER_327_68 = 30, + IB_RNR_TIMER_491_52 = 31 +}; + +enum ib_qp_attr_mask { + IB_QP_STATE = 1, + IB_QP_CUR_STATE = (1<<1), + IB_QP_EN_SQD_ASYNC_NOTIFY = (1<<2), + IB_QP_ACCESS_FLAGS = (1<<3), + IB_QP_PKEY_INDEX = (1<<4), + IB_QP_PORT = (1<<5), + IB_QP_QKEY = (1<<6), + IB_QP_AV = (1<<7), + IB_QP_PATH_MTU = (1<<8), + IB_QP_TIMEOUT = (1<<9), + IB_QP_RETRY_CNT = (1<<10), + IB_QP_RNR_RETRY = (1<<11), + IB_QP_RQ_PSN = (1<<12), + IB_QP_MAX_QP_RD_ATOMIC = (1<<13), + IB_QP_ALT_PATH = (1<<14), + IB_QP_MIN_RNR_TIMER = (1<<15), + IB_QP_SQ_PSN = (1<<16), + IB_QP_MAX_DEST_RD_ATOMIC = (1<<17), + IB_QP_PATH_MIG_STATE = (1<<18), + IB_QP_CAP = (1<<19), + IB_QP_DEST_QPN = (1<<20) +}; + +enum ib_qp_state { + IB_QPS_RESET, + IB_QPS_INIT, + IB_QPS_RTR, + IB_QPS_RTS, + IB_QPS_SQD, + IB_QPS_SQE, + IB_QPS_ERR +}; + +enum ib_mig_state { + IB_MIG_MIGRATED, + IB_MIG_REARM, + IB_MIG_ARMED +}; + +struct ib_qp_attr { + enum ib_qp_state qp_state; + enum ib_qp_state cur_qp_state; + enum ib_mtu path_mtu; + enum ib_mig_state path_mig_state; + u32 qkey; + u32 rq_psn; + u32 sq_psn; + u32 dest_qp_num; + int qp_access_flags; + struct ib_qp_cap cap; + struct ib_ah_attr ah_attr; + struct ib_ah_attr alt_ah_attr; + u16 pkey_index; + u16 alt_pkey_index; + u8 en_sqd_async_notify; + u8 sq_draining; + u8 max_rd_atomic; + u8 max_dest_rd_atomic; + u8 min_rnr_timer; + u8 port_num; + u8 timeout; + u8 retry_cnt; + u8 rnr_retry; + u8 alt_port_num; + u8 alt_timeout; +}; + +enum ib_wr_opcode { + IB_WR_RDMA_WRITE, + IB_WR_RDMA_WRITE_WITH_IMM, + IB_WR_SEND, + IB_WR_SEND_WITH_IMM, + IB_WR_RDMA_READ, + IB_WR_ATOMIC_CMP_AND_SWP, + IB_WR_ATOMIC_FETCH_AND_ADD +}; + +enum ib_send_flags { + IB_SEND_FENCE = 1, + IB_SEND_SIGNALED = (1<<1), + IB_SEND_SOLICITED = (1<<2), + IB_SEND_INLINE = (1<<3) +}; + +enum ib_recv_flags { + IB_RECV_SIGNALED = 1 +}; + +struct ib_sge { + u64 addr; + u32 length; + u32 lkey; +}; + +struct ib_send_wr { + struct ib_send_wr *next; + u64 wr_id; + struct ib_sge *sg_list; + int num_sge; + enum ib_wr_opcode opcode; + int send_flags; + u32 imm_data; + union { + struct { + u64 remote_addr; + u32 rkey; + } rdma; + struct { + u64 remote_addr; + u64 compare_add; + u64 swap; + u32 rkey; + } atomic; + struct { + struct ib_ah *ah; + struct ib_mad_hdr *mad_hdr; + u32 remote_qpn; + u32 remote_qkey; + int timeout_ms; /* valid for MADs only */ + u16 pkey_index; /* valid for GSI only */ + u8 port_num; /* valid for DR SMPs on switch only */ + } ud; + } wr; +}; + +struct ib_recv_wr { + struct ib_recv_wr *next; + u64 wr_id; + struct ib_sge *sg_list; + int num_sge; + int recv_flags; +}; + +enum ib_access_flags { + IB_ACCESS_LOCAL_WRITE = 1, + IB_ACCESS_REMOTE_WRITE = (1<<1), + IB_ACCESS_REMOTE_READ = (1<<2), + IB_ACCESS_REMOTE_ATOMIC = (1<<3), + IB_ACCESS_MW_BIND = (1<<4) +}; + +struct ib_phys_buf { + u64 addr; + u64 size; +}; + +struct ib_mr_attr { + struct ib_pd *pd; + u64 device_virt_addr; + u64 size; + int mr_access_flags; + u32 lkey; + u32 rkey; +}; + +enum ib_mr_rereg_flags { + IB_MR_REREG_TRANS = 1, + IB_MR_REREG_PD = (1<<1), + IB_MR_REREG_ACCESS = (1<<2) +}; + +struct ib_mw_bind { + struct ib_mr *mr; + u64 wr_id; + u64 addr; + u32 length; + int send_flags; + int mw_access_flags; +}; + +struct ib_fmr_attr { + int max_pages; + int max_maps; + u8 page_size; +}; + +struct ib_pd { + struct ib_device *device; + atomic_t usecnt; /* count all resources */ +}; + +struct ib_ah { + struct ib_device *device; + struct ib_pd *pd; +}; + +typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context); + +struct ib_cq { + struct ib_device *device; + ib_comp_handler comp_handler; + void (*event_handler)(struct ib_event *, void *); + void * cq_context; + int cqe; + atomic_t usecnt; /* count number of work queues */ +}; + +struct ib_srq { + struct ib_device *device; + struct ib_pd *pd; + void *srq_context; + atomic_t usecnt; +}; + +struct ib_qp { + struct ib_device *device; + struct ib_pd *pd; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; + struct ib_srq *srq; + void (*event_handler)(struct ib_event *, void *); + void *qp_context; + u32 qp_num; +}; + +struct ib_mr { + struct ib_device *device; + struct ib_pd *pd; + u32 lkey; + u32 rkey; + atomic_t usecnt; /* count number of MWs */ +}; + +struct ib_mw { + struct ib_device *device; + struct ib_pd *pd; + u32 rkey; +}; + +struct ib_fmr { + struct ib_device *device; + struct ib_pd *pd; + struct list_head list; + u32 lkey; + u32 rkey; +}; + +struct ib_mad; + +enum ib_process_mad_flags { + IB_MAD_IGNORE_MKEY = 1 +}; + +enum ib_mad_result { + IB_MAD_RESULT_FAILURE = 0, /* (!SUCCESS is the important flag) */ + IB_MAD_RESULT_SUCCESS = 1 << 0, /* MAD was successfully processed */ + IB_MAD_RESULT_REPLY = 1 << 1, /* Reply packet needs to be sent */ + IB_MAD_RESULT_CONSUMED = 1 << 2 /* Packet consumed: stop processing */ +}; + +#define IB_DEVICE_NAME_MAX 64 + +struct ib_cache { + rwlock_t lock; + struct ib_event_handler event_handler; + struct ib_pkey_cache **pkey_cache; + struct ib_gid_cache **gid_cache; +}; + +struct ib_device { + struct device *dma_device; + + char name[IB_DEVICE_NAME_MAX]; + + struct list_head event_handler_list; + spinlock_t event_handler_lock; + + struct list_head core_list; + struct list_head client_data_list; + spinlock_t client_data_lock; + + struct ib_cache cache; + + u32 flags; + + int (*query_device)(struct ib_device *device, + struct ib_device_attr *device_attr); + int (*query_port)(struct ib_device *device, + u8 port_num, + struct ib_port_attr *port_attr); + int (*query_gid)(struct ib_device *device, + u8 port_num, int index, + union ib_gid *gid); + int (*query_pkey)(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey); + int (*modify_device)(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + int (*modify_port)(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify); + struct ib_pd * (*alloc_pd)(struct ib_device *device); + int (*dealloc_pd)(struct ib_pd *pd); + struct ib_ah * (*create_ah)(struct ib_pd *pd, + struct ib_ah_attr *ah_attr); + int (*modify_ah)(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); + int (*query_ah)(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); + int (*destroy_ah)(struct ib_ah *ah); + struct ib_qp * (*create_qp)(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + int (*modify_qp)(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask); + int (*query_qp)(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + int (*destroy_qp)(struct ib_qp *qp); + int (*post_send)(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + int (*post_recv)(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr); + struct ib_cq * (*create_cq)(struct ib_device *device, + int cqe); + int (*destroy_cq)(struct ib_cq *cq); + int (*resize_cq)(struct ib_cq *cq, int *cqe); + int (*poll_cq)(struct ib_cq *cq, int num_entries, + struct ib_wc *wc); + int (*peek_cq)(struct ib_cq *cq, int wc_cnt); + int (*req_notify_cq)(struct ib_cq *cq, + enum ib_cq_notify cq_notify); + int (*req_ncomp_notif)(struct ib_cq *cq, + int wc_cnt); + struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, + int mr_access_flags); + struct ib_mr * (*reg_phys_mr)(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + int (*query_mr)(struct ib_mr *mr, + struct ib_mr_attr *mr_attr); + int (*dereg_mr)(struct ib_mr *mr); + int (*rereg_phys_mr)(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + struct ib_mw * (*alloc_mw)(struct ib_pd *pd); + int (*bind_mw)(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); + int (*dealloc_mw)(struct ib_mw *mw); + struct ib_fmr * (*alloc_fmr)(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + int (*map_phys_fmr)(struct ib_fmr *fmr, + u64 *page_list, int list_len, + u64 iova); + int (*unmap_fmr)(struct list_head *fmr_list); + int (*dealloc_fmr)(struct ib_fmr *fmr); + int (*attach_mcast)(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + int (*detach_mcast)(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + int (*process_mad)(struct ib_device *device, + int process_mad_flags, + u8 port_num, + u16 source_lid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); + + struct class_device class_dev; + struct kobject ports_parent; + struct list_head port_list; + + enum { + IB_DEV_UNINITIALIZED, + IB_DEV_REGISTERED, + IB_DEV_UNREGISTERED + } reg_state; + + u8 node_type; + u8 phys_port_cnt; +}; + +struct ib_client { + char *name; + void (*add) (struct ib_device *); + void (*remove)(struct ib_device *); + + struct list_head list; +}; + +struct ib_device *ib_alloc_device(size_t size); +void ib_dealloc_device(struct ib_device *device); + +int ib_register_device (struct ib_device *device); +void ib_unregister_device(struct ib_device *device); + +int ib_register_client (struct ib_client *client); +void ib_unregister_client(struct ib_client *client); + +void *ib_get_client_data(struct ib_device *device, struct ib_client *client); +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data); + +int ib_register_event_handler (struct ib_event_handler *event_handler); +int ib_unregister_event_handler(struct ib_event_handler *event_handler); +void ib_dispatch_event(struct ib_event *event); + +int ib_query_device(struct ib_device *device, + struct ib_device_attr *device_attr); + +int ib_query_port(struct ib_device *device, + u8 port_num, struct ib_port_attr *port_attr); + +int ib_query_gid(struct ib_device *device, + u8 port_num, int index, union ib_gid *gid); + +int ib_query_pkey(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey); + +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + +int ib_modify_port(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify); + +/** + * ib_alloc_pd - Allocates an unused protection domain. + * @device: The device on which to allocate the protection domain. + * + * A protection domain object provides an association between QPs, shared + * receive queues, address handles, memory regions, and memory windows. + */ +struct ib_pd *ib_alloc_pd(struct ib_device *device); + +/** + * ib_dealloc_pd - Deallocates a protection domain. + * @pd: The protection domain to deallocate. + */ +int ib_dealloc_pd(struct ib_pd *pd); + +/** + * ib_create_ah - Creates an address handle for the given address vector. + * @pd: The protection domain associated with the address handle. + * @ah_attr: The attributes of the address vector. + * + * The address handle is used to reference a local or global destination + * in all UD QP post sends. + */ +struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); + +/** + * ib_modify_ah - Modifies the address vector associated with an address + * handle. + * @ah: The address handle to modify. + * @ah_attr: The new address vector attributes to associate with the + * address handle. + */ +int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +/** + * ib_query_ah - Queries the address vector associated with an address + * handle. + * @ah: The address handle to query. + * @ah_attr: The address vector attributes associated with the address + * handle. + */ +int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +/** + * ib_destroy_ah - Destroys an address handle. + * @ah: The address handle to destroy. + */ +int ib_destroy_ah(struct ib_ah *ah); + +/** + * ib_create_qp - Creates a QP associated with the specified protection + * domain. + * @pd: The protection domain associated with the QP. + * @qp_init_attr: A list of initial attributes required to create the QP. + */ +struct ib_qp *ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + +/** + * ib_modify_qp - Modifies the attributes for the specified QP and then + * transitions the QP to the given state. + * @qp: The QP to modify. + * @qp_attr: On input, specifies the QP attributes to modify. On output, + * the current values of selected QP attributes are returned. + * @qp_attr_mask: A bit-mask used to specify which attributes of the QP + * are being modified. + */ +int ib_modify_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask); + +/** + * ib_query_qp - Returns the attribute list and current values for the + * specified QP. + * @qp: The QP to query. + * @qp_attr: The attributes of the specified QP. + * @qp_attr_mask: A bit-mask used to select specific attributes to query. + * @qp_init_attr: Additional attributes of the selected QP. + * + * The qp_attr_mask may be used to limit the query to gathering only the + * selected attributes. + */ +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + +/** + * ib_destroy_qp - Destroys the specified QP. + * @qp: The QP to destroy. + */ +int ib_destroy_qp(struct ib_qp *qp); + +/** + * ib_post_send - Posts a list of work requests to the send queue of + * the specified QP. + * @qp: The QP to post the work request on. + * @send_wr: A list of work requests to post on the send queue. + * @bad_send_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ +static inline int ib_post_send(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + return qp->device->post_send(qp, send_wr, bad_send_wr); +} + +/** + * ib_post_recv - Posts a list of work requests to the receive queue of + * the specified QP. + * @qp: The QP to post the work request on. + * @recv_wr: A list of work requests to post on the receive queue. + * @bad_recv_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ +static inline int ib_post_recv(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) +{ + return qp->device->post_recv(qp, recv_wr, bad_recv_wr); +} + +/** + * ib_create_cq - Creates a CQ on the specified device. + * @device: The device on which to create the CQ. + * @comp_handler: A user-specified callback that is invoked when a + * completion event occurs on the CQ. + * @event_handler: A user-specified callback that is invoked when an + * asynchronous event not associated with a completion occurs on the CQ. + * @cq_context: Context associated with the CQ returned to the user via + * the associated completion and event handlers. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ +struct ib_cq *ib_create_cq(struct ib_device *device, + ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), + void *cq_context, int cqe); + +/** + * ib_resize_cq - Modifies the capacity of the CQ. + * @cq: The CQ to resize. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ +int ib_resize_cq(struct ib_cq *cq, int cqe); + +/** + * ib_destroy_cq - Destroys the specified CQ. + * @cq: The CQ to destroy. + */ +int ib_destroy_cq(struct ib_cq *cq); + +/** + * ib_poll_cq - poll a CQ for completion(s) + * @cq:the CQ being polled + * @num_entries:maximum number of completions to return + * @wc:array of at least @num_entries &struct ib_wc where completions + * will be returned + * + * Poll a CQ for (possibly multiple) completions. If the return value + * is < 0, an error occurred. If the return value is >= 0, it is the + * number of completions returned. If the return value is + * non-negative and < num_entries, then the CQ was emptied. + */ +static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, + struct ib_wc *wc) +{ + return cq->device->poll_cq(cq, num_entries, wc); +} + +/** + * ib_peek_cq - Returns the number of unreaped completions currently + * on the specified CQ. + * @cq: The CQ to peek. + * @wc_cnt: A minimum number of unreaped completions to check for. + * + * If the number of unreaped completions is greater than or equal to wc_cnt, + * this function returns wc_cnt, otherwise, it returns the actual number of + * unreaped completions. + */ +int ib_peek_cq(struct ib_cq *cq, int wc_cnt); + +/** + * ib_req_notify_cq - Request completion notification on a CQ. + * @cq: The CQ to generate an event for. + * @cq_notify: If set to %IB_CQ_SOLICITED, completion notification will + * occur on the next solicited event. If set to %IB_CQ_NEXT_COMP, + * notification will occur on the next completion. + */ +static inline int ib_req_notify_cq(struct ib_cq *cq, + enum ib_cq_notify cq_notify) +{ + return cq->device->req_notify_cq(cq, cq_notify); +} + +/** + * ib_req_ncomp_notif - Request completion notification when there are + * at least the specified number of unreaped completions on the CQ. + * @cq: The CQ to generate an event for. + * @wc_cnt: The number of unreaped completions that should be on the + * CQ before an event is generated. + */ +static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) +{ + return cq->device->req_ncomp_notif ? + cq->device->req_ncomp_notif(cq, wc_cnt) : + -ENOSYS; +} + +/** + * ib_get_dma_mr - Returns a memory region for system memory that is + * usable for DMA. + * @pd: The protection domain associated with the memory region. + * @mr_access_flags: Specifies the memory access rights. + */ +struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); + +/** + * ib_reg_phys_mr - Prepares a virtually addressed memory region for use + * by an HCA. + * @pd: The protection domain associated assigned to the registered region. + * @phys_buf_array: Specifies a list of physical buffers to use in the + * memory region. + * @num_phys_buf: Specifies the size of the phys_buf_array. + * @mr_access_flags: Specifies the memory access rights. + * @iova_start: The offset of the region's starting I/O virtual address. + */ +struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + +/** + * ib_rereg_phys_mr - Modifies the attributes of an existing memory region. + * Conceptually, this call performs the functions deregister memory region + * followed by register physical memory region. Where possible, + * resources are reused instead of deallocated and reallocated. + * @mr: The memory region to modify. + * @mr_rereg_mask: A bit-mask used to indicate which of the following + * properties of the memory region are being modified. + * @pd: If %IB_MR_REREG_PD is set in mr_rereg_mask, this field specifies + * the new protection domain to associated with the memory region, + * otherwise, this parameter is ignored. + * @phys_buf_array: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this + * field specifies a list of physical buffers to use in the new + * translation, otherwise, this parameter is ignored. + * @num_phys_buf: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this + * field specifies the size of the phys_buf_array, otherwise, this + * parameter is ignored. + * @mr_access_flags: If %IB_MR_REREG_ACCESS is set in mr_rereg_mask, this + * field specifies the new memory access rights, otherwise, this + * parameter is ignored. + * @iova_start: The offset of the region's starting I/O virtual address. + */ +int ib_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + +/** + * ib_query_mr - Retrieves information about a specific memory region. + * @mr: The memory region to retrieve information about. + * @mr_attr: The attributes of the specified memory region. + */ +int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); + +/** + * ib_dereg_mr - Deregisters a memory region and removes it from the + * HCA translation table. + * @mr: The memory region to deregister. + */ +int ib_dereg_mr(struct ib_mr *mr); + +/** + * ib_alloc_mw - Allocates a memory window. + * @pd: The protection domain associated with the memory window. + */ +struct ib_mw *ib_alloc_mw(struct ib_pd *pd); + +/** + * ib_bind_mw - Posts a work request to the send queue of the specified + * QP, which binds the memory window to the given address range and + * remote access attributes. + * @qp: QP to post the bind work request on. + * @mw: The memory window to bind. + * @mw_bind: Specifies information about the memory window, including + * its address range, remote access rights, and associated memory region. + */ +static inline int ib_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + /* XXX reference counting in corresponding MR? */ + return mw->device->bind_mw ? + mw->device->bind_mw(qp, mw, mw_bind) : + -ENOSYS; +} + +/** + * ib_dealloc_mw - Deallocates a memory window. + * @mw: The memory window to deallocate. + */ +int ib_dealloc_mw(struct ib_mw *mw); + +/** + * ib_alloc_fmr - Allocates a unmapped fast memory region. + * @pd: The protection domain associated with the unmapped region. + * @mr_access_flags: Specifies the memory access rights. + * @fmr_attr: Attributes of the unmapped region. + * + * A fast memory region must be mapped before it can be used as part of + * a work request. + */ +struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + +/** + * ib_map_phys_fmr - Maps a list of physical pages to a fast memory region. + * @fmr: The fast memory region to associate with the pages. + * @page_list: An array of physical pages to map to the fast memory region. + * @list_len: The number of pages in page_list. + * @iova: The I/O virtual address to use with the mapped region. + */ +static inline int ib_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, int list_len, + u64 iova) +{ + return fmr->device->map_phys_fmr(fmr, page_list, list_len, iova); +} + +/** + * ib_unmap_fmr - Removes the mapping from a list of fast memory regions. + * @fmr_list: A linked list of fast memory regions to unmap. + */ +int ib_unmap_fmr(struct list_head *fmr_list); + +/** + * ib_dealloc_fmr - Deallocates a fast memory region. + * @fmr: The fast memory region to deallocate. + */ +int ib_dealloc_fmr(struct ib_fmr *fmr); + +/** + * ib_attach_mcast - Attaches the specified QP to a multicast group. + * @qp: QP to attach to the multicast group. The QP must be type + * IB_QPT_UD. + * @gid: Multicast group GID. + * @lid: Multicast group LID in host byte order. + * + * In order to send and receive multicast packets, subnet + * administration must have created the multicast group and configured + * the fabric appropriately. The port associated with the specified + * QP must also be a member of the multicast group. + */ +int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +/** + * ib_detach_mcast - Detaches the specified QP from a multicast group. + * @qp: QP to detach from the multicast group. + * @gid: Multicast group GID. + * @lid: Multicast group LID in host byte order. + */ +int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +#endif /* IB_VERBS_H */ From roland at topspin.com Sun Dec 19 22:14:50 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:14:50 -0800 Subject: [openib-general] [PATCH][v4][2/24] Add core InfiniBand support In-Reply-To: <200412192214.PJDxbje135ozyhtX@topspin.com> Message-ID: <200412192214.uWyITZT30vxqMJuO@topspin.com> Add implementation of core InfiniBand support. This can be thought of as a midlayer that provides an abstraction between low-level hardware drivers and upper level protocols (such as IP-over-InfiniBand). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-19 22:04:11.935940222 -0800 @@ -0,0 +1,10 @@ +menu "InfiniBand support" + +config INFINIBAND + tristate "InfiniBand support" + ---help--- + Core support for InfiniBand (IB). Make sure to also select + any protocols you wish to use as well as drivers for your + InfiniBand hardware. + +endmenu --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/Makefile 2004-12-19 22:04:11.960936538 -0800 @@ -0,0 +1 @@ +obj-$(CONFIG_INFINIBAND) += core/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-19 22:04:11.985932855 -0800 @@ -0,0 +1,6 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND) += ib_core.o + +ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ + device.o fmr_pool.o cache.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/cache.c 2004-12-19 22:04:12.287888357 -0800 @@ -0,0 +1,328 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include + +#include "core_priv.h" + +struct ib_pkey_cache { + int table_len; + u16 table[0]; +}; + +struct ib_gid_cache { + int table_len; + union ib_gid table[0]; +}; + +struct ib_update_work { + struct work_struct work; + struct ib_device *device; + u8 port_num; +}; + +static inline int start_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : 1; +} + +static inline int end_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; +} + +int ib_cached_gid_get(struct ib_device *device, + u8 port, + int index, + union ib_gid *gid) +{ + struct ib_gid_cache *cache; + unsigned long flags; + int ret = 0; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.gid_cache[port - start_port(device)]; + + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *gid = cache->table[index]; + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_gid_get); + +int ib_cached_pkey_get(struct ib_device *device, + u8 port, + int index, + u16 *pkey) +{ + struct ib_pkey_cache *cache; + unsigned long flags; + int ret = 0; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.pkey_cache[port - start_port(device)]; + + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *pkey = cache->table[index]; + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_pkey_get); + +int ib_cached_pkey_find(struct ib_device *device, + u8 port, + u16 pkey, + u16 *index) +{ + struct ib_pkey_cache *cache; + unsigned long flags; + int i; + int ret = -ENOENT; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.pkey_cache[port - start_port(device)]; + + *index = -1; + + for (i = 0; i < cache->table_len; ++i) + if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { + *index = i; + ret = 0; + break; + } + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_pkey_find); + +static void ib_cache_update(struct ib_device *device, + u8 port) +{ + struct ib_port_attr *tprops = NULL; + struct ib_pkey_cache *pkey_cache = NULL, *old_pkey_cache; + struct ib_gid_cache *gid_cache = NULL, *old_gid_cache; + int i; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + return; + + ret = ib_query_port(device, port, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed (%d) for %s\n", + ret, device->name); + goto err; + } + + pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * + sizeof *pkey_cache->table, GFP_KERNEL); + if (!pkey_cache) + goto err; + + pkey_cache->table_len = tprops->pkey_tbl_len; + + gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len * + sizeof *gid_cache->table, GFP_KERNEL); + if (!gid_cache) + goto err; + + gid_cache->table_len = tprops->gid_tbl_len; + + for (i = 0; i < pkey_cache->table_len; ++i) { + ret = ib_query_pkey(device, port, i, pkey_cache->table + i); + if (ret) { + printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; + } + } + + for (i = 0; i < gid_cache->table_len; ++i) { + ret = ib_query_gid(device, port, i, gid_cache->table + i); + if (ret) { + printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; + } + } + + write_lock_irq(&device->cache.lock); + + old_pkey_cache = device->cache.pkey_cache[port - start_port(device)]; + old_gid_cache = device->cache.gid_cache [port - start_port(device)]; + + device->cache.pkey_cache[port - start_port(device)] = pkey_cache; + device->cache.gid_cache [port - start_port(device)] = gid_cache; + + write_unlock_irq(&device->cache.lock); + + kfree(old_pkey_cache); + kfree(old_gid_cache); + kfree(tprops); + return; + +err: + kfree(pkey_cache); + kfree(gid_cache); + kfree(tprops); +} + +static void ib_cache_task(void *work_ptr) +{ + struct ib_update_work *work = work_ptr; + + ib_cache_update(work->device, work->port_num); + kfree(work); +} + +static void ib_cache_event(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct ib_update_work *work; + + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + work = kmalloc(sizeof *work, GFP_ATOMIC); + if (work) { + INIT_WORK(&work->work, ib_cache_task, work); + work->device = event->device; + work->port_num = event->element.port_num; + schedule_work(&work->work); + } + } +} + +void ib_cache_setup_one(struct ib_device *device) +{ + int p; + + rwlock_init(&device->cache.lock); + + device->cache.pkey_cache = + kmalloc(sizeof *device->cache.pkey_cache * + (end_port(device) - start_port(device) + 1), GFP_KERNEL); + device->cache.gid_cache = + kmalloc(sizeof *device->cache.pkey_cache * + (end_port(device) - start_port(device) + 1), GFP_KERNEL); + + if (!device->cache.pkey_cache || !device->cache.gid_cache) { + printk(KERN_WARNING "Couldn't allocate cache " + "for %s\n", device->name); + goto err; + } + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + device->cache.pkey_cache[p] = NULL; + device->cache.gid_cache [p] = NULL; + ib_cache_update(device, p + start_port(device)); + } + + INIT_IB_EVENT_HANDLER(&device->cache.event_handler, + device, ib_cache_event); + if (ib_register_event_handler(&device->cache.event_handler)) + goto err_cache; + + return; + +err_cache: + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); + } + +err: + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} + +void ib_cache_cleanup_one(struct ib_device *device) +{ + int p; + + ib_unregister_event_handler(&device->cache.event_handler); + flush_scheduled_work(); + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); + } + + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} + +struct ib_client cache_client = { + .name = "cache", + .add = ib_cache_setup_one, + .remove = ib_cache_cleanup_one +}; + +int __init ib_cache_setup(void) +{ + return ib_register_client(&cache_client); +} + +void __exit ib_cache_cleanup(void) +{ + ib_unregister_client(&cache_client); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/core_priv.h 2004-12-19 22:04:12.312884674 -0800 @@ -0,0 +1,52 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: core_priv.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef _CORE_PRIV_H +#define _CORE_PRIV_H + +#include +#include + +#include + +int ib_device_register_sysfs(struct ib_device *device); +void ib_device_unregister_sysfs(struct ib_device *device); + +int ib_sysfs_setup(void); +void ib_sysfs_cleanup(void); + +int ib_cache_setup(void); +void ib_cache_cleanup(void); + +#endif /* _CORE_PRIV_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/device.c 2004-12-19 22:04:12.238895577 -0800 @@ -0,0 +1,614 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: device.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include +#include + +#include + +#include "core_priv.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("core kernel InfiniBand API"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct ib_client_data { + struct list_head list; + struct ib_client *client; + void * data; +}; + +static LIST_HEAD(device_list); +static LIST_HEAD(client_list); + +/* + * device_sem protects access to both device_list and client_list. + * There's no real point to using multiple locks or something fancier + * like an rwsem: we always access both lists, and we're always + * modifying one list or the other list. In any case this is not a + * hot path so there's no point in trying to optimize. + */ +static DECLARE_MUTEX(device_sem); + +static int ib_device_check_mandatory(struct ib_device *device) +{ +#define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device, x), #x } + static const struct { + size_t offset; + char *name; + } mandatory_table[] = { + IB_MANDATORY_FUNC(query_device), + IB_MANDATORY_FUNC(query_port), + IB_MANDATORY_FUNC(query_pkey), + IB_MANDATORY_FUNC(query_gid), + IB_MANDATORY_FUNC(alloc_pd), + IB_MANDATORY_FUNC(dealloc_pd), + IB_MANDATORY_FUNC(create_ah), + IB_MANDATORY_FUNC(destroy_ah), + IB_MANDATORY_FUNC(create_qp), + IB_MANDATORY_FUNC(modify_qp), + IB_MANDATORY_FUNC(destroy_qp), + IB_MANDATORY_FUNC(post_send), + IB_MANDATORY_FUNC(post_recv), + IB_MANDATORY_FUNC(create_cq), + IB_MANDATORY_FUNC(destroy_cq), + IB_MANDATORY_FUNC(poll_cq), + IB_MANDATORY_FUNC(req_notify_cq), + IB_MANDATORY_FUNC(get_dma_mr), + IB_MANDATORY_FUNC(dereg_mr) + }; + int i; + + for (i = 0; i < sizeof mandatory_table / sizeof mandatory_table[0]; ++i) { + if (!*(void **) ((void *) device + mandatory_table[i].offset)) { + printk(KERN_WARNING "Device %s is missing mandatory function %s\n", + device->name, mandatory_table[i].name); + return -EINVAL; + } + } + + return 0; +} + +static struct ib_device *__ib_device_get_by_name(const char *name) +{ + struct ib_device *device; + + list_for_each_entry(device, &device_list, core_list) + if (!strncmp(name, device->name, IB_DEVICE_NAME_MAX)) + return device; + + return NULL; +} + + +static int alloc_name(char *name) +{ + long *inuse; + char buf[IB_DEVICE_NAME_MAX]; + struct ib_device *device; + int i; + + inuse = (long *) get_zeroed_page(GFP_KERNEL); + if (!inuse) + return -ENOMEM; + + list_for_each_entry(device, &device_list, core_list) { + if (!sscanf(device->name, name, &i)) + continue; + if (i < 0 || i >= PAGE_SIZE * 8) + continue; + snprintf(buf, sizeof buf, name, i); + if (!strncmp(buf, device->name, IB_DEVICE_NAME_MAX)) + set_bit(i, inuse); + } + + i = find_first_zero_bit(inuse, PAGE_SIZE * 8); + free_page((unsigned long) inuse); + snprintf(buf, sizeof buf, name, i); + + if (__ib_device_get_by_name(buf)) + return -ENFILE; + + strlcpy(name, buf, IB_DEVICE_NAME_MAX); + return 0; +} + +/** + * ib_alloc_device - allocate an IB device struct + * @size:size of structure to allocate + * + * Low-level drivers should use ib_alloc_device() to allocate &struct + * ib_device. @size is the size of the structure to be allocated, + * including any private data used by the low-level driver. + * ib_dealloc_device() must be used to free structures allocated with + * ib_alloc_device(). + */ +struct ib_device *ib_alloc_device(size_t size) +{ + void *dev; + + BUG_ON(size < sizeof (struct ib_device)); + + dev = kmalloc(size, GFP_KERNEL); + if (!dev) + return NULL; + + memset(dev, 0, size); + + return dev; +} +EXPORT_SYMBOL(ib_alloc_device); + +/** + * ib_dealloc_device - free an IB device struct + * @device:structure to free + * + * Free a structure allocated with ib_alloc_device(). + */ +void ib_dealloc_device(struct ib_device *device) +{ + if (device->reg_state == IB_DEV_UNINITIALIZED) { + kfree(device); + return; + } + + BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); + + ib_device_unregister_sysfs(device); +} +EXPORT_SYMBOL(ib_dealloc_device); + +static int add_client_context(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + unsigned long flags; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) { + printk(KERN_WARNING "Couldn't allocate client context for %s/%s\n", + device->name, client->name); + return -ENOMEM; + } + + context->client = client; + context->data = NULL; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return 0; +} + +/** + * ib_register_device - Register an IB device with IB core + * @device:Device to register + * + * Low-level drivers use ib_register_device() to register their + * devices with the IB core. All registered clients will receive a + * callback for each device that is added. @device must be allocated + * with ib_alloc_device(). + */ +int ib_register_device(struct ib_device *device) +{ + int ret; + + down(&device_sem); + + if (strchr(device->name, '%')) { + ret = alloc_name(device->name); + if (ret) + goto out; + } + + if (ib_device_check_mandatory(device)) { + ret = -EINVAL; + goto out; + } + + INIT_LIST_HEAD(&device->event_handler_list); + INIT_LIST_HEAD(&device->client_data_list); + spin_lock_init(&device->event_handler_lock); + spin_lock_init(&device->client_data_lock); + + ret = ib_device_register_sysfs(device); + if (ret) { + printk(KERN_WARNING "Couldn't register device %s with driver model\n", + device->name); + goto out; + } + + list_add_tail(&device->core_list, &device_list); + + device->reg_state = IB_DEV_REGISTERED; + + { + struct ib_client *client; + + list_for_each_entry(client, &client_list, list) + if (client->add && !add_client_context(device, client)) + client->add(device); + } + + out: + up(&device_sem); + return ret; +} +EXPORT_SYMBOL(ib_register_device); + +/** + * ib_unregister_device - Unregister an IB device + * @device:Device to unregister + * + * Unregister an IB device. All clients will receive a remove callback. + */ +void ib_unregister_device(struct ib_device *device) +{ + struct ib_client *client; + struct ib_client_data *context, *tmp; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry_reverse(client, &client_list, list) + if (client->remove) + client->remove(device); + + list_del(&device->core_list); + + up(&device_sem); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + kfree(context); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + device->reg_state = IB_DEV_UNREGISTERED; +} +EXPORT_SYMBOL(ib_unregister_device); + +/** + * ib_register_client - Register an IB client + * @client:Client to register + * + * Upper level users of the IB drivers can use ib_register_client() to + * register callbacks for IB device addition and removal. When an IB + * device is added, each registered client's add method will be called + * (in the order the clients were registered), and when a device is + * removed, each client's remove method will be called (in the reverse + * order that clients were registered). In addition, when + * ib_register_client() is called, the client will receive an add + * callback for all devices already registered. + */ +int ib_register_client(struct ib_client *client) +{ + struct ib_device *device; + + down(&device_sem); + + list_add_tail(&client->list, &client_list); + list_for_each_entry(device, &device_list, core_list) + if (client->add && !add_client_context(device, client)) + client->add(device); + + up(&device_sem); + + return 0; +} +EXPORT_SYMBOL(ib_register_client); + +/** + * ib_unregister_client - Unregister an IB client + * @client:Client to unregister + * + * Upper level users use ib_unregister_client() to remove their client + * registration. When ib_unregister_client() is called, the client + * will receive a remove callback for each IB device still registered. + */ +void ib_unregister_client(struct ib_client *client) +{ + struct ib_client_data *context, *tmp; + struct ib_device *device; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry(device, &device_list, core_list) { + if (client->remove) + client->remove(device); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + if (context->client == client) { + list_del(&context->list); + kfree(context); + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + } + list_del(&client->list); + + up(&device_sem); +} +EXPORT_SYMBOL(ib_unregister_client); + +/** + * ib_get_client_data - Get IB client context + * @device:Device to get context for + * @client:Client to get context for + * + * ib_get_client_data() returns client context set with + * ib_set_client_data(). + */ +void *ib_get_client_data(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + void *ret = NULL; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + ret = context->data; + break; + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_get_client_data); + +/** + * ib_set_client_data - Get IB client context + * @device:Device to set context for + * @client:Client to set context for + * @data:Context to set + * + * ib_set_client_data() sets client context that can be retrieved with + * ib_get_client_data(). + */ +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data) +{ + struct ib_client_data *context; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + context->data = data; + goto out; + } + + printk(KERN_WARNING "No client context found for %s/%s\n", + device->name, client->name); + +out: + spin_unlock_irqrestore(&device->client_data_lock, flags); +} +EXPORT_SYMBOL(ib_set_client_data); + +/** + * ib_register_event_handler - Register an IB event handler + * @event_handler:Handler to register + * + * ib_register_event_handler() registers an event handler that will be + * called back when asynchronous IB events occur (as defined in + * chapter 11 of the InfiniBand Architecture Specification). This + * callback may occur in interrupt context. + */ +int ib_register_event_handler (struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_add_tail(&event_handler->list, + &event_handler->device->event_handler_list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_register_event_handler); + +/** + * ib_unregister_event_handler - Unregister an event handler + * @event_handler:Handler to unregister + * + * Unregister an event handler registered with + * ib_register_event_handler(). + */ +int ib_unregister_event_handler(struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_del(&event_handler->list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_unregister_event_handler); + +/** + * ib_dispatch_event - Dispatch an asynchronous event + * @event:Event to dispatch + * + * Low-level drivers must call ib_dispatch_event() to dispatch the + * event to all registered event handlers when an asynchronous event + * occurs. + */ +void ib_dispatch_event(struct ib_event *event) +{ + unsigned long flags; + struct ib_event_handler *handler; + + spin_lock_irqsave(&event->device->event_handler_lock, flags); + + list_for_each_entry(handler, &event->device->event_handler_list, list) + handler->handler(handler, event); + + spin_unlock_irqrestore(&event->device->event_handler_lock, flags); +} +EXPORT_SYMBOL(ib_dispatch_event); + +/** + * ib_query_device - Query IB device attributes + * @device:Device to query + * @device_attr:Device attributes + * + * ib_query_device() returns the attributes of a device through the + * @device_attr pointer. + */ +int ib_query_device(struct ib_device *device, + struct ib_device_attr *device_attr) +{ + return device->query_device(device, device_attr); +} +EXPORT_SYMBOL(ib_query_device); + +/** + * ib_query_port - Query IB port attributes + * @device:Device to query + * @port_num:Port number to query + * @port_attr:Port attributes + * + * ib_query_port() returns the attributes of a port through the + * @port_attr pointer. + */ +int ib_query_port(struct ib_device *device, + u8 port_num, + struct ib_port_attr *port_attr) +{ + return device->query_port(device, port_num, port_attr); +} +EXPORT_SYMBOL(ib_query_port); + +/** + * ib_query_gid - Get GID table entry + * @device:Device to query + * @port_num:Port number to query + * @index:GID table index to query + * @gid:Returned GID + * + * ib_query_gid() fetches the specified GID table entry. + */ +int ib_query_gid(struct ib_device *device, + u8 port_num, int index, union ib_gid *gid) +{ + return device->query_gid(device, port_num, index, gid); +} +EXPORT_SYMBOL(ib_query_gid); + +/** + * ib_query_pkey - Get P_Key table entry + * @device:Device to query + * @port_num:Port number to query + * @index:P_Key table index to query + * @pkey:Returned P_Key + * + * ib_query_pkey() fetches the specified P_Key table entry. + */ +int ib_query_pkey(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey) +{ + return device->query_pkey(device, port_num, index, pkey); +} +EXPORT_SYMBOL(ib_query_pkey); + +/** + * ib_modify_device - Change IB device attributes + * @device:Device to modify + * @device_modify_mask:Mask of attributes to change + * @device_modify:New attribute values + * + * ib_modify_device() changes a device's attributes as specified by + * the @device_modify_mask and @device_modify structure. + */ +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify) +{ + return device->modify_device(device, device_modify_mask, + device_modify); +} +EXPORT_SYMBOL(ib_modify_device); + +/** + * ib_modify_port - Modifies the attributes for the specified port. + * @device: The device to modify. + * @port_num: The number of the port to modify. + * @port_modify_mask: Mask used to specify which attributes of the port + * to change. + * @port_modify: New attribute values for the port. + * + * ib_modify_port() changes a port's attributes as specified by the + * @port_modify_mask and @port_modify structure. + */ +int ib_modify_port(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify) +{ + return device->modify_port(device, port_num, port_modify_mask, + port_modify); +} +EXPORT_SYMBOL(ib_modify_port); + +static int __init ib_core_init(void) +{ + int ret; + + ret = ib_sysfs_setup(); + if (ret) + printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); + + ret = ib_cache_setup(); + if (ret) { + printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n"); + ib_sysfs_cleanup(); + } + + return ret; +} + +static void __exit ib_core_cleanup(void) +{ + ib_cache_cleanup(); + ib_sysfs_cleanup(); +} + +module_init(ib_core_init); +module_exit(ib_core_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/fmr_pool.c 2004-12-19 22:04:12.262892041 -0800 @@ -0,0 +1,507 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: fmr_pool.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include +#include + +#include + +#include "core_priv.h" + +enum { + IB_FMR_MAX_REMAPS = 32, + + IB_FMR_HASH_BITS = 8, + IB_FMR_HASH_SIZE = 1 << IB_FMR_HASH_BITS, + IB_FMR_HASH_MASK = IB_FMR_HASH_SIZE - 1 +}; + +/* + * If an FMR is not in use, then the list member will point to either + * its pool's free_list (if the FMR can be mapped again; that is, + * remap_count < IB_FMR_MAX_REMAPS) or its pool's dirty_list (if the + * FMR needs to be unmapped before being remapped). In either of + * these cases it is a bug if the ref_count is not 0. In other words, + * if ref_count is > 0, then the list member must not be linked into + * either free_list or dirty_list. + * + * The cache_node member is used to link the FMR into a cache bucket + * (if caching is enabled). This is independent of the reference + * count of the FMR. When a valid FMR is released, its ref_count is + * decremented, and if ref_count reaches 0, the FMR is placed in + * either free_list or dirty_list as appropriate. However, it is not + * removed from the cache and may be "revived" if a call to + * ib_fmr_register_physical() occurs before the FMR is remapped. In + * this case we just increment the ref_count and remove the FMR from + * free_list/dirty_list. + * + * Before we remap an FMR from free_list, we remove it from the cache + * (to prevent another user from obtaining a stale FMR). When an FMR + * is released, we add it to the tail of the free list, so that our + * cache eviction policy is "least recently used." + * + * All manipulation of ref_count, list and cache_node is protected by + * pool_lock to maintain consistency. + */ + +struct ib_fmr_pool { + spinlock_t pool_lock; + + int pool_size; + int max_pages; + int dirty_watermark; + int dirty_len; + struct list_head free_list; + struct list_head dirty_list; + struct hlist_head *cache_bucket; + + void (*flush_function)(struct ib_fmr_pool *pool, + void * arg); + void *flush_arg; + + struct task_struct *thread; + + atomic_t req_ser; + atomic_t flush_ser; + + wait_queue_head_t force_wait; +}; + +static inline u32 ib_fmr_hash(u64 first_page) +{ + return jhash_2words((u32) first_page, + (u32) (first_page >> 32), + 0); +} + +/* Caller must hold pool_lock */ +static inline struct ib_pool_fmr *ib_fmr_cache_lookup(struct ib_fmr_pool *pool, + u64 *page_list, + int page_list_len, + u64 io_virtual_address) +{ + struct hlist_head *bucket; + struct ib_pool_fmr *fmr; + struct hlist_node *pos; + + if (!pool->cache_bucket) + return NULL; + + bucket = pool->cache_bucket + ib_fmr_hash(*page_list); + + hlist_for_each_entry(fmr, pos, bucket, cache_node) + if (io_virtual_address == fmr->io_virtual_address && + page_list_len == fmr->page_list_len && + !memcmp(page_list, fmr->page_list, + page_list_len * sizeof *page_list)) + return fmr; + + return NULL; +} + +static void ib_fmr_batch_release(struct ib_fmr_pool *pool) +{ + int ret; + struct ib_pool_fmr *fmr; + LIST_HEAD(unmap_list); + LIST_HEAD(fmr_list); + + spin_lock_irq(&pool->pool_lock); + + list_for_each_entry(fmr, &pool->dirty_list, list) { + hlist_del_init(&fmr->cache_node); + fmr->remap_count = 0; + list_add_tail(&fmr->fmr->list, &fmr_list); + +#ifdef DEBUG + if (fmr->ref_count !=0) { + printk(KERN_WARNING "Unmapping FMR 0x%08x with ref count %d", + fmr, fmr->ref_count); + } +#endif + } + + list_splice(&pool->dirty_list, &unmap_list); + INIT_LIST_HEAD(&pool->dirty_list); + pool->dirty_len = 0; + + spin_unlock_irq(&pool->pool_lock); + + if (list_empty(&unmap_list)) { + return; + } + + ret = ib_unmap_fmr(&fmr_list); + if (ret) + printk(KERN_WARNING "ib_unmap_fmr returned %d", ret); + + spin_lock_irq(&pool->pool_lock); + list_splice(&unmap_list, &pool->free_list); + spin_unlock_irq(&pool->pool_lock); +} + +static int ib_fmr_cleanup_thread(void *pool_ptr) +{ + struct ib_fmr_pool *pool = pool_ptr; + + do { + if (pool->dirty_len >= pool->dirty_watermark || + atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) < 0) { + ib_fmr_batch_release(pool); + + atomic_inc(&pool->flush_ser); + wake_up_interruptible(&pool->force_wait); + + if (pool->flush_function) + pool->flush_function(pool, pool->flush_arg); + } + + set_current_state(TASK_INTERRUPTIBLE); + if (pool->dirty_len < pool->dirty_watermark && + atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) >= 0 && + !kthread_should_stop()) + schedule(); + __set_current_state(TASK_RUNNING); + } while (!kthread_should_stop()); + + return 0; +} + +/** + * ib_create_fmr_pool - Create an FMR pool + * @pd:Protection domain for FMRs + * @params:FMR pool parameters + * + * Create a pool of FMRs. Return value is pointer to new pool or + * error code if creation failed. + */ +struct ib_fmr_pool *ib_create_fmr_pool(struct ib_pd *pd, + struct ib_fmr_pool_param *params) +{ + struct ib_device *device; + struct ib_fmr_pool *pool; + int i; + int ret; + + if (!params) + return ERR_PTR(-EINVAL); + + device = pd->device; + if (!device->alloc_fmr || !device->dealloc_fmr || + !device->map_phys_fmr || !device->unmap_fmr) { + printk(KERN_WARNING "Device %s does not support fast memory regions", + device->name); + return ERR_PTR(-ENOSYS); + } + + pool = kmalloc(sizeof *pool, GFP_KERNEL); + if (!pool) { + printk(KERN_WARNING "couldn't allocate pool struct"); + return ERR_PTR(-ENOMEM); + } + + pool->cache_bucket = NULL; + + pool->flush_function = params->flush_function; + pool->flush_arg = params->flush_arg; + + INIT_LIST_HEAD(&pool->free_list); + INIT_LIST_HEAD(&pool->dirty_list); + + if (params->cache) { + pool->cache_bucket = + kmalloc(IB_FMR_HASH_SIZE * sizeof *pool->cache_bucket, + GFP_KERNEL); + if (!pool->cache_bucket) { + printk(KERN_WARNING "Failed to allocate cache in pool"); + ret = -ENOMEM; + goto out_free_pool; + } + + for (i = 0; i < IB_FMR_HASH_SIZE; ++i) + INIT_HLIST_HEAD(pool->cache_bucket + i); + } + + pool->pool_size = 0; + pool->max_pages = params->max_pages_per_fmr; + pool->dirty_watermark = params->dirty_watermark; + pool->dirty_len = 0; + spin_lock_init(&pool->pool_lock); + atomic_set(&pool->req_ser, 0); + atomic_set(&pool->flush_ser, 0); + init_waitqueue_head(&pool->force_wait); + + pool->thread = kthread_create(ib_fmr_cleanup_thread, + pool, + "ib_fmr(%s)", + device->name); + if (IS_ERR(pool->thread)) { + printk(KERN_WARNING "couldn't start cleanup thread"); + ret = PTR_ERR(pool->thread); + goto out_free_pool; + } + + { + struct ib_pool_fmr *fmr; + struct ib_fmr_attr attr = { + .max_pages = params->max_pages_per_fmr, + .max_maps = IB_FMR_MAX_REMAPS, + .page_size = PAGE_SHIFT + }; + + for (i = 0; i < params->pool_size; ++i) { + fmr = kmalloc(sizeof *fmr + params->max_pages_per_fmr * sizeof (u64), + GFP_KERNEL); + if (!fmr) { + printk(KERN_WARNING "failed to allocate fmr struct " + "for FMR %d", i); + goto out_fail; + } + + fmr->pool = pool; + fmr->remap_count = 0; + fmr->ref_count = 0; + INIT_HLIST_NODE(&fmr->cache_node); + + fmr->fmr = ib_alloc_fmr(pd, params->access, &attr); + if (IS_ERR(fmr->fmr)) { + printk(KERN_WARNING "fmr_create failed for FMR %d", i); + kfree(fmr); + goto out_fail; + } + + list_add_tail(&fmr->list, &pool->free_list); + ++pool->pool_size; + } + } + + return pool; + + out_free_pool: + kfree(pool->cache_bucket); + kfree(pool); + + return ERR_PTR(ret); + + out_fail: + ib_destroy_fmr_pool(pool); + + return ERR_PTR(-ENOMEM); +} +EXPORT_SYMBOL(ib_create_fmr_pool); + +/** + * ib_destroy_fmr_pool - Free FMR pool + * @pool:FMR pool to free + * + * Destroy an FMR pool and free all associated resources. + */ +int ib_destroy_fmr_pool(struct ib_fmr_pool *pool) +{ + struct ib_pool_fmr *fmr; + struct ib_pool_fmr *tmp; + int i; + + kthread_stop(pool->thread); + ib_fmr_batch_release(pool); + + i = 0; + list_for_each_entry_safe(fmr, tmp, &pool->free_list, list) { + ib_dealloc_fmr(fmr->fmr); + list_del(&fmr->list); + kfree(fmr); + ++i; + } + + if (i < pool->pool_size) + printk(KERN_WARNING "pool still has %d regions registered", + pool->pool_size - i); + + kfree(pool->cache_bucket); + kfree(pool); + + return 0; +} +EXPORT_SYMBOL(ib_destroy_fmr_pool); + +/** + * ib_flush_fmr_pool - Invalidate all unmapped FMRs + * @pool:FMR pool to flush + * + * Ensure that all unmapped FMRs are fully invalidated. + */ +int ib_flush_fmr_pool(struct ib_fmr_pool *pool) +{ + int serial; + + atomic_inc(&pool->req_ser); + /* + * It's OK if someone else bumps req_ser again here -- we'll + * just wait a little longer. + */ + serial = atomic_read(&pool->req_ser); + + wake_up_process(pool->thread); + + if (wait_event_interruptible(pool->force_wait, + atomic_read(&pool->flush_ser) - + atomic_read(&pool->req_ser) >= 0)) + return -EINTR; + + return 0; +} +EXPORT_SYMBOL(ib_flush_fmr_pool); + +/** + * ib_fmr_pool_map_phys - + * @pool:FMR pool to allocate FMR from + * @page_list:List of pages to map + * @list_len:Number of pages in @page_list + * @io_virtual_address:I/O virtual address for new FMR + * + * Map an FMR from an FMR pool. + */ +struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle, + u64 *page_list, + int list_len, + u64 *io_virtual_address) +{ + struct ib_fmr_pool *pool = pool_handle; + struct ib_pool_fmr *fmr; + unsigned long flags; + int result; + + if (list_len < 1 || list_len > pool->max_pages) + return ERR_PTR(-EINVAL); + + spin_lock_irqsave(&pool->pool_lock, flags); + fmr = ib_fmr_cache_lookup(pool, + page_list, + list_len, + *io_virtual_address); + if (fmr) { + /* found in cache */ + ++fmr->ref_count; + if (fmr->ref_count == 1) { + list_del(&fmr->list); + } + + spin_unlock_irqrestore(&pool->pool_lock, flags); + + return fmr; + } + + if (list_empty(&pool->free_list)) { + spin_unlock_irqrestore(&pool->pool_lock, flags); + return ERR_PTR(-EAGAIN); + } + + fmr = list_entry(pool->free_list.next, struct ib_pool_fmr, list); + list_del(&fmr->list); + hlist_del_init(&fmr->cache_node); + spin_unlock_irqrestore(&pool->pool_lock, flags); + + result = ib_map_phys_fmr(fmr->fmr, page_list, list_len, + *io_virtual_address); + + if (result) { + spin_lock_irqsave(&pool->pool_lock, flags); + list_add(&fmr->list, &pool->free_list); + spin_unlock_irqrestore(&pool->pool_lock, flags); + + printk(KERN_WARNING "fmr_map returns %d", + result); + + return ERR_PTR(result); + } + + ++fmr->remap_count; + fmr->ref_count = 1; + + if (pool->cache_bucket) { + fmr->io_virtual_address = *io_virtual_address; + fmr->page_list_len = list_len; + memcpy(fmr->page_list, page_list, list_len * sizeof(*page_list)); + + spin_lock_irqsave(&pool->pool_lock, flags); + hlist_add_head(&fmr->cache_node, + pool->cache_bucket + ib_fmr_hash(fmr->page_list[0])); + spin_unlock_irqrestore(&pool->pool_lock, flags); + } + + return fmr; +} +EXPORT_SYMBOL(ib_fmr_pool_map_phys); + +/** + * ib_fmr_pool_unmap - Unmap FMR + * @fmr:FMR to unmap + * + * Unmap an FMR. The FMR mapping may remain valid until the FMR is + * reused (or until ib_flush_fmr_pool() is called). + */ +int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr) +{ + struct ib_fmr_pool *pool; + unsigned long flags; + + pool = fmr->pool; + + spin_lock_irqsave(&pool->pool_lock, flags); + + --fmr->ref_count; + if (!fmr->ref_count) { + if (fmr->remap_count < IB_FMR_MAX_REMAPS) { + list_add_tail(&fmr->list, &pool->free_list); + } else { + list_add_tail(&fmr->list, &pool->dirty_list); + ++pool->dirty_len; + wake_up_process(pool->thread); + } + } + +#ifdef DEBUG + if (fmr->ref_count < 0) + printk(KERN_WARNING "FMR %p has ref count %d < 0", + fmr, fmr->ref_count); +#endif + + spin_unlock_irqrestore(&pool->pool_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_fmr_pool_unmap); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/packer.c 2004-12-19 22:04:12.010929171 -0800 @@ -0,0 +1,201 @@ +/* + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: packer.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +static u64 value_read(int offset, int size, void *structure) +{ + switch (size) { + case 1: return *(u8 *) (structure + offset); + case 2: return be16_to_cpup((__be16 *) (structure + offset)); + case 4: return be32_to_cpup((__be32 *) (structure + offset)); + case 8: return be64_to_cpup((__be64 *) (structure + offset)); + default: + printk(KERN_WARNING "Field size %d bits not handled\n", size * 8); + return 0; + } +} + +/** + * ib_pack - Pack a structure into a buffer + * @desc:Array of structure field descriptions + * @desc_len:Number of entries in @desc + * @structure:Structure to pack from + * @buf:Buffer to pack into + * + * ib_pack() packs a list of structure fields into a buffer, + * controlled by the array of fields in @desc. + */ +void ib_pack(const struct ib_field *desc, + int desc_len, + void *structure, + void *buf) +{ + int i; + + for (i = 0; i < desc_len; ++i) { + if (desc[i].size_bits <= 32) { + int shift; + u32 val; + __be32 mask; + __be32 *addr; + + shift = 32 - desc[i].offset_bits - desc[i].size_bits; + if (desc[i].struct_size_bytes) + val = value_read(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + structure) << shift; + else + val = 0; + + mask = cpu_to_be32(((1ull << desc[i].size_bits) - 1) << shift); + addr = (__be32 *) buf + desc[i].offset_words; + *addr = (*addr & ~mask) | (cpu_to_be32(val) & mask); + } else if (desc[i].size_bits <= 64) { + int shift; + u64 val; + __be64 mask; + __be64 *addr; + + shift = 64 - desc[i].offset_bits - desc[i].size_bits; + if (desc[i].struct_size_bytes) + val = value_read(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + structure) << shift; + else + val = 0; + + mask = cpu_to_be64(((1ull << desc[i].size_bits) - 1) << shift); + addr = (__be64 *) ((__be32 *) buf + desc[i].offset_words); + *addr = (*addr & ~mask) | (cpu_to_be64(val) & mask); + } else { + if (desc[i].offset_bits % 8 || + desc[i].size_bits % 8) { + printk(KERN_WARNING "Structure field %s of size %d " + "bits is not byte-aligned\n", + desc[i].field_name, desc[i].size_bits); + } + + if (desc[i].struct_size_bytes) + memcpy(buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + structure + desc[i].struct_offset_bytes, + desc[i].size_bits / 8); + else + memset(buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + 0, + desc[i].size_bits / 8); + } + } +} +EXPORT_SYMBOL(ib_pack); + +static void value_write(int offset, int size, u64 val, void *structure) +{ + switch (size * 8) { + case 8: *( u8 *) (structure + offset) = val; break; + case 16: *(__be16 *) (structure + offset) = cpu_to_be16(val); break; + case 32: *(__be32 *) (structure + offset) = cpu_to_be32(val); break; + case 64: *(__be64 *) (structure + offset) = cpu_to_be64(val); break; + default: + printk(KERN_WARNING "Field size %d bits not handled\n", size * 8); + } +} + +/** + * ib_unpack - Unpack a buffer into a structure + * @desc:Array of structure field descriptions + * @desc_len:Number of entries in @desc + * @buf:Buffer to unpack from + * @structure:Structure to unpack into + * + * ib_pack() unpacks a list of structure fields from a buffer, + * controlled by the array of fields in @desc. + */ +void ib_unpack(const struct ib_field *desc, + int desc_len, + void *buf, + void *structure) +{ + int i; + + for (i = 0; i < desc_len; ++i) { + if (!desc[i].struct_size_bytes) + continue; + + if (desc[i].size_bits <= 32) { + int shift; + u32 val; + u32 mask; + __be32 *addr; + + shift = 32 - desc[i].offset_bits - desc[i].size_bits; + mask = ((1ull << desc[i].size_bits) - 1) << shift; + addr = (__be32 *) buf + desc[i].offset_words; + val = (be32_to_cpup(addr) & mask) >> shift; + value_write(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + val, + structure); + } else if (desc[i].size_bits <= 64) { + int shift; + u64 val; + u64 mask; + __be64 *addr; + + shift = 64 - desc[i].offset_bits - desc[i].size_bits; + mask = ((1ull << desc[i].size_bits) - 1) << shift; + addr = (__be64 *) buf + desc[i].offset_words; + val = (be64_to_cpup(addr) & mask) >> shift; + value_write(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + val, + structure); + } else { + if (desc[i].offset_bits % 8 || + desc[i].size_bits % 8) { + printk(KERN_WARNING "Structure field %s of size %d " + "bits is not byte-aligned\n", + desc[i].field_name, desc[i].size_bits); + } + + memcpy(structure + desc[i].struct_offset_bytes, + buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + desc[i].size_bits / 8); + } + } +} +EXPORT_SYMBOL(ib_unpack); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/sysfs.c 2004-12-19 22:04:12.213899261 -0800 @@ -0,0 +1,695 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include "core_priv.h" + +#include + +struct ib_port { + struct kobject kobj; + struct ib_device *ibdev; + struct attribute_group gid_group; + struct attribute **gid_attr; + struct attribute_group pkey_group; + struct attribute **pkey_attr; + u8 port_num; +}; + +struct port_attribute { + struct attribute attr; + ssize_t (*show)(struct ib_port *, struct port_attribute *, char *buf); + ssize_t (*store)(struct ib_port *, struct port_attribute *, + const char *buf, size_t count); +}; + +#define PORT_ATTR(_name, _mode, _show, _store) \ +struct port_attribute port_attr_##_name = __ATTR(_name, _mode, _show, _store) + +#define PORT_ATTR_RO(_name) \ +struct port_attribute port_attr_##_name = __ATTR_RO(_name) + +struct port_table_attribute { + struct port_attribute attr; + int index; +}; + +static ssize_t port_attr_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct port_attribute *port_attr = + container_of(attr, struct port_attribute, attr); + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + + if (!port_attr->show) + return 0; + + return port_attr->show(p, port_attr, buf); +} + +static struct sysfs_ops port_sysfs_ops = { + .show = port_attr_show +}; + +static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + static const char *state_name[] = { + [IB_PORT_NOP] = "NOP", + [IB_PORT_DOWN] = "DOWN", + [IB_PORT_INIT] = "INIT", + [IB_PORT_ARMED] = "ARMED", + [IB_PORT_ACTIVE] = "ACTIVE", + [IB_PORT_ACTIVE_DEFER] = "ACTIVE_DEFER" + }; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d: %s\n", attr.state, + attr.state >= 0 && attr.state <= ARRAY_SIZE(state_name) ? + state_name[attr.state] : "UNKNOWN"); +} + +static ssize_t lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.lid); +} + +static ssize_t lid_mask_count_show(struct ib_port *p, + struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.lmc); +} + +static ssize_t sm_lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.sm_lid); +} + +static ssize_t sm_sl_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.sm_sl); +} + +static ssize_t cap_mask_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%08x\n", attr.port_cap_flags); +} + +static PORT_ATTR_RO(state); +static PORT_ATTR_RO(lid); +static PORT_ATTR_RO(lid_mask_count); +static PORT_ATTR_RO(sm_lid); +static PORT_ATTR_RO(sm_sl); +static PORT_ATTR_RO(cap_mask); + +static struct attribute *port_default_attrs[] = { + &port_attr_state.attr, + &port_attr_lid.attr, + &port_attr_lid_mask_count.attr, + &port_attr_sm_lid.attr, + &port_attr_sm_sl.attr, + &port_attr_cap_mask.attr, + NULL +}; + +static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + union ib_gid gid; + ssize_t ret; + + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) gid.raw)[0]), + be16_to_cpu(((u16 *) gid.raw)[1]), + be16_to_cpu(((u16 *) gid.raw)[2]), + be16_to_cpu(((u16 *) gid.raw)[3]), + be16_to_cpu(((u16 *) gid.raw)[4]), + be16_to_cpu(((u16 *) gid.raw)[5]), + be16_to_cpu(((u16 *) gid.raw)[6]), + be16_to_cpu(((u16 *) gid.raw)[7])); +} + +static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + u16 pkey; + ssize_t ret; + + ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); + if (ret) + return ret; + + return sprintf(buf, "0x%04x\n", pkey); +} + +#define PORT_PMA_ATTR(_name, _counter, _width, _offset) \ +struct port_table_attribute port_pma_attr_##_name = { \ + .attr = __ATTR(_name, S_IRUGO, show_pma_counter, NULL), \ + .index = (_offset) | ((_width) << 16) | ((_counter) << 24) \ +} + +static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + int offset = tab_attr->index & 0xffff; + int width = (tab_attr->index >> 16) & 0xff; + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + ssize_t ret; + + if (!p->ibdev->process_mad) + return sprintf(buf, "N/A (no PMA)\n"); + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + if (!in_mad || !out_mad) { + ret = -ENOMEM; + goto out; + } + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(0x12); /* PortCounters */ + + in_mad->data[41] = p->port_num; /* PortSelect field */ + + if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, p->port_num, 0xffff, + in_mad, out_mad) & + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { + ret = -EINVAL; + goto out; + } + + switch (width) { + case 4: + ret = sprintf(buf, "%u\n", (out_mad->data[40 + offset / 8] >> + (offset % 4)) & 0xf); + break; + case 8: + ret = sprintf(buf, "%u\n", out_mad->data[40 + offset / 8]); + break; + case 16: + ret = sprintf(buf, "%u\n", + be16_to_cpup((u16 *)(out_mad->data + 40 + offset / 8))); + break; + case 32: + ret = sprintf(buf, "%u\n", + be32_to_cpup((u32 *)(out_mad->data + 40 + offset / 8))); + break; + default: + ret = 0; + } + +out: + kfree(in_mad); + kfree(out_mad); + + return ret; +} + +static PORT_PMA_ATTR(symbol_error , 0, 16, 32); +static PORT_PMA_ATTR(link_error_recovery , 1, 8, 48); +static PORT_PMA_ATTR(link_downed , 2, 8, 56); +static PORT_PMA_ATTR(port_rcv_errors , 3, 16, 64); +static PORT_PMA_ATTR(port_rcv_remote_physical_errors, 4, 16, 80); +static PORT_PMA_ATTR(port_rcv_switch_relay_errors , 5, 16, 96); +static PORT_PMA_ATTR(port_xmit_discards , 6, 16, 112); +static PORT_PMA_ATTR(port_xmit_constraint_errors , 7, 8, 128); +static PORT_PMA_ATTR(port_rcv_constraint_errors , 8, 8, 136); +static PORT_PMA_ATTR(local_link_integrity_errors , 9, 4, 152); +static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10, 4, 156); +static PORT_PMA_ATTR(VL15_dropped , 11, 16, 176); +static PORT_PMA_ATTR(port_xmit_data , 12, 32, 192); +static PORT_PMA_ATTR(port_rcv_data , 13, 32, 224); +static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256); +static PORT_PMA_ATTR(port_rcv_packets , 15, 32, 288); + +static struct attribute *pma_attrs[] = { + &port_pma_attr_symbol_error.attr.attr, + &port_pma_attr_link_error_recovery.attr.attr, + &port_pma_attr_link_downed.attr.attr, + &port_pma_attr_port_rcv_errors.attr.attr, + &port_pma_attr_port_rcv_remote_physical_errors.attr.attr, + &port_pma_attr_port_rcv_switch_relay_errors.attr.attr, + &port_pma_attr_port_xmit_discards.attr.attr, + &port_pma_attr_port_xmit_constraint_errors.attr.attr, + &port_pma_attr_port_rcv_constraint_errors.attr.attr, + &port_pma_attr_local_link_integrity_errors.attr.attr, + &port_pma_attr_excessive_buffer_overrun_errors.attr.attr, + &port_pma_attr_VL15_dropped.attr.attr, + &port_pma_attr_port_xmit_data.attr.attr, + &port_pma_attr_port_rcv_data.attr.attr, + &port_pma_attr_port_xmit_packets.attr.attr, + &port_pma_attr_port_rcv_packets.attr.attr, + NULL +}; + +static struct attribute_group pma_group = { + .name = "counters", + .attrs = pma_attrs +}; + +static void ib_port_release(struct kobject *kobj) +{ + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + struct attribute *a; + int i; + + for (i = 0; (a = p->gid_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + for (i = 0; (a = p->pkey_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + kfree(p->gid_attr); + kfree(p); +} + +static struct kobj_type port_type = { + .release = ib_port_release, + .sysfs_ops = &port_sysfs_ops, + .default_attrs = port_default_attrs +}; + +static void ib_device_release(struct class_device *cdev) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + + kfree(dev); +} + +static int ib_device_hotplug(struct class_device *cdev, char **envp, + int num_envp, char *buf, int size) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + int i = 0, len = 0; + + if (add_hotplug_env_var(envp, num_envp, &i, buf, size, &len, + "NAME=%s", dev->name)) + return -ENOMEM; + + /* + * It might be nice to pass the node GUID to hotplug, but + * right now the only way to get it is to query the device + * provider, and this can crash during device removal because + * we are will be running after driver removal has started. + * We could add a node_guid field to struct ib_device, or we + * could just let the hotplug script read the node GUID from + * sysfs when devices are added. + */ + + envp[i] = NULL; + return 0; +} + +static int alloc_group(struct attribute ***attr, + ssize_t (*show)(struct ib_port *, + struct port_attribute *, char *buf), + int len) +{ + struct port_table_attribute ***tab_attr = + (struct port_table_attribute ***) attr; + int i; + int ret; + + *tab_attr = kmalloc((1 + len) * sizeof *tab_attr, GFP_KERNEL); + if (!*tab_attr) + return -ENOMEM; + + memset(*tab_attr, 0, (1 + len) * sizeof *tab_attr); + + for (i = 0; i < len; ++i) { + (*tab_attr)[i] = kmalloc(sizeof *(*tab_attr)[i], GFP_KERNEL); + if (!(*tab_attr)[i]) { + ret = -ENOMEM; + goto err; + } + memset((*tab_attr)[i], 0, sizeof *(*tab_attr)[i]); + (*tab_attr)[i]->attr.attr.name = kmalloc(8, GFP_KERNEL); + if (!(*tab_attr)[i]->attr.attr.name) { + ret = -ENOMEM; + goto err; + } + + if (snprintf((*tab_attr)[i]->attr.attr.name, 8, "%d", i) >= 8) { + ret = -ENOMEM; + goto err; + } + + (*tab_attr)[i]->attr.attr.mode = S_IRUGO; + (*tab_attr)[i]->attr.attr.owner = THIS_MODULE; + (*tab_attr)[i]->attr.show = show; + (*tab_attr)[i]->index = i; + } + + return 0; + +err: + for (i = 0; i < len; ++i) { + if ((*tab_attr)[i]) + kfree((*tab_attr)[i]->attr.attr.name); + kfree((*tab_attr)[i]); + } + + kfree(*tab_attr); + + return ret; +} + +static int add_port(struct ib_device *device, int port_num) +{ + struct ib_port *p; + struct ib_port_attr attr; + int i; + int ret; + + ret = ib_query_port(device, port_num, &attr); + if (ret) + return ret; + + p = kmalloc(sizeof *p, GFP_KERNEL); + if (!p) + return -ENOMEM; + memset(p, 0, sizeof *p); + + p->ibdev = device; + p->port_num = port_num; + p->kobj.ktype = &port_type; + + p->kobj.parent = kobject_get(&device->ports_parent); + if (!p->kobj.parent) { + ret = -EBUSY; + goto err; + } + + ret = kobject_set_name(&p->kobj, "%d", port_num); + if (ret) + goto err_put; + + ret = kobject_register(&p->kobj); + if (ret) + goto err_put; + + ret = sysfs_create_group(&p->kobj, &pma_group); + if (ret) + goto err_put; + + ret = alloc_group(&p->gid_attr, show_port_gid, attr.gid_tbl_len); + if (ret) + goto err_remove_pma; + + p->gid_group.name = "gids"; + p->gid_group.attrs = p->gid_attr; + + ret = sysfs_create_group(&p->kobj, &p->gid_group); + if (ret) + goto err_free_gid; + + ret = alloc_group(&p->pkey_attr, show_port_pkey, attr.pkey_tbl_len); + if (ret) + goto err_remove_gid; + + p->pkey_group.name = "pkeys"; + p->pkey_group.attrs = p->pkey_attr; + + ret = sysfs_create_group(&p->kobj, &p->pkey_group); + if (ret) + goto err_free_pkey; + + list_add_tail(&p->kobj.entry, &device->port_list); + + return 0; + +err_free_pkey: + for (i = 0; i < attr.pkey_tbl_len; ++i) { + kfree(p->pkey_attr[i]->name); + kfree(p->pkey_attr[i]); + } + + kfree(p->pkey_attr); + +err_remove_gid: + sysfs_remove_group(&p->kobj, &p->gid_group); + +err_free_gid: + for (i = 0; i < attr.gid_tbl_len; ++i) { + kfree(p->gid_attr[i]->name); + kfree(p->gid_attr[i]); + } + + kfree(p->gid_attr); + +err_remove_pma: + sysfs_remove_group(&p->kobj, &pma_group); + +err_put: + kobject_put(&device->ports_parent); + +err: + kfree(p); + return ret; +} + +static ssize_t show_sys_image_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.sys_image_guid)[0]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[1]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[2]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[3])); +} + +static ssize_t show_node_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.node_guid)[0]), + be16_to_cpu(((u16 *) &attr.node_guid)[1]), + be16_to_cpu(((u16 *) &attr.node_guid)[2]), + be16_to_cpu(((u16 *) &attr.node_guid)[3])); +} + +static CLASS_DEVICE_ATTR(sys_image_guid, S_IRUGO, show_sys_image_guid, NULL); +static CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); + +static struct class_device_attribute *ib_class_attributes[] = { + &class_device_attr_sys_image_guid, + &class_device_attr_node_guid +}; + +static struct class ib_class = { + .name = "infiniband", + .release = ib_device_release, + .hotplug = ib_device_hotplug, +}; + +int ib_device_register_sysfs(struct ib_device *device) +{ + struct class_device *class_dev = &device->class_dev; + int ret; + int i; + + class_dev->class = &ib_class; + class_dev->class_data = device; + strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); + + INIT_LIST_HEAD(&device->port_list); + + ret = class_device_register(class_dev); + if (ret) + goto err; + + for (i = 0; i < ARRAY_SIZE(ib_class_attributes); ++i) { + ret = class_device_create_file(class_dev, ib_class_attributes[i]); + if (ret) + goto err_unregister; + } + + device->ports_parent.parent = kobject_get(&class_dev->kobj); + if (!device->ports_parent.parent) { + ret = -EBUSY; + goto err_unregister; + } + ret = kobject_set_name(&device->ports_parent, "ports"); + if (ret) + goto err_put; + ret = kobject_register(&device->ports_parent); + if (ret) + goto err_put; + + if (device->node_type == IB_NODE_SWITCH) { + ret = add_port(device, 0); + if (ret) + goto err_put; + } else { + int i; + + for (i = 1; i <= device->phys_port_cnt; ++i) { + ret = add_port(device, i); + if (ret) + goto err_put; + } + } + + return 0; + +err_put: + { + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + } + + kobject_put(&class_dev->kobj); + +err_unregister: + class_device_unregister(class_dev); + +err: + return ret; +} + +void ib_device_unregister_sysfs(struct ib_device *device) +{ + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + + kobject_unregister(&device->ports_parent); + class_device_unregister(&device->class_dev); +} + +int ib_sysfs_setup(void) +{ + return class_register(&ib_class); +} + +void ib_sysfs_cleanup(void) +{ + class_unregister(&ib_class); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/ud_header.c 2004-12-19 22:04:12.034925635 -0800 @@ -0,0 +1,365 @@ +/* + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ud_header.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +#include + +#define STRUCT_FIELD(header, field) \ + .struct_offset_bytes = offsetof(struct ib_unpacked_ ## header, field), \ + .struct_size_bytes = sizeof ((struct ib_unpacked_ ## header *) 0)->field, \ + .field_name = #header ":" #field + +static const struct ib_field lrh_table[] = { + { STRUCT_FIELD(lrh, virtual_lane), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 4 }, + { STRUCT_FIELD(lrh, link_version), + .offset_words = 0, + .offset_bits = 4, + .size_bits = 4 }, + { STRUCT_FIELD(lrh, service_level), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 4 }, + { RESERVED, + .offset_words = 0, + .offset_bits = 12, + .size_bits = 2 }, + { STRUCT_FIELD(lrh, link_next_header), + .offset_words = 0, + .offset_bits = 14, + .size_bits = 2 }, + { STRUCT_FIELD(lrh, destination_lid), + .offset_words = 0, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 5 }, + { STRUCT_FIELD(lrh, packet_length), + .offset_words = 1, + .offset_bits = 5, + .size_bits = 11 }, + { STRUCT_FIELD(lrh, source_lid), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 } +}; + +static const struct ib_field grh_table[] = { + { STRUCT_FIELD(grh, ip_version), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 4 }, + { STRUCT_FIELD(grh, traffic_class), + .offset_words = 0, + .offset_bits = 4, + .size_bits = 8 }, + { STRUCT_FIELD(grh, flow_label), + .offset_words = 0, + .offset_bits = 12, + .size_bits = 20 }, + { STRUCT_FIELD(grh, payload_length), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { STRUCT_FIELD(grh, next_header), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 8 }, + { STRUCT_FIELD(grh, hop_limit), + .offset_words = 1, + .offset_bits = 24, + .size_bits = 8 }, + { STRUCT_FIELD(grh, source_gid), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 128 }, + { STRUCT_FIELD(grh, destination_gid), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 128 } +}; + +static const struct ib_field bth_table[] = { + { STRUCT_FIELD(bth, opcode), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(bth, solicited_event), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 1 }, + { STRUCT_FIELD(bth, mig_req), + .offset_words = 0, + .offset_bits = 9, + .size_bits = 1 }, + { STRUCT_FIELD(bth, pad_count), + .offset_words = 0, + .offset_bits = 10, + .size_bits = 2 }, + { STRUCT_FIELD(bth, transport_header_version), + .offset_words = 0, + .offset_bits = 12, + .size_bits = 4 }, + { STRUCT_FIELD(bth, pkey), + .offset_words = 0, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(bth, destination_qpn), + .offset_words = 1, + .offset_bits = 8, + .size_bits = 24 }, + { STRUCT_FIELD(bth, ack_req), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 1 }, + { RESERVED, + .offset_words = 2, + .offset_bits = 1, + .size_bits = 7 }, + { STRUCT_FIELD(bth, psn), + .offset_words = 2, + .offset_bits = 8, + .size_bits = 24 } +}; + +static const struct ib_field deth_table[] = { + { STRUCT_FIELD(deth, qkey), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(deth, source_qpn), + .offset_words = 1, + .offset_bits = 8, + .size_bits = 24 } +}; + +/** + * ib_ud_header_init - Initialize UD header structure + * @payload_bytes:Length of packet payload + * @grh_present:GRH flag (if non-zero, GRH will be included) + * @header:Structure to initialize + * + * ib_ud_header_init() initializes the lrh.link_version, lrh.link_next_header, + * lrh.packet_length, grh.ip_version, grh.payload_length, + * grh.next_header, bth.opcode, bth.pad_count and + * bth.transport_header_version fields of a &struct ib_ud_header given + * the payload length and whether a GRH will be included. + */ +void ib_ud_header_init(int payload_bytes, + int grh_present, + struct ib_ud_header *header) +{ + int header_len; + + memset(header, 0, sizeof *header); + + header_len = + IB_LRH_BYTES + + IB_BTH_BYTES + + IB_DETH_BYTES; + if (grh_present) { + header_len += IB_GRH_BYTES; + } + + header->lrh.link_version = 0; + header->lrh.link_next_header = + grh_present ? IB_LNH_IBA_GLOBAL : IB_LNH_IBA_LOCAL; + header->lrh.packet_length = (IB_LRH_BYTES + + IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) / 4; /* round up */ + + header->grh_present = grh_present; + if (grh_present) { + header->lrh.packet_length += IB_GRH_BYTES / 4; + + header->grh.ip_version = 6; + header->grh.payload_length = + cpu_to_be16((IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) & ~3); /* round up */ + header->grh.next_header = 0x1b; + } + + cpu_to_be16s(&header->lrh.packet_length); + + if (header->immediate_present) + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + else + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY; + header->bth.pad_count = (4 - payload_bytes) & 3; + header->bth.transport_header_version = 0; +} +EXPORT_SYMBOL(ib_ud_header_init); + +/** + * ib_ud_header_pack - Pack UD header struct into wire format + * @header:UD header struct + * @buf:Buffer to pack into + * + * ib_ud_header_pack() packs the UD header structure @header into wire + * format in the buffer @buf. + */ +int ib_ud_header_pack(struct ib_ud_header *header, + void *buf) +{ + int len = 0; + + ib_pack(lrh_table, ARRAY_SIZE(lrh_table), + &header->lrh, buf); + len += IB_LRH_BYTES; + + if (header->grh_present) { + ib_pack(grh_table, ARRAY_SIZE(grh_table), + &header->grh, buf + len); + len += IB_GRH_BYTES; + } + + ib_pack(bth_table, ARRAY_SIZE(bth_table), + &header->bth, buf + len); + len += IB_BTH_BYTES; + + ib_pack(deth_table, ARRAY_SIZE(deth_table), + &header->deth, buf + len); + len += IB_DETH_BYTES; + + if (header->immediate_present) { + memcpy(buf + len, &header->immediate_data, sizeof header->immediate_data); + len += sizeof header->immediate_data; + } + + return len; +} +EXPORT_SYMBOL(ib_ud_header_pack); + +/** + * ib_ud_header_unpack - Unpack UD header struct from wire format + * @header:UD header struct + * @buf:Buffer to pack into + * + * ib_ud_header_pack() unpacks the UD header structure @header from wire + * format in the buffer @buf. + */ +int ib_ud_header_unpack(void *buf, + struct ib_ud_header *header) +{ + ib_unpack(lrh_table, ARRAY_SIZE(lrh_table), + buf, &header->lrh); + buf += IB_LRH_BYTES; + + if (header->lrh.link_version != 0) { + printk(KERN_WARNING "Invalid LRH.link_version %d\n", + header->lrh.link_version); + return -EINVAL; + } + + switch (header->lrh.link_next_header) { + case IB_LNH_IBA_LOCAL: + header->grh_present = 0; + break; + + case IB_LNH_IBA_GLOBAL: + header->grh_present = 1; + ib_unpack(grh_table, ARRAY_SIZE(grh_table), + buf, &header->grh); + buf += IB_GRH_BYTES; + + if (header->grh.ip_version != 6) { + printk(KERN_WARNING "Invalid GRH.ip_version %d\n", + header->grh.ip_version); + return -EINVAL; + } + if (header->grh.next_header != 0x1b) { + printk(KERN_WARNING "Invalid GRH.next_header 0x%02x\n", + header->grh.next_header); + return -EINVAL; + } + break; + + default: + printk(KERN_WARNING "Invalid LRH.link_next_header %d\n", + header->lrh.link_next_header); + return -EINVAL; + } + + ib_unpack(bth_table, ARRAY_SIZE(bth_table), + buf, &header->bth); + buf += IB_BTH_BYTES; + + switch (header->bth.opcode) { + case IB_OPCODE_UD_SEND_ONLY: + header->immediate_present = 0; + break; + case IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE: + header->immediate_present = 1; + break; + default: + printk(KERN_WARNING "Invalid BTH.opcode 0x%02x\n", + header->bth.opcode); + return -EINVAL; + } + + if (header->bth.transport_header_version != 0) { + printk(KERN_WARNING "Invalid BTH.transport_header_version %d\n", + header->bth.transport_header_version); + return -EINVAL; + } + + ib_unpack(deth_table, ARRAY_SIZE(deth_table), + buf, &header->deth); + buf += IB_DETH_BYTES; + + if (header->immediate_present) + memcpy(&header->immediate_data, buf, sizeof header->immediate_data); + + return 0; +} +EXPORT_SYMBOL(ib_ud_header_unpack); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/verbs.c 2004-12-19 22:04:12.171905449 -0800 @@ -0,0 +1,433 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: verbs.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include + +/* Protection domains */ + +struct ib_pd *ib_alloc_pd(struct ib_device *device) +{ + struct ib_pd *pd; + + pd = device->alloc_pd(device); + + if (!IS_ERR(pd)) { + pd->device = device; + atomic_set(&pd->usecnt, 0); + } + + return pd; +} +EXPORT_SYMBOL(ib_alloc_pd); + +int ib_dealloc_pd(struct ib_pd *pd) +{ + if (atomic_read(&pd->usecnt)) + return -EBUSY; + + return pd->device->dealloc_pd(pd); +} +EXPORT_SYMBOL(ib_dealloc_pd); + +/* Address handles */ + +struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + struct ib_ah *ah; + + ah = pd->device->create_ah(pd, ah_attr); + + if (!IS_ERR(ah)) { + ah->device = pd->device; + ah->pd = pd; + atomic_inc(&pd->usecnt); + } + + return ah; +} +EXPORT_SYMBOL(ib_create_ah); + +int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return ah->device->modify_ah ? + ah->device->modify_ah(ah, ah_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_modify_ah); + +int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return ah->device->query_ah ? + ah->device->query_ah(ah, ah_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_query_ah); + +int ib_destroy_ah(struct ib_ah *ah) +{ + struct ib_pd *pd; + int ret; + + pd = ah->pd; + ret = ah->device->destroy_ah(ah); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_destroy_ah); + +/* Queue pairs */ + +struct ib_qp *ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr) +{ + struct ib_qp *qp; + + qp = pd->device->create_qp(pd, qp_init_attr); + + if (!IS_ERR(qp)) { + qp->device = pd->device; + qp->pd = pd; + qp->send_cq = qp_init_attr->send_cq; + qp->recv_cq = qp_init_attr->recv_cq; + qp->srq = qp_init_attr->srq; + qp->event_handler = qp_init_attr->event_handler; + qp->qp_context = qp_init_attr->qp_context; + atomic_inc(&pd->usecnt); + atomic_inc(&qp_init_attr->send_cq->usecnt); + atomic_inc(&qp_init_attr->recv_cq->usecnt); + if (qp_init_attr->srq) + atomic_inc(&qp_init_attr->srq->usecnt); + } + + return qp; +} +EXPORT_SYMBOL(ib_create_qp); + +int ib_modify_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask) +{ + return qp->device->modify_qp(qp, qp_attr, qp_attr_mask); +} +EXPORT_SYMBOL(ib_modify_qp); + +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr) +{ + return qp->device->query_qp ? + qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_query_qp); + +int ib_destroy_qp(struct ib_qp *qp) +{ + struct ib_pd *pd; + struct ib_cq *scq, *rcq; + struct ib_srq *srq; + int ret; + + pd = qp->pd; + scq = qp->send_cq; + rcq = qp->recv_cq; + srq = qp->srq; + + ret = qp->device->destroy_qp(qp); + if (!ret) { + atomic_dec(&pd->usecnt); + atomic_dec(&scq->usecnt); + atomic_dec(&rcq->usecnt); + if (srq) + atomic_dec(&srq->usecnt); + } + + return ret; +} +EXPORT_SYMBOL(ib_destroy_qp); + +/* Completion queues */ + +struct ib_cq *ib_create_cq(struct ib_device *device, + ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), + void *cq_context, int cqe) +{ + struct ib_cq *cq; + + cq = device->create_cq(device, cqe); + + if (!IS_ERR(cq)) { + cq->device = device; + cq->comp_handler = comp_handler; + cq->event_handler = event_handler; + cq->cq_context = cq_context; + atomic_set(&cq->usecnt, 0); + } + + return cq; +} +EXPORT_SYMBOL(ib_create_cq); + +int ib_destroy_cq(struct ib_cq *cq) +{ + if (atomic_read(&cq->usecnt)) + return -EBUSY; + + return cq->device->destroy_cq(cq); +} +EXPORT_SYMBOL(ib_destroy_cq); + +int ib_resize_cq(struct ib_cq *cq, + int cqe) +{ + int ret; + + if (!cq->device->resize_cq) + return -ENOSYS; + + ret = cq->device->resize_cq(cq, &cqe); + if (!ret) + cq->cqe = cqe; + + return ret; +} +EXPORT_SYMBOL(ib_resize_cq); + +/* Memory regions */ + +struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) +{ + struct ib_mr *mr; + + mr = pd->device->get_dma_mr(pd, mr_access_flags); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_get_dma_mr); + +struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_mr *mr; + + mr = pd->device->reg_phys_mr(pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_reg_phys_mr); + +int ib_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_pd *old_pd; + int ret; + + if (!mr->device->rereg_phys_mr) + return -ENOSYS; + + if (atomic_read(&mr->usecnt)) + return -EBUSY; + + old_pd = mr->pd; + + ret = mr->device->rereg_phys_mr(mr, mr_rereg_mask, pd, + phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!ret && (mr_rereg_mask & IB_MR_REREG_PD)) { + atomic_dec(&old_pd->usecnt); + atomic_inc(&pd->usecnt); + } + + return ret; +} +EXPORT_SYMBOL(ib_rereg_phys_mr); + +int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) +{ + return mr->device->query_mr ? + mr->device->query_mr(mr, mr_attr) : -ENOSYS; +} +EXPORT_SYMBOL(ib_query_mr); + +int ib_dereg_mr(struct ib_mr *mr) +{ + struct ib_pd *pd; + int ret; + + if (atomic_read(&mr->usecnt)) + return -EBUSY; + + pd = mr->pd; + ret = mr->device->dereg_mr(mr); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dereg_mr); + +/* Memory windows */ + +struct ib_mw *ib_alloc_mw(struct ib_pd *pd) +{ + struct ib_mw *mw; + + if (!pd->device->alloc_mw) + return ERR_PTR(-ENOSYS); + + mw = pd->device->alloc_mw(pd); + if (!IS_ERR(mw)) { + mw->device = pd->device; + mw->pd = pd; + atomic_inc(&pd->usecnt); + } + + return mw; +} +EXPORT_SYMBOL(ib_alloc_mw); + +int ib_dealloc_mw(struct ib_mw *mw) +{ + struct ib_pd *pd; + int ret; + + pd = mw->pd; + ret = mw->device->dealloc_mw(mw); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dealloc_mw); + +/* "Fast" memory regions */ + +struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct ib_fmr *fmr; + + if (!pd->device->alloc_fmr) + return ERR_PTR(-ENOSYS); + + fmr = pd->device->alloc_fmr(pd, mr_access_flags, fmr_attr); + if (!IS_ERR(fmr)) { + fmr->device = pd->device; + fmr->pd = pd; + atomic_inc(&pd->usecnt); + } + + return fmr; +} +EXPORT_SYMBOL(ib_alloc_fmr); + +int ib_unmap_fmr(struct list_head *fmr_list) +{ + struct ib_fmr *fmr; + + if (list_empty(fmr_list)) + return 0; + + fmr = list_entry(fmr_list->next, struct ib_fmr, list); + return fmr->device->unmap_fmr(fmr_list); +} +EXPORT_SYMBOL(ib_unmap_fmr); + +int ib_dealloc_fmr(struct ib_fmr *fmr) +{ + struct ib_pd *pd; + int ret; + + pd = fmr->pd; + ret = fmr->device->dealloc_fmr(fmr); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dealloc_fmr); + +/* Multicast groups */ + +int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) +{ + return qp->device->attach_mcast ? + qp->device->attach_mcast(qp, gid, lid) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_attach_mcast); + +int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) +{ + return qp->device->detach_mcast ? + qp->device->detach_mcast(qp, gid, lid) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_detach_mcast); From roland at topspin.com Sun Dec 19 22:14:53 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:14:53 -0800 Subject: [openib-general] [PATCH][v4][3/24] Hook up drivers/infiniband In-Reply-To: <200412192214.uWyITZT30vxqMJuO@topspin.com> Message-ID: <200412192214.qTsUPCeCMCi5dvjR@topspin.com> Add the appropriate lines to drivers/Kconfig and drivers/Makefile so that the kernel configuration and build systems know about drivers/infiniband. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/Kconfig 2004-12-19 21:09:51.000000000 -0800 +++ linux-bk/drivers/Kconfig 2004-12-19 22:04:12.806811886 -0800 @@ -54,4 +54,6 @@ source "drivers/usb/Kconfig" +source "drivers/infiniband/Kconfig" + endmenu --- linux-bk.orig/drivers/Makefile 2004-12-19 21:10:06.000000000 -0800 +++ linux-bk/drivers/Makefile 2004-12-19 22:04:12.806811886 -0800 @@ -59,4 +59,5 @@ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ +obj-$(CONFIG_INFINIBAND) += infiniband/ obj-y += firmware/ From roland at topspin.com Sun Dec 19 22:14:53 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:14:53 -0800 Subject: [openib-general] [PATCH][v4][4/24] Add InfiniBand MAD (management datagram) support (public headers) In-Reply-To: <200412192214.qTsUPCeCMCi5dvjR@topspin.com> Message-ID: <200412192214.Lv59G3vg3yqeEo9J@topspin.com> Add public headers for handling InfiniBand MADs (management datagrams), including sending and receiving MADs as well as passing MADs on to local agents. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_mad.h 2004-12-19 22:04:13.037777849 -0800 @@ -0,0 +1,404 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if !defined( IB_MAD_H ) +#define IB_MAD_H + +#include + +/* Management base version */ +#define IB_MGMT_BASE_VERSION 1 + +/* Management classes */ +#define IB_MGMT_CLASS_SUBN_LID_ROUTED 0x01 +#define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE 0x81 +#define IB_MGMT_CLASS_SUBN_ADM 0x03 +#define IB_MGMT_CLASS_PERF_MGMT 0x04 +#define IB_MGMT_CLASS_BM 0x05 +#define IB_MGMT_CLASS_DEVICE_MGMT 0x06 +#define IB_MGMT_CLASS_CM 0x07 +#define IB_MGMT_CLASS_SNMP 0x08 +#define IB_MGMT_CLASS_VENDOR_RANGE2_START 0x30 +#define IB_MGMT_CLASS_VENDOR_RANGE2_END 0x4F + +/* Management methods */ +#define IB_MGMT_METHOD_GET 0x01 +#define IB_MGMT_METHOD_SET 0x02 +#define IB_MGMT_METHOD_GET_RESP 0x81 +#define IB_MGMT_METHOD_SEND 0x03 +#define IB_MGMT_METHOD_TRAP 0x05 +#define IB_MGMT_METHOD_REPORT 0x06 +#define IB_MGMT_METHOD_REPORT_RESP 0x86 +#define IB_MGMT_METHOD_TRAP_REPRESS 0x07 + +#define IB_MGMT_METHOD_RESP 0x80 + +#define IB_MGMT_MAX_METHODS 128 + +#define IB_QP0 0 +#define IB_QP1 __constant_htonl(1) +#define IB_QP1_QKEY 0x80010000 + +struct ib_grh { + u32 version_tclass_flow; + u16 paylen; + u8 next_hdr; + u8 hop_limit; + union ib_gid sgid; + union ib_gid dgid; +} __attribute__ ((packed)); + +struct ib_mad_hdr { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u16 class_specific; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; +} __attribute__ ((packed)); + +struct ib_rmpp_hdr { + u8 rmpp_version; + u8 rmpp_type; + u8 rmpp_rtime_flags; + u8 rmpp_status; + u32 seg_num; + u32 paylen_newwin; +} __attribute__ ((packed)); + +struct ib_mad { + struct ib_mad_hdr mad_hdr; + u8 data[232]; +} __attribute__ ((packed)); + +struct ib_rmpp_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 data[220]; +} __attribute__ ((packed)); + +struct ib_vendor_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 reserved; + u8 oui[3]; + u8 data[216]; +} __attribute__ ((packed)); + +struct ib_mad_agent; +struct ib_mad_send_wc; +struct ib_mad_recv_wc; + +/** + * ib_mad_send_handler - callback handler for a sent MAD. + * @mad_agent: MAD agent that sent the MAD. + * @mad_send_wc: Send work completion information on the sent MAD. + */ +typedef void (*ib_mad_send_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc); + +/** + * ib_mad_snoop_handler - Callback handler for snooping sent MADs. + * @mad_agent: MAD agent that snooped the MAD. + * @send_wr: Work request information on the sent MAD. + * @mad_send_wc: Work completion information on the sent MAD. Valid + * only for snooping that occurs on a send completion. + * + * Clients snooping MADs should not modify data referenced by the @send_wr + * or @mad_send_wc. + */ +typedef void (*ib_mad_snoop_handler)(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc); + +/** + * ib_mad_recv_handler - callback handler for a received MAD. + * @mad_agent: MAD agent requesting the received MAD. + * @mad_recv_wc: Received work completion information on the received MAD. + * + * MADs received in response to a send request operation will be handed to + * the user after the send operation completes. All data buffers given + * to registered agents through this routine are owned by the receiving + * client, except for snooping agents. Clients snooping MADs should not + * modify the data referenced by @mad_recv_wc. + */ +typedef void (*ib_mad_recv_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_mad_agent - Used to track MAD registration with the access layer. + * @device: Reference to device registration is on. + * @qp: Reference to QP used for sending and receiving MADs. + * @recv_handler: Callback handler for a received MAD. + * @send_handler: Callback handler for a sent MAD. + * @snoop_handler: Callback handler for snooped sent MADs. + * @context: User-specified context associated with this registration. + * @hi_tid: Access layer assigned transaction ID for this client. + * Unsolicited MADs sent by this client will have the upper 32-bits + * of their TID set to this value. + * @port_num: Port number on which QP is registered + */ +struct ib_mad_agent { + struct ib_device *device; + struct ib_qp *qp; + ib_mad_recv_handler recv_handler; + ib_mad_send_handler send_handler; + ib_mad_snoop_handler snoop_handler; + void *context; + u32 hi_tid; + u8 port_num; +}; + +/** + * ib_mad_send_wc - MAD send completion information. + * @wr_id: Work request identifier associated with the send MAD request. + * @status: Completion status. + * @vendor_err: Optional vendor error information returned with a failed + * request. + */ +struct ib_mad_send_wc { + u64 wr_id; + enum ib_wc_status status; + u32 vendor_err; +}; + +/** + * ib_mad_recv_buf - received MAD buffer information. + * @list: Reference to next data buffer for a received RMPP MAD. + * @grh: References a data buffer containing the global route header. + * The data refereced by this buffer is only valid if the GRH is + * valid. + * @mad: References the start of the received MAD. + */ +struct ib_mad_recv_buf { + struct list_head list; + struct ib_grh *grh; + struct ib_mad *mad; +}; + +/** + * ib_mad_recv_wc - received MAD information. + * @wc: Completion information for the received data. + * @recv_buf: Specifies the location of the received data buffer(s). + * @mad_len: The length of the received MAD, without duplicated headers. + * + * For received response, the wr_id field of the wc is set to the wr_id + * for the corresponding send request. + */ +struct ib_mad_recv_wc { + struct ib_wc *wc; + struct ib_mad_recv_buf recv_buf; + int mad_len; +}; + +/** + * ib_mad_reg_req - MAD registration request + * @mgmt_class: Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version: Indicates which version of MADs for the given + * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. + * @method_mask: The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + */ +struct ib_mad_reg_req { + u8 mgmt_class; + u8 mgmt_class_version; + u8 oui[3]; + DECLARE_BITMAP(method_mask, IB_MGMT_MAX_METHODS); +}; + +/** + * ib_register_mad_agent - Register to send/receive MADs. + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP to access. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_reg_req: Specifies which unsolicited MADs should be received + * by the caller. This parameter may be NULL if the caller only + * wishes to receive solicited responses. + * @rmpp_version: If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler: The completion callback routine invoked after a send + * request has completed. + * @recv_handler: The completion callback routine invoked for a received + * MAD. + * @context: User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +enum ib_mad_snoop_flags { + /*IB_MAD_SNOOP_POSTED_SENDS = 1,*/ + /*IB_MAD_SNOOP_RMPP_SENDS = (1<<1),*/ + IB_MAD_SNOOP_SEND_COMPLETIONS = (1<<2), + /*IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS = (1<<3),*/ + IB_MAD_SNOOP_RECVS = (1<<4) + /*IB_MAD_SNOOP_RMPP_RECVS = (1<<5),*/ + /*IB_MAD_SNOOP_REDIRECTED_QPS = (1<<6)*/ +}; + +/** + * ib_register_mad_snoop - Register to snoop sent and received MADs. + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP traffic to snoop. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_snoop_flags: Specifies information where snooping occurs. + * @send_handler: The callback routine invoked for a snooped send. + * @recv_handler: The callback routine invoked for a snooped receive. + * @context: User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_unregister_mad_agent - Unregisters a client from using MAD services. + * @mad_agent: Corresponding MAD registration request to deregister. + * + * After invoking this routine, MAD services are no longer usable by the + * client on the associated QP. + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent); + +/** + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client. + * @mad_agent: Specifies the associated registration to post the send to. + * @send_wr: Specifies the information needed to send the MAD(s). + * @bad_send_wr: Specifies the MAD on which an error was encountered. + * + * Sent MADs are not guaranteed to complete in the order that they were posted. + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + +/** + * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. + * @mad_recv_wc: Work completion information for a received MAD. + * @buf: User-provided data buffer to receive the coalesced buffers. The + * referenced buffer should be at least the size of the mad_len specified + * by @mad_recv_wc. + * + * This call copies a chain of received RMPP MADs into a single data buffer, + * removing duplicated headers. + */ +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf); + +/** + * ib_free_recv_mad - Returns data buffers used to receive a MAD to the + * access layer. + * @mad_recv_wc: Work completion information for a received MAD. + * + * Clients receiving MADs through their ib_mad_recv_handler must call this + * routine to return the work completion buffers to the access layer. + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_cancel_mad - Cancels an outstanding send MAD operation. + * @mad_agent: Specifies the registration associated with sent MAD. + * @wr_id: Indicates the work request identifier of the MAD to cancel. + * + * MADs will be returned to the user through the corresponding + * ib_mad_send_handler. + */ +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id); + +/** + * ib_redirect_mad_qp - Registers a QP for MAD services. + * @qp: Reference to a QP that requires MAD services. + * @rmpp_version: If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler: The completion callback routine invoked after a send + * request has completed. + * @recv_handler: The completion callback routine invoked for a received + * MAD. + * @context: User specified context associated with the registration. + * + * Use of this call allows clients to use MAD services, such as RMPP, + * on user-owned QPs. After calling this routine, users may send + * MADs on the specified QP by calling ib_mad_post_send. + */ +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_process_mad_wc - Processes a work completion associated with a + * MAD sent or received on a redirected QP. + * @mad_agent: Specifies the registered MAD service using the redirected QP. + * @wc: References a work completion associated with a sent or received + * MAD segment. + * + * This routine is used to complete or continue processing on a MAD request. + * If the work completion is associated with a send operation, calling + * this routine is required to continue an RMPP transfer or to wait for a + * corresponding response, if it is a request. If the work completion is + * associated with a receive operation, calling this routine is required to + * process an inbound or outbound RMPP transfer, or to match a response MAD + * with its corresponding request. + */ +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc); + +#endif /* IB_MAD_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_smi.h 2004-12-19 22:04:13.062774166 -0800 @@ -0,0 +1,96 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if !defined( IB_SMI_H ) +#define IB_SMI_H + +#include + +#define IB_LID_PERMISSIVE 0xFFFF + +#define IB_SMP_DATA_SIZE 64 +#define IB_SMP_MAX_PATH_HOPS 64 + +struct ib_smp { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u8 hop_ptr; + u8 hop_cnt; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; + u64 mkey; + u16 dr_slid; + u16 dr_dlid; + u8 reserved[28]; + u8 data[IB_SMP_DATA_SIZE]; + u8 initial_path[IB_SMP_MAX_PATH_HOPS]; + u8 return_path[IB_SMP_MAX_PATH_HOPS]; +} __attribute__ ((packed)); + +#define IB_SMP_DIRECTION __constant_htons(0x8000) + +/* Subnet management attributes */ +#define IB_SMP_ATTR_NOTICE __constant_htons(0x0002) +#define IB_SMP_ATTR_NODE_DESC __constant_htons(0x0010) +#define IB_SMP_ATTR_NODE_INFO __constant_htons(0x0011) +#define IB_SMP_ATTR_SWITCH_INFO __constant_htons(0x0012) +#define IB_SMP_ATTR_GUID_INFO __constant_htons(0x0014) +#define IB_SMP_ATTR_PORT_INFO __constant_htons(0x0015) +#define IB_SMP_ATTR_PKEY_TABLE __constant_htons(0x0016) +#define IB_SMP_ATTR_SL_TO_VL_TABLE __constant_htons(0x0017) +#define IB_SMP_ATTR_VL_ARB_TABLE __constant_htons(0x0018) +#define IB_SMP_ATTR_LINEAR_FORWARD_TABLE __constant_htons(0x0019) +#define IB_SMP_ATTR_RANDOM_FORWARD_TABLE __constant_htons(0x001A) +#define IB_SMP_ATTR_MCAST_FORWARD_TABLE __constant_htons(0x001B) +#define IB_SMP_ATTR_SM_INFO __constant_htons(0x0020) +#define IB_SMP_ATTR_VENDOR_DIAG __constant_htons(0x0030) +#define IB_SMP_ATTR_LED_INFO __constant_htons(0x0031) +#define IB_SMP_ATTR_VENDOR_MASK __constant_htons(0xFF00) + +static inline u8 +ib_get_smp_direction(struct ib_smp *smp) +{ + return ((smp->status & IB_SMP_DIRECTION) == IB_SMP_DIRECTION); +} + +#endif /* IB_SMI_H */ From roland at topspin.com Sun Dec 19 22:14:57 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:14:57 -0800 Subject: [openib-general] [PATCH][v4][6/24] Add InfiniBand MAD (management datagram) support (private headers) In-Reply-To: <200412192214.LiAN0y3IR8YkPxhM@topspin.com> Message-ID: <200412192214.OnqdUjlwc4A94uzk@topspin.com> Add MAD layer private implementation headers. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent.h 2004-12-19 22:04:13.613692980 -0800 @@ -0,0 +1,55 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef __AGENT_H_ +#define __AGENT_H_ + +extern spinlock_t ib_agent_port_list_lock; + +extern int ib_agent_port_open(struct ib_device *device, + int port_num); + +extern int ib_agent_port_close(struct ib_device *device, int port_num); + +extern int agent_send(struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num); + +#endif /* __AGENT_H_ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent_priv.h 2004-12-19 22:04:13.636689591 -0800 @@ -0,0 +1,64 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef __IB_AGENT_PRIV_H__ +#define __IB_AGENT_PRIV_H__ + +#include + +#define SPFX "ib_agent: " + +struct ib_agent_send_wr { + struct list_head send_list; + struct ib_ah *ah; + struct ib_mad_private *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct ib_agent_port_private { + struct list_head port_list; + struct list_head send_posted_list; + spinlock_t send_list_lock; + int port_num; + struct ib_mad_agent *dr_smp_agent; /* DR SM class */ + struct ib_mad_agent *lr_smp_agent; /* LR SM class */ + struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ + struct ib_mr *mr; +}; + +#endif /* __IB_AGENT_PRIV_H__ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/mad_priv.h 2004-12-19 22:04:13.660686054 -0800 @@ -0,0 +1,194 @@ +/* + * Copyright (c) 2004, Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef __IB_MAD_PRIV_H__ +#define __IB_MAD_PRIV_H__ + +#include +#include +#include +#include +#include + + +#define PFX "ib_mad: " + +#define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 as a minimum */ + +/* QP and CQ parameters */ +#define IB_MAD_QP_SEND_SIZE 128 +#define IB_MAD_QP_RECV_SIZE 512 +#define IB_MAD_SEND_REQ_MAX_SG 2 +#define IB_MAD_RECV_REQ_MAX_SG 1 + +#define IB_MAD_SEND_Q_PSN 0 + +/* Registration table sizes */ +#define MAX_MGMT_CLASS 80 +#define MAX_MGMT_VERSION 8 +#define MAX_MGMT_OUI 8 +#define MAX_MGMT_VENDOR_RANGE2 IB_MGMT_CLASS_VENDOR_RANGE2_END - \ + IB_MGMT_CLASS_VENDOR_RANGE2_START + 1 + +struct ib_mad_list_head { + struct list_head list; + struct ib_mad_queue *mad_queue; +}; + +struct ib_mad_private_header { + struct ib_mad_list_head mad_list; + struct ib_mad_recv_wc recv_wc; + DECLARE_PCI_UNMAP_ADDR(mapping) +} __attribute__ ((packed)); + +struct ib_mad_private { + struct ib_mad_private_header header; + struct ib_grh grh; + union { + struct ib_mad mad; + struct ib_rmpp_mad rmpp_mad; + struct ib_smp smp; + } mad; +} __attribute__ ((packed)); + +struct ib_mad_agent_private { + struct list_head agent_list; + struct ib_mad_agent agent; + struct ib_mad_reg_req *reg_req; + struct ib_mad_qp_info *qp_info; + + spinlock_t lock; + struct list_head send_list; + struct list_head wait_list; + struct work_struct timed_work; + unsigned long timeout; + struct list_head local_list; + struct work_struct local_work; + + atomic_t refcount; + wait_queue_head_t wait; + u8 rmpp_version; +}; + +struct ib_mad_snoop_private { + struct ib_mad_agent agent; + struct ib_mad_qp_info *qp_info; + int snoop_index; + int mad_snoop_flags; + atomic_t refcount; + wait_queue_head_t wait; +}; + +struct ib_mad_send_wr_private { + struct ib_mad_list_head mad_list; + struct list_head agent_list; + struct ib_mad_agent *agent; + struct ib_send_wr send_wr; + struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; + u64 wr_id; /* client WR ID */ + u64 tid; + unsigned long timeout; + int retry; + int refcount; + enum ib_wc_status status; +}; + +struct ib_mad_local_private { + struct list_head completion_list; + struct ib_mad_private *mad_priv; + struct ib_send_wr send_wr; + struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; + u64 wr_id; /* client WR ID */ + u64 tid; +}; + +struct ib_mad_mgmt_method_table { + struct ib_mad_agent_private *agent[IB_MGMT_MAX_METHODS]; +}; + +struct ib_mad_mgmt_class_table { + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; +}; + +struct ib_mad_mgmt_vendor_class { + u8 oui[MAX_MGMT_OUI][3]; + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_OUI]; +}; + +struct ib_mad_mgmt_vendor_class_table { + struct ib_mad_mgmt_vendor_class *vendor_class[MAX_MGMT_VENDOR_RANGE2]; +}; + +struct ib_mad_mgmt_version_table { + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; +}; + +struct ib_mad_queue { + spinlock_t lock; + struct list_head list; + int count; + int max_active; + struct ib_mad_qp_info *qp_info; +}; + +struct ib_mad_qp_info { + struct ib_mad_port_private *port_priv; + struct ib_qp *qp; + struct ib_mad_queue send_queue; + struct ib_mad_queue recv_queue; + struct list_head overflow_list; + spinlock_t snoop_lock; + struct ib_mad_snoop_private **snoop_table; + int snoop_table_size; + atomic_t snoop_count; +}; + +struct ib_mad_port_private { + struct list_head port_list; + struct ib_device *device; + int port_num; + struct ib_cq *cq; + struct ib_pd *pd; + struct ib_mr *mr; + + spinlock_t reg_lock; + struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION]; + struct list_head agent_list; + struct workqueue_struct *wq; + struct work_struct work; + struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; +}; + +#endif /* __IB_MAD_PRIV_H__ */ From roland at topspin.com Sun Dec 19 22:14:55 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:14:55 -0800 Subject: [openib-general] [PATCH][v4][5/24] Add InfiniBand MAD (management datagram) support In-Reply-To: <200412192214.Lv59G3vg3yqeEo9J@topspin.com> Message-ID: <200412192214.LiAN0y3IR8YkPxhM@topspin.com> Add support for handling InfiniBand MADs (management datagrams), including sending and receiving MADs as well as passing MADs on to local agents. This is required for an SM (subnet manager) to discover and configure the host, since the SM's query MADs must be passed to the local SMA (subnet management agent). In addition, this support is used by upper level protocols to send queries to and receive responses from the SM. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-19 22:04:11.985932855 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-19 22:04:13.291740424 -0800 @@ -1,6 +1,8 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o + +ib_mad-y := mad.o smi.o agent.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent.c 2004-12-19 22:04:13.339733352 -0800 @@ -0,0 +1,399 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include + +#include + +#include + +#include "smi.h" +#include "agent_priv.h" +#include "mad_priv.h" + + +spinlock_t ib_agent_port_list_lock; +static LIST_HEAD(ib_agent_port_list); + +extern kmem_cache_t *ib_mad_cache; + + +/* + * Caller must hold ib_agent_port_list_lock + */ +static inline struct ib_agent_port_private * +__ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_agent_port_private *entry; + + BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ + + if (device) { + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if (entry->dr_smp_agent->device == device && + entry->port_num == port_num) + return entry; + } + } else { + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if ((entry->dr_smp_agent == mad_agent) || + (entry->lr_smp_agent == mad_agent) || + (entry->perf_mgmt_agent == mad_agent)) + return entry; + } + } + return NULL; +} + +static inline struct ib_agent_port_private * +ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_agent_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + entry = __ib_get_agent_port(device, port_num, mad_agent); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + return entry; +} + +int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num) +{ + struct ib_agent_port_private *port_priv; + + if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + return 1; + port_priv = ib_get_agent_port(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " + "not open\n", + device->name, port_num); + return 1; + } + + return smi_check_local_smp(port_priv->dr_smp_agent, smp); +} + +static int agent_mad_send(struct ib_mad_agent *mad_agent, + struct ib_agent_port_private *port_priv, + struct ib_mad_private *mad_priv, + struct ib_grh *grh, + struct ib_wc *wc) +{ + struct ib_agent_send_wr *agent_send_wr; + struct ib_sge gather_list; + struct ib_send_wr send_wr; + struct ib_send_wr *bad_send_wr; + struct ib_ah_attr ah_attr; + unsigned long flags; + int ret = 1; + + agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); + if (!agent_send_wr) + goto out; + agent_send_wr->mad = mad_priv; + + /* PCI mapping */ + gather_list.addr = dma_map_single(mad_agent->device->dma_device, + &mad_priv->mad, + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + gather_list.length = sizeof(mad_priv->mad); + gather_list.lkey = (*port_priv->mr).lkey; + + send_wr.next = NULL; + send_wr.opcode = IB_WR_SEND; + send_wr.sg_list = &gather_list; + send_wr.num_sge = 1; + send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ + send_wr.wr.ud.timeout_ms = 0; + send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; + + ah_attr.dlid = wc->slid; + ah_attr.port_num = mad_agent->port_num; + ah_attr.src_path_bits = wc->dlid_path_bits; + ah_attr.sl = wc->sl; + ah_attr.static_rate = 0; + ah_attr.ah_flags = 0; /* No GRH */ + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (wc->wc_flags & IB_WC_GRH) { + ah_attr.ah_flags = IB_AH_GRH; + /* Should sgid be looked up ? */ + ah_attr.grh.sgid_index = 0; + ah_attr.grh.hop_limit = grh->hop_limit; + ah_attr.grh.flow_label = be32_to_cpup( + &grh->version_tclass_flow) & 0xfffff; + ah_attr.grh.traffic_class = (be32_to_cpup( + &grh->version_tclass_flow) >> 20) & 0xff; + memcpy(ah_attr.grh.dgid.raw, + grh->sgid.raw, + sizeof(ah_attr.grh.dgid)); + } + } + + agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); + if (IS_ERR(agent_send_wr->ah)) { + printk(KERN_ERR SPFX "No memory for address handle\n"); + kfree(agent_send_wr); + goto out; + } + + send_wr.wr.ud.ah = agent_send_wr->ah; + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + send_wr.wr.ud.pkey_index = wc->pkey_index; + send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; + } else { /* for SMPs */ + send_wr.wr.ud.pkey_index = 0; + send_wr.wr.ud.remote_qkey = 0; + } + send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr; + send_wr.wr_id = (unsigned long)agent_send_wr; + + pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); + + /* Send */ + spin_lock_irqsave(&port_priv->send_list_lock, flags); + if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(agent_send_wr, mapping), + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + ib_destroy_ah(agent_send_wr->ah); + kfree(agent_send_wr); + } else { + list_add_tail(&agent_send_wr->send_list, + &port_priv->send_posted_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + ret = 0; + } + +out: + return ret; +} + +int agent_send(struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num) +{ + struct ib_agent_port_private *port_priv; + struct ib_mad_agent *mad_agent; + + port_priv = ib_get_agent_port(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", + device->name, port_num); + return 1; + } + + /* Get mad agent based on mgmt_class in MAD */ + switch (mad->mad.mad.mad_hdr.mgmt_class) { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + mad_agent = port_priv->dr_smp_agent; + break; + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + mad_agent = port_priv->lr_smp_agent; + break; + case IB_MGMT_CLASS_PERF_MGMT: + mad_agent = port_priv->perf_mgmt_agent; + break; + default: + return 1; + } + + return agent_mad_send(mad_agent, port_priv, mad, grh, wc); +} + +static void agent_send_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_agent_port_private *port_priv; + struct ib_agent_send_wr *agent_send_wr; + unsigned long flags; + + /* Find matching MAD agent */ + port_priv = ib_get_agent_port(NULL, 0, mad_agent); + if (!port_priv) { + printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " + "agent %p\n", mad_agent); + return; + } + + agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id; + spin_lock_irqsave(&port_priv->send_list_lock, flags); + /* Remove completed send from posted send MAD list */ + list_del(&agent_send_wr->send_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + + /* Unmap PCI */ + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(agent_send_wr, mapping), + sizeof(agent_send_wr->mad->mad), + DMA_TO_DEVICE); + + ib_destroy_ah(agent_send_wr->ah); + + /* Release allocated memory */ + kmem_cache_free(ib_mad_cache, agent_send_wr->mad); + kfree(agent_send_wr); +} + +int ib_agent_port_open(struct ib_device *device, int port_num) +{ + int ret; + struct ib_agent_port_private *port_priv; + struct ib_mad_reg_req reg_req; + unsigned long flags; + + /* First, check if port already open for SMI */ + port_priv = ib_get_agent_port(device, port_num, NULL); + if (port_priv) { + printk(KERN_DEBUG SPFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR SPFX "No memory for ib_agent_port_private\n"); + ret = -ENOMEM; + goto error1; + } + + memset(port_priv, 0, sizeof *port_priv); + port_priv->port_num = port_num; + spin_lock_init(&port_priv->send_list_lock); + INIT_LIST_HEAD(&port_priv->send_posted_list); + + /* Obtain MAD agent for directed route SM class */ + reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE; + reg_req.mgmt_class_version = 1; + + port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + + if (IS_ERR(port_priv->dr_smp_agent)) { + ret = PTR_ERR(port_priv->dr_smp_agent); + goto error2; + } + + /* Obtain MAD agent for LID routed SM class */ + reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->lr_smp_agent)) { + ret = PTR_ERR(port_priv->lr_smp_agent); + goto error3; + } + + /* Obtain MAD agent for PerfMgmt class */ + reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->perf_mgmt_agent)) { + ret = PTR_ERR(port_priv->perf_mgmt_agent); + goto error4; + } + + port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(port_priv->mr)) { + printk(KERN_ERR SPFX "Couldn't get DMA MR\n"); + ret = PTR_ERR(port_priv->mr); + goto error5; + } + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_agent_port_list); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + return 0; + +error5: + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); +error4: + ib_unregister_mad_agent(port_priv->lr_smp_agent); +error3: + ib_unregister_mad_agent(port_priv->dr_smp_agent); +error2: + kfree(port_priv); +error1: + return ret; +} + +int ib_agent_port_close(struct ib_device *device, int port_num) +{ + struct ib_agent_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + port_priv = __ib_get_agent_port(device, port_num, NULL); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + printk(KERN_ERR SPFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + ib_dereg_mr(port_priv->mr); + + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); + ib_unregister_mad_agent(port_priv->lr_smp_agent); + ib_unregister_mad_agent(port_priv->dr_smp_agent); + kfree(port_priv); + + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/mad.c 2004-12-19 22:04:13.315736888 -0800 @@ -0,0 +1,2632 @@ +/* + * Copyright (c) 2004, Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include + +#include + +#include "mad_priv.h" +#include "smi.h" +#include "agent.h" + + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_DESCRIPTION("kernel IB MAD API"); +MODULE_AUTHOR("Hal Rosenstock"); +MODULE_AUTHOR("Sean Hefty"); + + +kmem_cache_t *ib_mad_cache; +static struct list_head ib_mad_port_list; +static u32 ib_mad_client_id = 0; + +/* Port list lock */ +static spinlock_t ib_mad_port_list_lock; + + +/* Forward declarations */ +static int method_in_use(struct ib_mad_mgmt_method_table **method, + struct ib_mad_reg_req *mad_reg_req); +static void remove_mad_reg_req(struct ib_mad_agent_private *priv); +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad); +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc); +static void timeout_sends(void *data); +static void local_completions(void *data); +static int solicited_mad(struct ib_mad *mad); +static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv, + u8 mgmt_class); +static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv); + +/* + * Returns a ib_mad_port_private structure or NULL for a device/port + * Assumes ib_mad_port_list_lock is being held + */ +static inline struct ib_mad_port_private * +__ib_get_mad_port(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *entry; + + list_for_each_entry(entry, &ib_mad_port_list, port_list) { + if (entry->device == device && entry->port_num == port_num) + return entry; + } + return NULL; +} + +/* + * Wrapper function to return a ib_mad_port_private structure or NULL + * for a device/port + */ +static inline struct ib_mad_port_private * +ib_get_mad_port(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + entry = __ib_get_mad_port(device, port_num); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + + return entry; +} + +static inline u8 convert_mgmt_class(u8 mgmt_class) +{ + /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ + return mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ? + 0 : mgmt_class; +} + +static int get_spl_qp_index(enum ib_qp_type qp_type) +{ + switch (qp_type) + { + case IB_QPT_SMI: + return 0; + case IB_QPT_GSI: + return 1; + default: + return -1; + } +} + +static int vendor_class_index(u8 mgmt_class) +{ + return mgmt_class - IB_MGMT_CLASS_VENDOR_RANGE2_START; +} + +static int is_vendor_class(u8 mgmt_class) +{ + if ((mgmt_class < IB_MGMT_CLASS_VENDOR_RANGE2_START) || + (mgmt_class > IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return 0; + return 1; +} + +static int is_vendor_oui(char *oui) +{ + if (oui[0] || oui[1] || oui[2]) + return 1; + return 0; +} + +static int is_vendor_method_in_use( + struct ib_mad_mgmt_vendor_class *vendor_class, + struct ib_mad_reg_req *mad_reg_req) +{ + struct ib_mad_mgmt_method_table *method; + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) { + if (!memcmp(vendor_class->oui[i], mad_reg_req->oui, 3)) { + method = vendor_class->method_table[i]; + if (method) { + if (method_in_use(&method, mad_reg_req)) + return 1; + else + break; + } + } + } + return 0; +} + +/* + * ib_register_mad_agent - Register to send/receive MADs + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret = ERR_PTR(-EINVAL); + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_reg_req *reg_req = NULL; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + struct ib_mad_mgmt_method_table *method; + int ret2, qpn; + unsigned long flags; + u8 mgmt_class, vclass; + + /* Validate parameters */ + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) + goto error1; + + if (rmpp_version) + goto error1; /* XXX: until RMPP implemented */ + + /* Validate MAD registration request if supplied */ + if (mad_reg_req) { + if (mad_reg_req->mgmt_class_version >= MAX_MGMT_VERSION) + goto error1; + if (!recv_handler) + goto error1; + if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) { + /* + * IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is the only + * one in this range currently allowed + */ + if (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + goto error1; + } else if (mad_reg_req->mgmt_class == 0) { + /* + * Class 0 is reserved in IBA and is used for + * aliasing of IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE + */ + goto error1; + } else if (is_vendor_class(mad_reg_req->mgmt_class)) { + /* + * If class is in "new" vendor range, + * ensure supplied OUI is not zero + */ + if (!is_vendor_oui(mad_reg_req->oui)) + goto error1; + } + /* Make sure class supplied is consistent with QP type */ + if (qp_type == IB_QPT_SMI) { + if ((mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_LID_ROUTED) && + (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } else { + if ((mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } + } else { + /* No registration request supplied */ + if (!send_handler) + goto error1; + } + + /* Validate device and port */ + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + + /* Allocate structures */ + mad_agent_priv = kmalloc(sizeof *mad_agent_priv, GFP_KERNEL); + if (!mad_agent_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + if (mad_reg_req) { + reg_req = kmalloc(sizeof *reg_req, GFP_KERNEL); + if (!reg_req) { + ret = ERR_PTR(-ENOMEM); + goto error2; + } + /* Make a copy of the MAD registration request */ + memcpy(reg_req, mad_reg_req, sizeof *reg_req); + } + + /* Now, fill in the various structures */ + memset(mad_agent_priv, 0, sizeof *mad_agent_priv); + mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; + mad_agent_priv->reg_req = reg_req; + mad_agent_priv->rmpp_version = rmpp_version; + mad_agent_priv->agent.device = device; + mad_agent_priv->agent.recv_handler = recv_handler; + mad_agent_priv->agent.send_handler = send_handler; + mad_agent_priv->agent.context = context; + mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_agent_priv->agent.port_num = port_num; + + spin_lock_irqsave(&port_priv->reg_lock, flags); + mad_agent_priv->agent.hi_tid = ++ib_mad_client_id; + + /* + * Make sure MAD registration (if supplied) + * is non overlapping with any existing ones + */ + if (mad_reg_req) { + mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); + if (!is_vendor_class(mgmt_class)) { + class = port_priv->version[mad_reg_req-> + mgmt_class_version].class; + if (class) { + method = class->method_table[mgmt_class]; + if (method) { + if (method_in_use(&method, + mad_reg_req)) + goto error3; + } + } + ret2 = add_nonoui_reg_req(mad_reg_req, mad_agent_priv, + mgmt_class); + } else { + /* "New" vendor class range */ + vendor = port_priv->version[mad_reg_req-> + mgmt_class_version].vendor; + if (vendor) { + vclass = vendor_class_index(mgmt_class); + vendor_class = vendor->vendor_class[vclass]; + if (vendor_class) { + if (is_vendor_method_in_use( + vendor_class, + mad_reg_req)) + goto error3; + } + } + ret2 = add_oui_reg_req(mad_reg_req, mad_agent_priv); + } + if (ret2) { + ret = ERR_PTR(ret2); + goto error3; + } + } + + /* Add mad agent into port's agent list */ + list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + spin_lock_init(&mad_agent_priv->lock); + INIT_LIST_HEAD(&mad_agent_priv->send_list); + INIT_LIST_HEAD(&mad_agent_priv->wait_list); + INIT_WORK(&mad_agent_priv->timed_work, timeout_sends, mad_agent_priv); + INIT_LIST_HEAD(&mad_agent_priv->local_list); + INIT_WORK(&mad_agent_priv->local_work, local_completions, + mad_agent_priv); + atomic_set(&mad_agent_priv->refcount, 1); + init_waitqueue_head(&mad_agent_priv->wait); + + return &mad_agent_priv->agent; + +error3: + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + kfree(reg_req); +error2: + kfree(mad_agent_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_agent); + +static inline int is_snooping_sends(int mad_snoop_flags) +{ + return (mad_snoop_flags & + (/*IB_MAD_SNOOP_POSTED_SENDS | + IB_MAD_SNOOP_RMPP_SENDS |*/ + IB_MAD_SNOOP_SEND_COMPLETIONS /*| + IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS*/)); +} + +static inline int is_snooping_recvs(int mad_snoop_flags) +{ + return (mad_snoop_flags & + (IB_MAD_SNOOP_RECVS /*| + IB_MAD_SNOOP_RMPP_RECVS*/)); +} + +static int register_snoop_agent(struct ib_mad_qp_info *qp_info, + struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_snoop_private **new_snoop_table; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + /* Check for empty slot in array. */ + for (i = 0; i < qp_info->snoop_table_size; i++) + if (!qp_info->snoop_table[i]) + break; + + if (i == qp_info->snoop_table_size) { + /* Grow table. */ + new_snoop_table = kmalloc(sizeof mad_snoop_priv * + qp_info->snoop_table_size + 1, + GFP_ATOMIC); + if (!new_snoop_table) { + i = -ENOMEM; + goto out; + } + if (qp_info->snoop_table) { + memcpy(new_snoop_table, qp_info->snoop_table, + sizeof mad_snoop_priv * + qp_info->snoop_table_size); + kfree(qp_info->snoop_table); + } + qp_info->snoop_table = new_snoop_table; + qp_info->snoop_table_size++; + } + qp_info->snoop_table[i] = mad_snoop_priv; + atomic_inc(&qp_info->snoop_count); +out: + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + return i; +} + +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret; + struct ib_mad_snoop_private *mad_snoop_priv; + int qpn; + + /* Validate parameters */ + if ((is_snooping_sends(mad_snoop_flags) && !snoop_handler) || + (is_snooping_recvs(mad_snoop_flags) && !recv_handler)) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + /* Allocate structures */ + mad_snoop_priv = kmalloc(sizeof *mad_snoop_priv, GFP_KERNEL); + if (!mad_snoop_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + /* Now, fill in the various structures */ + memset(mad_snoop_priv, 0, sizeof *mad_snoop_priv); + mad_snoop_priv->qp_info = &port_priv->qp_info[qpn]; + mad_snoop_priv->agent.device = device; + mad_snoop_priv->agent.recv_handler = recv_handler; + mad_snoop_priv->agent.snoop_handler = snoop_handler; + mad_snoop_priv->agent.context = context; + mad_snoop_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_snoop_priv->agent.port_num = port_num; + mad_snoop_priv->mad_snoop_flags = mad_snoop_flags; + init_waitqueue_head(&mad_snoop_priv->wait); + mad_snoop_priv->snoop_index = register_snoop_agent( + &port_priv->qp_info[qpn], + mad_snoop_priv); + if (mad_snoop_priv->snoop_index < 0) { + ret = ERR_PTR(mad_snoop_priv->snoop_index); + goto error2; + } + + atomic_set(&mad_snoop_priv->refcount, 1); + return &mad_snoop_priv->agent; + +error2: + kfree(mad_snoop_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_snoop); + +static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; + + /* Note that we could still be handling received MADs */ + + /* + * Canceling all sends results in dropping received response + * MADs, preventing us from queuing additional work + */ + cancel_mads(mad_agent_priv); + + port_priv = mad_agent_priv->qp_info->port_priv; + cancel_delayed_work(&mad_agent_priv->timed_work); + flush_workqueue(port_priv->wq); + + spin_lock_irqsave(&port_priv->reg_lock, flags); + remove_mad_reg_req(mad_agent_priv); + list_del(&mad_agent_priv->agent_list); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + /* XXX: Cleanup pending RMPP receives for this agent */ + + atomic_dec(&mad_agent_priv->refcount); + wait_event(mad_agent_priv->wait, + !atomic_read(&mad_agent_priv->refcount)); + + if (mad_agent_priv->reg_req) + kfree(mad_agent_priv->reg_req); + kfree(mad_agent_priv); +} + +static void unregister_mad_snoop(struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_qp_info *qp_info; + unsigned long flags; + + qp_info = mad_snoop_priv->qp_info; + spin_lock_irqsave(&qp_info->snoop_lock, flags); + qp_info->snoop_table[mad_snoop_priv->snoop_index] = NULL; + atomic_dec(&qp_info->snoop_count); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + + atomic_dec(&mad_snoop_priv->refcount); + wait_event(mad_snoop_priv->wait, + !atomic_read(&mad_snoop_priv->refcount)); + + kfree(mad_snoop_priv); +} + +/* + * ib_unregister_mad_agent - Unregisters a client from using MAD services + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_snoop_private *mad_snoop_priv; + + /* If the TID is zero, the agent can only snoop. */ + if (mad_agent->hi_tid) { + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + unregister_mad_agent(mad_agent_priv); + } else { + mad_snoop_priv = container_of(mad_agent, + struct ib_mad_snoop_private, + agent); + unregister_mad_snoop(mad_snoop_priv); + } + return 0; +} +EXPORT_SYMBOL(ib_unregister_mad_agent); + +static void dequeue_mad(struct ib_mad_list_head *mad_list) +{ + struct ib_mad_queue *mad_queue; + unsigned long flags; + + BUG_ON(!mad_list->mad_queue); + mad_queue = mad_list->mad_queue; + spin_lock_irqsave(&mad_queue->lock, flags); + list_del(&mad_list->list); + mad_queue->count--; + spin_unlock_irqrestore(&mad_queue->lock, flags); +} + +static void snoop_send(struct ib_mad_qp_info *qp_info, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.snoop_handler(&mad_snoop_priv->agent, + send_wr, mad_send_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + +static void snoop_recv(struct ib_mad_qp_info *qp_info, + struct ib_mad_recv_wc *mad_recv_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.recv_handler(&mad_snoop_priv->agent, + mad_recv_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + +/* + * Return 0 if SMP is to be sent + * Return 1 if SMP was consumed locally (whether or not solicited) + * Return < 0 if error + */ +static int handle_outgoing_smp(struct ib_mad_agent_private *mad_agent_priv, + struct ib_smp *smp, + struct ib_send_wr *send_wr) +{ + int ret, alloc_flags; + unsigned long flags; + struct ib_mad_local_private *local; + struct ib_mad_private *mad_priv; + struct ib_device *device = mad_agent_priv->agent.device; + u8 port_num = mad_agent_priv->agent.port_num; + + if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + ret = -EINVAL; + printk(KERN_ERR PFX "Invalid directed route\n"); + goto out; + } + /* Check to post send on QP or process locally */ + ret = smi_check_local_dr_smp(smp, device, port_num); + if (!ret || !device->process_mad) + goto out; + + if (in_atomic() || irqs_disabled()) + alloc_flags = GFP_ATOMIC; + else + alloc_flags = GFP_KERNEL; + local = kmalloc(sizeof *local, alloc_flags); + if (!local) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); + goto out; + } + local->mad_priv = NULL; + mad_priv = kmem_cache_alloc(ib_mad_cache, alloc_flags); + if (!mad_priv) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for local response MAD\n"); + kfree(local); + goto out; + } + ret = device->process_mad(device, 0, port_num, smp->dr_slid, + (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); + switch (ret) + { + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: + /* + * See if response is solicited and + * there is a recv handler + */ + if (solicited_mad(&mad_priv->mad.mad) && + mad_agent_priv->agent.recv_handler) + local->mad_priv = mad_priv; + else + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS: + kmem_cache_free(ib_mad_cache, mad_priv); + kfree(local); + ret = 0; + goto out; + default: + kmem_cache_free(ib_mad_cache, mad_priv); + kfree(local); + ret = -EINVAL; + goto out; + } + + local->send_wr = *send_wr; + local->send_wr.sg_list = local->sg_list; + memcpy(local->sg_list, send_wr->sg_list, + sizeof *send_wr->sg_list * send_wr->num_sge); + local->send_wr.next = NULL; + local->tid = send_wr->wr.ud.mad_hdr->tid; + local->wr_id = send_wr->wr_id; + /* Reference MAD agent until local completion handled */ + atomic_inc(&mad_agent_priv->refcount); + /* Queue local completion to local list */ + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_add_tail(&local->completion_list, &mad_agent_priv->local_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + queue_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->local_work); + ret = 1; +out: + return ret; +} + +static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr) +{ + struct ib_mad_qp_info *qp_info; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; + + /* Replace user's WR ID with our own to find WR upon completion */ + qp_info = mad_agent_priv->qp_info; + mad_send_wr->wr_id = mad_send_wr->send_wr.wr_id; + mad_send_wr->send_wr.wr_id = (unsigned long)&mad_send_wr->mad_list; + mad_send_wr->mad_list.mad_queue = &qp_info->send_queue; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + if (qp_info->send_queue.count++ < qp_info->send_queue.max_active) { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->send_queue.list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = ib_post_send(mad_agent_priv->agent.qp, + &mad_send_wr->send_wr, &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + dequeue_mad(&mad_send_wr->mad_list); + } + } else { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->overflow_list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = 0; + } + return ret; +} + +/* + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + int ret = -EINVAL; + struct ib_mad_agent_private *mad_agent_priv; + + /* Validate supplied parameters */ + if (!bad_send_wr) + goto error1; + + if (!mad_agent || !send_wr) + goto error2; + + if (!mad_agent->send_handler) + goto error2; + + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + + /* Walk list of send WRs and post each on send list */ + while (send_wr) { + unsigned long flags; + struct ib_send_wr *next_send_wr; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_smp *smp; + + /* Validate more parameters */ + if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG) + goto error2; + + if (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler) + goto error2; + + if (!send_wr->wr.ud.mad_hdr) { + printk(KERN_ERR PFX "MAD header must be supplied " + "in WR %p\n", send_wr); + goto error2; + } + + /* + * Save pointer to next work request to post in case the + * current one completes, and the user modifies the work + * request associated with the completion + */ + next_send_wr = (struct ib_send_wr *)send_wr->next; + + smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; + if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + ret = handle_outgoing_smp(mad_agent_priv, smp, send_wr); + if (ret < 0) /* error */ + goto error2; + else if (ret == 1) /* locally consumed */ + goto next; + } + + /* Allocate MAD send WR tracking structure */ + mad_send_wr = kmalloc(sizeof *mad_send_wr, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_send_wr) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_send_wr_private\n"); + ret = -ENOMEM; + goto error2; + } + + mad_send_wr->send_wr = *send_wr; + mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; + memcpy(mad_send_wr->sg_list, send_wr->sg_list, + sizeof *send_wr->sg_list * send_wr->num_sge); + mad_send_wr->send_wr.next = NULL; + mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid; + mad_send_wr->agent = mad_agent; + /* Timeout will be updated after send completes */ + mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr. + ud.timeout_ms); + mad_send_wr->retry = 0; + /* One reference for each work request to QP + response */ + mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); + mad_send_wr->status = IB_WC_SUCCESS; + + /* Reference MAD agent until send completes */ + atomic_inc(&mad_agent_priv->refcount); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_add_tail(&mad_send_wr->agent_list, + &mad_agent_priv->send_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + ret = ib_send_mad(mad_agent_priv, mad_send_wr); + if (ret) { + /* Fail send request */ + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + atomic_dec(&mad_agent_priv->refcount); + goto error2; + } +next: + send_wr = next_send_wr; + } + return 0; + +error2: + *bad_send_wr = send_wr; +error1: + return ret; +} +EXPORT_SYMBOL(ib_post_send_mad); + +/* + * ib_free_recv_mad - Returns data buffers used to receive + * a MAD to the access layer + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_recv_buf *entry; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *priv; + + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, struct ib_mad_private, header); + + /* + * Walk receive buffer list associated with this WC + * No need to remove them from list of receive buffers + */ + list_for_each_entry(entry, &mad_recv_wc->recv_buf.list, list) { + /* Free previous receive buffer */ + kmem_cache_free(ib_mad_cache, priv); + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, struct ib_mad_private, + header); + } + + /* Free last buffer */ + kmem_cache_free(ib_mad_cache, priv); +} +EXPORT_SYMBOL(ib_free_recv_mad); + +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf) +{ + printk(KERN_ERR PFX "ib_coalesce_recv_mad() not implemented yet\n"); +} +EXPORT_SYMBOL(ib_coalesce_recv_mad); + +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + return ERR_PTR(-EINVAL); /* XXX: for now */ +} +EXPORT_SYMBOL(ib_redirect_mad_qp); + +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc) +{ + printk(KERN_ERR PFX "ib_process_mad_wc() not implemented yet\n"); + return 0; +} +EXPORT_SYMBOL(ib_process_mad_wc); + +static int method_in_use(struct ib_mad_mgmt_method_table **method, + struct ib_mad_reg_req *mad_reg_req) +{ + int i; + + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + if ((*method)->agent[i]) { + printk(KERN_ERR PFX "Method %d already in use\n", i); + return -EINVAL; + } + } + return 0; +} + +static int allocate_method_table(struct ib_mad_mgmt_method_table **method) +{ + /* Allocate management method table */ + *method = kmalloc(sizeof **method, GFP_ATOMIC); + if (!*method) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_method_table\n"); + return -ENOMEM; + } + /* Clear management method table */ + memset(*method, 0, sizeof **method); + + return 0; +} + +/* + * Check to see if there are any methods still in use + */ +static int check_method_table(struct ib_mad_mgmt_method_table *method) +{ + int i; + + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) + if (method->agent[i]) + return 1; + return 0; +} + +/* + * Check to see if there are any method tables for this class still in use + */ +static int check_class_table(struct ib_mad_mgmt_class_table *class) +{ + int i; + + for (i = 0; i < MAX_MGMT_CLASS; i++) + if (class->method_table[i]) + return 1; + return 0; +} + +static int check_vendor_class(struct ib_mad_mgmt_vendor_class *vendor_class) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + if (vendor_class->method_table[i]) + return 1; + return 0; +} + +static int find_vendor_oui(struct ib_mad_mgmt_vendor_class *vendor_class, + char *oui) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + /* Is there matching OUI for this vendor class ? */ + if (!memcmp(vendor_class->oui[i], oui, 3)) + return i; + + return -1; +} + +static int check_vendor_table(struct ib_mad_mgmt_vendor_class_table *vendor) +{ + int i; + + for (i = 0; i < MAX_MGMT_VENDOR_RANGE2; i++) + if (vendor->vendor_class[i]) + return 1; + + return 0; +} + +static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method, + struct ib_mad_agent_private *agent) +{ + int i; + + /* Remove any methods for this mad agent */ + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) { + if (method->agent[i] == agent) { + method->agent[i] = NULL; + } + } +} + +static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv, + u8 mgmt_class) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_class_table **class; + struct ib_mad_mgmt_method_table **method; + int i, ret; + + port_priv = agent_priv->qp_info->port_priv; + class = &port_priv->version[mad_reg_req->mgmt_class_version].class; + if (!*class) { + /* Allocate management class table for "new" class version */ + *class = kmalloc(sizeof **class, GFP_ATOMIC); + if (!*class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_class_table\n"); + ret = -ENOMEM; + goto error1; + } + /* Clear management class table */ + memset(*class, 0, sizeof(**class)); + /* Allocate method table for this management class */ + method = &(*class)->method_table[mgmt_class]; + if ((ret = allocate_method_table(method))) + goto error2; + } else { + method = &(*class)->method_table[mgmt_class]; + if (!*method) { + /* Allocate method table for this management class */ + if ((ret = allocate_method_table(method))) + goto error1; + } + } + + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error3; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = agent_priv; + } + return 0; + +error3: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, agent_priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; + goto error1; +error2: + kfree(*class); + *class = NULL; +error1: + return ret; +} + +static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_vendor_class_table **vendor_table; + struct ib_mad_mgmt_vendor_class_table *vendor = NULL; + struct ib_mad_mgmt_vendor_class *vendor_class = NULL; + struct ib_mad_mgmt_method_table **method; + int i, ret = -ENOMEM; + u8 vclass; + + /* "New" vendor (with OUI) class */ + vclass = vendor_class_index(mad_reg_req->mgmt_class); + port_priv = agent_priv->qp_info->port_priv; + vendor_table = &port_priv->version[ + mad_reg_req->mgmt_class_version].vendor; + if (!*vendor_table) { + /* Allocate mgmt vendor class table for "new" class version */ + vendor = kmalloc(sizeof *vendor, GFP_ATOMIC); + if (!vendor) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class_table\n"); + goto error1; + } + /* Clear management vendor class table */ + memset(vendor, 0, sizeof(*vendor)); + *vendor_table = vendor; + } + if (!(*vendor_table)->vendor_class[vclass]) { + /* Allocate table for this management vendor class */ + vendor_class = kmalloc(sizeof *vendor_class, GFP_ATOMIC); + if (!vendor_class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class\n"); + goto error2; + } + memset(vendor_class, 0, sizeof(*vendor_class)); + (*vendor_table)->vendor_class[vclass] = vendor_class; + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* Is there matching OUI for this vendor class ? */ + if (!memcmp((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3)) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(!*method); + goto check_in_use; + } + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* OUI slot available ? */ + if (!is_vendor_oui((*vendor_table)->vendor_class[ + vclass]->oui[i])) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(*method); + /* Allocate method table for this OUI */ + if ((ret = allocate_method_table(method))) + goto error3; + memcpy((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3); + goto check_in_use; + } + } + printk(KERN_ERR PFX "All OUI slots in use\n"); + goto error3; + +check_in_use: + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error4; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = agent_priv; + } + return 0; + +error4: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, agent_priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; +error3: + if (vendor_class) { + (*vendor_table)->vendor_class[vclass] = NULL; + kfree(vendor_class); + } +error2: + if (vendor) { + *vendor_table = NULL; + kfree(vendor); + } +error1: + return ret; +} + +static void remove_mad_reg_req(struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + int index; + u8 mgmt_class; + + /* + * Was MAD registration request supplied + * with original registration ? + */ + if (!agent_priv->reg_req) { + goto out; + } + + port_priv = agent_priv->qp_info->port_priv; + class = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].class; + if (!class) + goto vendor_check; + + mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); + method = class->method_table[mgmt_class]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* Now, check to see if there are any methods still in use */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + class->method_table[mgmt_class] = NULL; + /* Any management classes left ? */ + if (!check_class_table(class)) { + /* If not, release management class table */ + kfree(class); + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version].class = NULL; + } + } + } + +vendor_check: + vendor = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].vendor; + if (!vendor) + goto out; + + mgmt_class = vendor_class_index(agent_priv->reg_req->mgmt_class); + vendor_class = vendor->vendor_class[mgmt_class]; + if (vendor_class) { + index = find_vendor_oui(vendor_class, agent_priv->reg_req->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* + * Now, check to see if there are + * any methods still in use + */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + vendor_class->method_table[index] = NULL; + memset(vendor_class->oui[index], 0, 3); + /* Any OUIs left ? */ + if (!check_vendor_class(vendor_class)) { + /* If not, release vendor class table */ + kfree(vendor_class); + vendor->vendor_class[mgmt_class] = NULL; + /* Any other vendor classes left ? */ + if (!check_vendor_table(vendor)) { + kfree(vendor); + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version]. + vendor = NULL; + } + } + } + } + } + +out: + return; +} + +static int response_mad(struct ib_mad *mad) +{ + /* Trap represses are responses although response bit is reset */ + return ((mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) || + (mad->mad_hdr.method & IB_MGMT_METHOD_RESP)); +} + +static int solicited_mad(struct ib_mad *mad) +{ + /* CM MADs are never solicited */ + if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_CM) { + return 0; + } + + /* XXX: Determine whether MAD is using RMPP */ + + /* Not using RMPP */ + /* Is this MAD a response to a previous MAD ? */ + return response_mad(mad); +} + +static struct ib_mad_agent_private * +find_mad_agent(struct ib_mad_port_private *port_priv, + struct ib_mad *mad, + int solicited) +{ + struct ib_mad_agent_private *mad_agent = NULL; + unsigned long flags; + + spin_lock_irqsave(&port_priv->reg_lock, flags); + + /* + * Whether MAD was solicited determines type of routing to + * MAD client. + */ + if (solicited) { + u32 hi_tid; + struct ib_mad_agent_private *entry; + + /* + * Routing is based on high 32 bits of transaction ID + * of MAD. + */ + hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32; + list_for_each_entry(entry, &port_priv->agent_list, + agent_list) { + if (entry->agent.hi_tid == hi_tid) { + mad_agent = entry; + break; + } + } + } else { + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + struct ib_vendor_mad *vendor_mad; + int index; + + /* + * Routing is based on version, class, and method + * For "newer" vendor MADs, also based on OUI + */ + if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) + goto out; + if (!is_vendor_class(mad->mad_hdr.mgmt_class)) { + class = port_priv->version[ + mad->mad_hdr.class_version].class; + if (!class) + goto out; + method = class->method_table[convert_mgmt_class( + mad->mad_hdr.mgmt_class)]; + if (method) + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } else { + vendor = port_priv->version[ + mad->mad_hdr.class_version].vendor; + if (!vendor) + goto out; + vendor_class = vendor->vendor_class[vendor_class_index( + mad->mad_hdr.mgmt_class)]; + if (!vendor_class) + goto out; + /* Find matching OUI */ + vendor_mad = (struct ib_vendor_mad *)mad; + index = find_vendor_oui(vendor_class, vendor_mad->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } + } + } + + if (mad_agent) { + if (mad_agent->agent.recv_handler) + atomic_inc(&mad_agent->refcount); + else { + printk(KERN_NOTICE PFX "No receive handler for client " + "%p on port %d\n", + &mad_agent->agent, port_priv->port_num); + mad_agent = NULL; + } + } +out: + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + return mad_agent; +} + +static int validate_mad(struct ib_mad *mad, u32 qp_num) +{ + int valid = 0; + + /* Make sure MAD base version is understood */ + if (mad->mad_hdr.base_version != IB_MGMT_BASE_VERSION) { + printk(KERN_ERR PFX "MAD received with unsupported base " + "version %d\n", mad->mad_hdr.base_version); + goto out; + } + + /* Filter SMI packets sent to other than QP0 */ + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) { + if (qp_num == 0) + valid = 1; + } else { + /* Filter GSI packets sent to QP0 */ + if (qp_num != 0) + valid = 1; + } + +out: + return valid; +} + +/* + * Return start of fully reassembled MAD, or NULL, if MAD isn't assembled yet + */ +static struct ib_mad_private * +reassemble_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv) +{ + /* Until we have RMPP, all receives are reassembled!... */ + INIT_LIST_HEAD(&recv->header.recv_wc.recv_buf.list); + return recv; +} + +static struct ib_mad_send_wr_private* +find_send_req(struct ib_mad_agent_private *mad_agent_priv, + u64 tid) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, + agent_list) { + if (mad_send_wr->tid == tid) + return mad_send_wr; + } + + /* + * It's possible to receive the response before we've + * been notified that the send has completed + */ + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_list) { + if (mad_send_wr->tid == tid && mad_send_wr->timeout) { + /* Verify request has not been canceled */ + return (mad_send_wr->status == IB_WC_SUCCESS) ? + mad_send_wr : NULL; + } + } + return NULL; +} + +static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv, + int solicited) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + /* Fully reassemble receive before processing */ + recv = reassemble_recv(mad_agent_priv, recv); + if (!recv) { + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } + + /* Complete corresponding request */ + if (solicited) { + spin_lock_irqsave(&mad_agent_priv->lock, flags); + mad_send_wr = find_send_req(mad_agent_priv, + recv->mad.mad.mad_hdr.tid); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + ib_free_recv_mad(&recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } + /* Timeout = 0 means that we won't wait for a response */ + mad_send_wr->timeout = 0; + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + /* Defined behavior is to complete response before request */ + recv->header.recv_wc.wc->wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &recv->header.recv_wc); + atomic_dec(&mad_agent_priv->refcount); + + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); + } else { + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + } +} + +static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_qp_info *qp_info; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *recv, *response; + struct ib_mad_list_head *mad_list; + struct ib_mad_agent_private *mad_agent; + int solicited; + + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!response) + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " + "for response buffer\n"); + + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + dequeue_mad(mad_list); + + mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, header); + dma_unmap_single(port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); + + /* Setup MAD receive work completion from "normal" work completion */ + recv->header.recv_wc.wc = wc; + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); + recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; + recv->header.recv_wc.recv_buf.grh = &recv->grh; + + if (atomic_read(&qp_info->snoop_count)) + snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); + + /* Validate MAD */ + if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) + goto out; + + if (recv->mad.mad.mad_hdr.mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (!smi_handle_dr_smp_recv(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num, + port_priv->device->phys_port_cnt)) + goto out; + if (!smi_check_forward_dr_smp(&recv->mad.smp)) + goto local; + if (!smi_handle_dr_smp_send(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num)) + goto out; + if (!smi_check_local_dr_smp(&recv->mad.smp, + port_priv->device, + port_priv->port_num)) + goto out; + } + +local: + /* Give driver "right of first refusal" on incoming MAD */ + if (port_priv->device->process_mad) { + int ret; + + if (!response) { + printk(KERN_ERR PFX "No memory for response MAD\n"); + /* + * Is it better to assume that + * it wouldn't be processed ? + */ + goto out; + } + + ret = port_priv->device->process_mad(port_priv->device, 0, + port_priv->port_num, + wc->slid, + &recv->mad.mad, + &response->mad.mad); + if (ret & IB_MAD_RESULT_SUCCESS) { + if (ret & IB_MAD_RESULT_CONSUMED) + goto out; + if (ret & IB_MAD_RESULT_REPLY) { + /* Send response */ + if (!agent_send(response, &recv->grh, wc, + port_priv->device, + port_priv->port_num)) + response = NULL; + goto out; + } + } + } + + /* Determine corresponding MAD agent for incoming receive MAD */ + solicited = solicited_mad(&recv->mad.mad); + mad_agent = find_mad_agent(port_priv, &recv->mad.mad, solicited); + if (mad_agent) { + ib_mad_complete_recv(mad_agent, recv, solicited); + /* + * recv is freed up in error cases in ib_mad_complete_recv + * or via recv_handler in ib_mad_complete_recv() + */ + recv = NULL; + } + +out: + /* Post another receive request for this QP */ + if (response) { + ib_mad_post_receive_mads(qp_info, response); + if (recv) + kmem_cache_free(ib_mad_cache, recv); + } else + ib_mad_post_receive_mads(qp_info, recv); +} + +static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_send_wr_private *mad_send_wr; + unsigned long delay; + + if (list_empty(&mad_agent_priv->wait_list)) { + cancel_delayed_work(&mad_agent_priv->timed_work); + } else { + mad_send_wr = list_entry(mad_agent_priv->wait_list.next, + struct ib_mad_send_wr_private, + agent_list); + + if (time_after(mad_agent_priv->timeout, + mad_send_wr->timeout)) { + mad_agent_priv->timeout = mad_send_wr->timeout; + cancel_delayed_work(&mad_agent_priv->timed_work); + delay = mad_send_wr->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, + &mad_agent_priv->timed_work, delay); + } + } +} + +static void wait_for_response(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr ) +{ + struct ib_mad_send_wr_private *temp_mad_send_wr; + struct list_head *list_item; + unsigned long delay; + + list_del(&mad_send_wr->agent_list); + + delay = mad_send_wr->timeout; + mad_send_wr->timeout += jiffies; + + list_for_each_prev(list_item, &mad_agent_priv->wait_list) { + temp_mad_send_wr = list_entry(list_item, + struct ib_mad_send_wr_private, + agent_list); + if (time_after(mad_send_wr->timeout, + temp_mad_send_wr->timeout)) + break; + } + list_add(&mad_send_wr->agent_list, list_item); + + /* Reschedule a work item if we have a shorter timeout */ + if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) { + cancel_delayed_work(&mad_agent_priv->timed_work); + queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->timed_work, delay); + } +} + +/* + * Process a send work completion + */ +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_mad_agent_private *mad_agent_priv; + unsigned long flags; + + mad_agent_priv = container_of(mad_send_wr->agent, + struct ib_mad_agent_private, agent); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + if (mad_send_wc->status != IB_WC_SUCCESS && + mad_send_wr->status == IB_WC_SUCCESS) { + mad_send_wr->status = mad_send_wc->status; + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + } + + if (--mad_send_wr->refcount > 0) { + if (mad_send_wr->refcount == 1 && mad_send_wr->timeout && + mad_send_wr->status == IB_WC_SUCCESS) { + wait_for_response(mad_agent_priv, mad_send_wr); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + return; + } + + /* Remove send from MAD agent and notify client of completion */ + list_del(&mad_send_wr->agent_list); + adjust_timeout(mad_agent_priv); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + if (mad_send_wr->status != IB_WC_SUCCESS ) + mad_send_wc->status = mad_send_wr->status; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + mad_send_wc); + + /* Release reference on agent taken when sending */ + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + + kfree(mad_send_wr); +} + +static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_send_wr_private *mad_send_wr, *queued_send_wr; + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_queue *send_queue; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; + + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + send_queue = mad_list->mad_queue; + qp_info = send_queue->qp_info; + +retry: + queued_send_wr = NULL; + spin_lock_irqsave(&send_queue->lock, flags); + list_del(&mad_list->list); + + /* Move queued send to the send queue */ + if (send_queue->count-- > send_queue->max_active) { + mad_list = container_of(qp_info->overflow_list.next, + struct ib_mad_list_head, list); + queued_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + list_del(&mad_list->list); + list_add_tail(&mad_list->list, &send_queue->list); + } + spin_unlock_irqrestore(&send_queue->lock, flags); + + /* Restore client wr_id in WC and complete send */ + wc->wr_id = mad_send_wr->wr_id; + if (atomic_read(&qp_info->snoop_count)) + snoop_send(qp_info, &mad_send_wr->send_wr, + (struct ib_mad_send_wc *)wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc *)wc); + + if (queued_send_wr) { + ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr, + &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + mad_send_wr = queued_send_wr; + wc->status = IB_WC_LOC_QP_OP_ERR; + goto retry; + } + } +} + +static void mark_sends_for_retry(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_list_head *mad_list; + unsigned long flags; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + list_for_each_entry(mad_list, &qp_info->send_queue.list, list) { + mad_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + mad_send_wr->retry = 1; + } + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); +} + +static void mad_error_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_send_wr_private *mad_send_wr; + int ret; + + /* Determine if failure was a send or receive */ + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + if (mad_list->mad_queue == &qp_info->recv_queue) + /* + * Receive errors indicate that the QP has entered the error + * state - error handling/shutdown code will cleanup + */ + return; + + /* + * Send errors will transition the QP to SQE - move + * QP to RTS and repost flushed work requests + */ + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + if (wc->status == IB_WC_WR_FLUSH_ERR) { + if (mad_send_wr->retry) { + /* Repost send */ + struct ib_send_wr *bad_send_wr; + + mad_send_wr->retry = 0; + ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr, + &bad_send_wr); + if (ret) + ib_mad_send_done_handler(port_priv, wc); + } else + ib_mad_send_done_handler(port_priv, wc); + } else { + struct ib_qp_attr *attr; + + /* Transition QP to RTS and fail offending send */ + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RTS; + attr->cur_qp_state = IB_QPS_SQE; + ret = ib_modify_qp(qp_info->qp, attr, + IB_QP_STATE | IB_QP_CUR_STATE); + kfree(attr); + if (ret) + printk(KERN_ERR PFX "mad_error_handler - " + "ib_modify_qp to RTS : %d\n", ret); + else + mark_sends_for_retry(qp_info); + } + ib_mad_send_done_handler(port_priv, wc); + } +} + +/* + * IB MAD completion callback + */ +static void ib_mad_completion_handler(void *data) +{ + struct ib_mad_port_private *port_priv; + struct ib_wc wc; + + port_priv = (struct ib_mad_port_private *)data; + ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + + while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { + if (wc.status == IB_WC_SUCCESS) { + switch (wc.opcode) { + case IB_WC_SEND: + ib_mad_send_done_handler(port_priv, &wc); + break; + case IB_WC_RECV: + ib_mad_recv_done_handler(port_priv, &wc); + break; + default: + BUG_ON(1); + break; + } + } else + mad_error_handler(port_priv, &wc); + } +} + +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv) +{ + unsigned long flags; + struct ib_mad_send_wr_private *mad_send_wr, *temp_mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + struct list_head cancel_list; + + INIT_LIST_HEAD(&cancel_list); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &mad_agent_priv->send_list, agent_list) { + if (mad_send_wr->status == IB_WC_SUCCESS) { + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + } + } + + /* Empty wait list to prevent receives from finding a request */ + list_splice_init(&mad_agent_priv->wait_list, &cancel_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + /* Report all cancelled requests */ + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &cancel_list, agent_list) { + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + list_del(&mad_send_wr->agent_list); + kfree(mad_send_wr); + atomic_dec(&mad_agent_priv->refcount); + } +} + +static struct ib_mad_send_wr_private* +find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv, + u64 wr_id) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, + agent_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + return NULL; +} + +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + mad_send_wr = find_send_by_wr_id(mad_agent_priv, wr_id); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + goto out; + } + + if (mad_send_wr->status == IB_WC_SUCCESS) + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + + if (mad_send_wr->refcount != 0) { + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + goto out; + } + + list_del(&mad_send_wr->agent_list); + adjust_timeout(mad_agent_priv); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + +out: + return; +} +EXPORT_SYMBOL(ib_cancel_mad); + +static void local_completions(void *data) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_local_private *local; + unsigned long flags; + struct ib_wc wc; + struct ib_mad_send_wc mad_send_wc; + + mad_agent_priv = (struct ib_mad_agent_private *)data; + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + while (!list_empty(&mad_agent_priv->local_list)) { + local = list_entry(mad_agent_priv->local_list.next, + struct ib_mad_local_private, + completion_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + if (local->mad_priv) { + /* + * Defined behavior is to complete response + * before request + */ + wc.wr_id = local->wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = sizeof(struct ib_mad); + wc.src_qp = IB_QP0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = IB_LID_PERMISSIVE; + wc.sl = 0; + wc.dlid_path_bits = 0; + local->mad_priv->header.recv_wc.wc = &wc; + local->mad_priv->header.recv_wc.mad_len = + sizeof(struct ib_mad); + INIT_LIST_HEAD(&local->mad_priv->header.recv_wc.recv_buf.list); + local->mad_priv->header.recv_wc.recv_buf.grh = NULL; + local->mad_priv->header.recv_wc.recv_buf.mad = + &local->mad_priv->mad.mad; + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_recv(mad_agent_priv->qp_info, + &local->mad_priv->header.recv_wc, + IB_MAD_SNOOP_RECVS); + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &local->mad_priv->header.recv_wc); + } + + /* Complete send */ + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = local->wr_id; + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_send(mad_agent_priv->qp_info, &local->send_wr, + &mad_send_wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_del(&local->completion_list); + atomic_dec(&mad_agent_priv->refcount); + kfree(local); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); +} + +static void timeout_sends(void *data) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags, delay; + + mad_agent_priv = (struct ib_mad_agent_private *)data; + + mad_send_wc.status = IB_WC_RESP_TIMEOUT_ERR; + mad_send_wc.vendor_err = 0; + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + while (!list_empty(&mad_agent_priv->wait_list)) { + mad_send_wr = list_entry(mad_agent_priv->wait_list.next, + struct ib_mad_send_wr_private, + agent_list); + + if (time_after(mad_send_wr->timeout, jiffies)) { + delay = mad_send_wr->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, + &mad_agent_priv->timed_work, delay); + break; + } + + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + atomic_dec(&mad_agent_priv->refcount); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); +} + +static void ib_mad_thread_completion_handler(struct ib_cq *cq) +{ + struct ib_mad_port_private *port_priv = cq->cq_context; + + queue_work(port_priv->wq, &port_priv->work); +} + +/* + * Allocate receive MADs and post receive WRs for them + */ +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad) +{ + unsigned long flags; + int post, ret; + struct ib_mad_private *mad_priv; + struct ib_sge sg_list; + struct ib_recv_wr recv_wr, *bad_recv_wr; + struct ib_mad_queue *recv_queue = &qp_info->recv_queue; + + /* Initialize common scatter list fields */ + sg_list.length = sizeof *mad_priv - sizeof mad_priv->header; + sg_list.lkey = (*qp_info->port_priv->mr).lkey; + + /* Initialize common receive WR fields */ + recv_wr.next = NULL; + recv_wr.sg_list = &sg_list; + recv_wr.num_sge = 1; + recv_wr.recv_flags = IB_RECV_SIGNALED; + + do { + /* Allocate and map receive buffer */ + if (mad) { + mad_priv = mad; + mad = NULL; + } else { + mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!mad_priv) { + printk(KERN_ERR PFX "No memory for receive buffer\n"); + ret = -ENOMEM; + break; + } + } + sg_list.addr = dma_map_single(qp_info->port_priv-> + device->dma_device, + &mad_priv->grh, + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); + recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; + mad_priv->header.mad_list.mad_queue = recv_queue; + + /* Post receive WR */ + spin_lock_irqsave(&recv_queue->lock, flags); + post = (++recv_queue->count < recv_queue->max_active); + list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list); + spin_unlock_irqrestore(&recv_queue->lock, flags); + ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); + if (ret) { + spin_lock_irqsave(&recv_queue->lock, flags); + list_del(&mad_priv->header.mad_list.list); + recv_queue->count--; + spin_unlock_irqrestore(&recv_queue->lock, flags); + dma_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&mad_priv->header, + mapping), + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); + kmem_cache_free(ib_mad_cache, mad_priv); + printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret); + break; + } + } while (post); + + return ret; +} + +/* + * Return all the posted receive MADs + */ +static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *recv; + struct ib_mad_list_head *mad_list; + + while (!list_empty(&qp_info->recv_queue.list)) { + + mad_list = list_entry(qp_info->recv_queue.list.next, + struct ib_mad_list_head, list); + mad_priv_hdr = container_of(mad_list, + struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, + header); + + /* Remove from posted receive MAD list */ + list_del(&mad_list->list); + + /* Undo PCI mapping */ + dma_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); + kmem_cache_free(ib_mad_cache, recv); + } + + qp_info->recv_queue.count = 0; +} + +/* + * Start the port + */ +static int ib_mad_port_start(struct ib_mad_port_private *port_priv) +{ + int ret, i; + struct ib_qp_attr *attr; + struct ib_qp *qp; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) { + printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n"); + return -ENOMEM; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + qp = port_priv->qp_info[i].qp; + /* + * PKey index for QP1 is irrelevant but + * one is needed for the Reset to Init transition + */ + attr->qp_state = IB_QPS_INIT; + attr->pkey_index = 0; + attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | + IB_QP_PKEY_INDEX | IB_QP_QKEY); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "INIT: %d\n", i, ret); + goto out; + } + + attr->qp_state = IB_QPS_RTR; + ret = ib_modify_qp(qp, attr, IB_QP_STATE); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "RTR: %d\n", i, ret); + goto out; + } + + attr->qp_state = IB_QPS_RTS; + attr->sq_psn = IB_MAD_SEND_Q_PSN; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "RTS: %d\n", i, ret); + goto out; + } + } + + ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + if (ret) { + printk(KERN_ERR PFX "Failed to request completion " + "notification: %d\n", ret); + goto out; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); + if (ret) { + printk(KERN_ERR PFX "Couldn't post receive WRs\n"); + goto out; + } + } +out: + kfree(attr); + return ret; +} + +static void qp_event_handler(struct ib_event *event, void *qp_context) +{ + struct ib_mad_qp_info *qp_info = qp_context; + + /* It's worse than that! He's dead, Jim! */ + printk(KERN_ERR PFX "Fatal error (%d) on MAD QP (%d)\n", + event->event, qp_info->qp->qp_num); +} + +static void init_mad_queue(struct ib_mad_qp_info *qp_info, + struct ib_mad_queue *mad_queue) +{ + mad_queue->qp_info = qp_info; + mad_queue->count = 0; + spin_lock_init(&mad_queue->lock); + INIT_LIST_HEAD(&mad_queue->list); +} + +static void init_mad_qp(struct ib_mad_port_private *port_priv, + struct ib_mad_qp_info *qp_info) +{ + qp_info->port_priv = port_priv; + init_mad_queue(qp_info, &qp_info->send_queue); + init_mad_queue(qp_info, &qp_info->recv_queue); + INIT_LIST_HEAD(&qp_info->overflow_list); + spin_lock_init(&qp_info->snoop_lock); + qp_info->snoop_table = NULL; + qp_info->snoop_table_size = 0; + atomic_set(&qp_info->snoop_count, 0); +} + +static int create_mad_qp(struct ib_mad_qp_info *qp_info, + enum ib_qp_type qp_type) +{ + struct ib_qp_init_attr qp_init_attr; + int ret; + + memset(&qp_init_attr, 0, sizeof qp_init_attr); + qp_init_attr.send_cq = qp_info->port_priv->cq; + qp_init_attr.recv_cq = qp_info->port_priv->cq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; + qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; + qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; + qp_init_attr.qp_type = qp_type; + qp_init_attr.port_num = qp_info->port_priv->port_num; + qp_init_attr.qp_context = qp_info; + qp_init_attr.event_handler = qp_event_handler; + qp_info->qp = ib_create_qp(qp_info->port_priv->pd, &qp_init_attr); + if (IS_ERR(qp_info->qp)) { + printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", + get_spl_qp_index(qp_type)); + ret = PTR_ERR(qp_info->qp); + goto error; + } + /* Use minimum queue sizes unless the CQ is resized */ + qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; + qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; + return 0; + +error: + return ret; +} + +static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) +{ + ib_destroy_qp(qp_info->qp); + if (qp_info->snoop_table) + kfree(qp_info->snoop_table); +} + +/* + * Open the port + * Create the QP, PD, MR, and CQ if needed + */ +static int ib_mad_port_open(struct ib_device *device, + int port_num) +{ + int ret, cq_size; + struct ib_mad_port_private *port_priv; + unsigned long flags; + char name[sizeof "ib_mad123"]; + + /* First, check if port already open at MAD layer */ + port_priv = ib_get_mad_port(device, port_num); + if (port_priv) { + printk(KERN_DEBUG PFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR PFX "No memory for ib_mad_port_private\n"); + return -ENOMEM; + } + memset(port_priv, 0, sizeof *port_priv); + port_priv->device = device; + port_priv->port_num = port_num; + spin_lock_init(&port_priv->reg_lock); + INIT_LIST_HEAD(&port_priv->agent_list); + init_mad_qp(port_priv, &port_priv->qp_info[0]); + init_mad_qp(port_priv, &port_priv->qp_info[1]); + + cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + port_priv->cq = ib_create_cq(port_priv->device, + (ib_comp_handler) + ib_mad_thread_completion_handler, + NULL, port_priv, cq_size); + if (IS_ERR(port_priv->cq)) { + printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n"); + ret = PTR_ERR(port_priv->cq); + goto error3; + } + + port_priv->pd = ib_alloc_pd(device); + if (IS_ERR(port_priv->pd)) { + printk(KERN_ERR PFX "Couldn't create ib_mad PD\n"); + ret = PTR_ERR(port_priv->pd); + goto error4; + } + + port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(port_priv->mr)) { + printk(KERN_ERR PFX "Couldn't get ib_mad DMA MR\n"); + ret = PTR_ERR(port_priv->mr); + goto error5; + } + + ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI); + if (ret) + goto error6; + ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI); + if (ret) + goto error7; + + snprintf(name, sizeof name, "ib_mad%d", port_num); + port_priv->wq = create_singlethread_workqueue(name); + if (!port_priv->wq) { + ret = -ENOMEM; + goto error8; + } + INIT_WORK(&port_priv->work, ib_mad_completion_handler, port_priv); + + ret = ib_mad_port_start(port_priv); + if (ret) { + printk(KERN_ERR PFX "Couldn't start port\n"); + goto error9; + } + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_mad_port_list); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + return 0; + +error9: + destroy_workqueue(port_priv->wq); +error8: + destroy_mad_qp(&port_priv->qp_info[1]); +error7: + destroy_mad_qp(&port_priv->qp_info[0]); +error6: + ib_dereg_mr(port_priv->mr); +error5: + ib_dealloc_pd(port_priv->pd); +error4: + ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); +error3: + kfree(port_priv); + + return ret; +} + +/* + * Close the port + * If there are no classes using the port, free the port + * resources (CQ, MR, PD, QP) and remove the port's info structure + */ +static int ib_mad_port_close(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + port_priv = __ib_get_mad_port(device, port_num); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + printk(KERN_ERR PFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + + /* Stop processing completions. */ + flush_workqueue(port_priv->wq); + destroy_workqueue(port_priv->wq); + destroy_mad_qp(&port_priv->qp_info[1]); + destroy_mad_qp(&port_priv->qp_info[0]); + ib_dereg_mr(port_priv->mr); + ib_dealloc_pd(port_priv->pd); + ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); + /* XXX: Handle deallocation of MAD registration tables */ + + kfree(port_priv); + + return 0; +} + +static void ib_mad_init_device(struct ib_device *device) +{ + int ret, num_ports, cur_port, i, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret = ib_mad_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d\n", + device->name, cur_port); + goto error_device_open; + } + ret = ib_agent_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d " + "for agents\n", + device->name, cur_port); + goto error_device_open; + } + } + + goto error_device_query; + +error_device_open: + while (i > 0) { + cur_port--; + ret2 = ib_agent_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + } + ret2 = ib_mad_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + } + i--; + } + +error_device_query: + return; +} + +static void ib_mad_remove_device(struct ib_device *device) +{ + int ret = 0, i, num_ports, cur_port, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret2 = ib_agent_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + ret2 = ib_mad_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + } +} + +static struct ib_client mad_client = { + .name = "mad", + .add = ib_mad_init_device, + .remove = ib_mad_remove_device +}; + +static int __init ib_mad_init_module(void) +{ + int ret; + + spin_lock_init(&ib_mad_port_list_lock); + spin_lock_init(&ib_agent_port_list_lock); + + ib_mad_cache = kmem_cache_create("ib_mad", + sizeof(struct ib_mad_private), + 0, + SLAB_HWCACHE_ALIGN, + NULL, + NULL); + if (!ib_mad_cache) { + printk(KERN_ERR PFX "Couldn't create ib_mad cache\n"); + ret = -ENOMEM; + goto error1; + } + + INIT_LIST_HEAD(&ib_mad_port_list); + + if (ib_register_client(&mad_client)) { + printk(KERN_ERR PFX "Couldn't register ib_mad client\n"); + ret = -EINVAL; + goto error2; + } + + return 0; + +error2: + kmem_cache_destroy(ib_mad_cache); +error1: + return ret; +} + +static void __exit ib_mad_cleanup_module(void) +{ + ib_unregister_client(&mad_client); + + if (kmem_cache_destroy(ib_mad_cache)) { + printk(KERN_DEBUG PFX "Failed to destroy ib_mad cache\n"); + } +} + +module_init(ib_mad_init_module); +module_exit(ib_mad_cleanup_module); From roland at topspin.com Sun Dec 19 22:14:57 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:14:57 -0800 Subject: [openib-general] [PATCH][v4][7/24] Add InfiniBand MAD SMI support In-Reply-To: <200412192214.OnqdUjlwc4A94uzk@topspin.com> Message-ID: <200412192214.ePDzwiq1jLcgVszE@topspin.com> Add MAD layer SMI (Subnet Management Interface) code. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/smi.c 2004-12-19 22:04:13.901650545 -0800 @@ -0,0 +1,234 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include + + +/* + * Fixup a directed route SMP for sending + * Return 0 if the SMP should be discarded + */ +int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 */ + if (hop_cnt && hop_ptr == 0) { + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:2 */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->return_path set when received */ + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + /* smp->return_path set when received */ + smp->hop_ptr++; + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + /* C14-9:5 -- Fail unreasonable hop pointer */ + return (hop_ptr == hop_cnt + 1); + + } else { + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) { + smp->hop_ptr--; + /* C14-13:3 -- SMPs destined for SM shouldn't be here */ + return (node_type == IB_NODE_SWITCH || + smp->dr_slid == IB_LID_PERMISSIVE); + } + + /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */ + if (hop_ptr == 0) + return 1; + + /* C14-13:5 -- Check for unreasonable hop pointer */ + return 0; + } +} + +/* + * Adjust information for a received SMP + * Return 0 if the SMP should be dropped + */ +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 -- sender should have incremented hop_ptr */ + if (hop_cnt && hop_ptr == 0) + return 0; + + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + return (smp->initial_path[hop_ptr+1] <= phys_port_cnt); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + if (hop_cnt) + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + /* C14-9:5 -- fail unreasonable hop pointer */ + return (hop_ptr == hop_cnt + 1); + + } else { + + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->hop_ptr updated when sending */ + return (smp->return_path[hop_ptr-1] <= phys_port_cnt); + } + + /* C14-13:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == 1) { + if (smp->dr_slid == IB_LID_PERMISSIVE) { + /* giving SMP to SM - update hop_ptr */ + smp->hop_ptr--; + return 1; + } + /* smp->hop_ptr updated when sending */ + return (node_type == IB_NODE_SWITCH); + } + + /* C14-13:4 -- hop_ptr = 0 -> give to SM */ + /* C14-13:5 -- Check for unreasonable hop pointer */ + return (hop_ptr == 0); + } +} + +/* + * Return 1 if the received DR SMP should be forwarded to the send queue + * Return 0 if the SMP should be completed up the stack + */ +int smi_check_forward_dr_smp(struct ib_smp *smp) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + if (!ib_get_smp_direction(smp)) { + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) + return 1; + + /* C14-9:3 -- at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) + return (smp->dr_dlid == IB_LID_PERMISSIVE); + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + if (hop_ptr == hop_cnt + 1) + return 1; + } else { + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) + return 1; + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) + return (smp->dr_slid != IB_LID_PERMISSIVE); + } + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/smi.h 2004-12-19 22:04:13.924647156 -0800 @@ -0,0 +1,67 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef __SMI_H_ +#define __SMI_H_ + +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt); +extern int smi_check_forward_dr_smp(struct ib_smp *smp); +extern int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num); +extern int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num); + +/* + * Return 1 if the SMP should be handled by the local SMA/SM via process_mad + */ +static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, + struct ib_smp *smp) +{ + /* C14-9:3 -- We're at the end of the DR segment of path */ + /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ + return ((mad_agent->device->process_mad && + !ib_get_smp_direction(smp) && + (smp->hop_ptr == smp->hop_cnt + 1))); +} + +#endif /* __SMI_H_ */ From roland at topspin.com Sun Dec 19 22:14:58 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:14:58 -0800 Subject: [openib-general] [PATCH][v4][8/24] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <200412192214.ePDzwiq1jLcgVszE@topspin.com> Message-ID: <200412192214.sRxfKUrqyRM7YANI@topspin.com> Add support for sending queries to the SA (Subnet Administration). In particular the PathRecord and MCMember (multicast group member) used by the IP-over-InfiniBand driver are implemented. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-19 22:04:13.291740424 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-19 22:04:14.163611941 -0800 @@ -1,8 +1,10 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o ib_mad-y := mad.o smi.o agent.o + +ib_sa-y := sa_query.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/sa_query.c 2004-12-19 22:04:14.187608404 -0800 @@ -0,0 +1,866 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand subnet administration query support"); +MODULE_LICENSE("Dual BSD/GPL"); + +/* + * These two structures must be packed because they have 64-bit fields + * that are only 32-bit aligned. 64-bit architectures will lay them + * out wrong otherwise. (And unfortunately they are sent on the wire + * so we can't change the layout) + */ +struct ib_sa_hdr { + u64 sm_key; + u16 attr_offset; + u16 reserved; + ib_sa_comp_mask comp_mask; +} __attribute__ ((packed)); + +struct ib_sa_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + struct ib_sa_hdr sa_hdr; + u8 data[200]; +} __attribute__ ((packed)); + +struct ib_sa_sm_ah { + struct ib_ah *ah; + struct kref ref; +}; + +struct ib_sa_port { + struct ib_mad_agent *agent; + struct ib_mr *mr; + struct ib_sa_sm_ah *sm_ah; + struct work_struct update_task; + spinlock_t ah_lock; + u8 port_num; +}; + +struct ib_sa_device { + int start_port, end_port; + struct ib_event_handler event_handler; + struct ib_sa_port port[0]; +}; + +struct ib_sa_query { + void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); + void (*release)(struct ib_sa_query *); + struct ib_sa_port *port; + struct ib_sa_mad *mad; + struct ib_sa_sm_ah *sm_ah; + DECLARE_PCI_UNMAP_ADDR(mapping) + int id; +}; + +struct ib_sa_path_query { + void (*callback)(int, struct ib_sa_path_rec *, void *); + void *context; + struct ib_sa_query sa_query; +}; + +struct ib_sa_mcmember_query { + void (*callback)(int, struct ib_sa_mcmember_rec *, void *); + void *context; + struct ib_sa_query sa_query; +}; + +static void ib_sa_add_one(struct ib_device *device); +static void ib_sa_remove_one(struct ib_device *device); + +static struct ib_client sa_client = { + .name = "sa", + .add = ib_sa_add_one, + .remove = ib_sa_remove_one +}; + +static spinlock_t idr_lock; +static DEFINE_IDR(query_idr); + +static spinlock_t tid_lock; +static u32 tid; + +enum { + IB_SA_ATTR_CLASS_PORTINFO = 0x01, + IB_SA_ATTR_NOTICE = 0x02, + IB_SA_ATTR_INFORM_INFO = 0x03, + IB_SA_ATTR_NODE_REC = 0x11, + IB_SA_ATTR_PORT_INFO_REC = 0x12, + IB_SA_ATTR_SL2VL_REC = 0x13, + IB_SA_ATTR_SWITCH_REC = 0x14, + IB_SA_ATTR_LINEAR_FDB_REC = 0x15, + IB_SA_ATTR_RANDOM_FDB_REC = 0x16, + IB_SA_ATTR_MCAST_FDB_REC = 0x17, + IB_SA_ATTR_SM_INFO_REC = 0x18, + IB_SA_ATTR_LINK_REC = 0x20, + IB_SA_ATTR_GUID_INFO_REC = 0x30, + IB_SA_ATTR_SERVICE_REC = 0x31, + IB_SA_ATTR_PARTITION_REC = 0x33, + IB_SA_ATTR_RANGE_REC = 0x34, + IB_SA_ATTR_PATH_REC = 0x35, + IB_SA_ATTR_VL_ARB_REC = 0x36, + IB_SA_ATTR_MC_GROUP_REC = 0x37, + IB_SA_ATTR_MC_MEMBER_REC = 0x38, + IB_SA_ATTR_TRACE_REC = 0x39, + IB_SA_ATTR_MULTI_PATH_REC = 0x3a, + IB_SA_ATTR_SERVICE_ASSOC_REC = 0x3b +}; + +#define PATH_REC_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_path_rec, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_path_rec *) 0)->field, \ + .field_name = "sa_path_rec:" #field + +static const struct ib_field path_rec_table[] = { + { RESERVED, + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 32 }, + { PATH_REC_FIELD(dgid), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 128 }, + { PATH_REC_FIELD(sgid), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 128 }, + { PATH_REC_FIELD(dlid), + .offset_words = 10, + .offset_bits = 0, + .size_bits = 16 }, + { PATH_REC_FIELD(slid), + .offset_words = 10, + .offset_bits = 16, + .size_bits = 16 }, + { PATH_REC_FIELD(raw_traffic), + .offset_words = 11, + .offset_bits = 0, + .size_bits = 1 }, + { RESERVED, + .offset_words = 11, + .offset_bits = 1, + .size_bits = 3 }, + { PATH_REC_FIELD(flow_label), + .offset_words = 11, + .offset_bits = 4, + .size_bits = 20 }, + { PATH_REC_FIELD(hop_limit), + .offset_words = 11, + .offset_bits = 24, + .size_bits = 8 }, + { PATH_REC_FIELD(traffic_class), + .offset_words = 12, + .offset_bits = 0, + .size_bits = 8 }, + { PATH_REC_FIELD(reversible), + .offset_words = 12, + .offset_bits = 8, + .size_bits = 1 }, + { PATH_REC_FIELD(numb_path), + .offset_words = 12, + .offset_bits = 9, + .size_bits = 7 }, + { PATH_REC_FIELD(pkey), + .offset_words = 12, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 13, + .offset_bits = 0, + .size_bits = 12 }, + { PATH_REC_FIELD(sl), + .offset_words = 13, + .offset_bits = 12, + .size_bits = 4 }, + { PATH_REC_FIELD(mtu_selector), + .offset_words = 13, + .offset_bits = 16, + .size_bits = 2 }, + { PATH_REC_FIELD(mtu), + .offset_words = 13, + .offset_bits = 18, + .size_bits = 6 }, + { PATH_REC_FIELD(rate_selector), + .offset_words = 13, + .offset_bits = 24, + .size_bits = 2 }, + { PATH_REC_FIELD(rate), + .offset_words = 13, + .offset_bits = 26, + .size_bits = 6 }, + { PATH_REC_FIELD(packet_life_time_selector), + .offset_words = 14, + .offset_bits = 0, + .size_bits = 2 }, + { PATH_REC_FIELD(packet_life_time), + .offset_words = 14, + .offset_bits = 2, + .size_bits = 6 }, + { PATH_REC_FIELD(preference), + .offset_words = 14, + .offset_bits = 8, + .size_bits = 8 }, + { RESERVED, + .offset_words = 14, + .offset_bits = 16, + .size_bits = 48 }, +}; + +#define MCMEMBER_REC_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_mcmember_rec, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_mcmember_rec *) 0)->field, \ + .field_name = "sa_mcmember_rec:" #field + +static const struct ib_field mcmember_rec_table[] = { + { MCMEMBER_REC_FIELD(mgid), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 128 }, + { MCMEMBER_REC_FIELD(port_gid), + .offset_words = 4, + .offset_bits = 0, + .size_bits = 128 }, + { MCMEMBER_REC_FIELD(qkey), + .offset_words = 8, + .offset_bits = 0, + .size_bits = 32 }, + { MCMEMBER_REC_FIELD(mlid), + .offset_words = 9, + .offset_bits = 0, + .size_bits = 16 }, + { MCMEMBER_REC_FIELD(mtu_selector), + .offset_words = 9, + .offset_bits = 16, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(mtu), + .offset_words = 9, + .offset_bits = 18, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(traffic_class), + .offset_words = 9, + .offset_bits = 24, + .size_bits = 8 }, + { MCMEMBER_REC_FIELD(pkey), + .offset_words = 10, + .offset_bits = 0, + .size_bits = 16 }, + { MCMEMBER_REC_FIELD(rate_selector), + .offset_words = 10, + .offset_bits = 16, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(rate), + .offset_words = 10, + .offset_bits = 18, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(packet_life_time_selector), + .offset_words = 10, + .offset_bits = 24, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(packet_life_time), + .offset_words = 10, + .offset_bits = 26, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(sl), + .offset_words = 11, + .offset_bits = 0, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(flow_label), + .offset_words = 11, + .offset_bits = 4, + .size_bits = 20 }, + { MCMEMBER_REC_FIELD(hop_limit), + .offset_words = 11, + .offset_bits = 24, + .size_bits = 8 }, + { MCMEMBER_REC_FIELD(scope), + .offset_words = 12, + .offset_bits = 0, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(join_state), + .offset_words = 12, + .offset_bits = 4, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(proxy_join), + .offset_words = 12, + .offset_bits = 8, + .size_bits = 1 }, + { RESERVED, + .offset_words = 12, + .offset_bits = 9, + .size_bits = 23 }, +}; + +static void free_sm_ah(struct kref *kref) +{ + struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); + + ib_destroy_ah(sm_ah->ah); + kfree(sm_ah); +} + +static void update_sm_ah(void *port_ptr) +{ + struct ib_sa_port *port = port_ptr; + struct ib_sa_sm_ah *new_ah, *old_ah; + struct ib_port_attr port_attr; + struct ib_ah_attr ah_attr; + + if (ib_query_port(port->agent->device, port->port_num, &port_attr)) { + printk(KERN_WARNING "Couldn't query port\n"); + return; + } + + new_ah = kmalloc(sizeof *new_ah, GFP_KERNEL); + if (!new_ah) { + printk(KERN_WARNING "Couldn't allocate new SM AH\n"); + return; + } + + kref_init(&new_ah->ref); + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = port_attr.sm_lid; + ah_attr.sl = port_attr.sm_sl; + ah_attr.port_num = port->port_num; + + new_ah->ah = ib_create_ah(port->agent->qp->pd, &ah_attr); + if (IS_ERR(new_ah->ah)) { + printk(KERN_WARNING "Couldn't create new SM AH\n"); + kfree(new_ah); + return; + } + + spin_lock_irq(&port->ah_lock); + old_ah = port->sm_ah; + port->sm_ah = new_ah; + spin_unlock_irq(&port->ah_lock); + + if (old_ah) + kref_put(&old_ah->ref, free_sm_ah); +} + +static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event) +{ + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + struct ib_sa_device *sa_dev = + ib_get_client_data(event->device, &sa_client); + + schedule_work(&sa_dev->port[event->element.port_num - + sa_dev->start_port].update_task); + } +} + +/** + * ib_sa_cancel_query - try to cancel an SA query + * @id:ID of query to cancel + * @query:query pointer to cancel + * + * Try to cancel an SA query. If the id and query don't match up or + * the query has already completed, nothing is done. Otherwise the + * query is canceled and will complete with a status of -EINTR. + */ +void ib_sa_cancel_query(int id, struct ib_sa_query *query) +{ + unsigned long flags; + struct ib_mad_agent *agent; + + spin_lock_irqsave(&idr_lock, flags); + if (idr_find(&query_idr, id) != query) { + spin_unlock_irqrestore(&idr_lock, flags); + return; + } + agent = query->port->agent; + spin_unlock_irqrestore(&idr_lock, flags); + + ib_cancel_mad(agent, id); +} +EXPORT_SYMBOL(ib_sa_cancel_query); + +static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) +{ + unsigned long flags; + + memset(mad, 0, sizeof *mad); + + mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; + mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; + mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + + spin_lock_irqsave(&tid_lock, flags); + mad->mad_hdr.tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | tid++); + spin_unlock_irqrestore(&tid_lock, flags); +} + +static int send_mad(struct ib_sa_query *query, int timeout_ms) +{ + struct ib_sa_port *port = query->port; + unsigned long flags; + int ret; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .mad_hdr = &query->mad->mad_hdr, + .remote_qpn = 1, + .remote_qkey = IB_QP1_QKEY, + .timeout_ms = timeout_ms + } + } + }; + +retry: + if (!idr_pre_get(&query_idr, GFP_ATOMIC)) + return -ENOMEM; + spin_lock_irqsave(&idr_lock, flags); + ret = idr_get_new(&query_idr, query, &query->id); + spin_unlock_irqrestore(&idr_lock, flags); + if (ret == -EAGAIN) + goto retry; + if (ret) + return ret; + + wr.wr_id = query->id; + + spin_lock_irqsave(&port->ah_lock, flags); + kref_get(&port->sm_ah->ref); + query->sm_ah = port->sm_ah; + wr.wr.ud.ah = port->sm_ah->ah; + spin_unlock_irqrestore(&port->ah_lock, flags); + + gather_list.addr = dma_map_single(port->agent->device->dma_device, + query->mad, + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + gather_list.length = sizeof (struct ib_sa_mad); + gather_list.lkey = port->mr->lkey; + pci_unmap_addr_set(query, mapping, gather_list.addr); + + ret = ib_post_send_mad(port->agent, &wr, &bad_wr); + if (ret) { + dma_unmap_single(port->agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + spin_lock_irqsave(&idr_lock, flags); + idr_remove(&query_idr, query->id); + spin_unlock_irqrestore(&idr_lock, flags); + } + + return ret; +} + +static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_path_query *query = + container_of(sa_query, struct ib_sa_path_query, sa_query); + + if (mad) { + struct ib_sa_path_rec rec; + + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) +{ + kfree(sa_query->mad); + kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); +} + +/** + * ib_sa_path_rec_get - Start a Path get query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Path Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a Path Record Get query to the SA to look up a path. The + * callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_path_rec_get() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_path_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; + struct ib_mad_agent *agent = port->agent; + int ret; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); + if (!query->sa_query.mad) { + kfree(query); + return -ENOMEM; + } + + query->callback = callback; + query->context = context; + + init_mad(query->sa_query.mad, agent); + + query->sa_query.callback = ib_sa_path_rec_callback; + query->sa_query.release = ib_sa_path_rec_release; + query->sa_query.port = port; + query->sa_query.mad->mad_hdr.method = IB_MGMT_METHOD_GET; + query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), + rec, query->sa_query.mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms); + if (ret) { + *sa_query = NULL; + kfree(query->sa_query.mad); + kfree(query); + } + + return ret ? ret : query->sa_query.id; +} +EXPORT_SYMBOL(ib_sa_path_rec_get); + +static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_mcmember_query *query = + container_of(sa_query, struct ib_sa_mcmember_query, sa_query); + + if (mad) { + struct ib_sa_mcmember_rec rec; + + ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) +{ + kfree(sa_query->mad); + kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); +} + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_mcmember_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; + struct ib_mad_agent *agent = port->agent; + int ret; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); + if (!query->sa_query.mad) { + kfree(query); + return -ENOMEM; + } + + query->callback = callback; + query->context = context; + + init_mad(query->sa_query.mad, agent); + + query->sa_query.callback = ib_sa_mcmember_rec_callback; + query->sa_query.release = ib_sa_mcmember_rec_release; + query->sa_query.port = port; + query->sa_query.mad->mad_hdr.method = method; + query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); + query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + rec, query->sa_query.mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms); + if (ret) { + *sa_query = NULL; + kfree(query->sa_query.mad); + kfree(query); + } + + return ret ? ret : query->sa_query.id; +} +EXPORT_SYMBOL(ib_sa_mcmember_rec_query); + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_sa_query *query; + unsigned long flags; + + spin_lock_irqsave(&idr_lock, flags); + query = idr_find(&query_idr, mad_send_wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); + + if (!query) + return; + + switch (mad_send_wc->status) { + case IB_WC_SUCCESS: + /* No callback -- already got recv */ + break; + case IB_WC_RESP_TIMEOUT_ERR: + query->callback(query, -ETIMEDOUT, NULL); + break; + case IB_WC_WR_FLUSH_ERR: + query->callback(query, -EINTR, NULL); + break; + default: + query->callback(query, -EIO, NULL); + break; + } + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + + query->release(query); + + spin_lock_irqsave(&idr_lock, flags); + idr_remove(&query_idr, mad_send_wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); +} + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_query *query; + unsigned long flags; + + spin_lock_irqsave(&idr_lock, flags); + query = idr_find(&query_idr, mad_recv_wc->wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); + + if (query) { + if (mad_recv_wc->wc->status == IB_WC_SUCCESS) + query->callback(query, + mad_recv_wc->recv_buf.mad->mad_hdr.status ? + -EINVAL : 0, + (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); + else + query->callback(query, -EIO, NULL); + } + + ib_free_recv_mad(mad_recv_wc); +} + +static void ib_sa_add_one(struct ib_device *device) +{ + struct ib_sa_device *sa_dev; + int s, e, i; + + if (device->node_type == IB_NODE_SWITCH) + s = e = 0; + else { + s = 1; + e = device->phys_port_cnt; + } + + sa_dev = kmalloc(sizeof *sa_dev + + (e - s + 1) * sizeof (struct ib_sa_port), + GFP_KERNEL); + if (!sa_dev) + return; + + sa_dev->start_port = s; + sa_dev->end_port = e; + + for (i = 0; i <= e - s; ++i) { + sa_dev->port[i].mr = NULL; + sa_dev->port[i].sm_ah = NULL; + sa_dev->port[i].port_num = i + s; + spin_lock_init(&sa_dev->port[i].ah_lock); + + sa_dev->port[i].agent = + ib_register_mad_agent(device, i + s, IB_QPT_GSI, + NULL, 0, send_handler, + recv_handler, sa_dev); + if (IS_ERR(sa_dev->port[i].agent)) + goto err; + + sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(sa_dev->port[i].mr)) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + goto err; + } + + INIT_WORK(&sa_dev->port[i].update_task, + update_sm_ah, &sa_dev->port[i]); + } + + ib_set_client_data(device, &sa_client, sa_dev); + + /* + * We register our event handler after everything is set up, + * and then update our cached info after the event handler is + * registered to avoid any problems if a port changes state + * during our initialization. + */ + + INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event); + if (ib_register_event_handler(&sa_dev->event_handler)) + goto err; + + for (i = 0; i <= e - s; ++i) + update_sm_ah(&sa_dev->port[i]); + + return; + +err: + while (--i >= 0) { + ib_dereg_mr(sa_dev->port[i].mr); + ib_unregister_mad_agent(sa_dev->port[i].agent); + } + + kfree(sa_dev); + + return; +} + +static void ib_sa_remove_one(struct ib_device *device) +{ + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + int i; + + if (!sa_dev) + return; + + ib_unregister_event_handler(&sa_dev->event_handler); + + for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); + } + + kfree(sa_dev); +} + +static int __init ib_sa_init(void) +{ + int ret; + + spin_lock_init(&idr_lock); + spin_lock_init(&tid_lock); + + get_random_bytes(&tid, sizeof tid); + + ret = ib_register_client(&sa_client); + if (ret) + printk(KERN_ERR "Couldn't register ib_sa client\n"); + + return ret; +} + +static void __exit ib_sa_cleanup(void) +{ + ib_unregister_client(&sa_client); +} + +module_init(ib_sa_init); +module_exit(ib_sa_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_sa.h 2004-12-19 22:04:14.212604721 -0800 @@ -0,0 +1,280 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef IB_SA_H +#define IB_SA_H + +#include + +#include +#include + +enum { + IB_SA_CLASS_VERSION = 2, /* IB spec version 1.1/1.2 */ + + IB_SA_METHOD_DELETE = 0x15 +}; + +enum ib_sa_selector { + IB_SA_GTE = 0, + IB_SA_LTE = 1, + IB_SA_EQ = 2, + /* + * The meaning of "best" depends on the attribute: for + * example, for MTU best will return the largest available + * MTU, while for packet life time, best will return the + * smallest available life time. + */ + IB_SA_BEST = 3 +}; + +typedef u64 __bitwise ib_sa_comp_mask; + +#define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) + +/* + * Structures for SA records are named "struct ib_sa_xxx_rec." No + * attempt is made to pack structures to match the physical layout of + * SA records in SA MADs; all packing and unpacking is handled by the + * SA query code. + * + * For a record with structure ib_sa_xxx_rec, the naming convention + * for the component mask value for field yyy is IB_SA_XXX_REC_YYY (we + * never use different abbreviations or otherwise change the spelling + * of xxx/yyy between ib_sa_xxx_rec.yyy and IB_SA_XXX_REC_YYY). + * + * Reserved rows are indicated with comments to help maintainability. + */ + +/* reserved: 0 */ +/* reserved: 1 */ +#define IB_SA_PATH_REC_DGID IB_SA_COMP_MASK( 2) +#define IB_SA_PATH_REC_SGID IB_SA_COMP_MASK( 3) +#define IB_SA_PATH_REC_DLID IB_SA_COMP_MASK( 4) +#define IB_SA_PATH_REC_SLID IB_SA_COMP_MASK( 5) +#define IB_SA_PATH_REC_RAW_TRAFFIC IB_SA_COMP_MASK( 6) +/* reserved: 7 */ +#define IB_SA_PATH_REC_FLOW_LABEL IB_SA_COMP_MASK( 8) +#define IB_SA_PATH_REC_HOP_LIMIT IB_SA_COMP_MASK( 9) +#define IB_SA_PATH_REC_TRAFFIC_CLASS IB_SA_COMP_MASK(10) +#define IB_SA_PATH_REC_REVERSIBLE IB_SA_COMP_MASK(11) +#define IB_SA_PATH_REC_NUMB_PATH IB_SA_COMP_MASK(12) +#define IB_SA_PATH_REC_PKEY IB_SA_COMP_MASK(13) +/* reserved: 14 */ +#define IB_SA_PATH_REC_SL IB_SA_COMP_MASK(15) +#define IB_SA_PATH_REC_MTU_SELECTOR IB_SA_COMP_MASK(16) +#define IB_SA_PATH_REC_MTU IB_SA_COMP_MASK(17) +#define IB_SA_PATH_REC_RATE_SELECTOR IB_SA_COMP_MASK(18) +#define IB_SA_PATH_REC_RATE IB_SA_COMP_MASK(19) +#define IB_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR IB_SA_COMP_MASK(20) +#define IB_SA_PATH_REC_PACKET_LIFE_TIME IB_SA_COMP_MASK(21) +#define IB_SA_PATH_REC_PREFERENCE IB_SA_COMP_MASK(22) + +struct ib_sa_path_rec { + /* reserved */ + /* reserved */ + union ib_gid dgid; + union ib_gid sgid; + u16 dlid; + u16 slid; + int raw_traffic; + /* reserved */ + u32 flow_label; + u8 hop_limit; + u8 traffic_class; + int reversible; + u8 numb_path; + u16 pkey; + /* reserved */ + u8 sl; + u8 mtu_selector; + enum ib_mtu mtu; + u8 rate_selector; + u8 rate; + u8 packet_life_time_selector; + u8 packet_life_time; + u8 preference; +}; + +#define IB_SA_MCMEMBER_REC_MGID IB_SA_COMP_MASK( 0) +#define IB_SA_MCMEMBER_REC_PORT_GID IB_SA_COMP_MASK( 1) +#define IB_SA_MCMEMBER_REC_QKEY IB_SA_COMP_MASK( 2) +#define IB_SA_MCMEMBER_REC_MLID IB_SA_COMP_MASK( 3) +#define IB_SA_MCMEMBER_REC_MTU_SELECTOR IB_SA_COMP_MASK( 4) +#define IB_SA_MCMEMBER_REC_MTU IB_SA_COMP_MASK( 5) +#define IB_SA_MCMEMBER_REC_TRAFFIC_CLASS IB_SA_COMP_MASK( 6) +#define IB_SA_MCMEMBER_REC_PKEY IB_SA_COMP_MASK( 7) +#define IB_SA_MCMEMBER_REC_RATE_SELECTOR IB_SA_COMP_MASK( 8) +#define IB_SA_MCMEMBER_REC_RATE IB_SA_COMP_MASK( 9) +#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR IB_SA_COMP_MASK(10) +#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME IB_SA_COMP_MASK(11) +#define IB_SA_MCMEMBER_REC_SL IB_SA_COMP_MASK(12) +#define IB_SA_MCMEMBER_REC_FLOW_LABEL IB_SA_COMP_MASK(13) +#define IB_SA_MCMEMBER_REC_HOP_LIMIT IB_SA_COMP_MASK(14) +#define IB_SA_MCMEMBER_REC_SCOPE IB_SA_COMP_MASK(15) +#define IB_SA_MCMEMBER_REC_JOIN_STATE IB_SA_COMP_MASK(16) +#define IB_SA_MCMEMBER_REC_PROXY_JOIN IB_SA_COMP_MASK(17) + +struct ib_sa_mcmember_rec { + union ib_gid mgid; + union ib_gid port_gid; + u32 qkey; + u16 mlid; + u8 mtu_selector; + enum ib_mtu mtu; + u8 traffic_class; + u16 pkey; + u8 rate_selector; + u8 rate; + u8 packet_life_time_selector; + u8 packet_life_time; + u8 sl; + u32 flow_label; + u8 hop_limit; + u8 scope; + u8 join_state; + int proxy_join; +}; + +struct ib_sa_query; + +void ib_sa_cancel_query(int id, struct ib_sa_query *query); + +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +/** + * ib_sa_mcmember_rec_set - Start an MCMember set query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:MCMember Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send an MCMember Set query to the SA (eg to join a multicast + * group). The callback function will be called when the query + * completes (or fails); status is 0 for a successful response, -EINTR + * if the query is canceled, -ETIMEDOUT is the query timed out, or + * -EIO if an error occurred sending the query. The resp parameter of + * the callback is only valid if status is 0. + * + * If the return value of ib_sa_mcmember_rec_set() is negative, it is + * an error code. Otherwise it is a query ID that can be used to + * cancel the query. + */ +static inline int +ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query) +{ + return ib_sa_mcmember_rec_query(device, port_num, + IB_MGMT_METHOD_SET, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); +} + +/** + * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:MCMember Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send an MCMember Delete query to the SA (eg to leave a multicast + * group). The callback function will be called when the query + * completes (or fails); status is 0 for a successful response, -EINTR + * if the query is canceled, -ETIMEDOUT is the query timed out, or + * -EIO if an error occurred sending the query. The resp parameter of + * the callback is only valid if status is 0. + * + * If the return value of ib_sa_mcmember_rec_delete() is negative, it + * is an error code. Otherwise it is a query ID that can be used to + * cancel the query. + */ +static inline int +ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query) +{ + return ib_sa_mcmember_rec_query(device, port_num, + IB_SA_METHOD_DELETE, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); +} + + +#endif /* IB_SA_H */ From roland at topspin.com Sun Dec 19 22:15:02 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:02 -0800 Subject: [openib-general] [PATCH][v4][10/24] Add Mellanox HCA low-level driver (midlayer interface) In-Reply-To: <200412192214.l0L0xI7jtoJS0zXS@topspin.com> Message-ID: <200412192215.tmAzGQeAfP77Jebj@topspin.com> Add midlayer interface code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c 2004-12-19 22:04:15.045481984 -0800 @@ -0,0 +1,626 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_provider.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +static int mthca_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + props->fw_ver = to_mdev(ibdev)->fw_ver; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = IB_SMP_ATTR_NODE_INFO; + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + 1, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 76)) & + 0xffffff; + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 72)); + memcpy(&props->sys_image_guid, out_mad->data + 44, 8); + memcpy(&props->node_guid, out_mad->data + 52, 8); + + err = 0; + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->lid = be16_to_cpup((u16 *) (out_mad->data + 56)); + props->lmc = (*(u8 *) (out_mad->data + 74)) & 0x7; + props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 58)); + props->sm_sl = (*(u8 *) (out_mad->data + 76)) & 0xf; + props->state = (*(u8 *) (out_mad->data + 72)) & 0xf; + props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 60)); + props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; + props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; + props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 88)); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return 0; +} + +static int mthca_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = IB_SMP_ATTR_PKEY_TABLE; + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 32); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + *pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw, out_mad->data + 48, 8); + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = IB_SMP_ATTR_GUID_INFO; + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 8); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev) +{ + struct mthca_pd *pd; + int err; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = mthca_pd_alloc(to_mdev(ibdev), pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + return &pd->ibpd; +} + +static int mthca_dealloc_pd(struct ib_pd *pd) +{ + mthca_pd_free(to_mdev(pd->device), to_mpd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *mthca_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + int err; + struct mthca_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return ERR_PTR(-ENOMEM); + + err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah); + if (err) { + kfree(ah); + return ERR_PTR(err); + } + + return &ah->ibah; +} + +static int mthca_ah_destroy(struct ib_ah *ah) +{ + mthca_destroy_ah(to_mdev(ah->device), to_mah(ah)); + kfree(ah); + + return 0; +} + +static struct ib_qp *mthca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr) +{ + struct mthca_qp *qp; + int err; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + case IB_QPT_UC: + case IB_QPT_UD: + { + qp = kmalloc(sizeof *qp, GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->qp_type, init_attr->sq_sig_type, + init_attr->rq_sig_type, qp); + qp->ibqp.qp_num = qp->qpn; + break; + } + case IB_QPT_SMI: + case IB_QPT_GSI: + { + qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1; + + err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->sq_sig_type, init_attr->rq_sig_type, + qp->ibqp.qp_num, init_attr->port_num, + to_msqp(qp)); + break; + } + default: + /* Don't support raw QPs */ + return ERR_PTR(-ENOSYS); + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + init_attr->cap.max_inline_data = 0; + + return &qp->ibqp; +} + +static int mthca_destroy_qp(struct ib_qp *qp) +{ + mthca_free_qp(to_mdev(qp->device), to_mqp(qp)); + kfree(qp); + return 0; +} + +static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries) +{ + struct mthca_cq *cq; + int nent; + int err; + + cq = kmalloc(sizeof *cq, GFP_KERNEL); + if (!cq) + return ERR_PTR(-ENOMEM); + + for (nent = 1; nent <= entries; nent <<= 1) + ; /* nothing */ + + err = mthca_init_cq(to_mdev(ibdev), nent, cq); + if (err) { + kfree(cq); + cq = ERR_PTR(err); + } else + cq->ibcq.cqe = nent - 1; + + return &cq->ibcq; +} + +static int mthca_destroy_cq(struct ib_cq *cq) +{ + mthca_free_cq(to_mdev(cq->device), to_mcq(cq)); + kfree(cq); + + return 0; +} + +static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify) +{ + mthca_arm_cq(to_mdev(cq->device), to_mcq(cq), + notify == IB_CQ_SOLICITED); + return 0; +} + +static inline u32 convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC : 0) | + (acc & IB_ACCESS_REMOTE_WRITE ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? MTHCA_MPT_FLAG_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? MTHCA_MPT_FLAG_LOCAL_WRITE : 0) | + MTHCA_MPT_FLAG_LOCAL_READ; +} + +static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct mthca_mr *mr; + int err; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + err = mthca_mr_alloc_notrans(to_mdev(pd->device), + to_mpd(pd)->pd_num, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + struct mthca_mr *mr; + u64 *page_list; + u64 total_size; + u64 mask; + int shift; + int npages; + int err; + int i, j, n; + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + mask = 0; + total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (buffer_list[i].addr & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + /* Find largest page shift we can use to cover buffers */ + for (shift = PAGE_SHIFT; shift < 31; ++shift) + if (num_phys_buf > 1) { + if ((1ULL << shift) & mask) + break; + } else { + if (1ULL << shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << shift) - 1))) + break; + } + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1); + buffer_list[0].addr &= ~0ull << shift; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + npages = 0; + for (i = 0; i < num_phys_buf; ++i) + npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + + if (!npages) + return &mr->ibmr; + + page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); + if (!page_list) { + kfree(mr); + return ERR_PTR(-ENOMEM); + } + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + ++j) + page_list[n++] = buffer_list[i].addr + ((u64) j << shift); + + mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) " + "in PD %x; shift %d, npages %d.\n", + (unsigned long long) buffer_list[0].addr, + (unsigned long long) *iova_start, + to_mpd(pd)->pd_num, + shift, npages); + + err = mthca_mr_alloc_phys(to_mdev(pd->device), + to_mpd(pd)->pd_num, + page_list, shift, npages, + *iova_start, total_size, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + kfree(page_list); + return &mr->ibmr; +} + +static int mthca_dereg_mr(struct ib_mr *mr) +{ + mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); + kfree(mr); + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x\n", dev->rev_id); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32), + (int) (dev->fw_ver >> 16) & 0xffff, + (int) dev->fw_ver & 0xffff); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + switch (dev->hca_type) { + case TAVOR: return sprintf(buf, "MT23108\n"); + case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); + case ARBEL_NATIVE: return sprintf(buf, "MT25208\n"); + default: return sprintf(buf, "unknown\n"); + } +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); + +static struct class_device_attribute *mthca_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type +}; + +int mthca_register_device(struct mthca_dev *dev) +{ + int ret; + int i; + + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); + dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.phys_port_cnt = dev->limits.num_ports; + dev->ib_dev.dma_device = &dev->pdev->dev; + dev->ib_dev.class_dev.dev = &dev->pdev->dev; + dev->ib_dev.query_device = mthca_query_device; + dev->ib_dev.query_port = mthca_query_port; + dev->ib_dev.modify_port = mthca_modify_port; + dev->ib_dev.query_pkey = mthca_query_pkey; + dev->ib_dev.query_gid = mthca_query_gid; + dev->ib_dev.alloc_pd = mthca_alloc_pd; + dev->ib_dev.dealloc_pd = mthca_dealloc_pd; + dev->ib_dev.create_ah = mthca_ah_create; + dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.create_qp = mthca_create_qp; + dev->ib_dev.modify_qp = mthca_modify_qp; + dev->ib_dev.destroy_qp = mthca_destroy_qp; + dev->ib_dev.post_send = mthca_post_send; + dev->ib_dev.post_recv = mthca_post_receive; + dev->ib_dev.create_cq = mthca_create_cq; + dev->ib_dev.destroy_cq = mthca_destroy_cq; + dev->ib_dev.poll_cq = mthca_poll_cq; + dev->ib_dev.req_notify_cq = mthca_req_notify_cq; + dev->ib_dev.get_dma_mr = mthca_get_dma_mr; + dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; + dev->ib_dev.dereg_mr = mthca_dereg_mr; + dev->ib_dev.attach_mcast = mthca_multicast_attach; + dev->ib_dev.detach_mcast = mthca_multicast_detach; + dev->ib_dev.process_mad = mthca_process_mad; + + ret = ib_register_device(&dev->ib_dev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) { + ret = class_device_create_file(&dev->ib_dev.class_dev, + mthca_class_attributes[i]); + if (ret) { + ib_unregister_device(&dev->ib_dev); + return ret; + } + } + + return 0; +} + +void mthca_unregister_device(struct mthca_dev *dev) +{ + ib_unregister_device(&dev->ib_dev); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h 2004-12-19 22:04:15.069478447 -0800 @@ -0,0 +1,225 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_provider.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef MTHCA_PROVIDER_H +#define MTHCA_PROVIDER_H + +#include +#include + +#define MTHCA_MPT_FLAG_ATOMIC (1 << 14) +#define MTHCA_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define MTHCA_MPT_FLAG_REMOTE_READ (1 << 12) +#define MTHCA_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define MTHCA_MPT_FLAG_LOCAL_READ (1 << 10) + +struct mthca_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct mthca_mr { + struct ib_mr ibmr; + int order; + u32 first_seg; +}; + +struct mthca_pd { + struct ib_pd ibpd; + u32 pd_num; + atomic_t sqp_count; + struct mthca_mr ntmr; +}; + +struct mthca_eq { + struct mthca_dev *dev; + int eqn; + u32 ecr_mask; + u16 msi_x_vector; + u16 msi_x_entry; + int have_irq; + int nent; + int cons_index; + struct mthca_buf_list *page_list; + struct mthca_mr mr; +}; + +struct mthca_av; + +struct mthca_ah { + struct ib_ah ibah; + int on_hca; + u32 key; + struct mthca_av *av; + dma_addr_t avdma; +}; + +/* + * Quick description of our CQ/QP locking scheme: + * + * We have one global lock that protects dev->cq/qp_table. Each + * struct mthca_cq/qp also has its own lock. An individual qp lock + * may be taken inside of an individual cq lock. Both cqs attached to + * a qp may be locked, with the send cq locked first. No other + * nesting should be done. + * + * Each struct mthca_cq/qp also has an atomic_t ref count. The + * pointer from the cq/qp_table to the struct counts as one reference. + * This reference also is good for access through the consumer API, so + * modifying the CQ/QP etc doesn't need to take another reference. + * Access because of a completion being polled does need a reference. + * + * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the + * destroy function to sleep on. + * + * This means that access from the consumer API requires nothing but + * taking the struct's lock. + * + * Access because of a completion event should go as follows: + * - lock cq/qp_table and look up struct + * - increment ref count in struct + * - drop cq/qp_table lock + * - lock struct, do your thing, and unlock struct + * - decrement ref count; if zero, wake up waiters + * + * To destroy a CQ/QP, we can do the following: + * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock + * - decrement ref count + * - wait_event until ref count is zero + * + * It is the consumer's responsibilty to make sure that no QP + * operations (WQE posting or state modification) are pending when the + * QP is destroyed. Also, the consumer must make sure that calls to + * qp_modify are serialized. + * + * Possible optimizations (wait for profile data to see if/where we + * have locks bouncing between CPUs): + * - split cq/qp table lock into n separate (cache-aligned) locks, + * indexed (say) by the page in the table + * - split QP struct lock into three (one for common info, one for the + * send queue and one for the receive queue) + */ + +struct mthca_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int cons_index; + int is_direct; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + struct mthca_mr mr; + wait_queue_head_t wait; +}; + +struct mthca_wq { + int max; + int cur; + int next; + int last_comp; + void *last; + int max_gs; + int wqe_shift; + enum ib_sig_type policy; +}; + +struct mthca_qp { + struct ib_qp ibqp; + spinlock_t lock; + atomic_t refcount; + u32 qpn; + int transport; + enum ib_qp_state state; + int is_direct; + struct mthca_mr mr; + + struct mthca_wq rq; + struct mthca_wq sq; + int send_wqe_offset; + + u64 *wrid; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + + wait_queue_head_t wait; +}; + +struct mthca_sqp { + struct mthca_qp qp; + int port; + int pkey_index; + u32 qkey; + u32 send_psn; + struct ib_ud_header ud_header; + int header_buf_size; + void *header_buf; + dma_addr_t header_dma; +}; + +static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct mthca_mr, ibmr); +} + +static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct mthca_pd, ibpd); +} + +static inline struct mthca_ah *to_mah(struct ib_ah *ibah) +{ + return container_of(ibah, struct mthca_ah, ibah); +} + +static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct mthca_cq, ibcq); +} + +static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct mthca_qp, ibqp); +} + +static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp) +{ + return container_of(qp, struct mthca_sqp, qp); +} + +#endif /* MTHCA_PROVIDER_H */ From roland at topspin.com Sun Dec 19 22:14:59 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:14:59 -0800 Subject: [openib-general] [PATCH][v4][9/24] Add Mellanox HCA low-level driver In-Reply-To: <200412192214.sRxfKUrqyRM7YANI@topspin.com> Message-ID: <200412192214.l0L0xI7jtoJS0zXS@topspin.com> Add a low-level driver for Mellanox MT23108 and MT25208 HCAs. The MT25208 is only fully supported when in MT23108 compatibility mode; only the very beginnings of support for native MT25208 mode (required for HCAs without local memory) is present. (As a side note, I believe this driver would be the first in-tree consumer of the PCI MSI/MSI-X API) Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-12-19 22:04:11.935940222 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-19 22:04:14.496562875 -0800 @@ -7,4 +7,6 @@ any protocols you wish to use as well as drivers for your InfiniBand hardware. +source "drivers/infiniband/hw/mthca/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-12-19 22:04:11.960936538 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-12-19 22:04:14.472566412 -0800 @@ -1 +1,2 @@ obj-$(CONFIG_INFINIBAND) += core/ +obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Kconfig 2004-12-19 22:04:14.520559339 -0800 @@ -0,0 +1,26 @@ +config INFINIBAND_MTHCA + tristate "Mellanox HCA support" + depends on PCI && INFINIBAND + ---help--- + This is a low-level driver for Mellanox InfiniHost host + channel adapters (HCAs), including the MT23108 PCI-X HCA + ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). + +config INFINIBAND_MTHCA_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_MTHCA + default n + ---help--- + This option causes the mthca driver produce a bunch of debug + messages. Select this is you are developing the driver or + trying to diagnose a problem. + +config INFINIBAND_MTHCA_SSE_DOORBELL + bool "SSE doorbell code" + depends on INFINIBAND_MTHCA && X86 && !X86_64 + default n + ---help--- + This option will have the mthca driver use SSE instructions + to ring hardware doorbell registers. This may improve + performance for some workloads, but the driver will not run + on processors without SSE instructions. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Makefile 2004-12-19 22:04:14.543555950 -0800 @@ -0,0 +1,12 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o + +ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ + mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \ + mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ + mthca_provider.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c 2004-12-19 22:04:14.567552414 -0800 @@ -0,0 +1,179 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_allocator.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" + +/* Trivial bitmap-based allocator */ +u32 mthca_alloc(struct mthca_alloc *alloc) +{ + u32 obj; + + spin_lock(&alloc->lock); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); + if (obj >= alloc->max) { + alloc->top = (alloc->top + alloc->max) & alloc->mask; + obj = find_first_zero_bit(alloc->table, alloc->max); + } + + if (obj < alloc->max) { + set_bit(obj, alloc->table); + obj |= alloc->top; + } else + obj = -1; + + spin_unlock(&alloc->lock); + + return obj; +} + +void mthca_free(struct mthca_alloc *alloc, u32 obj) +{ + obj &= alloc->max - 1; + spin_lock(&alloc->lock); + clear_bit(obj, alloc->table); + alloc->last = min(alloc->last, obj); + alloc->top = (alloc->top + alloc->max) & alloc->mask; + spin_unlock(&alloc->lock); +} + +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved) +{ + int i; + + /* num must be a power of 2 */ + if (num != 1 << (ffs(num) - 1)) + return -EINVAL; + + alloc->last = 0; + alloc->top = 0; + alloc->max = num; + alloc->mask = mask; + spin_lock_init(&alloc->lock); + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), + GFP_KERNEL); + if (!alloc->table) + return -ENOMEM; + + bitmap_zero(alloc->table, num); + for (i = 0; i < reserved; ++i) + set_bit(i, alloc->table); + + return 0; +} + +void mthca_alloc_cleanup(struct mthca_alloc *alloc) +{ + kfree(alloc->table); +} + +/* + * Array of pointers with lazy allocation of leaf pages. Callers of + * _get, _set and _clear methods must use a lock or otherwise + * serialize access to the array. + */ + +void *mthca_array_get(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (array->page_list[p].page) { + int i = index & (PAGE_SIZE / sizeof (void *) - 1); + return array->page_list[p].page[i]; + } else + return NULL; +} + +int mthca_array_set(struct mthca_array *array, int index, void *value) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ + if (!array->page_list[p].page) + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); + + if (!array->page_list[p].page) + return -ENOMEM; + + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = + value; + ++array->page_list[p].used; + + return 0; +} + +void mthca_array_clear(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (--array->page_list[p].used == 0) { + free_page((unsigned long) array->page_list[p].page); + array->page_list[p].page = NULL; + } + + if (array->page_list[p].used < 0) + pr_debug("Array %p index %d page %d with ref count %d < 0\n", + array, index, p, array->page_list[p].used); +} + +int mthca_array_init(struct mthca_array *array, int nent) +{ + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; + int i; + + array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL); + if (!array->page_list) + return -ENOMEM; + + for (i = 0; i < npage; ++i) { + array->page_list[i].page = NULL; + array->page_list[i].used = 0; + } + + return 0; +} + +void mthca_array_cleanup(struct mthca_array *array, int nent) +{ + int i; + + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + free_page((unsigned long) array->page_list[i].page); + + kfree(array->page_list); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h 2004-12-19 22:04:14.591548878 -0800 @@ -0,0 +1,55 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_config_reg.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef MTHCA_CONFIG_REG_H +#define MTHCA_CONFIG_REG_H + +#include + +#define MTHCA_HCR_BASE 0x80680 +#define MTHCA_HCR_SIZE 0x0001c +#define MTHCA_ECR_BASE 0x80700 +#define MTHCA_ECR_SIZE 0x00008 +#define MTHCA_ECR_CLR_BASE 0x80708 +#define MTHCA_ECR_CLR_SIZE 0x00008 +#define MTHCA_ECR_OFFSET (MTHCA_ECR_BASE - MTHCA_HCR_BASE) +#define MTHCA_ECR_CLR_OFFSET (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE) +#define MTHCA_CLR_INT_BASE 0xf00d8 +#define MTHCA_CLR_INT_SIZE 0x00008 + +#define MTHCA_MAP_HCR_SIZE (MTHCA_ECR_CLR_BASE + \ + MTHCA_ECR_CLR_SIZE - \ + MTHCA_HCR_BASE) + +#endif /* MTHCA_CONFIG_REG_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h 2004-12-19 22:04:14.615545341 -0800 @@ -0,0 +1,391 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_dev.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef MTHCA_DEV_H +#define MTHCA_DEV_H + +#include +#include +#include +#include +#include +#include + +#include "mthca_provider.h" +#include "mthca_doorbell.h" + +#define DRV_NAME "ib_mthca" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.06-pre" +#define DRV_RELDATE "November 8, 2004" + +/* Types of supported HCA */ +enum { + TAVOR, /* MT23108 */ + ARBEL_COMPAT, /* MT25208 in Tavor compat mode */ + ARBEL_NATIVE /* MT25208 with extended features */ +}; + +enum { + MTHCA_FLAG_DDR_HIDDEN = 1 << 1, + MTHCA_FLAG_SRQ = 1 << 2, + MTHCA_FLAG_MSI = 1 << 3, + MTHCA_FLAG_MSI_X = 1 << 4, + MTHCA_FLAG_NO_LAM = 1 << 5 +}; + +enum { + MTHCA_KAR_PAGE = 1, + MTHCA_MAX_PORTS = 2 +}; + +enum { + MTHCA_MPT_ENTRY_SIZE = 0x40, + MTHCA_EQ_CONTEXT_SIZE = 0x40, + MTHCA_CQ_CONTEXT_SIZE = 0x40, + MTHCA_QP_CONTEXT_SIZE = 0x200, + MTHCA_AV_SIZE = 0x20, + MTHCA_MGM_ENTRY_SIZE = 0x40 +}; + +enum { + MTHCA_EQ_CMD, + MTHCA_EQ_ASYNC, + MTHCA_EQ_COMP, + MTHCA_NUM_EQ +}; + +struct mthca_cmd { + int use_events; + struct semaphore hcr_sem; + struct semaphore poll_sem; + struct semaphore event_sem; + int max_cmds; + spinlock_t context_lock; + int free_head; + struct mthca_cmd_context *context; + u16 token_mask; +}; + +struct mthca_limits { + int num_ports; + int vl_cap; + int mtu_cap; + int gid_table_len; + int pkey_table_len; + int local_ca_ack_delay; + int max_sg; + int num_qps; + int reserved_qps; + int num_srqs; + int reserved_srqs; + int num_eecs; + int reserved_eecs; + int num_cqs; + int reserved_cqs; + int num_eqs; + int reserved_eqs; + int num_mpts; + int num_mtt_segs; + int mtt_seg_size; + int reserved_mtts; + int reserved_mrws; + int num_rdbs; + int reserved_uars; + int num_mgms; + int num_amgms; + int reserved_mcgs; + int num_pds; + int reserved_pds; +}; + +struct mthca_alloc { + u32 last; + u32 top; + u32 max; + u32 mask; + spinlock_t lock; + unsigned long *table; +}; + +struct mthca_array { + struct { + void **page; + int used; + } *page_list; +}; + +struct mthca_pd_table { + struct mthca_alloc alloc; +}; + +struct mthca_mr_table { + struct mthca_alloc mpt_alloc; + int max_mtt_order; + unsigned long **mtt_buddy; + u64 mtt_base; +}; + +struct mthca_eq_table { + struct mthca_alloc alloc; + void __iomem *clr_int; + u32 clr_mask; + struct mthca_eq eq[MTHCA_NUM_EQ]; + int have_irq; + u8 inta_pin; +}; + +struct mthca_cq_table { + struct mthca_alloc alloc; + spinlock_t lock; + struct mthca_array cq; +}; + +struct mthca_qp_table { + struct mthca_alloc alloc; + int sqp_start; + spinlock_t lock; + struct mthca_array qp; +}; + +struct mthca_av_table { + struct pci_pool *pool; + int num_ddr_avs; + u64 ddr_av_base; + void __iomem *av_map; + struct mthca_alloc alloc; +}; + +struct mthca_mcg_table { + struct semaphore sem; + struct mthca_alloc alloc; +}; + +struct mthca_dev { + struct ib_device ib_dev; + struct pci_dev *pdev; + + int hca_type; + unsigned long mthca_flags; + + u32 rev_id; + + /* firmware info */ + u64 fw_ver; + union { + struct { + u64 fw_start; + u64 fw_end; + } tavor; + struct { + u64 clr_int_base; + u64 eq_arm_base; + u64 eq_set_ci_base; + struct scatterlist *mem; + u16 fw_pages; + } arbel; + } fw; + + u64 ddr_start; + u64 ddr_end; + + MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) + + void __iomem *hcr; + void __iomem *clr_base; + void __iomem *kar; + + struct mthca_cmd cmd; + struct mthca_limits limits; + + struct mthca_pd_table pd_table; + struct mthca_mr_table mr_table; + struct mthca_eq_table eq_table; + struct mthca_cq_table cq_table; + struct mthca_qp_table qp_table; + struct mthca_av_table av_table; + struct mthca_mcg_table mcg_table; + + struct mthca_pd driver_pd; + struct mthca_mr driver_mr; + + struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; + struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; + spinlock_t sm_lock; +}; + +#define mthca_dbg(mdev, format, arg...) \ + dev_dbg(&mdev->pdev->dev, format, ## arg) +#define mthca_err(mdev, format, arg...) \ + dev_err(&mdev->pdev->dev, format, ## arg) +#define mthca_info(mdev, format, arg...) \ + dev_info(&mdev->pdev->dev, format, ## arg) +#define mthca_warn(mdev, format, arg...) \ + dev_warn(&mdev->pdev->dev, format, ## arg) + +extern void __buggy_use_of_MTHCA_GET(void); +extern void __buggy_use_of_MTHCA_PUT(void); + +#define MTHCA_GET(dest, source, offset) \ + do { \ + void *__p = (char *) (source) + (offset); \ + switch (sizeof (dest)) { \ + case 1: (dest) = *(u8 *) __p; break; \ + case 2: (dest) = be16_to_cpup(__p); break; \ + case 4: (dest) = be32_to_cpup(__p); break; \ + case 8: (dest) = be64_to_cpup(__p); break; \ + default: __buggy_use_of_MTHCA_GET(); \ + } \ + } while (0) + +#define MTHCA_PUT(dest, source, offset) \ + do { \ + __typeof__(source) *__p = \ + (__typeof__(source) *) ((char *) (dest) + (offset)); \ + switch (sizeof(source)) { \ + case 1: *__p = (source); break; \ + case 2: *__p = cpu_to_be16(source); break; \ + case 4: *__p = cpu_to_be32(source); break; \ + case 8: *__p = cpu_to_be64(source); break; \ + default: __buggy_use_of_MTHCA_PUT(); \ + } \ + } while (0) + +int mthca_reset(struct mthca_dev *mdev); + +u32 mthca_alloc(struct mthca_alloc *alloc); +void mthca_free(struct mthca_alloc *alloc, u32 obj); +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved); +void mthca_alloc_cleanup(struct mthca_alloc *alloc); +void *mthca_array_get(struct mthca_array *array, int index); +int mthca_array_set(struct mthca_array *array, int index, void *value); +void mthca_array_clear(struct mthca_array *array, int index); +int mthca_array_init(struct mthca_array *array, int nent); +void mthca_array_cleanup(struct mthca_array *array, int nent); + +int mthca_init_pd_table(struct mthca_dev *dev); +int mthca_init_mr_table(struct mthca_dev *dev); +int mthca_init_eq_table(struct mthca_dev *dev); +int mthca_init_cq_table(struct mthca_dev *dev); +int mthca_init_qp_table(struct mthca_dev *dev); +int mthca_init_av_table(struct mthca_dev *dev); +int mthca_init_mcg_table(struct mthca_dev *dev); + +void mthca_cleanup_pd_table(struct mthca_dev *dev); +void mthca_cleanup_mr_table(struct mthca_dev *dev); +void mthca_cleanup_eq_table(struct mthca_dev *dev); +void mthca_cleanup_cq_table(struct mthca_dev *dev); +void mthca_cleanup_qp_table(struct mthca_dev *dev); +void mthca_cleanup_av_table(struct mthca_dev *dev); +void mthca_cleanup_mcg_table(struct mthca_dev *dev); + +int mthca_register_device(struct mthca_dev *dev); +void mthca_unregister_device(struct mthca_dev *dev); + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr); +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry); +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited); +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq); +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq); +void mthca_cq_event(struct mthca_dev *dev, u32 cqn); +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe); +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp); +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp); +void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp); +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah); +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header); + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); +int mthca_create_agents(struct mthca_dev *dev); +void mthca_free_agents(struct mthca_dev *dev); + +static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct mthca_dev, ib_dev); +} + +#endif /* MTHCA_DEV_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h 2004-12-19 22:04:14.639541805 -0800 @@ -0,0 +1,123 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_doorbell.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include + +#define MTHCA_RD_DOORBELL 0x00 +#define MTHCA_SEND_DOORBELL 0x10 +#define MTHCA_RECEIVE_DOORBELL 0x18 +#define MTHCA_CQ_DOORBELL 0x20 +#define MTHCA_EQ_DOORBELL 0x28 + +#if BITS_PER_LONG == 64 +/* + * Assume that we can just write a 64-bit doorbell atomically. s390 + * actually doesn't have writeq() but S/390 systems don't even have + * PCI so we won't worry about it. + */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + __raw_writeq(*(u64 *) val, dest); +} + +#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL) +/* Use SSE to write 64 bits atomically without a lock. */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline unsigned long mthca_get_fpu(void) +{ + unsigned long cr0; + + preempt_disable(); + asm volatile("mov %%cr0,%0; clts" : "=r" (cr0)); + return cr0; +} + +static inline void mthca_put_fpu(unsigned long cr0) +{ + asm volatile("mov %0,%%cr0" : : "r" (cr0)); + preempt_enable(); +} + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + /* i386 stack is aligned to 8 bytes, so this should be OK: */ + u8 xmmsave[8] __attribute__((aligned(8))); + unsigned long cr0; + + cr0 = mthca_get_fpu(); + + asm volatile ( + "movlps %%xmm0,(%0); \n\t" + "movlps (%1),%%xmm0; \n\t" + "movlps %%xmm0,(%2); \n\t" + "movlps (%0),%%xmm0; \n\t" + : + : "r" (xmmsave), "r" (val), "r" (dest) + : "memory" ); + + mthca_put_fpu(cr0); +} + +#else +/* Just fall back to a spinlock to protect the doorbell */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name; +#define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + unsigned long flags; + + spin_lock_irqsave(doorbell_lock, flags); + __raw_writel(val[0], dest); + __raw_writel(val[1], dest + 4); + spin_unlock_irqrestore(doorbell_lock, flags); +} + +#endif --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c 2004-12-19 22:04:14.663538269 -0800 @@ -0,0 +1,933 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_main.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL +#include +#endif + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" +#include "mthca_profile.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +#ifdef CONFIG_PCI_MSI + +static int msi_x = 0; +module_param(msi_x, int, 0444); +MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); + +static int msi = 0; +module_param(msi, int, 0444); +MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero"); + +#else /* CONFIG_PCI_MSI */ + +#define msi_x (0) +#define msi (0) + +#endif /* CONFIG_PCI_MSI */ + +static const char mthca_version[] __devinitdata = + "ib_mthca: Mellanox InfiniBand HCA driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +static int __devinit mthca_tune_pci(struct mthca_dev *mdev) +{ + int cap; + u16 val; + + /* First try to max out Read Byte Count */ + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) { + mthca_err(mdev, "Couldn't read PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2); + if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) { + mthca_err(mdev, "Couldn't write PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == TAVOR) + mthca_info(mdev, "No PCI-X capability, not setting RBC.\n"); + + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) { + mthca_err(mdev, "Couldn't read PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12); + if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) { + mthca_err(mdev, "Couldn't write PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == ARBEL_NATIVE || + mdev->hca_type == ARBEL_COMPAT) + mthca_info(mdev, "No PCI Express capability, " + "not setting Max Read Request Size.\n"); + + return 0; +} + +static int __devinit mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim) +{ + int err; + u8 status; + + err = mthca_QUERY_DEV_LIM(mdev, dev_lim, &status); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + if (dev_lim->min_page_sz > PAGE_SIZE) { + mthca_err(mdev, "HCA minimum page size of %d bigger than " + "kernel PAGE_SIZE of %ld, aborting.\n", + dev_lim->min_page_sz, PAGE_SIZE); + return -ENODEV; + } + if (dev_lim->num_ports > MTHCA_MAX_PORTS) { + mthca_err(mdev, "HCA has %d ports, but we only support %d, " + "aborting.\n", + dev_lim->num_ports, MTHCA_MAX_PORTS); + return -ENODEV; + } + + mdev->limits.num_ports = dev_lim->num_ports; + mdev->limits.vl_cap = dev_lim->max_vl; + mdev->limits.mtu_cap = dev_lim->max_mtu; + mdev->limits.gid_table_len = dev_lim->max_gids; + mdev->limits.pkey_table_len = dev_lim->max_pkeys; + mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay; + mdev->limits.max_sg = dev_lim->max_sg; + mdev->limits.reserved_qps = dev_lim->reserved_qps; + mdev->limits.reserved_srqs = dev_lim->reserved_srqs; + mdev->limits.reserved_eecs = dev_lim->reserved_eecs; + mdev->limits.reserved_cqs = dev_lim->reserved_cqs; + mdev->limits.reserved_eqs = dev_lim->reserved_eqs; + mdev->limits.reserved_mtts = dev_lim->reserved_mtts; + mdev->limits.reserved_mrws = dev_lim->reserved_mrws; + mdev->limits.reserved_uars = dev_lim->reserved_uars; + mdev->limits.reserved_pds = dev_lim->reserved_pds; + + if (dev_lim->flags & DEV_LIM_FLAG_SRQ) + mdev->mthca_flags |= MTHCA_FLAG_SRQ; + + return 0; +} + +static int __devinit mthca_init_tavor(struct mthca_dev *mdev) +{ + u8 status; + int err; + struct mthca_dev_lim dev_lim; + struct mthca_init_hca_param init_hca; + struct mthca_adapter adapter; + + err = mthca_SYS_EN(mdev, &status); + if (err) { + mthca_err(mdev, "SYS_EN command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "SYS_EN returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DDR(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_DDR command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DDR returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_dev_lim(mdev, &dev_lim); + + err = mthca_make_profile(mdev, &dev_lim, &init_hca); + if (err) + goto err_out_disable; + + err = mthca_INIT_HCA(mdev, &init_hca, &status); + if (err) { + mthca_err(mdev, "INIT_HCA command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "INIT_HCA returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_QUERY_ADAPTER(mdev, &adapter, &status); + if (err) { + mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_close; + } + + mdev->eq_table.inta_pin = adapter.inta_pin; + mdev->rev_id = adapter.revision_id; + + return 0; + +err_out_close: + mthca_CLOSE_HCA(mdev, 0, &status); + +err_out_disable: + mthca_SYS_DIS(mdev, &status); + + return err; +} + +static int __devinit mthca_load_fw(struct mthca_dev *mdev) +{ + u8 status; + int err; + int num_ent, num_sg, fw_pages, cur_order; + int i; + + /* FIXME: use HCA-attached memory for FW if present */ + + mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem * + mdev->fw.arbel.fw_pages, + GFP_KERNEL); + if (!mdev->fw.arbel.mem) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + return -ENOMEM; + } + + memset(mdev->fw.arbel.mem, 0, + sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages); + + fw_pages = mdev->fw.arbel.fw_pages; + num_ent = 0; + /* + * We allocate in as big chunks as we can, up to a maximum of + * 256 KB per chunk. + */ + cur_order = get_order(1 << 18); + while (fw_pages > 0) { + while (1 << cur_order > fw_pages) + --cur_order; + + /* + * We allocate with GFP_HIGHUSER because only the + * firmware is going to touch these pages, so there's + * no need for a kernel virtual address. We use + * __GFP_NOWARN because we'll deal with any allocation + * failures ourselves. + */ + mdev->fw.arbel.mem[num_ent].page = alloc_pages(GFP_HIGHUSER | __GFP_NOWARN, + cur_order); + mdev->fw.arbel.mem[num_ent].length = PAGE_SIZE << cur_order; + if (!mdev->fw.arbel.mem[num_ent].page) { + --cur_order; + if (cur_order < 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + } else { + ++num_ent; + fw_pages -= 1 << cur_order; + } + } + num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem, num_ent, + PCI_DMA_BIDIRECTIONAL); + if (num_sg <= 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + + err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status); + if (err) { + mthca_err(mdev, "MAP_FA command failed, aborting.\n"); + goto err_unmap; + } + if (status) { + mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap; + } + err = mthca_RUN_FW(mdev, &status); + if (err) { + mthca_err(mdev, "RUN_FW command failed, aborting.\n"); + goto err_unmap_fa; + } + if (status) { + mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap_fa; + } + + return 0; + +err_unmap_fa: + mthca_UNMAP_FA(mdev, &status); + +err_unmap: + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); +err_free: + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_pages(mdev->fw.arbel.mem[i].page, + get_order(mdev->fw.arbel.mem[i].length)); + kfree(mdev->fw.arbel.mem); + return err; +} + +static int __devinit mthca_init_arbel(struct mthca_dev *mdev) +{ + struct mthca_dev_lim dev_lim; + u8 status; + int err; + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_ENABLE_LAM(mdev, &status); + if (err) { + mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n"); + return err; + } + if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) { + mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n"); + mdev->mthca_flags |= MTHCA_FLAG_NO_LAM; + } else if (status) { + mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_load_fw(mdev); + if (err) { + mthca_err(mdev, "Failed to start FW, aborting.\n"); + goto err_out_disable; + } + + err = mthca_dev_lim(mdev, &dev_lim); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + goto err_out_disable; + } + + mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, " + "aborting.\n"); + err = -ENODEV; + +err_out_disable: + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + return err; +} + +static int __devinit mthca_init_hca(struct mthca_dev *mdev) +{ + if (mdev->hca_type == ARBEL_NATIVE) + return mthca_init_arbel(mdev); + else + return mthca_init_tavor(mdev); +} + +static int __devinit mthca_setup_hca(struct mthca_dev *dev) +{ + int err; + + MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock); + + err = mthca_init_pd_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + return err; + } + + err = mthca_init_mr_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "memory region table, aborting.\n"); + goto err_out_pd_table_free; + } + + err = mthca_pd_alloc(dev, &dev->driver_pd); + if (err) { + mthca_err(dev, "Failed to create driver PD, " + "aborting.\n"); + goto err_out_mr_table_free; + } + + err = mthca_init_eq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "event queue table, aborting.\n"); + goto err_out_pd_free; + } + + err = mthca_cmd_use_events(dev); + if (err) { + mthca_err(dev, "Failed to switch to event-driven " + "firmware commands, aborting.\n"); + goto err_out_eq_table_free; + } + + err = mthca_init_cq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "completion queue table, aborting.\n"); + goto err_out_cmd_poll; + } + + err = mthca_init_qp_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "queue pair table, aborting.\n"); + goto err_out_cq_table_free; + } + + err = mthca_init_av_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "address vector table, aborting.\n"); + goto err_out_qp_table_free; + } + + err = mthca_init_mcg_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "multicast group table, aborting.\n"); + goto err_out_av_table_free; + } + + return 0; + +err_out_av_table_free: + mthca_cleanup_av_table(dev); + +err_out_qp_table_free: + mthca_cleanup_qp_table(dev); + +err_out_cq_table_free: + mthca_cleanup_cq_table(dev); + +err_out_cmd_poll: + mthca_cmd_use_polling(dev); + +err_out_eq_table_free: + mthca_cleanup_eq_table(dev); + +err_out_pd_free: + mthca_pd_free(dev, &dev->driver_pd); + +err_out_mr_table_free: + mthca_cleanup_mr_table(dev); + +err_out_pd_table_free: + mthca_cleanup_pd_table(dev); + return err; +} + +static int __devinit mthca_request_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + int err; + + /* + * We request our first BAR in two chunks, since the MSI-X + * vector table is right in the middle. + * + * This is why we can't just use pci_request_regions() -- if + * we did then setting up MSI-X would fail, since the PCI core + * wants to do request_mem_region on the MSI-X vector table. + */ + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE, + DRV_NAME)) + return -EBUSY; + + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE, + DRV_NAME)) { + err = -EBUSY; + goto err_out_bar0_beg; + } + + err = pci_request_region(pdev, 2, DRV_NAME); + if (err) + goto err_out_bar0_end; + + if (!ddr_hidden) { + err = pci_request_region(pdev, 4, DRV_NAME); + if (err) + goto err_out_bar2; + } + + return 0; + +err_out_bar0_beg: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + +err_out_bar0_end: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + +err_out_bar2: + pci_release_region(pdev, 2); + return err; +} + +static void mthca_release_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + pci_release_region(pdev, 2); + if (!ddr_hidden) + pci_release_region(pdev, 4); +} + +static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev) +{ + struct msix_entry entries[3]; + int err; + + entries[0].entry = 0; + entries[1].entry = 1; + entries[2].entry = 2; + + err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries)); + if (err) { + if (err > 0) + mthca_info(mdev, "Only %d MSI-X vectors available, " + "not using MSI-X\n", err); + return err; + } + + mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector; + mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector; + mdev->eq_table.eq[MTHCA_EQ_CMD ].msi_x_vector = entries[2].vector; + + return 0; +} + +static void mthca_close_hca(struct mthca_dev *mdev) +{ + u8 status; + int i; + + mthca_CLOSE_HCA(mdev, 0, &status); + + if (mdev->hca_type == ARBEL_NATIVE) { + mthca_UNMAP_FA(mdev, &status); + + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_pages(mdev->fw.arbel.mem[i].page, + get_order(mdev->fw.arbel.mem[i].length)); + kfree(mdev->fw.arbel.mem); + + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + } else + mthca_SYS_DIS(mdev, &status); +} + +static int __devinit mthca_init_one(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + static int mthca_version_printed = 0; + int ddr_hidden = 0; + int err; + unsigned long mthca_base; + struct mthca_dev *mdev; + + if (!mthca_version_printed) { + printk(KERN_INFO "%s", mthca_version); + ++mthca_version_printed; + } + + printk(KERN_INFO PFX "Initializing %s (%s)\n", + pci_pretty_name(pdev), pci_name(pdev)); + + err = pci_enable_device(pdev); + if (err) { + dev_err(&pdev->dev, "Cannot enable PCI device, " + "aborting.\n"); + return err; + } + + /* + * Check for BARs. We expect 0: 1MB, 2: 8MB, 4: DDR (may not + * be present) + */ + if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) || + pci_resource_len(pdev, 0) != 1 << 20) { + dev_err(&pdev->dev, "Missing DCS, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) || + pci_resource_len(pdev, 2) != 1 << 23) { + dev_err(&pdev->dev, "Missing UAR, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM)) + ddr_hidden = 1; + + err = mthca_request_regions(pdev, ddr_hidden); + if (err) { + dev_err(&pdev->dev, "Cannot obtain PCI resources, " + "aborting.\n"); + goto err_out_disable_pdev; + } + + pci_set_master(pdev); + + err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n"); + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n"); + goto err_out_free_res; + } + } + err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit " + "consistent PCI DMA mask.\n"); + err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, " + "aborting.\n"); + goto err_out_free_res; + } + } + + mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev); + if (!mdev) { + dev_err(&pdev->dev, "Device struct alloc failed, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_res; + } + + mdev->pdev = pdev; + mdev->hca_type = id->driver_data; + + if (ddr_hidden) + mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; + + /* + * Now reset the HCA before we touch the PCI capabilities or + * attempt a firmware command, since a boot ROM may have left + * the HCA in an undefined state. + */ + err = mthca_reset(mdev); + if (err) { + mthca_err(mdev, "Failed to reset HCA, aborting.\n"); + goto err_out_free_dev; + } + + if (msi_x && !mthca_enable_msi_x(mdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI_X; + if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) && + !pci_enable_msi(pdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI; + + sema_init(&mdev->cmd.hcr_sem, 1); + sema_init(&mdev->cmd.poll_sem, 1); + mdev->cmd.use_events = 0; + + mthca_base = pci_resource_start(pdev, 0); + mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE); + if (!mdev->hcr) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_dev; + } + mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + if (!mdev->clr_base) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap; + } + + mthca_base = pci_resource_start(pdev, 2); + mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); + if (!mdev->kar) { + mthca_err(mdev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap_clr; + } + + err = mthca_tune_pci(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_init_hca(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_setup_hca(mdev); + if (err) + goto err_out_close; + + err = mthca_register_device(mdev); + if (err) + goto err_out_cleanup; + + err = mthca_create_agents(mdev); + if (err) + goto err_out_unregister; + + pci_set_drvdata(pdev, mdev); + + return 0; + +err_out_unregister: + mthca_unregister_device(mdev); + +err_out_cleanup: + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + +err_out_close: + mthca_close_hca(mdev); + +err_out_iounmap_kar: + iounmap(mdev->kar); + +err_out_iounmap_clr: + iounmap(mdev->clr_base); + +err_out_iounmap: + iounmap(mdev->hcr); + +err_out_free_dev: + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + +err_out_free_res: + mthca_release_regions(pdev, ddr_hidden); + +err_out_disable_pdev: + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + return err; +} + +static void __devexit mthca_remove_one(struct pci_dev *pdev) +{ + struct mthca_dev *mdev = pci_get_drvdata(pdev); + u8 status; + int p; + + if (mdev) { + mthca_free_agents(mdev); + mthca_unregister_device(mdev); + + for (p = 1; p <= mdev->limits.num_ports; ++p) + mthca_CLOSE_IB(mdev, p, &status); + + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + + mthca_close_hca(mdev); + + iounmap(mdev->hcr); + iounmap(mdev->clr_base); + + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + mthca_release_regions(pdev, mdev->mthca_flags & + MTHCA_FLAG_DDR_HIDDEN); + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + } +} + +static struct pci_device_id mthca_pci_table[] = { + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { 0, } +}; + +MODULE_DEVICE_TABLE(pci, mthca_pci_table); + +static struct pci_driver mthca_driver = { + .name = "ib_mthca", + .id_table = mthca_pci_table, + .probe = mthca_init_one, + .remove = __devexit_p(mthca_remove_one) +}; + +static int __init mthca_init(void) +{ + int ret; + + /* + * TODO: measure whether dynamically choosing doorbell code at + * runtime affects our performance. Is there a "magic" way to + * choose without having to follow a function pointer every + * time we ring a doorbell? + */ +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL + if (!cpu_has_xmm) { + printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n"); + printk(KERN_ERR PFX "the current CPU does not support SSE.\n"); + printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL " + "and recompile.\n"); + return -ENODEV; + } +#endif + + ret = pci_register_driver(&mthca_driver); + return ret < 0 ? ret : 0; +} + +static void __exit mthca_cleanup(void) +{ + pci_unregister_driver(&mthca_driver); +} + +module_init(mthca_init); +module_exit(mthca_cleanup); From roland at topspin.com Sun Dec 19 22:15:05 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:05 -0800 Subject: [openib-general] [PATCH][v4][12/24] Add Mellanox HCA low-level driver (EQ) In-Reply-To: <200412192215.f5Aip1f85HbDxoU5@topspin.com> Message-ID: <200412192215.2kuiGerZWq2NkwHo@topspin.com> Add event queue code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c 2004-12-19 22:04:15.548407870 -0800 @@ -0,0 +1,689 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_eq.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" +#include "mthca_config_reg.h" + +enum { + MTHCA_NUM_ASYNC_EQE = 0x80, + MTHCA_NUM_CMD_EQE = 0x80, + MTHCA_EQ_ENTRY_SIZE = 0x20 +}; + +/* + * Must be packed because start is 64 bits but only aligned to 32 bits. + */ +struct mthca_eq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 pd; + u8 reserved1[3]; + u8 intr; + u32 lost_count; + u32 lkey; + u32 reserved2[2]; + u32 consumer_index; + u32 producer_index; + u32 reserved3[4]; +} __attribute__((packed)); + +#define MTHCA_EQ_STATUS_OK ( 0 << 28) +#define MTHCA_EQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_EQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_EQ_OWNER_SW ( 0 << 24) +#define MTHCA_EQ_OWNER_HW ( 1 << 24) +#define MTHCA_EQ_FLAG_TR ( 1 << 18) +#define MTHCA_EQ_FLAG_OI ( 1 << 17) +#define MTHCA_EQ_STATE_ARMED ( 1 << 8) +#define MTHCA_EQ_STATE_FIRED ( 2 << 8) +#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 << 8) + +enum { + MTHCA_EVENT_TYPE_COMP = 0x00, + MTHCA_EVENT_TYPE_PATH_MIG = 0x01, + MTHCA_EVENT_TYPE_COMM_EST = 0x02, + MTHCA_EVENT_TYPE_SQ_DRAINED = 0x03, + MTHCA_EVENT_TYPE_SRQ_LAST_WQE = 0x13, + MTHCA_EVENT_TYPE_CQ_ERROR = 0x04, + MTHCA_EVENT_TYPE_WQ_CATAS_ERROR = 0x05, + MTHCA_EVENT_TYPE_EEC_CATAS_ERROR = 0x06, + MTHCA_EVENT_TYPE_PATH_MIG_FAILED = 0x07, + MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, + MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, + MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, + MTHCA_EVENT_TYPE_PORT_CHANGE = 0x09, + MTHCA_EVENT_TYPE_EQ_OVERFLOW = 0x0f, + MTHCA_EVENT_TYPE_ECC_DETECT = 0x0e, + MTHCA_EVENT_TYPE_CMD = 0x0a +}; + +#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG) | \ + (1ULL << MTHCA_EVENT_TYPE_COMM_EST) | \ + (1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED) | \ + (1ULL << MTHCA_EVENT_TYPE_CQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE) | \ + (1ULL << MTHCA_EVENT_TYPE_ECC_DETECT)) +#define MTHCA_SRQ_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE) +#define MTHCA_CMD_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_CMD) + +#define MTHCA_EQ_DB_INC_CI (1 << 24) +#define MTHCA_EQ_DB_REQ_NOT (2 << 24) +#define MTHCA_EQ_DB_DISARM_CQ (3 << 24) +#define MTHCA_EQ_DB_SET_CI (4 << 24) +#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24) + +struct mthca_eqe { + u8 reserved1; + u8 type; + u8 reserved2; + u8 subtype; + union { + u32 raw[6]; + struct { + u32 cqn; + } __attribute__((packed)) comp; + struct { + u16 reserved1; + u16 token; + u32 reserved2; + u8 reserved3[3]; + u8 status; + u64 out_param; + } __attribute__((packed)) cmd; + struct { + u32 qpn; + } __attribute__((packed)) qp; + struct { + u32 cqn; + u32 reserved1; + u8 reserved2[3]; + u8 syndrome; + } __attribute__((packed)) cq_err; + struct { + u32 reserved1[2]; + u32 port; + } __attribute__((packed)) port_change; + } event; + u8 reserved3[3]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_EQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_EQ_ENTRY_OWNER_HW (1 << 7) + +static inline u64 async_mask(struct mthca_dev *dev) +{ + return dev->mthca_flags & MTHCA_FLAG_SRQ ? + MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK : + MTHCA_ASYNC_EVENT_MASK; +} + +static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); + doorbell[1] = cpu_to_be32(ci); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void eq_req_not(struct mthca_dev *dev, int eqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); + doorbell[1] = 0; + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + doorbell[1] = cpu_to_be32(cqn); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry) +{ + return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int next_eqe_sw(struct mthca_eq *eq) +{ + return !(MTHCA_EQ_ENTRY_OWNER_HW & + get_eqe(eq, eq->cons_index)->owner); +} + +static inline void set_eqe_hw(struct mthca_eq *eq, int entry) +{ + get_eqe(eq, entry)->owner = MTHCA_EQ_ENTRY_OWNER_HW; +} + +static void port_change(struct mthca_dev *dev, int port, int active) +{ + struct ib_event record; + + mthca_dbg(dev, "Port change to %s for port %d\n", + active ? "active" : "down", port); + + record.device = &dev->ib_dev; + record.event = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR; + record.element.port_num = port; + + ib_dispatch_event(&record); +} + +static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) +{ + struct mthca_eqe *eqe; + int disarm_cqn; + + while (next_eqe_sw(eq)) { + int set_ci = 0; + eqe = get_eqe(eq, eq->cons_index); + + switch (eqe->type) { + case MTHCA_EVENT_TYPE_COMP: + disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff; + disarm_cq(dev, eq->eqn, disarm_cqn); + mthca_cq_event(dev, disarm_cqn); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG); + break; + + case MTHCA_EVENT_TYPE_COMM_EST: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_COMM_EST); + break; + + case MTHCA_EVENT_TYPE_SQ_DRAINED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_SQ_DRAINED); + break; + + case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_FATAL); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG_FAILED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_REQ_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_ACCESS_ERR); + break; + + case MTHCA_EVENT_TYPE_CMD: + mthca_cmd_event(dev, + be16_to_cpu(eqe->event.cmd.token), + eqe->event.cmd.status, + be64_to_cpu(eqe->event.cmd.out_param)); + /* cmd_event() may add more commands. + * The card will think the queue has overflowed if + * we don't tell it we've been processing events. + */ + set_ci = 1; + break; + + case MTHCA_EVENT_TYPE_PORT_CHANGE: + port_change(dev, + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3, + eqe->subtype == 0x4); + break; + + case MTHCA_EVENT_TYPE_CQ_ERROR: + mthca_warn(dev, "CQ %s on CQN %08x\n", + eqe->event.cq_err.syndrome == 1 ? + "overrun" : "access violation", + be32_to_cpu(eqe->event.cq_err.cqn)); + break; + + case MTHCA_EVENT_TYPE_EQ_OVERFLOW: + mthca_warn(dev, "EQ overrun on EQN %d\n", eq->eqn); + break; + + case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR: + case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR: + case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR: + case MTHCA_EVENT_TYPE_ECC_DETECT: + default: + mthca_warn(dev, "Unhandled event %02x(%02x) on EQ %d\n", + eqe->type, eqe->subtype, eq->eqn); + break; + }; + + set_eqe_hw(eq, eq->cons_index); + eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + + if (set_ci) { + wmb(); /* see comment below */ + set_eq_ci(dev, eq->eqn, eq->cons_index); + set_ci = 0; + } + } + + /* + * This barrier makes sure that all updates to + * ownership bits done by set_eqe_hw() hit memory + * before the consumer index is updated. set_eq_ci() + * allows the HCA to possibly write more EQ entries, + * and we want to avoid the exceedingly unlikely + * possibility of the HCA writing an entry and then + * having set_eqe_hw() overwrite the owner field. + */ + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + eq_req_not(dev, eq->eqn); +} + +static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +{ + struct mthca_dev *dev = dev_ptr; + u32 ecr; + int work = 0; + int i; + + if (dev->eq_table.clr_mask) + writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); + + while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) { + work = 1; + + writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (ecr & dev->eq_table.eq[i].ecr_mask) + mthca_eq_int(dev, &dev->eq_table.eq[i]); + } + + return IRQ_RETVAL(work); +} + +static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr, + struct pt_regs *regs) +{ + struct mthca_eq *eq = eq_ptr; + struct mthca_dev *dev = eq->dev; + + writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + mthca_eq_int(dev, eq); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static int __devinit mthca_create_eq(struct mthca_dev *dev, + int nent, + u8 intr, + struct mthca_eq *eq) +{ + int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + u64 *dma_list = NULL; + dma_addr_t t; + void *mailbox = NULL; + struct mthca_eq_context *eq_context; + int err = -ENOMEM; + int i; + u8 status; + + /* Make sure EQ size is aligned to a power of 2 size. */ + for (i = 1; i < nent; i <<= 1) + ; /* nothing */ + nent = i; + + eq->dev = dev; + + eq->page_list = kmalloc(npages * sizeof *eq->page_list, + GFP_KERNEL); + if (!eq->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + eq->page_list[i].buf = NULL; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free; + eq_context = MAILBOX_ALIGN(mailbox); + + for (i = 0; i < npages; ++i) { + eq->page_list[i].buf = pci_alloc_consistent(dev->pdev, + PAGE_SIZE, &t); + if (!eq->page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&eq->page_list[i], mapping, t); + + memset(eq->page_list[i].buf, 0, PAGE_SIZE); + } + + for (i = 0; i < nent; ++i) + set_eqe_hw(eq, i); + + eq->eqn = mthca_alloc(&dev->eq_table.alloc); + if (eq->eqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, PAGE_SHIFT, npages, + 0, npages * PAGE_SIZE, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &eq->mr); + if (err) + goto err_out_free_eq; + + eq->nent = nent; + + memset(eq_context, 0, sizeof *eq_context); + eq_context->flags = cpu_to_be32(MTHCA_EQ_STATUS_OK | + MTHCA_EQ_OWNER_HW | + MTHCA_EQ_STATE_ARMED | + MTHCA_EQ_FLAG_TR); + eq_context->start = cpu_to_be64(0); + eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + eq_context->intr = intr; + eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); + + err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err); + goto err_out_free_mr; + } + if (status) { + mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + kfree(dma_list); + kfree(mailbox); + + eq->ecr_mask = swab32(1 << eq->eqn); + eq->cons_index = 0; + + eq_req_not(dev, eq->eqn); + + mthca_dbg(dev, "Allocated EQ %d with %d entries\n", + eq->eqn, nent); + + return err; + + err_out_free_mr: + mthca_free_mr(dev, &eq->mr); + + err_out_free_eq: + mthca_free(&dev->eq_table.alloc, eq->eqn); + + err_out_free: + for (i = 0; i < npages; ++i) + if (eq->page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], + mapping)); + + kfree(eq->page_list); + kfree(dma_list); + kfree(mailbox); + + err_out: + return err; +} + +static void mthca_free_eq(struct mthca_dev *dev, + struct mthca_eq *eq) +{ + void *mailbox = NULL; + int err; + u8 status; + int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + int i; + + mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + return; + + err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox), + eq->eqn, &status); + if (err) + mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err); + if (status) + mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n", + status); + + if (0) { + mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); + for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4)); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + + mthca_free_mr(dev, &eq->mr); + for (i = 0; i < npages; ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], mapping)); + + kfree(eq->page_list); + kfree(mailbox); +} + +static void mthca_free_irqs(struct mthca_dev *dev) +{ + int i; + + if (dev->eq_table.have_irq) + free_irq(dev->pdev->irq, dev); + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (dev->eq_table.eq[i].have_irq) + free_irq(dev->eq_table.eq[i].msi_x_vector, + dev->eq_table.eq + i); +} + +int __devinit mthca_init_eq_table(struct mthca_dev *dev) +{ + int err; + u8 status; + u8 intr; + int i; + + err = mthca_alloc_init(&dev->eq_table.alloc, + dev->limits.num_eqs, + dev->limits.num_eqs - 1, + dev->limits.reserved_eqs); + if (err) + return err; + + if (dev->mthca_flags & MTHCA_FLAG_MSI || + dev->mthca_flags & MTHCA_FLAG_MSI_X) { + dev->eq_table.clr_mask = 0; + } else { + dev->eq_table.clr_mask = + swab32(1 << (dev->eq_table.inta_pin & 31)); + dev->eq_table.clr_int = dev->clr_base + + (dev->eq_table.inta_pin < 31 ? 4 : 0); + } + + intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? + 128 : dev->eq_table.inta_pin; + + err = mthca_create_eq(dev, dev->limits.num_cqs, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, + &dev->eq_table.eq[MTHCA_EQ_COMP]); + if (err) + goto err_out_free; + + err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, + &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + if (err) + goto err_out_comp; + + err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, + &dev->eq_table.eq[MTHCA_EQ_CMD]); + if (err) + goto err_out_async; + + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { + static const char *eq_name[] = { + [MTHCA_EQ_COMP] = DRV_NAME " (comp)", + [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", + [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" + }; + + for (i = 0; i < MTHCA_NUM_EQ; ++i) { + err = request_irq(dev->eq_table.eq[i].msi_x_vector, + mthca_msi_x_interrupt, 0, + eq_name[i], dev->eq_table.eq + i); + if (err) + goto err_out_cmd; + dev->eq_table.eq[i].have_irq = 1; + } + } else { + err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ, + DRV_NAME, dev); + if (err) + goto err_out_cmd; + dev->eq_table.have_irq = 1; + } + + err = mthca_MAP_EQ(dev, async_mask(dev), + 0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status); + + err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); + + return 0; + +err_out_cmd: + mthca_free_irqs(dev); + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]); + +err_out_async: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + +err_out_comp: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); + +err_out_free: + mthca_alloc_cleanup(&dev->eq_table.alloc); + return err; +} + +void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev) +{ + u8 status; + int i; + + mthca_free_irqs(dev); + + mthca_MAP_EQ(dev, async_mask(dev), + 1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + mthca_free_eq(dev, &dev->eq_table.eq[i]); + + mthca_alloc_cleanup(&dev->eq_table.alloc); +} From roland at topspin.com Sun Dec 19 22:15:03 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:03 -0800 Subject: [openib-general] [PATCH][v4][11/24] Add Mellanox HCA low-level driver (FW commands) In-Reply-To: <200412192215.tmAzGQeAfP77Jebj@topspin.com> Message-ID: <200412192215.f5Aip1f85HbDxoU5@topspin.com> Add firmware command processing code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c 2004-12-19 22:04:15.289446032 -0800 @@ -0,0 +1,1573 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_cmd.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" + +#define CMD_POLL_TOKEN 0xffff + +enum { + HCR_IN_PARAM_OFFSET = 0x00, + HCR_IN_MODIFIER_OFFSET = 0x08, + HCR_OUT_PARAM_OFFSET = 0x0c, + HCR_TOKEN_OFFSET = 0x14, + HCR_STATUS_OFFSET = 0x18, + + HCR_OPMOD_SHIFT = 12, + HCA_E_BIT = 22, + HCR_GO_BIT = 23 +}; + +enum { + /* initialization and general commands */ + CMD_SYS_EN = 0x1, + CMD_SYS_DIS = 0x2, + CMD_MAP_FA = 0xfff, + CMD_UNMAP_FA = 0xffe, + CMD_RUN_FW = 0xff6, + CMD_MOD_STAT_CFG = 0x34, + CMD_QUERY_DEV_LIM = 0x3, + CMD_QUERY_FW = 0x4, + CMD_ENABLE_LAM = 0xff8, + CMD_DISABLE_LAM = 0xff7, + CMD_QUERY_DDR = 0x5, + CMD_QUERY_ADAPTER = 0x6, + CMD_INIT_HCA = 0x7, + CMD_CLOSE_HCA = 0x8, + CMD_INIT_IB = 0x9, + CMD_CLOSE_IB = 0xa, + CMD_QUERY_HCA = 0xb, + CMD_SET_IB = 0xc, + CMD_ACCESS_DDR = 0x2e, + CMD_MAP_ICM = 0xffa, + CMD_UNMAP_ICM = 0xff9, + CMD_MAP_ICM_AUX = 0xffc, + CMD_UNMAP_ICM_AUX = 0xffb, + CMD_SET_ICM_SIZE = 0xffd, + + /* TPT commands */ + CMD_SW2HW_MPT = 0xd, + CMD_QUERY_MPT = 0xe, + CMD_HW2SW_MPT = 0xf, + CMD_READ_MTT = 0x10, + CMD_WRITE_MTT = 0x11, + CMD_SYNC_TPT = 0x2f, + + /* EQ commands */ + CMD_MAP_EQ = 0x12, + CMD_SW2HW_EQ = 0x13, + CMD_HW2SW_EQ = 0x14, + CMD_QUERY_EQ = 0x15, + + /* CQ commands */ + CMD_SW2HW_CQ = 0x16, + CMD_HW2SW_CQ = 0x17, + CMD_QUERY_CQ = 0x18, + CMD_RESIZE_CQ = 0x2c, + + /* SRQ commands */ + CMD_SW2HW_SRQ = 0x35, + CMD_HW2SW_SRQ = 0x36, + CMD_QUERY_SRQ = 0x37, + + /* QP/EE commands */ + CMD_RST2INIT_QPEE = 0x19, + CMD_INIT2RTR_QPEE = 0x1a, + CMD_RTR2RTS_QPEE = 0x1b, + CMD_RTS2RTS_QPEE = 0x1c, + CMD_SQERR2RTS_QPEE = 0x1d, + CMD_2ERR_QPEE = 0x1e, + CMD_RTS2SQD_QPEE = 0x1f, + CMD_SQD2SQD_QPEE = 0x38, + CMD_SQD2RTS_QPEE = 0x20, + CMD_ERR2RST_QPEE = 0x21, + CMD_QUERY_QPEE = 0x22, + CMD_INIT2INIT_QPEE = 0x2d, + CMD_SUSPEND_QPEE = 0x32, + CMD_UNSUSPEND_QPEE = 0x33, + /* special QPs and management commands */ + CMD_CONF_SPECIAL_QP = 0x23, + CMD_MAD_IFC = 0x24, + + /* multicast commands */ + CMD_READ_MGM = 0x25, + CMD_WRITE_MGM = 0x26, + CMD_MGID_HASH = 0x27, + + /* miscellaneous commands */ + CMD_DIAG_RPRT = 0x30, + CMD_NOP = 0x31, + + /* debug commands */ + CMD_QUERY_DEBUG_MSG = 0x2a, + CMD_SET_DEBUG_MSG = 0x2b, +}; + +/* + * According to Mellanox code, FW may be starved and never complete + * commands. So we can't use strict timeouts described in PRM -- we + * just arbitrarily select 60 seconds for now. + */ +#if 0 +/* + * Round up and add 1 to make sure we get the full wait time (since we + * will be starting in the middle of a jiffy) + */ +enum { + CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1, + CMD_TIME_CLASS_B = (HZ + 99) / 100 + 1, + CMD_TIME_CLASS_C = (HZ + 9) / 10 + 1 +}; +#else +enum { + CMD_TIME_CLASS_A = 60 * HZ, + CMD_TIME_CLASS_B = 60 * HZ, + CMD_TIME_CLASS_C = 60 * HZ +}; +#endif + +enum { + GO_BIT_TIMEOUT = HZ * 10 +}; + +struct mthca_cmd_context { + struct completion done; + struct timer_list timer; + int result; + int next; + u64 out_param; + u16 token; + u8 status; +}; + +static inline int go_bit(struct mthca_dev *dev) +{ + return readl(dev->hcr + HCR_STATUS_OFFSET) & + swab32(1 << HCR_GO_BIT); +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + if (down_interruptible(&dev->cmd.hcr_sem)) + return -EINTR; + + if (event) { + unsigned long end = jiffies + GO_BIT_TIMEOUT; + + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + } + + if (go_bit(dev)) { + err = -EAGAIN; + goto out; + } + + /* + * We use writel (instead of something like memcpy_toio) + * because writes of less than 32 bits to the HCR don't work + * (and some architectures such as ia64 implement memcpy_toio + * in terms of writeb). + */ + __raw_writel(cpu_to_be32(in_param >> 32), dev->hcr + 0 * 4); + __raw_writel(cpu_to_be32(in_param & 0xfffffffful), dev->hcr + 1 * 4); + __raw_writel(cpu_to_be32(in_modifier), dev->hcr + 2 * 4); + __raw_writel(cpu_to_be32(out_param >> 32), dev->hcr + 3 * 4); + __raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4); + __raw_writel(cpu_to_be32(token << 16), dev->hcr + 5 * 4); + + /* __raw_writel may not order writes. */ + wmb(); + + __raw_writel(cpu_to_be32((1 << HCR_GO_BIT) | + (event ? (1 << HCA_E_BIT) : 0) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), dev->hcr + 6 * 4); + +out: + up(&dev->cmd.hcr_sem); + return err; +} + +static int mthca_cmd_poll(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + unsigned long end; + + if (down_interruptible(&dev->cmd.poll_sem)) + return -EINTR; + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, CMD_POLL_TOKEN, 0); + if (err) + goto out; + + end = timeout + jiffies; + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + + if (go_bit(dev)) { + err = -EBUSY; + goto out; + } + + if (out_is_imm) { + memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64)); + be64_to_cpus(out_param); + } + + *status = be32_to_cpu(__raw_readl(dev->hcr + HCR_STATUS_OFFSET)) >> 24; + +out: + up(&dev->cmd.poll_sem); + return err; +} + +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param) +{ + struct mthca_cmd_context *context = + &dev->cmd.context[token & dev->cmd.token_mask]; + + /* previously timed out command completing at long last */ + if (token != context->token) + return; + + context->result = 0; + context->status = status; + context->out_param = out_param; + + context->token += dev->cmd.token_mask + 1; + + complete(&context->done); +} + +static void event_timeout(unsigned long context_ptr) +{ + struct mthca_cmd_context *context = + (struct mthca_cmd_context *) context_ptr; + + context->result = -EBUSY; + complete(&context->done); +} + +static int mthca_cmd_wait(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + struct mthca_cmd_context *context; + + if (down_interruptible(&dev->cmd.event_sem)) + return -EINTR; + + spin_lock(&dev->cmd.context_lock); + BUG_ON(dev->cmd.free_head < 0); + context = &dev->cmd.context[dev->cmd.free_head]; + dev->cmd.free_head = context->next; + spin_unlock(&dev->cmd.context_lock); + + init_completion(&context->done); + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, context->token, 1); + if (err) + goto out; + + context->timer.expires = jiffies + timeout; + add_timer(&context->timer); + + wait_for_completion(&context->done); + del_timer_sync(&context->timer); + + err = context->result; + if (err) + goto out; + + *status = context->status; + if (*status) + mthca_dbg(dev, "Command %02x completed with status %02x\n", + op, *status); + + if (out_is_imm) + *out_param = context->out_param; + +out: + spin_lock(&dev->cmd.context_lock); + context->next = dev->cmd.free_head; + dev->cmd.free_head = context - dev->cmd.context; + spin_unlock(&dev->cmd.context_lock); + + up(&dev->cmd.event_sem); + return err; +} + +/* Invoke a command with an output mailbox */ +static int mthca_cmd_box(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); +} + +/* Invoke a command with no output parameter */ +static int mthca_cmd(struct mthca_dev *dev, + u64 in_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + return mthca_cmd_box(dev, in_param, 0, in_modifier, + op_modifier, op, timeout, status); +} + +/* + * Invoke a command with an immediate output parameter (and copy the + * output into the caller's out_param pointer after the command + * executes). + */ +static int mthca_cmd_imm(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); +} + +/* + * Switch to using events to issue FW commands (should be called after + * event queue to command events has been initialized). + */ +int mthca_cmd_use_events(struct mthca_dev *dev) +{ + int i; + + dev->cmd.context = kmalloc(dev->cmd.max_cmds * + sizeof (struct mthca_cmd_context), + GFP_KERNEL); + if (!dev->cmd.context) + return -ENOMEM; + + for (i = 0; i < dev->cmd.max_cmds; ++i) { + dev->cmd.context[i].token = i; + dev->cmd.context[i].next = i + 1; + init_timer(&dev->cmd.context[i].timer); + dev->cmd.context[i].timer.data = + (unsigned long) &dev->cmd.context[i]; + dev->cmd.context[i].timer.function = event_timeout; + } + + dev->cmd.context[dev->cmd.max_cmds - 1].next = -1; + dev->cmd.free_head = 0; + + sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds); + spin_lock_init(&dev->cmd.context_lock); + + for (dev->cmd.token_mask = 1; + dev->cmd.token_mask < dev->cmd.max_cmds; + dev->cmd.token_mask <<= 1) + ; /* nothing */ + --dev->cmd.token_mask; + + dev->cmd.use_events = 1; + down(&dev->cmd.poll_sem); + + return 0; +} + +/* + * Switch back to polling (used when shutting down the device) + */ +void mthca_cmd_use_polling(struct mthca_dev *dev) +{ + int i; + + dev->cmd.use_events = 0; + + for (i = 0; i < dev->cmd.max_cmds; ++i) + down(&dev->cmd.event_sem); + + kfree(dev->cmd.context); + + up(&dev->cmd.poll_sem); +} + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status) +{ + u64 out; + int ret; + + ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status); + + if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR) + mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, " + "sladdr=%d, SPD source=%s\n", + (int) (out >> 6) & 0xf, (int) (out >> 4) & 3, + (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM"); + + return ret; +} + +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status); +} + +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int lg; + int nent = 0; + int i, j; + int err = 0; + int ts = 0; + + inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma); + memset(inbox, 0, PAGE_SIZE); + + for (i = 0; i < count; ++i) { + /* + * We have to pass pages that are aligned to their + * size, so find the least significant 1 in the + * address or size and use that as our log2 size. + */ + lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1; + if (lg < 12) { + mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n", + (unsigned long long) sg_dma_address(sglist + i), + sg_dma_len(sglist + i)); + err = -EINVAL; + goto out; + } + for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) { + *((__be64 *) (inbox + nent * 4 + 2)) = + cpu_to_be64((sg_dma_address(sglist + i) + + (j << lg)) | + (lg - 12)); + ts += 1 << (lg - 10); + if (nent == PAGE_SIZE / 16) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + if (err || *status) + goto out; + nent = 0; + } + } + } + + if (nent) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + } + + mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts); + +out: + pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma); + return err; +} + +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status); +} + +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status); +} + +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err = 0; + u8 lg; + +#define QUERY_FW_OUT_SIZE 0x100 +#define QUERY_FW_VER_OFFSET 0x00 +#define QUERY_FW_MAX_CMD_OFFSET 0x0f +#define QUERY_FW_ERR_START_OFFSET 0x30 +#define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_START_OFFSET 0x20 +#define QUERY_FW_END_OFFSET 0x28 + +#define QUERY_FW_SIZE_OFFSET 0x00 +#define QUERY_FW_CLR_INT_BASE_OFFSET 0x20 +#define QUERY_FW_EQ_ARM_BASE_OFFSET 0x40 +#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma); + if (!outbox) { + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->fw_ver, outbox, QUERY_FW_VER_OFFSET); + /* + * FW subminor version is at more signifant bits than minor + * version, so swap here. + */ + dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) | + ((dev->fw_ver & 0xffff0000ull) >> 16) | + ((dev->fw_ver & 0x0000ffffull) << 16); + + MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET); + dev->cmd.max_cmds = 1 << lg; + + mthca_dbg(dev, "FW version %012llx, max commands %d\n", + (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET); + MTHCA_GET(dev->fw.arbel.clr_int_base, outbox, QUERY_FW_CLR_INT_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_arm_base, outbox, QUERY_FW_EQ_ARM_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET); + mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2); + + /* + * Arbel page size is always 4 KB; round up number of + * system pages needed. + */ + dev->fw.arbel.fw_pages = + (dev->fw.arbel.fw_pages + (1 << (PAGE_SHIFT - 12)) - 1) >> + (PAGE_SHIFT - 12); + + mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", + (unsigned long long) dev->fw.arbel.clr_int_base, + (unsigned long long) dev->fw.arbel.eq_arm_base, + (unsigned long long) dev->fw.arbel.eq_set_ci_base); + } else { + MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET); + MTHCA_GET(dev->fw.tavor.fw_end, outbox, QUERY_FW_END_OFFSET); + + mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n", + (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10), + (unsigned long long) dev->fw.tavor.fw_start, + (unsigned long long) dev->fw.tavor.fw_end); + } + +out: + pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define ENABLE_LAM_OUT_SIZE 0x100 +#define ENABLE_LAM_START_OFFSET 0x00 +#define ENABLE_LAM_END_OFFSET 0x08 +#define ENABLE_LAM_INFO_OFFSET 0x13 + +#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4) +#define ENABLE_LAM_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM, + CMD_TIME_CLASS_C, status); + + if (err) + goto out; + + if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, ENABLE_LAM_END_OFFSET); + MTHCA_GET(info, outbox, ENABLE_LAM_INFO_OFFSET); + + if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & ENABLE_LAM_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status); +} + +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define QUERY_DDR_OUT_SIZE 0x100 +#define QUERY_DDR_START_OFFSET 0x00 +#define QUERY_DDR_END_OFFSET 0x08 +#define QUERY_DDR_INFO_OFFSET 0x13 + +#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4) +#define QUERY_DDR_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, QUERY_DDR_END_OFFSET); + MTHCA_GET(info, outbox, QUERY_DDR_INFO_OFFSET); + + if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & QUERY_DDR_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & QUERY_DDR_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + u8 field; + u16 size; + int err; + +#define QUERY_DEV_LIM_OUT_SIZE 0x100 +#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET 0x10 +#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET 0x11 +#define QUERY_DEV_LIM_RSVD_QP_OFFSET 0x12 +#define QUERY_DEV_LIM_MAX_QP_OFFSET 0x13 +#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET 0x14 +#define QUERY_DEV_LIM_MAX_SRQ_OFFSET 0x15 +#define QUERY_DEV_LIM_RSVD_EEC_OFFSET 0x16 +#define QUERY_DEV_LIM_MAX_EEC_OFFSET 0x17 +#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET 0x19 +#define QUERY_DEV_LIM_RSVD_CQ_OFFSET 0x1a +#define QUERY_DEV_LIM_MAX_CQ_OFFSET 0x1b +#define QUERY_DEV_LIM_MAX_MPT_OFFSET 0x1d +#define QUERY_DEV_LIM_RSVD_EQ_OFFSET 0x1e +#define QUERY_DEV_LIM_MAX_EQ_OFFSET 0x1f +#define QUERY_DEV_LIM_RSVD_MTT_OFFSET 0x20 +#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET 0x21 +#define QUERY_DEV_LIM_RSVD_MRW_OFFSET 0x22 +#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET 0x23 +#define QUERY_DEV_LIM_MAX_AV_OFFSET 0x27 +#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET 0x29 +#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET 0x2b +#define QUERY_DEV_LIM_MAX_RDMA_OFFSET 0x2f +#define QUERY_DEV_LIM_RSZ_SRQ_OFFSET 0x33 +#define QUERY_DEV_LIM_ACK_DELAY_OFFSET 0x35 +#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 +#define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 +#define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f +#define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 +#define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 +#define QUERY_DEV_LIM_UAR_SZ_OFFSET 0x49 +#define QUERY_DEV_LIM_PAGE_SZ_OFFSET 0x4b +#define QUERY_DEV_LIM_MAX_SG_OFFSET 0x51 +#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET 0x52 +#define QUERY_DEV_LIM_MAX_SG_RQ_OFFSET 0x55 +#define QUERY_DEV_LIM_MAX_DESC_SZ_RQ_OFFSET 0x56 +#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET 0x61 +#define QUERY_DEV_LIM_RSVD_MCG_OFFSET 0x62 +#define QUERY_DEV_LIM_MAX_MCG_OFFSET 0x63 +#define QUERY_DEV_LIM_RSVD_PD_OFFSET 0x64 +#define QUERY_DEV_LIM_MAX_PD_OFFSET 0x65 +#define QUERY_DEV_LIM_RSVD_RDD_OFFSET 0x66 +#define QUERY_DEV_LIM_MAX_RDD_OFFSET 0x67 +#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET 0x80 +#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET 0x82 +#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84 +#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86 +#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET 0x88 +#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET 0x8a +#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET 0x8c +#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET 0x8e +#define QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET 0x90 +#define QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET 0x92 +#define QUERY_DEV_LIM_PBL_SZ_OFFSET 0x96 +#define QUERY_DEV_LIM_BMME_FLAGS_OFFSET 0x97 +#define QUERY_DEV_LIM_RSVD_LKEY_OFFSET 0x98 +#define QUERY_DEV_LIM_LAMR_OFFSET 0x9f +#define QUERY_DEV_LIM_MAX_ICM_SZ_OFFSET 0xa0 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); + dev_lim->reserved_qps = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); + dev_lim->max_qps = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET); + dev_lim->reserved_srqs = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET); + dev_lim->max_srqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET); + dev_lim->reserved_eecs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET); + dev_lim->max_eecs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET); + dev_lim->max_cq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET); + dev_lim->reserved_cqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET); + dev_lim->max_cqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET); + dev_lim->max_mpts = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET); + dev_lim->reserved_eqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET); + dev_lim->max_eqs = 1 << (field & 0x7); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET); + dev_lim->reserved_mtts = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET); + dev_lim->max_mrw_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET); + dev_lim->reserved_mrws = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET); + dev_lim->max_mtt_seg = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET); + dev_lim->max_requester_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET); + dev_lim->max_responder_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET); + dev_lim->max_rdma_global = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET); + dev_lim->local_ca_ack_delay = field & 0x1f; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET); + dev_lim->max_mtu = field >> 4; + dev_lim->max_port_width = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET); + dev_lim->max_vl = field >> 4; + dev_lim->num_ports = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); + dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); + dev_lim->max_pkeys = 1 << (field & 0xf); + MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET); + dev_lim->reserved_uars = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET); + dev_lim->uar_size = 1 << ((field & 0x3f) + 20); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET); + dev_lim->min_page_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET); + dev_lim->max_sg = field; + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET); + dev_lim->max_desc_sz = size; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET); + dev_lim->max_qp_per_mcg = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET); + dev_lim->reserved_mgms = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET); + dev_lim->max_mcgs = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET); + dev_lim->reserved_pds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET); + dev_lim->max_pds = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET); + dev_lim->reserved_rdds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET); + dev_lim->max_rdds = 1 << (field & 0x3f); + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET); + dev_lim->eec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET); + dev_lim->qpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET); + dev_lim->eeec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET); + dev_lim->eqpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET); + dev_lim->eqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET); + dev_lim->cqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET); + dev_lim->srq_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); + dev_lim->uar_scratch_entry_sz = size; + + mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); + mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); + mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_lim->reserved_mrws, dev_lim->reserved_mtts); + mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); + mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_lim->max_pds, dev_lim->reserved_mgms); + + mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSZ_SRQ_OFFSET); + dev_lim->hca.arbel.resize_srq = field & 1; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET); + dev_lim->hca.arbel.mtt_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); + dev_lim->hca.arbel.mpt_entry_sz = size; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET); + dev_lim->hca.arbel.max_pbl_sz = 1 << (field & 0x3f); + MTHCA_GET(dev_lim->hca.arbel.bmme_flags, outbox, + QUERY_DEV_LIM_BMME_FLAGS_OFFSET); + MTHCA_GET(dev_lim->hca.arbel.reserved_lkey, outbox, + QUERY_DEV_LIM_RSVD_LKEY_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_LAMR_OFFSET); + dev_lim->hca.arbel.lam_required = field & 1; + MTHCA_GET(dev_lim->hca.arbel.max_icm_sz, outbox, + QUERY_DEV_LIM_MAX_ICM_SZ_OFFSET); + + if (dev_lim->hca.arbel.bmme_flags & 1) + mthca_dbg(dev, "Base MM extensions: yes " + "(flags %d, max PBL %d, rsvd L_Key %08x)\n", + dev_lim->hca.arbel.bmme_flags, + dev_lim->hca.arbel.max_pbl_sz, + dev_lim->hca.arbel.reserved_lkey); + else + mthca_dbg(dev, "Base MM extensions: no\n"); + + mthca_dbg(dev, "Max ICM size %lld MB\n", + (unsigned long long) dev_lim->hca.arbel.max_icm_sz >> 20); + } else { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); + dev_lim->hca.tavor.max_avs = 1 << (field & 0x3f); + } + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err; + +#define QUERY_ADAPTER_OUT_SIZE 0x100 +#define QUERY_ADAPTER_VENDOR_ID_OFFSET 0x00 +#define QUERY_ADAPTER_DEVICE_ID_OFFSET 0x04 +#define QUERY_ADAPTER_REVISION_ID_OFFSET 0x08 +#define QUERY_ADAPTER_INTA_PIN_OFFSET 0x10 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET); + MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET); + MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET); + MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + +#define INIT_HCA_IN_SIZE 0x200 +#define INIT_HCA_FLAGS_OFFSET 0x014 +#define INIT_HCA_QPC_OFFSET 0x020 +#define INIT_HCA_QPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x10) +#define INIT_HCA_LOG_QP_OFFSET (INIT_HCA_QPC_OFFSET + 0x17) +#define INIT_HCA_EEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x20) +#define INIT_HCA_LOG_EEC_OFFSET (INIT_HCA_QPC_OFFSET + 0x27) +#define INIT_HCA_SRQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x28) +#define INIT_HCA_LOG_SRQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x2f) +#define INIT_HCA_CQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x30) +#define INIT_HCA_LOG_CQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x37) +#define INIT_HCA_EQPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x40) +#define INIT_HCA_EEEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x50) +#define INIT_HCA_EQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x60) +#define INIT_HCA_LOG_EQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x67) +#define INIT_HCA_RDB_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x70) +#define INIT_HCA_UDAV_OFFSET 0x0b0 +#define INIT_HCA_UDAV_LKEY_OFFSET (INIT_HCA_UDAV_OFFSET + 0x0) +#define INIT_HCA_UDAV_PD_OFFSET (INIT_HCA_UDAV_OFFSET + 0x4) +#define INIT_HCA_MCAST_OFFSET 0x0c0 +#define INIT_HCA_MC_BASE_OFFSET (INIT_HCA_MCAST_OFFSET + 0x00) +#define INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12) +#define INIT_HCA_MC_HASH_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x16) +#define INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b) +#define INIT_HCA_TPT_OFFSET 0x0f0 +#define INIT_HCA_MPT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x00) +#define INIT_HCA_MTT_SEG_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x09) +#define INIT_HCA_LOG_MPT_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x0b) +#define INIT_HCA_MTT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x10) +#define INIT_HCA_UAR_OFFSET 0x120 +#define INIT_HCA_UAR_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x00) +#define INIT_HCA_UAR_PAGE_SZ_OFFSET (INIT_HCA_UAR_OFFSET + 0x0b) +#define INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10) + + inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_HCA_IN_SIZE); + +#if defined(__LITTLE_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1); +#elif defined(__BIG_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1); +#else +#error Host endianness not defined +#endif + /* Check port for UD address vector: */ + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1); + + /* We leave wqe_quota, responder_exu, etc as 0 (default) */ + + /* QPC/EEC/CQC/EQC/RDB attributes */ + + MTHCA_PUT(inbox, param->qpc_base, INIT_HCA_QPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_qps, INIT_HCA_LOG_QP_OFFSET); + MTHCA_PUT(inbox, param->eec_base, INIT_HCA_EEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET); + MTHCA_PUT(inbox, param->srqc_base, INIT_HCA_SRQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET); + MTHCA_PUT(inbox, param->cqc_base, INIT_HCA_CQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_cqs, INIT_HCA_LOG_CQ_OFFSET); + MTHCA_PUT(inbox, param->eqpc_base, INIT_HCA_EQPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eeec_base, INIT_HCA_EEEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eqc_base, INIT_HCA_EQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eqs, INIT_HCA_LOG_EQ_OFFSET); + MTHCA_PUT(inbox, param->rdb_base, INIT_HCA_RDB_BASE_OFFSET); + + /* UD AV attributes */ + + /* multicast attributes */ + + MTHCA_PUT(inbox, param->mc_base, INIT_HCA_MC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET); + MTHCA_PUT(inbox, param->mc_hash_sz, INIT_HCA_MC_HASH_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET); + + /* TPT attributes */ + + MTHCA_PUT(inbox, param->mpt_base, INIT_HCA_MPT_BASE_OFFSET); + MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET); + MTHCA_PUT(inbox, param->mtt_base, INIT_HCA_MTT_BASE_OFFSET); + + /* UAR attributes */ + { + u8 uar_page_sz = PAGE_SHIFT - 12; + MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET); + MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET); + } + + err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA, + HZ, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + u32 flags; + +#define INIT_IB_IN_SIZE 56 +#define INIT_IB_FLAGS_OFFSET 0x00 +#define INIT_IB_FLAG_SIG (1 << 18) +#define INIT_IB_FLAG_NG (1 << 17) +#define INIT_IB_FLAG_G0 (1 << 16) +#define INIT_IB_FLAG_1X (1 << 8) +#define INIT_IB_FLAG_4X (1 << 9) +#define INIT_IB_FLAG_12X (1 << 11) +#define INIT_IB_VL_SHIFT 4 +#define INIT_IB_MTU_SHIFT 12 +#define INIT_IB_MAX_GID_OFFSET 0x06 +#define INIT_IB_MAX_PKEY_OFFSET 0x0a +#define INIT_IB_GUID0_OFFSET 0x10 +#define INIT_IB_NODE_GUID_OFFSET 0x18 +#define INIT_IB_SI_GUID_OFFSET 0x20 + + inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_IB_IN_SIZE); + + flags = 0; + flags |= param->enable_1x ? INIT_IB_FLAG_1X : 0; + flags |= param->enable_4x ? INIT_IB_FLAG_4X : 0; + flags |= param->set_guid0 ? INIT_IB_FLAG_G0 : 0; + flags |= param->set_node_guid ? INIT_IB_FLAG_NG : 0; + flags |= param->set_si_guid ? INIT_IB_FLAG_SIG : 0; + flags |= param->vl_cap << INIT_IB_VL_SHIFT; + flags |= param->mtu_cap << INIT_IB_MTU_SHIFT; + MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET); + + MTHCA_PUT(inbox, param->gid_cap, INIT_IB_MAX_GID_OFFSET); + MTHCA_PUT(inbox, param->pkey_cap, INIT_IB_MAX_PKEY_OFFSET); + MTHCA_PUT(inbox, param->guid0, INIT_IB_GUID0_OFFSET); + MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET); + MTHCA_PUT(inbox, param->si_guid, INIT_IB_SI_GUID_OFFSET); + + err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, + CMD_TIME_CLASS_A, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status) +{ + return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status); +} + +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status) +{ + return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status); +} + +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + if (mpt_entry) { + outdma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry, + CMD_HW2SW_MPT, + CMD_TIME_CLASS_B, status); + + if (mpt_entry) + pci_unmap_single(dev->pdev, outdma, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mtt_entry, + (num_mtt + 2) * 8, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + (num_mtt + 2) * 8, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status) +{ + mthca_dbg(dev, "%s mask %016llx for eqn %d\n", + unmap ? "Clearing" : "Setting", + (unsigned long long) event_mask, eq_num); + return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num, + 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status); +} + +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, eq_num, 0, + CMD_HW2SW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, cq_num, 0, + CMD_HW2SW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status) +{ + static const u16 op[] = { + [MTHCA_TRANS_RST2INIT] = CMD_RST2INIT_QPEE, + [MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE, + [MTHCA_TRANS_INIT2RTR] = CMD_INIT2RTR_QPEE, + [MTHCA_TRANS_RTR2RTS] = CMD_RTR2RTS_QPEE, + [MTHCA_TRANS_RTS2RTS] = CMD_RTS2RTS_QPEE, + [MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE, + [MTHCA_TRANS_ANY2ERR] = CMD_2ERR_QPEE, + [MTHCA_TRANS_RTS2SQD] = CMD_RTS2SQD_QPEE, + [MTHCA_TRANS_SQD2SQD] = CMD_SQD2SQD_QPEE, + [MTHCA_TRANS_SQD2RTS] = CMD_SQD2RTS_QPEE, + [MTHCA_TRANS_ANY2RST] = CMD_ERR2RST_QPEE + }; + u8 op_mod = 0; + + dma_addr_t indma; + int err; + + if (trans < 0 || trans >= ARRAY_SIZE(op)) + return -EINVAL; + + if (trans == MTHCA_TRANS_ANY2RST) { + indma = 0; + op_mod = 3; /* don't write outbox, any->reset */ + + /* For debugging */ + qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + &indma); + op_mod = 2; /* write outbox, any->reset */ + } else { + indma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + } + + if (trans == MTHCA_TRANS_ANY2RST) { + err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + + } else + err = mthca_cmd(dev, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (trans != MTHCA_TRANS_ANY2RST) + pci_unmap_single(dev->pdev, indma, + MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE); + else + pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + qp_context, indma); + return err; +} + +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0, + CMD_QUERY_QPEE, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status) +{ + u8 op_mod; + + switch (type) { + case IB_QPT_SMI: + op_mod = 0; + break; + case IB_QPT_GSI: + op_mod = 1; + break; + case IB_QPT_RAW_IPV6: + op_mod = 2; + break; + case IB_QPT_RAW_ETY: + op_mod = 3; + break; + default: + return -EINVAL; + } + + return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP, + CMD_TIME_CLASS_B, status); +} + +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status) { + void *box; + dma_addr_t dma; + int err; + +#define MAD_IFC_BOX_SIZE 512 + + box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma); + if (!box) + return -ENOMEM; + + memcpy(box, in_mad, 256); + + err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey, + CMD_MAD_IFC, CMD_TIME_CLASS_C, status); + + if (!err && !*status) + memcpy(response_mad, box + 256, 256); + + pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma); + return err; +} + +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, index, 0, + CMD_READ_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status) +{ + dma_addr_t indma; + u64 imm; + int err; + + indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH, + CMD_TIME_CLASS_A, status); + *hash = imm; + + pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE); + return err; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h 2004-12-19 22:04:15.312442643 -0800 @@ -0,0 +1,276 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_cmd.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef MTHCA_CMD_H +#define MTHCA_CMD_H + +#include + +#define MTHCA_CMD_MAILBOX_ALIGN 16UL +#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1) + +enum { + /* command completed successfully: */ + MTHCA_CMD_STAT_OK = 0x00, + /* Internal error (such as a bus error) occurred while processing command: */ + MTHCA_CMD_STAT_INTERNAL_ERR = 0x01, + /* Operation/command not supported or opcode modifier not supported: */ + MTHCA_CMD_STAT_BAD_OP = 0x02, + /* Parameter not supported or parameter out of range: */ + MTHCA_CMD_STAT_BAD_PARAM = 0x03, + /* System not enabled or bad system state: */ + MTHCA_CMD_STAT_BAD_SYS_STATE = 0x04, + /* Attempt to access reserved or unallocaterd resource: */ + MTHCA_CMD_STAT_BAD_RESOURCE = 0x05, + /* Requested resource is currently executing a command, or is otherwise busy: */ + MTHCA_CMD_STAT_RESOURCE_BUSY = 0x06, + /* memory error: */ + MTHCA_CMD_STAT_DDR_MEM_ERR = 0x07, + /* Required capability exceeds device limits: */ + MTHCA_CMD_STAT_EXCEED_LIM = 0x08, + /* Resource is not in the appropriate state or ownership: */ + MTHCA_CMD_STAT_BAD_RES_STATE = 0x09, + /* Index out of range: */ + MTHCA_CMD_STAT_BAD_INDEX = 0x0a, + /* FW image corrupted: */ + MTHCA_CMD_STAT_BAD_NVMEM = 0x0b, + /* Attempt to modify a QP/EE which is not in the presumed state: */ + MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, + /* Bad segment parameters (Address/Size): */ + MTHCA_CMD_STAT_BAD_SEG_PARAM = 0x20, + /* Memory Region has Memory Windows bound to: */ + MTHCA_CMD_STAT_REG_BOUND = 0x21, + /* HCA local attached memory not present: */ + MTHCA_CMD_STAT_LAM_NOT_PRE = 0x22, + /* Bad management packet (silently discarded): */ + MTHCA_CMD_STAT_BAD_PKT = 0x30, + /* More outstanding CQEs in CQ than new CQ size: */ + MTHCA_CMD_STAT_BAD_SIZE = 0x40 +}; + +enum { + MTHCA_TRANS_INVALID = 0, + MTHCA_TRANS_RST2INIT, + MTHCA_TRANS_INIT2INIT, + MTHCA_TRANS_INIT2RTR, + MTHCA_TRANS_RTR2RTS, + MTHCA_TRANS_RTS2RTS, + MTHCA_TRANS_SQERR2RTS, + MTHCA_TRANS_ANY2ERR, + MTHCA_TRANS_RTS2SQD, + MTHCA_TRANS_SQD2SQD, + MTHCA_TRANS_SQD2RTS, + MTHCA_TRANS_ANY2RST, +}; + +enum { + DEV_LIM_FLAG_SRQ = 1 << 6 +}; + +struct mthca_dev_lim { + int max_srq_sz; + int max_qp_sz; + int reserved_qps; + int max_qps; + int reserved_srqs; + int max_srqs; + int reserved_eecs; + int max_eecs; + int max_cq_sz; + int reserved_cqs; + int max_cqs; + int max_mpts; + int reserved_eqs; + int max_eqs; + int reserved_mtts; + int max_mrw_sz; + int reserved_mrws; + int max_mtt_seg; + int max_requester_per_qp; + int max_responder_per_qp; + int max_rdma_global; + int local_ca_ack_delay; + int max_mtu; + int max_port_width; + int max_vl; + int num_ports; + int max_gids; + int max_pkeys; + u32 flags; + int reserved_uars; + int uar_size; + int min_page_sz; + int max_sg; + int max_desc_sz; + int max_qp_per_mcg; + int reserved_mgms; + int max_mcgs; + int reserved_pds; + int max_pds; + int reserved_rdds; + int max_rdds; + int eec_entry_sz; + int qpc_entry_sz; + int eeec_entry_sz; + int eqpc_entry_sz; + int eqc_entry_sz; + int cqc_entry_sz; + int srq_entry_sz; + int uar_scratch_entry_sz; + union { + struct { + int max_avs; + } tavor; + struct { + int resize_srq; + int mtt_entry_sz; + int mpt_entry_sz; + int max_pbl_sz; + u8 bmme_flags; + u32 reserved_lkey; + int lam_required; + u64 max_icm_sz; + } arbel; + } hca; +}; + +struct mthca_adapter { + u32 vendor_id; + u32 device_id; + u32 revision_id; + u8 inta_pin; +}; + +struct mthca_init_hca_param { + u64 qpc_base; + u8 log_num_qps; + u64 eec_base; + u8 log_num_eecs; + u64 srqc_base; + u8 log_num_srqs; + u64 cqc_base; + u8 log_num_cqs; + u64 eqpc_base; + u64 eeec_base; + u64 eqc_base; + u8 log_num_eqs; + u64 rdb_base; + u64 mc_base; + u16 log_mc_entry_sz; + u16 mc_hash_sz; + u8 log_mc_table_sz; + u64 mpt_base; + u8 mtt_seg_sz; + u8 log_mpt_sz; + u64 mtt_base; + u64 uar_scratch_base; +}; + +struct mthca_init_ib_param { + int enable_1x; + int enable_4x; + int vl_cap; + int mtu_cap; + u16 gid_cap; + u16 pkey_cap; + int set_guid0; + u64 guid0; + int set_node_guid; + u64 node_guid; + int set_si_guid; + u64 si_guid; +}; + +int mthca_cmd_use_events(struct mthca_dev *dev); +void mthca_cmd_use_polling(struct mthca_dev *dev); +void mthca_cmd_event(struct mthca_dev *dev, u16 token, + u8 status, u64 out_param); + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status); +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status); +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status); +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status); +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status); +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status); +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status); +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status); +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status); +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status); +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status); +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status); +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status); +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status); +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status); +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status); + +#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) (x), MTHCA_CMD_MAILBOX_ALIGN)) + +#endif /* MTHCA_CMD_H */ From roland at topspin.com Sun Dec 19 22:15:06 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:06 -0800 Subject: [openib-general] [PATCH][v4][13/24] Add Mellanox HCA low-level driver (initialization) In-Reply-To: <200412192215.2kuiGerZWq2NkwHo@topspin.com> Message-ID: <200412192215.Y6Xr7GqjaXLzJU2L@topspin.com> Add device initializaton code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c 2004-12-19 22:04:15.759376780 -0800 @@ -0,0 +1,226 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_profile.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include "mthca_profile.h" + +static int default_profile[MTHCA_RES_NUM] = { + [MTHCA_RES_QP] = 1 << 16, + [MTHCA_RES_EQP] = 1 << 16, + [MTHCA_RES_CQ] = 1 << 16, + [MTHCA_RES_EQ] = 32, + [MTHCA_RES_RDB] = 1 << 18, + [MTHCA_RES_MCG] = 1 << 13, + [MTHCA_RES_MPT] = 1 << 17, + [MTHCA_RES_MTT] = 1 << 20, + [MTHCA_RES_UDAV] = 1 << 15 +}; + +enum { + MTHCA_RDB_ENTRY_SIZE = 32, + MTHCA_MTT_SEG_SIZE = 64 +}; + +enum { + MTHCA_NUM_PDS = 1 << 15 +}; + +int mthca_make_profile(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca) +{ + /* just use default profile for now */ + struct mthca_resource { + u64 size; + u64 start; + int type; + int num; + int log_num; + }; + + u64 total_size = 0; + struct mthca_resource *profile; + struct mthca_resource tmp; + int i, j; + + default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE; + + profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); + if (!profile) + return -ENOMEM; + + profile[MTHCA_RES_QP].size = dev_lim->qpc_entry_sz; + profile[MTHCA_RES_EEC].size = dev_lim->eec_entry_sz; + profile[MTHCA_RES_SRQ].size = dev_lim->srq_entry_sz; + profile[MTHCA_RES_CQ].size = dev_lim->cqc_entry_sz; + profile[MTHCA_RES_EQP].size = dev_lim->eqpc_entry_sz; + profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz; + profile[MTHCA_RES_EQ].size = dev_lim->eqc_entry_sz; + profile[MTHCA_RES_RDB].size = MTHCA_RDB_ENTRY_SIZE; + profile[MTHCA_RES_MCG].size = MTHCA_MGM_ENTRY_SIZE; + profile[MTHCA_RES_MPT].size = MTHCA_MPT_ENTRY_SIZE; + profile[MTHCA_RES_MTT].size = MTHCA_MTT_SEG_SIZE; + profile[MTHCA_RES_UAR].size = dev_lim->uar_scratch_entry_sz; + profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE; + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + profile[i].type = i; + profile[i].num = default_profile[i]; + profile[i].log_num = max(ffs(default_profile[i]) - 1, 0); + profile[i].size *= default_profile[i]; + } + + /* + * Sort the resources in decreasing order of size. Since they + * all have sizes that are powers of 2, we'll be able to keep + * resources aligned to their size and pack them without gaps + * using the sorted order. + */ + for (i = MTHCA_RES_NUM; i > 0; --i) + for (j = 1; j < i; ++j) { + if (profile[j].size > profile[j - 1].size) { + tmp = profile[j]; + profile[j] = profile[j - 1]; + profile[j - 1] = tmp; + } + } + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + if (profile[i].size) { + profile[i].start = dev->ddr_start + total_size; + total_size += profile[i].size; + } + if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) { + mthca_err(dev, "Profile requires 0x%llx bytes; " + "won't fit between DDR start at 0x%016llx " + "and FW start at 0x%016llx.\n", + (unsigned long long) total_size, + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->fw.tavor.fw_start); + kfree(profile); + return -ENOMEM; + } + + if (profile[i].size) + mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx " + "(size 0x%8llx)\n", + i, profile[i].type, profile[i].log_num, + (unsigned long long) profile[i].start, + (unsigned long long) profile[i].size); + } + + mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n", + (int) (total_size >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10)); + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + switch (profile[i].type) { + case MTHCA_RES_QP: + dev->limits.num_qps = profile[i].num; + init_hca->qpc_base = profile[i].start; + init_hca->log_num_qps = profile[i].log_num; + break; + case MTHCA_RES_EEC: + dev->limits.num_eecs = profile[i].num; + init_hca->eec_base = profile[i].start; + init_hca->log_num_eecs = profile[i].log_num; + break; + case MTHCA_RES_SRQ: + dev->limits.num_srqs = profile[i].num; + init_hca->srqc_base = profile[i].start; + init_hca->log_num_srqs = profile[i].log_num; + break; + case MTHCA_RES_CQ: + dev->limits.num_cqs = profile[i].num; + init_hca->cqc_base = profile[i].start; + init_hca->log_num_cqs = profile[i].log_num; + break; + case MTHCA_RES_EQP: + init_hca->eqpc_base = profile[i].start; + break; + case MTHCA_RES_EEEC: + init_hca->eeec_base = profile[i].start; + break; + case MTHCA_RES_EQ: + dev->limits.num_eqs = profile[i].num; + init_hca->eqc_base = profile[i].start; + init_hca->log_num_eqs = profile[i].log_num; + break; + case MTHCA_RES_RDB: + dev->limits.num_rdbs = profile[i].num; + init_hca->rdb_base = profile[i].start; + break; + case MTHCA_RES_MCG: + dev->limits.num_mgms = profile[i].num >> 1; + dev->limits.num_amgms = profile[i].num >> 1; + init_hca->mc_base = profile[i].start; + init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1; + init_hca->log_mc_table_sz = profile[i].log_num; + init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); + break; + case MTHCA_RES_MPT: + dev->limits.num_mpts = profile[i].num; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; + break; + case MTHCA_RES_MTT: + dev->limits.num_mtt_segs = profile[i].num; + dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE; + dev->mr_table.mtt_base = profile[i].start; + init_hca->mtt_base = profile[i].start; + init_hca->mtt_seg_sz = ffs(MTHCA_MTT_SEG_SIZE) - 7; + break; + case MTHCA_RES_UAR: + init_hca->uar_scratch_base = profile[i].start; + break; + case MTHCA_RES_UDAV: + dev->av_table.ddr_av_base = profile[i].start; + dev->av_table.num_ddr_avs = profile[i].num; + default: + break; + } + } + + /* + * PDs don't take any HCA memory, but we assign them as part + * of the HCA profile anyway. + */ + dev->limits.num_pds = MTHCA_NUM_PDS; + + kfree(profile); + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h 2004-12-19 22:04:15.783373244 -0800 @@ -0,0 +1,62 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_profile.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef MTHCA_PROFILE_H +#define MTHCA_PROFILE_H + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_RES_QP, + MTHCA_RES_EEC, + MTHCA_RES_SRQ, + MTHCA_RES_CQ, + MTHCA_RES_EQP, + MTHCA_RES_EEEC, + MTHCA_RES_EQ, + MTHCA_RES_RDB, + MTHCA_RES_MCG, + MTHCA_RES_MPT, + MTHCA_RES_MTT, + MTHCA_RES_UAR, + MTHCA_RES_UDAV, + MTHCA_RES_NUM +}; + +int mthca_make_profile(struct mthca_dev *mdev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca); + +#endif /* MTHCA_PROFILE_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c 2004-12-19 22:04:15.807369708 -0800 @@ -0,0 +1,232 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_reset.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +int mthca_reset(struct mthca_dev *mdev) +{ + int i; + int err = 0; + u32 *hca_header = NULL; + u32 *bridge_header = NULL; + struct pci_dev *bridge = NULL; + +#define MTHCA_RESET_OFFSET 0xf0010 +#define MTHCA_RESET_VALUE cpu_to_be32(1) + + /* + * Reset the chip. This is somewhat ugly because we have to + * save off the PCI header before reset and then restore it + * after the chip reboots. We skip config space offsets 22 + * and 23 since those have a special meaning. + * + * To make matters worse, for Tavor (PCI-X HCA) we have to + * find the associated bridge device and save off its PCI + * header as well. + */ + + if (mdev->hca_type == TAVOR) { + /* Look for the bridge -- its device ID will be 2 more + than HCA's device ID. */ + while ((bridge = pci_get_device(mdev->pdev->vendor, + mdev->pdev->device + 2, + bridge)) != NULL) { + if (bridge->hdr_type == PCI_HEADER_TYPE_BRIDGE && + bridge->subordinate == mdev->pdev->bus) { + mthca_dbg(mdev, "Found bridge: %s (%s)\n", + pci_pretty_name(bridge), pci_name(bridge)); + break; + } + } + + if (!bridge) { + /* + * Didn't find a bridge for a Tavor device -- + * assume we're in no-bridge mode and hope for + * the best. + */ + mthca_warn(mdev, "No bridge found for %s (%s)\n", + pci_pretty_name(mdev->pdev), pci_name(mdev->pdev)); + } + + } + + /* For Arbel do we need to save off the full 4K PCI Express header?? */ + hca_header = kmalloc(256, GFP_KERNEL); + if (!hca_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA " + "PCI header, aborting.\n"); + goto out; + } + } + + if (bridge) { + bridge_header = kmalloc(256, GFP_KERNEL); + if (!bridge_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "bridge PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA bridge " + "PCI header, aborting.\n"); + goto out; + } + } + } + + /* actually hit reset */ + { + void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) + + MTHCA_RESET_OFFSET, 4); + + if (!reset) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't map HCA reset register, " + "aborting.\n"); + goto out; + } + + writel(MTHCA_RESET_VALUE, reset); + iounmap(reset); + } + + /* Docs say to wait one second before accessing device */ + msleep(1000); + + /* Now wait for PCI device to start responding again */ + { + u32 v; + int c = 0; + + for (c = 0; c < 100; ++c) { + if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't access HCA after reset, " + "aborting.\n"); + goto out; + } + + if (v != 0xffffffff) + goto good; + + msleep(100); + } + + err = -ENODEV; + mthca_err(mdev, "PCI device did not come back after reset, " + "aborting.\n"); + goto out; + } + +good: + /* Now restore the PCI headers */ + if (bridge) { + /* + * Bridge control register is at 0x3e, so we'll + * naturally restore it last in this loop. + */ + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(bridge, PCI_COMMAND, + bridge_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, " + "aborting.\n"); + goto out; + } + } + + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(mdev->pdev, PCI_COMMAND, + hca_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA COMMAND, " + "aborting.\n"); + goto out; + } + +out: + if (bridge) + pci_dev_put(bridge); + kfree(bridge_header); + kfree(hca_header); + + return err; +} From roland at topspin.com Sun Dec 19 22:15:07 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:07 -0800 Subject: [openib-general] [PATCH][v4][14/24] Add Mellanox HCA low-level driver (QP/CQ) In-Reply-To: <200412192215.Y6Xr7GqjaXLzJU2L@topspin.com> Message-ID: <200412192215.7jqz9W3Au4nr5j1b@topspin.com> Add CQ (completion queue) and QP (queue pair) code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c 2004-12-19 22:04:16.047334345 -0800 @@ -0,0 +1,831 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_cq.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE +}; + +enum { + MTHCA_CQ_ENTRY_SIZE = 0x20 +}; + +/* + * Must be packed because start is 64 bits but only aligned to 32 bits. + */ +struct mthca_cq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 error_eqn; + u32 comp_eqn; + u32 pd; + u32 lkey; + u32 last_notified_index; + u32 solicit_producer_index; + u32 consumer_index; + u32 producer_index; + u32 cqn; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_CQ_STATUS_OK ( 0 << 28) +#define MTHCA_CQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_CQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_CQ_FLAG_TR ( 1 << 18) +#define MTHCA_CQ_FLAG_OI ( 1 << 17) +#define MTHCA_CQ_STATE_DISARMED ( 0 << 8) +#define MTHCA_CQ_STATE_ARMED ( 1 << 8) +#define MTHCA_CQ_STATE_ARMED_SOL ( 4 << 8) +#define MTHCA_EQ_STATE_FIRED (10 << 8) + +enum { + MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe +}; + +enum { + SYNDROME_LOCAL_LENGTH_ERR = 0x01, + SYNDROME_LOCAL_QP_OP_ERR = 0x02, + SYNDROME_LOCAL_EEC_OP_ERR = 0x03, + SYNDROME_LOCAL_PROT_ERR = 0x04, + SYNDROME_WR_FLUSH_ERR = 0x05, + SYNDROME_MW_BIND_ERR = 0x06, + SYNDROME_BAD_RESP_ERR = 0x10, + SYNDROME_LOCAL_ACCESS_ERR = 0x11, + SYNDROME_REMOTE_INVAL_REQ_ERR = 0x12, + SYNDROME_REMOTE_ACCESS_ERR = 0x13, + SYNDROME_REMOTE_OP_ERR = 0x14, + SYNDROME_RETRY_EXC_ERR = 0x15, + SYNDROME_RNR_RETRY_EXC_ERR = 0x16, + SYNDROME_LOCAL_RDD_VIOL_ERR = 0x20, + SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21, + SYNDROME_REMOTE_ABORTED_ERR = 0x22, + SYNDROME_INVAL_EECN_ERR = 0x23, + SYNDROME_INVAL_EEC_STATE_ERR = 0x24 +}; + +struct mthca_cqe { + u32 my_qpn; + u32 my_ee; + u32 rqpn; + u16 sl_g_mlpath; + u16 rlid; + u32 imm_etype_pkey_eec; + u32 byte_cnt; + u32 wqe; + u8 opcode; + u8 is_send; + u8 reserved; + u8 owner; +}; + +struct mthca_err_cqe { + u32 my_qpn; + u32 reserved1[3]; + u8 syndrome; + u8 reserved2; + u16 db_cnt; + u32 reserved3; + u32 wqe; + u8 opcode; + u8 reserved4[2]; + u8 owner; +}; + +#define MTHCA_CQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_CQ_ENTRY_OWNER_HW (1 << 7) + +#define MTHCA_CQ_DB_INC_CI (1 << 24) +#define MTHCA_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_CQ_DB_REQ_NOT_SOL (3 << 24) +#define MTHCA_CQ_DB_SET_CI (4 << 24) +#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24) + +static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) +{ + if (cq->is_direct) + return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE); + else + return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int cqe_sw(struct mthca_cq *cq, int i) +{ + return !(MTHCA_CQ_ENTRY_OWNER_HW & + get_cqe(cq, i)->owner); +} + +static inline int next_cqe_sw(struct mthca_cq *cq) +{ + return cqe_sw(cq, cq->cons_index); +} + +static inline void set_cqe_hw(struct mthca_cq *cq, int entry) +{ + get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW; +} + +static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, + int nent) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn); + doorbell[1] = cpu_to_be32(nent - 1); + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +void mthca_cq_event(struct mthca_dev *dev, u32 cqn) +{ + struct mthca_cq *cq; + + spin_lock(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock(&dev->cq_table.lock); + + if (!cq) { + mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); + return; + } + + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) +{ + struct mthca_cq *cq; + struct mthca_cqe *cqe; + int prod_index; + int nfreed = 0; + + spin_lock_irq(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock_irq(&dev->cq_table.lock); + + if (!cq) + return; + + spin_lock_irq(&cq->lock); + + /* + * First we need to find the current producer index, so we + * know where to start cleaning from. It doesn't matter if HW + * adds new entries after this loop -- the QP we're worried + * about is already in RESET, so the new entries won't come + * from our QP and therefore don't need to be checked. + */ + for (prod_index = cq->cons_index; + cqe_sw(cq, prod_index & cq->ibcq.cqe); + ++prod_index) + if (prod_index == cq->cons_index + cq->ibcq.cqe) + break; + + if (0) + mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n", + qpn, cqn, cq->cons_index, prod_index); + + /* + * Now sweep backwards through the CQ, removing CQ entries + * that match our QP by copying older entries on top of them. + */ + while (prod_index > cq->cons_index) { + cqe = get_cqe(cq, (prod_index - 1) & cq->ibcq.cqe); + if (cqe->my_qpn == cpu_to_be32(qpn)) + ++nfreed; + else if (nfreed) + memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & + cq->ibcq.cqe), + cqe, + MTHCA_CQ_ENTRY_SIZE); + --prod_index; + } + + if (nfreed) { + wmb(); + inc_cons_index(dev, cq, nfreed); + cq->cons_index = (cq->cons_index + nfreed) & cq->ibcq.cqe; + } + + spin_unlock_irq(&cq->lock); + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq, + struct mthca_qp *qp, int wqe_index, int is_send, + struct mthca_err_cqe *cqe, + struct ib_wc *entry, int *free_cqe) +{ + int err; + int dbd; + u32 new_wqe; + + if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) { + int j; + + mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); + } + + /* + * For completions in error, only work request ID, status (and + * freed resource count for RD) have to be set. + */ + switch (cqe->syndrome) { + case SYNDROME_LOCAL_LENGTH_ERR: + entry->status = IB_WC_LOC_LEN_ERR; + break; + case SYNDROME_LOCAL_QP_OP_ERR: + entry->status = IB_WC_LOC_QP_OP_ERR; + break; + case SYNDROME_LOCAL_EEC_OP_ERR: + entry->status = IB_WC_LOC_EEC_OP_ERR; + break; + case SYNDROME_LOCAL_PROT_ERR: + entry->status = IB_WC_LOC_PROT_ERR; + break; + case SYNDROME_WR_FLUSH_ERR: + entry->status = IB_WC_WR_FLUSH_ERR; + break; + case SYNDROME_MW_BIND_ERR: + entry->status = IB_WC_MW_BIND_ERR; + break; + case SYNDROME_BAD_RESP_ERR: + entry->status = IB_WC_BAD_RESP_ERR; + break; + case SYNDROME_LOCAL_ACCESS_ERR: + entry->status = IB_WC_LOC_ACCESS_ERR; + break; + case SYNDROME_REMOTE_INVAL_REQ_ERR: + entry->status = IB_WC_REM_INV_REQ_ERR; + break; + case SYNDROME_REMOTE_ACCESS_ERR: + entry->status = IB_WC_REM_ACCESS_ERR; + break; + case SYNDROME_REMOTE_OP_ERR: + entry->status = IB_WC_REM_OP_ERR; + break; + case SYNDROME_RETRY_EXC_ERR: + entry->status = IB_WC_RETRY_EXC_ERR; + break; + case SYNDROME_RNR_RETRY_EXC_ERR: + entry->status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case SYNDROME_LOCAL_RDD_VIOL_ERR: + entry->status = IB_WC_LOC_RDD_VIOL_ERR; + break; + case SYNDROME_REMOTE_INVAL_RD_REQ_ERR: + entry->status = IB_WC_REM_INV_RD_REQ_ERR; + break; + case SYNDROME_REMOTE_ABORTED_ERR: + entry->status = IB_WC_REM_ABORT_ERR; + break; + case SYNDROME_INVAL_EECN_ERR: + entry->status = IB_WC_INV_EECN_ERR; + break; + case SYNDROME_INVAL_EEC_STATE_ERR: + entry->status = IB_WC_INV_EEC_STATE_ERR; + break; + default: + entry->status = IB_WC_GENERAL_ERR; + break; + } + + err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); + if (err) + return err; + + /* + * If we're at the end of the WQE chain, or we've used up our + * doorbell count, free the CQE. Otherwise just update it for + * the next poll operation. + */ + if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) + return 0; + + cqe->db_cnt = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd); + cqe->wqe = new_wqe; + cqe->syndrome = SYNDROME_WR_FLUSH_ERR; + + *free_cqe = 0; + + return 0; +} + +static void dump_cqe(struct mthca_cqe *cqe) +{ + int j; + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); +} + +static inline int mthca_poll_one(struct mthca_dev *dev, + struct mthca_cq *cq, + struct mthca_qp **cur_qp, + int *freed, + struct ib_wc *entry) +{ + struct mthca_wq *wq; + struct mthca_cqe *cqe; + int wqe_index; + int is_error = 0; + int is_send; + int free_cqe = 1; + int err = 0; + + if (!next_cqe_sw(cq)) + return -EAGAIN; + + rmb(); + + cqe = get_cqe(cq, cq->cons_index); + + if (0) { + mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + dump_cqe(cqe); + } + + if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) { + is_error = 1; + is_send = cqe->opcode & 1; + } else + is_send = cqe->is_send & 0x80; + + if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { + if (*cur_qp) { + spin_unlock(&(*cur_qp)->lock); + if (atomic_dec_and_test(&(*cur_qp)->refcount)) + wake_up(&(*cur_qp)->wait); + } + + spin_lock(&dev->qp_table.lock); + *cur_qp = mthca_array_get(&dev->qp_table.qp, + be32_to_cpu(cqe->my_qpn) & + (dev->limits.num_qps - 1)); + if (*cur_qp) + atomic_inc(&(*cur_qp)->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!*cur_qp) { + mthca_warn(dev, "CQ entry for unknown QP %06x\n", + be32_to_cpu(cqe->my_qpn) & 0xffffff); + err = -EINVAL; + goto out; + } + + spin_lock(&(*cur_qp)->lock); + } + + if (is_send) { + wq = &(*cur_qp)->sq; + wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset) + >> wq->wqe_shift); + entry->wr_id = (*cur_qp)->wrid[wqe_index + + (*cur_qp)->rq.max]; + } else { + wq = &(*cur_qp)->rq; + wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + entry->wr_id = (*cur_qp)->wrid[wqe_index]; + } + + if (wq->last_comp < wqe_index) + wq->cur -= wqe_index - wq->last_comp; + else + wq->cur -= wq->max - wq->last_comp + wqe_index; + + wq->last_comp = wqe_index; + + if (0) + mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n", + is_send ? "Send" : "Receive", + (*cur_qp)->qpn, wqe_index, wq->max); + + if (is_error) { + err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, + (struct mthca_err_cqe *) cqe, + entry, &free_cqe); + goto out; + } + + if (is_send) { + entry->opcode = IB_WC_SEND; /* XXX */ + } else { + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + switch (cqe->opcode & 0x1f) { + case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE: + case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV; + break; + case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE: + case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV_RDMA_WITH_IMM; + break; + default: + entry->wc_flags = 0; + entry->opcode = IB_WC_RECV; + break; + } + entry->slid = be16_to_cpu(cqe->rlid); + entry->sl = be16_to_cpu(cqe->sl_g_mlpath) >> 12; + entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; + entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; + entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; + entry->wc_flags |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ? + IB_WC_GRH : 0; + } + + entry->status = IB_WC_SUCCESS; + + out: + if (free_cqe) { + set_cqe_hw(cq, cq->cons_index); + ++(*freed); + cq->cons_index = (cq->cons_index + 1) & cq->ibcq.cqe; + } + + return err; +} + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct mthca_dev *dev = to_mdev(ibcq->device); + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_qp *qp = NULL; + unsigned long flags; + int err = 0; + int freed = 0; + int npolled; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + err = mthca_poll_one(dev, cq, &qp, + &freed, entry + npolled); + if (err) + break; + } + + if (qp) { + spin_unlock(&qp->lock); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + + if (freed) { + wmb(); + while (freed--) + inc_cons_index(dev, cq, 1); + } + + spin_unlock_irqrestore(&cq->lock, flags); + + return err == 0 || err == -EAGAIN ? npolled : err; +} + +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((solicited ? + MTHCA_CQ_DB_REQ_NOT_SOL : + MTHCA_CQ_DB_REQ_NOT) | + cq->cqn); + doorbell[1] = 0xffffffff; + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq) +{ + int size = nent * MTHCA_CQ_ENTRY_SIZE; + dma_addr_t t; + void *mailbox = NULL; + int npages, shift; + u64 *dma_list = NULL; + struct mthca_cq_context *cq_context; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out; + + cq_context = MAILBOX_ALIGN(mailbox); + + if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { + if (0) + mthca_dbg(dev, "Creating direct CQ of size %d\n", size); + + cq->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + cq->queue.direct.buf = pci_alloc_consistent(dev->pdev, + size, &t); + if (!cq->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&cq->queue.direct, mapping, t); + + memset(cq->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + cq->is_direct = 0; + npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, + GFP_KERNEL); + if (!cq->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + cq->queue.page_list[i].buf = NULL; + + for (i = 0; i < npages; ++i) { + cq->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!cq->queue.page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); + + memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE); + } + } + + for (i = 0; i < nent; ++i) + set_cqe_hw(cq, i); + + cq->cqn = mthca_alloc(&dev->cq_table.alloc); + if (cq->cqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, shift, npages, + 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &cq->mr); + if (err) + goto err_out_free_cq; + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + memset(cq_context, 0, sizeof *cq_context); + cq_context->flags = cpu_to_be32(MTHCA_CQ_STATUS_OK | + MTHCA_CQ_STATE_DISARMED | + MTHCA_CQ_FLAG_TR); + cq_context->start = cpu_to_be64(0); + cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); + cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); + cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); + cq_context->cqn = cpu_to_be32(cq->cqn); + + err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err); + goto err_out_free_mr; + } + + if (status) { + mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + spin_lock_irq(&dev->cq_table.lock); + if (mthca_array_set(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1), + cq)) { + spin_unlock_irq(&dev->cq_table.lock); + goto err_out_free_mr; + } + spin_unlock_irq(&dev->cq_table.lock); + + cq->cons_index = 0; + + kfree(dma_list); + kfree(mailbox); + + return 0; + + err_out_free_mr: + mthca_free_mr(dev, &cq->mr); + + err_out_free_cq: + mthca_free(&dev->cq_table.alloc, cq->cqn); + + err_out_free: + if (cq->is_direct) + pci_free_consistent(dev->pdev, size, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, mapping)); + else { + for (i = 0; i < npages; ++i) + if (cq->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + err_out: + kfree(dma_list); + kfree(mailbox); + + return err; +} + +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq) +{ + void *mailbox; + int err; + u8 status; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_warn(dev, "No memory for mailbox to free CQ.\n"); + return; + } + + err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status); + if (err) + mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n", + status); + + if (0) { + u32 *ctx = MAILBOX_ALIGN(mailbox); + int j; + + printk(KERN_ERR "context for CQN %x\n", cq->cqn); + for (j = 0; j < 16; ++j) + printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); + } + + spin_lock_irq(&dev->cq_table.lock); + mthca_array_clear(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1)); + spin_unlock_irq(&dev->cq_table.lock); + + atomic_dec(&cq->refcount); + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + mthca_free_mr(dev, &cq->mr); + + if (cq->is_direct) + pci_free_consistent(dev->pdev, + (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, + mapping)); + else { + int i; + + for (i = 0; + i < ((cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + mthca_free(&dev->cq_table.alloc, cq->cqn); + kfree(mailbox); +} + +int __devinit mthca_init_cq_table(struct mthca_dev *dev) +{ + int err; + + spin_lock_init(&dev->cq_table.lock); + + err = mthca_alloc_init(&dev->cq_table.alloc, + dev->limits.num_cqs, + (1 << 24) - 1, + dev->limits.reserved_cqs); + if (err) + return err; + + err = mthca_array_init(&dev->cq_table.cq, + dev->limits.num_cqs); + if (err) + mthca_alloc_cleanup(&dev->cq_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev) +{ + mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs); + mthca_alloc_cleanup(&dev->cq_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c 2004-12-19 22:04:16.071330809 -0800 @@ -0,0 +1,1536 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_qp.c 1355 2004-12-17 15:23:43Z roland $ + */ + +#include + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, + MTHCA_ACK_REQ_FREQ = 10, + MTHCA_FLIGHT_LIMIT = 9, + MTHCA_UD_HEADER_SIZE = 72 /* largest UD header possible */ +}; + +enum { + MTHCA_QP_STATE_RST = 0, + MTHCA_QP_STATE_INIT = 1, + MTHCA_QP_STATE_RTR = 2, + MTHCA_QP_STATE_RTS = 3, + MTHCA_QP_STATE_SQE = 4, + MTHCA_QP_STATE_SQD = 5, + MTHCA_QP_STATE_ERR = 6, + MTHCA_QP_STATE_DRAINING = 7 +}; + +enum { + MTHCA_QP_ST_RC = 0x0, + MTHCA_QP_ST_UC = 0x1, + MTHCA_QP_ST_RD = 0x2, + MTHCA_QP_ST_UD = 0x3, + MTHCA_QP_ST_MLX = 0x7 +}; + +enum { + MTHCA_QP_PM_MIGRATED = 0x3, + MTHCA_QP_PM_ARMED = 0x0, + MTHCA_QP_PM_REARM = 0x1 +}; + +enum { + /* qp_context flags */ + MTHCA_QP_BIT_DE = 1 << 8, + /* params1 */ + MTHCA_QP_BIT_SRE = 1 << 15, + MTHCA_QP_BIT_SWE = 1 << 14, + MTHCA_QP_BIT_SAE = 1 << 13, + MTHCA_QP_BIT_SIC = 1 << 4, + MTHCA_QP_BIT_SSC = 1 << 3, + /* params2 */ + MTHCA_QP_BIT_RRE = 1 << 15, + MTHCA_QP_BIT_RWE = 1 << 14, + MTHCA_QP_BIT_RAE = 1 << 13, + MTHCA_QP_BIT_RIC = 1 << 4, + MTHCA_QP_BIT_RSC = 1 << 3 +}; + +struct mthca_qp_path { + u32 port_pkey; + u8 rnr_retry; + u8 g_mylmc; + u16 rlid; + u8 ackto; + u8 mgid_index; + u8 static_rate; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u8 rgid[16]; +} __attribute__((packed)); + +struct mthca_qp_context { + u32 flags; + u32 sched_queue; + u32 mtu_msgmax; + u32 usr_page; + u32 local_qpn; + u32 remote_qpn; + u32 reserved1[2]; + struct mthca_qp_path pri_path; + struct mthca_qp_path alt_path; + u32 rdd; + u32 pd; + u32 wqe_base; + u32 wqe_lkey; + u32 params1; + u32 reserved2; + u32 next_send_psn; + u32 cqn_snd; + u32 next_snd_wqe[2]; + u32 last_acked_psn; + u32 ssn; + u32 params2; + u32 rnr_nextrecvpsn; + u32 ra_buff_indx; + u32 cqn_rcv; + u32 next_rcv_wqe[2]; + u32 qkey; + u32 srqn; + u32 rmsn; + u32 reserved3[19]; +} __attribute__((packed)); + +struct mthca_qp_param { + u32 opt_param_mask; + u32 reserved1; + struct mthca_qp_context context; + u32 reserved2[62]; +} __attribute__((packed)); + +enum { + MTHCA_QP_OPTPAR_ALT_ADDR_PATH = 1 << 0, + MTHCA_QP_OPTPAR_RRE = 1 << 1, + MTHCA_QP_OPTPAR_RAE = 1 << 2, + MTHCA_QP_OPTPAR_REW = 1 << 3, + MTHCA_QP_OPTPAR_PKEY_INDEX = 1 << 4, + MTHCA_QP_OPTPAR_Q_KEY = 1 << 5, + MTHCA_QP_OPTPAR_RNR_TIMEOUT = 1 << 6, + MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7, + MTHCA_QP_OPTPAR_SRA_MAX = 1 << 8, + MTHCA_QP_OPTPAR_RRA_MAX = 1 << 9, + MTHCA_QP_OPTPAR_PM_STATE = 1 << 10, + MTHCA_QP_OPTPAR_PORT_NUM = 1 << 11, + MTHCA_QP_OPTPAR_RETRY_COUNT = 1 << 12, + MTHCA_QP_OPTPAR_ALT_RNR_RETRY = 1 << 13, + MTHCA_QP_OPTPAR_ACK_TIMEOUT = 1 << 14, + MTHCA_QP_OPTPAR_RNR_RETRY = 1 << 15, + MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 +}; + +enum { + MTHCA_OPCODE_NOP = 0x00, + MTHCA_OPCODE_RDMA_WRITE = 0x08, + MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, + MTHCA_OPCODE_SEND = 0x0a, + MTHCA_OPCODE_SEND_IMM = 0x0b, + MTHCA_OPCODE_RDMA_READ = 0x10, + MTHCA_OPCODE_ATOMIC_CS = 0x11, + MTHCA_OPCODE_ATOMIC_FA = 0x12, + MTHCA_OPCODE_BIND_MW = 0x18, + MTHCA_OPCODE_INVALID = 0xff +}; + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 +}; + +struct mthca_next_seg { + u32 nda_op; /* [31:6] next WQE [4:0] next opcode */ + u32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + u32 flags; /* [3] CQ [2] Event [1] Solicit */ + u32 imm; /* immediate data */ +}; + +struct mthca_ud_seg { + u32 reserved1; + u32 lkey; + u64 av_addr; + u32 reserved2[4]; + u32 dqpn; + u32 qkey; + u32 reserved3[2]; +}; + +struct mthca_bind_seg { + u32 flags; /* [31] Atomic [30] rem write [29] rem read */ + u32 reserved; + u32 new_rkey; + u32 lkey; + u64 addr; + u64 length; +}; + +struct mthca_raddr_seg { + u64 raddr; + u32 rkey; + u32 reserved; +}; + +struct mthca_atomic_seg { + u64 swap_add; + u64 compare; +}; + +struct mthca_data_seg { + u32 byte_count; + u32 lkey; + u64 addr; +}; + +struct mthca_mlx_seg { + u32 nda_op; + u32 nds; + u32 flags; /* [17] VL15 [16] SLR [14:12] static rate + [11:8] SL [3] C [2] E */ + u16 rlid; + u16 vcrc; +}; + +static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 3; +} + +static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 1; +} + +static void *get_recv_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + (n << qp->rq.wqe_shift); + else + return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf + + ((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1)); +} + +static void *get_send_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + qp->send_wqe_offset + + (n << qp->sq.wqe_shift); + else + return qp->queue.page_list[(qp->send_wqe_offset + + (n << qp->sq.wqe_shift)) >> + PAGE_SHIFT].buf + + ((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) & + (PAGE_SIZE - 1)); +} + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type) +{ + struct mthca_qp *qp; + struct ib_event event; + + spin_lock(&dev->qp_table.lock); + qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1)); + if (qp) + atomic_inc(&qp->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!qp) { + mthca_warn(dev, "Async event for bogus QP %08x\n", qpn); + return; + } + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); + + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +static int to_mthca_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: return MTHCA_QP_STATE_RST; + case IB_QPS_INIT: return MTHCA_QP_STATE_INIT; + case IB_QPS_RTR: return MTHCA_QP_STATE_RTR; + case IB_QPS_RTS: return MTHCA_QP_STATE_RTS; + case IB_QPS_SQD: return MTHCA_QP_STATE_SQD; + case IB_QPS_SQE: return MTHCA_QP_STATE_SQE; + case IB_QPS_ERR: return MTHCA_QP_STATE_ERR; + default: return -1; + } +} + +enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS }; + +static int to_mthca_st(int transport) +{ + switch (transport) { + case RC: return MTHCA_QP_ST_RC; + case UC: return MTHCA_QP_ST_UC; + case UD: return MTHCA_QP_ST_UD; + case RD: return MTHCA_QP_ST_RD; + case MLX: return MTHCA_QP_ST_MLX; + default: return -1; + } +} + +static const struct { + int trans; + u32 req_param[NUM_TRANS]; + u32 opt_param[NUM_TRANS]; +} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_RST2INIT, + .req_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + }, + /* bug-for-bug compatibility with VAPI: */ + .opt_param = { + [MLX] = IB_QP_PORT + } + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_INIT2INIT, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + }, + [IB_QPS_RTR] = { + .trans = MTHCA_TRANS_INIT2RTR, + .req_param = { + [RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTR2RTS, + .req_param = { + [UD] = IB_QP_SQ_PSN, + [RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + [MLX] = IB_QP_SQ_PSN, + }, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTS2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_RTS2SQD, + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQD2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_SQD2SQD, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQERR2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR } + } +}; + +static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr, + int attr_mask) +{ + if (attr_mask & IB_QP_PKEY_INDEX) + sqp->pkey_index = attr->pkey_index; + if (attr_mask & IB_QP_QKEY) + sqp->qkey = attr->qkey; + if (attr_mask & IB_QP_SQ_PSN) + sqp->send_psn = attr->sq_psn; +} + +static void init_port(struct mthca_dev *dev, int port) +{ + int err; + u8 status; + struct mthca_init_ib_param param; + + memset(¶m, 0, sizeof param); + + param.enable_1x = 1; + param.enable_4x = 1; + param.vl_cap = dev->limits.vl_cap; + param.mtu_cap = dev->limits.mtu_cap; + param.gid_cap = dev->limits.gid_table_len; + param.pkey_cap = dev->limits.pkey_table_len; + + err = mthca_INIT_IB(dev, ¶m, port, &status); + if (err) + mthca_warn(dev, "INIT_IB failed, return code %d.\n", err); + if (status) + mthca_warn(dev, "INIT_IB returned status %02x.\n", status); +} + +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + enum ib_qp_state cur_state, new_state; + void *mailbox = NULL; + struct mthca_qp_param *qp_param; + struct mthca_qp_context *qp_context; + u32 req_param, opt_param; + u8 status; + int err; + + if (attr_mask & IB_QP_CUR_STATE) { + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + cur_state = attr->cur_qp_state; + } else { + spin_lock_irq(&qp->lock); + cur_state = qp->state; + spin_unlock_irq(&qp->lock); + } + + if (attr_mask & IB_QP_STATE) { + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + new_state = attr->qp_state; + } else + new_state = cur_state; + + if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) { + mthca_dbg(dev, "Illegal QP transition " + "%d->%d\n", cur_state, new_state); + return -EINVAL; + } + + req_param = state_table[cur_state][new_state].req_param[qp->transport]; + opt_param = state_table[cur_state][new_state].opt_param[qp->transport]; + + if ((req_param & attr_mask) != req_param) { + mthca_dbg(dev, "QP transition " + "%d->%d missing req attr 0x%08x\n", + cur_state, new_state, + req_param & ~attr_mask); + return -EINVAL; + } + + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) { + mthca_dbg(dev, "QP transition (transport %d) " + "%d->%d has extra attr 0x%08x\n", + qp->transport, + cur_state, new_state, + attr_mask & ~(req_param | opt_param | + IB_QP_STATE)); + return -EINVAL; + } + + mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + qp_param = MAILBOX_ALIGN(mailbox); + qp_context = &qp_param->context; + memset(qp_param, 0, sizeof *qp_param); + + qp_context->flags = cpu_to_be32((to_mthca_state(new_state) << 28) | + (to_mthca_st(qp->transport) << 16)); + qp_context->flags |= cpu_to_be32(MTHCA_QP_BIT_DE); + if (!(attr_mask & IB_QP_PATH_MIG_STATE)) + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + else { + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE); + switch (attr->path_mig_state) { + case IB_MIG_MIGRATED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + break; + case IB_MIG_REARM: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11); + break; + case IB_MIG_ARMED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11); + break; + } + } + /* leave sched_queue as 0 */ + if (qp->transport == MLX || qp->transport == UD) + qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) | + (11 << 24)); + else if (attr_mask & IB_QP_PATH_MTU) { + qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | + (31 << 24)); + } + qp_context->usr_page = cpu_to_be32(MTHCA_KAR_PAGE); + qp_context->local_qpn = cpu_to_be32(qp->qpn); + if (attr_mask & IB_QP_DEST_QPN) { + qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); + } + + if (qp->transport == MLX) + qp_context->pri_path.port_pkey |= + cpu_to_be32(to_msqp(qp)->port << 24); + else { + if (attr_mask & IB_QP_PORT) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->port_num << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM); + } + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->pkey_index); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX); + } + + if (attr_mask & IB_QP_RNR_RETRY) { + qp_context->pri_path.rnr_retry = attr->rnr_retry << 5; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY); + } + + if (attr_mask & IB_QP_AV) { + qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; + qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); + qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3; + if (attr->ah_attr.ah_flags & IB_AH_GRH) { + qp_context->pri_path.g_mylmc |= 1 << 7; + qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; + qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32((attr->ah_attr.sl << 28) | + (attr->ah_attr.grh.traffic_class << 20) | + (attr->ah_attr.grh.flow_label)); + memcpy(qp_context->pri_path.rgid, + attr->ah_attr.grh.dgid.raw, 16); + } else { + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32(attr->ah_attr.sl << 28); + } + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); + } + + if (attr_mask & IB_QP_TIMEOUT) { + qp_context->pri_path.ackto = attr->timeout; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); + } + + /* XXX alt_path */ + + /* leave rdd as 0 */ + qp_context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pd_num); + /* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */ + qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); + qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | + (MTHCA_FLIGHT_LIMIT << 24) | + MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SWE | + MTHCA_QP_BIT_SAE); + if (qp->sq.policy == IB_SIGNAL_ALL_WR) + qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); + if (attr_mask & IB_QP_RETRY_CNT) { + qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT); + } + + /* XXX initiator resources */ + + if (attr_mask & IB_QP_SQ_PSN) + qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); + qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); + + /* XXX RDMA/atomic enable, responder resources */ + + if (qp->rq.policy == IB_SIGNAL_ALL_WR) + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); + } + if (attr_mask & IB_QP_RQ_PSN) + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); + + /* XXX ra_buff_indx */ + + qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); + + if (attr_mask & IB_QP_QKEY) { + qp_context->qkey = cpu_to_be32(attr->qkey); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); + } + + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, + qp->qpn, 0, qp_param, 0, &status); + if (status) { + mthca_warn(dev, "modify QP %d returned status %02x.\n", + state_table[cur_state][new_state].trans, status); + err = -EINVAL; + } + + if (!err) + qp->state = new_state; + + kfree(mailbox); + + if (is_sqp(dev, qp)) + store_attrs(to_msqp(qp), attr, attr_mask); + + /* + * If we are moving QP0 to RTR, bring the IB link up; if we + * are moving QP0 to RESET or ERROR, bring the link back down. + */ + if (is_qp0(dev, qp)) { + if (cur_state != IB_QPS_RTR && + new_state == IB_QPS_RTR) + init_port(dev, to_msqp(qp)->port); + + if (cur_state != IB_QPS_RESET && + cur_state != IB_QPS_ERR && + (new_state == IB_QPS_RESET || + new_state == IB_QPS_ERR)) + mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + } + + return err; +} + +/* + * Allocate and register buffer for WQEs. qp->rq.max, sq.max, + * rq.max_gs and sq.max_gs must all be assigned. + * mthca_alloc_wqe_buf will calculate rq.wqe_shift and + * sq.wqe_shift (as well as send_wqe_offset, is_direct, and + * queue) + */ +static int mthca_alloc_wqe_buf(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int size; + int i; + int npages, shift; + dma_addr_t t; + u64 *dma_list = NULL; + int err = -ENOMEM; + + size = sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg); + + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; + qp->rq.wqe_shift++) + ; /* nothing */ + + size = sizeof (struct mthca_next_seg) + + qp->sq.max_gs * sizeof (struct mthca_data_seg); + if (qp->transport == MLX) + size += 2 * sizeof (struct mthca_data_seg); + else if (qp->transport == UD) + size += sizeof (struct mthca_ud_seg); + else /* bind seg is as big as atomic + raddr segs */ + size += sizeof (struct mthca_bind_seg); + + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; + qp->sq.wqe_shift++) + ; /* nothing */ + + qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift, + 1 << qp->sq.wqe_shift); + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64), + GFP_KERNEL); + if (!qp->wrid) + goto err_out; + + if (size <= MTHCA_MAX_DIRECT_QP_SIZE) { + qp->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n", + size, shift); + + qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t); + if (!qp->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&qp->queue.direct, mapping, t); + + memset(qp->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + qp->is_direct = 0; + npages = size / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + qp->queue.page_list = kmalloc(npages * + sizeof *qp->queue.page_list, + GFP_KERNEL); + if (!qp->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) { + qp->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!qp->queue.page_list[i].buf) + goto err_out_free; + + memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE); + + pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t); + dma_list[i] = t; + } + } + + err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, + npages, 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &qp->mr); + if (err) + goto err_out_free; + + kfree(dma_list); + return 0; + + err_out_free: + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else + for (i = 0; i < npages; ++i) { + if (qp->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + + } + + err_out: + kfree(qp->wrid); + kfree(dma_list); + return err; +} + +static int mthca_alloc_qp_common(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + spin_lock_init(&qp->lock); + atomic_set(&qp->refcount, 1); + qp->state = IB_QPS_RESET; + qp->sq.policy = send_policy; + qp->rq.policy = recv_policy; + qp->rq.cur = 0; + qp->sq.cur = 0; + qp->rq.next = 0; + qp->sq.next = 0; + qp->rq.last_comp = qp->rq.max - 1; + qp->sq.last_comp = qp->sq.max - 1; + qp->rq.last = NULL; + qp->sq.last = NULL; + + err = mthca_alloc_wqe_buf(dev, pd, qp); + return err; +} + +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + switch (type) { + case IB_QPT_RC: qp->transport = RC; break; + case IB_QPT_UC: qp->transport = UC; break; + case IB_QPT_UD: qp->transport = UD; break; + default: return -EINVAL; + } + + qp->qpn = mthca_alloc(&dev->qp_table.alloc); + if (qp->qpn == -1) + return -ENOMEM; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, qp); + if (err) { + mthca_free(&dev->qp_table.alloc, qp->qpn); + return err; + } + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_set(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1), qp); + spin_unlock_irq(&dev->qp_table.lock); + + return 0; +} + +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp) +{ + int err = 0; + u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; + + sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; + sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size, + &sqp->header_dma, GFP_KERNEL); + if (!sqp->header_buf) + return -ENOMEM; + + spin_lock_irq(&dev->qp_table.lock); + if (mthca_array_get(&dev->qp_table.qp, mqpn)) + err = -EBUSY; + else + mthca_array_set(&dev->qp_table.qp, mqpn, sqp); + spin_unlock_irq(&dev->qp_table.lock); + + if (err) + goto err_out; + + sqp->port = port; + sqp->qp.qpn = mqpn; + sqp->qp.transport = MLX; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, + &sqp->qp); + if (err) + goto err_out_free; + + atomic_inc(&pd->sqp_count); + + return 0; + + err_out_free: + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, mqpn); + spin_unlock_irq(&dev->qp_table.lock); + + err_out: + dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, + sqp->header_buf, sqp->header_dma); + + return err; +} + +void mthca_free_qp(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + u8 status; + int size; + int i; + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1)); + spin_unlock_irq(&dev->qp_table.lock); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); + + if (qp->state != IB_QPS_RESET) + mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status); + + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + if (qp->ibqp.send_cq != qp->ibqp.recv_cq) + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); + + mthca_free_mr(dev, &qp->mr); + + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else { + for (i = 0; i < size / PAGE_SIZE; ++i) { + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + } + } + + kfree(qp->wrid); + + if (is_sqp(dev, qp)) { + atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); + dma_free_coherent(&dev->pdev->dev, + to_msqp(qp)->header_buf_size, + to_msqp(qp)->header_buf, + to_msqp(qp)->header_dma); + } + else + mthca_free(&dev->qp_table.alloc, qp->qpn); +} + +/* Create UD header for an MLX send and build a data segment for it */ +static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp, + int ind, struct ib_send_wr *wr, + struct mthca_mlx_seg *mlx, + struct mthca_data_seg *data) +{ + int header_size; + int err; + + ib_ud_header_init(256, /* assume a MAD */ + sqp->ud_header.grh_present, + &sqp->ud_header); + + err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); + if (err) + return err; + mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1); + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) | + (sqp->ud_header.lrh.destination_lid == 0xffff ? + MTHCA_MLX_SLR : 0) | + (sqp->ud_header.lrh.service_level << 8)); + mlx->rlid = sqp->ud_header.lrh.destination_lid; + mlx->vcrc = 0; + + switch (wr->opcode) { + case IB_WR_SEND: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; + sqp->ud_header.immediate_present = 0; + break; + case IB_WR_SEND_WITH_IMM: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + sqp->ud_header.immediate_present = 1; + sqp->ud_header.immediate_data = wr->imm_data; + break; + default: + return -EINVAL; + } + + sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (sqp->ud_header.lrh.destination_lid == 0xffff) + sqp->ud_header.lrh.source_lid = 0xffff; + sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + sqp->pkey_index, + &sqp->ud_header.bth.pkey); + else + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + wr->wr.ud.pkey_index, + &sqp->ud_header.bth.pkey); + cpu_to_be16s(&sqp->ud_header.bth.pkey); + sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + sqp->qkey : wr->wr.ud.remote_qkey); + sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + header_size = ib_ud_header_pack(&sqp->ud_header, + sqp->header_buf + + ind * MTHCA_UD_HEADER_SIZE); + + data->byte_count = cpu_to_be32(header_size); + data->lkey = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey); + data->addr = cpu_to_be64(sqp->header_dma + + ind * MTHCA_UD_HEADER_SIZE); + + return 0; +} + +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + void *wqe; + void *prev_wqe; + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + u32 f0 = 0; + int ind; + u8 op0 = 0; + + static const u8 opcode[] = { + [IB_WR_SEND] = MTHCA_OPCODE_SEND, + [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, + [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, + [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, + [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, + }; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post send */ + + ind = qp->sq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->sq.cur + nreq >= qp->sq.max) { + mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", + qp->sq.cur, qp->sq.max, nreq); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_send_wqe(qp, ind); + prev_wqe = qp->sq.last; + qp->sq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + ((struct mthca_next_seg *) wqe)->flags = + ((wr->send_flags & IB_SEND_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | + ((wr->send_flags & IB_SEND_SOLICITED) ? + cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + cpu_to_be32(1); + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + switch (qp->transport) { + case RC: + switch (wr->opcode) { + case IB_WR_ATOMIC_CMP_AND_SWP: + case IB_WR_ATOMIC_FETCH_AND_ADD: + ((struct mthca_raddr_seg *) wqe)->raddr = + cpu_to_be64(wr->wr.atomic.remote_addr); + ((struct mthca_raddr_seg *) wqe)->rkey = + cpu_to_be32(wr->wr.atomic.rkey); + ((struct mthca_raddr_seg *) wqe)->reserved = 0; + + wqe += sizeof (struct mthca_raddr_seg); + + if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP) { + ((struct mthca_atomic_seg *) wqe)->swap_add = + cpu_to_be64(wr->wr.atomic.swap); + ((struct mthca_atomic_seg *) wqe)->compare = + cpu_to_be64(wr->wr.atomic.compare_add); + } else { + ((struct mthca_atomic_seg *) wqe)->swap_add = + cpu_to_be64(wr->wr.atomic.compare_add); + ((struct mthca_atomic_seg *) wqe)->compare = 0; + } + + wqe += sizeof (struct mthca_atomic_seg); + size += sizeof (struct mthca_raddr_seg) / 16 + + sizeof (struct mthca_atomic_seg); + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + case IB_WR_RDMA_READ: + ((struct mthca_raddr_seg *) wqe)->raddr = + cpu_to_be64(wr->wr.rdma.remote_addr); + ((struct mthca_raddr_seg *) wqe)->rkey = + cpu_to_be32(wr->wr.rdma.rkey); + ((struct mthca_raddr_seg *) wqe)->reserved = 0; + wqe += sizeof (struct mthca_raddr_seg); + size += sizeof (struct mthca_raddr_seg) / 16; + break; + + default: + /* No extra segments required for sends */ + break; + } + + case UD: + ((struct mthca_ud_seg *) wqe)->lkey = + cpu_to_be32(to_mah(wr->wr.ud.ah)->key); + ((struct mthca_ud_seg *) wqe)->av_addr = + cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma); + ((struct mthca_ud_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mthca_ud_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mthca_ud_seg); + size += sizeof (struct mthca_ud_seg) / 16; + break; + + case MLX: + err = build_mlx_header(dev, to_msqp(qp), ind, wr, + wqe - sizeof (struct mthca_next_seg), + wqe); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + break; + } + + if (wr->num_sge > qp->sq.max_gs) { + mthca_err(dev, "too many gathers\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC */ + if (qp->transport == MLX) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind + qp->rq.max] = wr->wr_id; + + if (wr->opcode >= ARRAY_SIZE(opcode)) { + mthca_err(dev, "opcode invalid\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32(((ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | + opcode[wr->opcode]); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); + } + + if (!size0) { + size0 = size; + op0 = opcode[wr->opcode]; + } + + ++ind; + if (unlikely(ind >= qp->sq.max)) + ind -= qp->sq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->sq.cur += nreq; + qp->sq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + int ind; + void *wqe; + void *prev_wqe; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post receive */ + + ind = qp->rq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->rq.cur + nreq >= qp->rq.max) { + mthca_err(dev, "RQ %06x full\n", qp->qpn); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_recv_wqe(qp, ind); + prev_wqe = qp->rq.last; + qp->rq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD); + ((struct mthca_next_seg *) wqe)->flags = + (wr->recv_flags & IB_RECV_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (wr->num_sge > qp->rq.max_gs) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind] = wr->wr_id; + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32((ind << qp->rq.wqe_shift) | 1); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD | size); + } + + if (!size0) + size0 = size; + + ++ind; + if (unlikely(ind >= qp->rq.max)) + ind -= qp->rq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->rq.cur += nreq; + qp->rq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe) +{ + struct mthca_next_seg *next; + + if (is_send) + next = get_send_wqe(qp, index); + else + next = get_recv_wqe(qp, index); + + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + if (next->ee_nds & cpu_to_be32(0x3f)) + *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | + (next->ee_nds & cpu_to_be32(0x3f)); + else + *new_wqe = 0; + + return 0; +} + +int __devinit mthca_init_qp_table(struct mthca_dev *dev) +{ + int err; + u8 status; + int i; + + spin_lock_init(&dev->qp_table.lock); + + /* + * We reserve 2 extra QPs per port for the special QPs. The + * special QP for port 1 has to be even, so round up. + */ + dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL; + err = mthca_alloc_init(&dev->qp_table.alloc, + dev->limits.num_qps, + (1 << 24) - 1, + dev->qp_table.sqp_start + + MTHCA_MAX_PORTS * 2); + if (err) + return err; + + err = mthca_array_init(&dev->qp_table.qp, + dev->limits.num_qps); + if (err) { + mthca_alloc_cleanup(&dev->qp_table.alloc); + return err; + } + + for (i = 0; i < 2; ++i) { + err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI, + dev->qp_table.sqp_start + i * 2, + &status); + if (err) + goto err_out; + if (status) { + mthca_warn(dev, "CONF_SPECIAL_QP returned " + "status %02x, aborting.\n", + status); + err = -EINVAL; + goto err_out; + } + } + return 0; + + err_out: + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps); + mthca_alloc_cleanup(&dev->qp_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev) +{ + int i; + u8 status; + + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_alloc_cleanup(&dev->qp_table.alloc); +} From roland at topspin.com Sun Dec 19 22:15:10 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:10 -0800 Subject: [openib-general] [PATCH][v4][15/24] Add Mellanox HCA low-level driver (last bits) In-Reply-To: <200412192215.7jqz9W3Au4nr5j1b@topspin.com> Message-ID: <200412192215.EI3HeB132RULe1EN@topspin.com> Add code for remaining InfiniBand objects (address vectors, multicast groups, memory regions and protection domains) Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c 2004-12-19 22:04:16.308295889 -0800 @@ -0,0 +1,219 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_av.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +#include +#include + +#include "mthca_dev.h" + +struct mthca_av { + u32 port_pd; + u8 reserved1; + u8 g_slid; + u16 dlid; + u8 reserved2; + u8 gid_index; + u8 msg_sr; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u32 dgid[4]; +}; + +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah) +{ + u32 index = -1; + struct mthca_av *av = NULL; + + ah->on_hca = 0; + + if (!atomic_read(&pd->sqp_count) && + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + index = mthca_alloc(&dev->av_table.alloc); + + /* fall back to allocate in host memory */ + if (index == -1) + goto host_alloc; + + av = kmalloc(sizeof *av, GFP_KERNEL); + if (!av) + goto host_alloc; + + ah->on_hca = 1; + ah->avdma = dev->av_table.ddr_av_base + + index * MTHCA_AV_SIZE; + } + + host_alloc: + if (!ah->on_hca) { + ah->av = pci_pool_alloc(dev->av_table.pool, + SLAB_KERNEL, &ah->avdma); + if (!ah->av) + return -ENOMEM; + + av = ah->av; + } + + ah->key = pd->ntmr.ibmr.lkey; + + memset(av, 0, MTHCA_AV_SIZE); + + av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24)); + av->g_slid = ah_attr->src_path_bits; + av->dlid = cpu_to_be16(ah_attr->dlid); + av->msg_sr = (3 << 4) | /* 2K message */ + ah_attr->static_rate; + av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + if (ah_attr->ah_flags & IB_AH_GRH) { + av->g_slid |= 0x80; + av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + + ah_attr->grh.sgid_index; + av->hop_limit = ah_attr->grh.hop_limit; + av->sl_tclass_flowlabel |= + cpu_to_be32((ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); + } else { + /* Arbel workaround -- low byte of GID must be 2 */ + av->dgid[3] = cpu_to_be32(2); + } + + if (0) { + int j; + + mthca_dbg(dev, "Created UDAV at %p/%08lx:\n", + av, (unsigned long) ah->avdma); + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) av)[j])); + } + + if (ah->on_hca) { + memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, + av, MTHCA_AV_SIZE); + kfree(av); + } + + return 0; +} + +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) +{ + if (ah->on_hca) + mthca_free(&dev->av_table.alloc, + (ah->avdma - dev->av_table.ddr_av_base) / + MTHCA_AV_SIZE); + else + pci_pool_free(dev->av_table.pool, ah->av, ah->avdma); + + return 0; +} + +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header) +{ + if (ah->on_hca) + return -EINVAL; + + header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + header->lrh.destination_lid = ah->av->dlid; + header->lrh.source_lid = ah->av->g_slid & 0x7f; + if (ah->av->g_slid & 0x80) { + header->grh_present = 1; + header->grh.traffic_class = + (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; + header->grh.flow_label = + ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); + ib_cached_gid_get(&dev->ib_dev, + be32_to_cpu(ah->av->port_pd) >> 24, + ah->av->gid_index, + &header->grh.source_gid); + memcpy(header->grh.destination_gid.raw, + ah->av->dgid, 16); + } else { + header->grh_present = 0; + } + + return 0; +} + +int __devinit mthca_init_av_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->av_table.alloc, + dev->av_table.num_ddr_avs, + dev->av_table.num_ddr_avs - 1, + 0); + if (err) + return err; + + dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev, + MTHCA_AV_SIZE, + MTHCA_AV_SIZE, 0); + if (!dev->av_table.pool) + goto out_free_alloc; + + if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) + + dev->av_table.ddr_av_base - + dev->ddr_start, + dev->av_table.num_ddr_avs * + MTHCA_AV_SIZE); + if (!dev->av_table.av_map) + goto out_free_pool; + } else + dev->av_table.av_map = NULL; + + return 0; + + out_free_pool: + pci_pool_destroy(dev->av_table.pool); + + out_free_alloc: + mthca_alloc_cleanup(&dev->av_table.alloc); + return -ENOMEM; +} + +void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) +{ + if (dev->av_table.av_map) + iounmap(dev->av_table.av_map); + pci_pool_destroy(dev->av_table.pool); + mthca_alloc_cleanup(&dev->av_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c 2004-12-19 22:04:16.332292352 -0800 @@ -0,0 +1,376 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_mcg.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2) +}; + +struct mthca_mgm { + u32 next_gid_index; + u32 reserved[3]; + u8 gid[16]; + u32 qp[MTHCA_QP_PER_MGM]; +}; + +static const u8 zero_gid[16]; /* automatically initialized to 0 */ + +/* + * Caller must hold MCG table semaphore. gid and mgm parameters must + * be properly aligned for command interface. + * + * Returns 0 unless a firmware command error occurs. + * + * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1 + * and *mgm holds MGM entry. + * + * if GID is found in AMGM, *index = index in AMGM, *prev = index of + * previous entry in hash chain and *mgm holds AMGM entry. + * + * If no AMGM exists for given gid, *index = -1, *prev = index of last + * entry in hash chain and *mgm holds end of hash chain. + */ +static int find_mgm(struct mthca_dev *dev, + u8 *gid, struct mthca_mgm *mgm, + u16 *hash, int *prev, int *index) +{ + void *mailbox; + u8 *mgid; + int err; + u8 status; + + mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgid = MAILBOX_ALIGN(mailbox); + + memcpy(mgid, gid, 16); + + err = mthca_MGID_HASH(dev, mgid, hash, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "MGID_HASH returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (0) + mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:" + "%04x:%04x:%04x:%04x is %04x\n", + be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]), + be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]), + be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]), + be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]), + *hash); + + *index = *hash; + *prev = -1; + + do { + err = mthca_READ_MGM(dev, *index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + return -EINVAL; + } + + if (!memcmp(mgm->gid, zero_gid, 16)) { + if (*index != *hash) { + mthca_err(dev, "Found zero MGID in AMGM.\n"); + err = -EINVAL; + } + goto out; + } + + if (!memcmp(mgm->gid, gid, 16)) + goto out; + + *prev = *index; + *index = be32_to_cpu(mgm->next_gid_index) >> 5; + } while (*index); + + *index = -1; + + out: + kfree(mailbox); + return err; +} + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int index, prev; + int link = 0; + int i; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index != -1) { + if (!memcmp(mgm->gid, zero_gid, 16)) + memcpy(mgm->gid, gid->raw, 16); + } else { + link = 1; + + index = mthca_alloc(&dev->mcg_table.alloc); + if (index == -1) { + mthca_err(dev, "No AMGM entries left\n"); + err = -ENOMEM; + goto out; + } + + err = mthca_READ_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + memcpy(mgm->gid, gid->raw, 16); + mgm->next_gid_index = 0; + } + + for (i = 0; i < MTHCA_QP_PER_MGM; ++i) + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { + mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31)); + break; + } + + if (i == MTHCA_QP_PER_MGM) { + mthca_err(dev, "MGM at index %x is full.\n", index); + err = -ENOMEM; + goto out; + } + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + if (!link) + goto out; + + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int prev, index; + int i, loc; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index == -1) { + mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x " + "not found\n", + be16_to_cpu(((u16 *) gid->raw)[0]), + be16_to_cpu(((u16 *) gid->raw)[1]), + be16_to_cpu(((u16 *) gid->raw)[2]), + be16_to_cpu(((u16 *) gid->raw)[3]), + be16_to_cpu(((u16 *) gid->raw)[4]), + be16_to_cpu(((u16 *) gid->raw)[5]), + be16_to_cpu(((u16 *) gid->raw)[6]), + be16_to_cpu(((u16 *) gid->raw)[7])); + err = -EINVAL; + goto out; + } + + for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) { + if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) + loc = i; + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) + break; + } + + if (loc == -1) { + mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num); + err = -EINVAL; + goto out; + } + + mgm->qp[loc] = mgm->qp[i - 1]; + mgm->qp[i - 1] = 0; + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (i != 1) + goto out; + + goto out; + + if (prev == -1) { + /* Remove entry from MGM */ + if (be32_to_cpu(mgm->next_gid_index) >> 5) { + err = mthca_READ_MGM(dev, + be32_to_cpu(mgm->next_gid_index) >> 5, + mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", + status); + err = -EINVAL; + goto out; + } + } else + memset(mgm->gid, 0, 16); + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } else { + /* Remove entry from AMGM */ + index = be32_to_cpu(mgm->next_gid_index) >> 5; + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int __devinit mthca_init_mcg_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->mcg_table.alloc, + dev->limits.num_amgms, + dev->limits.num_amgms - 1, + 0); + if (err) + return err; + + init_MUTEX(&dev->mcg_table.sem); + + return 0; +} + +void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev) +{ + mthca_alloc_cleanup(&dev->mcg_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c 2004-12-19 22:04:16.356288816 -0800 @@ -0,0 +1,396 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_mr.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +/* + * Must be packed because mtt_seg is 64 bits but only aligned to 32 bits. + */ +struct mthca_mpt_entry { + u32 flags; + u32 page_size; + u32 key; + u32 pd; + u64 start; + u64 length; + u32 lkey; + u32 window_count; + u32 window_count_limit; + u64 mtt_seg; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_MPT_FLAG_SW_OWNS (0xfUL << 28) +#define MTHCA_MPT_FLAG_MIO (1 << 17) +#define MTHCA_MPT_FLAG_BIND_ENABLE (1 << 15) +#define MTHCA_MPT_FLAG_PHYSICAL (1 << 9) +#define MTHCA_MPT_FLAG_REGION (1 << 8) + +#define MTHCA_MTT_FLAG_PRESENT 1 + +/* + * Buddy allocator for MTT segments (currently not very efficient + * since it doesn't keep a free list and just searches linearly + * through the bitmaps) + */ + +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +{ + int o; + int m; + u32 seg; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + for (o = order; o <= dev->mr_table.max_mtt_order; ++o) { + m = 1 << (dev->mr_table.max_mtt_order - o); + seg = find_first_bit(dev->mr_table.mtt_buddy[o], m); + if (seg < m) + goto found; + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + return -1; + + found: + clear_bit(seg, dev->mr_table.mtt_buddy[o]); + + while (o > order) { + --o; + seg <<= 1; + set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]); + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + + seg <<= order; + + return seg; +} + +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +{ + seg >>= order; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) { + clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]); + seg >>= 1; + ++order; + } + + set_bit(seg, dev->mr_table.mtt_buddy[order]); + + spin_unlock(&dev->mr_table.mpt_alloc.lock); +} + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + struct mthca_mpt_entry *mpt_entry; + int err; + u8 status; + + might_sleep(); + + mr->order = -1; + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return -ENOMEM; + } + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_PHYSICAL | + MTHCA_MPT_FLAG_REGION | + access); + mpt_entry->page_size = 0; + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = 0; + mpt_entry->length = ~0ULL; + + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; +} + +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + u64 *mtt_entry; + struct mthca_mpt_entry *mpt_entry; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + WARN_ON(buffer_size_shift >= 32); + + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + /* nothing */ ; + + mr->first_seg = mthca_alloc_mtt(dev, mr->order); + if (mr->first_seg == -1) + goto err_out_mpt_free; + + /* + * If list_len is odd, we add one more dummy entry for + * firmware efficiency. + */ + mailbox = kmalloc(max(sizeof *mpt_entry, + (size_t) 8 * (list_len + (list_len & 1) + 2)) + + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mtt_entry = MAILBOX_ALIGN(mailbox); + + mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + mtt_entry[1] = 0; + for (i = 0; i < list_len; ++i) + mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + if (list_len & 1) { + mtt_entry[i + 2] = 0; + ++list_len; + } + + if (0) { + mthca_dbg(dev, "Dumping MPT entry\n"); + for (i = 0; i < list_len + 2; ++i) + printk(KERN_ERR "[%2d] %016llx\n", + i, (unsigned long long) be64_to_cpu(mtt_entry[i])); + } + + err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status); + if (err) { + mthca_warn(dev, "WRITE_MTT failed (%d)\n", err); + goto err_out_mailbox_free; + } + if (status) { + mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_mailbox_free; + } + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12); + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = cpu_to_be64(iova); + mpt_entry->length = cpu_to_be64(total_size); + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; + + err_out_mailbox_free: + kfree(mailbox); + + err_out_free_mtt: + mthca_free_mtt(dev, mr->first_seg, mr->order); + + err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return err; +} + +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) +{ + int err; + u8 status; + + might_sleep(); + + err = mthca_HW2SW_MPT(dev, NULL, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", + status); + + if (mr->order >= 0) + mthca_free_mtt(dev, mr->first_seg, mr->order); + + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); +} + +int __devinit mthca_init_mr_table(struct mthca_dev *dev) +{ + int err; + int i, s; + + err = mthca_alloc_init(&dev->mr_table.mpt_alloc, + dev->limits.num_mpts, + ~0, dev->limits.reserved_mrws); + if (err) + return err; + + err = -ENOMEM; + + for (i = 1, dev->mr_table.max_mtt_order = 0; + i < dev->limits.num_mtt_segs; + i <<= 1, ++dev->mr_table.max_mtt_order) + /* nothing */ ; + + dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) * + sizeof (long *), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy) + goto err_out; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + dev->mr_table.mtt_buddy[i] = NULL; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) { + s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i)); + dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy[i]) + goto err_out_free; + bitmap_zero(dev->mr_table.mtt_buddy[i], + 1 << (dev->mr_table.max_mtt_order - i)); + } + + set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]); + + for (i = 0; i < dev->mr_table.max_mtt_order; ++i) + if (1 << i >= dev->limits.reserved_mtts) + break; + + if (i == dev->mr_table.max_mtt_order) { + mthca_err(dev, "MTT table of order %d is " + "too small.\n", i); + goto err_out_free; + } + + (void) mthca_alloc_mtt(dev, i); + + return 0; + + err_out_free: + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + + err_out: + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); + + return err; +} + +void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) +{ + int i; + + /* XXX check if any MRs are still allocated? */ + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + kfree(dev->mr_table.mtt_buddy); + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c 2004-12-19 22:04:16.379285427 -0800 @@ -0,0 +1,80 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_pd.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd) +{ + int err; + + might_sleep(); + + atomic_set(&pd->sqp_count, 0); + pd->pd_num = mthca_alloc(&dev->pd_table.alloc); + if (pd->pd_num == -1) + return -ENOMEM; + + err = mthca_mr_alloc_notrans(dev, pd->pd_num, + MTHCA_MPT_FLAG_LOCAL_READ | + MTHCA_MPT_FLAG_LOCAL_WRITE, + &pd->ntmr); + if (err) + mthca_free(&dev->pd_table.alloc, pd->pd_num); + + return err; +} + +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd) +{ + might_sleep(); + mthca_free_mr(dev, &pd->ntmr); + mthca_free(&dev->pd_table.alloc, pd->pd_num); +} + +int __devinit mthca_init_pd_table(struct mthca_dev *dev) +{ + return mthca_alloc_init(&dev->pd_table.alloc, + dev->limits.num_pds, + (1 << 24) - 1, + dev->limits.reserved_pds); +} + +void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev) +{ + /* XXX check if any PDs are still allocated? */ + mthca_alloc_cleanup(&dev->pd_table.alloc); +} From roland at topspin.com Sun Dec 19 22:15:12 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:12 -0800 Subject: [openib-general] [PATCH][v4][16/24] Add Mellanox HCA low-level driver (MAD) In-Reply-To: <200412192215.EI3HeB132RULe1EN@topspin.com> Message-ID: <200412192215.l62Q9JXNhGg51wOf@topspin.com> Add MAD (management datagram) code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c 2004-12-19 22:04:16.654244908 -0800 @@ -0,0 +1,320 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_mad.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_VENDOR_CLASS1 = 0x9, + MTHCA_VENDOR_CLASS2 = 0xa +}; + +struct mthca_trap_mad { + struct ib_mad *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static void update_sm_ah(struct mthca_dev *dev, + u8 port_num, u16 lid, u8 sl) +{ + struct ib_ah *new_ah; + struct ib_ah_attr ah_attr; + unsigned long flags; + + if (!dev->send_agent[port_num - 1][0]) + return; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = lid; + ah_attr.sl = sl; + ah_attr.port_num = port_num; + + new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd, + &ah_attr); + if (IS_ERR(new_ah)) + return; + + spin_lock_irqsave(&dev->sm_lock, flags); + if (dev->sm_ah[port_num - 1]) + ib_destroy_ah(dev->sm_ah[port_num - 1]); + dev->sm_ah[port_num - 1] = new_ah; + spin_unlock_irqrestore(&dev->sm_lock, flags); +} + +/* + * Snoop SM MADs for port info and P_Key table sets, so we can + * synthesize LID change and P_Key change events. + */ +static void smp_snoop(struct ib_device *ibdev, + u8 port_num, + struct ib_mad *mad) +{ + struct ib_event event; + + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_SET) { + if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { + update_sm_ah(to_mdev(ibdev), port_num, + be16_to_cpup((__be16 *) (mad->data + 58)), + (*(u8 *) (mad->data + 76)) & 0xf); + + event.device = ibdev; + event.event = IB_EVENT_LID_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + + if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) { + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + } +} + +static void forward_trap(struct mthca_dev *dev, + u8 port_num, + struct ib_mad *mad) +{ + int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; + struct mthca_trap_mad *tmad; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = qpn ? IB_QP1_QKEY : 0, + .timeout_ms = 0 + } + } + }; + struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; + int ret; + unsigned long flags; + + if (agent) { + tmad = kmalloc(sizeof *tmad, GFP_KERNEL); + if (!tmad) + return; + + tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL); + if (!tmad->mad) { + kfree(tmad); + return; + } + + memcpy(tmad->mad, mad, sizeof *mad); + + wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; + wr.wr_id = (unsigned long) tmad; + + gather_list.addr = dma_map_single(agent->device->dma_device, + tmad->mad, + sizeof *tmad->mad, + DMA_TO_DEVICE); + gather_list.length = sizeof *tmad->mad; + gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; + pci_unmap_addr_set(tmad, mapping, gather_list.addr); + + /* + * We rely here on the fact that MLX QPs don't use the + * address handle after the send is posted (this is + * wrong following the IB spec strictly, but we know + * it's OK for our devices). + */ + spin_lock_irqsave(&dev->sm_lock, flags); + wr.wr.ud.ah = dev->sm_ah[port_num - 1]; + if (wr.wr.ud.ah) + ret = ib_post_send_mad(agent, &wr, &bad_wr); + else + ret = -EINVAL; + spin_unlock_irqrestore(&dev->sm_lock, flags); + + if (ret) { + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + DMA_TO_DEVICE); + kfree(tmad->mad); + kfree(tmad); + } + } +} + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + int err; + u8 status; + + /* Forward locally generated traps to the SM */ + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && + slid == 0) { + forward_trap(to_mdev(ibdev), port_num, in_mad); + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + /* + * Only handle SM gets, sets and trap represses for SM class + * + * Only handle PMA and Mellanox vendor-specific class gets and + * sets for other classes. + */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_TRAP_REPRESS) + return IB_MAD_RESULT_SUCCESS; + + /* + * Don't process SMInfo queries or vendor-specific + * MADs -- the SMA can't handle them. + */ + if (in_mad->mad_hdr.attr_id == IB_SMP_ATTR_SM_INFO || + ((in_mad->mad_hdr.attr_id & IB_SMP_ATTR_VENDOR_MASK) == + IB_SMP_ATTR_VENDOR_MASK)) + return IB_MAD_RESULT_SUCCESS; + } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1 || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET) + return IB_MAD_RESULT_SUCCESS; + } else + return IB_MAD_RESULT_SUCCESS; + + err = mthca_MAD_IFC(to_mdev(ibdev), + !!(mad_flags & IB_MAD_IGNORE_MKEY), + port_num, in_mad, out_mad, + &status); + if (err) { + mthca_err(to_mdev(ibdev), "MAD_IFC failed\n"); + return IB_MAD_RESULT_FAILURE; + } + if (status == MTHCA_CMD_STAT_BAD_PKT) + return IB_MAD_RESULT_SUCCESS; + if (status) { + mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n", + status); + return IB_MAD_RESULT_FAILURE; + } + + if (!out_mad->mad_hdr.status) + smp_snoop(ibdev, port_num, in_mad); + + /* set return bit in status of directed route responses */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + out_mad->mad_hdr.status |= cpu_to_be16(1 << 15); + + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) + /* no response for trap repress */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct mthca_trap_mad *tmad = + (void *) (unsigned long) mad_send_wc->wr_id; + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + DMA_TO_DEVICE); + kfree(tmad->mad); + kfree(tmad); +} + +int mthca_create_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + spin_lock_init(&dev->sm_lock); + + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) { + agent = ib_register_mad_agent(&dev->ib_dev, p + 1, + q ? IB_QPT_GSI : IB_QPT_SMI, + NULL, 0, send_handler, + NULL, NULL); + if (IS_ERR(agent)) + goto err; + dev->send_agent[p][q] = agent; + } + + return 0; + +err: + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) + if (dev->send_agent[p][q]) + ib_unregister_mad_agent(dev->send_agent[p][q]); + + return PTR_ERR(agent); +} + +void mthca_free_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + for (p = 0; p < dev->limits.num_ports; ++p) { + for (q = 0; q <= 1; ++q) { + agent = dev->send_agent[p][q]; + dev->send_agent[p][q] = NULL; + ib_unregister_mad_agent(agent); + } + + if (dev->sm_ah[p]) + ib_destroy_ah(dev->sm_ah[p]); + } +} From roland at topspin.com Sun Dec 19 22:15:14 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:14 -0800 Subject: [openib-general] [PATCH][v4][17/24] IPoIB IPv4 multicast In-Reply-To: <200412192215.l62Q9JXNhGg51wOf@topspin.com> Message-ID: <200412192215.vegmgBmv5xungHlQ@topspin.com> Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB hardware addresses. Also add so INFINIBAND_ALEN has a home. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/include/linux/if_infiniband.h 2004-12-19 22:04:16.867213523 -0800 @@ -0,0 +1,29 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef _LINUX_IF_INFINIBAND_H +#define _LINUX_IF_INFINIBAND_H + +#define INFINIBAND_ALEN 20 /* Octets in IPoIB HW addr */ + +#endif /* _LINUX_IF_INFINIBAND_H */ --- linux-bk.orig/include/net/ip.h 2004-12-19 21:09:26.000000000 -0800 +++ linux-bk/include/net/ip.h 2004-12-19 22:04:16.868213376 -0800 @@ -229,6 +229,39 @@ buf[3]=addr&0x7F; } +/* + * Map a multicast IP onto multicast MAC for type IP-over-InfiniBand. + * Leave P_Key as 0 to be filled in by driver. + */ + +static inline void ip_ib_mc_map(u32 addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + addr = ntohl(addr); + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x40; /* IPv4 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + buf[10] = 0; + buf[11] = 0; + buf[12] = 0; + buf[13] = 0; + buf[14] = 0; + buf[15] = 0; + buf[19] = addr & 0xff; + addr >>= 8; + buf[18] = addr & 0xff; + addr >>= 8; + buf[17] = addr & 0xff; + addr >>= 8; + buf[16] = addr & 0x0f; +} + #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) #include #endif --- linux-bk.orig/net/ipv4/arp.c 2004-12-19 21:09:46.000000000 -0800 +++ linux-bk/net/ipv4/arp.c 2004-12-19 22:04:16.868213376 -0800 @@ -213,6 +213,9 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; + case ARPHRD_INFINIBAND: + ip_ib_mc_map(addr, haddr); + return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); From roland at topspin.com Sun Dec 19 22:15:14 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:14 -0800 Subject: [openib-general] [PATCH][v4][18/24] IPoIB IPv6 support In-Reply-To: <200412192215.vegmgBmv5xungHlQ@topspin.com> Message-ID: <200412192215.69tnzAhGIT1vQGLF@topspin.com> Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB hardware addresses, and add support for autoconfiguration for devices with type ARPHRD_INFINIBAND. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Signed-off-by: Nitin Hande Signed-off-by: Roland Dreier --- linux-bk.orig/include/net/if_inet6.h 2004-12-19 21:09:54.000000000 -0800 +++ linux-bk/include/net/if_inet6.h 2004-12-19 22:04:17.213162542 -0800 @@ -266,5 +266,20 @@ { buf[0] = 0x00; } + +static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x60; /* IPv6 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + memcpy(buf + 10, addr->s6_addr + 6, 10); +} #endif #endif --- linux-bk.orig/net/ipv6/addrconf.c 2004-12-19 21:09:51.000000000 -0800 +++ linux-bk/net/ipv6/addrconf.c 2004-12-19 22:04:17.215162248 -0800 @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -1095,6 +1096,12 @@ memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + if (dev->addr_len != INFINIBAND_ALEN) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; } return -1; } @@ -1794,7 +1801,8 @@ if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && - (dev->type != ARPHRD_ARCNET)) { + (dev->type != ARPHRD_ARCNET) && + (dev->type != ARPHRD_INFINIBAND)) { /* Alas, we support only Ethernet autoconfiguration. */ return; } --- linux-bk.orig/net/ipv6/ndisc.c 2004-12-19 21:09:20.000000000 -0800 +++ linux-bk/net/ipv6/ndisc.c 2004-12-19 22:04:17.216162100 -0800 @@ -260,6 +260,9 @@ case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; + case ARPHRD_INFINIBAND: + ipv6_ib_mc_map(addr, buf); + return 0; default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); From roland at topspin.com Sun Dec 19 22:15:14 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:14 -0800 Subject: [openib-general] [PATCH][v4][19/24] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <200412192215.69tnzAhGIT1vQGLF@topspin.com> Message-ID: <200412192215.fZX1ZQqQD4QGkKcF@topspin.com> Add a driver that implements the (IPoIB) IP-over-InfiniBand protocol. This is a network device driver of type ARPHRD_INFINIBAND (and addr_len INFINIBAND_ALEN bytes). The ARP/ND implementation for this driver is not completely straightforward, because InfiniBand requires an additional path lookup be performed (through an IB-specific mechanism) after a remote hardware address has been resolved. We are very open to suggestions of a better way to handle this than the current implementation. Although IB has a special multicast group join mode intended to support IP multicast routing (non member join), no means to identify different multicast styles has yet been determined, so all joins by the driver are currently full member joins. We are looking for guidance in how to solve this. The IPoIB protocol/encapsulation is described in the Internet-Drafts http://www.ietf.org/internet-drafts/draft-ietf-ipoib-architecture-04.txt http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-12-19 22:04:14.496562875 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-19 22:04:17.510118781 -0800 @@ -9,4 +9,6 @@ source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/ulp/ipoib/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-12-19 22:04:14.472566412 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-12-19 22:04:17.485122465 -0800 @@ -1,2 +1,3 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ +obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Kconfig 2004-12-19 22:04:17.559111562 -0800 @@ -0,0 +1,33 @@ +config INFINIBAND_IPOIB + tristate "IP-over-InfiniBand" + depends on INFINIBAND && NETDEVICES && INET + ---help--- + Support for the IP-over-InfiniBand protocol (IPoIB). This + transports IP packets over InfiniBand so you can use your IB + device as a fancy NIC. + + The IPoIB protocol is defined by the IETF ipoib working + group: . + +config INFINIBAND_IPOIB_DEBUG + bool "IP-over-InfiniBand debugging" + depends on INFINIBAND_IPOIB + ---help--- + This option causes debugging code to be compiled into the + IPoIB driver. The output can be turned on via the + debug_level and mcast_debug_level module parameters (which + can also be set after the driver is loaded through sysfs). + + This option also creates an "ipoib_debugfs," which can be + mounted to expose debugging information about IB multicast + groups used by the IPoIB driver. + +config INFINIBAND_IPOIB_DEBUG_DATA + bool "IP-over-InfiniBand data path debugging" + depends on INFINIBAND_IPOIB_DEBUG + ---help--- + This option compiles debugging code into the the data path + of the IPoIB driver. The output can be turned on via the + data_debug_level module parameter; however, even with output + turned off, this debugging code will have some performance + impact. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Makefile 2004-12-19 22:04:17.534115245 -0800 @@ -0,0 +1,11 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_IPOIB) += ib_ipoib.o + +ib_ipoib-y := ipoib_main.o \ + ipoib_ib.o \ + ipoib_multicast.o \ + ipoib_verbs.o \ + ipoib_vlan.o +ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o + --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib.h 2004-12-19 22:04:17.584107878 -0800 @@ -0,0 +1,350 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib.h 1358 2004-12-17 22:00:11Z roland $ + */ + +#ifndef _IPOIB_H +#define _IPOIB_H + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include + +#include +#include +#include + +/* constants */ + +enum { + IPOIB_PACKET_SIZE = 2048, + IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, + + IPOIB_ENCAP_LEN = 4, + + IPOIB_RX_RING_SIZE = 128, + IPOIB_TX_RING_SIZE = 64, + + IPOIB_NUM_WC = 4, + + IPOIB_MAX_PATH_REC_QUEUE = 3, + IPOIB_MAX_MCAST_QUEUE = 3, + + IPOIB_FLAG_OPER_UP = 0, + IPOIB_FLAG_ADMIN_UP = 1, + IPOIB_PKEY_ASSIGNED = 2, + IPOIB_PKEY_STOP = 3, + IPOIB_FLAG_SUBINTERFACE = 4, + IPOIB_MCAST_RUN = 5, + IPOIB_STOP_REAPER = 6, + + IPOIB_MAX_BACKOFF_SECONDS = 16, + + IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ + IPOIB_MCAST_FLAG_SENDONLY = 1, + IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ + IPOIB_MCAST_FLAG_ATTACHED = 3, +}; + +/* structs */ + +struct ipoib_header { + u16 proto; + u16 reserved; +}; + +struct ipoib_pseudoheader { + u8 hwaddr[INFINIBAND_ALEN]; +}; + +struct ipoib_mcast; + +struct ipoib_buf { + struct sk_buff *skb; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +/* + * Device private locking: tx_lock protects members used in TX fast + * path (and we use LLTX so upper layers don't do extra locking). + * lock protects everything else. lock nests inside of tx_lock (ie + * tx_lock must be acquired first if needed). + */ +struct ipoib_dev_priv { + spinlock_t lock; + + struct net_device *dev; + + unsigned long flags; + + struct semaphore mcast_mutex; + struct semaphore vlan_mutex; + + struct rb_root path_tree; + struct list_head path_list; + + struct ipoib_mcast *broadcast; + struct list_head multicast_list; + struct rb_root multicast_tree; + + struct work_struct pkey_task; + struct work_struct mcast_task; + struct work_struct flush_task; + struct work_struct restart_task; + struct work_struct ah_reap_task; + + struct ib_device *ca; + u8 port; + u16 pkey; + struct ib_pd *pd; + struct ib_mr *mr; + struct ib_cq *cq; + struct ib_qp *qp; + u32 qkey; + + union ib_gid local_gid; + u16 local_lid; + + unsigned int admin_mtu; + unsigned int mcast_mtu; + + struct ipoib_buf *rx_ring; + + spinlock_t tx_lock; + struct ipoib_buf *tx_ring; + unsigned tx_head; + unsigned tx_tail; + + struct ib_wc ibwc[IPOIB_NUM_WC]; + + struct list_head dead_ahs; + + struct ib_event_handler event_handler; + + struct net_device_stats stats; + + struct net_device *parent; + struct list_head child_intfs; + struct list_head list; + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG + struct list_head fs_list; + struct dentry *mcg_dentry; +#endif +}; + +struct ipoib_ah { + struct net_device *dev; + struct ib_ah *ah; + struct list_head list; + struct kref ref; + unsigned last_send; +}; + +struct ipoib_path { + struct net_device *dev; + struct ib_sa_path_rec pathrec; + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct list_head neigh_list; + + int query_id; + struct ib_sa_query *query; + struct completion done; + + struct rb_node rb_node; + struct list_head list; +}; + +struct ipoib_neigh { + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct neighbour *neighbour; + + struct list_head list; +}; + +static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) +{ + return (struct ipoib_neigh **) (neigh->ha + 24 - + (offsetof(struct neighbour, ha) & 4)); +} + +extern struct workqueue_struct *ipoib_workqueue; + +/* functions */ + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr); +void ipoib_free_ah(struct kref *kref); +static inline void ipoib_put_ah(struct ipoib_ah *ah) +{ + kref_put(&ah->ref, ipoib_free_ah); +} + +int ipoib_add_pkey_attr(struct net_device *dev); + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn); +void ipoib_reap_ah(void *dev_ptr); + +void ipoib_flush_paths(struct net_device *dev); +struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_ib_dev_flush(void *dev); +void ipoib_ib_dev_cleanup(struct net_device *dev); + +int ipoib_ib_dev_open(struct net_device *dev); +int ipoib_ib_dev_up(struct net_device *dev); +int ipoib_ib_dev_down(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev); + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_dev_cleanup(struct net_device *dev); + +void ipoib_mcast_join_task(void *dev_ptr); +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb); + +void ipoib_mcast_restart_task(void *dev_ptr); +int ipoib_mcast_start_thread(struct net_device *dev); +int ipoib_mcast_stop_thread(struct net_device *dev); + +void ipoib_mcast_dev_down(struct net_device *dev); +void ipoib_mcast_dev_flush(struct net_device *dev); + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev); +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter); +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter); +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *gid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only); + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); + +int ipoib_qp_create(struct net_device *dev); +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca); +void ipoib_transport_dev_cleanup(struct net_device *dev); + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); + +void ipoib_pkey_poll(void *dev); +int ipoib_pkey_dev_delay_open(struct net_device *dev); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int ipoib_create_debug_file(struct net_device *dev); +void ipoib_delete_debug_file(struct net_device *dev); +int ipoib_register_debugfs(void); +void ipoib_unregister_debugfs(void); +#else +static inline int ipoib_create_debug_file(struct net_device *dev) { return 0; } +static inline void ipoib_delete_debug_file(struct net_device *dev) { } +static inline int ipoib_register_debugfs(void) { return 0; } +static inline void ipoib_unregister_debugfs(void) { } +#endif + + +#define ipoib_printk(level, priv, format, arg...) \ + printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg) +#define ipoib_warn(priv, format, arg...) \ + ipoib_printk(KERN_WARNING, priv, format , ## arg) + + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +extern int debug_level; + +#define ipoib_dbg(priv, format, arg...) \ + do { \ + if (debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { \ + if (mcast_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG */ +#define ipoib_dbg(priv, format, arg...) \ + do { (void) (priv); } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +#define ipoib_dbg_data(priv, format, arg...) \ + do { \ + if (data_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ +#define ipoib_dbg_data(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ + + +#define IPOIB_GID_FMT "%x:%x:%x:%x:%x:%x:%x:%x" + +#define IPOIB_GID_ARG(gid) be16_to_cpup((__be16 *) ((gid).raw + 0)), \ + be16_to_cpup((__be16 *) ((gid).raw + 2)), \ + be16_to_cpup((__be16 *) ((gid).raw + 4)), \ + be16_to_cpup((__be16 *) ((gid).raw + 6)), \ + be16_to_cpup((__be16 *) ((gid).raw + 8)), \ + be16_to_cpup((__be16 *) ((gid).raw + 10)), \ + be16_to_cpup((__be16 *) ((gid).raw + 12)), \ + be16_to_cpup((__be16 *) ((gid).raw + 14)) + +#endif /* _IPOIB_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2004-12-19 22:04:17.608104342 -0800 @@ -0,0 +1,287 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include + +#include "ipoib.h" + +enum { + IPOIB_MAGIC = 0x49504942 /* "IPIB" */ +}; + +static DECLARE_MUTEX(ipoib_fs_mutex); +static struct dentry *ipoib_root; +static struct super_block *ipoib_sb; +static LIST_HEAD(ipoib_device_list); + +static void *ipoib_mcg_seq_start(struct seq_file *file, loff_t *pos) +{ + struct ipoib_mcast_iter *iter; + loff_t n = *pos; + + iter = ipoib_mcast_iter_init(file->private); + if (!iter) + return NULL; + + while (n--) { + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + } + + return iter; +} + +static void *ipoib_mcg_seq_next(struct seq_file *file, void *iter_ptr, + loff_t *pos) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + + (*pos)++; + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +static void ipoib_mcg_seq_stop(struct seq_file *file, void *iter_ptr) +{ + /* nothing for now */ +} + +static int ipoib_mcg_seq_show(struct seq_file *file, void *iter_ptr) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + char gid_buf[sizeof "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"]; + union ib_gid mgid; + int i, n; + unsigned long created; + unsigned int queuelen, complete, send_only; + + if (iter) { + ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen, + &complete, &send_only); + + for (n = 0, i = 0; i < sizeof mgid / 2; ++i) { + n += sprintf(gid_buf + n, "%x", + be16_to_cpu(((u16 *)mgid.raw)[i])); + if (i < sizeof mgid / 2 - 1) + gid_buf[n++] = ':'; + } + } + + seq_printf(file, "GID: %*s", -(1 + (int) sizeof gid_buf), gid_buf); + + seq_printf(file, + " created: %10ld queuelen: %4d complete: %d send_only: %d\n", + created, queuelen, complete, send_only); + + return 0; +} + +static struct seq_operations ipoib_seq_ops = { + .start = ipoib_mcg_seq_start, + .next = ipoib_mcg_seq_next, + .stop = ipoib_mcg_seq_stop, + .show = ipoib_mcg_seq_show, +}; + +static int ipoib_mcg_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &ipoib_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->u.generic_ip; + + return 0; +} + +static struct file_operations ipoib_fops = { + .owner = THIS_MODULE, + .open = ipoib_mcg_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +static struct inode *ipoib_get_inode(void) +{ + struct inode *inode = new_inode(ipoib_sb); + + if (inode) { + inode->i_mode = S_IFREG | S_IRUGO; + inode->i_uid = 0; + inode->i_gid = 0; + inode->i_blksize = PAGE_CACHE_SIZE; + inode->i_blocks = 0; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode->i_fop = &ipoib_fops; + } + + return inode; +} + +static int __ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dentry *dentry; + struct inode *inode; + char name[IFNAMSIZ + sizeof "_mcg"]; + + snprintf(name, sizeof name, "%s_mcg", dev->name); + + dentry = d_alloc_name(ipoib_root, name); + if (!dentry) + return -ENOMEM; + + inode = ipoib_get_inode(); + if (!inode) { + dput(dentry); + return -ENOMEM; + } + + inode->u.generic_ip = dev; + priv->mcg_dentry = dentry; + + d_add(dentry, inode); + + return 0; +} + +int ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + + list_add_tail(&priv->fs_list, &ipoib_device_list); + + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return 0; + } + + up(&ipoib_fs_mutex); + + return __ipoib_create_debug_file(dev); +} + +void ipoib_delete_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + list_del(&priv->fs_list); + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return; + } + up(&ipoib_fs_mutex); + + if (priv->mcg_dentry) { + d_drop(priv->mcg_dentry); + simple_unlink(ipoib_root->d_inode, priv->mcg_dentry); + } +} + +static int ipoib_fill_super(struct super_block *sb, void *data, int silent) +{ + static struct tree_descr ipoib_files[] = { + { "" } + }; + struct ipoib_dev_priv *priv; + int ret; + + ret = simple_fill_super(sb, IPOIB_MAGIC, ipoib_files); + if (ret) + return ret; + + ipoib_root = sb->s_root; + + down(&ipoib_fs_mutex); + + ipoib_sb = sb; + + list_for_each_entry(priv, &ipoib_device_list, fs_list) { + ret = __ipoib_create_debug_file(priv->dev); + if (ret) + break; + } + + up(&ipoib_fs_mutex); + + return ret; +} + +static struct super_block *ipoib_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + return get_sb_single(fs_type, flags, data, ipoib_fill_super); +} + +static void ipoib_kill_sb(struct super_block *sb) +{ + down(&ipoib_fs_mutex); + ipoib_sb = NULL; + up(&ipoib_fs_mutex); + + kill_litter_super(sb); +} + +static struct file_system_type ipoib_fs_type = { + .owner = THIS_MODULE, + .name = "ipoib_debugfs", + .get_sb = ipoib_get_sb, + .kill_sb = ipoib_kill_sb, +}; + +int ipoib_register_debugfs(void) +{ + return register_filesystem(&ipoib_fs_type); +} + +void ipoib_unregister_debugfs(void) +{ + unregister_filesystem(&ipoib_fs_type); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2004-12-19 22:04:17.633100658 -0800 @@ -0,0 +1,632 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_ib.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include + +#include "ipoib.h" + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +int data_debug_level; + +module_param(data_debug_level, int, 0644); +MODULE_PARM_DESC(data_debug_level, + "Enable data path debug tracing if > 0"); +#endif + +#define IPOIB_OP_RECV (1ul << 31) + +static DECLARE_MUTEX(pkey_sem); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr) +{ + struct ipoib_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return NULL; + + ah->dev = dev; + ah->last_send = 0; + kref_init(&ah->ref); + + ah->ah = ib_create_ah(pd, attr); + if (IS_ERR(ah->ah)) { + kfree(ah); + ah = NULL; + } else + ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah); + + return ah; +} + +void ipoib_free_ah(struct kref *kref) +{ + struct ipoib_ah *ah = container_of(kref, struct ipoib_ah, ref); + struct ipoib_dev_priv *priv = netdev_priv(ah->dev); + + unsigned long flags; + + if (ah->last_send <= priv->tx_tail) { + ipoib_dbg(priv, "Freeing ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } else { + spin_lock_irqsave(&priv->lock, flags); + list_add_tail(&ah->list, &priv->dead_ahs); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv, + unsigned int wr_id, + dma_addr_t addr) +{ + struct ib_sge list = { + .addr = addr, + .length = IPOIB_BUF_SIZE, + .lkey = priv->mr->lkey, + }; + struct ib_recv_wr param = { + .wr_id = wr_id | IPOIB_OP_RECV, + .sg_list = &list, + .num_sge = 1, + .recv_flags = IB_RECV_SIGNALED + }; + struct ib_recv_wr *bad_wr; + + return ib_post_recv(priv->qp, ¶m, &bad_wr); +} + +static int ipoib_ib_post_receive(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + dma_addr_t addr; + int ret; + + skb = dev_alloc_skb(IPOIB_BUF_SIZE + 4); + if (!skb) { + ipoib_warn(priv, "failed to allocate receive buffer\n"); + + priv->rx_ring[id].skb = NULL; + return -ENOMEM; + } + skb_reserve(skb, 4); /* 16 byte align IP header */ + priv->rx_ring[id].skb = skb; + addr = dma_map_single(priv->ca->dma_device, + skb->data, IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr); + + ret = ipoib_ib_receive(priv, id, addr); + if (ret) { + ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n", + id, ret); + priv->rx_ring[id].skb = NULL; + } + + return ret; +} + +static int ipoib_ib_post_receives(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + if (ipoib_ib_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + return -EIO; + } + } + + return 0; +} + +static void ipoib_ib_handle_wc(struct net_device *dev, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + + ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); + + if (wr_id & IPOIB_OP_RECV) { + wr_id &= ~IPOIB_OP_RECV; + + if (wr_id < IPOIB_RX_RING_SIZE) { + struct sk_buff *skb = priv->rx_ring[wr_id].skb; + + priv->rx_ring[wr_id].skb = NULL; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(&priv->rx_ring[wr_id], + mapping), + IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + + if (wc->status != IB_WC_SUCCESS) { + if (wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed recv event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put(skb, wc->byte_len); + skb_pull(skb, IB_GRH_BYTES); + + if (wc->slid != priv->local_lid || + wc->src_qp != priv->qp->qp_num) { + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + } else { + ipoib_dbg_data(priv, "dropping loopback packet\n"); + dev_kfree_skb_any(skb); + } + + /* repost receive */ + if (ipoib_ib_post_receive(dev, wr_id)) + ipoib_warn(priv, "ipoib_ib_post_receive failed " + "for buf %d\n", wr_id); + } else + ipoib_warn(priv, "completion event with wrid %d\n", + wr_id); + + } else { + struct ipoib_buf *tx_req; + unsigned long flags; + + if (wr_id >= IPOIB_TX_RING_SIZE) { + ipoib_warn(priv, "completion event with wrid %d (> %d)\n", + wr_id, IPOIB_TX_RING_SIZE); + return; + } + + ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); + + tx_req = &priv->tx_ring[wr_id]; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; + + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->tx_tail; + if (priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + } +} + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + for (i = 0; i < n; ++i) + ipoib_ib_handle_wc(dev, priv->ibwc + i); + } while (n == IPOIB_NUM_WC); +} + +static inline int post_send(struct ipoib_dev_priv *priv, + unsigned int wr_id, + struct ib_ah *address, u32 qpn, + dma_addr_t addr, int len) +{ + struct ib_sge list = { + .addr = addr, + .length = len, + .lkey = priv->mr->lkey, + }; + struct ib_send_wr param = { + .wr_id = wr_id, + .opcode = IB_WR_SEND, + .sg_list = &list, + .num_sge = 1, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = priv->qkey, + .ah = address + }, + }, + .send_flags = IB_SEND_SIGNALED, + }; + struct ib_send_wr *bad_wr; + + return ib_post_send(priv->qp, ¶m, &bad_wr); +} + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_buf *tx_req; + dma_addr_t addr; + + if (skb->len > dev->mtu + INFINIBAND_ALEN) { + ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + skb->len, dev->mtu + INFINIBAND_ALEN); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n", + skb->len, address, qpn); + + /* + * We put the skb into the tx_ring _before_ we call post_send() + * because it's entirely possible that the completion handler will + * run before we execute anything after the post_send(). That + * means we have to make sure everything is properly recorded and + * our state is consistent before we call post_send(). + */ + tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req->skb = skb; + addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, + DMA_TO_DEVICE); + pci_unmap_addr_set(tx_req, mapping, addr); + + if (unlikely(post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + address->ah, qpn, addr, skb->len))) { + ipoib_warn(priv, "post_send failed\n"); + ++priv->stats.tx_errors; + dma_unmap_single(priv->ca->dma_device, addr, skb->len, + DMA_TO_DEVICE); + dev_kfree_skb_any(skb); + } else { + dev->trans_start = jiffies; + + address->last_send = priv->tx_head; + ++priv->tx_head; + + if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + } + } +} + +void __ipoib_reap_ah(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah, *tah; + LIST_HEAD(remove_list); + + spin_lock_irq(&priv->lock); + list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list) + if (ah->last_send <= priv->tx_tail) { + list_del(&ah->list); + list_add_tail(&ah->list, &remove_list); + } + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(ah, tah, &remove_list, list) { + ipoib_dbg(priv, "Reaping ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } +} + +void ipoib_reap_ah(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + __ipoib_reap_ah(dev); + + if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); +} + +int ipoib_ib_dev_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + ret = ipoib_qp_create(dev); + if (ret) { + ipoib_warn(priv, "ipoib_qp_create returned %d\n", ret); + return -1; + } + + ret = ipoib_ib_post_receives(dev); + if (ret) { + ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + return -1; + } + + clear_bit(IPOIB_STOP_REAPER, &priv->flags); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + + return 0; +} + +int ipoib_ib_dev_up(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + + return ipoib_mcast_start_thread(dev); +} + +int ipoib_ib_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "downing ib_dev\n"); + + clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + netif_carrier_off(dev); + + /* Shutdown the P_Key thread if still active */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + set_bit(IPOIB_PKEY_STOP, &priv->flags); + cancel_delayed_work(&priv->pkey_task); + up(&pkey_sem); + flush_workqueue(ipoib_workqueue); + } + + ipoib_mcast_stop_thread(dev); + + /* + * Flush the multicast groups first so we stop any multicast joins. The + * completion thread may have already died and we may deadlock waiting + * for the completion thread to finish some multicast joins. + */ + ipoib_mcast_dev_flush(dev); + + /* Delete broadcast and local addresses since they will be recreated */ + ipoib_mcast_dev_down(dev); + + ipoib_flush_paths(dev); + + return 0; +} + +static int recvs_pending(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + return 1; + + return 0; +} + +int ipoib_ib_dev_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + int attr_mask; + int i; + + /* Kill the existing QP and allocate a new one */ + qp_attr.qp_state = IB_QPS_ERR; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to ERROR state\n"); + + /* Wait for all sends and receives to complete */ + while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) + yield(); + + ipoib_dbg(priv, "All sends and receives done.\n"); + + qp_attr.qp_state = IB_QPS_RESET; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to RESET state\n"); + + /* Wait for all AHs to be reaped */ + set_bit(IPOIB_STOP_REAPER, &priv->flags); + cancel_delayed_work(&priv->ah_reap_task); + flush_workqueue(ipoib_workqueue); + while (!list_empty(&priv->dead_ahs)) { + __ipoib_reap_ah(dev); + yield(); + } + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + ipoib_warn(priv, "Recv skb still around @ %d\n", i); + + return 0; +} + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + priv->ca = ca; + priv->port = port; + priv->qp = NULL; + + if (ipoib_transport_dev_init(dev, ca)) { + printk(KERN_WARNING "%s: ipoib_transport_dev_init failed\n", ca->name); + return -ENODEV; + } + + if (dev->flags & IFF_UP) { + if (ipoib_ib_dev_open(dev)) { + ipoib_transport_dev_cleanup(dev); + return -ENODEV; + } + } + + return 0; +} + +void ipoib_ib_dev_flush(void *_dev) +{ + struct net_device *dev = (struct net_device *)_dev; + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv; + + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + return; + + ipoib_dbg(priv, "flushing\n"); + + ipoib_ib_dev_down(dev); + + /* + * The device could have been brought down between the start and when + * we get here, don't bring it back up if it's not configured up + */ + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_ib_dev_up(dev); + + /* Flush any child interfaces too */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + ipoib_ib_dev_flush(&cpriv->dev); +} + +void ipoib_ib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "cleaning up ib_dev\n"); + + ipoib_mcast_stop_thread(dev); + + /* Delete the broadcast address and the local address */ + ipoib_mcast_dev_down(dev); + + ipoib_transport_dev_cleanup(dev); +} + +/* + * Delayed P_Key Assigment Interim Support + * + * The following is initial implementation of delayed P_Key assigment + * mechanism. It is using the same approach implemented for the multicast + * group join. The single goal of this implementation is to quickly address + * Bug #2507. This implementation will probably be removed when the P_Key + * change async notification is available. + */ +int ipoib_open(struct net_device *dev); + +static void ipoib_pkey_dev_check_presence(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + u16 pkey_index = 0; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + else + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); +} + +void ipoib_pkey_poll(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_pkey_dev_check_presence(dev); + + if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + ipoib_open(dev); + else { + down(&pkey_sem); + if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + } +} + +int ipoib_pkey_dev_delay_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Look for the interface pkey value in the IB Port P_Key table and */ + /* set the interface pkey assigment flag */ + ipoib_pkey_dev_check_presence(dev); + + /* P_Key value not assigned yet - start polling */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + clear_bit(IPOIB_PKEY_STOP, &priv->flags); + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + return 1; + } + + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_main.c 2004-12-19 22:04:17.658096974 -0800 @@ -0,0 +1,1084 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_main.c 1362 2004-12-18 15:56:29Z roland $ + */ + +#include "ipoib.h" + +#include +#include + +#include +#include +#include + +#include /* For ARPHRD_xxx */ + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); +MODULE_LICENSE("Dual BSD/GPL"); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int debug_level; + +module_param(debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); +#endif + +static const u8 ipv4_bcast_addr[] = { + 0x00, 0xff, 0xff, 0xff, + 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff +}; + +struct workqueue_struct *ipoib_workqueue; + +static void ipoib_add_one(struct ib_device *device); +static void ipoib_remove_one(struct ib_device *device); + +static struct ib_client ipoib_client = { + .name = "ipoib", + .add = ipoib_add_one, + .remove = ipoib_remove_one +}; + +int ipoib_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "bringing up interface\n"); + + set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + if (ipoib_pkey_dev_delay_open(dev)) + return 0; + + if (ipoib_ib_dev_open(dev)) + return -EINVAL; + + if (ipoib_ib_dev_up(dev)) + return -EINVAL; + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring up any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (flags & IFF_UP) + continue; + + dev_change_flags(cpriv->dev, flags | IFF_UP); + } + up(&priv->vlan_mutex); + } + + netif_start_queue(dev); + + return 0; +} + +static int ipoib_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "stopping interface\n"); + + clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + netif_stop_queue(dev); + + ipoib_ib_dev_down(dev); + ipoib_ib_dev_stop(dev); + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring down any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (!(flags & IFF_UP)) + continue; + + dev_change_flags(cpriv->dev, flags & ~IFF_UP); + } + up(&priv->vlan_mutex); + } + + return 0; +} + +static int ipoib_change_mtu(struct net_device *dev, int new_mtu) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) + return -EINVAL; + + priv->admin_mtu = new_mtu; + + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + return 0; +} + +static struct ipoib_path *__path_find(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->path_tree.rb_node; + struct ipoib_path *path; + int ret; + + while (n) { + path = rb_entry(n, struct ipoib_path, rb_node); + + ret = memcmp(gid->raw, path->pathrec.dgid.raw, + sizeof (union ib_gid)); + + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return path; + } + + return NULL; +} + +static int __path_add(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->path_tree.rb_node; + struct rb_node *pn = NULL; + struct ipoib_path *tpath; + int ret; + + while (*n) { + pn = *n; + tpath = rb_entry(pn, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, tpath->pathrec.dgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&path->rb_node, pn, n); + rb_insert_color(&path->rb_node, &priv->path_tree); + + list_add_tail(&path->list, &priv->path_list); + + return 0; +} + +static void __path_free(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tn; + struct sk_buff *skb; + + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb_irq(skb); + + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + if (path->ah) + ipoib_put_ah(path->ah); + + rb_erase(&path->rb_node, &priv->path_tree); + list_del(&path->list); + kfree(path); +} + +void ipoib_flush_paths(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path, *tp; + LIST_HEAD(remove_list); + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + list_splice(&priv->path_list, &remove_list); + INIT_LIST_HEAD(&priv->path_list); + spin_unlock_irqrestore(&priv->lock, flags); + + list_for_each_entry_safe(path, tp, &remove_list, list) { + if (path->query) + ib_sa_cancel_query(path->query_id, path->query); + wait_for_completion(&path->done); + __path_free(dev, path); + } +} + +static void path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *path_ptr) +{ + struct ipoib_path *path = path_ptr; + struct net_device *dev = path->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah = NULL; + struct ipoib_neigh *neigh; + struct sk_buff_head skqueue; + struct sk_buff *skb; + unsigned long flags; + + if (pathrec) + ipoib_dbg(priv, "PathRec LID 0x%04x for GID " IPOIB_GID_FMT "\n", + be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); + else + ipoib_dbg(priv, "PathRec status %d for GID " IPOIB_GID_FMT "\n", + status, IPOIB_GID_ARG(path->pathrec.dgid)); + + skb_queue_head_init(&skqueue); + + if (!status) { + /* + * For now we set static_rate to 0. This is not + * really correct: we should look at the rate + * component of the path member record, compare it + * with the rate of our local port (calculated from + * the active link speed and link width) and set an + * inter-packet delay appropriately. + */ + struct ib_ah_attr av = { + .dlid = be16_to_cpu(pathrec->dlid), + .sl = pathrec->sl, + .static_rate = 0, + .port_num = priv->port + }; + + ah = ipoib_create_ah(dev, priv->pd, &av); + } + + spin_lock_irqsave(&priv->lock, flags); + + path->ah = ah; + + if (ah) { + path->pathrec = *pathrec; + + ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n", + ah, be16_to_cpu(pathrec->dlid), pathrec->sl); + + while ((skb = __skb_dequeue(&path->queue))) + __skb_queue_tail(&skqueue, skb); + + list_for_each_entry(neigh, &path->neigh_list, list) { + kref_get(&path->ah->ref); + neigh->ah = path->ah; + + while ((skb = __skb_dequeue(&neigh->queue))) + __skb_queue_tail(&skqueue, skb); + } + } else + path->query = NULL; + + complete(&path->done); + + spin_unlock_irqrestore(&priv->lock, flags); + + while ((skb = __skb_dequeue(&skqueue))) { + skb->dev = dev; + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed " + "to requeue packet\n"); + } +} + +static struct ipoib_path *path_rec_create(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + path = kmalloc(sizeof *path, GFP_ATOMIC); + if (!path) + return NULL; + + path->dev = dev; + path->pathrec.dlid = 0; + + skb_queue_head_init(&path->queue); + + INIT_LIST_HEAD(&path->neigh_list); + path->query = NULL; + init_completion(&path->done); + + memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.numb_path = 1; + + __path_add(dev, path); + + return path; +} + +static int path_rec_start(struct net_device *dev, + struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "Start path record lookup for " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(path->pathrec.dgid)); + + path->query_id = + ib_sa_path_rec_get(priv->ca, priv->port, + &path->pathrec, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + 1000, GFP_ATOMIC, + path_rec_completion, + path, &path->query); + if (path->query_id < 0) { + ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); + path->query = NULL; + return path->query_id; + } + + return 0; +} + +static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + return; + } + + skb_queue_head_init(&neigh->queue); + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) + goto err; + } + + list_add_tail(&neigh->list, &path->neigh_list); + + if (path->pathrec.dlid) { + kref_get(&path->ah->ref); + neigh->ah = path->ah; + + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + } else { + neigh->ah = NULL; + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + __skb_queue_tail(&neigh->queue, skb); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + if (!path->query && path_rec_start(dev, path)) + goto err; + } + + spin_unlock(&priv->lock); + return; + +err: + *to_ipoib_neigh(skb->dst->neighbour) = NULL; + list_del(&neigh->list); + kfree(neigh); + neigh->neighbour->ops->destructor = NULL; + + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + + spin_unlock(&priv->lock); +} + +static void path_lookup(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + + /* Look up path record for unicasts */ + if (skb->dst->neighbour->ha[4] != 0xff) { + neigh_add_path(skb, dev); + return; + } + + /* Add in the P_Key for multicasts */ + skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; + skb->dst->neighbour->ha[9] = priv->pkey & 0xff; + ipoib_mcast_send(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4), skb); +} + +static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, + struct ipoib_pseudoheader *phdr) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (phdr->hwaddr + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (phdr->hwaddr + 4)); + if (path) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof *phdr); + __skb_queue_tail(&path->queue, skb); + + if (path_rec_start(dev, path)) + __path_free(dev, path); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + spin_unlock(&priv->lock); + return; + } + + if (path->pathrec.dlid) { + ipoib_dbg(priv, "Send unicast ARP to %04x\n", + be16_to_cpu(path->pathrec.dlid)); + + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) phdr->hwaddr)); + } else if ((path->query || !path_rec_start(dev, path)) && + skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof *phdr); + __skb_queue_tail(&path->queue, skb); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + spin_unlock(&priv->lock); +} + +static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh; + unsigned long flags; + + local_irq_save(flags); + if (!spin_trylock(&priv->tx_lock)) { + local_irq_restore(flags); + return NETDEV_TX_LOCKED; + } + + /* + * Check if our queue is full. Since we have the LLTX feature + * bit set, we can't rely on netif_stop_queue() preventing our + * xmit function from being called with a full queue. + * + * This is a temporary workaround until LLTX is fixed so that + * hard_start_xmit does not get called after netif_stop_queue(). + */ + if (unlikely(priv->tx_head - priv->tx_tail >= IPOIB_TX_RING_SIZE)) { + ipoib_dbg(priv, "TX ring full in xmit, stopping kernel net queue\n"); + netif_stop_queue(dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); + return NETDEV_TX_BUSY; + } + + if (skb->dst && skb->dst->neighbour) { + if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { + path_lookup(skb, dev); + goto out; + } + + neigh = *to_ipoib_neigh(skb->dst->neighbour); + + if (likely(neigh->ah)) { + ipoib_send(dev, skb, neigh->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + goto out; + } + + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + spin_lock(&priv->lock); + __skb_queue_tail(&neigh->queue, skb); + spin_unlock(&priv->lock); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + } else { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb->data; + skb_pull(skb, sizeof *phdr); + + if (phdr->hwaddr[4] == 0xff) { + /* Add in the P_Key for multicast*/ + phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; + phdr->hwaddr[9] = priv->pkey & 0xff; + + ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb); + } else { + /* unicast GID -- should be ARP reply */ + + if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { + ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " + IPOIB_GID_FMT "\n", + skb->dst ? "neigh" : "dst", + be16_to_cpup((u16 *) skb->data), + be32_to_cpup((u32 *) phdr->hwaddr), + IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + goto out; + } + + unicast_arp_send(skb, dev, phdr); + } + } + +out: + spin_unlock_irqrestore(&priv->tx_lock, flags); + + return NETDEV_TX_OK; +} + +struct net_device_stats *ipoib_get_stats(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return &priv->stats; +} + +static void ipoib_timeout(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_warn(priv, "transmit timeout: latency %ld\n", + jiffies - dev->trans_start); + /* XXX reset QP, etc. */ +} + +static int ipoib_hard_header(struct sk_buff *skb, + struct net_device *dev, + unsigned short type, + void *daddr, void *saddr, unsigned len) +{ + struct ipoib_header *header; + + header = (struct ipoib_header *) skb_push(skb, sizeof *header); + + header->proto = htons(type); + header->reserved = 0; + + /* + * If we don't have a neighbour structure, stuff the + * destination address onto the front of the skb so we can + * figure out where to send the packet later. + */ + if (!skb->dst || !skb->dst->neighbour) { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb_push(skb, sizeof *phdr); + memcpy(phdr->hwaddr, daddr, INFINIBAND_ALEN); + } + + return 0; +} + +static void ipoib_set_mcast_list(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + schedule_work(&priv->restart_task); +} + +static void ipoib_neigh_destructor(struct neighbour *n) +{ + struct ipoib_neigh *neigh = *to_ipoib_neigh(n); + struct ipoib_dev_priv *priv = netdev_priv(n->dev); + unsigned long flags; + + ipoib_dbg(priv, + "neigh_destructor for %06x " IPOIB_GID_FMT "\n", + be32_to_cpup((__be32 *) n->ha), + IPOIB_GID_ARG(*((union ib_gid *) (n->ha + 4)))); + + spin_lock_irqsave(&priv->lock, flags); + + if (neigh) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + *to_ipoib_neigh(n) = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static int ipoib_neigh_setup(struct neighbour *neigh) +{ + /* + * Is this kosher? I can't find anybody in the kernel that + * sets neigh->destructor, so we should be able to set it here + * without trouble. + */ + neigh->ops->destructor = ipoib_neigh_destructor; + + return 0; +} + +static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) +{ + parms->neigh_setup = ipoib_neigh_setup; + + return 0; +} + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Allocate RX/TX "rings" to hold queued skbs */ + + priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->rx_ring) { + printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", + ca->name, IPOIB_RX_RING_SIZE); + goto out; + } + memset(priv->rx_ring, 0, + IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf)); + + priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->tx_ring) { + printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", + ca->name, IPOIB_TX_RING_SIZE); + goto out_rx_ring_cleanup; + } + memset(priv->tx_ring, 0, + IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf)); + + /* priv->tx_head & tx_tail are already 0 */ + + if (ipoib_ib_dev_init(dev, ca, port)) + goto out_tx_ring_cleanup; + + return 0; + +out_tx_ring_cleanup: + kfree(priv->tx_ring); + +out_rx_ring_cleanup: + kfree(priv->rx_ring); + +out: + return -ENOMEM; +} + +void ipoib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv; + + ipoib_delete_debug_file(dev); + + /* Delete any child interfaces first */ + list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { + unregister_netdev(cpriv->dev); + ipoib_dev_cleanup(cpriv->dev); + free_netdev(cpriv->dev); + } + + ipoib_ib_dev_cleanup(dev); + + if (priv->rx_ring) { + kfree(priv->rx_ring); + priv->rx_ring = NULL; + } + + if (priv->tx_ring) { + kfree(priv->tx_ring); + priv->tx_ring = NULL; + } +} + +static void ipoib_setup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + dev->open = ipoib_open; + dev->stop = ipoib_stop; + dev->change_mtu = ipoib_change_mtu; + dev->hard_start_xmit = ipoib_start_xmit; + dev->get_stats = ipoib_get_stats; + dev->tx_timeout = ipoib_timeout; + dev->hard_header = ipoib_hard_header; + dev->set_multicast_list = ipoib_set_mcast_list; + dev->neigh_setup = ipoib_neigh_setup_dev; + + dev->watchdog_timeo = HZ; + + dev->rebuild_header = NULL; + dev->set_mac_address = NULL; + dev->header_cache_update = NULL; + + dev->flags |= IFF_BROADCAST | IFF_MULTICAST; + + /* + * We add in INFINIBAND_ALEN to allow for the destination + * address "pseudoheader" for skbs without neighbour struct. + */ + dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; + dev->addr_len = INFINIBAND_ALEN; + dev->type = ARPHRD_INFINIBAND; + dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + + /* MTU will be reset when mcast join happens */ + dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; + priv->mcast_mtu = priv->admin_mtu = dev->mtu; + + memcpy(dev->broadcast, ipv4_bcast_addr, INFINIBAND_ALEN); + + netif_carrier_off(dev); + + SET_MODULE_OWNER(dev); + + priv->dev = dev; + + spin_lock_init(&priv->lock); + spin_lock_init(&priv->tx_lock); + + init_MUTEX(&priv->mcast_mutex); + init_MUTEX(&priv->vlan_mutex); + + INIT_LIST_HEAD(&priv->path_list); + INIT_LIST_HEAD(&priv->child_intfs); + INIT_LIST_HEAD(&priv->dead_ahs); + INIT_LIST_HEAD(&priv->multicast_list); + + INIT_WORK(&priv->pkey_task, ipoib_pkey_poll, priv->dev); + INIT_WORK(&priv->mcast_task, ipoib_mcast_join_task, priv->dev); + INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, priv->dev); + INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev); + INIT_WORK(&priv->ah_reap_task, ipoib_reap_ah, priv->dev); +} + +struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) +{ + struct net_device *dev; + + dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name, + ipoib_setup); + if (!dev) + return NULL; + + return netdev_priv(dev); +} + +static ssize_t show_pkey(struct class_device *cdev, char *buf) +{ + struct ipoib_dev_priv *priv = + netdev_priv(container_of(cdev, struct net_device, class_dev)); + + return sprintf(buf, "0x%04x\n", priv->pkey); +} +static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); + +static ssize_t create_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; +} +static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); + +static ssize_t delete_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; + +} +static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); + +int ipoib_add_pkey_attr(struct net_device *dev) +{ + return class_device_create_file(&dev->class_dev, + &class_device_attr_pkey); +} + +static struct net_device *ipoib_add_port(const char *format, + struct ib_device *hca, u8 port) +{ + struct ipoib_dev_priv *priv; + int result = -ENOMEM; + + priv = ipoib_intf_alloc(format); + if (!priv) + goto alloc_mem_failed; + + SET_NETDEV_DEV(priv->dev, hca->dma_device); + + result = ib_query_pkey(hca, port, 0, &priv->pkey); + if (result) { + printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } + + priv->dev->broadcast[8] = priv->pkey >> 8; + priv->dev->broadcast[9] = priv->pkey & 0xff; + + result = ib_query_gid(hca, port, 0, &priv->local_gid); + if (result) { + printk(KERN_WARNING "%s: ib_query_gid port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + + result = ipoib_dev_init(priv->dev, hca, port); + if (result < 0) { + printk(KERN_WARNING "%s: failed to initialize port %d (ret = %d)\n", + hca->name, port, result); + goto device_init_failed; + } + + INIT_IB_EVENT_HANDLER(&priv->event_handler, + priv->ca, ipoib_event); + result = ib_register_event_handler(&priv->event_handler); + if (result < 0) { + printk(KERN_WARNING "%s: ib_register_event_handler failed for " + "port %d (ret = %d)\n", + hca->name, port, result); + goto event_failed; + } + + result = register_netdev(priv->dev); + if (result) { + printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", + hca->name, port, result); + goto register_failed; + } + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_create_child)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_delete_child)) + goto sysfs_failed; + + return priv->dev; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ib_unregister_event_handler(&priv->event_handler); + +event_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +alloc_mem_failed: + return ERR_PTR(result); +} + +static void ipoib_add_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct net_device *dev; + struct ipoib_dev_priv *priv; + int s, e, p; + + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); + if (!dev_list) + return; + + INIT_LIST_HEAD(dev_list); + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (p = s; p <= e; ++p) { + dev = ipoib_add_port("ib%d", device, p); + if (!IS_ERR(dev)) { + priv = netdev_priv(dev); + list_add_tail(&priv->list, dev_list); + } + } + + ib_set_client_data(device, &ipoib_client, dev_list); +} + +static void ipoib_remove_one(struct ib_device *device) +{ + struct ipoib_dev_priv *priv, *tmp; + struct list_head *dev_list; + + dev_list = ib_get_client_data(device, &ipoib_client); + + list_for_each_entry_safe(priv, tmp, dev_list, list) { + ib_unregister_event_handler(&priv->event_handler); + + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + free_netdev(priv->dev); + } +} + +static int __init ipoib_init_module(void) +{ + int ret; + + ret = ipoib_register_debugfs(); + if (ret) + return ret; + + /* + * We create our own workqueue mainly because we want to be + * able to flush it when devices are being removed. We can't + * use schedule_work()/flush_scheduled_work() because both + * unregister_netdev() and linkwatch_event take the rtnl lock, + * so flush_scheduled_work() can deadlock during device + * removal. + */ + ipoib_workqueue = create_singlethread_workqueue("ipoib"); + if (!ipoib_workqueue) { + ret = -ENOMEM; + goto err_fs; + } + + ret = ib_register_client(&ipoib_client); + if (ret) + goto err_wq; + + return 0; + +err_fs: + ipoib_unregister_debugfs(); + +err_wq: + destroy_workqueue(ipoib_workqueue); + + return ret; +} + +static void __exit ipoib_cleanup_module(void) +{ + ipoib_unregister_debugfs(); + ib_unregister_client(&ipoib_client); + destroy_workqueue(ipoib_workqueue); +} + +module_init(ipoib_init_module); +module_exit(ipoib_cleanup_module); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2004-12-19 22:04:17.682093438 -0800 @@ -0,0 +1,254 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +#include "ipoib.h" + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr *qp_attr; + int attr_mask; + int ret; + u16 pkey_index; + + ret = -ENOMEM; + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) + goto out; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ret = -ENXIO; + goto out; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + /* set correct QKey for QP */ + qp_attr->qkey = priv->qkey; + attr_mask = IB_QP_QKEY; + ret = ib_modify_qp(priv->qp, qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret); + goto out; + } + + /* attach QP to multicast group */ + down(&priv->mcast_mutex); + ret = ib_attach_mcast(priv->qp, mgid, mlid); + up(&priv->mcast_mutex); + if (ret) + ipoib_warn(priv, "failed to attach to multicast group, ret = %d\n", ret); + +out: + kfree(qp_attr); + return ret; +} + +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + down(&priv->mcast_mutex); + ret = ib_detach_mcast(priv->qp, mgid, mlid); + up(&priv->mcast_mutex); + if (ret) + ipoib_warn(priv, "ib_detach_mcast failed (result = %d)\n", ret); + + return ret; +} + +int ipoib_qp_create(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u16 pkey_index; + struct ib_qp_attr qp_attr; + int attr_mask; + + /* + * Search through the port P_Key table for the requested pkey value. + * The port has to be assigned to the respective IB partition in + * advance. + */ + ret = ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index); + if (ret) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return ret; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qkey = 0; + qp_attr.port_num = priv->port; + qp_attr.pkey_index = pkey_index; + attr_mask = + IB_QP_QKEY | + IB_QP_PORT | + IB_QP_PKEY_INDEX | + IB_QP_STATE; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret); + goto out_fail; + } + + qp_attr.qp_state = IB_QPS_RTR; + /* Can't set this in a INIT->RTR transition */ + attr_mask &= ~IB_QP_PORT; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret); + goto out_fail; + } + + qp_attr.qp_state = IB_QPS_RTS; + qp_attr.sq_psn = 0; + attr_mask |= IB_QP_SQ_PSN; + attr_mask &= ~IB_QP_PKEY_INDEX; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret); + goto out_fail; + } + + return 0; + +out_fail: + ib_destroy_qp(priv->qp); + priv->qp = NULL; + + return -EINVAL; +} + +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr init_attr = { + .cap = { + .max_send_wr = IPOIB_TX_RING_SIZE, + .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_sge = 1, + .max_recv_sge = 1 + }, + .sq_sig_type = IB_SIGNAL_ALL_WR, + .rq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_UD + }; + + priv->pd = ib_alloc_pd(priv->ca); + if (IS_ERR(priv->pd)) { + printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name); + return -ENODEV; + } + + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, + IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + if (IS_ERR(priv->cq)) { + printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); + goto out_free_pd; + } + + if (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP)) + goto out_free_cq; + + priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(priv->mr)) { + printk(KERN_WARNING "%s: ib_reg_phys_mr failed\n", ca->name); + goto out_free_cq; + } + + init_attr.send_cq = priv->cq; + init_attr.recv_cq = priv->cq, + + priv->qp = ib_create_qp(priv->pd, &init_attr); + if (IS_ERR(priv->qp)) { + printk(KERN_WARNING "%s: failed to create QP\n", ca->name); + goto out_free_mr; + } + + priv->dev->dev_addr[1] = (priv->qp->qp_num >> 16) & 0xff; + priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; + priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; + + return 0; + +out_free_mr: + ib_dereg_mr(priv->mr); + +out_free_cq: + ib_destroy_cq(priv->cq); + +out_free_pd: + ib_dealloc_pd(priv->pd); + return -ENODEV; +} + +void ipoib_transport_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->qp) { + if (ib_destroy_qp(priv->qp)) + ipoib_warn(priv, "ib_qp_destroy failed\n"); + + priv->qp = NULL; + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + } + + if (ib_dereg_mr(priv->mr)) + ipoib_warn(priv, "ib_dereg_mr failed\n"); + + if (ib_destroy_cq(priv->cq)) + ipoib_warn(priv, "ib_cq_destroy failed\n"); + + if (ib_dealloc_pd(priv->pd)) + ipoib_warn(priv, "ib_dealloc_pd failed\n"); +} + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record) +{ + struct ipoib_dev_priv *priv = + container_of(handler, struct ipoib_dev_priv, event_handler); + + if (record->event == IB_EVENT_PORT_ACTIVE || + record->event == IB_EVENT_LID_CHANGE || + record->event == IB_EVENT_SM_CHANGE) { + ipoib_dbg(priv, "Port active event\n"); + schedule_work(&priv->flush_task); + } +} From roland at topspin.com Sun Dec 19 22:15:16 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:16 -0800 Subject: [openib-general] [PATCH][v4][20/24] Add IPoIB multicast & partition code In-Reply-To: <200412192215.fZX1ZQqQD4QGkKcF@topspin.com> Message-ID: <200412192215.0RFFzMyJ3V7jnOWs@topspin.com> Add functions for handling IPoIB multicast and multiple partitions. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2004-12-19 22:04:18.140025955 -0800 @@ -0,0 +1,981 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_multicast.c 1362 2004-12-18 15:56:29Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ipoib.h" + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int mcast_debug_level; + +module_param(mcast_debug_level, int, 0644); +MODULE_PARM_DESC(mcast_debug_level, + "Enable multicast debug tracing if > 0"); +#endif + +static DECLARE_MUTEX(mcast_mutex); + +/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ +struct ipoib_mcast { + struct ib_sa_mcmember_rec mcmember; + struct ipoib_ah *ah; + + struct rb_node rb_node; + struct list_head list; + struct completion done; + + int query_id; + struct ib_sa_query *query; + + unsigned long created; + unsigned long backoff; + + unsigned long flags; + unsigned char logcount; + + struct list_head neigh_list; + + struct sk_buff_head pkt_queue; + + struct net_device *dev; +}; + +struct ipoib_mcast_iter { + struct net_device *dev; + union ib_gid mgid; + unsigned long created; + unsigned int queuelen; + unsigned int complete; + unsigned int send_only; +}; + +static void ipoib_mcast_free(struct ipoib_mcast *mcast) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tmp; + unsigned long flags; + + ipoib_dbg_mcast(netdev_priv(dev), + "deleting multicast group " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + spin_lock_irqsave(&priv->lock, flags); + + list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); + + if (mcast->ah) + ipoib_put_ah(mcast->ah); + + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + dev_kfree_skb_any(skb); + } + + kfree(mcast); +} + +static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev, + int can_sleep) +{ + struct ipoib_mcast *mcast; + + mcast = kmalloc(sizeof (*mcast), can_sleep ? GFP_KERNEL : GFP_ATOMIC); + if (!mcast) + return NULL; + + memset(mcast, 0, sizeof (*mcast)); + + init_completion(&mcast->done); + + mcast->dev = dev; + mcast->created = jiffies; + mcast->backoff = HZ; + mcast->logcount = 0; + + INIT_LIST_HEAD(&mcast->list); + INIT_LIST_HEAD(&mcast->neigh_list); + skb_queue_head_init(&mcast->pkt_queue); + + mcast->ah = NULL; + mcast->query = NULL; + + return mcast; +} + +static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->multicast_tree.rb_node; + + while (n) { + struct ipoib_mcast *mcast; + int ret; + + mcast = rb_entry(n, struct ipoib_mcast, rb_node); + + ret = memcmp(mgid->raw, mcast->mcmember.mgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return mcast; + } + + return NULL; +} + +static int __ipoib_mcast_add(struct net_device *dev, struct ipoib_mcast *mcast) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->multicast_tree.rb_node, *pn = NULL; + + while (*n) { + struct ipoib_mcast *tmcast; + int ret; + + pn = *n; + tmcast = rb_entry(pn, struct ipoib_mcast, rb_node); + + ret = memcmp(mcast->mcmember.mgid.raw, tmcast->mcmember.mgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&mcast->rb_node, pn, n); + rb_insert_color(&mcast->rb_node, &priv->multicast_tree); + + return 0; +} + +static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, + struct ib_sa_mcmember_rec *mcmember) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + mcast->mcmember = *mcmember; + + /* Set the cached Q_Key before we attach if it's the broadcast group */ + if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + sizeof (union ib_gid))) + priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); + + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + ipoib_warn(priv, "multicast group " IPOIB_GID_FMT + " already attached\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + return 0; + } + + ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret < 0) { + ipoib_warn(priv, "couldn't attach QP to multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags); + return ret; + } + } + + { + /* + * For now we set static_rate to 0. This is not + * really correct: we should look at the rate + * component of the MC member record, compare it with + * the rate of our local port (calculated from the + * active link speed and link width) and set an + * inter-packet delay appropriately. + */ + struct ib_ah_attr av = { + .dlid = be16_to_cpu(mcast->mcmember.mlid), + .port_num = priv->port, + .sl = mcast->mcmember.sl, + .static_rate = 0, + .ah_flags = IB_AH_GRH, + .grh = { + .flow_label = be32_to_cpu(mcast->mcmember.flow_label), + .hop_limit = mcast->mcmember.hop_limit, + .sgid_index = 0, + .traffic_class = mcast->mcmember.traffic_class + } + }; + + av.grh.dgid = mcast->mcmember.mgid; + + mcast->ah = ipoib_create_ah(dev, priv->pd, &av); + if (!mcast->ah) { + ipoib_warn(priv, "ib_address_create failed\n"); + } else { + ipoib_dbg_mcast(priv, "MGID " IPOIB_GID_FMT + " AV %p, LID 0x%04x, SL %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + mcast->ah->ah, + be16_to_cpu(mcast->mcmember.mlid), + mcast->mcmember.sl); + } + } + + /* actually send any queued packets */ + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + + if (!skb->dst || !skb->dst->neighbour) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof (struct ipoib_pseudoheader)); + } + + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n"); + } + + return 0; +} + +static void +ipoib_mcast_sendonly_join_complete(int status, + struct ib_sa_mcmember_rec *mcmember, + void *mcast_ptr) +{ + struct ipoib_mcast *mcast = mcast_ptr; + struct net_device *dev = mcast->dev; + + if (!status) + ipoib_mcast_join_finish(mcast, mcmember); + else { + if (mcast->logcount++ < 20) + ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); + + /* Flush out any queued packets */ + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + + dev_kfree_skb_any(skb); + } + + /* Clear the busy flag so we try again */ + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + } + + complete(&mcast->done); +} + +static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { +#if 0 /* Some SMs don't support send-only yet */ + .join_state = 4 +#else + .join_state = 1 +#endif + }; + int ret = 0; + + if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) { + ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n"); + return -ENODEV; + } + + if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n"); + return -EBUSY; + } + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + 1000, GFP_ATOMIC, + ipoib_mcast_sendonly_join_complete, + mcast, &mcast->query); + if (ret < 0) { + ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", + ret); + } else { + ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT + ", starting join\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + mcast->query_id = ret; + } + + return ret; +} + +static void ipoib_mcast_join_complete(int status, + struct ib_sa_mcmember_rec *mcmember, + void *mcast_ptr) +{ + struct ipoib_mcast *mcast = mcast_ptr; + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg_mcast(priv, "join completion for " IPOIB_GID_FMT + " (status %d)\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); + + if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { + mcast->backoff = HZ; + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_work(ipoib_workqueue, &priv->mcast_task); + up(&mcast_mutex); + complete(&mcast->done); + return; + } + + if (status == -EINTR) { + complete(&mcast->done); + return; + } + + if (status && mcast->logcount++ < 20) { + if (status == -ETIMEDOUT || status == -EINTR) { + ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT + ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + status); + } else { + ipoib_warn(priv, "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + status); + } + } + + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + + mcast->query = NULL; + + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { + if (status == -ETIMEDOUT) + queue_work(ipoib_workqueue, &priv->mcast_task); + else + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, + mcast->backoff * HZ); + } else + complete(&mcast->done); + up(&mcast_mutex); + + return; +} + +static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, + int create) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { + .join_state = 1 + }; + ib_sa_comp_mask comp_mask; + int ret = 0; + + ipoib_dbg_mcast(priv, "joining MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + comp_mask = + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE; + + if (create) { + comp_mask |= + IB_SA_MCMEMBER_REC_QKEY | + IB_SA_MCMEMBER_REC_SL | + IB_SA_MCMEMBER_REC_FLOW_LABEL | + IB_SA_MCMEMBER_REC_TRAFFIC_CLASS; + + rec.qkey = priv->broadcast->mcmember.qkey; + rec.sl = priv->broadcast->mcmember.sl; + rec.flow_label = priv->broadcast->mcmember.flow_label; + rec.traffic_class = priv->broadcast->mcmember.traffic_class; + } + + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, + mcast->backoff * 1000, GFP_ATOMIC, + ipoib_mcast_join_complete, + mcast, &mcast->query); + + if (ret < 0) { + ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); + + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->mcast_task, + mcast->backoff); + up(&mcast_mutex); + } else + mcast->query_id = ret; +} + +void ipoib_mcast_join_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (!test_bit(IPOIB_MCAST_RUN, &priv->flags)) + return; + + if (ib_query_gid(priv->ca, priv->port, 0, &priv->local_gid)) + ipoib_warn(priv, "ib_gid_entry_get() failed\n"); + else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + if (!priv->broadcast) { + priv->broadcast = ipoib_mcast_alloc(dev, 1); + if (!priv->broadcast) { + ipoib_warn(priv, "failed to allocate broadcast group\n"); + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->mcast_task, HZ); + up(&mcast_mutex); + return; + } + + memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + sizeof (union ib_gid)); + + spin_lock_irq(&priv->lock); + __ipoib_mcast_add(dev, priv->broadcast); + spin_unlock_irq(&priv->lock); + } + + if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { + ipoib_mcast_join(dev, priv->broadcast, 0); + return; + } + + while (1) { + struct ipoib_mcast *mcast = NULL; + + spin_lock_irq(&priv->lock); + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + /* Found the next unjoined group */ + break; + } + } + spin_unlock_irq(&priv->lock); + + if (&mcast->list == &priv->multicast_list) { + /* All done */ + break; + } + + ipoib_mcast_join(dev, mcast, 1); + return; + } + + { + struct ib_port_attr attr; + + if (!ib_query_port(priv->ca, priv->port, &attr)) + priv->local_lid = attr.lid; + else + ipoib_warn(priv, "ib_query_port failed\n"); + } + + priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - + IPOIB_ENCAP_LEN; + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n"); + + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + netif_carrier_on(dev); +} + +int ipoib_mcast_start_thread(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg_mcast(priv, "starting multicast thread\n"); + + down(&mcast_mutex); + if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_work(ipoib_workqueue, &priv->mcast_task); + up(&mcast_mutex); + + return 0; +} + +int ipoib_mcast_stop_thread(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; + + ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + + down(&mcast_mutex); + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + cancel_delayed_work(&priv->mcast_task); + up(&mcast_mutex); + + flush_workqueue(ipoib_workqueue); + + if (priv->broadcast && priv->broadcast->query) { + ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); + priv->broadcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for bcast\n"); + wait_for_completion(&priv->broadcast->done); + } + + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } + } + + return 0; +} + +int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { + .join_state = 1 + }; + int ret = 0; + + if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) + return 0; + + ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + /* Remove ourselves from the multicast group */ + ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret) + ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + + /* + * Just make one shot at leaving and don't wait for a reply; + * if we fail, too bad. + */ + ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + 0, GFP_ATOMIC, NULL, + mcast, &mcast->query); + if (ret < 0) + ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " + "for leave (result = %d)\n", ret); + + return 0; +} + +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + mcast = __ipoib_mcast_find(dev, mgid); + if (!mcast) { + /* Let's create a new send only group now */ + ipoib_dbg_mcast(priv, "setting up send only multicast group for " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(*mgid)); + + mcast = ipoib_mcast_alloc(dev, 0); + if (!mcast) { + ipoib_warn(priv, "unable to allocate memory for " + "multicast structure\n"); + dev_kfree_skb_any(skb); + goto out; + } + + set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags); + mcast->mcmember.mgid = *mgid; + __ipoib_mcast_add(dev, mcast); + list_add_tail(&mcast->list, &priv->multicast_list); + } + + if (!mcast->ah) { + if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) + skb_queue_tail(&mcast->pkt_queue, skb); + else + dev_kfree_skb_any(skb); + + if (mcast->query) + ipoib_dbg_mcast(priv, "no address vector, " + "but multicast join already started\n"); + else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) + ipoib_mcast_sendonly_join(mcast); + + /* + * If lookup completes between here and out:, don't + * want to send packet twice. + */ + mcast = NULL; + } + +out: + if (mcast && mcast->ah) { + if (skb->dst && + skb->dst->neighbour && + !*to_ipoib_neigh(skb->dst->neighbour)) { + struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + + if (neigh) { + kref_get(&mcast->ah->ref); + neigh->ah = mcast->ah; + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + list_add_tail(&neigh->list, &mcast->neigh_list); + } + } + + ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + } + + spin_unlock(&priv->lock); +} + +void ipoib_mcast_dev_flush(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + LIST_HEAD(remove_list); + struct ipoib_mcast *mcast, *tmcast, *nmcast; + unsigned long flags; + + ipoib_dbg_mcast(priv, "flushing multicast list\n"); + + spin_lock_irqsave(&priv->lock, flags); + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + nmcast = ipoib_mcast_alloc(dev, 0); + if (nmcast) { + nmcast->flags = + mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY); + + nmcast->mcmember.mgid = mcast->mcmember.mgid; + + /* Add the new group in before the to-be-destroyed group */ + list_add_tail(&nmcast->list, &mcast->list); + list_del_init(&mcast->list); + + rb_replace_node(&mcast->rb_node, &nmcast->rb_node, + &priv->multicast_tree); + + list_add_tail(&mcast->list, &remove_list); + } else { + ipoib_warn(priv, "could not reallocate multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + } + } + + if (priv->broadcast) { + nmcast = ipoib_mcast_alloc(dev, 0); + if (nmcast) { + nmcast->mcmember.mgid = priv->broadcast->mcmember.mgid; + + rb_replace_node(&priv->broadcast->rb_node, + &nmcast->rb_node, + &priv->multicast_tree); + + list_add_tail(&priv->broadcast->list, &remove_list); + } + + priv->broadcast = nmcast; + } + + spin_unlock_irqrestore(&priv->lock, flags); + + list_for_each_entry(mcast, &remove_list, list) { + ipoib_mcast_leave(dev, mcast); + ipoib_mcast_free(mcast); + } +} + +void ipoib_mcast_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; + + /* Delete broadcast since it will be recreated */ + if (priv->broadcast) { + ipoib_dbg_mcast(priv, "deleting broadcast group\n"); + + spin_lock_irqsave(&priv->lock, flags); + rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); + spin_unlock_irqrestore(&priv->lock, flags); + ipoib_mcast_leave(dev, priv->broadcast); + ipoib_mcast_free(priv->broadcast); + priv->broadcast = NULL; + } +} + +void ipoib_mcast_restart_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dev_mc_list *mclist; + struct ipoib_mcast *mcast, *tmcast; + LIST_HEAD(remove_list); + unsigned long flags; + + ipoib_dbg_mcast(priv, "restarting multicast task\n"); + + ipoib_mcast_stop_thread(dev); + + spin_lock_irqsave(&priv->lock, flags); + + /* + * Unfortunately, the networking core only gives us a list of all of + * the multicast hardware addresses. We need to figure out which ones + * are new and which ones have been removed + */ + + /* Clear out the found flag */ + list_for_each_entry(mcast, &priv->multicast_list, list) + clear_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags); + + /* Mark all of the entries that are found or don't exist */ + for (mclist = dev->mc_list; mclist; mclist = mclist->next) { + union ib_gid mgid; + + memcpy(mgid.raw, mclist->dmi_addr + 4, sizeof mgid); + + /* Add in the P_Key */ + mgid.raw[4] = (priv->pkey >> 8) & 0xff; + mgid.raw[5] = priv->pkey & 0xff; + + mcast = __ipoib_mcast_find(dev, &mgid); + if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + struct ipoib_mcast *nmcast; + + /* Not found or send-only group, let's add a new entry */ + ipoib_dbg_mcast(priv, "adding multicast entry for mgid " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); + + nmcast = ipoib_mcast_alloc(dev, 0); + if (!nmcast) { + ipoib_warn(priv, "unable to allocate memory for multicast structure\n"); + continue; + } + + set_bit(IPOIB_MCAST_FLAG_FOUND, &nmcast->flags); + + nmcast->mcmember.mgid = mgid; + + if (mcast) { + /* Destroy the send only entry */ + list_del(&mcast->list); + list_add_tail(&mcast->list, &remove_list); + + rb_replace_node(&mcast->rb_node, + &nmcast->rb_node, + &priv->multicast_tree); + } else + __ipoib_mcast_add(dev, nmcast); + + list_add_tail(&nmcast->list, &priv->multicast_list); + } + + if (mcast) + set_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags); + } + + /* Remove all of the entries don't exist anymore */ + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags) && + !test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "deleting multicast group " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rb_erase(&mcast->rb_node, &priv->multicast_tree); + + /* Move to the remove list */ + list_del(&mcast->list); + list_add_tail(&mcast->list, &remove_list); + } + } + spin_unlock_irqrestore(&priv->lock, flags); + + /* We have to cancel outside of the spinlock */ + list_for_each_entry(mcast, &remove_list, list) { + ipoib_mcast_leave(mcast->dev, mcast); + ipoib_mcast_free(mcast); + } + + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_mcast_start_thread(dev); +} + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev) +{ + struct ipoib_mcast_iter *iter; + + iter = kmalloc(sizeof *iter, GFP_KERNEL); + if (!iter) + return NULL; + + iter->dev = dev; + memset(iter->mgid.raw, 0, sizeof iter->mgid); + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter) +{ + kfree(iter); +} + +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter) +{ + struct ipoib_dev_priv *priv = netdev_priv(iter->dev); + struct rb_node *n; + struct ipoib_mcast *mcast; + int ret = 1; + + spin_lock_irq(&priv->lock); + + n = rb_first(&priv->multicast_tree); + + while (n) { + mcast = rb_entry(n, struct ipoib_mcast, rb_node); + + if (memcmp(iter->mgid.raw, mcast->mcmember.mgid.raw, + sizeof (union ib_gid)) < 0) { + iter->mgid = mcast->mcmember.mgid; + iter->created = mcast->created; + iter->queuelen = skb_queue_len(&mcast->pkt_queue); + iter->complete = !!mcast->ah; + iter->send_only = !!(mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY)); + + ret = 0; + + break; + } + + n = rb_next(n); + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *mgid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only) +{ + *mgid = iter->mgid; + *created = iter->created; + *queuelen = iter->queuelen; + *complete = iter->complete; + *send_only = iter->send_only; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2004-12-19 22:04:18.168021829 -0800 @@ -0,0 +1,177 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_vlan.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include +#include +#include + +#include + +#include "ipoib.h" + +static ssize_t show_parent(struct class_device *class_dev, char *buf) +{ + struct net_device *dev = + container_of(class_dev, struct net_device, class_dev); + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return sprintf(buf, "%s\n", priv->parent->name); +} +static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) +{ + struct ipoib_dev_priv *ppriv, *priv; + char intf_name[IFNAMSIZ]; + int result; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + ppriv = netdev_priv(pdev); + + down(&ppriv->vlan_mutex); + + /* + * First ensure this isn't a duplicate. We check the parent device and + * then all of the child interfaces to make sure the Pkey doesn't match. + */ + if (ppriv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } + + list_for_each_entry(priv, &ppriv->child_intfs, list) { + if (priv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } + } + + snprintf(intf_name, sizeof intf_name, "%s.%04x", + ppriv->dev->name, pkey); + priv = ipoib_intf_alloc(intf_name); + if (!priv) { + result = -ENOMEM; + goto err; + } + + set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags); + + priv->pkey = pkey; + + memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN); + priv->dev->broadcast[8] = pkey >> 8; + priv->dev->broadcast[9] = pkey & 0xff; + + result = ipoib_dev_init(priv->dev, ppriv->ca, ppriv->port); + if (result < 0) { + ipoib_warn(ppriv, "failed to initialize subinterface: " + "device %s, port %d", + ppriv->ca->name, ppriv->port); + goto device_init_failed; + } + + result = register_netdev(priv->dev); + if (result) { + ipoib_warn(priv, "failed to initialize; error %i", result); + goto register_failed; + } + + priv->parent = ppriv->dev; + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_parent)) + goto sysfs_failed; + + list_add_tail(&priv->list, &ppriv->child_intfs); + + up(&ppriv->vlan_mutex); + + return 0; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +err: + up(&ppriv->vlan_mutex); + return result; +} + +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey) +{ + struct ipoib_dev_priv *ppriv, *priv, *tpriv; + int ret = -ENOENT; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + ppriv = netdev_priv(pdev); + + down(&ppriv->vlan_mutex); + list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) { + if (priv->pkey == pkey) { + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + + list_del(&priv->list); + + kfree(priv); + + ret = 0; + break; + } + } + up(&ppriv->vlan_mutex); + + return ret; +} From roland at topspin.com Sun Dec 19 22:15:17 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:17 -0800 Subject: [openib-general] [PATCH][v4][21/24] Add InfiniBand userspace MAD support In-Reply-To: <200412192215.0RFFzMyJ3V7jnOWs@topspin.com> Message-ID: <200412192215.96NE5c2UPDke8PZm@topspin.com> Add a driver that provides a character special device for each InfiniBand port. This device allows userspace to send and receive MADs via write() and read() (with some control operations implemented as ioctls). All operations are 32/64 clean and have been tested with 32-bit userspace running on a ppc64 kernel. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-19 22:04:14.163611941 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-19 22:04:18.415985288 -0800 @@ -1,6 +1,6 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o ib_umad.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o @@ -8,3 +8,5 @@ ib_mad-y := mad.o smi.o agent.o ib_sa-y := sa_query.o + +ib_umad-y := user_mad.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/user_mad.c 2004-12-19 22:04:18.442981310 -0800 @@ -0,0 +1,719 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand userspace MAD packet access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + IB_UMAD_MAX_PORTS = 256, + IB_UMAD_MAX_AGENTS = 32 +}; + +struct ib_umad_port { + int devnum; + struct cdev dev; + struct class_device class_dev; + struct ib_device *ib_dev; + struct ib_umad_device *umad_dev; + u8 port_num; +}; + +struct ib_umad_device { + int start_port, end_port; + struct kref ref; + struct ib_umad_port port[0]; +}; + +struct ib_umad_file { + struct ib_umad_port *port; + spinlock_t recv_lock; + struct list_head recv_list; + wait_queue_head_t recv_wait; + struct rw_semaphore agent_mutex; + struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; + struct ib_mr *mr[IB_UMAD_MAX_AGENTS]; +}; + +struct ib_umad_packet { + struct ib_user_mad mad; + struct ib_ah *ah; + struct list_head list; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static dev_t base_dev; +static spinlock_t map_lock; +static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS); + +static void ib_umad_add_one(struct ib_device *device); +static void ib_umad_remove_one(struct ib_device *device); + +static int queue_packet(struct ib_umad_file *file, + struct ib_mad_agent *agent, + struct ib_umad_packet *packet) +{ + int ret = 1; + + down_read(&file->agent_mutex); + for (packet->mad.id = 0; + packet->mad.id < IB_UMAD_MAX_AGENTS; + packet->mad.id++) + if (agent == file->agent[packet->mad.id]) { + spin_lock_irq(&file->recv_lock); + list_add_tail(&packet->list, &file->recv_list); + spin_unlock_irq(&file->recv_lock); + wake_up_interruptible(&file->recv_wait); + ret = 0; + break; + } + + up_read(&file->agent_mutex); + + return ret; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *send_wc) +{ + struct ib_umad_file *file = agent->context; + struct ib_umad_packet *packet = + (void *) (unsigned long) send_wc->wr_id; + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(packet, mapping), + sizeof packet->mad.data, + DMA_TO_DEVICE); + ib_destroy_ah(packet->ah); + + if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { + packet->mad.status = ETIMEDOUT; + + if (!queue_packet(file, agent, packet)) + return; + } + + kfree(packet); +} + +static void recv_handler(struct ib_mad_agent *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_umad_file *file = agent->context; + struct ib_umad_packet *packet; + + if (mad_recv_wc->wc->status != IB_WC_SUCCESS) + goto out; + + packet = kmalloc(sizeof *packet, GFP_KERNEL); + if (!packet) + goto out; + + memset(packet, 0, sizeof *packet); + + memcpy(packet->mad.data, mad_recv_wc->recv_buf.mad, sizeof packet->mad.data); + packet->mad.status = 0; + packet->mad.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); + packet->mad.lid = cpu_to_be16(mad_recv_wc->wc->slid); + packet->mad.sl = mad_recv_wc->wc->sl; + packet->mad.path_bits = mad_recv_wc->wc->dlid_path_bits; + packet->mad.grh_present = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH); + if (packet->mad.grh_present) { + /* XXX parse GRH */ + packet->mad.gid_index = 0; + packet->mad.hop_limit = 0; + packet->mad.traffic_class = 0; + memset(packet->mad.gid, 0, 16); + packet->mad.flow_label = 0; + } + + if (queue_packet(file, agent, packet)) + kfree(packet); + +out: + ib_free_recv_mad(mad_recv_wc); +} + +static ssize_t ib_umad_read(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_umad_file *file = filp->private_data; + struct ib_umad_packet *packet; + ssize_t ret; + + if (count < sizeof (struct ib_user_mad)) + return -EINVAL; + + spin_lock_irq(&file->recv_lock); + + while (list_empty(&file->recv_list)) { + spin_unlock_irq(&file->recv_lock); + + if (filp->f_flags & O_NONBLOCK) + return -EAGAIN; + + if (wait_event_interruptible(file->recv_wait, + !list_empty(&file->recv_list))) + return -ERESTARTSYS; + + spin_lock_irq(&file->recv_lock); + } + + packet = list_entry(file->recv_list.next, struct ib_umad_packet, list); + list_del(&packet->list); + + spin_unlock_irq(&file->recv_lock); + + if (copy_to_user(buf, &packet->mad, sizeof packet->mad)) + ret = -EFAULT; + else + ret = sizeof packet->mad; + + kfree(packet); + return ret; +} + +static ssize_t ib_umad_write(struct file *filp, const char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_umad_file *file = filp->private_data; + struct ib_umad_packet *packet; + struct ib_mad_agent *agent; + struct ib_ah_attr ah_attr; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + }; + int ret; + + if (count < sizeof (struct ib_user_mad)) + return -EINVAL; + + packet = kmalloc(sizeof *packet, GFP_KERNEL); + if (!packet) + return -ENOMEM; + + if (copy_from_user(&packet->mad, buf, sizeof packet->mad)) { + kfree(packet); + return -EFAULT; + } + + if (packet->mad.id < 0 || packet->mad.id >= IB_UMAD_MAX_AGENTS) { + ret = -EINVAL; + goto err; + } + + down_read(&file->agent_mutex); + + agent = file->agent[packet->mad.id]; + if (!agent) { + ret = -EINVAL; + goto err_up; + } + + ((struct ib_mad_hdr *) packet->mad.data)->tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | + (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) & + 0xffffffff)); + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = be16_to_cpu(packet->mad.lid); + ah_attr.sl = packet->mad.sl; + ah_attr.src_path_bits = packet->mad.path_bits; + ah_attr.port_num = file->port->port_num; + /* XXX handle GRH */ + + packet->ah = ib_create_ah(agent->qp->pd, &ah_attr); + if (IS_ERR(packet->ah)) { + ret = PTR_ERR(packet->ah); + goto err_up; + } + + gather_list.addr = dma_map_single(agent->device->dma_device, + packet->mad.data, + sizeof packet->mad.data, + DMA_TO_DEVICE); + gather_list.length = sizeof packet->mad.data; + gather_list.lkey = file->mr[packet->mad.id]->lkey; + pci_unmap_addr_set(packet, mapping, gather_list.addr); + + wr.wr.ud.mad_hdr = (struct ib_mad_hdr *) packet->mad.data; + wr.wr.ud.ah = packet->ah; + wr.wr.ud.remote_qpn = be32_to_cpu(packet->mad.qpn); + wr.wr.ud.remote_qkey = be32_to_cpu(packet->mad.qkey); + wr.wr.ud.timeout_ms = packet->mad.timeout_ms; + + wr.wr_id = (unsigned long) packet; + + ret = ib_post_send_mad(agent, &wr, &bad_wr); + if (ret) { + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(packet, mapping), + sizeof packet->mad.data, + DMA_TO_DEVICE); + goto err_up; + } + + up_read(&file->agent_mutex); + + return sizeof packet->mad; + +err_up: + up_read(&file->agent_mutex); + +err: + kfree(packet); + return ret; +} + +static unsigned int ib_umad_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ib_umad_file *file = filp->private_data; + + /* we will always be able to post a MAD send */ + unsigned int mask = POLLOUT | POLLWRNORM; + + poll_wait(filp, &file->recv_wait, wait); + + if (!list_empty(&file->recv_list)) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +static int ib_umad_reg_agent(struct ib_umad_file *file, unsigned long arg) +{ + struct ib_user_mad_reg_req ureq; + struct ib_mad_reg_req req; + struct ib_mad_agent *agent; + int agent_id; + int ret; + + down_write(&file->agent_mutex); + + if (copy_from_user(&ureq, (void __user *) arg, sizeof ureq)) { + ret = -EFAULT; + goto out; + } + + if (ureq.qpn != 0 && ureq.qpn != 1) { + ret = -EINVAL; + goto out; + } + + for (agent_id = 0; agent_id < IB_UMAD_MAX_AGENTS; ++agent_id) + if (!file->agent[agent_id]) + goto found; + + ret = -ENOMEM; + goto out; + +found: + req.mgmt_class = ureq.mgmt_class; + req.mgmt_class_version = ureq.mgmt_class_version; + memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); + memcpy(req.oui, ureq.oui, sizeof req.oui); + + agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num, + ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI, + &req, 0, send_handler, recv_handler, + file); + if (IS_ERR(agent)) { + ret = PTR_ERR(agent); + goto out; + } + + file->agent[agent_id] = agent; + + file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(file->mr[agent_id])) { + ret = -ENOMEM; + goto err; + } + + if (put_user(agent_id, + (u32 __user *) (arg + offsetof(struct ib_user_mad_reg_req, id)))) { + ret = -EFAULT; + goto err_mr; + } + + ret = 0; + goto out; + +err_mr: + ib_dereg_mr(file->mr[agent_id]); + +err: + file->agent[agent_id] = NULL; + ib_unregister_mad_agent(agent); + +out: + up_write(&file->agent_mutex); + return ret; +} + +static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg) +{ + u32 id; + int ret = 0; + + down_write(&file->agent_mutex); + + if (get_user(id, (u32 __user *) arg)) { + ret = -EFAULT; + goto out; + } + + if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) { + ret = -EINVAL; + goto out; + } + + ib_dereg_mr(file->mr[id]); + ib_unregister_mad_agent(file->agent[id]); + file->agent[id] = NULL; + +out: + up_write(&file->agent_mutex); + return ret; +} + +static int ib_umad_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +{ + switch (cmd) { + case IB_USER_MAD_REGISTER_AGENT: + return ib_umad_reg_agent(filp->private_data, arg); + case IB_USER_MAD_UNREGISTER_AGENT: + return ib_umad_unreg_agent(filp->private_data, arg); + default: + return -ENOIOCTLCMD; + } +} + +static int ib_umad_open(struct inode *inode, struct file *filp) +{ + struct ib_umad_port *port = + container_of(inode->i_cdev, struct ib_umad_port, dev); + struct ib_umad_file *file; + + file = kmalloc(sizeof *file, GFP_KERNEL); + if (!file) + return -ENOMEM; + + memset(file, 0, sizeof *file); + + spin_lock_init(&file->recv_lock); + init_rwsem(&file->agent_mutex); + INIT_LIST_HEAD(&file->recv_list); + init_waitqueue_head(&file->recv_wait); + + file->port = port; + filp->private_data = file; + + return 0; +} + +static int ib_umad_close(struct inode *inode, struct file *filp) +{ + struct ib_umad_file *file = filp->private_data; + int i; + + for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) + if (file->agent[i]) { + ib_dereg_mr(file->mr[i]); + ib_unregister_mad_agent(file->agent[i]); + } + + kfree(file); + + return 0; +} + +static struct file_operations umad_fops = { + .owner = THIS_MODULE, + .read = ib_umad_read, + .write = ib_umad_write, + .poll = ib_umad_poll, + .ioctl = ib_umad_ioctl, + .open = ib_umad_open, + .release = ib_umad_close +}; + +static struct ib_client umad_client = { + .name = "umad", + .add = ib_umad_add_one, + .remove = ib_umad_remove_one +}; + +static ssize_t show_dev(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + return print_dev_t(buf, port->dev.dev); +} +static CLASS_DEVICE_ATTR(dev, S_IRUGO, show_dev, NULL); + +static ssize_t show_ibdev(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + return sprintf(buf, "%s\n", port->ib_dev->name); +} +static CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); + +static ssize_t show_port(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + return sprintf(buf, "%d\n", port->port_num); +} +static CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL); + +static void ib_umad_release_dev(struct kref *ref) +{ + struct ib_umad_device *dev = + container_of(ref, struct ib_umad_device, ref); + + kfree(dev); +} + +static void ib_umad_release_port(struct class_device *class_dev) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + cdev_del(&port->dev); + clear_bit(port->devnum, dev_map); + kref_put(&port->umad_dev->ref, ib_umad_release_dev); +} + +static struct class umad_class = { + .name = "infiniband_mad", + .release = ib_umad_release_port +}; + +static ssize_t show_abi_version(struct class *class, char *buf) +{ + return sprintf(buf, "%d\n", IB_USER_MAD_ABI_VERSION); +} +static CLASS_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); + +static void ib_umad_add_one(struct ib_device *device) +{ + struct ib_umad_device *umad_dev; + int s, e, i; + + if (device->node_type == IB_NODE_SWITCH) + s = e = 0; + else { + s = 1; + e = device->phys_port_cnt; + } + + umad_dev = kmalloc(sizeof *umad_dev + + (e - s + 1) * sizeof (struct ib_umad_port), + GFP_KERNEL); + if (!umad_dev) + return; + + memset(umad_dev, 0, sizeof *umad_dev + + (e - s + 1) * sizeof (struct ib_umad_port)); + + kref_init(&umad_dev->ref); + + umad_dev->start_port = s; + umad_dev->end_port = e; + + for (i = s; i <= e; ++i) { + umad_dev->port[i - s].umad_dev = umad_dev; + kref_get(&umad_dev->ref); + + spin_lock(&map_lock); + umad_dev->port[i - s].devnum = + find_first_zero_bit(dev_map, IB_UMAD_MAX_PORTS); + if (umad_dev->port[i - s].devnum >= IB_UMAD_MAX_PORTS) { + spin_unlock(&map_lock); + goto err; + } + set_bit(umad_dev->port[i - s].devnum, dev_map); + spin_unlock(&map_lock); + + umad_dev->port[i - s].ib_dev = device; + umad_dev->port[i - s].port_num = i; + + cdev_init(&umad_dev->port[i - s].dev, &umad_fops); + umad_dev->port[i - s].dev.owner = THIS_MODULE; + kobject_set_name(&umad_dev->port[i - s].dev.kobj, + "umad%d", umad_dev->port[i - s].devnum); + if (cdev_add(&umad_dev->port[i - s].dev, base_dev + + umad_dev->port[i - s].devnum, 1)) + goto err; + + umad_dev->port[i - s].class_dev.class = &umad_class; + umad_dev->port[i - s].class_dev.dev = device->dma_device; + snprintf(umad_dev->port[i - s].class_dev.class_id, + BUS_ID_SIZE, "umad%d", umad_dev->port[i - s].devnum); + if (class_device_register(&umad_dev->port[i - s].class_dev)) + goto err_class; + + if (class_device_create_file(&umad_dev->port[i - s].class_dev, + &class_device_attr_dev)) + goto err_class; + if (class_device_create_file(&umad_dev->port[i - s].class_dev, + &class_device_attr_ibdev)) + goto err_class; + if (class_device_create_file(&umad_dev->port[i - s].class_dev, + &class_device_attr_port)) + goto err_class; + } + + ib_set_client_data(device, &umad_client, umad_dev); + + return; + +err_class: + cdev_del(&umad_dev->port[i - s].dev); + clear_bit(umad_dev->port[i - s].devnum, dev_map); + +err: + while (--i >= s) + class_device_unregister(&umad_dev->port[i - s].class_dev); + + kref_put(&umad_dev->ref, ib_umad_release_dev); +} + +static void ib_umad_remove_one(struct ib_device *device) +{ + struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client); + int i; + + if (!umad_dev) + return; + + for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) + class_device_unregister(&umad_dev->port[i].class_dev); + + kref_put(&umad_dev->ref, ib_umad_release_dev); +} + +static int __init ib_umad_init(void) +{ + int ret; + + spin_lock_init(&map_lock); + + ret = alloc_chrdev_region(&base_dev, 0, IB_UMAD_MAX_PORTS, + "infiniband_mad"); + if (ret) { + printk(KERN_ERR "user_mad: couldn't get device number\n"); + goto out; + } + + ret = class_register(&umad_class); + if (ret) { + printk(KERN_ERR "user_mad: couldn't create class infiniband_mad\n"); + goto out_chrdev; + } + + ret = class_create_file(&umad_class, &class_attr_abi_version); + if (ret) { + printk(KERN_ERR "user_mad: couldn't create abi_version attribute\n"); + goto out_class; + } + + ret = ib_register_client(&umad_client); + if (ret) { + printk(KERN_ERR "user_mad: couldn't register ib_umad client\n"); + goto out_class; + } + + /* Our ioctls are 32/64 clean */ + ret = register_ioctl32_conversion(IB_USER_MAD_REGISTER_AGENT, NULL); + ret |= register_ioctl32_conversion(IB_USER_MAD_UNREGISTER_AGENT, NULL); + if (ret) { + printk(KERN_ERR "user_mad: couldn't register ioctl32 conversions\n"); + goto out_client; + } + + return 0; + +out_client: + ib_unregister_client(&umad_client); + +out_class: + class_unregister(&umad_class); + +out_chrdev: + unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS); + +out: + return ret; +} + +static void __exit ib_umad_cleanup(void) +{ + unregister_ioctl32_conversion(IB_USER_MAD_REGISTER_AGENT); + unregister_ioctl32_conversion(IB_USER_MAD_UNREGISTER_AGENT); + ib_unregister_client(&umad_client); + class_unregister(&umad_class); + unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS); +} + +module_init(ib_umad_init); +module_exit(ib_umad_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_user_mad.h 2004-12-19 22:04:18.470977184 -0800 @@ -0,0 +1,123 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef IB_USER_MAD_H +#define IB_USER_MAD_H + +#include +#include + +/* + * Increment this value if any changes that break userspace ABI + * compatibility are made. + */ +#define IB_USER_MAD_ABI_VERSION 2 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + */ + +/** + * ib_user_mad - MAD packet + * @data - Contents of MAD + * @id - ID of agent MAD received with/to be sent with + * @status - 0 on successful receive, ETIMEDOUT if no response + * received (transaction ID in data[] will be set to TID of original + * request) (ignored on send) + * @timeout_ms - Milliseconds to wait for response (unset on receive) + * @qpn - Remote QP number received from/to be sent to + * @qkey - Remote Q_Key to be sent with (unset on receive) + * @lid - Remote lid received from/to be sent to + * @sl - Service level received with/to be sent with + * @path_bits - Local path bits received with/to be sent with + * @grh_present - If set, GRH was received/should be sent + * @gid_index - Local GID index to send with (unset on receive) + * @hop_limit - Hop limit in GRH + * @traffic_class - Traffic class in GRH + * @gid - Remote GID in GRH + * @flow_label - Flow label in GRH + * + * All multi-byte quantities are stored in network (big endian) byte order. + */ +struct ib_user_mad { + __u8 data[256]; + __u32 id; + __u32 status; + __u32 timeout_ms; + __u32 qpn; + __u32 qkey; + __u16 lid; + __u8 sl; + __u8 path_bits; + __u8 grh_present; + __u8 gid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 gid[16]; + __u32 flow_label; +}; + +/** + * ib_user_mad_reg_req - MAD registration request + * @id - Set by the kernel; used to identify agent in future requests. + * @qpn - Queue pair number; must be 0 or 1. + * @method_mask - The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + * @mgmt_class - Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version - Indicates which version of MADs for the given + * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. + */ +struct ib_user_mad_reg_req { + __u32 id; + __u32 method_mask[4]; + __u8 qpn; + __u8 mgmt_class; + __u8 mgmt_class_version; + __u8 oui[3]; +}; + +#define IB_IOCTL_MAGIC 0x1b + +#define IB_USER_MAD_REGISTER_AGENT _IOWR(IB_IOCTL_MAGIC, 1, \ + struct ib_user_mad_reg_req) + +#define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 2, __u32) + +#endif /* IB_USER_MAD_H */ From roland at topspin.com Sun Dec 19 22:15:18 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:18 -0800 Subject: [openib-general] [PATCH][v4][22/24] Document InfiniBand ioctl use In-Reply-To: <200412192215.96NE5c2UPDke8PZm@topspin.com> Message-ID: <200412192215.XIYuLGzemoQZ9LxY@topspin.com> Add the 0x1b ioctl magic number used by ib_umad module to Documentation/ioctl-number.txt. Signed-off-by: Roland Dreier --- linux-bk.orig/Documentation/ioctl-number.txt 2004-12-19 21:09:51.000000000 -0800 +++ linux-bk/Documentation/ioctl-number.txt 2004-12-19 22:04:19.307853857 -0800 @@ -72,6 +72,7 @@ 0x09 all linux/md.h 0x12 all linux/fs.h linux/blkpg.h +0x1b all InfiniBand Subsystem 0x20 all drivers/cdrom/cm206.h 0x22 all scsi/sg.h '#' 00-3F IEEE 1394 Subsystem Block for the entire subsystem From roland at topspin.com Sun Dec 19 22:15:19 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:19 -0800 Subject: [openib-general] [PATCH][v4][23/24] Add InfiniBand Documentation files In-Reply-To: <200412192215.XIYuLGzemoQZ9LxY@topspin.com> Message-ID: <200412192215.pKjErOfjUaT6gtSk@topspin.com> Add files to Documentation/infiniband that describe the tree under /sys/class/infiniband, the IPoIB driver and the userspace MAD access driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/ipoib.txt 2004-12-19 22:04:20.188724047 -0800 @@ -0,0 +1,56 @@ +IP OVER INFINIBAND + + The ib_ipoib driver is an implementation of the IP over InfiniBand + protocol as specified by the latest Internet-Drafts issued by the + IETF ipoib working group. It is a "native" implementation in the + sense of setting the interface type to ARPHRD_INFINIBAND and the + hardware address length to 20 (earlier proprietary implementations + masqueraded to the kernel as ethernet interfaces). + +Partitions and P_Keys + + When the IPoIB driver is loaded, it creates one interface for each + port using the P_Key at index 0. To create an interface with a + different P_Key, write the desired P_Key into the main interface's + /sys/class/net//create_child file. For example: + + echo 0x8001 > /sys/class/net/ib0/create_child + + This will create an interface named ib0.8001 with P_Key 0x8001. To + remove a subinterface, use the "delete_child" file: + + echo 0x8001 > /sys/class/net/ib0/delete_child + + The P_Key for any interface is given by the "pkey" file, and the + main interface for a subinterface is in "parent." + +Debugging Information + + By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set + to 'y', tracing messages are compiled into the driver. They are + turned on by setting the module parameters debug_level and + mcast_debug_level to 1. These parameters can be controlled at + runtime through files in /sys/module/ib_ipoib/. + + CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs" + virtual filesystem. By mounting this filesystem, for example with + + mkdir -p /ipoib_debugfs + mount -t ipoib_debugfs none /ipoib_debufs + + it is possible to get statistics about multicast groups from the + files /ipoib_debugfs/ib0_mcg and so on. + + The performance impact of this option is negligible, so it + is safe to enable this option with debug_level set to 0 for normal + operation. + + CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in + the data path when data_debug_level is set to 1. However, even with + the output disabled, enabling this configuration option will affect + performance, because it adds tests to the fast path. + +References + + IETF IP over InfiniBand (ipoib) Working Group + http://ietf.org/html.charters/ipoib-charter.html --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/sysfs.txt 2004-12-19 22:04:20.415690600 -0800 @@ -0,0 +1,63 @@ +SYSFS FILES + + For each InfiniBand device, the InfiniBand drivers create the + following files under /sys/class/infiniband/: + + node_guid - Node GUID + sys_image_guid - System image GUID + + In addition, there is a "ports" subdirectory, with one subdirectory + for each port. For example, if mthca0 is a 2-port HCA, there will + be two directories: + + /sys/class/infiniband/mthca0/ports/1 + /sys/class/infiniband/mthca0/ports/2 + + (A switch will only have a single "0" subdirectory for switch port + 0; no subdirectory is created for normal switch ports) + + In each port subdirectory, the following files are created: + + cap_mask - Port capability mask + lid - Port LID + lid_mask_count - Port LID mask count + sm_lid - Subnet manager LID for port's subnet + sm_sl - Subnet manager SL for port's subnet + state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER) + + There is also a "counters" subdirectory, with files + + VL15_dropped + excessive_buffer_overrun_errors + link_downed + link_error_recovery + local_link_integrity_errors + port_rcv_constraint_errors + port_rcv_data + port_rcv_errors + port_rcv_packets + port_rcv_remote_physical_errors + port_rcv_switch_relay_errors + port_xmit_constraint_errors + port_xmit_data + port_xmit_discards + port_xmit_packets + symbol_error + + Each of these files contains the corresponding value from the port's + Performance Management PortCounters attribute, as described in + section 16.1.3.5 of the InfiniBand Architecture Specification. + + The "pkeys" and "gids" subdirectories contain one file for each + entry in the port's P_Key or GID table respectively. For example, + ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key + table. + +MTHCA + + The Mellanox HCA driver also creates the files: + + hw_rev - Hardware revision number + fw_ver - Firmware version + hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)", + or "MT25208" --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/user_mad.txt 2004-12-19 22:04:20.627659363 -0800 @@ -0,0 +1,81 @@ +USERSPACE MAD ACCESS + +Device files + + Each port of each InfiniBand device has a "umad" device attached. + For example, a two-port HCA will have two devices, while a switch + will have one device (for switch port 0). + +Creating MAD agents + + A MAD agent can be created by filling in a struct ib_user_mad_reg_req + and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file + descriptor for the appropriate device file. If the registration + request succeeds, a 32-bit id will be returned in the structure. + For example: + + struct ib_user_mad_reg_req req = { /* ... */ }; + ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req); + if (!ret) + my_agent = req.id; + else + perror("agent register"); + + Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT + ioctl. Also, all agents registered through a file descriptor will + be unregistered when the descriptor is closed. + +Receiving MADs + + MADs are received using read(). The buffer passed to read() must be + large enough to hold at least one struct ib_user_mad. For example: + + struct ib_user_mad mad; + ret = read(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("read"); + + In addition to the actual MAD contents, the other struct ib_user_mad + fields will be filled in with information on the received MAD. For + example, the remote LID will be in mad.lid. + + If a send times out, a receive will be generated with mad.status set + to ETIMEDOUT. Otherwise when a MAD has been successfully received, + mad.status will be 0. + + poll()/select() may be used to wait until a MAD can be read. + +Sending MADs + + MADs are sent using write(). The agent ID for sending should be + filled into the id field of the MAD, the destination LID should be + filled into the lid field, and so on. For example: + + struct ib_user_mad mad; + + /* fill in mad.data */ + + mad.id = my_agent; /* req.id from agent registration */ + mad.lid = my_dest; /* in network byte order... */ + /* etc. */ + + ret = write(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("write"); + +/dev files + + To create the appropriate character device files automatically with + udev, a rule like + + KERNEL="umad*", NAME="infiniband/%k" + + can be used. This will create a device node named + + /dev/infiniband/umad0 + + for the first port, and so on. The InfiniBand device and port + associated with this device can be determined from the files + + /sys/class/infiniband_mad/umad0/ibdev + /sys/class/infiniband_mad/umad0/port From roland at topspin.com Sun Dec 19 22:15:20 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:20 -0800 Subject: [openib-general] [PATCH][v4][24/24] InfiniBand MAINTAINERS entry In-Reply-To: <200412192215.pKjErOfjUaT6gtSk@topspin.com> Message-ID: <200412192215.TKOr0u539NEw6MPp@topspin.com> Add OpenIB maintainers information to MAINTAINERS. Signed-off-by: Roland Dreier --- linux-bk.orig/MAINTAINERS 2004-12-19 21:09:20.000000000 -0800 +++ linux-bk/MAINTAINERS 2004-12-19 22:04:20.988606172 -0800 @@ -1081,6 +1081,17 @@ L: linux-fbdev-devel at lists.sourceforge.net S: Maintained +INFINIBAND SUBSYSTEM +P: Roland Dreier +M: roland at topspin.com +P: Sean Hefty +M: mshefty at ichips.intel.com +P: Hal Rosenstock +M: halr at voltaire.com +L: openib-general at openib.org +W: http://www.openib.org/ +S: Supported + INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS P: Vojtech Pavlik M: vojtech at suse.cz From yoshfuji at linux-ipv6.org Sun Dec 19 22:38:45 2004 From: yoshfuji at linux-ipv6.org (YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?=) Date: Mon, 20 Dec 2004 15:38:45 +0900 (JST) Subject: [openib-general] Re: [PATCH][v4][0/24] Second InfiniBand merge candidate patch set In-Reply-To: <200412192214.KlDxQ7icOmxHYIf0@topspin.com> References: <200412192214.KlDxQ7icOmxHYIf0@topspin.com> Message-ID: <20041220.153845.70996857.yoshfuji@linux-ipv6.org> In article <200412192214.KlDxQ7icOmxHYIf0 at topspin.com> (at Sun, 19 Dec 2004 22:14:43 -0800), Roland Dreier says: > The following series of patches is the latest version of the OpenIB > InfiniBand drivers. We believe that this version is suitable for > merging when 2.6.11 opens (or into -mm immediately), although of > course we are willing to go through as many more iterations as > required to fix any remaining issues. Maybe, via the net queue. David? --yoshfuji From yoshfuji at linux-ipv6.org Sun Dec 19 22:58:36 2004 From: yoshfuji at linux-ipv6.org (YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?=) Date: Mon, 20 Dec 2004 15:58:36 +0900 (JST) Subject: [openib-general] Re: [PATCH][v4][19/24] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <200412192215.fZX1ZQqQD4QGkKcF@topspin.com> References: <200412192215.69tnzAhGIT1vQGLF@topspin.com> <200412192215.fZX1ZQqQD4QGkKcF@topspin.com> Message-ID: <20041220.155836.75677852.yoshfuji@linux-ipv6.org> In article <200412192215.fZX1ZQqQD4QGkKcF at topspin.com> (at Sun, 19 Dec 2004 22:15:14 -0800), Roland Dreier says: > +enum { > + IPOIB_PACKET_SIZE = 2048, > + IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, > + > + IPOIB_ENCAP_LEN = 4, > + > + IPOIB_RX_RING_SIZE = 128, > + IPOIB_TX_RING_SIZE = 64, > + > + IPOIB_NUM_WC = 4, > + > + IPOIB_MAX_PATH_REC_QUEUE = 3, > + IPOIB_MAX_MCAST_QUEUE = 3, above entries does not seem to appropriate for enum (than #define). > + > + IPOIB_FLAG_OPER_UP = 0, > + IPOIB_FLAG_ADMIN_UP = 1, > + IPOIB_PKEY_ASSIGNED = 2, > + IPOIB_PKEY_STOP = 3, > + IPOIB_FLAG_SUBINTERFACE = 4, > + IPOIB_MCAST_RUN = 5, > + IPOIB_STOP_REAPER = 6, this seems ok, but are "xxx_FLAG_xxx" entries really flags? > + IPOIB_MAX_BACKOFF_SECONDS = 16, ditto, w/ first one. > + IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ > + IPOIB_MCAST_FLAG_SENDONLY = 1, > + IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ > + IPOIB_MCAST_FLAG_ATTACHED = 3, seems fine, but are these really flags? --yoshfuji From roland at topspin.com Sun Dec 19 22:15:14 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:14 -0800 Subject: [openib-general] [PATCH][v4][17/24] IPoIB IPv4 multicast In-Reply-To: <200412192215.l62Q9JXNhGg51wOf@topspin.com> Message-ID: <200412192215.vegmgBmv5xungHlQ@topspin.com> Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB hardware addresses. Also add so INFINIBAND_ALEN has a home. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/include/linux/if_infiniband.h 2004-12-19 22:04:16.867213523 -0800 @@ -0,0 +1,29 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef _LINUX_IF_INFINIBAND_H +#define _LINUX_IF_INFINIBAND_H + +#define INFINIBAND_ALEN 20 /* Octets in IPoIB HW addr */ + +#endif /* _LINUX_IF_INFINIBAND_H */ --- linux-bk.orig/include/net/ip.h 2004-12-19 21:09:26.000000000 -0800 +++ linux-bk/include/net/ip.h 2004-12-19 22:04:16.868213376 -0800 @@ -229,6 +229,39 @@ buf[3]=addr&0x7F; } +/* + * Map a multicast IP onto multicast MAC for type IP-over-InfiniBand. + * Leave P_Key as 0 to be filled in by driver. + */ + +static inline void ip_ib_mc_map(u32 addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + addr = ntohl(addr); + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x40; /* IPv4 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + buf[10] = 0; + buf[11] = 0; + buf[12] = 0; + buf[13] = 0; + buf[14] = 0; + buf[15] = 0; + buf[19] = addr & 0xff; + addr >>= 8; + buf[18] = addr & 0xff; + addr >>= 8; + buf[17] = addr & 0xff; + addr >>= 8; + buf[16] = addr & 0x0f; +} + #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) #include #endif --- linux-bk.orig/net/ipv4/arp.c 2004-12-19 21:09:46.000000000 -0800 +++ linux-bk/net/ipv4/arp.c 2004-12-19 22:04:16.868213376 -0800 @@ -213,6 +213,9 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; + case ARPHRD_INFINIBAND: + ip_ib_mc_map(addr, haddr); + return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); From roland at topspin.com Sun Dec 19 22:15:14 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:14 -0800 Subject: [openib-general] [PATCH][v4][18/24] IPoIB IPv6 support In-Reply-To: <200412192215.vegmgBmv5xungHlQ@topspin.com> Message-ID: <200412192215.69tnzAhGIT1vQGLF@topspin.com> Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB hardware addresses, and add support for autoconfiguration for devices with type ARPHRD_INFINIBAND. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Signed-off-by: Nitin Hande Signed-off-by: Roland Dreier --- linux-bk.orig/include/net/if_inet6.h 2004-12-19 21:09:54.000000000 -0800 +++ linux-bk/include/net/if_inet6.h 2004-12-19 22:04:17.213162542 -0800 @@ -266,5 +266,20 @@ { buf[0] = 0x00; } + +static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x60; /* IPv6 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + memcpy(buf + 10, addr->s6_addr + 6, 10); +} #endif #endif --- linux-bk.orig/net/ipv6/addrconf.c 2004-12-19 21:09:51.000000000 -0800 +++ linux-bk/net/ipv6/addrconf.c 2004-12-19 22:04:17.215162248 -0800 @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -1095,6 +1096,12 @@ memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + if (dev->addr_len != INFINIBAND_ALEN) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; } return -1; } @@ -1794,7 +1801,8 @@ if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && - (dev->type != ARPHRD_ARCNET)) { + (dev->type != ARPHRD_ARCNET) && + (dev->type != ARPHRD_INFINIBAND)) { /* Alas, we support only Ethernet autoconfiguration. */ return; } --- linux-bk.orig/net/ipv6/ndisc.c 2004-12-19 21:09:20.000000000 -0800 +++ linux-bk/net/ipv6/ndisc.c 2004-12-19 22:04:17.216162100 -0800 @@ -260,6 +260,9 @@ case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; + case ARPHRD_INFINIBAND: + ipv6_ib_mc_map(addr, buf); + return 0; default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); From roland at topspin.com Sun Dec 19 22:15:16 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:16 -0800 Subject: [openib-general] [PATCH][v4][20/24] Add IPoIB multicast & partition code In-Reply-To: <200412192215.fZX1ZQqQD4QGkKcF@topspin.com> Message-ID: <200412192215.0RFFzMyJ3V7jnOWs@topspin.com> Add functions for handling IPoIB multicast and multiple partitions. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2004-12-19 22:04:18.140025955 -0800 @@ -0,0 +1,981 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_multicast.c 1362 2004-12-18 15:56:29Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ipoib.h" + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int mcast_debug_level; + +module_param(mcast_debug_level, int, 0644); +MODULE_PARM_DESC(mcast_debug_level, + "Enable multicast debug tracing if > 0"); +#endif + +static DECLARE_MUTEX(mcast_mutex); + +/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ +struct ipoib_mcast { + struct ib_sa_mcmember_rec mcmember; + struct ipoib_ah *ah; + + struct rb_node rb_node; + struct list_head list; + struct completion done; + + int query_id; + struct ib_sa_query *query; + + unsigned long created; + unsigned long backoff; + + unsigned long flags; + unsigned char logcount; + + struct list_head neigh_list; + + struct sk_buff_head pkt_queue; + + struct net_device *dev; +}; + +struct ipoib_mcast_iter { + struct net_device *dev; + union ib_gid mgid; + unsigned long created; + unsigned int queuelen; + unsigned int complete; + unsigned int send_only; +}; + +static void ipoib_mcast_free(struct ipoib_mcast *mcast) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tmp; + unsigned long flags; + + ipoib_dbg_mcast(netdev_priv(dev), + "deleting multicast group " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + spin_lock_irqsave(&priv->lock, flags); + + list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); + + if (mcast->ah) + ipoib_put_ah(mcast->ah); + + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + dev_kfree_skb_any(skb); + } + + kfree(mcast); +} + +static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev, + int can_sleep) +{ + struct ipoib_mcast *mcast; + + mcast = kmalloc(sizeof (*mcast), can_sleep ? GFP_KERNEL : GFP_ATOMIC); + if (!mcast) + return NULL; + + memset(mcast, 0, sizeof (*mcast)); + + init_completion(&mcast->done); + + mcast->dev = dev; + mcast->created = jiffies; + mcast->backoff = HZ; + mcast->logcount = 0; + + INIT_LIST_HEAD(&mcast->list); + INIT_LIST_HEAD(&mcast->neigh_list); + skb_queue_head_init(&mcast->pkt_queue); + + mcast->ah = NULL; + mcast->query = NULL; + + return mcast; +} + +static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->multicast_tree.rb_node; + + while (n) { + struct ipoib_mcast *mcast; + int ret; + + mcast = rb_entry(n, struct ipoib_mcast, rb_node); + + ret = memcmp(mgid->raw, mcast->mcmember.mgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return mcast; + } + + return NULL; +} + +static int __ipoib_mcast_add(struct net_device *dev, struct ipoib_mcast *mcast) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->multicast_tree.rb_node, *pn = NULL; + + while (*n) { + struct ipoib_mcast *tmcast; + int ret; + + pn = *n; + tmcast = rb_entry(pn, struct ipoib_mcast, rb_node); + + ret = memcmp(mcast->mcmember.mgid.raw, tmcast->mcmember.mgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&mcast->rb_node, pn, n); + rb_insert_color(&mcast->rb_node, &priv->multicast_tree); + + return 0; +} + +static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, + struct ib_sa_mcmember_rec *mcmember) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + mcast->mcmember = *mcmember; + + /* Set the cached Q_Key before we attach if it's the broadcast group */ + if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + sizeof (union ib_gid))) + priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); + + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + ipoib_warn(priv, "multicast group " IPOIB_GID_FMT + " already attached\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + return 0; + } + + ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret < 0) { + ipoib_warn(priv, "couldn't attach QP to multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags); + return ret; + } + } + + { + /* + * For now we set static_rate to 0. This is not + * really correct: we should look at the rate + * component of the MC member record, compare it with + * the rate of our local port (calculated from the + * active link speed and link width) and set an + * inter-packet delay appropriately. + */ + struct ib_ah_attr av = { + .dlid = be16_to_cpu(mcast->mcmember.mlid), + .port_num = priv->port, + .sl = mcast->mcmember.sl, + .static_rate = 0, + .ah_flags = IB_AH_GRH, + .grh = { + .flow_label = be32_to_cpu(mcast->mcmember.flow_label), + .hop_limit = mcast->mcmember.hop_limit, + .sgid_index = 0, + .traffic_class = mcast->mcmember.traffic_class + } + }; + + av.grh.dgid = mcast->mcmember.mgid; + + mcast->ah = ipoib_create_ah(dev, priv->pd, &av); + if (!mcast->ah) { + ipoib_warn(priv, "ib_address_create failed\n"); + } else { + ipoib_dbg_mcast(priv, "MGID " IPOIB_GID_FMT + " AV %p, LID 0x%04x, SL %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + mcast->ah->ah, + be16_to_cpu(mcast->mcmember.mlid), + mcast->mcmember.sl); + } + } + + /* actually send any queued packets */ + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + + if (!skb->dst || !skb->dst->neighbour) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof (struct ipoib_pseudoheader)); + } + + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n"); + } + + return 0; +} + +static void +ipoib_mcast_sendonly_join_complete(int status, + struct ib_sa_mcmember_rec *mcmember, + void *mcast_ptr) +{ + struct ipoib_mcast *mcast = mcast_ptr; + struct net_device *dev = mcast->dev; + + if (!status) + ipoib_mcast_join_finish(mcast, mcmember); + else { + if (mcast->logcount++ < 20) + ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); + + /* Flush out any queued packets */ + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + + dev_kfree_skb_any(skb); + } + + /* Clear the busy flag so we try again */ + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + } + + complete(&mcast->done); +} + +static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { +#if 0 /* Some SMs don't support send-only yet */ + .join_state = 4 +#else + .join_state = 1 +#endif + }; + int ret = 0; + + if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) { + ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n"); + return -ENODEV; + } + + if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n"); + return -EBUSY; + } + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + 1000, GFP_ATOMIC, + ipoib_mcast_sendonly_join_complete, + mcast, &mcast->query); + if (ret < 0) { + ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", + ret); + } else { + ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT + ", starting join\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + mcast->query_id = ret; + } + + return ret; +} + +static void ipoib_mcast_join_complete(int status, + struct ib_sa_mcmember_rec *mcmember, + void *mcast_ptr) +{ + struct ipoib_mcast *mcast = mcast_ptr; + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg_mcast(priv, "join completion for " IPOIB_GID_FMT + " (status %d)\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); + + if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { + mcast->backoff = HZ; + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_work(ipoib_workqueue, &priv->mcast_task); + up(&mcast_mutex); + complete(&mcast->done); + return; + } + + if (status == -EINTR) { + complete(&mcast->done); + return; + } + + if (status && mcast->logcount++ < 20) { + if (status == -ETIMEDOUT || status == -EINTR) { + ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT + ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + status); + } else { + ipoib_warn(priv, "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + status); + } + } + + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + + mcast->query = NULL; + + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { + if (status == -ETIMEDOUT) + queue_work(ipoib_workqueue, &priv->mcast_task); + else + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, + mcast->backoff * HZ); + } else + complete(&mcast->done); + up(&mcast_mutex); + + return; +} + +static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, + int create) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { + .join_state = 1 + }; + ib_sa_comp_mask comp_mask; + int ret = 0; + + ipoib_dbg_mcast(priv, "joining MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + comp_mask = + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE; + + if (create) { + comp_mask |= + IB_SA_MCMEMBER_REC_QKEY | + IB_SA_MCMEMBER_REC_SL | + IB_SA_MCMEMBER_REC_FLOW_LABEL | + IB_SA_MCMEMBER_REC_TRAFFIC_CLASS; + + rec.qkey = priv->broadcast->mcmember.qkey; + rec.sl = priv->broadcast->mcmember.sl; + rec.flow_label = priv->broadcast->mcmember.flow_label; + rec.traffic_class = priv->broadcast->mcmember.traffic_class; + } + + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, + mcast->backoff * 1000, GFP_ATOMIC, + ipoib_mcast_join_complete, + mcast, &mcast->query); + + if (ret < 0) { + ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); + + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->mcast_task, + mcast->backoff); + up(&mcast_mutex); + } else + mcast->query_id = ret; +} + +void ipoib_mcast_join_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (!test_bit(IPOIB_MCAST_RUN, &priv->flags)) + return; + + if (ib_query_gid(priv->ca, priv->port, 0, &priv->local_gid)) + ipoib_warn(priv, "ib_gid_entry_get() failed\n"); + else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + if (!priv->broadcast) { + priv->broadcast = ipoib_mcast_alloc(dev, 1); + if (!priv->broadcast) { + ipoib_warn(priv, "failed to allocate broadcast group\n"); + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->mcast_task, HZ); + up(&mcast_mutex); + return; + } + + memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + sizeof (union ib_gid)); + + spin_lock_irq(&priv->lock); + __ipoib_mcast_add(dev, priv->broadcast); + spin_unlock_irq(&priv->lock); + } + + if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { + ipoib_mcast_join(dev, priv->broadcast, 0); + return; + } + + while (1) { + struct ipoib_mcast *mcast = NULL; + + spin_lock_irq(&priv->lock); + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + /* Found the next unjoined group */ + break; + } + } + spin_unlock_irq(&priv->lock); + + if (&mcast->list == &priv->multicast_list) { + /* All done */ + break; + } + + ipoib_mcast_join(dev, mcast, 1); + return; + } + + { + struct ib_port_attr attr; + + if (!ib_query_port(priv->ca, priv->port, &attr)) + priv->local_lid = attr.lid; + else + ipoib_warn(priv, "ib_query_port failed\n"); + } + + priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - + IPOIB_ENCAP_LEN; + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n"); + + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + netif_carrier_on(dev); +} + +int ipoib_mcast_start_thread(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg_mcast(priv, "starting multicast thread\n"); + + down(&mcast_mutex); + if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_work(ipoib_workqueue, &priv->mcast_task); + up(&mcast_mutex); + + return 0; +} + +int ipoib_mcast_stop_thread(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; + + ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + + down(&mcast_mutex); + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + cancel_delayed_work(&priv->mcast_task); + up(&mcast_mutex); + + flush_workqueue(ipoib_workqueue); + + if (priv->broadcast && priv->broadcast->query) { + ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); + priv->broadcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for bcast\n"); + wait_for_completion(&priv->broadcast->done); + } + + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } + } + + return 0; +} + +int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { + .join_state = 1 + }; + int ret = 0; + + if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) + return 0; + + ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + /* Remove ourselves from the multicast group */ + ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret) + ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + + /* + * Just make one shot at leaving and don't wait for a reply; + * if we fail, too bad. + */ + ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + 0, GFP_ATOMIC, NULL, + mcast, &mcast->query); + if (ret < 0) + ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " + "for leave (result = %d)\n", ret); + + return 0; +} + +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + mcast = __ipoib_mcast_find(dev, mgid); + if (!mcast) { + /* Let's create a new send only group now */ + ipoib_dbg_mcast(priv, "setting up send only multicast group for " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(*mgid)); + + mcast = ipoib_mcast_alloc(dev, 0); + if (!mcast) { + ipoib_warn(priv, "unable to allocate memory for " + "multicast structure\n"); + dev_kfree_skb_any(skb); + goto out; + } + + set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags); + mcast->mcmember.mgid = *mgid; + __ipoib_mcast_add(dev, mcast); + list_add_tail(&mcast->list, &priv->multicast_list); + } + + if (!mcast->ah) { + if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) + skb_queue_tail(&mcast->pkt_queue, skb); + else + dev_kfree_skb_any(skb); + + if (mcast->query) + ipoib_dbg_mcast(priv, "no address vector, " + "but multicast join already started\n"); + else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) + ipoib_mcast_sendonly_join(mcast); + + /* + * If lookup completes between here and out:, don't + * want to send packet twice. + */ + mcast = NULL; + } + +out: + if (mcast && mcast->ah) { + if (skb->dst && + skb->dst->neighbour && + !*to_ipoib_neigh(skb->dst->neighbour)) { + struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + + if (neigh) { + kref_get(&mcast->ah->ref); + neigh->ah = mcast->ah; + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + list_add_tail(&neigh->list, &mcast->neigh_list); + } + } + + ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + } + + spin_unlock(&priv->lock); +} + +void ipoib_mcast_dev_flush(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + LIST_HEAD(remove_list); + struct ipoib_mcast *mcast, *tmcast, *nmcast; + unsigned long flags; + + ipoib_dbg_mcast(priv, "flushing multicast list\n"); + + spin_lock_irqsave(&priv->lock, flags); + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + nmcast = ipoib_mcast_alloc(dev, 0); + if (nmcast) { + nmcast->flags = + mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY); + + nmcast->mcmember.mgid = mcast->mcmember.mgid; + + /* Add the new group in before the to-be-destroyed group */ + list_add_tail(&nmcast->list, &mcast->list); + list_del_init(&mcast->list); + + rb_replace_node(&mcast->rb_node, &nmcast->rb_node, + &priv->multicast_tree); + + list_add_tail(&mcast->list, &remove_list); + } else { + ipoib_warn(priv, "could not reallocate multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + } + } + + if (priv->broadcast) { + nmcast = ipoib_mcast_alloc(dev, 0); + if (nmcast) { + nmcast->mcmember.mgid = priv->broadcast->mcmember.mgid; + + rb_replace_node(&priv->broadcast->rb_node, + &nmcast->rb_node, + &priv->multicast_tree); + + list_add_tail(&priv->broadcast->list, &remove_list); + } + + priv->broadcast = nmcast; + } + + spin_unlock_irqrestore(&priv->lock, flags); + + list_for_each_entry(mcast, &remove_list, list) { + ipoib_mcast_leave(dev, mcast); + ipoib_mcast_free(mcast); + } +} + +void ipoib_mcast_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; + + /* Delete broadcast since it will be recreated */ + if (priv->broadcast) { + ipoib_dbg_mcast(priv, "deleting broadcast group\n"); + + spin_lock_irqsave(&priv->lock, flags); + rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); + spin_unlock_irqrestore(&priv->lock, flags); + ipoib_mcast_leave(dev, priv->broadcast); + ipoib_mcast_free(priv->broadcast); + priv->broadcast = NULL; + } +} + +void ipoib_mcast_restart_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dev_mc_list *mclist; + struct ipoib_mcast *mcast, *tmcast; + LIST_HEAD(remove_list); + unsigned long flags; + + ipoib_dbg_mcast(priv, "restarting multicast task\n"); + + ipoib_mcast_stop_thread(dev); + + spin_lock_irqsave(&priv->lock, flags); + + /* + * Unfortunately, the networking core only gives us a list of all of + * the multicast hardware addresses. We need to figure out which ones + * are new and which ones have been removed + */ + + /* Clear out the found flag */ + list_for_each_entry(mcast, &priv->multicast_list, list) + clear_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags); + + /* Mark all of the entries that are found or don't exist */ + for (mclist = dev->mc_list; mclist; mclist = mclist->next) { + union ib_gid mgid; + + memcpy(mgid.raw, mclist->dmi_addr + 4, sizeof mgid); + + /* Add in the P_Key */ + mgid.raw[4] = (priv->pkey >> 8) & 0xff; + mgid.raw[5] = priv->pkey & 0xff; + + mcast = __ipoib_mcast_find(dev, &mgid); + if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + struct ipoib_mcast *nmcast; + + /* Not found or send-only group, let's add a new entry */ + ipoib_dbg_mcast(priv, "adding multicast entry for mgid " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); + + nmcast = ipoib_mcast_alloc(dev, 0); + if (!nmcast) { + ipoib_warn(priv, "unable to allocate memory for multicast structure\n"); + continue; + } + + set_bit(IPOIB_MCAST_FLAG_FOUND, &nmcast->flags); + + nmcast->mcmember.mgid = mgid; + + if (mcast) { + /* Destroy the send only entry */ + list_del(&mcast->list); + list_add_tail(&mcast->list, &remove_list); + + rb_replace_node(&mcast->rb_node, + &nmcast->rb_node, + &priv->multicast_tree); + } else + __ipoib_mcast_add(dev, nmcast); + + list_add_tail(&nmcast->list, &priv->multicast_list); + } + + if (mcast) + set_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags); + } + + /* Remove all of the entries don't exist anymore */ + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags) && + !test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "deleting multicast group " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rb_erase(&mcast->rb_node, &priv->multicast_tree); + + /* Move to the remove list */ + list_del(&mcast->list); + list_add_tail(&mcast->list, &remove_list); + } + } + spin_unlock_irqrestore(&priv->lock, flags); + + /* We have to cancel outside of the spinlock */ + list_for_each_entry(mcast, &remove_list, list) { + ipoib_mcast_leave(mcast->dev, mcast); + ipoib_mcast_free(mcast); + } + + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_mcast_start_thread(dev); +} + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev) +{ + struct ipoib_mcast_iter *iter; + + iter = kmalloc(sizeof *iter, GFP_KERNEL); + if (!iter) + return NULL; + + iter->dev = dev; + memset(iter->mgid.raw, 0, sizeof iter->mgid); + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter) +{ + kfree(iter); +} + +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter) +{ + struct ipoib_dev_priv *priv = netdev_priv(iter->dev); + struct rb_node *n; + struct ipoib_mcast *mcast; + int ret = 1; + + spin_lock_irq(&priv->lock); + + n = rb_first(&priv->multicast_tree); + + while (n) { + mcast = rb_entry(n, struct ipoib_mcast, rb_node); + + if (memcmp(iter->mgid.raw, mcast->mcmember.mgid.raw, + sizeof (union ib_gid)) < 0) { + iter->mgid = mcast->mcmember.mgid; + iter->created = mcast->created; + iter->queuelen = skb_queue_len(&mcast->pkt_queue); + iter->complete = !!mcast->ah; + iter->send_only = !!(mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY)); + + ret = 0; + + break; + } + + n = rb_next(n); + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *mgid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only) +{ + *mgid = iter->mgid; + *created = iter->created; + *queuelen = iter->queuelen; + *complete = iter->complete; + *send_only = iter->send_only; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2004-12-19 22:04:18.168021829 -0800 @@ -0,0 +1,177 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_vlan.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include +#include +#include + +#include + +#include "ipoib.h" + +static ssize_t show_parent(struct class_device *class_dev, char *buf) +{ + struct net_device *dev = + container_of(class_dev, struct net_device, class_dev); + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return sprintf(buf, "%s\n", priv->parent->name); +} +static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) +{ + struct ipoib_dev_priv *ppriv, *priv; + char intf_name[IFNAMSIZ]; + int result; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + ppriv = netdev_priv(pdev); + + down(&ppriv->vlan_mutex); + + /* + * First ensure this isn't a duplicate. We check the parent device and + * then all of the child interfaces to make sure the Pkey doesn't match. + */ + if (ppriv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } + + list_for_each_entry(priv, &ppriv->child_intfs, list) { + if (priv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } + } + + snprintf(intf_name, sizeof intf_name, "%s.%04x", + ppriv->dev->name, pkey); + priv = ipoib_intf_alloc(intf_name); + if (!priv) { + result = -ENOMEM; + goto err; + } + + set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags); + + priv->pkey = pkey; + + memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN); + priv->dev->broadcast[8] = pkey >> 8; + priv->dev->broadcast[9] = pkey & 0xff; + + result = ipoib_dev_init(priv->dev, ppriv->ca, ppriv->port); + if (result < 0) { + ipoib_warn(ppriv, "failed to initialize subinterface: " + "device %s, port %d", + ppriv->ca->name, ppriv->port); + goto device_init_failed; + } + + result = register_netdev(priv->dev); + if (result) { + ipoib_warn(priv, "failed to initialize; error %i", result); + goto register_failed; + } + + priv->parent = ppriv->dev; + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_parent)) + goto sysfs_failed; + + list_add_tail(&priv->list, &ppriv->child_intfs); + + up(&ppriv->vlan_mutex); + + return 0; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +err: + up(&ppriv->vlan_mutex); + return result; +} + +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey) +{ + struct ipoib_dev_priv *ppriv, *priv, *tpriv; + int ret = -ENOENT; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + ppriv = netdev_priv(pdev); + + down(&ppriv->vlan_mutex); + list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) { + if (priv->pkey == pkey) { + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + + list_del(&priv->list); + + kfree(priv); + + ret = 0; + break; + } + } + up(&ppriv->vlan_mutex); + + return ret; +} From roland at topspin.com Sun Dec 19 22:15:14 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 19 Dec 2004 22:15:14 -0800 Subject: [openib-general] [PATCH][v4][19/24] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <200412192215.69tnzAhGIT1vQGLF@topspin.com> Message-ID: <200412192215.fZX1ZQqQD4QGkKcF@topspin.com> Add a driver that implements the (IPoIB) IP-over-InfiniBand protocol. This is a network device driver of type ARPHRD_INFINIBAND (and addr_len INFINIBAND_ALEN bytes). The ARP/ND implementation for this driver is not completely straightforward, because InfiniBand requires an additional path lookup be performed (through an IB-specific mechanism) after a remote hardware address has been resolved. We are very open to suggestions of a better way to handle this than the current implementation. Although IB has a special multicast group join mode intended to support IP multicast routing (non member join), no means to identify different multicast styles has yet been determined, so all joins by the driver are currently full member joins. We are looking for guidance in how to solve this. The IPoIB protocol/encapsulation is described in the Internet-Drafts http://www.ietf.org/internet-drafts/draft-ietf-ipoib-architecture-04.txt http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-12-19 22:04:14.496562875 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-19 22:04:17.510118781 -0800 @@ -9,4 +9,6 @@ source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/ulp/ipoib/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-12-19 22:04:14.472566412 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-12-19 22:04:17.485122465 -0800 @@ -1,2 +1,3 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ +obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Kconfig 2004-12-19 22:04:17.559111562 -0800 @@ -0,0 +1,33 @@ +config INFINIBAND_IPOIB + tristate "IP-over-InfiniBand" + depends on INFINIBAND && NETDEVICES && INET + ---help--- + Support for the IP-over-InfiniBand protocol (IPoIB). This + transports IP packets over InfiniBand so you can use your IB + device as a fancy NIC. + + The IPoIB protocol is defined by the IETF ipoib working + group: . + +config INFINIBAND_IPOIB_DEBUG + bool "IP-over-InfiniBand debugging" + depends on INFINIBAND_IPOIB + ---help--- + This option causes debugging code to be compiled into the + IPoIB driver. The output can be turned on via the + debug_level and mcast_debug_level module parameters (which + can also be set after the driver is loaded through sysfs). + + This option also creates an "ipoib_debugfs," which can be + mounted to expose debugging information about IB multicast + groups used by the IPoIB driver. + +config INFINIBAND_IPOIB_DEBUG_DATA + bool "IP-over-InfiniBand data path debugging" + depends on INFINIBAND_IPOIB_DEBUG + ---help--- + This option compiles debugging code into the the data path + of the IPoIB driver. The output can be turned on via the + data_debug_level module parameter; however, even with output + turned off, this debugging code will have some performance + impact. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Makefile 2004-12-19 22:04:17.534115245 -0800 @@ -0,0 +1,11 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_IPOIB) += ib_ipoib.o + +ib_ipoib-y := ipoib_main.o \ + ipoib_ib.o \ + ipoib_multicast.o \ + ipoib_verbs.o \ + ipoib_vlan.o +ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o + --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib.h 2004-12-19 22:04:17.584107878 -0800 @@ -0,0 +1,350 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib.h 1358 2004-12-17 22:00:11Z roland $ + */ + +#ifndef _IPOIB_H +#define _IPOIB_H + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include + +#include +#include +#include + +/* constants */ + +enum { + IPOIB_PACKET_SIZE = 2048, + IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, + + IPOIB_ENCAP_LEN = 4, + + IPOIB_RX_RING_SIZE = 128, + IPOIB_TX_RING_SIZE = 64, + + IPOIB_NUM_WC = 4, + + IPOIB_MAX_PATH_REC_QUEUE = 3, + IPOIB_MAX_MCAST_QUEUE = 3, + + IPOIB_FLAG_OPER_UP = 0, + IPOIB_FLAG_ADMIN_UP = 1, + IPOIB_PKEY_ASSIGNED = 2, + IPOIB_PKEY_STOP = 3, + IPOIB_FLAG_SUBINTERFACE = 4, + IPOIB_MCAST_RUN = 5, + IPOIB_STOP_REAPER = 6, + + IPOIB_MAX_BACKOFF_SECONDS = 16, + + IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ + IPOIB_MCAST_FLAG_SENDONLY = 1, + IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ + IPOIB_MCAST_FLAG_ATTACHED = 3, +}; + +/* structs */ + +struct ipoib_header { + u16 proto; + u16 reserved; +}; + +struct ipoib_pseudoheader { + u8 hwaddr[INFINIBAND_ALEN]; +}; + +struct ipoib_mcast; + +struct ipoib_buf { + struct sk_buff *skb; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +/* + * Device private locking: tx_lock protects members used in TX fast + * path (and we use LLTX so upper layers don't do extra locking). + * lock protects everything else. lock nests inside of tx_lock (ie + * tx_lock must be acquired first if needed). + */ +struct ipoib_dev_priv { + spinlock_t lock; + + struct net_device *dev; + + unsigned long flags; + + struct semaphore mcast_mutex; + struct semaphore vlan_mutex; + + struct rb_root path_tree; + struct list_head path_list; + + struct ipoib_mcast *broadcast; + struct list_head multicast_list; + struct rb_root multicast_tree; + + struct work_struct pkey_task; + struct work_struct mcast_task; + struct work_struct flush_task; + struct work_struct restart_task; + struct work_struct ah_reap_task; + + struct ib_device *ca; + u8 port; + u16 pkey; + struct ib_pd *pd; + struct ib_mr *mr; + struct ib_cq *cq; + struct ib_qp *qp; + u32 qkey; + + union ib_gid local_gid; + u16 local_lid; + + unsigned int admin_mtu; + unsigned int mcast_mtu; + + struct ipoib_buf *rx_ring; + + spinlock_t tx_lock; + struct ipoib_buf *tx_ring; + unsigned tx_head; + unsigned tx_tail; + + struct ib_wc ibwc[IPOIB_NUM_WC]; + + struct list_head dead_ahs; + + struct ib_event_handler event_handler; + + struct net_device_stats stats; + + struct net_device *parent; + struct list_head child_intfs; + struct list_head list; + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG + struct list_head fs_list; + struct dentry *mcg_dentry; +#endif +}; + +struct ipoib_ah { + struct net_device *dev; + struct ib_ah *ah; + struct list_head list; + struct kref ref; + unsigned last_send; +}; + +struct ipoib_path { + struct net_device *dev; + struct ib_sa_path_rec pathrec; + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct list_head neigh_list; + + int query_id; + struct ib_sa_query *query; + struct completion done; + + struct rb_node rb_node; + struct list_head list; +}; + +struct ipoib_neigh { + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct neighbour *neighbour; + + struct list_head list; +}; + +static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) +{ + return (struct ipoib_neigh **) (neigh->ha + 24 - + (offsetof(struct neighbour, ha) & 4)); +} + +extern struct workqueue_struct *ipoib_workqueue; + +/* functions */ + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr); +void ipoib_free_ah(struct kref *kref); +static inline void ipoib_put_ah(struct ipoib_ah *ah) +{ + kref_put(&ah->ref, ipoib_free_ah); +} + +int ipoib_add_pkey_attr(struct net_device *dev); + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn); +void ipoib_reap_ah(void *dev_ptr); + +void ipoib_flush_paths(struct net_device *dev); +struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_ib_dev_flush(void *dev); +void ipoib_ib_dev_cleanup(struct net_device *dev); + +int ipoib_ib_dev_open(struct net_device *dev); +int ipoib_ib_dev_up(struct net_device *dev); +int ipoib_ib_dev_down(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev); + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_dev_cleanup(struct net_device *dev); + +void ipoib_mcast_join_task(void *dev_ptr); +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb); + +void ipoib_mcast_restart_task(void *dev_ptr); +int ipoib_mcast_start_thread(struct net_device *dev); +int ipoib_mcast_stop_thread(struct net_device *dev); + +void ipoib_mcast_dev_down(struct net_device *dev); +void ipoib_mcast_dev_flush(struct net_device *dev); + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev); +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter); +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter); +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *gid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only); + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); + +int ipoib_qp_create(struct net_device *dev); +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca); +void ipoib_transport_dev_cleanup(struct net_device *dev); + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); + +void ipoib_pkey_poll(void *dev); +int ipoib_pkey_dev_delay_open(struct net_device *dev); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int ipoib_create_debug_file(struct net_device *dev); +void ipoib_delete_debug_file(struct net_device *dev); +int ipoib_register_debugfs(void); +void ipoib_unregister_debugfs(void); +#else +static inline int ipoib_create_debug_file(struct net_device *dev) { return 0; } +static inline void ipoib_delete_debug_file(struct net_device *dev) { } +static inline int ipoib_register_debugfs(void) { return 0; } +static inline void ipoib_unregister_debugfs(void) { } +#endif + + +#define ipoib_printk(level, priv, format, arg...) \ + printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg) +#define ipoib_warn(priv, format, arg...) \ + ipoib_printk(KERN_WARNING, priv, format , ## arg) + + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +extern int debug_level; + +#define ipoib_dbg(priv, format, arg...) \ + do { \ + if (debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { \ + if (mcast_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG */ +#define ipoib_dbg(priv, format, arg...) \ + do { (void) (priv); } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +#define ipoib_dbg_data(priv, format, arg...) \ + do { \ + if (data_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ +#define ipoib_dbg_data(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ + + +#define IPOIB_GID_FMT "%x:%x:%x:%x:%x:%x:%x:%x" + +#define IPOIB_GID_ARG(gid) be16_to_cpup((__be16 *) ((gid).raw + 0)), \ + be16_to_cpup((__be16 *) ((gid).raw + 2)), \ + be16_to_cpup((__be16 *) ((gid).raw + 4)), \ + be16_to_cpup((__be16 *) ((gid).raw + 6)), \ + be16_to_cpup((__be16 *) ((gid).raw + 8)), \ + be16_to_cpup((__be16 *) ((gid).raw + 10)), \ + be16_to_cpup((__be16 *) ((gid).raw + 12)), \ + be16_to_cpup((__be16 *) ((gid).raw + 14)) + +#endif /* _IPOIB_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2004-12-19 22:04:17.608104342 -0800 @@ -0,0 +1,287 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include + +#include "ipoib.h" + +enum { + IPOIB_MAGIC = 0x49504942 /* "IPIB" */ +}; + +static DECLARE_MUTEX(ipoib_fs_mutex); +static struct dentry *ipoib_root; +static struct super_block *ipoib_sb; +static LIST_HEAD(ipoib_device_list); + +static void *ipoib_mcg_seq_start(struct seq_file *file, loff_t *pos) +{ + struct ipoib_mcast_iter *iter; + loff_t n = *pos; + + iter = ipoib_mcast_iter_init(file->private); + if (!iter) + return NULL; + + while (n--) { + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + } + + return iter; +} + +static void *ipoib_mcg_seq_next(struct seq_file *file, void *iter_ptr, + loff_t *pos) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + + (*pos)++; + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +static void ipoib_mcg_seq_stop(struct seq_file *file, void *iter_ptr) +{ + /* nothing for now */ +} + +static int ipoib_mcg_seq_show(struct seq_file *file, void *iter_ptr) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + char gid_buf[sizeof "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"]; + union ib_gid mgid; + int i, n; + unsigned long created; + unsigned int queuelen, complete, send_only; + + if (iter) { + ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen, + &complete, &send_only); + + for (n = 0, i = 0; i < sizeof mgid / 2; ++i) { + n += sprintf(gid_buf + n, "%x", + be16_to_cpu(((u16 *)mgid.raw)[i])); + if (i < sizeof mgid / 2 - 1) + gid_buf[n++] = ':'; + } + } + + seq_printf(file, "GID: %*s", -(1 + (int) sizeof gid_buf), gid_buf); + + seq_printf(file, + " created: %10ld queuelen: %4d complete: %d send_only: %d\n", + created, queuelen, complete, send_only); + + return 0; +} + +static struct seq_operations ipoib_seq_ops = { + .start = ipoib_mcg_seq_start, + .next = ipoib_mcg_seq_next, + .stop = ipoib_mcg_seq_stop, + .show = ipoib_mcg_seq_show, +}; + +static int ipoib_mcg_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &ipoib_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->u.generic_ip; + + return 0; +} + +static struct file_operations ipoib_fops = { + .owner = THIS_MODULE, + .open = ipoib_mcg_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +static struct inode *ipoib_get_inode(void) +{ + struct inode *inode = new_inode(ipoib_sb); + + if (inode) { + inode->i_mode = S_IFREG | S_IRUGO; + inode->i_uid = 0; + inode->i_gid = 0; + inode->i_blksize = PAGE_CACHE_SIZE; + inode->i_blocks = 0; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode->i_fop = &ipoib_fops; + } + + return inode; +} + +static int __ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dentry *dentry; + struct inode *inode; + char name[IFNAMSIZ + sizeof "_mcg"]; + + snprintf(name, sizeof name, "%s_mcg", dev->name); + + dentry = d_alloc_name(ipoib_root, name); + if (!dentry) + return -ENOMEM; + + inode = ipoib_get_inode(); + if (!inode) { + dput(dentry); + return -ENOMEM; + } + + inode->u.generic_ip = dev; + priv->mcg_dentry = dentry; + + d_add(dentry, inode); + + return 0; +} + +int ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + + list_add_tail(&priv->fs_list, &ipoib_device_list); + + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return 0; + } + + up(&ipoib_fs_mutex); + + return __ipoib_create_debug_file(dev); +} + +void ipoib_delete_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + list_del(&priv->fs_list); + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return; + } + up(&ipoib_fs_mutex); + + if (priv->mcg_dentry) { + d_drop(priv->mcg_dentry); + simple_unlink(ipoib_root->d_inode, priv->mcg_dentry); + } +} + +static int ipoib_fill_super(struct super_block *sb, void *data, int silent) +{ + static struct tree_descr ipoib_files[] = { + { "" } + }; + struct ipoib_dev_priv *priv; + int ret; + + ret = simple_fill_super(sb, IPOIB_MAGIC, ipoib_files); + if (ret) + return ret; + + ipoib_root = sb->s_root; + + down(&ipoib_fs_mutex); + + ipoib_sb = sb; + + list_for_each_entry(priv, &ipoib_device_list, fs_list) { + ret = __ipoib_create_debug_file(priv->dev); + if (ret) + break; + } + + up(&ipoib_fs_mutex); + + return ret; +} + +static struct super_block *ipoib_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + return get_sb_single(fs_type, flags, data, ipoib_fill_super); +} + +static void ipoib_kill_sb(struct super_block *sb) +{ + down(&ipoib_fs_mutex); + ipoib_sb = NULL; + up(&ipoib_fs_mutex); + + kill_litter_super(sb); +} + +static struct file_system_type ipoib_fs_type = { + .owner = THIS_MODULE, + .name = "ipoib_debugfs", + .get_sb = ipoib_get_sb, + .kill_sb = ipoib_kill_sb, +}; + +int ipoib_register_debugfs(void) +{ + return register_filesystem(&ipoib_fs_type); +} + +void ipoib_unregister_debugfs(void) +{ + unregister_filesystem(&ipoib_fs_type); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2004-12-19 22:04:17.633100658 -0800 @@ -0,0 +1,632 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_ib.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include + +#include "ipoib.h" + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +int data_debug_level; + +module_param(data_debug_level, int, 0644); +MODULE_PARM_DESC(data_debug_level, + "Enable data path debug tracing if > 0"); +#endif + +#define IPOIB_OP_RECV (1ul << 31) + +static DECLARE_MUTEX(pkey_sem); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr) +{ + struct ipoib_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return NULL; + + ah->dev = dev; + ah->last_send = 0; + kref_init(&ah->ref); + + ah->ah = ib_create_ah(pd, attr); + if (IS_ERR(ah->ah)) { + kfree(ah); + ah = NULL; + } else + ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah); + + return ah; +} + +void ipoib_free_ah(struct kref *kref) +{ + struct ipoib_ah *ah = container_of(kref, struct ipoib_ah, ref); + struct ipoib_dev_priv *priv = netdev_priv(ah->dev); + + unsigned long flags; + + if (ah->last_send <= priv->tx_tail) { + ipoib_dbg(priv, "Freeing ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } else { + spin_lock_irqsave(&priv->lock, flags); + list_add_tail(&ah->list, &priv->dead_ahs); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv, + unsigned int wr_id, + dma_addr_t addr) +{ + struct ib_sge list = { + .addr = addr, + .length = IPOIB_BUF_SIZE, + .lkey = priv->mr->lkey, + }; + struct ib_recv_wr param = { + .wr_id = wr_id | IPOIB_OP_RECV, + .sg_list = &list, + .num_sge = 1, + .recv_flags = IB_RECV_SIGNALED + }; + struct ib_recv_wr *bad_wr; + + return ib_post_recv(priv->qp, ¶m, &bad_wr); +} + +static int ipoib_ib_post_receive(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + dma_addr_t addr; + int ret; + + skb = dev_alloc_skb(IPOIB_BUF_SIZE + 4); + if (!skb) { + ipoib_warn(priv, "failed to allocate receive buffer\n"); + + priv->rx_ring[id].skb = NULL; + return -ENOMEM; + } + skb_reserve(skb, 4); /* 16 byte align IP header */ + priv->rx_ring[id].skb = skb; + addr = dma_map_single(priv->ca->dma_device, + skb->data, IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr); + + ret = ipoib_ib_receive(priv, id, addr); + if (ret) { + ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n", + id, ret); + priv->rx_ring[id].skb = NULL; + } + + return ret; +} + +static int ipoib_ib_post_receives(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + if (ipoib_ib_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + return -EIO; + } + } + + return 0; +} + +static void ipoib_ib_handle_wc(struct net_device *dev, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + + ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); + + if (wr_id & IPOIB_OP_RECV) { + wr_id &= ~IPOIB_OP_RECV; + + if (wr_id < IPOIB_RX_RING_SIZE) { + struct sk_buff *skb = priv->rx_ring[wr_id].skb; + + priv->rx_ring[wr_id].skb = NULL; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(&priv->rx_ring[wr_id], + mapping), + IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + + if (wc->status != IB_WC_SUCCESS) { + if (wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed recv event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put(skb, wc->byte_len); + skb_pull(skb, IB_GRH_BYTES); + + if (wc->slid != priv->local_lid || + wc->src_qp != priv->qp->qp_num) { + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + } else { + ipoib_dbg_data(priv, "dropping loopback packet\n"); + dev_kfree_skb_any(skb); + } + + /* repost receive */ + if (ipoib_ib_post_receive(dev, wr_id)) + ipoib_warn(priv, "ipoib_ib_post_receive failed " + "for buf %d\n", wr_id); + } else + ipoib_warn(priv, "completion event with wrid %d\n", + wr_id); + + } else { + struct ipoib_buf *tx_req; + unsigned long flags; + + if (wr_id >= IPOIB_TX_RING_SIZE) { + ipoib_warn(priv, "completion event with wrid %d (> %d)\n", + wr_id, IPOIB_TX_RING_SIZE); + return; + } + + ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); + + tx_req = &priv->tx_ring[wr_id]; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; + + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->tx_tail; + if (priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + } +} + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + for (i = 0; i < n; ++i) + ipoib_ib_handle_wc(dev, priv->ibwc + i); + } while (n == IPOIB_NUM_WC); +} + +static inline int post_send(struct ipoib_dev_priv *priv, + unsigned int wr_id, + struct ib_ah *address, u32 qpn, + dma_addr_t addr, int len) +{ + struct ib_sge list = { + .addr = addr, + .length = len, + .lkey = priv->mr->lkey, + }; + struct ib_send_wr param = { + .wr_id = wr_id, + .opcode = IB_WR_SEND, + .sg_list = &list, + .num_sge = 1, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = priv->qkey, + .ah = address + }, + }, + .send_flags = IB_SEND_SIGNALED, + }; + struct ib_send_wr *bad_wr; + + return ib_post_send(priv->qp, ¶m, &bad_wr); +} + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_buf *tx_req; + dma_addr_t addr; + + if (skb->len > dev->mtu + INFINIBAND_ALEN) { + ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + skb->len, dev->mtu + INFINIBAND_ALEN); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n", + skb->len, address, qpn); + + /* + * We put the skb into the tx_ring _before_ we call post_send() + * because it's entirely possible that the completion handler will + * run before we execute anything after the post_send(). That + * means we have to make sure everything is properly recorded and + * our state is consistent before we call post_send(). + */ + tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req->skb = skb; + addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, + DMA_TO_DEVICE); + pci_unmap_addr_set(tx_req, mapping, addr); + + if (unlikely(post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + address->ah, qpn, addr, skb->len))) { + ipoib_warn(priv, "post_send failed\n"); + ++priv->stats.tx_errors; + dma_unmap_single(priv->ca->dma_device, addr, skb->len, + DMA_TO_DEVICE); + dev_kfree_skb_any(skb); + } else { + dev->trans_start = jiffies; + + address->last_send = priv->tx_head; + ++priv->tx_head; + + if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + } + } +} + +void __ipoib_reap_ah(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah, *tah; + LIST_HEAD(remove_list); + + spin_lock_irq(&priv->lock); + list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list) + if (ah->last_send <= priv->tx_tail) { + list_del(&ah->list); + list_add_tail(&ah->list, &remove_list); + } + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(ah, tah, &remove_list, list) { + ipoib_dbg(priv, "Reaping ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } +} + +void ipoib_reap_ah(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + __ipoib_reap_ah(dev); + + if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); +} + +int ipoib_ib_dev_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + ret = ipoib_qp_create(dev); + if (ret) { + ipoib_warn(priv, "ipoib_qp_create returned %d\n", ret); + return -1; + } + + ret = ipoib_ib_post_receives(dev); + if (ret) { + ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + return -1; + } + + clear_bit(IPOIB_STOP_REAPER, &priv->flags); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + + return 0; +} + +int ipoib_ib_dev_up(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + + return ipoib_mcast_start_thread(dev); +} + +int ipoib_ib_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "downing ib_dev\n"); + + clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + netif_carrier_off(dev); + + /* Shutdown the P_Key thread if still active */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + set_bit(IPOIB_PKEY_STOP, &priv->flags); + cancel_delayed_work(&priv->pkey_task); + up(&pkey_sem); + flush_workqueue(ipoib_workqueue); + } + + ipoib_mcast_stop_thread(dev); + + /* + * Flush the multicast groups first so we stop any multicast joins. The + * completion thread may have already died and we may deadlock waiting + * for the completion thread to finish some multicast joins. + */ + ipoib_mcast_dev_flush(dev); + + /* Delete broadcast and local addresses since they will be recreated */ + ipoib_mcast_dev_down(dev); + + ipoib_flush_paths(dev); + + return 0; +} + +static int recvs_pending(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + return 1; + + return 0; +} + +int ipoib_ib_dev_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + int attr_mask; + int i; + + /* Kill the existing QP and allocate a new one */ + qp_attr.qp_state = IB_QPS_ERR; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to ERROR state\n"); + + /* Wait for all sends and receives to complete */ + while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) + yield(); + + ipoib_dbg(priv, "All sends and receives done.\n"); + + qp_attr.qp_state = IB_QPS_RESET; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to RESET state\n"); + + /* Wait for all AHs to be reaped */ + set_bit(IPOIB_STOP_REAPER, &priv->flags); + cancel_delayed_work(&priv->ah_reap_task); + flush_workqueue(ipoib_workqueue); + while (!list_empty(&priv->dead_ahs)) { + __ipoib_reap_ah(dev); + yield(); + } + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + ipoib_warn(priv, "Recv skb still around @ %d\n", i); + + return 0; +} + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + priv->ca = ca; + priv->port = port; + priv->qp = NULL; + + if (ipoib_transport_dev_init(dev, ca)) { + printk(KERN_WARNING "%s: ipoib_transport_dev_init failed\n", ca->name); + return -ENODEV; + } + + if (dev->flags & IFF_UP) { + if (ipoib_ib_dev_open(dev)) { + ipoib_transport_dev_cleanup(dev); + return -ENODEV; + } + } + + return 0; +} + +void ipoib_ib_dev_flush(void *_dev) +{ + struct net_device *dev = (struct net_device *)_dev; + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv; + + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + return; + + ipoib_dbg(priv, "flushing\n"); + + ipoib_ib_dev_down(dev); + + /* + * The device could have been brought down between the start and when + * we get here, don't bring it back up if it's not configured up + */ + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_ib_dev_up(dev); + + /* Flush any child interfaces too */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + ipoib_ib_dev_flush(&cpriv->dev); +} + +void ipoib_ib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "cleaning up ib_dev\n"); + + ipoib_mcast_stop_thread(dev); + + /* Delete the broadcast address and the local address */ + ipoib_mcast_dev_down(dev); + + ipoib_transport_dev_cleanup(dev); +} + +/* + * Delayed P_Key Assigment Interim Support + * + * The following is initial implementation of delayed P_Key assigment + * mechanism. It is using the same approach implemented for the multicast + * group join. The single goal of this implementation is to quickly address + * Bug #2507. This implementation will probably be removed when the P_Key + * change async notification is available. + */ +int ipoib_open(struct net_device *dev); + +static void ipoib_pkey_dev_check_presence(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + u16 pkey_index = 0; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + else + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); +} + +void ipoib_pkey_poll(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_pkey_dev_check_presence(dev); + + if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + ipoib_open(dev); + else { + down(&pkey_sem); + if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + } +} + +int ipoib_pkey_dev_delay_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Look for the interface pkey value in the IB Port P_Key table and */ + /* set the interface pkey assigment flag */ + ipoib_pkey_dev_check_presence(dev); + + /* P_Key value not assigned yet - start polling */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + clear_bit(IPOIB_PKEY_STOP, &priv->flags); + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + return 1; + } + + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_main.c 2004-12-19 22:04:17.658096974 -0800 @@ -0,0 +1,1084 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_main.c 1362 2004-12-18 15:56:29Z roland $ + */ + +#include "ipoib.h" + +#include +#include + +#include +#include +#include + +#include /* For ARPHRD_xxx */ + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); +MODULE_LICENSE("Dual BSD/GPL"); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int debug_level; + +module_param(debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); +#endif + +static const u8 ipv4_bcast_addr[] = { + 0x00, 0xff, 0xff, 0xff, + 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff +}; + +struct workqueue_struct *ipoib_workqueue; + +static void ipoib_add_one(struct ib_device *device); +static void ipoib_remove_one(struct ib_device *device); + +static struct ib_client ipoib_client = { + .name = "ipoib", + .add = ipoib_add_one, + .remove = ipoib_remove_one +}; + +int ipoib_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "bringing up interface\n"); + + set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + if (ipoib_pkey_dev_delay_open(dev)) + return 0; + + if (ipoib_ib_dev_open(dev)) + return -EINVAL; + + if (ipoib_ib_dev_up(dev)) + return -EINVAL; + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring up any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (flags & IFF_UP) + continue; + + dev_change_flags(cpriv->dev, flags | IFF_UP); + } + up(&priv->vlan_mutex); + } + + netif_start_queue(dev); + + return 0; +} + +static int ipoib_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "stopping interface\n"); + + clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + netif_stop_queue(dev); + + ipoib_ib_dev_down(dev); + ipoib_ib_dev_stop(dev); + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring down any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (!(flags & IFF_UP)) + continue; + + dev_change_flags(cpriv->dev, flags & ~IFF_UP); + } + up(&priv->vlan_mutex); + } + + return 0; +} + +static int ipoib_change_mtu(struct net_device *dev, int new_mtu) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) + return -EINVAL; + + priv->admin_mtu = new_mtu; + + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + return 0; +} + +static struct ipoib_path *__path_find(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->path_tree.rb_node; + struct ipoib_path *path; + int ret; + + while (n) { + path = rb_entry(n, struct ipoib_path, rb_node); + + ret = memcmp(gid->raw, path->pathrec.dgid.raw, + sizeof (union ib_gid)); + + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return path; + } + + return NULL; +} + +static int __path_add(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->path_tree.rb_node; + struct rb_node *pn = NULL; + struct ipoib_path *tpath; + int ret; + + while (*n) { + pn = *n; + tpath = rb_entry(pn, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, tpath->pathrec.dgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&path->rb_node, pn, n); + rb_insert_color(&path->rb_node, &priv->path_tree); + + list_add_tail(&path->list, &priv->path_list); + + return 0; +} + +static void __path_free(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tn; + struct sk_buff *skb; + + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb_irq(skb); + + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + if (path->ah) + ipoib_put_ah(path->ah); + + rb_erase(&path->rb_node, &priv->path_tree); + list_del(&path->list); + kfree(path); +} + +void ipoib_flush_paths(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path, *tp; + LIST_HEAD(remove_list); + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + list_splice(&priv->path_list, &remove_list); + INIT_LIST_HEAD(&priv->path_list); + spin_unlock_irqrestore(&priv->lock, flags); + + list_for_each_entry_safe(path, tp, &remove_list, list) { + if (path->query) + ib_sa_cancel_query(path->query_id, path->query); + wait_for_completion(&path->done); + __path_free(dev, path); + } +} + +static void path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *path_ptr) +{ + struct ipoib_path *path = path_ptr; + struct net_device *dev = path->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah = NULL; + struct ipoib_neigh *neigh; + struct sk_buff_head skqueue; + struct sk_buff *skb; + unsigned long flags; + + if (pathrec) + ipoib_dbg(priv, "PathRec LID 0x%04x for GID " IPOIB_GID_FMT "\n", + be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); + else + ipoib_dbg(priv, "PathRec status %d for GID " IPOIB_GID_FMT "\n", + status, IPOIB_GID_ARG(path->pathrec.dgid)); + + skb_queue_head_init(&skqueue); + + if (!status) { + /* + * For now we set static_rate to 0. This is not + * really correct: we should look at the rate + * component of the path member record, compare it + * with the rate of our local port (calculated from + * the active link speed and link width) and set an + * inter-packet delay appropriately. + */ + struct ib_ah_attr av = { + .dlid = be16_to_cpu(pathrec->dlid), + .sl = pathrec->sl, + .static_rate = 0, + .port_num = priv->port + }; + + ah = ipoib_create_ah(dev, priv->pd, &av); + } + + spin_lock_irqsave(&priv->lock, flags); + + path->ah = ah; + + if (ah) { + path->pathrec = *pathrec; + + ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n", + ah, be16_to_cpu(pathrec->dlid), pathrec->sl); + + while ((skb = __skb_dequeue(&path->queue))) + __skb_queue_tail(&skqueue, skb); + + list_for_each_entry(neigh, &path->neigh_list, list) { + kref_get(&path->ah->ref); + neigh->ah = path->ah; + + while ((skb = __skb_dequeue(&neigh->queue))) + __skb_queue_tail(&skqueue, skb); + } + } else + path->query = NULL; + + complete(&path->done); + + spin_unlock_irqrestore(&priv->lock, flags); + + while ((skb = __skb_dequeue(&skqueue))) { + skb->dev = dev; + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed " + "to requeue packet\n"); + } +} + +static struct ipoib_path *path_rec_create(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + path = kmalloc(sizeof *path, GFP_ATOMIC); + if (!path) + return NULL; + + path->dev = dev; + path->pathrec.dlid = 0; + + skb_queue_head_init(&path->queue); + + INIT_LIST_HEAD(&path->neigh_list); + path->query = NULL; + init_completion(&path->done); + + memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.numb_path = 1; + + __path_add(dev, path); + + return path; +} + +static int path_rec_start(struct net_device *dev, + struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "Start path record lookup for " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(path->pathrec.dgid)); + + path->query_id = + ib_sa_path_rec_get(priv->ca, priv->port, + &path->pathrec, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + 1000, GFP_ATOMIC, + path_rec_completion, + path, &path->query); + if (path->query_id < 0) { + ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); + path->query = NULL; + return path->query_id; + } + + return 0; +} + +static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + return; + } + + skb_queue_head_init(&neigh->queue); + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) + goto err; + } + + list_add_tail(&neigh->list, &path->neigh_list); + + if (path->pathrec.dlid) { + kref_get(&path->ah->ref); + neigh->ah = path->ah; + + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + } else { + neigh->ah = NULL; + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + __skb_queue_tail(&neigh->queue, skb); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + if (!path->query && path_rec_start(dev, path)) + goto err; + } + + spin_unlock(&priv->lock); + return; + +err: + *to_ipoib_neigh(skb->dst->neighbour) = NULL; + list_del(&neigh->list); + kfree(neigh); + neigh->neighbour->ops->destructor = NULL; + + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + + spin_unlock(&priv->lock); +} + +static void path_lookup(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + + /* Look up path record for unicasts */ + if (skb->dst->neighbour->ha[4] != 0xff) { + neigh_add_path(skb, dev); + return; + } + + /* Add in the P_Key for multicasts */ + skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; + skb->dst->neighbour->ha[9] = priv->pkey & 0xff; + ipoib_mcast_send(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4), skb); +} + +static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, + struct ipoib_pseudoheader *phdr) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (phdr->hwaddr + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (phdr->hwaddr + 4)); + if (path) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof *phdr); + __skb_queue_tail(&path->queue, skb); + + if (path_rec_start(dev, path)) + __path_free(dev, path); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + spin_unlock(&priv->lock); + return; + } + + if (path->pathrec.dlid) { + ipoib_dbg(priv, "Send unicast ARP to %04x\n", + be16_to_cpu(path->pathrec.dlid)); + + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) phdr->hwaddr)); + } else if ((path->query || !path_rec_start(dev, path)) && + skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof *phdr); + __skb_queue_tail(&path->queue, skb); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + spin_unlock(&priv->lock); +} + +static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh; + unsigned long flags; + + local_irq_save(flags); + if (!spin_trylock(&priv->tx_lock)) { + local_irq_restore(flags); + return NETDEV_TX_LOCKED; + } + + /* + * Check if our queue is full. Since we have the LLTX feature + * bit set, we can't rely on netif_stop_queue() preventing our + * xmit function from being called with a full queue. + * + * This is a temporary workaround until LLTX is fixed so that + * hard_start_xmit does not get called after netif_stop_queue(). + */ + if (unlikely(priv->tx_head - priv->tx_tail >= IPOIB_TX_RING_SIZE)) { + ipoib_dbg(priv, "TX ring full in xmit, stopping kernel net queue\n"); + netif_stop_queue(dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); + return NETDEV_TX_BUSY; + } + + if (skb->dst && skb->dst->neighbour) { + if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { + path_lookup(skb, dev); + goto out; + } + + neigh = *to_ipoib_neigh(skb->dst->neighbour); + + if (likely(neigh->ah)) { + ipoib_send(dev, skb, neigh->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + goto out; + } + + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + spin_lock(&priv->lock); + __skb_queue_tail(&neigh->queue, skb); + spin_unlock(&priv->lock); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + } else { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb->data; + skb_pull(skb, sizeof *phdr); + + if (phdr->hwaddr[4] == 0xff) { + /* Add in the P_Key for multicast*/ + phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; + phdr->hwaddr[9] = priv->pkey & 0xff; + + ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb); + } else { + /* unicast GID -- should be ARP reply */ + + if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { + ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " + IPOIB_GID_FMT "\n", + skb->dst ? "neigh" : "dst", + be16_to_cpup((u16 *) skb->data), + be32_to_cpup((u32 *) phdr->hwaddr), + IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + goto out; + } + + unicast_arp_send(skb, dev, phdr); + } + } + +out: + spin_unlock_irqrestore(&priv->tx_lock, flags); + + return NETDEV_TX_OK; +} + +struct net_device_stats *ipoib_get_stats(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return &priv->stats; +} + +static void ipoib_timeout(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_warn(priv, "transmit timeout: latency %ld\n", + jiffies - dev->trans_start); + /* XXX reset QP, etc. */ +} + +static int ipoib_hard_header(struct sk_buff *skb, + struct net_device *dev, + unsigned short type, + void *daddr, void *saddr, unsigned len) +{ + struct ipoib_header *header; + + header = (struct ipoib_header *) skb_push(skb, sizeof *header); + + header->proto = htons(type); + header->reserved = 0; + + /* + * If we don't have a neighbour structure, stuff the + * destination address onto the front of the skb so we can + * figure out where to send the packet later. + */ + if (!skb->dst || !skb->dst->neighbour) { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb_push(skb, sizeof *phdr); + memcpy(phdr->hwaddr, daddr, INFINIBAND_ALEN); + } + + return 0; +} + +static void ipoib_set_mcast_list(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + schedule_work(&priv->restart_task); +} + +static void ipoib_neigh_destructor(struct neighbour *n) +{ + struct ipoib_neigh *neigh = *to_ipoib_neigh(n); + struct ipoib_dev_priv *priv = netdev_priv(n->dev); + unsigned long flags; + + ipoib_dbg(priv, + "neigh_destructor for %06x " IPOIB_GID_FMT "\n", + be32_to_cpup((__be32 *) n->ha), + IPOIB_GID_ARG(*((union ib_gid *) (n->ha + 4)))); + + spin_lock_irqsave(&priv->lock, flags); + + if (neigh) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + *to_ipoib_neigh(n) = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static int ipoib_neigh_setup(struct neighbour *neigh) +{ + /* + * Is this kosher? I can't find anybody in the kernel that + * sets neigh->destructor, so we should be able to set it here + * without trouble. + */ + neigh->ops->destructor = ipoib_neigh_destructor; + + return 0; +} + +static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) +{ + parms->neigh_setup = ipoib_neigh_setup; + + return 0; +} + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Allocate RX/TX "rings" to hold queued skbs */ + + priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->rx_ring) { + printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", + ca->name, IPOIB_RX_RING_SIZE); + goto out; + } + memset(priv->rx_ring, 0, + IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf)); + + priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->tx_ring) { + printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", + ca->name, IPOIB_TX_RING_SIZE); + goto out_rx_ring_cleanup; + } + memset(priv->tx_ring, 0, + IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf)); + + /* priv->tx_head & tx_tail are already 0 */ + + if (ipoib_ib_dev_init(dev, ca, port)) + goto out_tx_ring_cleanup; + + return 0; + +out_tx_ring_cleanup: + kfree(priv->tx_ring); + +out_rx_ring_cleanup: + kfree(priv->rx_ring); + +out: + return -ENOMEM; +} + +void ipoib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv; + + ipoib_delete_debug_file(dev); + + /* Delete any child interfaces first */ + list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { + unregister_netdev(cpriv->dev); + ipoib_dev_cleanup(cpriv->dev); + free_netdev(cpriv->dev); + } + + ipoib_ib_dev_cleanup(dev); + + if (priv->rx_ring) { + kfree(priv->rx_ring); + priv->rx_ring = NULL; + } + + if (priv->tx_ring) { + kfree(priv->tx_ring); + priv->tx_ring = NULL; + } +} + +static void ipoib_setup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + dev->open = ipoib_open; + dev->stop = ipoib_stop; + dev->change_mtu = ipoib_change_mtu; + dev->hard_start_xmit = ipoib_start_xmit; + dev->get_stats = ipoib_get_stats; + dev->tx_timeout = ipoib_timeout; + dev->hard_header = ipoib_hard_header; + dev->set_multicast_list = ipoib_set_mcast_list; + dev->neigh_setup = ipoib_neigh_setup_dev; + + dev->watchdog_timeo = HZ; + + dev->rebuild_header = NULL; + dev->set_mac_address = NULL; + dev->header_cache_update = NULL; + + dev->flags |= IFF_BROADCAST | IFF_MULTICAST; + + /* + * We add in INFINIBAND_ALEN to allow for the destination + * address "pseudoheader" for skbs without neighbour struct. + */ + dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; + dev->addr_len = INFINIBAND_ALEN; + dev->type = ARPHRD_INFINIBAND; + dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + + /* MTU will be reset when mcast join happens */ + dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; + priv->mcast_mtu = priv->admin_mtu = dev->mtu; + + memcpy(dev->broadcast, ipv4_bcast_addr, INFINIBAND_ALEN); + + netif_carrier_off(dev); + + SET_MODULE_OWNER(dev); + + priv->dev = dev; + + spin_lock_init(&priv->lock); + spin_lock_init(&priv->tx_lock); + + init_MUTEX(&priv->mcast_mutex); + init_MUTEX(&priv->vlan_mutex); + + INIT_LIST_HEAD(&priv->path_list); + INIT_LIST_HEAD(&priv->child_intfs); + INIT_LIST_HEAD(&priv->dead_ahs); + INIT_LIST_HEAD(&priv->multicast_list); + + INIT_WORK(&priv->pkey_task, ipoib_pkey_poll, priv->dev); + INIT_WORK(&priv->mcast_task, ipoib_mcast_join_task, priv->dev); + INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, priv->dev); + INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev); + INIT_WORK(&priv->ah_reap_task, ipoib_reap_ah, priv->dev); +} + +struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) +{ + struct net_device *dev; + + dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name, + ipoib_setup); + if (!dev) + return NULL; + + return netdev_priv(dev); +} + +static ssize_t show_pkey(struct class_device *cdev, char *buf) +{ + struct ipoib_dev_priv *priv = + netdev_priv(container_of(cdev, struct net_device, class_dev)); + + return sprintf(buf, "0x%04x\n", priv->pkey); +} +static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); + +static ssize_t create_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; +} +static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); + +static ssize_t delete_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; + +} +static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); + +int ipoib_add_pkey_attr(struct net_device *dev) +{ + return class_device_create_file(&dev->class_dev, + &class_device_attr_pkey); +} + +static struct net_device *ipoib_add_port(const char *format, + struct ib_device *hca, u8 port) +{ + struct ipoib_dev_priv *priv; + int result = -ENOMEM; + + priv = ipoib_intf_alloc(format); + if (!priv) + goto alloc_mem_failed; + + SET_NETDEV_DEV(priv->dev, hca->dma_device); + + result = ib_query_pkey(hca, port, 0, &priv->pkey); + if (result) { + printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } + + priv->dev->broadcast[8] = priv->pkey >> 8; + priv->dev->broadcast[9] = priv->pkey & 0xff; + + result = ib_query_gid(hca, port, 0, &priv->local_gid); + if (result) { + printk(KERN_WARNING "%s: ib_query_gid port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + + result = ipoib_dev_init(priv->dev, hca, port); + if (result < 0) { + printk(KERN_WARNING "%s: failed to initialize port %d (ret = %d)\n", + hca->name, port, result); + goto device_init_failed; + } + + INIT_IB_EVENT_HANDLER(&priv->event_handler, + priv->ca, ipoib_event); + result = ib_register_event_handler(&priv->event_handler); + if (result < 0) { + printk(KERN_WARNING "%s: ib_register_event_handler failed for " + "port %d (ret = %d)\n", + hca->name, port, result); + goto event_failed; + } + + result = register_netdev(priv->dev); + if (result) { + printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", + hca->name, port, result); + goto register_failed; + } + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_create_child)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_delete_child)) + goto sysfs_failed; + + return priv->dev; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ib_unregister_event_handler(&priv->event_handler); + +event_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +alloc_mem_failed: + return ERR_PTR(result); +} + +static void ipoib_add_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct net_device *dev; + struct ipoib_dev_priv *priv; + int s, e, p; + + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); + if (!dev_list) + return; + + INIT_LIST_HEAD(dev_list); + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (p = s; p <= e; ++p) { + dev = ipoib_add_port("ib%d", device, p); + if (!IS_ERR(dev)) { + priv = netdev_priv(dev); + list_add_tail(&priv->list, dev_list); + } + } + + ib_set_client_data(device, &ipoib_client, dev_list); +} + +static void ipoib_remove_one(struct ib_device *device) +{ + struct ipoib_dev_priv *priv, *tmp; + struct list_head *dev_list; + + dev_list = ib_get_client_data(device, &ipoib_client); + + list_for_each_entry_safe(priv, tmp, dev_list, list) { + ib_unregister_event_handler(&priv->event_handler); + + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + free_netdev(priv->dev); + } +} + +static int __init ipoib_init_module(void) +{ + int ret; + + ret = ipoib_register_debugfs(); + if (ret) + return ret; + + /* + * We create our own workqueue mainly because we want to be + * able to flush it when devices are being removed. We can't + * use schedule_work()/flush_scheduled_work() because both + * unregister_netdev() and linkwatch_event take the rtnl lock, + * so flush_scheduled_work() can deadlock during device + * removal. + */ + ipoib_workqueue = create_singlethread_workqueue("ipoib"); + if (!ipoib_workqueue) { + ret = -ENOMEM; + goto err_fs; + } + + ret = ib_register_client(&ipoib_client); + if (ret) + goto err_wq; + + return 0; + +err_fs: + ipoib_unregister_debugfs(); + +err_wq: + destroy_workqueue(ipoib_workqueue); + + return ret; +} + +static void __exit ipoib_cleanup_module(void) +{ + ipoib_unregister_debugfs(); + ib_unregister_client(&ipoib_client); + destroy_workqueue(ipoib_workqueue); +} + +module_init(ipoib_init_module); +module_exit(ipoib_cleanup_module); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2004-12-19 22:04:17.682093438 -0800 @@ -0,0 +1,254 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +#include "ipoib.h" + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr *qp_attr; + int attr_mask; + int ret; + u16 pkey_index; + + ret = -ENOMEM; + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) + goto out; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ret = -ENXIO; + goto out; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + /* set correct QKey for QP */ + qp_attr->qkey = priv->qkey; + attr_mask = IB_QP_QKEY; + ret = ib_modify_qp(priv->qp, qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret); + goto out; + } + + /* attach QP to multicast group */ + down(&priv->mcast_mutex); + ret = ib_attach_mcast(priv->qp, mgid, mlid); + up(&priv->mcast_mutex); + if (ret) + ipoib_warn(priv, "failed to attach to multicast group, ret = %d\n", ret); + +out: + kfree(qp_attr); + return ret; +} + +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + down(&priv->mcast_mutex); + ret = ib_detach_mcast(priv->qp, mgid, mlid); + up(&priv->mcast_mutex); + if (ret) + ipoib_warn(priv, "ib_detach_mcast failed (result = %d)\n", ret); + + return ret; +} + +int ipoib_qp_create(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u16 pkey_index; + struct ib_qp_attr qp_attr; + int attr_mask; + + /* + * Search through the port P_Key table for the requested pkey value. + * The port has to be assigned to the respective IB partition in + * advance. + */ + ret = ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index); + if (ret) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return ret; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qkey = 0; + qp_attr.port_num = priv->port; + qp_attr.pkey_index = pkey_index; + attr_mask = + IB_QP_QKEY | + IB_QP_PORT | + IB_QP_PKEY_INDEX | + IB_QP_STATE; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret); + goto out_fail; + } + + qp_attr.qp_state = IB_QPS_RTR; + /* Can't set this in a INIT->RTR transition */ + attr_mask &= ~IB_QP_PORT; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret); + goto out_fail; + } + + qp_attr.qp_state = IB_QPS_RTS; + qp_attr.sq_psn = 0; + attr_mask |= IB_QP_SQ_PSN; + attr_mask &= ~IB_QP_PKEY_INDEX; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret); + goto out_fail; + } + + return 0; + +out_fail: + ib_destroy_qp(priv->qp); + priv->qp = NULL; + + return -EINVAL; +} + +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr init_attr = { + .cap = { + .max_send_wr = IPOIB_TX_RING_SIZE, + .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_sge = 1, + .max_recv_sge = 1 + }, + .sq_sig_type = IB_SIGNAL_ALL_WR, + .rq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_UD + }; + + priv->pd = ib_alloc_pd(priv->ca); + if (IS_ERR(priv->pd)) { + printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name); + return -ENODEV; + } + + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, + IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + if (IS_ERR(priv->cq)) { + printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); + goto out_free_pd; + } + + if (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP)) + goto out_free_cq; + + priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(priv->mr)) { + printk(KERN_WARNING "%s: ib_reg_phys_mr failed\n", ca->name); + goto out_free_cq; + } + + init_attr.send_cq = priv->cq; + init_attr.recv_cq = priv->cq, + + priv->qp = ib_create_qp(priv->pd, &init_attr); + if (IS_ERR(priv->qp)) { + printk(KERN_WARNING "%s: failed to create QP\n", ca->name); + goto out_free_mr; + } + + priv->dev->dev_addr[1] = (priv->qp->qp_num >> 16) & 0xff; + priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; + priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; + + return 0; + +out_free_mr: + ib_dereg_mr(priv->mr); + +out_free_cq: + ib_destroy_cq(priv->cq); + +out_free_pd: + ib_dealloc_pd(priv->pd); + return -ENODEV; +} + +void ipoib_transport_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->qp) { + if (ib_destroy_qp(priv->qp)) + ipoib_warn(priv, "ib_qp_destroy failed\n"); + + priv->qp = NULL; + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + } + + if (ib_dereg_mr(priv->mr)) + ipoib_warn(priv, "ib_dereg_mr failed\n"); + + if (ib_destroy_cq(priv->cq)) + ipoib_warn(priv, "ib_cq_destroy failed\n"); + + if (ib_dealloc_pd(priv->pd)) + ipoib_warn(priv, "ib_dealloc_pd failed\n"); +} + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record) +{ + struct ipoib_dev_priv *priv = + container_of(handler, struct ipoib_dev_priv, event_handler); + + if (record->event == IB_EVENT_PORT_ACTIVE || + record->event == IB_EVENT_LID_CHANGE || + record->event == IB_EVENT_SM_CHANGE) { + ipoib_dbg(priv, "Port active event\n"); + schedule_work(&priv->flush_task); + } +} From shaharf at voltaire.com Mon Dec 20 01:22:45 2004 From: shaharf at voltaire.com (shaharf) Date: Mon, 20 Dec 2004 11:22:45 +0200 Subject: [openib-general] osm Message-ID: > shaharf> Hi all, I took the liberty to commit the user-mode tree > shaharf> even though it is far to be complete or stable. However, > shaharf> the tree contains several applications and utilities that > shaharf> may help all of us, even at this preliminary stage. > > That's great, better to commit early and get feedback than to work > privately and end up going the wrong direction. > > A couple quick suggestions: > - Add copyright/license headers to all your source files > - Use GNU autotools instead of writing your own Makefiles > > - Roland As I am new to open source development so I am not sure what license I should add. Is BSD license what we want? Standard GNU license may be problematic. I kept the Intel based license in the new opensm files. It looks good enough for me. Autotools will be used at later stage. Again, I am not really used to these tools so I didn't want to delay the opensm due that. The current make files should do for now. Shahar From shaharf at voltaire.com Mon Dec 20 01:28:15 2004 From: shaharf at voltaire.com (shaharf) Date: Mon, 20 Dec 2004 11:28:15 +0200 Subject: [openib-general] osm Message-ID: > > One more suggestion: consider using libsysfs (which every modern > distro will ship) instead reimplementing it in libcommon/sysfs.c. > > - R. Thanks - I will do that. I was not aware to this library. Shahar From arnd at arndb.de Mon Dec 20 04:14:35 2004 From: arnd at arndb.de (Arnd Bergmann) Date: Mon, 20 Dec 2004 13:14:35 +0100 Subject: [openib-general] Re: [PATCH][v4][19/24] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <20041220.155836.75677852.yoshfuji@linux-ipv6.org> References: <200412192215.69tnzAhGIT1vQGLF@topspin.com> <200412192215.fZX1ZQqQD4QGkKcF@topspin.com> <20041220.155836.75677852.yoshfuji@linux-ipv6.org> Message-ID: <200412201314.35502.arnd@arndb.de> On Maandag 20 Dezember 2004 07:58, YOSHIFUJI Hideaki / 吉藤英明 wrote: > Roland Dreier says: > > > +enum { > > +     IPOIB_PACKET_SIZE         = 2048, > > +     IPOIB_BUF_SIZE            = IPOIB_PACKET_SIZE + IB_GRH_BYTES, > > + > > +     IPOIB_ENCAP_LEN           = 4, > > + > > +     IPOIB_RX_RING_SIZE        = 128, > > +     IPOIB_TX_RING_SIZE        = 64, > > + > > above entries does not seem to appropriate for enum (than #define). According to Documentation/CodingStyle, it actually is preferred like this. See also include/linux/ide.h for another example where this is done. Arnd <>< From yoshfuji at linux-ipv6.org Mon Dec 20 05:17:09 2004 From: yoshfuji at linux-ipv6.org (YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?=) Date: Mon, 20 Dec 2004 22:17:09 +0900 (JST) Subject: [openib-general] Re: [PATCH][v4][19/24] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <200412201314.35502.arnd@arndb.de> References: <200412192215.fZX1ZQqQD4QGkKcF@topspin.com> <20041220.155836.75677852.yoshfuji@linux-ipv6.org> <200412201314.35502.arnd@arndb.de> Message-ID: <20041220.221709.99112884.yoshfuji@linux-ipv6.org> In article <200412201314.35502.arnd at arndb.de> (at Mon, 20 Dec 2004 13:14:35 +0100), Arnd Bergmann says: > On Maandag 20 Dezember 2004 07:58, YOSHIFUJI Hideaki / 吉藤英明 wrote: > > Roland Dreier says: > > > > > +enum { > > > +     IPOIB_PACKET_SIZE         = 2048, > > > +     IPOIB_BUF_SIZE            = IPOIB_PACKET_SIZE + IB_GRH_BYTES, > > > + > > > +     IPOIB_ENCAP_LEN           = 4, > > > + > > > +     IPOIB_RX_RING_SIZE        = 128, > > > +     IPOIB_TX_RING_SIZE        = 64, > > > + > > > > above entries does not seem to appropriate for enum (than #define). > > According to Documentation/CodingStyle, it actually is preferred like this. > See also include/linux/ide.h for another example where this is done. No, it is not the similar case. --yoshfuji From itamar at mellanox.co.il Mon Dec 20 05:58:16 2004 From: itamar at mellanox.co.il (Itamar Rabenstein) Date: Mon, 20 Dec 2004 15:58:16 +0200 Subject: [openib-general] [PATCH] initial CM module Message-ID: <91DB792C7985D411BEC300B40080D29CB47A1A@mtvex01.mtv.mtl.com> In gen1 TS cm there was a function : ib_cm_service_assign() that return a free service_id . and then you could call ib_cm_listen with this service_id. In Dapl implementation we use this when we want to listen on a not pre assign service_id (Dat verb: Dat_Psp_Create_any ) I think gen2 cm should include something similear : There can be 2 ways to implement it : 1) to add new function as it was in gen1 Index: include/ib_cm.h =================================================================== +/** + * ib_cm_service_assign() - return free service_id + */ +u64 ib_cm_service_assign()(void); + 2) change the ib_cm_listen to understand that the reqeust comes without service_id and then it will assign a service_id and only then will do the listen this can be dane in many ways like :if the service_id == 0 (like sockets bind) Itamar Rabenstein Mellanox Technologies Ltd mailto : itamar at mellanox.co.il phone: 972-3-6259506 -----Original Message----- From: Sean Hefty [ mailto:mshefty at ichips.intel.com ] Sent: Thursday, December 16, 2004 9:56 PM To: openib-general at openib.org; halr at voltaire.com Subject: [openib-general] [PATCH] initial CM module This patch adds in the initial CM API and module code. The module loads, unloads, and allocates/deallocates connection structures, but that's about it. This patch does not include changes needed to Kconfig or the Makefile, since I'm not sure that it makes sense to change these yet. I will commit this unless there are any objections. - Sean Index: include/ib_cm.h =================================================================== --- include/ib_cm.h (revision 0) +++ include/ib_cm.h (revision 0) @@ -0,0 +1,387 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * < http://www.fsf.org/copyleft/gpl.html >, or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * < http://openib.org/license.html >. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ +#if !defined(IB_CM_H) +#define IB_CM_H + +#include +#include + +enum ib_cm_state { + IB_CM_IDLE, + IB_CM_LISTEN, + IB_CM_REQ_SENT, + IB_CM_REQ_RCVD, + IB_CM_MRA_REQ_SENT, + IB_CM_MRA_REQ_RCVD, + IB_CM_REP_SENT, + IB_CM_REP_RCVD, + IB_CM_MRA_REP_SENT, + IB_CM_MRA_REP_RCVD, + IB_CM_ESTABLISHED, + IB_CM_LAP_SENT, + IB_CM_LAP_RCVD, + IB_CM_MRA_LAP_SENT, + IB_CM_MRA_LAP_RCVD, + IB_CM_DREQ_SENT, + IB_CM_DREQ_RCVD, + IB_CM_TIMEWAIT, + IB_CM_SIDR_REQ_SENT, + IB_CM_SIDR_REQ_RCVD +}; + +enum ib_cm_event_type { + IB_CM_REQ_TIMEOUT, + IB_CM_REQ_RECEIVED, + IB_CM_REP_TIMEOUT, + IB_CM_REP_RECEIVED, + IB_CM_RTU_RECEIVED, + IB_CM_DREQ_TIMEOUT, + IB_CM_DREQ_RECEIVED, + IB_CM_DREP_RECEIVED, + IB_CM_MRA_RECEIVED, + IB_CM_LAP_TIMEOUT, + IB_CM_LAP_RECEIVED, + IB_CM_APR_RECEIVED +}; + +struct ib_cm_event { + /* a whole lot more stuff goes here */ + void *private_data; + u8 private_data_len; + enum ib_cm_event_type event; +}; + +struct ib_cm_id; + +typedef void (*ib_cm_handler)(struct ib_cm_id *cm_id, + struct ib_cm_event *event); + +struct ib_cm_id { + ib_cm_handler cm_handler; + void *context; + u64 service_id; + enum ib_cm_state state; +}; + +/** + * ib_create_cm_id - Allocate a connection identifier. + * @cm_handler: Callback invoked to notify the user of CM events. + * @context: User specified context associated with the connection + * identifier. + * + * Connection identifiers are used to track connection states and + * listen requests. + */ +struct ib_cm_id *ib_create_cm_id(ib_cm_handler cm_handler, + void *context); + +/** + * ib_destroy_cm_id - Destroy a connection identifier. + * @cm_id: Connection identifier to destroy. + * + * This call blocks until the connection identifier is destroyed. + */ +int ib_destroy_cm_id(struct ib_cm_id *cm_id); +//*** TBD : add flags to allow calling routine from CM callback... + +/** + * ib_cm_listen - Initiates listening on the specified service ID for + * connection and service ID resolution requests. + * @cm_id: Connection identifier associated with the listen request. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. + */ +int ib_cm_listen(struct ib_cm_id *cm_id, + u64 service_id, + u64 service_mask); + +struct ib_cm_req_param { + struct ib_qp *qp; + struct ib_path_record *primary_path; + struct ib_path_record *alternate_path; + u64 service_id; + int timeout_ms; + void *private_data; + u8 private_data_len; + u8 responder_resources; + u8 initiator_depth; + u8 remote_cm_response_timeout; + u8 flow_control; + u8 local_cm_response_timeout; + u8 retry_count; + u8 rnr_retry_count; + u8 max_cm_retries; +}; + +/** + * ib_send_cm_req - Sends a connection request to the remote node. + * @cm_id: Connection identifier that will be associated with the + * connection request. + * @param: Connection request information needed to establish the + * connection. + */ +int ib_send_cm_req(struct ib_cm_id *cm_id, + struct ib_cm_req_param *param); + +struct ib_cm_rep_param { + struct ib_qp *qp; + void *private_data; + u8 reply_private_data_len; + u8 responder_resources; + u8 initiator_depth; + u8 target_ack_delay; + u8 failover_accepted; + u8 flow_control; + u8 rnr_retry_count; +}; + +/** + * ib_send_cm_rep - Sends a connection reply in response to a connection + * request. + * @cm_id: Connection identifier that will be associated with the + * connection request. + * @param: Connection reply information needed to establish the + * connection. + */ +int ib_send_cm_rep(struct ib_cm_id *cm_id, + struct ib_cm_req_param *param); + +/** + * ib_send_cm_rtu - Sends a connection ready to use message in response + * to a connection reply message. + * @cm_id: Connection identifier associated with the connection request. + * @private_data: Optional user-defined private data sent with the + * ready to use message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_rtu(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len); + +/** + * ib_send_cm_dreq - Sends a disconnection request for an existing + * connection. + * @cm_id: Connection identifier associated with the connection being + * released. + * @private_data: Optional user-defined private data sent with the + * disconnection request message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_dreq(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len); + +/** + * ib_send_cm_drep - Sends a disconnection reply to a disconnection request. + * @cm_id: Connection identifier associated with the connection being + * released. + * @private_data: Optional user-defined private data sent with the + * disconnection reply message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_drep(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len); + +/** + * ib_cm_establish - Forces a connection state to established. + * @cm_id: Connection identifier to transition to established. + * + * This routine should be invoked by users who receive messages on a + * connected QP before an RTU has been received. + */ +int ib_cm_establish(struct ib_cm_id *id); + +enum ib_cm_rej_reason { + IB_CM_REJ_NO_QP = __constant_htons(1), + IB_CM_REJ_NO_EEC = __constant_htons(2), + IB_CM_REJ_NO_RESOURCES = __constant_htons(3), + IB_CM_REJ_TIMEOUT = __constant_htons(4), + IB_CM_REJ_UNSUPPORTED = __constant_htons(5), + IB_CM_REJ_INVALID_COMM_ID = __constant_htons(6), + IB_CM_REJ_INVALID_COMM_INSTANCE = __constant_htons(7), + IB_CM_REJ_INVALID_SERVICE_ID = __constant_htons(8), + IB_CM_REJ_INVALID_TRANSPORT_TYPE = __constant_htons(9), + IB_CM_REJ_STALE_CONN = __constant_htons(10), + IB_CM_REJ_RDC_NOT_EXIST = __constant_htons(11), + IB_CM_REJ_INVALID_GID = __constant_htons(12), + IB_CM_REJ_INVALID_LID = __constant_htons(13), + IB_CM_REJ_INVALID_SL = __constant_htons(14), + IB_CM_REJ_INVALID_TRAFFIC_CLASS = __constant_htons(15), + IB_CM_REJ_INVALID_HOP_LIMIT = __constant_htons(16), + IB_CM_REJ_INVALID_PACKET_RATE = __constant_htons(17), + IB_CM_REJ_INVALID_ALT_GID = __constant_htons(18), + IB_CM_REJ_INVALID_ALT_LID = __constant_htons(19), + IB_CM_REJ_INVALID_ALT_SL = __constant_htons(20), + IB_CM_REJ_INVALID_ALT_TRAFFIC_CLASS = __constant_htons(21), + IB_CM_REJ_INVALID_ALT_HOP_LIMIT = __constant_htons(22), + IB_CM_REJ_INVALID_ALT_PACKET_RATE = __constant_htons(23), + IB_CM_REJ_PORT_REDIRECT = __constant_htons(24), + IB_CM_REJ_INVALID_MTU = __constant_htons(26), + IB_CM_REJ_INSUFFICIENT_RESP_RESOURCES = __constant_htons(27), + IB_CM_REJ_CONSUMER_DEFINED = __constant_htons(28), + IB_CM_REJ_INVALID_RNR_RETRY = __constant_htons(29), + IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = __constant_htons(30), + IB_CM_REJ_INVALID_CLASS_VERSION = __constant_htons(31), + IB_CM_REJ_INVALID_FLOW_LABEL = __constant_htons(32), + IB_CM_REJ_INVALID_ALT_FLOW_LABEL = __constant_htons(33) +}; + +/** + * ib_send_cm_rej - Sends a connection rejection message to the + * remote node. + * @cm_id: Connection identifier associated with the connection being + * rejected. + * @reason: Reason for the connection request rejection. + * @ari: Optional additional rejection information. + * @ari_length: Size of the additional rejection information, in bytes. + * @private_data: Optional user-defined private data sent with the + * rejection message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_rej(struct ib_cm_id *cm_id, + enum ib_cm_rej_reason reason, + void *ari, + u8 ari_length, + void *private_data, + u8 private_data_len); + +/** + * ib_send_cm_mra - Sends a message receipt acknowledgement to a connection + * message. + * @cm_id: Connection identifier associated with the connection message. + * @service_timeout: The maximum time required for the sender to reply to + * to the connection message. + * @private_data: Optional user-defined private data sent with the + * message receipt acknowledgement. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_mra(struct ib_cm_id *cm_id, + u8 service_timeout, + void *private_data, + u8 private_data_len); + +/** + * ib_send_cm_lap - Sends a load alternate path request. + * @cm_id: Connection identifier associated with the load alternate path + * message. + * @alternate_path: A path record that identifies the alternate path to + * load. + * @private_data: Optional user-defined private data sent with the + * load alternate path message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_lap(struct ib_cm_id *cm_id, + struct ib_path_record *alternate_path, + void *private_data, + u8 private_data_len); + +enum ib_cm_apr_status { + IB_CM_APR_SUCCESS, + IB_CM_APR_INVALID_COMM_ID, + IB_CM_APR_UNSUPPORTED, + IB_CM_APR_REJECT, + IB_CM_APR_REDIRECT, + IB_CM_APR_IS_CURRENT, + IB_CM_APR_INVALID_QPN_EECN, + IB_CM_APR_INVALID_LID, + IB_CM_APR_INVALID_GID, + IB_CM_APR_INVALID_FLOW_LABEL, + IB_CM_APR_INVALID_TCLASS, + IB_CM_APR_INVALID_HOP_LIMIT, + IB_CM_APR_INVALID_PACKET_RATE, + IB_CM_APR_INVALID_SL +}; + +/** + * ib_send_cm_apr - Sends an alternate path response message in response to + * a load alternate path request. + * @cm_id: Connection identifier associated with the alternate path response. + * @status: Reply status sent with the alternate path response. + * @info: Optional additional information sent with the alternate path + * response. + * @info_length: Size of the additional information, in bytes. + * @private_data: Optional user-defined private data sent with the + * alternate path response message. + * @private_data_len: Size of the private data buffer, in bytes. + */ +int ib_send_cm_apr(struct ib_cm_id *cm_id, + enum ib_cm_apr_status status, + void *info, + u8 info_length, + void *private_data, + u8 private_data_len); + +struct ib_cm_sidr_req_param { + struct ib_path_record *path; + u64 service_id; + int timeout_ms; + void *private_data; + u8 private_data_len; + u16 pkey; +}; + +/** + * ib_send_cm_sidr_req - Sends a service ID resolution request to the + * remote node. + * @cm_id: Communication identifier that will be associated with the + * service ID resolution request. + * @param: Service ID resolution request information. + */ +int ib_send_cm_sidr_req(struct ib_cm_id *cm_id, + struct ib_cm_sidr_req_param *param); + +enum ib_cm_sidr_status { + IB_SIDR_SUCCESS, + IB_SIDR_UNSUPPORTED, + IB_SIDR_REJECT, + IB_SIDR_NO_QP, + IB_SIDR_REDIRECT, + IB_SIDR_UNSUPPORTED_VERSION +}; + +struct ib_cm_sidr_rep_param { + u32 qp_num; + u32 qkey; + enum ib_cm_sidr_status status; + void *info; + u8 info_length; + void *private_data; + u8 private_data_len; +}; + +/** + * ib_send_cm_sidr_rep - Sends a service ID resolution request to the + * remote node. + * @cm_id: Communication identifier associated with the received service ID + * resolution request. + * @param: Service ID resolution reply information. + */ +int ib_send_cm_sidr_rep(struct ib_cm_id *cm_id, + struct ib_cm_sidr_rep_param *param); + +#endif /* IB_CM_H */ Index: core/cm.c =================================================================== --- core/cm.c (revision 0) +++ core/cm.c (revision 0) @@ -0,0 +1,292 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * < http://www.fsf.org/copyleft/gpl.html >, or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * < http://openib.org/license.html >. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ +#include +#include + +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand CM"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void cm_add_one(struct ib_device *device); +static void cm_remove_one(struct ib_device *device); + +static struct ib_client cm_client = { + .name = "cm", + .add = cm_add_one, + .remove = cm_remove_one +}; + +struct cm_port { + struct ib_mad_agent *mad_agent; +}; + +struct ib_cm_id_private { + struct ib_cm_id id; + + spinlock_t lock; + wait_queue_head_t wait; + atomic_t refcount; +}; + +struct ib_cm_id *ib_create_cm_id(ib_cm_handler cm_handler, + void *context) +{ + struct ib_cm_id_private *cm_id_priv; + + cm_id_priv = kmalloc(sizeof *cm_id_priv, GFP_KERNEL); + if (!cm_id_priv) + return ERR_PTR(-ENOMEM); + + cm_id_priv->id.service_id = 0; + cm_id_priv->id.state = IB_CM_IDLE; + cm_id_priv->id.cm_handler = cm_handler; + cm_id_priv->id.context = context; + + spin_lock_init(&cm_id_priv->lock); + init_waitqueue_head(&cm_id_priv->wait); + atomic_set(&cm_id_priv->refcount, 1); + + return &cm_id_priv->id; +} +EXPORT_SYMBOL(ib_create_cm_id); + +static void reset_cm_state(struct ib_cm_id_private *cm_id_priv) +{ + /* reject connections if establishing */ + /* disconnect established connections */ + /* update timewait info */ +} + +int ib_destroy_cm_id(struct ib_cm_id *cm_id) +{ + struct ib_cm_id_private *cm_id_priv; + unsigned long flags; + + cm_id_priv = container_of(cm_id, struct ib_cm_id_private, id); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch(cm_id->state) { + case IB_CM_IDLE: + case IB_CM_LISTEN: + case IB_CM_TIMEWAIT: + break; /* Connection is ready to be destroyed. */ + default: + reset_cm_state(cm_id_priv); + break; + } + cm_id->state = IB_CM_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + atomic_dec(&cm_id_priv->refcount); + wait_event(cm_id_priv->wait, + !atomic_read(&cm_id_priv->refcount)); + kfree(cm_id_priv); + return 0; +} +EXPORT_SYMBOL(ib_destroy_cm_id); + +int ib_cm_listen(struct ib_cm_id *cm_id, + u64 service_id, + u64 service_mask) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_cm_listen); + +int ib_send_cm_req(struct ib_cm_id *cm_id, + struct ib_cm_req_param *param) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_req); + +int ib_send_cm_rep(struct ib_cm_id *cm_id, + struct ib_cm_req_param *param) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_rep); + +int ib_send_cm_rtu(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_rtu); + +int ib_send_cm_dreq(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_dreq); + +int ib_send_cm_drep(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_drep); + +int ib_cm_establish(struct ib_cm_id *id) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_cm_establish); + +int ib_send_cm_rej(struct ib_cm_id *cm_id, + enum ib_cm_rej_reason reason, + void *ari, + u8 ari_length, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_rej); + +int ib_send_cm_mra(struct ib_cm_id *cm_id, + u8 service_timeout, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_mra); + +int ib_send_cm_lap(struct ib_cm_id *cm_id, + struct ib_path_record *alternate_path, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_lap); + +int ib_send_cm_apr(struct ib_cm_id *cm_id, + enum ib_cm_apr_status status, + void *info, + u8 info_length, + void *private_data, + u8 private_data_len) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_apr); + +int ib_send_cm_sidr_req(struct ib_cm_id *cm_id, + struct ib_cm_sidr_req_param *param) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_sidr_req); + +int ib_send_cm_sidr_rep(struct ib_cm_id *cm_id, + struct ib_cm_sidr_rep_param *param) +{ + return -EINVAL; +} +EXPORT_SYMBOL(ib_send_cm_sidr_rep); + +static void send_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct cm_port *port; + + port = (struct cm_port *)mad_agent->context; +} + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct cm_port *port; + + port = (struct cm_port *)mad_agent->context; +} + +static void cm_add_one(struct ib_device *device) +{ + struct cm_port *port; + struct ib_mad_reg_req reg_req; + u8 i; + + port = kmalloc(sizeof *port * device->phys_port_cnt, GFP_KERNEL); + if (!port) + goto out; + + memset(®_req, 0, sizeof reg_req); + reg_req.mgmt_class = IB_MGMT_CLASS_CM; + reg_req.mgmt_class_version = 1; + set_bit(IB_MGMT_METHOD_GET, reg_req.method_mask); + set_bit(IB_MGMT_METHOD_SET, reg_req.method_mask); + set_bit(IB_MGMT_METHOD_GET_RESP, reg_req.method_mask); + set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); + for (i = 1; i <= device->phys_port_cnt; i++) { + port[i].mad_agent = ib_register_mad_agent(device, i, + IB_QPT_GSI, + ®_req, + 0, + send_handler, + recv_handler, + &port[i]); + } + +out: + ib_set_client_data(device, &cm_client, port); +} + +static void cm_remove_one(struct ib_device *device) +{ + struct cm_port *port; + int i; + + port = (struct cm_port *)ib_get_client_data(device, &cm_client); + if (!port) + return; + + for (i = 1; i <= device->phys_port_cnt; i++) { + if (!IS_ERR(port[i].mad_agent)) + ib_unregister_mad_agent(port[i].mad_agent); + } + kfree(port); +} + +static int __init ib_cm_init(void) +{ + return ib_register_client(&cm_client); +} + +static void __exit ib_cm_cleanup(void) +{ + ib_unregister_client(&cm_client); +} + +module_init(ib_cm_init); +module_exit(ib_cm_cleanup); _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Mon Dec 20 06:42:40 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 20 Dec 2004 06:42:40 -0800 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <91DB792C7985D411BEC300B40080D29CB47A1A@mtvex01.mtv.mtl.com> Message-ID: In gen1 TS cm there was a function : ib_cm_service_assign() that return a free service_id . and then you could call ib_cm_listen with this service_id. In Dapl implementation we use this when we want to listen on a not pre assign service_id (Dat verb: Dat_Psp_Create_any ) I think gen2 cm should include something similear : There can be 2 ways to implement it : 1) to add new function as it was in gen1 2) change the ib_cm_listen to understand that the reqeust comes without service_id and then it will assign a service_id and only then will do the listen this can be dane in many ways like :if the service_id == 0 (like sockets bind) Do others feel that this functionality is needed? This seems problematic in that the CM could allocate a service ID that is used/required by an application that has not yet started when this call is made. (I don't recall if service IDs have a reserved/non-reserved values similar to sockets, but I don't think that they do.) If this functionality is needed, my initial preference is the second method mentioned. - Sean From mst at mellanox.co.il Mon Dec 20 07:29:24 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 20 Dec 2004 17:29:24 +0200 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <52u0qlpy1r.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> <52u0qlpy1r.fsf@topspin.com> Message-ID: <20041220152924.GA3746@mellanox.co.il> Hello, Roland! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] Re: IPoIB Failure CQ overrun": > Robert> 0000:04:00.0: CQ overrun on CQN 00000082 > > This appears to be an issue with the latest FW (I see it with Tavor FW > 3.3.1 but not 3.2.0). I am working with Mellanox on finding out > whether it's a FW bug or a problem with mthca. > > For now you can work around it by changing > > IPOIB_NUM_WC = 4, > > to > > IPOIB_NUM_WC = 1, > > in ipoib.h. > > - Roland In investigating this issue I discovered what I belive is a race condition in mthca: mthca_poll_one decrements the qp->cur counter. This will make it possible, once mthca_poll_one returns, to post new wqes on the qp. The qp lock is then dropped, and finally the cq consumer index is incremented. However, if you try to post send wqes after qp->cur was decremented but before the consumer index is updated, post will seccedd, and cq will overrun. A simplest solution is to update the cq consumer index before qp is unlocked. A patch is attached. I also would like to suggest implementing CQ doorbell coalescing in mthca, to reduce the number of CQ doorbells. Unfortunately this patch does not seem to solve the overrun problem, so may be another problem. That will need more looking into. Thanks, MST -------------- next part -------------- Index: hw/mthca/mthca_cq.c =================================================================== --- hw/mthca/mthca_cq.c (revision 1362) +++ hw/mthca/mthca_cq.c (working copy) @@ -412,6 +412,11 @@ if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { if (*cur_qp) { + if (*freed) { + wmb(); + inc_cons_index(dev, cq, *freed); + *freed = 0; + } spin_unlock(&(*cur_qp)->lock); if (atomic_dec_and_test(&(*cur_qp)->refcount)) wake_up(&(*cur_qp)->wait); @@ -529,16 +534,17 @@ break; } + if (freed) { + wmb(); + inc_cons_index(dev, cq, freed); + } + if (qp) { spin_unlock(&qp->lock); if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); } - if (freed) { - wmb(); - inc_cons_index(dev, cq, freed); - } spin_unlock_irqrestore(&cq->lock, flags); From roland at topspin.com Mon Dec 20 07:36:47 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Dec 2004 07:36:47 -0800 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: (Sean Hefty's message of "Mon, 20 Dec 2004 06:42:40 -0800") References: Message-ID: <52ekhlkmls.fsf@topspin.com> Sean> Do others feel that this functionality is needed? This Sean> seems problematic in that the CM could allocate a service ID Sean> that is used/required by an application that has not yet Sean> started when this call is made. (I don't recall if service Sean> IDs have a reserved/non-reserved values similar to sockets, Sean> but I don't think that they do.) If this functionality is Sean> needed, my initial preference is the second method Sean> mentioned. Yes, some mechanism of allocating unique "local OS administered" service IDs (see section A.3.2.3.3 of the spec) is required. I don't see any problem with allocating IDs before an application is listening. This is no different from the fact that, say, the ROM repository service ID is always assigned, but there may not be any service listening for connections. All that is required for correct operation is for an application to start listening on its assigned service IDs before publishing those IDs. Or did I misunderstand and you were just worried about possible service ID collisions? In that case there is no problem, all service IDs starting with 0x02 are reserved for this "local OS" allocation method and may be freely assigned by the CM. - R. From roland at topspin.com Mon Dec 20 07:38:54 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Dec 2004 07:38:54 -0800 Subject: [openib-general] osm In-Reply-To: (shaharf@voltaire.com's message of "Mon, 20 Dec 2004 11:22:45 +0200") References: Message-ID: <52acs9kmi9.fsf@topspin.com> shaharf> As I am new to open source development so I am not sure shaharf> what license I should add. Is BSD license what we want? shaharf> Standard GNU license may be problematic. I kept the Intel shaharf> based license in the new opensm files. It looks good shaharf> enough for me. Yes, we should not change existing licenses for code such as opensm. For new code we should use the standard OpenIB dual GPL/BSD license. You can copy the copyright/license comment from the top of any kernel source file. - Roland From roland at topspin.com Mon Dec 20 07:42:47 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Dec 2004 07:42:47 -0800 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <20041220152924.GA3746@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 20 Dec 2004 17:29:24 +0200") References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> <52u0qlpy1r.fsf@topspin.com> <20041220152924.GA3746@mellanox.co.il> Message-ID: <52652xkmbs.fsf@topspin.com> Michael> In investigating this issue I discovered what I belive is Michael> a race condition in mthca: Thanks, good catch. I'll apply your patch. In the future can you add a Signed-off-by: line to your patches? Michael> I also would like to suggest implementing CQ doorbell Michael> coalescing in mthca, to reduce the number of CQ Michael> doorbells. Sounds like a good idea... Michael> Unfortunately this patch does not seem to solve the Michael> overrun problem, so may be another problem. That will Michael> need more looking into. OK. At this point do you think it's a FW problem or a driver problem? Thanks, Roland From mst at mellanox.co.il Mon Dec 20 08:01:46 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 20 Dec 2004 18:01:46 +0200 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <52652xkmbs.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> <52u0qlpy1r.fsf@topspin.com> <20041220152924.GA3746@mellanox.co.il> <52652xkmbs.fsf@topspin.com> Message-ID: <20041220160146.GB3746@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun": > Michael> In investigating this issue I discovered what I belive is > Michael> a race condition in mthca: > > Thanks, good catch. I'll apply your patch. In the future can you add > a Signed-off-by: line to your patches? Sorry,I forgot it. Here it is for the last patch: Signed-off-by: Michael S. Tsirkin > Michael> I also would like to suggest implementing CQ doorbell > Michael> coalescing in mthca, to reduce the number of CQ > Michael> doorbells. > > Sounds like a good idea... > > Michael> Unfortunately this patch does not seem to solve the > Michael> overrun problem, so may be another problem. That will > Michael> need more looking into. > > OK. At this point do you think it's a FW problem or a driver problem? > > Thanks, > Roland CQ consumer index doorbell FW is reasonably well tested with VAPI (and with directed tests). It is also relatively straight-forward code so I would suspect a driver problem first of all. Unfortunately once the overrun happends I can not bring the interface down nor unload the ip over ib module (both commands hang) so I have to reboot. This is slowing me down considerably. Do you have an idea why is that, and how to fix this problem? MST From roland at topspin.com Mon Dec 20 08:06:24 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Dec 2004 08:06:24 -0800 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <20041220160146.GB3746@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 20 Dec 2004 18:01:46 +0200") References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> <52u0qlpy1r.fsf@topspin.com> <20041220152924.GA3746@mellanox.co.il> <52652xkmbs.fsf@topspin.com> <20041220160146.GB3746@mellanox.co.il> Message-ID: <52wtvdj6nz.fsf@topspin.com> Michael> CQ consumer index doorbell FW is reasonably well tested Michael> with VAPI (and with directed tests). It is also Michael> relatively straight-forward code so I would suspect a Michael> driver problem first of all. Fair enough but the behavior changed from FW version 3.2 to 3.3.1 which is interesting as well. Michael> Unfortunately once the overrun happends I can not bring Michael> the interface down nor unload the ip over ib module (both Michael> commands hang) so I have to reboot. This is slowing me Michael> down considerably. Do you have an idea why is that, and Michael> how to fix this problem? Probably IPoIB is stuck in the loop /* Wait for all sends and receives to complete */ while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) yield(); in ipoib_ib_dev_stop(), since some of completions it's waiting for are lost because of the CQ overrun. I'll add a timeout here where we give up and assume everything is done. - R. From mst at mellanox.co.il Mon Dec 20 08:16:28 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 20 Dec 2004 18:16:28 +0200 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <52wtvdj6nz.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> <52u0qlpy1r.fsf@topspin.com> <20041220152924.GA3746@mellanox.co.il> <52652xkmbs.fsf@topspin.com> <20041220160146.GB3746@mellanox.co.il> <52wtvdj6nz.fsf@topspin.com> Message-ID: <20041220161628.GC3746@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun": > Michael> CQ consumer index doorbell FW is reasonably well tested > Michael> with VAPI (and with directed tests). It is also > Michael> relatively straight-forward code so I would suspect a > Michael> driver problem first of all. > > Fair enough but the behavior changed from FW version 3.2 to 3.3.1 > which is interesting as well. I know but races are always tricky, could be just a timing issue. Its just that CI doorbells are routinely stressed here by QA. > Michael> Unfortunately once the overrun happends I can not bring > Michael> the interface down nor unload the ip over ib module (both > Michael> commands hang) so I have to reboot. This is slowing me > Michael> down considerably. Do you have an idea why is that, and > Michael> how to fix this problem? > > Probably IPoIB is stuck in the loop > > /* Wait for all sends and receives to complete */ > while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) > yield(); > > in ipoib_ib_dev_stop(), since some of completions it's waiting for are > lost because of the CQ overrun. I'll add a timeout here where we give > up and assume everything is done. > > - R. Let me know when you do. But why wait? Once you close the CQ, and get the command interface event of the hw2sw cq, it is guaranteed you wont get any new cqes or events on this cq. Alternatively, you can do a query cq to check its not in overrun, although it seems like working around a specific problem we see. MST From arnd at arndb.de Mon Dec 20 08:10:03 2004 From: arnd at arndb.de (Arnd Bergmann) Date: Mon, 20 Dec 2004 17:10:03 +0100 Subject: [openib-general] Re: [PATCH][v4][19/24] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <20041220.221709.99112884.yoshfuji@linux-ipv6.org> References: <200412192215.fZX1ZQqQD4QGkKcF@topspin.com> <200412201314.35502.arnd@arndb.de> <20041220.221709.99112884.yoshfuji@linux-ipv6.org> Message-ID: <200412201710.04025.arnd@arndb.de> On Maandag 20 Dezember 2004 14:17, YOSHIFUJI Hideaki / 吉藤英明 wrote: > In article <200412201314.35502.arnd at arndb.de> (at Mon, 20 Dec 2004 13:14:35 +0100), Arnd Bergmann says: > > See also include/linux/ide.h for another example where this is done. > > No, it is not the similar case. Sorry, it was a typo on my side, I meant include/linux/ata.h. Arnd <>< -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: signature URL: From roland at topspin.com Mon Dec 20 08:41:17 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Dec 2004 08:41:17 -0800 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <20041220161628.GC3746@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 20 Dec 2004 18:16:28 +0200") References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> <52u0qlpy1r.fsf@topspin.com> <20041220152924.GA3746@mellanox.co.il> <52652xkmbs.fsf@topspin.com> <20041220160146.GB3746@mellanox.co.il> <52wtvdj6nz.fsf@topspin.com> <20041220161628.GC3746@mellanox.co.il> Message-ID: <52r7lkkjma.fsf@topspin.com> Michael> I know but races are always tricky, could be just a Michael> timing issue. Its just that CI doorbells are routinely Michael> stressed here by QA. The thing that really makes it hard to for to think of a potential driver problem is that changing from updating the CI all at once to updating it by 1 at a time in a loop fixes things for me. If anything this lengthens the amount of time during which the CQ has too little space. Also adding a 1000 extra entries to the CQ created by IPoIB -- ie changing the code in ipoib_verbs.c to priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1000); still has the same problem, so we're not just transiently overrunning by 1 or something like that -- it looks like we're systematically losing updates to the CQ. Michael> Let me know when you do. But why wait? Once you close Michael> the CQ, and get the command interface event of the hw2sw Michael> cq, it is guaranteed you wont get any new cqes or events Michael> on this cq. OK, it's done. The reason for the wait here is that we are actually cleaning up the QP and want to make sure that we don't leak any resources. First we transition the QP to error, wait for all work requests to complete, and then transition the QP to reset. - Roland From mst at mellanox.co.il Mon Dec 20 08:46:01 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 20 Dec 2004 18:46:01 +0200 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <52r7lkkjma.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> <52u0qlpy1r.fsf@topspin.com> <20041220152924.GA3746@mellanox.co.il> <52652xkmbs.fsf@topspin.com> <20041220160146.GB3746@mellanox.co.il> <52wtvdj6nz.fsf@topspin.com> <20041220161628.GC3746@mellanox.co.il> <52r7lkkjma.fsf@topspin.com> Message-ID: <20041220164601.GD3746@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun": > Michael> I know but races are always tricky, could be just a > Michael> timing issue. Its just that CI doorbells are routinely > Michael> stressed here by QA. > > The thing that really makes it hard to for to think of a potential > driver problem is that changing from updating the CI all at once to > updating it by 1 at a time in a loop fixes things for me. If anything > this lengthens the amount of time during which the CQ has too little > space. > > Also adding a 1000 extra entries to the CQ created by IPoIB -- ie > changing the code in ipoib_verbs.c to > > priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, > IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1000); > > still has the same problem, so we're not just transiently overrunning > by 1 or something like that -- it looks like we're systematically > losing updates to the CQ. > > Michael> Let me know when you do. But why wait? Once you close > Michael> the CQ, and get the command interface event of the hw2sw > Michael> cq, it is guaranteed you wont get any new cqes or events > Michael> on this cq. > > OK, it's done. The reason for the wait here is that we are actually > cleaning up the QP and want to make sure that we don't leak any > resources. First we transition the QP to error, wait for all work > requests to complete, and then transition the QP to reset. > > - Roland But why wait for completion? Once QP is in error no new WQEs will be processed by hardware. You can close the CQ and free all of them. MST From roland at topspin.com Mon Dec 20 08:45:49 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Dec 2004 08:45:49 -0800 Subject: [openib-general] Re: [PATCH][v4][19/24] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <20041220.155836.75677852.yoshfuji@linux-ipv6.org> (YOSHIFUJI Hideaki's message of "Mon, 20 Dec 2004 15:58:36 +0900 (JST)") References: <200412192215.69tnzAhGIT1vQGLF@topspin.com> <200412192215.fZX1ZQqQD4QGkKcF@topspin.com> <20041220.155836.75677852.yoshfuji@linux-ipv6.org> Message-ID: <52is6wkjeq.fsf@topspin.com> YOSHIFUJI> above entries does not seem to appropriate for enum YOSHIFUJI> (than #define). As Arnd mentioned, I thought enum values were preferred to using the preprocessor. What's the advantage of converting to macros (which have no type, are invisible to the compiler, etc)? Thanks, Roland From roland at topspin.com Mon Dec 20 08:50:58 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Dec 2004 08:50:58 -0800 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <20041220164601.GD3746@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 20 Dec 2004 18:46:01 +0200") References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> <52u0qlpy1r.fsf@topspin.com> <20041220152924.GA3746@mellanox.co.il> <52652xkmbs.fsf@topspin.com> <20041220160146.GB3746@mellanox.co.il> <52wtvdj6nz.fsf@topspin.com> <20041220161628.GC3746@mellanox.co.il> <52r7lkkjma.fsf@topspin.com> <20041220164601.GD3746@mellanox.co.il> Message-ID: <52ekhkkj65.fsf@topspin.com> Michael> But why wait for completion? Once QP is in error no new Michael> WQEs will be processed by hardware. You can close the CQ Michael> and free all of them. We can't destroy the CQ before we destroy the associated QP. And we don't want to destroy the associated QP (for various reasons we prefer using the same QP if someone ifconfigs down and then up their IPoIB interface). In any case if we destroyed the QP without waiting for completions we wouldn't know which requests completed successfully and which were flushed. - R. From mst at mellanox.co.il Mon Dec 20 09:04:51 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 20 Dec 2004 19:04:51 +0200 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <52ekhkkj65.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> <52u0qlpy1r.fsf@topspin.com> <20041220152924.GA3746@mellanox.co.il> <52652xkmbs.fsf@topspin.com> <20041220160146.GB3746@mellanox.co.il> <52wtvdj6nz.fsf@topspin.com> <20041220161628.GC3746@mellanox.co.il> <52r7lkkjma.fsf@topspin.com> <20041220164601.GD3746@mellanox.co.il> <52ekhkkj65.fsf@topspin.com> Message-ID: <20041220170451.GA4011@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun": > Michael> But why wait for completion? Once QP is in error no new > Michael> WQEs will be processed by hardware. You can close the CQ > Michael> and free all of them. > > We can't destroy the CQ before we destroy the associated QP. And we > don't want to destroy the associated QP (for various reasons we prefer > using the same QP if someone ifconfigs down and then up their IPoIB > interface). I see. Hmm. But not if you a bringing the module down, right? So if the module is going down, cant you do this then? Maybe we just need a way to do "reset cq" which will do harware to software and back. > In any case if we destroyed the QP without waiting for completions we > wouldn't know which requests completed successfully and which were flushed. > > - R. You free them either way. I think its only because you keep the old QP and the old CQ. mst From roland at topspin.com Mon Dec 20 09:55:34 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Dec 2004 09:55:34 -0800 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <52r7lkkjma.fsf@topspin.com> (Roland Dreier's message of "Mon, 20 Dec 2004 08:41:17 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0003125E9A@orsmsx408> <52u0qlpy1r.fsf@topspin.com> <20041220152924.GA3746@mellanox.co.il> <52652xkmbs.fsf@topspin.com> <20041220160146.GB3746@mellanox.co.il> <52wtvdj6nz.fsf@topspin.com> <20041220161628.GC3746@mellanox.co.il> <52r7lkkjma.fsf@topspin.com> Message-ID: <52brcoj1m1.fsf@topspin.com> >From adding some more dumping of CQ state, what _may_ be happening is that under rare conditions the HCA's CQ consumer index gets incremented by 1 too many. Then when the CQ is completely empty it will look full to the HW and we'll get an overrun for the next CQE. (I saw it happen after ~300K increments of the CQ's CI, ~160K of which were for >1) I didn't see how the driver could be doing this, since the HCA ended up with a CI that was one more than the number of increments that the driver did. Also, converting all of the increment CI dbells to only increment by 1 fixes the problem, which is more evidence of a FW glitch. Thanks, Roland From jjengla at sandia.gov Mon Dec 20 16:21:58 2004 From: jjengla at sandia.gov (Josh England) Date: Mon, 20 Dec 2004 16:21:58 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <52ekhry2yy.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> <52mzwg2561.fsf@topspin.com> <52ekhry2yy.fsf@topspin.com> Message-ID: <1103588518.22372.21.camel@localhost> Roland, Alright...I'm not sure whats going on. I just now got a chance to test this out and it still isn't working. I've tried with the latest from SVN as well as the exact same kernel/rootFS you were using, and I still can't ping between nodes. I'm just doing 'modprobe ib_mthca; modprobe ib_ipoib; ifconfig ...; ping ...' on the same two nodes. Am I missing something? -JE On Wed, 2004-12-15 at 07:51 -0800, Roland Dreier wrote: > OK, I think the following change (already committed) should fix IPoIB > with PCI Express HCAs. Thanks to Josh England for giving me access to > a couple of his machines for debugging, and to Tziporet Koren for > pointing out this Arbel "feature." > > - Roland > > Index: infiniband/hw/mthca/mthca_av.c > =================================================================== > --- infiniband/hw/mthca/mthca_av.c (revision 1331) > +++ infiniband/hw/mthca/mthca_av.c (working copy) > @@ -97,6 +97,9 @@ > cpu_to_be32((ah_attr->grh.traffic_class << 20) | > ah_attr->grh.flow_label); > memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); > + } else { > + /* Arbel workaround -- low byte of GID must be 2 */ > + av->dgid[3] = cpu_to_be32(2); > } > > if (0) { > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Tue Dec 21 01:46:12 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 21 Dec 2004 11:46:12 +0200 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun Message-ID: <506C3D7B14CDD411A52C00025558DED60545D730@mtlex01.yok.mtl.com> I'm a bit ill, expect to work on it tomorrow. Could you post the patch with these dumps? > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Mon, December 20, 2004 7:56 PM > To: Michael S. Tsirkin > Cc: openib-general at openib.org > Subject: Re: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun > > > >From adding some more dumping of CQ state, what _may_ be happening is > that under rare conditions the HCA's CQ consumer index gets > incremented by 1 too many. Then when the CQ is completely empty it > will look full to the HW and we'll get an overrun for the next CQE. > (I saw it happen after ~300K increments of the CQ's CI, ~160K of which > were for >1) > > I didn't see how the driver could be doing this, since the HCA ended > up with a CI that was one more than the number of increments that the > driver did. Also, converting all of the increment CI dbells to only > increment by 1 fixes the problem, which is more evidence of a > FW glitch. > > Thanks, > Roland > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Dec 21 06:42:45 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 21 Dec 2004 16:42:45 +0200 Subject: [openib-general] PDR ARs and Notes Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B1A@taurus.voltaire.com> Hi, Attached are the ARs I recorded (categorized as related to PF contract or other) and my notes from yesterday's PDR. I took the liberty of assigning the ARs :-) Let me know if I missed anything or whether there are any other comments or questions. Thanks. -- Hal -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: PDR ARs.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: PDR Notes.txt URL: From roland at topspin.com Tue Dec 21 08:02:01 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 21 Dec 2004 08:02:01 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1103588518.22372.21.camel@localhost> (Josh England's message of "Mon, 20 Dec 2004 16:21:58 -0800") References: <1AC79F16F5C5284499BB9591B33D6F00030B8D5D@orsmsx408> <52mzwg2561.fsf@topspin.com> <52ekhry2yy.fsf@topspin.com> <1103588518.22372.21.camel@localhost> Message-ID: <527jnbhc7a.fsf@topspin.com> Josh> Roland, Alright...I'm not sure whats going on. I just now Josh> got a chance to test this out and it still isn't working. Josh> I've tried with the latest from SVN as well as the exact Josh> same kernel/rootFS you were using, and I still can't ping Josh> between nodes. I'm just doing 'modprobe ib_mthca; modprobe Josh> ib_ipoib; ifconfig ...; ping ...' on the same two nodes. Am Josh> I missing something? That should work. I definitely was able to ping earlier with my kernel/rootFS, but maybe that was depending on some previous state. You could go through Hal's IPoIB debugging FAQ, or if you reenable my account I should be able to take a look today. - R. From halr at voltaire.com Tue Dec 21 09:07:03 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 21 Dec 2004 19:07:03 +0200 Subject: [openib-general] IPoIB still not working Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B2B@taurus.voltaire.com> Hi Josh, [You wrote:] I've tried with the latest from SVN as well as the exact same kernel/rootFS you were using, and I still can't ping between nodes. I'm just doing 'modprobe ib_mthca; modprobe ib_ipoib; ifconfig ...; ping ...' on the same two nodes. Am I missing something? Anything in /var/log/messages on any of the nodes ? Is the ping directed at a specific IP address or to the subnet broadcast address ? Is this PCIe nodes or PCIX ones or both ? Are you at 4.6.1 now on all your PCIe HCAs ? Not sure whether 4.5.3 does indeed work but it sounds like something has changed for the worse since Roland had it working. Thanks. -- Hal From sean.hefty at intel.com Tue Dec 21 09:26:12 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 21 Dec 2004 09:26:12 -0800 Subject: [openib-general] [PATCH] initial CM module In-Reply-To: <52ekhlkmls.fsf@topspin.com> Message-ID: >Yes, some mechanism of allocating unique "local OS administered" >service IDs (see section A.3.2.3.3 of the spec) is required. Thanks, that's what I was looking for in the spec, but couldn't find it. >Or did I misunderstand and you were just worried about possible >service ID collisions? In that case there is no problem, all service >IDs starting with 0x02 are reserved for this "local OS" allocation >method and may be freely assigned by the CM. I was worried about possible service ID collisions. I will update the API to allow this, with the plan to use the existing listen call. - Sean From robert.j.woodruff at intel.com Tue Dec 21 09:32:23 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 21 Dec 2004 09:32:23 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F00031BCB8B@orsmsx408> >Are you at 4.6.1 now on all your PCIe HCAs ? Not sure whether 4.5.3 does indeed work >but it sounds like something has changed for the worse since Roland had it working. >Thanks. >-- Hal I had problems with the 4.3.5 firmware and it seemed to work with my early version of the 4.6.0-rc4 firmware. I was able to successfully run MPI jobs over IPoIB over the week-end without any issues. I was using the 1359 version, plus the one tweak to ipoib.h that I reported last friday. I will also try the latest released 4.6.1 firmware when I get back in the office. woody From roland at topspin.com Tue Dec 21 11:24:22 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 21 Dec 2004 11:24:22 -0800 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <506C3D7B14CDD411A52C00025558DED60545D730@mtlex01.yok.mtl.com> (Michael S. Tsirkin's message of "Tue, 21 Dec 2004 11:46:12 +0200") References: <506C3D7B14CDD411A52C00025558DED60545D730@mtlex01.yok.mtl.com> Message-ID: <52oegnfo9l.fsf@topspin.com> Michael> I'm a bit ill, expect to work on it tomorrow. Could you Michael> post the patch with these dumps? The patch is below. Testing on PCI Express/Arbel systems (with dual 3.2 GHz Xeons) with FW 4.5.3, I've seen somewhat different behavior. Even after setting IPOIB_NUM_WC to 1 so that IPoIB never polls more than 1 CQE at a time (and so we always inc the CI by exactly 1), I still get the CQ overrun. The debug patch produces this output: ib_mthca 0000:02:00.0: CQ overrun on CQN 00000082 ib0: timing out; 3 sends 128 receives not completed divert: no divert_blk to free, ib0 not ethernet context for CQN 82 cons_index 7f, nfrees 17d47f, ndfrees 0 [ 0] 90040000 [ 4] 00000000 [ 8] 00000000 [ c] e8000001 [10] 00000002 [14] 00000001 [18] 02000000 [1c] 00000023 [20] 0000007d [24] 7fffffff [28] 00000080 [2c] 0000007f [30] f8000082 [34] 002483b4 [38] 00000001 [3c] 00000000 You can see that the HW's consumer index (at offset 0x28) is 0x80, one more than the driver thinks it should be. Even more interesting is this dump from the other system on the other end of this netpipe run: context for CQN 82 cons_index 81, nfrees 17d481, ndfrees 0 [ 0] 00040100 [ 4] 00000000 [ 8] 00000000 [ c] e8000001 [10] 00000002 [14] 00000001 [18] 02000000 [1c] 00000023 [20] 00000080 [24] 00000080 [28] 000000a7 [2c] 00000081 [30] f8000082 [34] 002483b4 [38] 00000001 [3c] 00000000 Here the HW CI is 0xa7, way higher than the driver's value. It looks like we avoided the CQ overrun because the CI got bumped past the correct value more than once. Thanks, Roland Index: hw/mthca/mthca_provider.h =================================================================== --- hw/mthca/mthca_provider.h (revision 1370) +++ hw/mthca/mthca_provider.h (working copy) @@ -138,6 +138,8 @@ int cqn; int cons_index; int is_direct; + int nfrees; + int ndfrees; union { struct mthca_buf_list direct; struct mthca_buf_list *page_list; Index: hw/mthca/mthca_cq.c =================================================================== --- hw/mthca/mthca_cq.c (revision 1370) +++ hw/mthca/mthca_cq.c (working copy) @@ -535,6 +535,8 @@ } if (freed) { + cq->nfrees += freed; + cq->ndfrees += freed - 1; wmb(); inc_cons_index(dev, cq, freed); } @@ -706,6 +708,7 @@ spin_unlock_irq(&dev->cq_table.lock); cq->cons_index = 0; + cq->nfrees = cq->ndfrees = 0; kfree(dma_list); kfree(mailbox); @@ -764,11 +767,13 @@ mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n", status); - if (0) { + if (1) { u32 *ctx = MAILBOX_ALIGN(mailbox); int j; printk(KERN_ERR "context for CQN %x\n", cq->cqn); + printk(KERN_ERR "cons_index %x, nfrees %x, ndfrees %x\n", + cq->cons_index, cq->nfrees, cq->ndfrees); for (j = 0; j < 16; ++j) printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); } From eric.lemoine at gmail.com Wed Dec 22 10:49:48 2004 From: eric.lemoine at gmail.com (Eric Lemoine) Date: Wed, 22 Dec 2004 19:49:48 +0100 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <1103484675.1050.158.camel@jzny.localdomain> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> Message-ID: <5cac192f04122210491d64d4b6@mail.gmail.com> On 19 Dec 2004 14:31:15 -0500, jamal wrote: > On Sat, 2004-12-18 at 00:44, David S. Miller wrote: > > > Perhaps one way to fix this is to add a pointer to a spinlock to > > the netdev struct, and have hold that the upper level grab that > > when NETIF_F_LLTX when doing queue state checks. Actually, that > > could end up being racy too. > > How about releasing the qlock only when the LLTX transmit lock is > grabbed? That should bring it to par with what it was originally. I dont like the idea of releasing inside the driver a lock taken outside. That might be just me... Instead, I would suggest to have LLTX drivers check whether queue is stopped after they grab their private tx lock and before they check tx ring fullness. That way we close the race window but keep the driver bug check around. See attached sungem patch. -- Eric -------------- next part -------------- A non-text attachment was scrubbed... Name: sungem-lltx.patch Type: application/octet-stream Size: 511 bytes Desc: not available URL: From roland at topspin.com Wed Dec 22 12:29:12 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 22 Dec 2004 12:29:12 -0800 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <52oegnfo9l.fsf@topspin.com> (Roland Dreier's message of "Tue, 21 Dec 2004 11:24:22 -0800") References: <506C3D7B14CDD411A52C00025558DED60545D730@mtlex01.yok.mtl.com> <52oegnfo9l.fsf@topspin.com> Message-ID: <52r7licc13.fsf@topspin.com> Roland> Testing on PCI Express/Arbel systems (with dual 3.2 GHz Roland> Xeons) with FW 4.5.3, I've seen somewhat different Roland> behavior. Even after setting IPOIB_NUM_WC to 1 so that Roland> IPoIB never polls more than 1 CQE at a time (and so we Roland> always inc the CI by exactly 1), I still get the CQ Roland> overrun. FW 4.6.1 on these systems behaves the way Tavor FW 3.3.1 did on my systems: if I modify IPoIB to always poll 1 CQE at a time, it works fine. If IPoIB polls more than 1 CQE at a time then we see an overrun. - Roland From mst at mellanox.co.il Wed Dec 22 13:20:38 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 22 Dec 2004 23:20:38 +0200 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun In-Reply-To: <52r7licc13.fsf@topspin.com> References: <506C3D7B14CDD411A52C00025558DED60545D730@mtlex01.yok.mtl.com> <52oegnfo9l.fsf@topspin.com> <52r7licc13.fsf@topspin.com> Message-ID: <20041222212037.GB22913@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun": > Roland> Testing on PCI Express/Arbel systems (with dual 3.2 GHz > Roland> Xeons) with FW 4.5.3, I've seen somewhat different > Roland> behavior. Even after setting IPOIB_NUM_WC to 1 so that > Roland> IPoIB never polls more than 1 CQE at a time (and so we > Roland> always inc the CI by exactly 1), I still get the CQ > Roland> overrun. > > FW 4.6.1 on these systems behaves the way Tavor FW 3.3.1 did on my > systems: if I modify IPoIB to always poll 1 CQE at a time, it works > fine. If IPoIB polls more than 1 CQE at a time then we see an > overrun. > > - Roland Thanks, Roland! I'll be working on it. mst From davem at davemloft.net Wed Dec 22 20:29:19 2004 From: davem at davemloft.net (David S. Miller) Date: Wed, 22 Dec 2004 20:29:19 -0800 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <5cac192f04122210491d64d4b6@mail.gmail.com> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> <5cac192f04122210491d64d4b6@mail.gmail.com> Message-ID: <20041222202919.057b8331.davem@davemloft.net> On Wed, 22 Dec 2004 19:49:48 +0100 Eric Lemoine wrote: > Instead, I would suggest to have LLTX drivers check whether queue is > stopped after they grab their private tx lock and before they check tx > ring fullness. That way we close the race window but keep the driver > bug check around. > > See attached sungem patch. That sounds about right. Nice idea. It solves the race, and retains the error state check. I'll apply Eric's patch, and do something similar in the other LLTX drivers (except loopback which has not "queue" per se so doesn't need this stuff). From roland at topspin.com Wed Dec 22 20:38:34 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 22 Dec 2004 20:38:34 -0800 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <20041222202919.057b8331.davem@davemloft.net> (David S. Miller's message of "Wed, 22 Dec 2004 20:29:19 -0800") References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> <5cac192f04122210491d64d4b6@mail.gmail.com> <20041222202919.057b8331.davem@davemloft.net> Message-ID: <52pt11bpdh.fsf@topspin.com> David> That sounds about right. Nice idea. It solves the race, David> and retains the error state check. Great, I've made a similar change to the IP-over-IB driver. Thanks, Roland From tziporet at mellanox.co.il Thu Dec 23 01:07:14 2004 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 23 Dec 2004 11:07:14 +0200 Subject: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun Message-ID: <506C3D7B14CDD411A52C00025558DED6064BEC29@mtlex01.yok.mtl.com> Hi, We found an issue with the FW that causing this overrun. The bug happened when incrementing CQ consumer index in more then 1 and a CQE is written at the same time on this CQ. This bug is the same in 3.3.1 and 4.6.1 versions. A FW with a fix will be provided next week. Tziporet -----Original Message----- From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] Sent: Wednesday, December 22, 2004 11:21 PM To: Roland Dreier Cc: openib-general at openib.org Subject: Re: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun": > Roland> Testing on PCI Express/Arbel systems (with dual 3.2 GHz > Roland> Xeons) with FW 4.5.3, I've seen somewhat different > Roland> behavior. Even after setting IPOIB_NUM_WC to 1 so that > Roland> IPoIB never polls more than 1 CQE at a time (and so we > Roland> always inc the CI by exactly 1), I still get the CQ > Roland> overrun. > > FW 4.6.1 on these systems behaves the way Tavor FW 3.3.1 did on my > systems: if I modify IPoIB to always poll 1 CQE at a time, it works > fine. If IPoIB polls more than 1 CQE at a time then we see an > overrun. > > - Roland Thanks, Roland! I'll be working on it. mst _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.lemoine at gmail.com Thu Dec 23 01:10:34 2004 From: eric.lemoine at gmail.com (Eric Lemoine) Date: Thu, 23 Dec 2004 10:10:34 +0100 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <20041222202919.057b8331.davem@davemloft.net> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> <5cac192f04122210491d64d4b6@mail.gmail.com> <20041222202919.057b8331.davem@davemloft.net> Message-ID: <5cac192f0412230110628749e3@mail.gmail.com> On Wed, 22 Dec 2004 20:29:19 -0800, David S. Miller wrote: > On Wed, 22 Dec 2004 19:49:48 +0100 > Eric Lemoine wrote: > > > Instead, I would suggest to have LLTX drivers check whether queue is > > stopped after they grab their private tx lock and before they check tx > > ring fullness. That way we close the race window but keep the driver > > bug check around. > > > > See attached sungem patch. > > That sounds about right. Nice idea. It solves the race, and retains > the error state check. > > I'll apply Eric's patch, and do something similar in the other LLTX > drivers (except loopback which has not "queue" per se so doesn't need > this stuff). Dave, I still have one concern with the LLTX code (and it may be that the correct patch is Jamal's) : Without LLTX we do : lock(queue_lock), lock(xmit_lock), release(queue_lock), release(xmit_lock). With LLTX (without Jamal's patch) we do : lock(queue_lock), release(queue_lock), lock(tx_lock), release(tx_lock). LLTX doesn't look correct because it creates a race condition window between the the two lock-protected sections. So you may want to reconsider Jamal's patch or pull out LLTX... Thanks, -- Eric From kaber at trash.net Thu Dec 23 08:37:24 2004 From: kaber at trash.net (Patrick McHardy) Date: Thu, 23 Dec 2004 17:37:24 +0100 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <5cac192f0412230110628749e3@mail.gmail.com> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> <5cac192f04122210491d64d4b6@mail.gmail.com> <20041222202919.057b8331.davem@davemloft.net> <5cac192f0412230110628749e3@mail.gmail.com> Message-ID: <41CAF444.3000305@trash.net> Eric Lemoine wrote: > I still have one concern with the LLTX code (and it may be that the > correct patch is Jamal's) : > > Without LLTX we do : lock(queue_lock), lock(xmit_lock), > release(queue_lock), release(xmit_lock). With LLTX (without Jamal's > patch) we do : lock(queue_lock), release(queue_lock), lock(tx_lock), > release(tx_lock). LLTX doesn't look correct because it creates a race > condition window between the the two lock-protected sections. So you > may want to reconsider Jamal's patch or pull out LLTX... You're right, it can cause packet reordering if something like this happens: CPU1 CPU2 lock(queue_lock) dequeue unlock(queue_lock) lock(queue_lock) dequeue unlock(queue_lock) lock(xmit_lock) hard_start_xmit unlock(xmit_lock) lock(xmit_lock) hard_start_xmit unlock(xmit_lock) Jamal's patch should fix this. Regards Patrick From roland at topspin.com Thu Dec 23 09:22:52 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 09:22:52 -0800 Subject: [openib-general] Problem with userspace tree structure Message-ID: <52llbp7wur.fsf@topspin.com> I started looking at the code that's being checked in under src/userspace/, and I think the current structure needs to be modified. As it stands now, the whole userspace/ tree is treated as a single source package with a Makefile at the userspace/ level. I think it would be far better to return to the older structure where the userspace/ directory just has subdirectories for various packages (such as userspace/tvflash). If we continue with a monolithic userspace/ tree, then it becomes much harder to do releases (and also harder for users to build what they're interested in). For example, it will become very annoying to have to do a full release of everything including opensm just to update ibstat. In fact it's probably already very confusing that one can do a make from userspace/, but to build tvflash one has to do a make from userspace/tvflash. Thanks, Roland From robert.j.woodruff at intel.com Thu Dec 23 09:33:51 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 23 Dec 2004 09:33:51 -0800 Subject: [openib-general] Problem with userspace tree structure Message-ID: <1AC79F16F5C5284499BB9591B33D6F0003206023@orsmsx408> Roland> I started looking at the code that's being checked in under >src/userspace/, and I think the current structure needs to be >modified. As it stands now, the whole userspace/ tree is treated as a >single source package with a Makefile at the userspace/ level. I don't have any objection. My only suggestion would be to have a structure where someone can easily build all of the components from a top level Makefile or easily build each individual component. woody From mst at mellanox.co.il Thu Dec 23 09:39:02 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 23 Dec 2004 19:39:02 +0200 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <52llbp7wur.fsf@topspin.com> References: <52llbp7wur.fsf@topspin.com> Message-ID: <20041223173902.GA29875@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "[openib-general] Problem with userspace tree structure": > I started looking at the code that's being checked in under > src/userspace/, and I think the current structure needs to be > modified. As it stands now, the whole userspace/ tree is treated as a > single source package with a Makefile at the userspace/ level. > > I think it would be far better to return to the older structure where > the userspace/ directory just has subdirectories for various packages > (such as userspace/tvflash). If we continue with a monolithic > userspace/ tree, then it becomes much harder to do releases (and also > harder for users to build what they're interested in). For example, > it will become very annoying to have to do a full release of > everything including opensm just to update ibstat. > > In fact it's probably already very confusing that one can do a make > from userspace/, but to build tvflash one has to do a make from > userspace/tvflash. > > Thanks, > Roland Apropos tvflash, I think we want to copy mstflint from https://openib.org/svn/trunk/contrib/mellanox/mstflint/ to somewhere under https://openib.org/svn/gen2/trunk/src/userspace/ As with tvflash, no driver is required, and Mellanox does not support flashing tvflash for Mellanox-produced boards, you really need mstflint for these. OK? MST From halr at voltaire.com Thu Dec 23 09:34:44 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Dec 2004 12:34:44 -0500 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <52llbp7wur.fsf@topspin.com> References: <52llbp7wur.fsf@topspin.com> Message-ID: <1103823283.4054.85.camel@localhost.localdomain> On Thu, 2004-12-23 at 12:22, Roland Dreier wrote: > I started looking at the code that's being checked in under > src/userspace/, and I think the current structure needs to be > modified. As it stands now, the whole userspace/ tree is treated as a > single source package with a Makefile at the userspace/ level. I'm pretty sure this is temporary and just a quick way to get started. At least opensm will be moving to autotools. This would be done at the osm level (userspace subdirectory). I would expect diags and util to be similar. I think those are the subdirs tied together with the makefile at the userspace level now. -- Hal From halr at voltaire.com Thu Dec 23 09:45:15 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Dec 2004 12:45:15 -0500 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <20041223173902.GA29875@mellanox.co.il> References: <52llbp7wur.fsf@topspin.com> <20041223173902.GA29875@mellanox.co.il> Message-ID: <1103823915.4054.88.camel@localhost.localdomain> On Thu, 2004-12-23 at 12:39, Michael S. Tsirkin wrote: > Apropos tvflash, I think we want to copy mstflint from > https://openib.org/svn/trunk/contrib/mellanox/mstflint/ > to somewhere under https://openib.org/svn/gen2/trunk/src/userspace/ > > As with tvflash, no driver is required, > and Mellanox does not support flashing tvflash for Mellanox-produced boards, > you really need mstflint for these. > > OK? A userspace subdir for mstflint is fine with me. -- Hal From eric.lemoine at gmail.com Thu Dec 23 10:11:50 2004 From: eric.lemoine at gmail.com (Eric Lemoine) Date: Thu, 23 Dec 2004 19:11:50 +0100 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <41CAF444.3000305@trash.net> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> <5cac192f04122210491d64d4b6@mail.gmail.com> <20041222202919.057b8331.davem@davemloft.net> <5cac192f0412230110628749e3@mail.gmail.com> <41CAF444.3000305@trash.net> Message-ID: <5cac192f0412231011471763f3@mail.gmail.com> On Thu, 23 Dec 2004 17:37:24 +0100, Patrick McHardy wrote: > Eric Lemoine wrote: > > I still have one concern with the LLTX code (and it may be that the > > correct patch is Jamal's) : > > > > Without LLTX we do : lock(queue_lock), lock(xmit_lock), > > release(queue_lock), release(xmit_lock). With LLTX (without Jamal's > > patch) we do : lock(queue_lock), release(queue_lock), lock(tx_lock), > > release(tx_lock). LLTX doesn't look correct because it creates a race > > condition window between the the two lock-protected sections. So you > > may want to reconsider Jamal's patch or pull out LLTX... > > You're right, it can cause packet reordering That's exactly what I was thinking of. > if something like this > happens: > > CPU1 CPU2 > lock(queue_lock) > dequeue > unlock(queue_lock) > lock(queue_lock) > dequeue > unlock(queue_lock) > lock(xmit_lock) > hard_start_xmit > unlock(xmit_lock) > lock(xmit_lock) > hard_start_xmit > unlock(xmit_lock) > > Jamal's patch should fix this. -- Eric From roland at topspin.com Thu Dec 23 10:17:26 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 10:17:26 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003206023@orsmsx408> (Robert J. Woodruff's message of "Thu, 23 Dec 2004 09:33:51 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0003206023@orsmsx408> Message-ID: <52hdmc98w9.fsf@topspin.com> Robert> I don't have any objection. My only suggestion would be Robert> to have a structure where someone can easily build all of Robert> the components from a top level Makefile or easily build Robert> each individual component. Yes, each subdirectory of userspace should be set up with autotools so a standard "./configure && make && make install" is all that's needed. - R. From mst at mellanox.co.il Thu Dec 23 10:40:04 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 23 Dec 2004 20:40:04 +0200 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <52hdmc98w9.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003206023@orsmsx408> <52hdmc98w9.fsf@topspin.com> Message-ID: <20041223184004.GA30030@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] Problem with userspace tree structure": > Robert> I don't have any objection. My only suggestion would be > Robert> to have a structure where someone can easily build all of > Robert> the components from a top level Makefile or easily build > Robert> each individual component. > > Yes, each subdirectory of userspace should be set up with autotools so > a standard "./configure && make && make install" is all that's needed. > > - R. The problem is that the autotools files are generated, so having them in svn is a problem - conflicts cant be resolved properly - while not having them in svn is a problem - you require all users to have autotools to build. mst From tduffy at sun.com Thu Dec 23 10:48:42 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Dec 2004 10:48:42 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <20041223184004.GA30030@mellanox.co.il> References: <1AC79F16F5C5284499BB9591B33D6F0003206023@orsmsx408> <52hdmc98w9.fsf@topspin.com> <20041223184004.GA30030@mellanox.co.il> Message-ID: <1103827723.10426.3.camel@duffman> On Thu, 2004-12-23 at 20:40 +0200, Michael S. Tsirkin wrote: > The problem is that the autotools files are generated, so having > them in svn is a problem - conflicts cant be resolved properly - > while not having them in svn is a problem - you require all users > to have autotools to build. Yes, and any distribution of Linux includes them. You also require all users to have a compiler... -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Thu Dec 23 10:51:15 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 10:51:15 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <20041223184004.GA30030@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 23 Dec 2004 20:40:04 +0200") References: <1AC79F16F5C5284499BB9591B33D6F0003206023@orsmsx408> <52hdmc98w9.fsf@topspin.com> <20041223184004.GA30030@mellanox.co.il> Message-ID: <52brck97bw.fsf@topspin.com> Michael> The problem is that the autotools files are generated, so Michael> having them in svn is a problem - conflicts cant be Michael> resolved properly - while not having them in svn is a Michael> problem - you require all users to have autotools to Michael> build. Good point. What I really meant is that every subdirectory of userspace/ should have it's own autotools setup that generates configure etc. with a run of "autogen.sh." That makes it easy to make tarballs for users to build. Developers or others who want to run from subversion need autotools (which are trivial to install on any distribution modern enough to run 2.6 kernels). - R. From tduffy at sun.com Thu Dec 23 10:52:14 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Dec 2004 10:52:14 -0800 Subject: [openib-general] [OT] out of office replies Message-ID: <1103827939.10426.10.camel@duffman> Can people please fix their out-of-office scripts to not respond to mailing lists. Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Thu Dec 23 11:33:38 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Dec 2004 11:33:38 -0800 Subject: [openib-general] [PATCH][TRIVIAL] Fix ipoib build Message-ID: <1103830418.16121.1.camel@duffman> Signed-off-by: Tom Duffy Index: drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- drivers/infiniband/ulp/ipoib/ipoib_main.c (revision 1376) +++ drivers/infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -543,7 +543,7 @@ * set, we can't rely on netif_stop_queue() preventing our * xmit function from being called with a full queue. */ - if (unlikely(netif_queue_stopped(dev))) + if (unlikely(netif_queue_stopped(dev))) { spin_unlock_irqrestore(&priv->tx_lock, flags); return NETDEV_TX_BUSY; } -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Thu Dec 23 11:41:35 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 11:41:35 -0800 Subject: [openib-general] [PATCH][TRIVIAL] Fix ipoib build In-Reply-To: <1103830418.16121.1.camel@duffman> (Tom Duffy's message of "Thu, 23 Dec 2004 11:33:38 -0800") References: <1103830418.16121.1.camel@duffman> Message-ID: <527jn89500.fsf@topspin.com> Oops, sorry. Applied... - R. From roland at topspin.com Thu Dec 23 11:47:46 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 11:47:46 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <52brck97bw.fsf@topspin.com> (Roland Dreier's message of "Thu, 23 Dec 2004 10:51:15 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0003206023@orsmsx408> <52hdmc98w9.fsf@topspin.com> <20041223184004.GA30030@mellanox.co.il> <52brck97bw.fsf@topspin.com> Message-ID: <52y8fo7q59.fsf@topspin.com> By the way, I don't think we have to mandate autotools for every package under the userspace/ directory (although using autotools is the friendliest for users, packagers and developers). My real objection was just to the recent checkin of Makefiles etc. directly under the userspace/ directory. I think userspace/ should just have subdirectories for each OpenIB package and nothing else. eg take a look at http://cvs.gnome.org/viewcvs/ to see what I think our userspace/ directory should look like. - R. From tduffy at sun.com Thu Dec 23 11:52:28 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Dec 2004 11:52:28 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <52brck97bw.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003206023@orsmsx408> <52hdmc98w9.fsf@topspin.com> <20041223184004.GA30030@mellanox.co.il> <52brck97bw.fsf@topspin.com> Message-ID: <1103831548.16121.5.camel@duffman> On Thu, 2004-12-23 at 10:51 -0800, Roland Dreier wrote: > Michael> The problem is that the autotools files are generated, so > Michael> having them in svn is a problem - conflicts cant be > Michael> resolved properly - while not having them in svn is a > Michael> problem - you require all users to have autotools to > Michael> build. > > Good point. What I really meant is that every subdirectory of > userspace/ should have it's own autotools setup that generates > configure etc. with a run of "autogen.sh." That makes it easy to make > tarballs for users to build. Developers or others who want to run > from subversion need autotools (which are trivial to install on any > distribution modern enough to run 2.6 kernels). I have seen a lot of projects who have only autogen.sh in the source control, but then have a tarball available that has a generated configure script for the user to run. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Thu Dec 23 12:03:57 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 23 Dec 2004 22:03:57 +0200 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <1103827723.10426.3.camel@duffman> References: <1103827723.10426.3.camel@duffman> Message-ID: <20041223200357.GA30207@mellanox.co.il> Hello! Quoting r. Tom Duffy (tduffy at sun.com) "Re: [openib-general] Problem with userspace tree structure": > On Thu, 2004-12-23 at 20:40 +0200, Michael S. Tsirkin wrote: > > The problem is that the autotools files are generated, so having > > them in svn is a problem - conflicts cant be resolved properly - > > while not having them in svn is a problem - you require all users > > to have autotools to build. > > Yes, and any distribution of Linux includes them. You also require all > users to have a compiler... > > -tduffy There's a difference. There are no real standards, and the language seems to be not well defined and not stable between autotools versions. At least I see that many tools that went this way require specific autogen revisions. MST From tduffy at sun.com Thu Dec 23 12:14:27 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Dec 2004 12:14:27 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <20041223200357.GA30207@mellanox.co.il> References: <1103827723.10426.3.camel@duffman> <20041223200357.GA30207@mellanox.co.il> Message-ID: <1103832867.16121.11.camel@duffman> On Thu, 2004-12-23 at 22:03 +0200, Michael S. Tsirkin wrote: > There's a difference. > > There are no real standards, and the language seems to be not well > defined and not stable between autotools versions. At least I see > that many tools that went this way require specific autogen revisions. Hrm, sounds like you are describing gcc ;) Kidding aside, most distributions package a whole slew of versions for this very reason. On my Fedora 3 system, I have automake 1.4, 1.5, 1.6, 1.7 and 1.9; autoconf 2.13 and 2.59. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From johannes at erdfelt.com Thu Dec 23 12:50:45 2004 From: johannes at erdfelt.com (Johannes Erdfelt) Date: Thu, 23 Dec 2004 12:50:45 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <20041223173902.GA29875@mellanox.co.il> References: <52llbp7wur.fsf@topspin.com> <20041223173902.GA29875@mellanox.co.il> Message-ID: <20041223205045.GJ1107@sventech.com> On Thu, Dec 23, 2004, Michael S. Tsirkin wrote: > Apropos tvflash, I think we want to copy mstflint from > https://openib.org/svn/trunk/contrib/mellanox/mstflint/ > to somewhere under https://openib.org/svn/gen2/trunk/src/userspace/ > > As with tvflash, no driver is required, > and Mellanox does not support flashing tvflash for Mellanox-produced boards, > you really need mstflint for these. I have no problem with adding mstflint. tvflash should flash firmware images according to the specs I have and in fact, I have burned binary images generated by infiniburn with tvflash. Is there something wrong with the way tvflash burns images? JE From tduffy at sun.com Thu Dec 23 13:10:14 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Dec 2004 13:10:14 -0800 Subject: [openib-general] osm In-Reply-To: References: Message-ID: <1103836214.16121.20.camel@duffman> On Sun, 2004-12-19 at 20:26 +0200, shaharf wrote: > Hopefully everything will compile and install and then you > can run opensm by > > $OPENIB_ROOT/bin/opensm & > > > > It will run on the first existing port. Plese verify that > it is active. You can override that by using “-g portguid_in_hex”. On > any case of problems that you want me to assist, please run the opensm > with –V and send me the log file (/tmp/osm.log). How is a port supposed to be active without an SM running? I am getting: [root at flopteron2 bin]# ./opensm -V ------------------------------------------------- OpenSM Rev:1.6.0-rc2 Command Line Arguments: Big V selected Log File: /tmp/osm.log ------------------------------------------------- using default guid 0x2c9010a99e031 return umad_open_port -5 return umad_port_id -5 Error from osm_opensm_bind (0x2A) [root at flopteron2 bin]# ./ibstat CA 'mthca0': CA type: MT25208 (MT23108 coma0 Numer of ports: 2 Firmware version: 4.5.3 Hardware version: a0 Node GUID: 0x0002c9010a99e030 System image GUID: 0x0002c9010a99e033 Port 1: State: Down Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00500a68 Port GUID: 0x0002c9010a99e031 Port 2: State: Down Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00500a68 Port GUID: 0x0002c9010a99e032 > Don’t forget to modprobe ib_umad and configure udev before > using the userspace staff. When you say configure udev, what do you mean? I have /dev/umad[01] -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: osm.log Type: text/x-log Size: 17209 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Thu Dec 23 13:27:41 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 23 Dec 2004 23:27:41 +0200 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <20041223205045.GJ1107@sventech.com> References: <52llbp7wur.fsf@topspin.com> <20041223173902.GA29875@mellanox.co.il> <20041223205045.GJ1107@sventech.com> Message-ID: <20041223212741.GA30425@mellanox.co.il> Hello! Quoting r. Johannes Erdfelt (johannes at erdfelt.com) "Re: [openib-general] Problem with userspace tree structure": > On Thu, Dec 23, 2004, Michael S. Tsirkin wrote: > > Apropos tvflash, I think we want to copy mstflint from > > https://openib.org/svn/trunk/contrib/mellanox/mstflint/ > > to somewhere under https://openib.org/svn/gen2/trunk/src/userspace/ > > > > As with tvflash, no driver is required, > > and Mellanox does not support flashing tvflash for Mellanox-produced boards, > > you really need mstflint for these. > > I have no problem with adding mstflint. > > tvflash should flash firmware images according to the specs I have and > in fact, I have burned binary images generated by infiniburn with > tvflash. Is there something wrong with the way tvflash burns images? > > JE I didnt say that. I didnt look at tvflash in any detail. It's midnight here' I'll try to clarify on Sunday. mst From shaharf at voltaire.com Thu Dec 23 13:27:44 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 23 Dec 2004 23:27:44 +0200 Subject: [openib-general] Problem with userspace tree structure Message-ID: >By the way, I don't think we have to mandate autotools for every >package under the userspace/ directory (although using autotools is >the friendliest for users, packagers and developers). > >My real objection was just to the recent checkin of Makefiles >etc. directly under the userspace/ directory. I think userspace/ >should just have subdirectories for each OpenIB package and nothing >else. eg take a look at http://cvs.gnome.org/viewcvs/ to see what I >think our userspace/ directory should look like. > > - R. Autotools will be used sometime in the near future. As for the userspace directory itself, I really see no harm having a simple makefile that will build everything below it. It doesn't collide with treating each of the sub-trees as a different package, having its autotools setup in it. As I see it, the usermode tools are one set of tools sharing some libraries. I saw several distributions (such as a SVN distribution) that had many packages, even from completly different sources under one root and the root's make file did build all of them. This can be very convinent, and free the user from a complicated install procedure where you have to manualy build many libraries and tools in a specific order. Make is the tool to do it for you and it is much more common then autotools.... As a matter of fact, this is similar to other tools packages that have many tools under one roof (most tool chains). I am not sure http://cvs.gnome.org/viewcvs/ is a good example for us - it contains hundreds of related and non related projects that there is really no sense to build them all. We are not there. What we have now is a set of tools. If or when the number of usermode project will grow, and there will be sets of tools that dedicated to some environment or task, I would separate the usermode root to some sub-trees such as switch, host, MPI, ... Right now it is unnecasarily overhead. BTW, the reason I didn't include tvflash in my make file, because it doesn't build at all on my SUSE9.1 (opteron). If it is only a local problem of mine, I will be glad to inclue it. Same with any other utilities. Shahar From shaharf at voltaire.com Thu Dec 23 13:29:07 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 23 Dec 2004 23:29:07 +0200 Subject: [openib-general] Gen2 OpenSM Message-ID: OK, I have good news and bad news: The good news are that the committed OpenSM can bring up OpenIB (at least I could do it once or twice... ;-). It is also merged with Mellanox 1.6.1 OpenSM. The bad news are that I am pretty sure that the committed openib stack will not work - there is a bug in user_mad that has to be fixed. I forgot to check it in, and I can't do it now (I have no svn at home ;-). If anybody needs it now it will simply send my version of the file and you can use it. Hal has it too if I will not be available. Please remember that this is a *very* preliminary version, and the fact that it may work sometimes should be treated as a Christmas miracle.... Retries and many other things are still missing. Shahar From mst at mellanox.co.il Thu Dec 23 13:31:29 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 23 Dec 2004 23:31:29 +0200 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: References: Message-ID: <20041223213129.GC30425@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Problem with userspace tree structure": > BTW, the reason I didn't include tvflash in my make file, because it doesn't build at all on my SUSE9.1 (opteron). If it is only a local problem of mine, I will be glad to inclue it. Same with any other utilities. > > Shahar > I guess thats why one needs ./configure :) From mst at mellanox.co.il Thu Dec 23 13:33:30 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 23 Dec 2004 23:33:30 +0200 Subject: [openib-general] Gen2 OpenSM In-Reply-To: References: Message-ID: <20041223213330.GD30425@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "[openib-general] Gen2 OpenSM": > OK, I have good news and bad news: > > The good news are that the committed OpenSM can bring up OpenIB (at least I could do it once or twice... ;-). > > It is also merged with Mellanox 1.6.1 OpenSM. > > The bad news are that I am pretty sure that the committed openib stack will not work - there is a bug in user_mad that has to be fixed. I forgot to check it in, and I can't do it now (I have no svn at home ;-). You can install it even on win98. > If anybody needs it now it will simply send my version of the file and you can use it. Hal has it too if I will not be available. Cant Hal commit it then? mst From shaharf at voltaire.com Thu Dec 23 13:32:26 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 23 Dec 2004 23:32:26 +0200 Subject: [openib-general] Problem with userspace tree structure Message-ID: >Hello! >Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Problem with userspace tree structure": >> BTW, the reason I didn't include tvflash in my make file, because it doesn't build at all on my SUSE9.1 (option). If it is only a local problem of mine, I will be glad to include it. Same with any other utilities. >> >> Shahar >> > I guess thats why one needs ./configure :) It has one - or to be exact you can generate one. It doesn't help if the relevent headers files are missing ;-( Shahar From tduffy at sun.com Thu Dec 23 13:46:23 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Dec 2004 13:46:23 -0800 Subject: [openib-general] [PATCH] fix CA type name truncation Message-ID: <1103838383.16121.28.camel@duffman> Increase the size of ca_type to accommodate string printed out of ARBEL in TAVOR compat mode. Before: [root at flopteron2 bin]# ./ibstat CA 'mthca0': CA type: MT25208 (MT23108 coma0 After: [root at flopteron2 bin]# ./ibstat CA 'mthca0': CA type: MT25208 (MT23108 compat mode) Note: there is still a bug (probably in sys_read_string()) that does not zero out the last character, thus the a0 is printed (which is the next string in the struct). Index: gen2/trunk/src/userspace/include/umad.h =================================================================== --- gen2/trunk/src/userspace/include/umad.h (revision 1376) +++ gen2/trunk/src/userspace/include/umad.h (working copy) @@ -68,7 +68,7 @@ char ca_name[UMAD_CA_NAME_LEN]; int numports; char fw_ver[20]; - char ca_type[20]; + char ca_type[40]; char hw_ver[20]; uint64_t node_guid; uint64_t system_guid; -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From shaharf at voltaire.com Thu Dec 23 13:48:30 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 23 Dec 2004 23:48:30 +0200 Subject: [openib-general] Gen2 OpenSM Message-ID: Ok, I will send the relevant function from user_mad file. Please forgive me not to send you a proper patch. I am already at home and I don't want to mess up my Wife's computer ;-) Aha - BTW, what I really meant is that OpenSM can bring up IPoIB (small mistake for me, big progress for the openib...;-). I will send a proper patch on Sunday. Happy Christmas everybody! Shahar _________________________________ static int mad_is_solicit(struct ib_mad_hdr *madhdr) { int method = madhdr->method; // filter only request methods according to IB spec V1.2 13.4.5 and C13-6 return (method < IB_MGMT_METHOD_RESP && method != IB_MGMT_METHOD_SEND && method != IB_MGMT_METHOD_TRAP && method != IB_MGMT_METHOD_TRAP_REPRESS); } static ssize_t ib_umad_write(struct file *filp, const char __user *buf, size_t count, loff_t *pos) { struct ib_umad_file *file = filp->private_data; struct ib_umad_packet *packet; struct ib_mad_agent *agent; struct ib_ah_attr ah_attr; struct ib_sge gather_list; struct ib_send_wr *bad_wr, wr = { .opcode = IB_WR_SEND, .sg_list = &gather_list, .num_sge = 1, .send_flags = IB_SEND_SIGNALED, }; int ret; if (count < sizeof (struct ib_user_mad)) return -EINVAL; packet = kmalloc(sizeof *packet, GFP_KERNEL); if (!packet) return -ENOMEM; if (copy_from_user(&packet->mad, buf, sizeof packet->mad)) { kfree(packet); return -EFAULT; } if (packet->mad.id < 0 || packet->mad.id >= IB_UMAD_MAX_AGENTS) { ret = -EINVAL; goto err; } down_read(&file->agent_mutex); agent = file->agent[packet->mad.id]; if (!agent) { ret = -EINVAL; goto err_up; } if (mad_is_solicit((struct ib_mad_hdr *)packet->mad.data)) { ((struct ib_mad_hdr *) packet->mad.data)->tid = cpu_to_be64(((u64) agent->hi_tid) << 32 | (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) & 0xffffffff)); } memset(&ah_attr, 0, sizeof ah_attr); ah_attr.dlid = be16_to_cpu(packet->mad.lid); ah_attr.sl = packet->mad.sl; ah_attr.src_path_bits = packet->mad.path_bits; ah_attr.port_num = file->port->port_num; /* XXX handle GRH */ packet->ah = ib_create_ah(agent->qp->pd, &ah_attr); if (IS_ERR(packet->ah)) { ret = PTR_ERR(packet->ah); goto err_up; } gather_list.addr = dma_map_single(agent->device->dma_device, packet->mad.data, sizeof packet->mad.data, DMA_TO_DEVICE); gather_list.length = sizeof packet->mad.data; gather_list.lkey = file->mr[packet->mad.id]->lkey; pci_unmap_addr_set(packet, mapping, gather_list.addr); wr.wr.ud.mad_hdr = (struct ib_mad_hdr *) packet->mad.data; wr.wr.ud.ah = packet->ah; wr.wr.ud.remote_qpn = be32_to_cpu(packet->mad.qpn); wr.wr.ud.remote_qkey = be32_to_cpu(packet->mad.qkey); wr.wr.ud.timeout_ms = packet->mad.timeout_ms; wr.wr_id = (unsigned long) packet; ret = ib_post_send_mad(agent, &wr, &bad_wr); if (ret) { dma_unmap_single(agent->device->dma_device, pci_unmap_addr(packet, mapping), sizeof packet->mad.data, DMA_TO_DEVICE); goto err_up; } up_read(&file->agent_mutex); return sizeof packet->mad; err_up: up_read(&file->agent_mutex); err: kfree(packet); return ret; } From shaharf at voltaire.com Thu Dec 23 14:02:26 2004 From: shaharf at voltaire.com (shaharf) Date: Fri, 24 Dec 2004 00:02:26 +0200 Subject: [openib-general] osm Message-ID: ________________________________ From: Tom Duffy [mailto:tduffy at sun.com] Sent: Thu 12/23/2004 11:10 PM To: shaharf Cc: openib-general at openib.org Subject: Re: [openib-general] osm On Sun, 2004-12-19 at 20:26 +0200, shaharf wrote: > Hopefully everything will compile and install and then you > can run opensm by > > $OPENIB_ROOT/bin/opensm & > > > > It will run on the first existing port. Plese verify that > it is active. You can override that by using "-g portguid_in_hex". On > any case of problems that you want me to assist, please run the opensm > with -V and send me the log file (/tmp/osm.log). How is a port supposed to be active without an SM running? -- You are completely right - I feel foolish.... What I really should meant is that the port should be Physically UP. Unfortunately, there is no sys file for that right now. I hope that it will be fixed soon. > Don't forget to modprobe ib_umad and configure udev before > using the userspace staff. When you say configure udev, what do you mean? I have /dev/umad[01] -tduffy -- Please read Roland's doc on user mad (linux-kernel/doc or something similar). If it won't help I will send you my configuraion next Sunday. I am not sure that this is your problem. I will look into it on Sunday. In the meanwhile, can you recheck with the latest version? Shahar From johannes at erdfelt.com Thu Dec 23 14:06:18 2004 From: johannes at erdfelt.com (Johannes Erdfelt) Date: Thu, 23 Dec 2004 14:06:18 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <20041223212741.GA30425@mellanox.co.il> References: <52llbp7wur.fsf@topspin.com> <20041223173902.GA29875@mellanox.co.il> <20041223205045.GJ1107@sventech.com> <20041223212741.GA30425@mellanox.co.il> Message-ID: <20041223220618.GK1107@sventech.com> On Thu, Dec 23, 2004, Michael S. Tsirkin wrote: > Hello! > Quoting r. Johannes Erdfelt (johannes at erdfelt.com) "Re: [openib-general] Problem with userspace tree structure": > > On Thu, Dec 23, 2004, Michael S. Tsirkin wrote: > > > Apropos tvflash, I think we want to copy mstflint from > > > https://openib.org/svn/trunk/contrib/mellanox/mstflint/ > > > to somewhere under https://openib.org/svn/gen2/trunk/src/userspace/ > > > > > > As with tvflash, no driver is required, > > > and Mellanox does not support flashing tvflash for Mellanox-produced boards, > > > you really need mstflint for these. > > > > I have no problem with adding mstflint. > > > > tvflash should flash firmware images according to the specs I have and > > in fact, I have burned binary images generated by infiniburn with > > tvflash. Is there something wrong with the way tvflash burns images? > > I didnt say that. I didnt look at tvflash in any detail. > It's midnight here' I'll try to clarify on Sunday. Oh, I misunderstood you when you said "really need". If you find any issues, let me or Roland know and we can take care of it. Thanks. JE From roland at topspin.com Thu Dec 23 14:10:45 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 14:10:45 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: (shaharf@voltaire.com's message of "Thu, 23 Dec 2004 23:27:44 +0200") References: Message-ID: <52pt107jiy.fsf@topspin.com> shaharf> BTW, the reason I didn't include tvflash in my make file, shaharf> because it doesn't build at all on my SUSE9.1 shaharf> (opteron). If it is only a local problem of mine, I will shaharf> be glad to inclue it. Same with any other utilities. If you give a hint as to the problem you saw trying to build it, I can probably help. - Roland From tduffy at sun.com Thu Dec 23 14:20:11 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Dec 2004 14:20:11 -0800 Subject: [openib-general] [PATCH] Gen2 OpenSM In-Reply-To: References: Message-ID: <1103840411.1886.5.camel@duffman> On Thu, 2004-12-23 at 23:48 +0200, shaharf wrote: > I will send a proper patch on Sunday. BTW, your mailer munged the whitespace. > _________________________________ > > static int mad_is_solicit(struct ib_mad_hdr *madhdr) > { > int method = madhdr->method; > > // filter only request methods according to IB spec V1.2 13.4.5 and C13-6 Please don't use C++ isms in the kernel code. Anyways, here is the patch from the whitespace-fixed, comment-fixed version: Index: drivers/infiniband/core/user_mad.c =================================================================== --- drivers/infiniband/core/user_mad.c (revision 1377) +++ drivers/infiniband/core/user_mad.c (working copy) @@ -222,6 +222,21 @@ return ret; } +static int mad_is_solicit(struct ib_mad_hdr *madhdr) +{ + int method = madhdr->method; + + /* + * filter only request methods according to + * IB spec V1.2 13.4.5 and C13-6 + */ + return (method < IB_MGMT_METHOD_RESP && + method != IB_MGMT_METHOD_SEND && + method != IB_MGMT_METHOD_TRAP && + method != IB_MGMT_METHOD_TRAP_REPRESS); +} + + static ssize_t ib_umad_write(struct file *filp, const char __user *buf, size_t count, loff_t *pos) { @@ -263,10 +278,12 @@ goto err_up; } - ((struct ib_mad_hdr *) packet->mad.data)->tid = - cpu_to_be64(((u64) agent->hi_tid) << 32 | - (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) & - 0xffffffff)); + if (mad_is_solicit((struct ib_mad_hdr *)packet->mad.data)) { + ((struct ib_mad_hdr *) packet->mad.data)->tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | + (be64_to_cpu(((struct ib_mad_hdr *) + packet->mad.data)->tid) & 0xffffffff)); + } memset(&ah_attr, 0, sizeof ah_attr); ah_attr.dlid = be16_to_cpu(packet->mad.lid); -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Thu Dec 23 14:22:11 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 14:22:11 -0800 Subject: [openib-general] Gen2 OpenSM In-Reply-To: (shaharf@voltaire.com's message of "Thu, 23 Dec 2004 23:48:30 +0200") References: Message-ID: <52llbo7izw.fsf@topspin.com> shaharf> Ok, I will send the relevant function from user_mad shaharf> file. Please forgive me not to send you a proper patch. I shaharf> am already at home and I don't want to mess up my Wife's shaharf> computer ;-) Aha - BTW, what I really meant is that shaharf> OpenSM can bring up IPoIB (small mistake for me, big shaharf> progress for the openib...;-). Here's the patch that I committed, I think I got the idea. By the way, I had the umad module set the TID when sending traps, since we do expect to get a response in the form of a trap repress. - Roland Index: infiniband/core/user_mad.c =================================================================== --- infiniband/core/user_mad.c (revision 1375) +++ infiniband/core/user_mad.c (working copy) @@ -236,6 +236,8 @@ .num_sge = 1, .send_flags = IB_SEND_SIGNALED, }; + u8 method; + u64 *tid; int ret; if (count < sizeof (struct ib_user_mad)) @@ -263,11 +265,22 @@ goto err_up; } - ((struct ib_mad_hdr *) packet->mad.data)->tid = - cpu_to_be64(((u64) agent->hi_tid) << 32 | - (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) & - 0xffffffff)); + /* + * If userspace is generating a request that will generate a + * response, we need to make sure the high-order part of the + * transaction ID matches the agent being used to send the + * MAD. + */ + method = ((struct ib_mad_hdr *) packet->mad.data)->method; + if (!(method & IB_MGMT_METHOD_RESP) && + method != IB_MGMT_METHOD_TRAP_REPRESS && + method != IB_MGMT_METHOD_SEND) { + tid = &((struct ib_mad_hdr *) packet->mad.data)->tid; + *tid = cpu_to_be64(((u64) agent->hi_tid) << 32 | + (be64_to_cpup(tid) & 0xffffffff)); + } + memset(&ah_attr, 0, sizeof ah_attr); ah_attr.dlid = be16_to_cpu(packet->mad.lid); ah_attr.sl = packet->mad.sl; From halr at voltaire.com Thu Dec 23 14:21:13 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Dec 2004 17:21:13 -0500 Subject: [openib-general] [PATCH] user_mad: Fix handling of non request TIDs in ib_umad_write Message-ID: <1103840473.4054.346.camel@localhost.localdomain> user_mad: Fix handling of non request TIDs in ib_umad_write Add "mad_is_solicit" function and its invocation from ib_umad_write. This is required to avoid trashing non request TIDs. Index: user_mad.c =================================================================== --- user_mad.c (revision 1377) +++ user_mad.c (working copy) @@ -183,6 +183,17 @@ ib_free_recv_mad(mad_recv_wc); } +static int mad_is_solicit(struct ib_mad_hdr *madhdr) +{ + int method = madhdr->method; + + // filter only request methods according to IB spec V1.2 13.4.5 and C13-6 + return (method < IB_MGMT_METHOD_RESP && + method != IB_MGMT_METHOD_SEND && + method != IB_MGMT_METHOD_TRAP && + method != IB_MGMT_METHOD_TRAP_REPRESS); +} + static ssize_t ib_umad_read(struct file *filp, char __user *buf, size_t count, loff_t *pos) { @@ -263,10 +274,12 @@ goto err_up; } - ((struct ib_mad_hdr *) packet->mad.data)->tid = - cpu_to_be64(((u64) agent->hi_tid) << 32 | - (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) & - 0xffffffff)); + if (mad_is_solicit((struct ib_mad_hdr *)packet->mad.data)) { + ((struct ib_mad_hdr *) packet->mad.data)->tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | + (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) & + 0xffffffff)); + } memset(&ah_attr, 0, sizeof ah_attr); ah_attr.dlid = be16_to_cpu(packet->mad.lid); From tduffy at sun.com Thu Dec 23 14:28:49 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Dec 2004 14:28:49 -0800 Subject: [openib-general] osm In-Reply-To: References: Message-ID: <1103840929.1886.8.camel@duffman> On Fri, 2004-12-24 at 00:02 +0200, shaharf wrote: > -- Please read Roland's doc on user mad (linux-kernel/doc or something > similar). If it won't help I will send you my configuraion next > Sunday. I am not sure that this is your problem. I will look into it > on Sunday. In the meanwhile, can you recheck with the latest version? Does your code specifically open /dev/infiniband/umad0? Or will it just open the device based off of major minor from /sys/class/infiniband_mad/umad0/dev? Roland, do you absolutely need the KERNEL="umad*", NAME="infiniband/%k" rule in udev? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From shaharf at voltaire.com Thu Dec 23 14:27:15 2004 From: shaharf at voltaire.com (shaharf) Date: Fri, 24 Dec 2004 00:27:15 +0200 Subject: [openib-general] Gen2 OpenSM Message-ID: Thanks. You are probably right about the trap. I wonder why it has different scheme in the spec. Did this work for anybody else then me? Shahar ________________________________ From: Roland Dreier [mailto:roland at topspin.com] Sent: Fri 12/24/2004 12:22 AM To: shaharf Cc: Michael S. Tsirkin; openib-general at openib.org Subject: Re: [openib-general] Gen2 OpenSM shaharf> Ok, I will send the relevant function from user_mad shaharf> file. Please forgive me not to send you a proper patch. I shaharf> am already at home and I don't want to mess up my Wife's shaharf> computer ;-) Aha - BTW, what I really meant is that shaharf> OpenSM can bring up IPoIB (small mistake for me, big shaharf> progress for the openib...;-). Here's the patch that I committed, I think I got the idea. By the way, I had the umad module set the TID when sending traps, since we do expect to get a response in the form of a trap repress. - Roland Index: infiniband/core/user_mad.c =================================================================== --- infiniband/core/user_mad.c (revision 1375) +++ infiniband/core/user_mad.c (working copy) @@ -236,6 +236,8 @@ .num_sge = 1, .send_flags = IB_SEND_SIGNALED, }; + u8 method; + u64 *tid; int ret; if (count < sizeof (struct ib_user_mad)) @@ -263,11 +265,22 @@ goto err_up; } - ((struct ib_mad_hdr *) packet->mad.data)->tid = - cpu_to_be64(((u64) agent->hi_tid) << 32 | - (be64_to_cpu(((struct ib_mad_hdr *) packet->mad.data)->tid) & - 0xffffffff)); + /* + * If userspace is generating a request that will generate a + * response, we need to make sure the high-order part of the + * transaction ID matches the agent being used to send the + * MAD. + */ + method = ((struct ib_mad_hdr *) packet->mad.data)->method; + if (!(method & IB_MGMT_METHOD_RESP) && + method != IB_MGMT_METHOD_TRAP_REPRESS && + method != IB_MGMT_METHOD_SEND) { + tid = &((struct ib_mad_hdr *) packet->mad.data)->tid; + *tid = cpu_to_be64(((u64) agent->hi_tid) << 32 | + (be64_to_cpup(tid) & 0xffffffff)); + } + memset(&ah_attr, 0, sizeof ah_attr); ah_attr.dlid = be16_to_cpu(packet->mad.lid); ah_attr.sl = packet->mad.sl; From roland at topspin.com Thu Dec 23 14:33:24 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 14:33:24 -0800 Subject: [openib-general] osm In-Reply-To: (shaharf@voltaire.com's message of "Fri, 24 Dec 2004 00:02:26 +0200") References: Message-ID: <52hdmc7ih7.fsf@topspin.com> shaharf> -- Please read Roland's doc on user mad (linux-kernel/doc shaharf> or something similar). If it won't help I will send you shaharf> my configuraion next Sunday. I am not sure that this is shaharf> your problem. I will look into it on Sunday. In the shaharf> meanwhile, can you recheck with the latest version? Actually it seems libumad/umad.c is written against the previous version of that doc. Based on Greg K-H's input, we changed the names of the device files to /dev/infiniband/umad0 etc. If I change libumad/umad.c to open /dev/infiniband/umad0, then osm + OpenIB IPoIB work for me. - Roland From roland at topspin.com Thu Dec 23 14:35:12 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 14:35:12 -0800 Subject: [openib-general] osm In-Reply-To: <1103840929.1886.8.camel@duffman> (Tom Duffy's message of "Thu, 23 Dec 2004 14:28:49 -0800") References: <1103840929.1886.8.camel@duffman> Message-ID: <52d5x07ie7.fsf@topspin.com> Tom> Or will it just open the device based off of major minor from Tom> /sys/class/infiniband_mad/umad0/dev? Looks like right now it is hard coded in libumad/umad.c. However there's no way for userspace to open a device just based off major/minor -- you need a device node. Tom> Roland, do you absolutely need the KERNEL="umad*", Tom> NAME="infiniband/%k" rule in udev? I'm not a udev expert by any stretch of the imagination, but how else does udev know to create devices? - R. From roland at topspin.com Thu Dec 23 14:35:42 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 14:35:42 -0800 Subject: [openib-general] [PATCH] user_mad: Fix handling of non request TIDs in ib_umad_write In-Reply-To: <1103840473.4054.346.camel@localhost.localdomain> (Hal Rosenstock's message of "23 Dec 2004 17:21:13 -0500") References: <1103840473.4054.346.camel@localhost.localdomain> Message-ID: <528y7o7idd.fsf@topspin.com> Thanks to you and Tom for making a patch -- however I already did it myself and comitted it. - r. From Tom.Duffy at Sun.COM Thu Dec 23 14:41:46 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Thu, 23 Dec 2004 14:41:46 -0800 Subject: [openib-general] osm In-Reply-To: <52d5x07ie7.fsf@topspin.com> References: <1103840929.1886.8.camel@duffman> <52d5x07ie7.fsf@topspin.com> Message-ID: <1103841706.1886.11.camel@duffman> On Thu, 2004-12-23 at 14:35 -0800, Roland Dreier wrote: > Tom> Or will it just open the device based off of major minor from > Tom> /sys/class/infiniband_mad/umad0/dev? > > Looks like right now it is hard coded in libumad/umad.c. OK, I see now... > However > there's no way for userspace to open a device just based off > major/minor -- you need a device node. I didn't know that. OK, I will change my local copy of umad.c to open my devices. > Tom> Roland, do you absolutely need the KERNEL="umad*", > Tom> NAME="infiniband/%k" rule in udev? > > I'm not a udev expert by any stretch of the imagination, but how else > does udev know to create devices? I think it defaults to put them in /dev/. When I modprobe ib_umad, I get them installed: [root at flopteron2 bin]# rmmod ib_umad [root at flopteron2 bin]# ls -l /dev/umad* ls: /dev/umad*: No such file or directory [root at flopteron2 bin]# modprobe ib_umad [root at flopteron2 bin]# ls -l /dev/umad* crw------- 1 root root 253, 0 Dec 23 23:35 /dev/umad0 crw------- 1 root root 253, 1 Dec 23 23:35 /dev/umad1 -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Thu Dec 23 14:40:06 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Dec 2004 17:40:06 -0500 Subject: [openib-general] osm In-Reply-To: <1103836214.16121.20.camel@duffman> References: <1103836214.16121.20.camel@duffman> Message-ID: <1103841606.4054.382.camel@localhost.localdomain> On Thu, 2004-12-23 at 16:10, Tom Duffy wrote: > How is a port supposed to be active without an SM running? The instructions weren't quite in the right order. I think the README file I added earlier has this right: opensm will run on the first existing port. You can override that by using "-g ". Verify that the first port is active. > I am getting: > > [root at flopteron2 bin]# ./opensm -V > ------------------------------------------------- > OpenSM Rev:1.6.0-rc2 > Command Line Arguments: > Big V selected > Log File: /tmp/osm.log > ------------------------------------------------- > using default guid 0x2c9010a99e031 > return umad_open_port -5 > return umad_port_id -5 > > Error from osm_opensm_bind (0x2A) I think the reason for this has been answered. > [root at flopteron2 bin]# ./ibstat > CA 'mthca0': > CA type: MT25208 (MT23108 coma0 > Numer of ports: 2 > Firmware version: 4.5.3 > Hardware version: a0 > Node GUID: 0x0002c9010a99e030 > System image GUID: 0x0002c9010a99e033 > Port 1: > State: Down > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00500a68 > Port GUID: 0x0002c9010a99e031 > Port 2: > State: Down > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00500a68 > Port GUID: 0x0002c9010a99e032 Is port 1 plugged into anything ? It should get to INIT without the help of a SM. -- Hal From Tom.Duffy at Sun.COM Thu Dec 23 14:48:13 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Thu, 23 Dec 2004 14:48:13 -0800 Subject: [openib-general] osm In-Reply-To: <1103841606.4054.382.camel@localhost.localdomain> References: <1103836214.16121.20.camel@duffman> <1103841606.4054.382.camel@localhost.localdomain> Message-ID: <1103842093.1886.14.camel@duffman> On Thu, 2004-12-23 at 17:40 -0500, Hal Rosenstock wrote: > I think the reason for this has been answered. Yeah, I am about to try with modified libumad to open the right device. > Is port 1 plugged into anything ? It should get to INIT without the help > of a SM. Right, that was my first problem. I installed this new machine yesterday and forgot to plug the cable in. That is fixed now. It turns out that opensm will loop trying to open the port if the port is DOWN printing something like this: OpenSM[9009]: SM port is down. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Thu Dec 23 15:04:51 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Dec 2004 18:04:51 -0500 Subject: [openib-general] New Headers for OpenIB Message-ID: <1103843091.4054.406.camel@localhost.localdomain> Hi, Didn't someone on lkml complain about things like $Id$ being in the header or is this ok ? -- Hal From tduffy at sun.com Thu Dec 23 15:16:56 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Dec 2004 15:16:56 -0800 Subject: [openib-general] progress...but opensm crashed Message-ID: <1103843816.1886.23.camel@duffman> The good news was that my port went to active on the node running opensm (way to go Shahar!). The bad news is that there was no xmas miracle when I brought up another node on the subnet. (gdb) bt #0 0x0000002a95994f1e in stack_dump () at stack.c:33 #1 0x0000002a959953e1 in handler (x=11) at stack.c:112 #2 #3 0x0000000000410121 in osm_port_share_pkey (p_log=0x559af8, p_port_1=0x5913c0, p_port_2=0x0) at osm_port.h:1616 #4 0x0000000000408a5c in __match_notice_to_inf_rec (p_list_item=0x559b00, context=0x0) at osm_inform.c:599 #5 0x0000002a9588b044 in cl_qlist_apply_func (p_list=0x557ce0, pfn_func=0x4086d9 <__match_notice_to_inf_rec>, context=0x43004f90) at cl_list.c:387 #6 0x0000000000408c8a in osm_report_notice (p_log=0x559af8, p_subn=0x557940, p_ntc=0x430050d0) at osm_inform.c:705 #7 0x000000000042bfa2 in __osm_trap_rcv_process_request (p_rcv=0x558848, p_madw=0x5b95a8) at osm_trap_rcv.c:681 #8 0x000000000042c128 in osm_trap_rcv_process (p_rcv=0x558848, p_madw=0x5b95a8) at osm_trap_rcv.c:759 #9 0x000000000042c158 in __osm_trap_rcv_ctrl_disp_callback (context=0x559b00, p_data=0x0) at osm_trap_rcv_ctrl.c:99 #10 0x00000000004035ce in __cl_disp_worker (context=0x559b00) at cl_dispatcher.c:138 #11 0x0000002a9588f5ef in __cl_thread_pool_routine (context=0x559b00) at cl_threadpool.c:111 #12 0x0000002a9588f4be in __cl_thread_wrapper (arg=0x0) at cl_thread.c:94 ---Type to continue, or q to quit--- #13 0x0000002a9567213a in start_thread () from /lib64/tls/libpthread.so.0 #14 0x0000002a95ce33c3 in clone () from /lib64/tls/libc.so.6 #15 0x0000000000000000 in ?? () -- Tom Duffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Thu Dec 23 17:54:44 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 17:54:44 -0800 Subject: [openib-general] New Headers for OpenIB In-Reply-To: <1103843091.4054.406.camel@localhost.localdomain> (Hal Rosenstock's message of "23 Dec 2004 18:04:51 -0500") References: <1103843091.4054.406.camel@localhost.localdomain> Message-ID: <52vfas5ul7.fsf@topspin.com> Hal> Didn't someone on lkml complain about things like $Id$ Hal> being in the header or is this ok ? Someone may have complained but a quick grep of existing kernel sources shows more than a thousand files with /* $Id$ */ so I don't think it's a real issue. - R. From roland at topspin.com Thu Dec 23 18:11:36 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 23 Dec 2004 18:11:36 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: (shaharf@voltaire.com's message of "Thu, 23 Dec 2004 23:27:44 +0200") References: Message-ID: <52r7lg5tt3.fsf@topspin.com> shaharf> As a matter of fact, this is similar to other tools shaharf> packages that have many tools under one roof (most tool shaharf> chains). I am not sure http://cvs.gnome.org/viewcvs/ is a shaharf> good example for us - it contains hundreds of related and shaharf> non related projects that there is really no sense to shaharf> build them all. We are not there. What we have now is a shaharf> set of tools. If or when the number of usermode project shaharf> will grow, and there will be sets of tools that dedicated shaharf> to some environment or task, I would separate the shaharf> usermode root to some sub-trees such as switch, host, shaharf> MPI, ... Right now it is unnecasarily overhead. I think gnome is a pretty good parallel, because the contents of userspace/ are really separate projects, with different sets of developers and different maintainers. Having one main makefile is confusing for users and packagers, and the current scheme is also error prone, since it's hard to maintain correct cross-package dependencies (for example, touch include/common.h, run make, and watch binaries that depend on it such as mad_test fail to get rebuilt). - Roland From shaharf at voltaire.com Thu Dec 23 23:30:54 2004 From: shaharf at voltaire.com (shaharf) Date: Fri, 24 Dec 2004 09:30:54 +0200 Subject: [openib-general] progress...but opensm crashed Message-ID: Hi Tom, It seems that one of nodes in your subnet registered itself to be informed on traps. This is one of the rare cases that require GRH headers which are not really supported yet. I didn't think GRH has a high priority, but as I see that it is really used, I will have to deal with that sooner then I thought. Thanks for the information. I would be glad to have your osm.log file (probably in /tmp) if you have run it with -V. This could let me check my theory. Shahar ________________________________ From: openib-general-bounces at openib.org on behalf of Tom Duffy Sent: Fri 12/24/2004 1:16 AM To: openib-general at openib.org Subject: [openib-general] progress...but opensm crashed The good news was that my port went to active on the node running opensm (way to go Shahar!). The bad news is that there was no xmas miracle when I brought up another node on the subnet. (gdb) bt #0 0x0000002a95994f1e in stack_dump () at stack.c:33 #1 0x0000002a959953e1 in handler (x=11) at stack.c:112 #2 #3 0x0000000000410121 in osm_port_share_pkey (p_log=0x559af8, p_port_1=0x5913c0, p_port_2=0x0) at osm_port.h:1616 #4 0x0000000000408a5c in __match_notice_to_inf_rec (p_list_item=0x559b00, context=0x0) at osm_inform.c:599 #5 0x0000002a9588b044 in cl_qlist_apply_func (p_list=0x557ce0, pfn_func=0x4086d9 <__match_notice_to_inf_rec>, context=0x43004f90) at cl_list.c:387 #6 0x0000000000408c8a in osm_report_notice (p_log=0x559af8, p_subn=0x557940, p_ntc=0x430050d0) at osm_inform.c:705 #7 0x000000000042bfa2 in __osm_trap_rcv_process_request (p_rcv=0x558848, p_madw=0x5b95a8) at osm_trap_rcv.c:681 #8 0x000000000042c128 in osm_trap_rcv_process (p_rcv=0x558848, p_madw=0x5b95a8) at osm_trap_rcv.c:759 #9 0x000000000042c158 in __osm_trap_rcv_ctrl_disp_callback (context=0x559b00, p_data=0x0) at osm_trap_rcv_ctrl.c:99 #10 0x00000000004035ce in __cl_disp_worker (context=0x559b00) at cl_dispatcher.c:138 #11 0x0000002a9588f5ef in __cl_thread_pool_routine (context=0x559b00) at cl_threadpool.c:111 #12 0x0000002a9588f4be in __cl_thread_wrapper (arg=0x0) at cl_thread.c:94 ---Type to continue, or q to quit--- #13 0x0000002a9567213a in start_thread () from /lib64/tls/libpthread.so.0 #14 0x0000002a95ce33c3 in clone () from /lib64/tls/libc.so.6 #15 0x0000000000000000 in ?? () -- Tom Duffy From halr at voltaire.com Fri Dec 24 05:10:14 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Dec 2004 08:10:14 -0500 Subject: [openib-general] progress...but opensm crashed In-Reply-To: References: Message-ID: <1103893814.4054.420.camel@localhost.localdomain> On Fri, 2004-12-24 at 02:30, shaharf wrote: > Hi Tom, > It seems that one of nodes in your subnet registered itself to be informed on traps. This is one of the rare cases that require GRH headers which are not really supported yet. > > I didn't think GRH has a high priority, but as I see that it is really used, > I will have to deal with that sooner then I thought. GRH support is currently lacking from user_mad.c and I think there is some support needed in the OpenSM OpenIB "vendor" interface for this. > Thanks for the information. I would be glad to have your osm.log file (probably in /tmp) if you have run it with -V. This could let me check my theory. Also, would it be possible to turn this off for the time being (and see if the subnet comes up) ? -- Hal From roland at topspin.com Fri Dec 24 07:31:50 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 24 Dec 2004 07:31:50 -0800 Subject: [openib-general] progress...but opensm crashed In-Reply-To: (shaharf@voltaire.com's message of "Fri, 24 Dec 2004 09:30:54 +0200") References: Message-ID: <523bxv67bt.fsf@topspin.com> shaharf> Hi Tom, It seems that one of nodes in your subnet shaharf> registered itself to be informed on traps. This is one of shaharf> the rare cases that require GRH headers which are not shaharf> really supported yet. I'm confused -- I thought only MADs sent between subnets need GRHs? - Roland From eric.lemoine at gmail.com Fri Dec 24 08:10:38 2004 From: eric.lemoine at gmail.com (Eric Lemoine) Date: Fri, 24 Dec 2004 17:10:38 +0100 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <41CAF444.3000305@trash.net> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> <5cac192f04122210491d64d4b6@mail.gmail.com> <20041222202919.057b8331.davem@davemloft.net> <5cac192f0412230110628749e3@mail.gmail.com> <41CAF444.3000305@trash.net> Message-ID: <5cac192f04122408102129af43@mail.gmail.com> On Thu, 23 Dec 2004 17:37:24 +0100, Patrick McHardy wrote: > Eric Lemoine wrote: > > I still have one concern with the LLTX code (and it may be that the > > correct patch is Jamal's) : > > > > Without LLTX we do : lock(queue_lock), lock(xmit_lock), > > release(queue_lock), release(xmit_lock). With LLTX (without Jamal's > > patch) we do : lock(queue_lock), release(queue_lock), lock(tx_lock), > > release(tx_lock). LLTX doesn't look correct because it creates a race > > condition window between the the two lock-protected sections. So you > > may want to reconsider Jamal's patch or pull out LLTX... > > You're right, it can cause packet reordering if something like this > happens: > > CPU1 CPU2 > lock(queue_lock) > dequeue > unlock(queue_lock) > lock(queue_lock) > dequeue > unlock(queue_lock) > lock(xmit_lock) > hard_start_xmit > unlock(xmit_lock) > lock(xmit_lock) > hard_start_xmit > unlock(xmit_lock) > > Jamal's patch should fix this. Yes but requiring drivers to release a lock that they should not even be aware of doesn't sound good. Another way would be to keep dev->queue_lock grabbed when entering start_xmit() and let the driver drop it (and re-acquire it before it returns) only if it wishes so. Although I don't like this too much either, that's the best way I can think of up to now... -- Eric From halr at voltaire.com Fri Dec 24 08:57:29 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Dec 2004 11:57:29 -0500 Subject: [openib-general] progress...but opensm crashed In-Reply-To: <523bxv67bt.fsf@topspin.com> References: <523bxv67bt.fsf@topspin.com> Message-ID: <1103907449.4054.431.camel@localhost.localdomain> On Fri, 2004-12-24 at 10:31, Roland Dreier wrote: > shaharf> Hi Tom, It seems that one of nodes in your subnet > shaharf> registered itself to be informed on traps. This is one of > shaharf> the rare cases that require GRH headers which are not > shaharf> really supported yet. > > I'm confused -- I thought only MADs sent between subnets need GRHs? In general that is true. There is nothing that precludes sending with a GRH inside a subnet (other than it is slightly more overhead). In terms of Report(Notice), it depends on whether the Set(InformInfo) included a GRH or not. I think it may be used with redirection as well (ClassPortInfo::RedirectGID non 0). -- Hal From shaharf at voltaire.com Mon Dec 27 02:55:10 2004 From: shaharf at voltaire.com (shaharf) Date: Mon, 27 Dec 2004 12:55:10 +0200 Subject: [openib-general] progress...but opensm crashed Message-ID: Tom, I debugged the opensm a little, and it seems like a simple opensm bug. Somehow the destination lid of the report target was not found, and as there were no checks for that, the opensm faulted. I added the missing checks. I don't know how this happened - i.e. why was the dest port invalid. It may happen during reconfiguration phases. Again, log files would help me very much. Please update your opensm and try again. Shahar > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Tom Duffy > Sent: Friday, December 24, 2004 1:17 AM > To: openib-general at openib.org > Subject: [openib-general] progress...but opensm crashed > > The good news was that my port went to active on the node running opensm > (way to go Shahar!). > > The bad news is that there was no xmas miracle when I brought up another > node on the subnet. > > (gdb) bt > #0 0x0000002a95994f1e in stack_dump () at stack.c:33 > #1 0x0000002a959953e1 in handler (x=11) at stack.c:112 > #2 > #3 0x0000000000410121 in osm_port_share_pkey (p_log=0x559af8, > p_port_1=0x5913c0, p_port_2=0x0) at osm_port.h:1616 > #4 0x0000000000408a5c in __match_notice_to_inf_rec (p_list_item=0x559b00, > context=0x0) at osm_inform.c:599 > #5 0x0000002a9588b044 in cl_qlist_apply_func (p_list=0x557ce0, > pfn_func=0x4086d9 <__match_notice_to_inf_rec>, context=0x43004f90) > at cl_list.c:387 > #6 0x0000000000408c8a in osm_report_notice (p_log=0x559af8, > p_subn=0x557940, > p_ntc=0x430050d0) at osm_inform.c:705 > #7 0x000000000042bfa2 in __osm_trap_rcv_process_request (p_rcv=0x558848, > p_madw=0x5b95a8) at osm_trap_rcv.c:681 > #8 0x000000000042c128 in osm_trap_rcv_process (p_rcv=0x558848, > p_madw=0x5b95a8) at osm_trap_rcv.c:759 > #9 0x000000000042c158 in __osm_trap_rcv_ctrl_disp_callback > (context=0x559b00, > p_data=0x0) at osm_trap_rcv_ctrl.c:99 > #10 0x00000000004035ce in __cl_disp_worker (context=0x559b00) > at cl_dispatcher.c:138 > #11 0x0000002a9588f5ef in __cl_thread_pool_routine (context=0x559b00) > at cl_threadpool.c:111 > #12 0x0000002a9588f4be in __cl_thread_wrapper (arg=0x0) at cl_thread.c:94 > ---Type to continue, or q to quit--- > #13 0x0000002a9567213a in start_thread () from /lib64/tls/libpthread.so.0 > #14 0x0000002a95ce33c3 in clone () from /lib64/tls/libc.so.6 > #15 0x0000000000000000 in ?? () > > -- > Tom Duffy From ido at mellanox.co.il Mon Dec 27 06:06:33 2004 From: ido at mellanox.co.il (Ido Bukspan) Date: Mon, 27 Dec 2004 16:06:33 +0200 Subject: [openib-general] IPoIB performance. Message-ID: <91DB792C7985D411BEC300B40080D29C711C5A@mtvex01.mtv.mtl.com> Hello I have been investigating the performance of the IPoIB for a while and I have discovered 3 interesting points which I think are worth implementing in gen2. 1. We can divide the single CQ into two separate completion queues: one for the RQ and the other for SQ. Then we can change the CQ policy affiliated with the SQ into IB_CQ_CONSUMER_REARM and in mainstream not arm the CQ. In such case the poll_cq_tq will be called from the send_packet method, and will reap completions without any need for interrupts/events. Obviously in cases when we have to stop the queue ( e.g. no more room available) , we need to arm the CQ until completions arrive. In general, this change reduces the interrupt rate. It may also help when posting and polling on the SQ, happens from two different processors (e.g. spinlock clash). 2. The current IPoIB driver is signaling every transmitting packet. We can improve performance by selective signaling )e.g. every 5 to 10 packets(. Note that we did notice several problems when doing it. This approach can have a problem in case of for example ping (ICMP) which do not allocate new buffers before the first ones are released. I can think of some WA to this problem such as send a dummy packet every now or then (which won't go out of the device). this includes changing the send policy to be IB_WQ_SIGNAL_SELECTABLE. When a selective WQE completes, the ipoib driver has to internally complete the rest of WQEs that weren't signaled. This change mainly reduces the overhead required by the HW driver to poll on the completion queue. 3. I think we should call netif_wake_queue only if the queue is actually stopped because as far as I understand, the kernel can schedule another process after calling wake queue. I have tried these changes and got ~20% improvement in performance )2 Tavor machines with dual CPU*3.1 GH (Xeon), AS3.0 U3) If you find it interesting, I can work on a patch for gen2. What do you think? -Ido Ido Bukspan Mellanox Technologies Ltd. Phone : (972)-3-6259500 ,Ext 518. Fax : (972)-3-5614943 mailto:ido at mellanox.co.il http://www.mellanox.com No play No game From roland at topspin.com Mon Dec 27 08:21:13 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 08:21:13 -0800 Subject: [openib-general] IPoIB performance. In-Reply-To: <91DB792C7985D411BEC300B40080D29C711C5A@mtvex01.mtv.mtl.com> (Ido Bukspan's message of "Mon, 27 Dec 2004 16:06:33 +0200") References: <91DB792C7985D411BEC300B40080D29C711C5A@mtvex01.mtv.mtl.com> Message-ID: <528y7jlnk6.fsf@topspin.com> Ido> 1. We can divide the single CQ into two separate completion Ido> queues: one for the RQ and the other for SQ. Then we can Ido> change the CQ policy affiliated with the SQ into Ido> IB_CQ_CONSUMER_REARM and in mainstream not arm the CQ. In Ido> such case the poll_cq_tq will be called from the send_packet Ido> method, and will reap completions without any need for Ido> interrupts/events. Obviously in cases when we have to stop Ido> the queue ( e.g. no more room available) , we need to arm the Ido> CQ until completions arrive. In general, this change reduces Ido> the interrupt rate. It may also help when posting and polling Ido> on the SQ, happens from two different processors Ido> (e.g. spinlock clash). I guess you also need to set a timer to poll the send queue so that you eventually get a completion for all sends, even when a packet is sent and another one isn't sent for a while. Also we would need to try a variety of workloads, because splitting the send and receive CQs means that we will always have to lock the CQ twice to poll sends and completions. For example "NPtcp -2" would be a useful test. Ido> 2. The current IPoIB driver is signaling every transmitting Ido> packet. We can improve performance by selective signaling Ido> )e.g. every 5 to 10 packets(. Note that we did notice Ido> several problems when doing it. This approach can have a Ido> problem in case of for example ping (ICMP) which do not Ido> allocate new buffers before the first ones are released. I Ido> can think of some WA to this problem such as send a dummy Ido> packet every now or then (which won't go out of the device). Ido> this includes changing the send policy to be Ido> IB_WQ_SIGNAL_SELECTABLE. When a selective WQE completes, the Ido> ipoib driver has to internally complete the rest of WQEs that Ido> weren't signaled. This change mainly reduces the overhead Ido> required by the HW driver to poll on the completion queue. I've looked at this in the past as well. As you point out, the kernel needs the destructor of an skb being sent to be called before it can free space in a socket buffer, so you would need to be very clever here. I'll be curious to see your approach. Ido> 3. I think we should call netif_wake_queue only if the queue Ido> is actually stopped because as far as I understand, the Ido> kernel can schedule another process after calling wake queue. Thanks for pointing this out (although of course netif_wake_queue and __netif_schedule don't actually schedule a new process -- since we're calling netif_wake_queue from interrupt context they couldn't in any case). The expensive thing seems to be the clearing and restoring interrupts in __netif_schedule. In any case I committed the change below, which seems to be about a 3% improvement. Ido> I have tried these changes and got ~20% improvement in Ido> performance )2 Tavor machines with dual CPU*3.1 GH (Xeon), Ido> AS3.0 U3) If you find it interesting, I can work on a patch Ido> for gen2. If you write a patch and do some benchmarking, that would be great. Thanks, Roland Index: ulp/ipoib/ipoib_ib.c =================================================================== --- ulp/ipoib/ipoib_ib.c (revision 1383) +++ ulp/ipoib/ipoib_ib.c (working copy) @@ -249,7 +249,8 @@ spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; - if (priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + if (netif_queue_stopped(dev) && + priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) netif_wake_queue(dev); spin_unlock_irqrestore(&priv->tx_lock, flags); From shaharf at voltaire.com Mon Dec 27 08:29:40 2004 From: shaharf at voltaire.com (shaharf) Date: Mon, 27 Dec 2004 18:29:40 +0200 Subject: [openib-general] Problem with userspace tree structure Message-ID: > > I think gnome is a pretty good parallel, because the contents of > userspace/ are really separate projects, with different sets of > developers and different maintainers. Having one main makefile is > confusing for users and packagers, and the current scheme is also > error prone, since it's hard to maintain correct cross-package > dependencies (for example, touch include/common.h, run make, and watch > binaries that depend on it such as mad_test fail to get rebuilt). > > - Roland I am not sure everything in usermode is separate project. For example, osm, util, lib* do belong to the same package. The dependencies argument is for handling everything under the same root. You are correct about the remake problem, but this is just a bug - forgot to include the .depend file - it is fixed now. You have a point that there may be some directories that should be separate projects (even if not independent). When we will have such projects, we should re-structure the tree. If you think it is very important to have such multi trees environment right from the start, I don't really care (it is just one more directory name to type) as long as everything required by the opensm can be made by a single make. Shahar From roland at topspin.com Mon Dec 27 08:36:59 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 08:36:59 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: (shaharf@voltaire.com's message of "Mon, 27 Dec 2004 18:29:40 +0200") References: Message-ID: <524qi7lmtw.fsf@topspin.com> shaharf> I am not sure everything in usermode is separate shaharf> project. For example, osm, util, lib* do belong to the shaharf> same package. The dependencies argument is for handling shaharf> everything under the same root. You are correct about the shaharf> remake problem, but this is just a bug - forgot to shaharf> include the .depend file - it is fixed now. You have a shaharf> point that there may be some directories that should be shaharf> separate projects (even if not independent). When we shaharf> will have such projects, we should re-structure the shaharf> tree. If you think it is very important to have such shaharf> multi trees environment right from the start, I don't shaharf> really care (it is just one more directory name to type) shaharf> as long as everything required by the opensm can be made shaharf> by a single make. We already have osm, diags, and tvflash which are all independent. I'll commit directories for libibvers and libmthca, which will be two more independent projects. Presumably MPI and possibly DAPL/ITAPI will start work soon. That makes at least 6 or 7 independent packages under userspace, so I'd prefer to get the tree structure correct now. - Roland From shaharf at voltaire.com Mon Dec 27 08:49:35 2004 From: shaharf at voltaire.com (shaharf) Date: Mon, 27 Dec 2004 18:49:35 +0200 Subject: [openib-general] Problem with userspace tree structure Message-ID: > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Monday, December 27, 2004 6:37 PM > To: shaharf > Cc: openib-general at openib.org > Subject: Re: [openib-general] Problem with userspace tree structure > > We already have osm, diags, and tvflash which are all independent. > I'll commit directories for libibvers and libmthca, which will be two > more independent projects. Presumably MPI and possibly DAPL/ITAPI > will start work soon. That makes at least 6 or 7 independent packages > under userspace, so I'd prefer to get the tree structure correct now. > > - Roland > OK. How about the following tree? Usermode Host-tools Tvflash Switch-tools ... Management Libcommon Libumad Diag ... osm Complib opensm Access-libs Libibverbs Libmthca MPI ... ... From roland at topspin.com Mon Dec 27 09:07:54 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 09:07:54 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: (shaharf@voltaire.com's message of "Mon, 27 Dec 2004 18:49:35 +0200") References: Message-ID: <52zmzzk6tx.fsf@topspin.com> shaharf> OK. How about the following tree? That's OK, but why do we need a hierarchy? Why not just put every package in a top-level directory like: tvflash management libibverbs libmthca mpi and so on. - Roland From mst at mellanox.co.il Mon Dec 27 09:14:08 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Dec 2004 19:14:08 +0200 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <524qi7lmtw.fsf@topspin.com> References: <524qi7lmtw.fsf@topspin.com> Message-ID: <20041227171408.GA10124@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] Problem with userspace tree structure": > We already have osm, diags, and tvflash which are all independent. > I'll commit directories for libibvers and libmthca, which will be two > more independent projects. Could you add mstflint? Copy from contrib/mellanox. MST From shaharf at voltaire.com Mon Dec 27 09:21:11 2004 From: shaharf at voltaire.com (shaharf) Date: Mon, 27 Dec 2004 19:21:11 +0200 Subject: [openib-general] Opensm - retries added Message-ID: Hi all, I added very simple retries mechanism to the SM. It is not really checked, because I don't have a controlled way to loose packets. It is committed. Please let me know if it you encounter related problems (or non-related problems ;-) Shahar -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: retries-diff Type: application/octet-stream Size: 1532 bytes Desc: retries-diff URL: From mst at mellanox.co.il Mon Dec 27 13:32:08 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Dec 2004 23:32:08 +0200 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: References: Message-ID: <20041227213208.GA13309@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Problem with userspace tree structure": > > > > -----Original Message----- > > From: Roland Dreier [mailto:roland at topspin.com] > > Sent: Monday, December 27, 2004 6:37 PM > > To: shaharf > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] Problem with userspace tree structure > > > > We already have osm, diags, and tvflash which are all independent. > > I'll commit directories for libibvers and libmthca, which will be two > > more independent projects. Presumably MPI and possibly DAPL/ITAPI > > will start work soon. That makes at least 6 or 7 independent packages > > under userspace, so I'd prefer to get the tree structure correct now. > > > > - Roland > > > > OK. How about the following tree? > > Usermode > Host-tools > Tvflash > Switch-tools > ... > Management > Libcommon > Libumad > Diag > ... > osm > Complib > opensm > Access-libs > Libibverbs > Libmthca > MPI > ... > ... Could you please remove openib/src/userspace/osm.old as a first step? Its really painful to get two copies of opensm each time. If you thjink you need it for reference, you could always just move it somewhere ... mst From roland at topspin.com Mon Dec 27 14:19:03 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 14:19:03 -0800 Subject: [openib-general] Problem with userspace tree structure In-Reply-To: <20041227213208.GA13309@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 27 Dec 2004 23:32:08 +0200") References: <20041227213208.GA13309@mellanox.co.il> Message-ID: <52ekhbjsfc.fsf@topspin.com> Michael> Could you please remove openib/src/userspace/osm.old as a Michael> first step? Its really painful to get two copies of Michael> opensm each time. If you thjink you need it for Michael> reference, you could always just move it somewhere ... I agree osm.old should be deleted. No need to move it -- since we're using a revision control system, one can always check out any specific revision from the past that's of interest. - R. From roland at topspin.com Mon Dec 27 18:01:53 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 18:01:53 -0800 Subject: [openib-general] Start of userspace verbs support Message-ID: <52oegfji3y.fsf@topspin.com> I've just committed skeleton versions of userspace/libibverbs and userspace/libmthca. This is mostly to get early feedback on the structure of the code, etc. -- what's there doesn't do much that's useful. libibverbs is the device-independent part of userspace verbs, and is what applications actually link to. libmthca is device-dependent support for the same set of Mellanox HCAs as mthca support in the kernel. Right now, about all you can do is build the libraries, set OPENIB_DRIVER_PATH to a path where mthca.so can be found, and run ib_devices to get a list of devices and node GUIDs. All comments and suggestions welcome as usual. Thanks, Roland From davem at davemloft.net Mon Dec 27 21:00:07 2004 From: davem at davemloft.net (David S. Miller) Date: Mon, 27 Dec 2004 21:00:07 -0800 Subject: [openib-general] Re: [PATCH][v4][0/24] Second InfiniBand merge candidate patch set In-Reply-To: <20041220.153845.70996857.yoshfuji@linux-ipv6.org> References: <200412192214.KlDxQ7icOmxHYIf0@topspin.com> <20041220.153845.70996857.yoshfuji@linux-ipv6.org> Message-ID: <20041227210007.398734cd.davem@davemloft.net> On Mon, 20 Dec 2004 15:38:45 +0900 (JST) YOSHIFUJI Hideaki / $B5HF#1QL@(B wrote: > In article <200412192214.KlDxQ7icOmxHYIf0 at topspin.com> (at Sun, 19 Dec 2004 22:14:43 -0800), Roland Dreier says: > > > The following series of patches is the latest version of the OpenIB > > InfiniBand drivers. We believe that this version is suitable for > > merging when 2.6.11 opens (or into -mm immediately), although of > > course we are willing to go through as many more iterations as > > required to fix any remaining issues. > > Maybe, via the net queue. David? If Roland can resubmit his patch queue to me with the fixes folks have recommended to him, I can start merging this stuff in. I leave for vacation Wednesday morning (PST time), so if it is submitted after that I'll get to it at the beginning of the new year. From roland at topspin.com Mon Dec 27 21:19:26 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:19:26 -0800 Subject: [openib-general] Re: [PATCH][v4][0/24] Second InfiniBand merge candidate patch set In-Reply-To: <20041227210007.398734cd.davem@davemloft.net> (David S. Miller's message of "Mon, 27 Dec 2004 21:00:07 -0800") References: <200412192214.KlDxQ7icOmxHYIf0@topspin.com> <20041220.153845.70996857.yoshfuji@linux-ipv6.org> <20041227210007.398734cd.davem@davemloft.net> Message-ID: <52k6r3j8yp.fsf@topspin.com> David> If Roland can resubmit his patch queue to me with the fixes David> folks have recommended to him, I can start merging this David> stuff in. Andrew has indicated to me that he has merged the whole InfiniBand patch into -mm now. Dave, did you want to handle the entire merge of the whole IB stack, or just the net/ parts, which are pretty trivial and stand alone, since AF_INFINIBAND is already defined in the tree? I'm happy to do whatever it takes to get IB merged as expeditiously as possible so Dave & Andrew, please let me know what seems easiest and best to you. Thanks, Roland From davem at davemloft.net Mon Dec 27 21:26:15 2004 From: davem at davemloft.net (David S. Miller) Date: Mon, 27 Dec 2004 21:26:15 -0800 Subject: [openib-general] Re: [PATCH][v4][0/24] Second InfiniBand merge candidate patch set In-Reply-To: <52k6r3j8yp.fsf@topspin.com> References: <200412192214.KlDxQ7icOmxHYIf0@topspin.com> <20041220.153845.70996857.yoshfuji@linux-ipv6.org> <20041227210007.398734cd.davem@davemloft.net> <52k6r3j8yp.fsf@topspin.com> Message-ID: <20041227212615.0536c99f.davem@davemloft.net> On Mon, 27 Dec 2004 21:19:26 -0800 Roland Dreier wrote: > Dave, did you want to handle the entire merge of the whole IB stack, > or just the net/ parts, which are pretty trivial and stand alone, > since AF_INFINIBAND is already defined in the tree? Send it all over. From roland at topspin.com Mon Dec 27 21:40:06 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:40:06 -0800 Subject: [openib-general] [PATCH] Add support for querying port width/speed Message-ID: <52fz1rj809.fsf@topspin.com> Here's a patch that adds active_width and active_speed members to the port attributes structure. This is one step towards having IPoIB set the static rate of its address handles properly. I also added a "rate" sysfs attribute for each port, which shows the data rate. I don't have any 12X or DDR ports to test with, but 1X and 4X display correctly for me. While I was at it, I killed off the ib_static_rate enum, because the 1.2 spec adds a ton more IPD values (due to the addition of DDR and QDR as well as 8X), and it seemed like there wasn't really a good way to name all of them. For example, should the IPD value 1 be called IB_STATIC_RATE_8X_TO_4X? IB_STATIC_RATE_DDR_TO_SDR? IB_STATIC_RATE_QDR_TO_DDR? Should it have three different names? The enum wasn't used anywhere so it didn't seem worth struggling with this issue. In actual code it seems better to compute the ratio of data rates and derive the IPD value from that. Look OK to commit? Thanks, Roland Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 1387) +++ infiniband/include/ib_verbs.h (working copy) @@ -144,13 +144,6 @@ } } -enum ib_static_rate { - IB_STATIC_RATE_FULL = 0, - IB_STATIC_RATE_12X_TO_4X = 2, - IB_STATIC_RATE_4X_TO_1X = 3, - IB_STATIC_RATE_12X_TO_1X = 11 -}; - enum ib_port_state { IB_PORT_NOP = 0, IB_PORT_DOWN = 1, @@ -182,6 +175,24 @@ IB_PORT_BOOT_MGMT_SUP = (1<<9) }; +enum ib_port_width { + IB_WIDTH_1X = 1, + IB_WIDTH_4X = 2, + IB_WIDTH_8X = 4, + IB_WIDTH_12X = 8 +}; + +static inline int ib_width_enum_to_int(enum ib_port_width width) +{ + switch (width) { + case IB_WIDTH_1X: return 1; + case IB_WIDTH_4X: return 4; + case IB_WIDTH_8X: return 8; + case IB_WIDTH_12X: return 12; + default: return -1; + } +} + struct ib_port_attr { enum ib_port_state state; enum ib_mtu max_mtu; @@ -199,6 +210,8 @@ u8 sm_sl; u8 subnet_timeout; u8 init_type_reply; + u8 active_width; + u8 active_speed; }; enum ib_device_modify_flags { Index: infiniband/core/sysfs.c =================================================================== --- infiniband/core/sysfs.c (revision 1387) +++ infiniband/core/sysfs.c (working copy) @@ -171,12 +171,41 @@ return sprintf(buf, "0x%08x\n", attr.port_cap_flags); } +static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + char *speed = ""; + int rate; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + switch (attr.active_speed) { + case 2: speed = " DDR"; break; + case 4: speed = " QDR"; break; + } + + printk(KERN_ERR "width %d speed %d\n", attr.active_width, attr.active_speed); + + rate = 25 * ib_width_enum_to_int(attr.active_width) * attr.active_speed; + if (rate < 0) + return -EINVAL; + + return sprintf(buf, "%d%s Gb/sec (%dX%s)\n", + rate / 10, rate % 10 ? ".5" : "", + ib_width_enum_to_int(attr.active_width), speed); +} + static PORT_ATTR_RO(state); static PORT_ATTR_RO(lid); static PORT_ATTR_RO(lid_mask_count); static PORT_ATTR_RO(sm_lid); static PORT_ATTR_RO(sm_sl); static PORT_ATTR_RO(cap_mask); +static PORT_ATTR_RO(rate); static struct attribute *port_default_attrs[] = { &port_attr_state.attr, @@ -185,6 +214,7 @@ &port_attr_sm_lid.attr, &port_attr_sm_sl.attr, &port_attr_cap_mask.attr, + &port_attr_rate.attr, NULL }; Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 1397) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -115,14 +115,16 @@ } props->lid = be16_to_cpup((u16 *) (out_mad->data + 16)); - props->lmc = (*(u8 *) (out_mad->data + 34)) & 0x7; + props->lmc = out_mad->data[34] & 0x7; props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 18)); - props->sm_sl = (*(u8 *) (out_mad->data + 36)) & 0xf; - props->state = (*(u8 *) (out_mad->data + 32)) & 0xf; + props->sm_sl = out_mad->data[36] & 0xf; + props->state = out_mad->data[32] & 0xf; props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 20)); props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 48)); + props->active_width = out_mad->data[31] & 0xf; + props->active_speed = out_mad->data[35] >> 4; out: kfree(in_mad); Index: docs/sysfs.txt =================================================================== --- docs/sysfs.txt (revision 1387) +++ docs/sysfs.txt (working copy) @@ -21,6 +21,7 @@ cap_mask - Port capability mask lid - Port LID lid_mask_count - Port LID mask count + rate - Port data rate (active width * active speed) sm_lid - Subnet manager LID for port's subnet sm_sl - Subnet manager SL for port's subnet state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER) From roland at topspin.com Mon Dec 27 21:50:47 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:50:47 -0800 Subject: [openib-general] [PATCH][v5][0/24] Latest IB patch queue In-Reply-To: 20041227212615.0536c99f.davem@davemloft.net Message-ID: <200412272150.IBRnA4AvjendsF8x@topspin.com> >>>>> "David" == David S Miller writes: David> Send it all over. OK, you asked for it... here's our latest tree, which should incorporate all the feedback I've seen. (Individuals trimmed from CC list, since they probably don't want to get all 24 patches over again) Thanks, Roland Dreier From roland at topspin.com Mon Dec 27 21:50:48 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:50:48 -0800 Subject: [openib-general] [PATCH][v5][1/24] Add core InfiniBand support (public headers) In-Reply-To: <200412272150.IBRnA4AvjendsF8x@topspin.com> Message-ID: <200412272150.khYObxkGxtPP9Oju@topspin.com> Add public headers for core InfiniBand support. This can be thought of as a midlayer that provides an abstraction between low-level hardware drivers and upper level protocols (such as IP-over-InfiniBand). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_cache.h 2004-12-27 21:48:17.561381253 -0800 @@ -0,0 +1,53 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef _IB_CACHE_H +#define _IB_CACHE_H + +#include + +int ib_cached_gid_get(struct ib_device *device, + u8 port, + int index, + union ib_gid *gid); +int ib_cached_pkey_get(struct ib_device *device_handle, + u8 port, + int index, + u16 *pkey); +int ib_cached_pkey_find(struct ib_device *device, + u8 port, + u16 pkey, + u16 *index); + +#endif /* _IB_CACHE_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_fmr_pool.h 2004-12-27 21:48:17.586377574 -0800 @@ -0,0 +1,92 @@ +/* + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_fmr_pool.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#if !defined(IB_FMR_POOL_H) +#define IB_FMR_POOL_H + +#include + +struct ib_fmr_pool; + +/** + * struct ib_fmr_pool_param - Parameters for creating FMR pool + * @max_pages_per_fmr:Maximum number of pages per map request. + * @access:Access flags for FMRs in pool. + * @pool_size:Number of FMRs to allocate for pool. + * @dirty_watermark:Flush is triggered when @dirty_watermark dirty + * FMRs are present. + * @flush_function:Callback called when unmapped FMRs are flushed and + * more FMRs are possibly available for mapping + * @flush_arg:Context passed to user's flush function. + * @cache:If set, FMRs may be reused after unmapping for identical map + * requests. + */ +struct ib_fmr_pool_param { + int max_pages_per_fmr; + enum ib_access_flags access; + int pool_size; + int dirty_watermark; + void (*flush_function)(struct ib_fmr_pool *pool, + void * arg); + void *flush_arg; + unsigned cache:1; +}; + +struct ib_pool_fmr { + struct ib_fmr *fmr; + struct ib_fmr_pool *pool; + struct list_head list; + struct hlist_node cache_node; + int ref_count; + int remap_count; + u64 io_virtual_address; + int page_list_len; + u64 page_list[0]; +}; + +struct ib_fmr_pool *ib_create_fmr_pool(struct ib_pd *pd, + struct ib_fmr_pool_param *params); + +int ib_destroy_fmr_pool(struct ib_fmr_pool *pool); + +int ib_flush_fmr_pool(struct ib_fmr_pool *pool); + +struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle, + u64 *page_list, + int list_len, + u64 *io_virtual_address); + +int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr); + +#endif /* IB_FMR_POOL_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_pack.h 2004-12-27 21:48:17.640369627 -0800 @@ -0,0 +1,245 @@ +/* + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_pack.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef IB_PACK_H +#define IB_PACK_H + +#include + +enum { + IB_LRH_BYTES = 8, + IB_GRH_BYTES = 40, + IB_BTH_BYTES = 12, + IB_DETH_BYTES = 8 +}; + +struct ib_field { + size_t struct_offset_bytes; + size_t struct_size_bytes; + int offset_words; + int offset_bits; + int size_bits; + char *field_name; +}; + +#define RESERVED \ + .field_name = "reserved" + +/* + * This macro cleans up the definitions of constants for BTH opcodes. + * It is used to define constants such as IB_OPCODE_UD_SEND_ONLY, + * which becomes IB_OPCODE_UD + IB_OPCODE_SEND_ONLY, and this gives + * the correct value. + * + * In short, user code should use the constants defined using the + * macro rather than worrying about adding together other constants. +*/ +#define IB_OPCODE(transport, op) \ + IB_OPCODE_ ## transport ## _ ## op = \ + IB_OPCODE_ ## transport + IB_OPCODE_ ## op + +enum { + /* transport types -- just used to define real constants */ + IB_OPCODE_RC = 0x00, + IB_OPCODE_UC = 0x20, + IB_OPCODE_RD = 0x40, + IB_OPCODE_UD = 0x60, + + /* operations -- just used to define real constants */ + IB_OPCODE_SEND_FIRST = 0x00, + IB_OPCODE_SEND_MIDDLE = 0x01, + IB_OPCODE_SEND_LAST = 0x02, + IB_OPCODE_SEND_LAST_WITH_IMMEDIATE = 0x03, + IB_OPCODE_SEND_ONLY = 0x04, + IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE = 0x05, + IB_OPCODE_RDMA_WRITE_FIRST = 0x06, + IB_OPCODE_RDMA_WRITE_MIDDLE = 0x07, + IB_OPCODE_RDMA_WRITE_LAST = 0x08, + IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE = 0x09, + IB_OPCODE_RDMA_WRITE_ONLY = 0x0a, + IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE = 0x0b, + IB_OPCODE_RDMA_READ_REQUEST = 0x0c, + IB_OPCODE_RDMA_READ_RESPONSE_FIRST = 0x0d, + IB_OPCODE_RDMA_READ_RESPONSE_MIDDLE = 0x0e, + IB_OPCODE_RDMA_READ_RESPONSE_LAST = 0x0f, + IB_OPCODE_RDMA_READ_RESPONSE_ONLY = 0x10, + IB_OPCODE_ACKNOWLEDGE = 0x11, + IB_OPCODE_ATOMIC_ACKNOWLEDGE = 0x12, + IB_OPCODE_COMPARE_SWAP = 0x13, + IB_OPCODE_FETCH_ADD = 0x14, + + /* real constants follow -- see comment about above IB_OPCODE() + macro for more details */ + + /* RC */ + IB_OPCODE(RC, SEND_FIRST), + IB_OPCODE(RC, SEND_MIDDLE), + IB_OPCODE(RC, SEND_LAST), + IB_OPCODE(RC, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(RC, SEND_ONLY), + IB_OPCODE(RC, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_WRITE_FIRST), + IB_OPCODE(RC, RDMA_WRITE_MIDDLE), + IB_OPCODE(RC, RDMA_WRITE_LAST), + IB_OPCODE(RC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_WRITE_ONLY), + IB_OPCODE(RC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_READ_REQUEST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_FIRST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_MIDDLE), + IB_OPCODE(RC, RDMA_READ_RESPONSE_LAST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_ONLY), + IB_OPCODE(RC, ACKNOWLEDGE), + IB_OPCODE(RC, ATOMIC_ACKNOWLEDGE), + IB_OPCODE(RC, COMPARE_SWAP), + IB_OPCODE(RC, FETCH_ADD), + + /* UC */ + IB_OPCODE(UC, SEND_FIRST), + IB_OPCODE(UC, SEND_MIDDLE), + IB_OPCODE(UC, SEND_LAST), + IB_OPCODE(UC, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(UC, SEND_ONLY), + IB_OPCODE(UC, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(UC, RDMA_WRITE_FIRST), + IB_OPCODE(UC, RDMA_WRITE_MIDDLE), + IB_OPCODE(UC, RDMA_WRITE_LAST), + IB_OPCODE(UC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(UC, RDMA_WRITE_ONLY), + IB_OPCODE(UC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + + /* RD */ + IB_OPCODE(RD, SEND_FIRST), + IB_OPCODE(RD, SEND_MIDDLE), + IB_OPCODE(RD, SEND_LAST), + IB_OPCODE(RD, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(RD, SEND_ONLY), + IB_OPCODE(RD, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_WRITE_FIRST), + IB_OPCODE(RD, RDMA_WRITE_MIDDLE), + IB_OPCODE(RD, RDMA_WRITE_LAST), + IB_OPCODE(RD, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_WRITE_ONLY), + IB_OPCODE(RD, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_READ_REQUEST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_FIRST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_MIDDLE), + IB_OPCODE(RD, RDMA_READ_RESPONSE_LAST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_ONLY), + IB_OPCODE(RD, ACKNOWLEDGE), + IB_OPCODE(RD, ATOMIC_ACKNOWLEDGE), + IB_OPCODE(RD, COMPARE_SWAP), + IB_OPCODE(RD, FETCH_ADD), + + /* UD */ + IB_OPCODE(UD, SEND_ONLY), + IB_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE) +}; + +enum { + IB_LNH_RAW = 0, + IB_LNH_IP = 1, + IB_LNH_IBA_LOCAL = 2, + IB_LNH_IBA_GLOBAL = 3 +}; + +struct ib_unpacked_lrh { + u8 virtual_lane; + u8 link_version; + u8 service_level; + u8 link_next_header; + __be16 destination_lid; + __be16 packet_length; + __be16 source_lid; +}; + +struct ib_unpacked_grh { + u8 ip_version; + u8 traffic_class; + __be32 flow_label; + __be16 payload_length; + u8 next_header; + u8 hop_limit; + union ib_gid source_gid; + union ib_gid destination_gid; +}; + +struct ib_unpacked_bth { + u8 opcode; + u8 solicited_event; + u8 mig_req; + u8 pad_count; + u8 transport_header_version; + __be16 pkey; + __be32 destination_qpn; + u8 ack_req; + __be32 psn; +}; + +struct ib_unpacked_deth { + __be32 qkey; + __be32 source_qpn; +}; + +struct ib_ud_header { + struct ib_unpacked_lrh lrh; + int grh_present; + struct ib_unpacked_grh grh; + struct ib_unpacked_bth bth; + struct ib_unpacked_deth deth; + int immediate_present; + __be32 immediate_data; +}; + +void ib_pack(const struct ib_field *desc, + int desc_len, + void *structure, + void *buf); + +void ib_unpack(const struct ib_field *desc, + int desc_len, + void *buf, + void *structure); + +void ib_ud_header_init(int payload_bytes, + int grh_present, + struct ib_ud_header *header); + +int ib_ud_header_pack(struct ib_ud_header *header, + void *buf); + +int ib_ud_header_unpack(void *buf, + struct ib_ud_header *header); + +#endif /* IB_PACK_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_verbs.h 2004-12-27 21:48:17.684363151 -0800 @@ -0,0 +1,1249 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#if !defined(IB_VERBS_H) +#define IB_VERBS_H + +#include +#include +#include + +union ib_gid { + u8 raw[16]; + struct { + u64 subnet_prefix; + u64 interface_id; + } global; +}; + +enum ib_node_type { + IB_NODE_CA = 1, + IB_NODE_SWITCH, + IB_NODE_ROUTER +}; + +enum ib_device_cap_flags { + IB_DEVICE_RESIZE_MAX_WR = 1, + IB_DEVICE_BAD_PKEY_CNTR = (1<<1), + IB_DEVICE_BAD_QKEY_CNTR = (1<<2), + IB_DEVICE_RAW_MULTI = (1<<3), + IB_DEVICE_AUTO_PATH_MIG = (1<<4), + IB_DEVICE_CHANGE_PHY_PORT = (1<<5), + IB_DEVICE_UD_AV_PORT_ENFORCE = (1<<6), + IB_DEVICE_CURR_QP_STATE_MOD = (1<<7), + IB_DEVICE_SHUTDOWN_PORT = (1<<8), + IB_DEVICE_INIT_TYPE = (1<<9), + IB_DEVICE_PORT_ACTIVE_EVENT = (1<<10), + IB_DEVICE_SYS_IMAGE_GUID = (1<<11), + IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), + IB_DEVICE_SRQ_RESIZE = (1<<13), + IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_RQ_SIG_TYPE = (1<<15) +}; + +enum ib_atomic_cap { + IB_ATOMIC_NONE, + IB_ATOMIC_HCA, + IB_ATOMIC_GLOB +}; + +struct ib_device_attr { + u64 fw_ver; + u64 node_guid; + u64 sys_image_guid; + u64 max_mr_size; + u64 page_size_cap; + u32 vendor_id; + u32 vendor_part_id; + u32 hw_ver; + int max_qp; + int max_qp_wr; + int device_cap_flags; + int max_sge; + int max_sge_rd; + int max_cq; + int max_cqe; + int max_mr; + int max_pd; + int max_qp_rd_atom; + int max_ee_rd_atom; + int max_res_rd_atom; + int max_qp_init_rd_atom; + int max_ee_init_rd_atom; + enum ib_atomic_cap atomic_cap; + int max_ee; + int max_rdd; + int max_mw; + int max_raw_ipv6_qp; + int max_raw_ethy_qp; + int max_mcast_grp; + int max_mcast_qp_attach; + int max_total_mcast_qp_attach; + int max_ah; + int max_fmr; + int max_map_per_fmr; + int max_srq; + int max_srq_wr; + int max_srq_sge; + u16 max_pkeys; + u8 local_ca_ack_delay; +}; + +enum ib_mtu { + IB_MTU_256 = 1, + IB_MTU_512 = 2, + IB_MTU_1024 = 3, + IB_MTU_2048 = 4, + IB_MTU_4096 = 5 +}; + +static inline int ib_mtu_enum_to_int(enum ib_mtu mtu) +{ + switch (mtu) { + case IB_MTU_256: return 256; + case IB_MTU_512: return 512; + case IB_MTU_1024: return 1024; + case IB_MTU_2048: return 2048; + case IB_MTU_4096: return 4096; + default: return -1; + } +} + +enum ib_port_state { + IB_PORT_NOP = 0, + IB_PORT_DOWN = 1, + IB_PORT_INIT = 2, + IB_PORT_ARMED = 3, + IB_PORT_ACTIVE = 4, + IB_PORT_ACTIVE_DEFER = 5 +}; + +enum ib_port_cap_flags { + IB_PORT_SM = (1<<31), + IB_PORT_NOTICE_SUP = (1<<30), + IB_PORT_TRAP_SUP = (1<<29), + IB_PORT_AUTO_MIGR_SUP = (1<<27), + IB_PORT_SL_MAP_SUP = (1<<26), + IB_PORT_MKEY_NVRAM = (1<<25), + IB_PORT_PKEY_NVRAM = (1<<24), + IB_PORT_LED_INFO_SUP = (1<<23), + IB_PORT_SM_DISABLED = (1<<22), + IB_PORT_SYS_IMAGE_GUID_SUP = (1<<21), + IB_PORT_PKEY_SW_EXT_PORT_TRAP_SUP = (1<<20), + IB_PORT_CM_SUP = (1<<16), + IB_PORT_SNMP_TUNNEL_SUP = (1<<15), + IB_PORT_REINIT_SUP = (1<<14), + IB_PORT_DEVICE_MGMT_SUP = (1<<13), + IB_PORT_VENDOR_CLASS_SUP = (1<<12), + IB_PORT_DR_NOTICE_SUP = (1<<11), + IB_PORT_PORT_NOTICE_SUP = (1<<10), + IB_PORT_BOOT_MGMT_SUP = (1<<9) +}; + +enum ib_port_width { + IB_WIDTH_1X = 1, + IB_WIDTH_4X = 2, + IB_WIDTH_8X = 4, + IB_WIDTH_12X = 8 +}; + +static inline int ib_width_enum_to_int(enum ib_port_width width) +{ + switch (width) { + case IB_WIDTH_1X: return 1; + case IB_WIDTH_4X: return 4; + case IB_WIDTH_8X: return 8; + case IB_WIDTH_12X: return 12; + default: return -1; + } +} + +struct ib_port_attr { + enum ib_port_state state; + enum ib_mtu max_mtu; + enum ib_mtu active_mtu; + int gid_tbl_len; + u32 port_cap_flags; + u32 max_msg_sz; + u32 bad_pkey_cntr; + u32 qkey_viol_cntr; + u16 pkey_tbl_len; + u16 lid; + u16 sm_lid; + u8 lmc; + u8 max_vl_num; + u8 sm_sl; + u8 subnet_timeout; + u8 init_type_reply; + u8 active_width; + u8 active_speed; +}; + +enum ib_device_modify_flags { + IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 +}; + +struct ib_device_modify { + u64 sys_image_guid; +}; + +enum ib_port_modify_flags { + IB_PORT_SHUTDOWN = 1, + IB_PORT_INIT_TYPE = (1<<2), + IB_PORT_RESET_QKEY_CNTR = (1<<3) +}; + +struct ib_port_modify { + u32 set_port_cap_mask; + u32 clr_port_cap_mask; + u8 init_type; +}; + +enum ib_event_type { + IB_EVENT_CQ_ERR, + IB_EVENT_QP_FATAL, + IB_EVENT_QP_REQ_ERR, + IB_EVENT_QP_ACCESS_ERR, + IB_EVENT_COMM_EST, + IB_EVENT_SQ_DRAINED, + IB_EVENT_PATH_MIG, + IB_EVENT_PATH_MIG_ERR, + IB_EVENT_DEVICE_FATAL, + IB_EVENT_PORT_ACTIVE, + IB_EVENT_PORT_ERR, + IB_EVENT_LID_CHANGE, + IB_EVENT_PKEY_CHANGE, + IB_EVENT_SM_CHANGE +}; + +struct ib_event { + struct ib_device *device; + union { + struct ib_cq *cq; + struct ib_qp *qp; + u8 port_num; + } element; + enum ib_event_type event; +}; + +struct ib_event_handler { + struct ib_device *device; + void (*handler)(struct ib_event_handler *, struct ib_event *); + struct list_head list; +}; + +#define INIT_IB_EVENT_HANDLER(_ptr, _device, _handler) \ + do { \ + (_ptr)->device = _device; \ + (_ptr)->handler = _handler; \ + INIT_LIST_HEAD(&(_ptr)->list); \ + } while (0) + +struct ib_global_route { + union ib_gid dgid; + u32 flow_label; + u8 sgid_index; + u8 hop_limit; + u8 traffic_class; +}; + +enum { + IB_MULTICAST_QPN = 0xffffff +}; + +enum ib_ah_flags { + IB_AH_GRH = 1 +}; + +struct ib_ah_attr { + struct ib_global_route grh; + u16 dlid; + u8 sl; + u8 src_path_bits; + u8 static_rate; + u8 ah_flags; + u8 port_num; +}; + +enum ib_wc_status { + IB_WC_SUCCESS, + IB_WC_LOC_LEN_ERR, + IB_WC_LOC_QP_OP_ERR, + IB_WC_LOC_EEC_OP_ERR, + IB_WC_LOC_PROT_ERR, + IB_WC_WR_FLUSH_ERR, + IB_WC_MW_BIND_ERR, + IB_WC_BAD_RESP_ERR, + IB_WC_LOC_ACCESS_ERR, + IB_WC_REM_INV_REQ_ERR, + IB_WC_REM_ACCESS_ERR, + IB_WC_REM_OP_ERR, + IB_WC_RETRY_EXC_ERR, + IB_WC_RNR_RETRY_EXC_ERR, + IB_WC_LOC_RDD_VIOL_ERR, + IB_WC_REM_INV_RD_REQ_ERR, + IB_WC_REM_ABORT_ERR, + IB_WC_INV_EECN_ERR, + IB_WC_INV_EEC_STATE_ERR, + IB_WC_FATAL_ERR, + IB_WC_RESP_TIMEOUT_ERR, + IB_WC_GENERAL_ERR +}; + +enum ib_wc_opcode { + IB_WC_SEND, + IB_WC_RDMA_WRITE, + IB_WC_RDMA_READ, + IB_WC_COMP_SWAP, + IB_WC_FETCH_ADD, + IB_WC_BIND_MW, +/* + * Set value of IB_WC_RECV so consumers can test if a completion is a + * receive by testing (opcode & IB_WC_RECV). + */ + IB_WC_RECV = 1 << 7, + IB_WC_RECV_RDMA_WITH_IMM +}; + +enum ib_wc_flags { + IB_WC_GRH = 1, + IB_WC_WITH_IMM = (1<<1) +}; + +struct ib_wc { + u64 wr_id; + enum ib_wc_status status; + enum ib_wc_opcode opcode; + u32 vendor_err; + u32 byte_len; + __be32 imm_data; + u32 src_qp; + int wc_flags; + u16 pkey_index; + u16 slid; + u8 sl; + u8 dlid_path_bits; + u8 port_num; /* valid only for DR SMPs on switches */ +}; + +enum ib_cq_notify { + IB_CQ_SOLICITED, + IB_CQ_NEXT_COMP +}; + +struct ib_qp_cap { + u32 max_send_wr; + u32 max_recv_wr; + u32 max_send_sge; + u32 max_recv_sge; + u32 max_inline_data; +}; + +enum ib_sig_type { + IB_SIGNAL_ALL_WR, + IB_SIGNAL_REQ_WR +}; + +enum ib_qp_type { + /* + * IB_QPT_SMI and IB_QPT_GSI have to be the first two entries + * here (and in that order) since the MAD layer uses them as + * indices into a 2-entry table. + */ + IB_QPT_SMI, + IB_QPT_GSI, + + IB_QPT_RC, + IB_QPT_UC, + IB_QPT_UD, + IB_QPT_RAW_IPV6, + IB_QPT_RAW_ETY +}; + +struct ib_qp_init_attr { + void (*event_handler)(struct ib_event *, void *); + void *qp_context; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; + struct ib_srq *srq; + struct ib_qp_cap cap; + enum ib_sig_type sq_sig_type; + enum ib_sig_type rq_sig_type; + enum ib_qp_type qp_type; + u8 port_num; /* special QP types only */ +}; + +enum ib_rnr_timeout { + IB_RNR_TIMER_655_36 = 0, + IB_RNR_TIMER_000_01 = 1, + IB_RNR_TIMER_000_02 = 2, + IB_RNR_TIMER_000_03 = 3, + IB_RNR_TIMER_000_04 = 4, + IB_RNR_TIMER_000_06 = 5, + IB_RNR_TIMER_000_08 = 6, + IB_RNR_TIMER_000_12 = 7, + IB_RNR_TIMER_000_16 = 8, + IB_RNR_TIMER_000_24 = 9, + IB_RNR_TIMER_000_32 = 10, + IB_RNR_TIMER_000_48 = 11, + IB_RNR_TIMER_000_64 = 12, + IB_RNR_TIMER_000_96 = 13, + IB_RNR_TIMER_001_28 = 14, + IB_RNR_TIMER_001_92 = 15, + IB_RNR_TIMER_002_56 = 16, + IB_RNR_TIMER_003_84 = 17, + IB_RNR_TIMER_005_12 = 18, + IB_RNR_TIMER_007_68 = 19, + IB_RNR_TIMER_010_24 = 20, + IB_RNR_TIMER_015_36 = 21, + IB_RNR_TIMER_020_48 = 22, + IB_RNR_TIMER_030_72 = 23, + IB_RNR_TIMER_040_96 = 24, + IB_RNR_TIMER_061_44 = 25, + IB_RNR_TIMER_081_92 = 26, + IB_RNR_TIMER_122_88 = 27, + IB_RNR_TIMER_163_84 = 28, + IB_RNR_TIMER_245_76 = 29, + IB_RNR_TIMER_327_68 = 30, + IB_RNR_TIMER_491_52 = 31 +}; + +enum ib_qp_attr_mask { + IB_QP_STATE = 1, + IB_QP_CUR_STATE = (1<<1), + IB_QP_EN_SQD_ASYNC_NOTIFY = (1<<2), + IB_QP_ACCESS_FLAGS = (1<<3), + IB_QP_PKEY_INDEX = (1<<4), + IB_QP_PORT = (1<<5), + IB_QP_QKEY = (1<<6), + IB_QP_AV = (1<<7), + IB_QP_PATH_MTU = (1<<8), + IB_QP_TIMEOUT = (1<<9), + IB_QP_RETRY_CNT = (1<<10), + IB_QP_RNR_RETRY = (1<<11), + IB_QP_RQ_PSN = (1<<12), + IB_QP_MAX_QP_RD_ATOMIC = (1<<13), + IB_QP_ALT_PATH = (1<<14), + IB_QP_MIN_RNR_TIMER = (1<<15), + IB_QP_SQ_PSN = (1<<16), + IB_QP_MAX_DEST_RD_ATOMIC = (1<<17), + IB_QP_PATH_MIG_STATE = (1<<18), + IB_QP_CAP = (1<<19), + IB_QP_DEST_QPN = (1<<20) +}; + +enum ib_qp_state { + IB_QPS_RESET, + IB_QPS_INIT, + IB_QPS_RTR, + IB_QPS_RTS, + IB_QPS_SQD, + IB_QPS_SQE, + IB_QPS_ERR +}; + +enum ib_mig_state { + IB_MIG_MIGRATED, + IB_MIG_REARM, + IB_MIG_ARMED +}; + +struct ib_qp_attr { + enum ib_qp_state qp_state; + enum ib_qp_state cur_qp_state; + enum ib_mtu path_mtu; + enum ib_mig_state path_mig_state; + u32 qkey; + u32 rq_psn; + u32 sq_psn; + u32 dest_qp_num; + int qp_access_flags; + struct ib_qp_cap cap; + struct ib_ah_attr ah_attr; + struct ib_ah_attr alt_ah_attr; + u16 pkey_index; + u16 alt_pkey_index; + u8 en_sqd_async_notify; + u8 sq_draining; + u8 max_rd_atomic; + u8 max_dest_rd_atomic; + u8 min_rnr_timer; + u8 port_num; + u8 timeout; + u8 retry_cnt; + u8 rnr_retry; + u8 alt_port_num; + u8 alt_timeout; +}; + +enum ib_wr_opcode { + IB_WR_RDMA_WRITE, + IB_WR_RDMA_WRITE_WITH_IMM, + IB_WR_SEND, + IB_WR_SEND_WITH_IMM, + IB_WR_RDMA_READ, + IB_WR_ATOMIC_CMP_AND_SWP, + IB_WR_ATOMIC_FETCH_AND_ADD +}; + +enum ib_send_flags { + IB_SEND_FENCE = 1, + IB_SEND_SIGNALED = (1<<1), + IB_SEND_SOLICITED = (1<<2), + IB_SEND_INLINE = (1<<3) +}; + +enum ib_recv_flags { + IB_RECV_SIGNALED = 1 +}; + +struct ib_sge { + u64 addr; + u32 length; + u32 lkey; +}; + +struct ib_send_wr { + struct ib_send_wr *next; + u64 wr_id; + struct ib_sge *sg_list; + int num_sge; + enum ib_wr_opcode opcode; + int send_flags; + u32 imm_data; + union { + struct { + u64 remote_addr; + u32 rkey; + } rdma; + struct { + u64 remote_addr; + u64 compare_add; + u64 swap; + u32 rkey; + } atomic; + struct { + struct ib_ah *ah; + struct ib_mad_hdr *mad_hdr; + u32 remote_qpn; + u32 remote_qkey; + int timeout_ms; /* valid for MADs only */ + u16 pkey_index; /* valid for GSI only */ + u8 port_num; /* valid for DR SMPs on switch only */ + } ud; + } wr; +}; + +struct ib_recv_wr { + struct ib_recv_wr *next; + u64 wr_id; + struct ib_sge *sg_list; + int num_sge; + int recv_flags; +}; + +enum ib_access_flags { + IB_ACCESS_LOCAL_WRITE = 1, + IB_ACCESS_REMOTE_WRITE = (1<<1), + IB_ACCESS_REMOTE_READ = (1<<2), + IB_ACCESS_REMOTE_ATOMIC = (1<<3), + IB_ACCESS_MW_BIND = (1<<4) +}; + +struct ib_phys_buf { + u64 addr; + u64 size; +}; + +struct ib_mr_attr { + struct ib_pd *pd; + u64 device_virt_addr; + u64 size; + int mr_access_flags; + u32 lkey; + u32 rkey; +}; + +enum ib_mr_rereg_flags { + IB_MR_REREG_TRANS = 1, + IB_MR_REREG_PD = (1<<1), + IB_MR_REREG_ACCESS = (1<<2) +}; + +struct ib_mw_bind { + struct ib_mr *mr; + u64 wr_id; + u64 addr; + u32 length; + int send_flags; + int mw_access_flags; +}; + +struct ib_fmr_attr { + int max_pages; + int max_maps; + u8 page_size; +}; + +struct ib_pd { + struct ib_device *device; + atomic_t usecnt; /* count all resources */ +}; + +struct ib_ah { + struct ib_device *device; + struct ib_pd *pd; +}; + +typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context); + +struct ib_cq { + struct ib_device *device; + ib_comp_handler comp_handler; + void (*event_handler)(struct ib_event *, void *); + void * cq_context; + int cqe; + atomic_t usecnt; /* count number of work queues */ +}; + +struct ib_srq { + struct ib_device *device; + struct ib_pd *pd; + void *srq_context; + atomic_t usecnt; +}; + +struct ib_qp { + struct ib_device *device; + struct ib_pd *pd; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; + struct ib_srq *srq; + void (*event_handler)(struct ib_event *, void *); + void *qp_context; + u32 qp_num; +}; + +struct ib_mr { + struct ib_device *device; + struct ib_pd *pd; + u32 lkey; + u32 rkey; + atomic_t usecnt; /* count number of MWs */ +}; + +struct ib_mw { + struct ib_device *device; + struct ib_pd *pd; + u32 rkey; +}; + +struct ib_fmr { + struct ib_device *device; + struct ib_pd *pd; + struct list_head list; + u32 lkey; + u32 rkey; +}; + +struct ib_mad; + +enum ib_process_mad_flags { + IB_MAD_IGNORE_MKEY = 1 +}; + +enum ib_mad_result { + IB_MAD_RESULT_FAILURE = 0, /* (!SUCCESS is the important flag) */ + IB_MAD_RESULT_SUCCESS = 1 << 0, /* MAD was successfully processed */ + IB_MAD_RESULT_REPLY = 1 << 1, /* Reply packet needs to be sent */ + IB_MAD_RESULT_CONSUMED = 1 << 2 /* Packet consumed: stop processing */ +}; + +#define IB_DEVICE_NAME_MAX 64 + +struct ib_cache { + rwlock_t lock; + struct ib_event_handler event_handler; + struct ib_pkey_cache **pkey_cache; + struct ib_gid_cache **gid_cache; +}; + +struct ib_device { + struct device *dma_device; + + char name[IB_DEVICE_NAME_MAX]; + + struct list_head event_handler_list; + spinlock_t event_handler_lock; + + struct list_head core_list; + struct list_head client_data_list; + spinlock_t client_data_lock; + + struct ib_cache cache; + + u32 flags; + + int (*query_device)(struct ib_device *device, + struct ib_device_attr *device_attr); + int (*query_port)(struct ib_device *device, + u8 port_num, + struct ib_port_attr *port_attr); + int (*query_gid)(struct ib_device *device, + u8 port_num, int index, + union ib_gid *gid); + int (*query_pkey)(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey); + int (*modify_device)(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + int (*modify_port)(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify); + struct ib_pd * (*alloc_pd)(struct ib_device *device); + int (*dealloc_pd)(struct ib_pd *pd); + struct ib_ah * (*create_ah)(struct ib_pd *pd, + struct ib_ah_attr *ah_attr); + int (*modify_ah)(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); + int (*query_ah)(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); + int (*destroy_ah)(struct ib_ah *ah); + struct ib_qp * (*create_qp)(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + int (*modify_qp)(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask); + int (*query_qp)(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + int (*destroy_qp)(struct ib_qp *qp); + int (*post_send)(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + int (*post_recv)(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr); + struct ib_cq * (*create_cq)(struct ib_device *device, + int cqe); + int (*destroy_cq)(struct ib_cq *cq); + int (*resize_cq)(struct ib_cq *cq, int *cqe); + int (*poll_cq)(struct ib_cq *cq, int num_entries, + struct ib_wc *wc); + int (*peek_cq)(struct ib_cq *cq, int wc_cnt); + int (*req_notify_cq)(struct ib_cq *cq, + enum ib_cq_notify cq_notify); + int (*req_ncomp_notif)(struct ib_cq *cq, + int wc_cnt); + struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, + int mr_access_flags); + struct ib_mr * (*reg_phys_mr)(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + int (*query_mr)(struct ib_mr *mr, + struct ib_mr_attr *mr_attr); + int (*dereg_mr)(struct ib_mr *mr); + int (*rereg_phys_mr)(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + struct ib_mw * (*alloc_mw)(struct ib_pd *pd); + int (*bind_mw)(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); + int (*dealloc_mw)(struct ib_mw *mw); + struct ib_fmr * (*alloc_fmr)(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + int (*map_phys_fmr)(struct ib_fmr *fmr, + u64 *page_list, int list_len, + u64 iova); + int (*unmap_fmr)(struct list_head *fmr_list); + int (*dealloc_fmr)(struct ib_fmr *fmr); + int (*attach_mcast)(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + int (*detach_mcast)(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + int (*process_mad)(struct ib_device *device, + int process_mad_flags, + u8 port_num, + u16 source_lid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); + + struct class_device class_dev; + struct kobject ports_parent; + struct list_head port_list; + + enum { + IB_DEV_UNINITIALIZED, + IB_DEV_REGISTERED, + IB_DEV_UNREGISTERED + } reg_state; + + u8 node_type; + u8 phys_port_cnt; +}; + +struct ib_client { + char *name; + void (*add) (struct ib_device *); + void (*remove)(struct ib_device *); + + struct list_head list; +}; + +struct ib_device *ib_alloc_device(size_t size); +void ib_dealloc_device(struct ib_device *device); + +int ib_register_device (struct ib_device *device); +void ib_unregister_device(struct ib_device *device); + +int ib_register_client (struct ib_client *client); +void ib_unregister_client(struct ib_client *client); + +void *ib_get_client_data(struct ib_device *device, struct ib_client *client); +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data); + +int ib_register_event_handler (struct ib_event_handler *event_handler); +int ib_unregister_event_handler(struct ib_event_handler *event_handler); +void ib_dispatch_event(struct ib_event *event); + +int ib_query_device(struct ib_device *device, + struct ib_device_attr *device_attr); + +int ib_query_port(struct ib_device *device, + u8 port_num, struct ib_port_attr *port_attr); + +int ib_query_gid(struct ib_device *device, + u8 port_num, int index, union ib_gid *gid); + +int ib_query_pkey(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey); + +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + +int ib_modify_port(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify); + +/** + * ib_alloc_pd - Allocates an unused protection domain. + * @device: The device on which to allocate the protection domain. + * + * A protection domain object provides an association between QPs, shared + * receive queues, address handles, memory regions, and memory windows. + */ +struct ib_pd *ib_alloc_pd(struct ib_device *device); + +/** + * ib_dealloc_pd - Deallocates a protection domain. + * @pd: The protection domain to deallocate. + */ +int ib_dealloc_pd(struct ib_pd *pd); + +/** + * ib_create_ah - Creates an address handle for the given address vector. + * @pd: The protection domain associated with the address handle. + * @ah_attr: The attributes of the address vector. + * + * The address handle is used to reference a local or global destination + * in all UD QP post sends. + */ +struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); + +/** + * ib_modify_ah - Modifies the address vector associated with an address + * handle. + * @ah: The address handle to modify. + * @ah_attr: The new address vector attributes to associate with the + * address handle. + */ +int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +/** + * ib_query_ah - Queries the address vector associated with an address + * handle. + * @ah: The address handle to query. + * @ah_attr: The address vector attributes associated with the address + * handle. + */ +int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +/** + * ib_destroy_ah - Destroys an address handle. + * @ah: The address handle to destroy. + */ +int ib_destroy_ah(struct ib_ah *ah); + +/** + * ib_create_qp - Creates a QP associated with the specified protection + * domain. + * @pd: The protection domain associated with the QP. + * @qp_init_attr: A list of initial attributes required to create the QP. + */ +struct ib_qp *ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + +/** + * ib_modify_qp - Modifies the attributes for the specified QP and then + * transitions the QP to the given state. + * @qp: The QP to modify. + * @qp_attr: On input, specifies the QP attributes to modify. On output, + * the current values of selected QP attributes are returned. + * @qp_attr_mask: A bit-mask used to specify which attributes of the QP + * are being modified. + */ +int ib_modify_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask); + +/** + * ib_query_qp - Returns the attribute list and current values for the + * specified QP. + * @qp: The QP to query. + * @qp_attr: The attributes of the specified QP. + * @qp_attr_mask: A bit-mask used to select specific attributes to query. + * @qp_init_attr: Additional attributes of the selected QP. + * + * The qp_attr_mask may be used to limit the query to gathering only the + * selected attributes. + */ +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + +/** + * ib_destroy_qp - Destroys the specified QP. + * @qp: The QP to destroy. + */ +int ib_destroy_qp(struct ib_qp *qp); + +/** + * ib_post_send - Posts a list of work requests to the send queue of + * the specified QP. + * @qp: The QP to post the work request on. + * @send_wr: A list of work requests to post on the send queue. + * @bad_send_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ +static inline int ib_post_send(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + return qp->device->post_send(qp, send_wr, bad_send_wr); +} + +/** + * ib_post_recv - Posts a list of work requests to the receive queue of + * the specified QP. + * @qp: The QP to post the work request on. + * @recv_wr: A list of work requests to post on the receive queue. + * @bad_recv_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ +static inline int ib_post_recv(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) +{ + return qp->device->post_recv(qp, recv_wr, bad_recv_wr); +} + +/** + * ib_create_cq - Creates a CQ on the specified device. + * @device: The device on which to create the CQ. + * @comp_handler: A user-specified callback that is invoked when a + * completion event occurs on the CQ. + * @event_handler: A user-specified callback that is invoked when an + * asynchronous event not associated with a completion occurs on the CQ. + * @cq_context: Context associated with the CQ returned to the user via + * the associated completion and event handlers. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ +struct ib_cq *ib_create_cq(struct ib_device *device, + ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), + void *cq_context, int cqe); + +/** + * ib_resize_cq - Modifies the capacity of the CQ. + * @cq: The CQ to resize. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ +int ib_resize_cq(struct ib_cq *cq, int cqe); + +/** + * ib_destroy_cq - Destroys the specified CQ. + * @cq: The CQ to destroy. + */ +int ib_destroy_cq(struct ib_cq *cq); + +/** + * ib_poll_cq - poll a CQ for completion(s) + * @cq:the CQ being polled + * @num_entries:maximum number of completions to return + * @wc:array of at least @num_entries &struct ib_wc where completions + * will be returned + * + * Poll a CQ for (possibly multiple) completions. If the return value + * is < 0, an error occurred. If the return value is >= 0, it is the + * number of completions returned. If the return value is + * non-negative and < num_entries, then the CQ was emptied. + */ +static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, + struct ib_wc *wc) +{ + return cq->device->poll_cq(cq, num_entries, wc); +} + +/** + * ib_peek_cq - Returns the number of unreaped completions currently + * on the specified CQ. + * @cq: The CQ to peek. + * @wc_cnt: A minimum number of unreaped completions to check for. + * + * If the number of unreaped completions is greater than or equal to wc_cnt, + * this function returns wc_cnt, otherwise, it returns the actual number of + * unreaped completions. + */ +int ib_peek_cq(struct ib_cq *cq, int wc_cnt); + +/** + * ib_req_notify_cq - Request completion notification on a CQ. + * @cq: The CQ to generate an event for. + * @cq_notify: If set to %IB_CQ_SOLICITED, completion notification will + * occur on the next solicited event. If set to %IB_CQ_NEXT_COMP, + * notification will occur on the next completion. + */ +static inline int ib_req_notify_cq(struct ib_cq *cq, + enum ib_cq_notify cq_notify) +{ + return cq->device->req_notify_cq(cq, cq_notify); +} + +/** + * ib_req_ncomp_notif - Request completion notification when there are + * at least the specified number of unreaped completions on the CQ. + * @cq: The CQ to generate an event for. + * @wc_cnt: The number of unreaped completions that should be on the + * CQ before an event is generated. + */ +static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) +{ + return cq->device->req_ncomp_notif ? + cq->device->req_ncomp_notif(cq, wc_cnt) : + -ENOSYS; +} + +/** + * ib_get_dma_mr - Returns a memory region for system memory that is + * usable for DMA. + * @pd: The protection domain associated with the memory region. + * @mr_access_flags: Specifies the memory access rights. + */ +struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); + +/** + * ib_reg_phys_mr - Prepares a virtually addressed memory region for use + * by an HCA. + * @pd: The protection domain associated assigned to the registered region. + * @phys_buf_array: Specifies a list of physical buffers to use in the + * memory region. + * @num_phys_buf: Specifies the size of the phys_buf_array. + * @mr_access_flags: Specifies the memory access rights. + * @iova_start: The offset of the region's starting I/O virtual address. + */ +struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + +/** + * ib_rereg_phys_mr - Modifies the attributes of an existing memory region. + * Conceptually, this call performs the functions deregister memory region + * followed by register physical memory region. Where possible, + * resources are reused instead of deallocated and reallocated. + * @mr: The memory region to modify. + * @mr_rereg_mask: A bit-mask used to indicate which of the following + * properties of the memory region are being modified. + * @pd: If %IB_MR_REREG_PD is set in mr_rereg_mask, this field specifies + * the new protection domain to associated with the memory region, + * otherwise, this parameter is ignored. + * @phys_buf_array: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this + * field specifies a list of physical buffers to use in the new + * translation, otherwise, this parameter is ignored. + * @num_phys_buf: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this + * field specifies the size of the phys_buf_array, otherwise, this + * parameter is ignored. + * @mr_access_flags: If %IB_MR_REREG_ACCESS is set in mr_rereg_mask, this + * field specifies the new memory access rights, otherwise, this + * parameter is ignored. + * @iova_start: The offset of the region's starting I/O virtual address. + */ +int ib_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + +/** + * ib_query_mr - Retrieves information about a specific memory region. + * @mr: The memory region to retrieve information about. + * @mr_attr: The attributes of the specified memory region. + */ +int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); + +/** + * ib_dereg_mr - Deregisters a memory region and removes it from the + * HCA translation table. + * @mr: The memory region to deregister. + */ +int ib_dereg_mr(struct ib_mr *mr); + +/** + * ib_alloc_mw - Allocates a memory window. + * @pd: The protection domain associated with the memory window. + */ +struct ib_mw *ib_alloc_mw(struct ib_pd *pd); + +/** + * ib_bind_mw - Posts a work request to the send queue of the specified + * QP, which binds the memory window to the given address range and + * remote access attributes. + * @qp: QP to post the bind work request on. + * @mw: The memory window to bind. + * @mw_bind: Specifies information about the memory window, including + * its address range, remote access rights, and associated memory region. + */ +static inline int ib_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + /* XXX reference counting in corresponding MR? */ + return mw->device->bind_mw ? + mw->device->bind_mw(qp, mw, mw_bind) : + -ENOSYS; +} + +/** + * ib_dealloc_mw - Deallocates a memory window. + * @mw: The memory window to deallocate. + */ +int ib_dealloc_mw(struct ib_mw *mw); + +/** + * ib_alloc_fmr - Allocates a unmapped fast memory region. + * @pd: The protection domain associated with the unmapped region. + * @mr_access_flags: Specifies the memory access rights. + * @fmr_attr: Attributes of the unmapped region. + * + * A fast memory region must be mapped before it can be used as part of + * a work request. + */ +struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + +/** + * ib_map_phys_fmr - Maps a list of physical pages to a fast memory region. + * @fmr: The fast memory region to associate with the pages. + * @page_list: An array of physical pages to map to the fast memory region. + * @list_len: The number of pages in page_list. + * @iova: The I/O virtual address to use with the mapped region. + */ +static inline int ib_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, int list_len, + u64 iova) +{ + return fmr->device->map_phys_fmr(fmr, page_list, list_len, iova); +} + +/** + * ib_unmap_fmr - Removes the mapping from a list of fast memory regions. + * @fmr_list: A linked list of fast memory regions to unmap. + */ +int ib_unmap_fmr(struct list_head *fmr_list); + +/** + * ib_dealloc_fmr - Deallocates a fast memory region. + * @fmr: The fast memory region to deallocate. + */ +int ib_dealloc_fmr(struct ib_fmr *fmr); + +/** + * ib_attach_mcast - Attaches the specified QP to a multicast group. + * @qp: QP to attach to the multicast group. The QP must be type + * IB_QPT_UD. + * @gid: Multicast group GID. + * @lid: Multicast group LID in host byte order. + * + * In order to send and receive multicast packets, subnet + * administration must have created the multicast group and configured + * the fabric appropriately. The port associated with the specified + * QP must also be a member of the multicast group. + */ +int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +/** + * ib_detach_mcast - Detaches the specified QP from a multicast group. + * @qp: QP to detach from the multicast group. + * @gid: Multicast group GID. + * @lid: Multicast group LID in host byte order. + */ +int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +#endif /* IB_VERBS_H */ From roland at topspin.com Mon Dec 27 21:50:55 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:50:55 -0800 Subject: [openib-general] [PATCH][v5][3/24] Hook up drivers/infiniband In-Reply-To: <200412272150.vKuRYXlCFl5x8NAo@topspin.com> Message-ID: <200412272150.BzlME8aULSGdgnS3@topspin.com> Add the appropriate lines to drivers/Kconfig and drivers/Makefile so that the kernel configuration and build systems know about drivers/infiniband. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/Kconfig 2004-12-27 21:47:59.198084242 -0800 +++ linux-bk/drivers/Kconfig 2004-12-27 21:48:19.194140917 -0800 @@ -56,4 +56,6 @@ source "drivers/mmc/Kconfig" +source "drivers/infiniband/Kconfig" + endmenu --- linux-bk.orig/drivers/Makefile 2004-12-27 21:48:10.314447971 -0800 +++ linux-bk/drivers/Makefile 2004-12-27 21:48:19.194140917 -0800 @@ -59,5 +59,6 @@ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ +obj-$(CONFIG_INFINIBAND) += infiniband/ obj-y += firmware/ obj-$(CONFIG_CRYPTO) += crypto/ From roland at topspin.com Mon Dec 27 21:50:56 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:50:56 -0800 Subject: [openib-general] [PATCH][v5][4/24] Add InfiniBand MAD (management datagram) support (public headers) In-Reply-To: <200412272150.BzlME8aULSGdgnS3@topspin.com> Message-ID: <200412272150.7LBVS92XE77zrCiS@topspin.com> Add public headers for handling InfiniBand MADs (management datagrams), including sending and receiving MADs as well as passing MADs on to local agents. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_mad.h 2004-12-27 21:48:19.513093969 -0800 @@ -0,0 +1,404 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_mad.h 1389 2004-12-27 22:56:47Z roland $ + */ + +#if !defined( IB_MAD_H ) +#define IB_MAD_H + +#include + +/* Management base version */ +#define IB_MGMT_BASE_VERSION 1 + +/* Management classes */ +#define IB_MGMT_CLASS_SUBN_LID_ROUTED 0x01 +#define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE 0x81 +#define IB_MGMT_CLASS_SUBN_ADM 0x03 +#define IB_MGMT_CLASS_PERF_MGMT 0x04 +#define IB_MGMT_CLASS_BM 0x05 +#define IB_MGMT_CLASS_DEVICE_MGMT 0x06 +#define IB_MGMT_CLASS_CM 0x07 +#define IB_MGMT_CLASS_SNMP 0x08 +#define IB_MGMT_CLASS_VENDOR_RANGE2_START 0x30 +#define IB_MGMT_CLASS_VENDOR_RANGE2_END 0x4F + +/* Management methods */ +#define IB_MGMT_METHOD_GET 0x01 +#define IB_MGMT_METHOD_SET 0x02 +#define IB_MGMT_METHOD_GET_RESP 0x81 +#define IB_MGMT_METHOD_SEND 0x03 +#define IB_MGMT_METHOD_TRAP 0x05 +#define IB_MGMT_METHOD_REPORT 0x06 +#define IB_MGMT_METHOD_REPORT_RESP 0x86 +#define IB_MGMT_METHOD_TRAP_REPRESS 0x07 + +#define IB_MGMT_METHOD_RESP 0x80 + +#define IB_MGMT_MAX_METHODS 128 + +#define IB_QP0 0 +#define IB_QP1 __constant_htonl(1) +#define IB_QP1_QKEY 0x80010000 + +struct ib_grh { + u32 version_tclass_flow; + u16 paylen; + u8 next_hdr; + u8 hop_limit; + union ib_gid sgid; + union ib_gid dgid; +} __attribute__ ((packed)); + +struct ib_mad_hdr { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u16 class_specific; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; +} __attribute__ ((packed)); + +struct ib_rmpp_hdr { + u8 rmpp_version; + u8 rmpp_type; + u8 rmpp_rtime_flags; + u8 rmpp_status; + u32 seg_num; + u32 paylen_newwin; +} __attribute__ ((packed)); + +struct ib_mad { + struct ib_mad_hdr mad_hdr; + u8 data[232]; +} __attribute__ ((packed)); + +struct ib_rmpp_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 data[220]; +} __attribute__ ((packed)); + +struct ib_vendor_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 reserved; + u8 oui[3]; + u8 data[216]; +} __attribute__ ((packed)); + +struct ib_mad_agent; +struct ib_mad_send_wc; +struct ib_mad_recv_wc; + +/** + * ib_mad_send_handler - callback handler for a sent MAD. + * @mad_agent: MAD agent that sent the MAD. + * @mad_send_wc: Send work completion information on the sent MAD. + */ +typedef void (*ib_mad_send_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc); + +/** + * ib_mad_snoop_handler - Callback handler for snooping sent MADs. + * @mad_agent: MAD agent that snooped the MAD. + * @send_wr: Work request information on the sent MAD. + * @mad_send_wc: Work completion information on the sent MAD. Valid + * only for snooping that occurs on a send completion. + * + * Clients snooping MADs should not modify data referenced by the @send_wr + * or @mad_send_wc. + */ +typedef void (*ib_mad_snoop_handler)(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc); + +/** + * ib_mad_recv_handler - callback handler for a received MAD. + * @mad_agent: MAD agent requesting the received MAD. + * @mad_recv_wc: Received work completion information on the received MAD. + * + * MADs received in response to a send request operation will be handed to + * the user after the send operation completes. All data buffers given + * to registered agents through this routine are owned by the receiving + * client, except for snooping agents. Clients snooping MADs should not + * modify the data referenced by @mad_recv_wc. + */ +typedef void (*ib_mad_recv_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_mad_agent - Used to track MAD registration with the access layer. + * @device: Reference to device registration is on. + * @qp: Reference to QP used for sending and receiving MADs. + * @recv_handler: Callback handler for a received MAD. + * @send_handler: Callback handler for a sent MAD. + * @snoop_handler: Callback handler for snooped sent MADs. + * @context: User-specified context associated with this registration. + * @hi_tid: Access layer assigned transaction ID for this client. + * Unsolicited MADs sent by this client will have the upper 32-bits + * of their TID set to this value. + * @port_num: Port number on which QP is registered + */ +struct ib_mad_agent { + struct ib_device *device; + struct ib_qp *qp; + ib_mad_recv_handler recv_handler; + ib_mad_send_handler send_handler; + ib_mad_snoop_handler snoop_handler; + void *context; + u32 hi_tid; + u8 port_num; +}; + +/** + * ib_mad_send_wc - MAD send completion information. + * @wr_id: Work request identifier associated with the send MAD request. + * @status: Completion status. + * @vendor_err: Optional vendor error information returned with a failed + * request. + */ +struct ib_mad_send_wc { + u64 wr_id; + enum ib_wc_status status; + u32 vendor_err; +}; + +/** + * ib_mad_recv_buf - received MAD buffer information. + * @list: Reference to next data buffer for a received RMPP MAD. + * @grh: References a data buffer containing the global route header. + * The data refereced by this buffer is only valid if the GRH is + * valid. + * @mad: References the start of the received MAD. + */ +struct ib_mad_recv_buf { + struct list_head list; + struct ib_grh *grh; + struct ib_mad *mad; +}; + +/** + * ib_mad_recv_wc - received MAD information. + * @wc: Completion information for the received data. + * @recv_buf: Specifies the location of the received data buffer(s). + * @mad_len: The length of the received MAD, without duplicated headers. + * + * For received response, the wr_id field of the wc is set to the wr_id + * for the corresponding send request. + */ +struct ib_mad_recv_wc { + struct ib_wc *wc; + struct ib_mad_recv_buf recv_buf; + int mad_len; +}; + +/** + * ib_mad_reg_req - MAD registration request + * @mgmt_class: Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version: Indicates which version of MADs for the given + * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. + * @method_mask: The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + */ +struct ib_mad_reg_req { + u8 mgmt_class; + u8 mgmt_class_version; + u8 oui[3]; + DECLARE_BITMAP(method_mask, IB_MGMT_MAX_METHODS); +}; + +/** + * ib_register_mad_agent - Register to send/receive MADs. + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP to access. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_reg_req: Specifies which unsolicited MADs should be received + * by the caller. This parameter may be NULL if the caller only + * wishes to receive solicited responses. + * @rmpp_version: If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler: The completion callback routine invoked after a send + * request has completed. + * @recv_handler: The completion callback routine invoked for a received + * MAD. + * @context: User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +enum ib_mad_snoop_flags { + /*IB_MAD_SNOOP_POSTED_SENDS = 1,*/ + /*IB_MAD_SNOOP_RMPP_SENDS = (1<<1),*/ + IB_MAD_SNOOP_SEND_COMPLETIONS = (1<<2), + /*IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS = (1<<3),*/ + IB_MAD_SNOOP_RECVS = (1<<4) + /*IB_MAD_SNOOP_RMPP_RECVS = (1<<5),*/ + /*IB_MAD_SNOOP_REDIRECTED_QPS = (1<<6)*/ +}; + +/** + * ib_register_mad_snoop - Register to snoop sent and received MADs. + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP traffic to snoop. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_snoop_flags: Specifies information where snooping occurs. + * @send_handler: The callback routine invoked for a snooped send. + * @recv_handler: The callback routine invoked for a snooped receive. + * @context: User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_unregister_mad_agent - Unregisters a client from using MAD services. + * @mad_agent: Corresponding MAD registration request to deregister. + * + * After invoking this routine, MAD services are no longer usable by the + * client on the associated QP. + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent); + +/** + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client. + * @mad_agent: Specifies the associated registration to post the send to. + * @send_wr: Specifies the information needed to send the MAD(s). + * @bad_send_wr: Specifies the MAD on which an error was encountered. + * + * Sent MADs are not guaranteed to complete in the order that they were posted. + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + +/** + * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. + * @mad_recv_wc: Work completion information for a received MAD. + * @buf: User-provided data buffer to receive the coalesced buffers. The + * referenced buffer should be at least the size of the mad_len specified + * by @mad_recv_wc. + * + * This call copies a chain of received RMPP MADs into a single data buffer, + * removing duplicated headers. + */ +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf); + +/** + * ib_free_recv_mad - Returns data buffers used to receive a MAD to the + * access layer. + * @mad_recv_wc: Work completion information for a received MAD. + * + * Clients receiving MADs through their ib_mad_recv_handler must call this + * routine to return the work completion buffers to the access layer. + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_cancel_mad - Cancels an outstanding send MAD operation. + * @mad_agent: Specifies the registration associated with sent MAD. + * @wr_id: Indicates the work request identifier of the MAD to cancel. + * + * MADs will be returned to the user through the corresponding + * ib_mad_send_handler. + */ +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id); + +/** + * ib_redirect_mad_qp - Registers a QP for MAD services. + * @qp: Reference to a QP that requires MAD services. + * @rmpp_version: If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler: The completion callback routine invoked after a send + * request has completed. + * @recv_handler: The completion callback routine invoked for a received + * MAD. + * @context: User specified context associated with the registration. + * + * Use of this call allows clients to use MAD services, such as RMPP, + * on user-owned QPs. After calling this routine, users may send + * MADs on the specified QP by calling ib_mad_post_send. + */ +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_process_mad_wc - Processes a work completion associated with a + * MAD sent or received on a redirected QP. + * @mad_agent: Specifies the registered MAD service using the redirected QP. + * @wc: References a work completion associated with a sent or received + * MAD segment. + * + * This routine is used to complete or continue processing on a MAD request. + * If the work completion is associated with a send operation, calling + * this routine is required to continue an RMPP transfer or to wait for a + * corresponding response, if it is a request. If the work completion is + * associated with a receive operation, calling this routine is required to + * process an inbound or outbound RMPP transfer, or to match a response MAD + * with its corresponding request. + */ +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc); + +#endif /* IB_MAD_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_smi.h 2004-12-27 21:48:19.539090142 -0800 @@ -0,0 +1,96 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_smi.h 1389 2004-12-27 22:56:47Z roland $ + */ + +#if !defined( IB_SMI_H ) +#define IB_SMI_H + +#include + +#define IB_LID_PERMISSIVE 0xFFFF + +#define IB_SMP_DATA_SIZE 64 +#define IB_SMP_MAX_PATH_HOPS 64 + +struct ib_smp { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u8 hop_ptr; + u8 hop_cnt; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; + u64 mkey; + u16 dr_slid; + u16 dr_dlid; + u8 reserved[28]; + u8 data[IB_SMP_DATA_SIZE]; + u8 initial_path[IB_SMP_MAX_PATH_HOPS]; + u8 return_path[IB_SMP_MAX_PATH_HOPS]; +} __attribute__ ((packed)); + +#define IB_SMP_DIRECTION __constant_htons(0x8000) + +/* Subnet management attributes */ +#define IB_SMP_ATTR_NOTICE __constant_htons(0x0002) +#define IB_SMP_ATTR_NODE_DESC __constant_htons(0x0010) +#define IB_SMP_ATTR_NODE_INFO __constant_htons(0x0011) +#define IB_SMP_ATTR_SWITCH_INFO __constant_htons(0x0012) +#define IB_SMP_ATTR_GUID_INFO __constant_htons(0x0014) +#define IB_SMP_ATTR_PORT_INFO __constant_htons(0x0015) +#define IB_SMP_ATTR_PKEY_TABLE __constant_htons(0x0016) +#define IB_SMP_ATTR_SL_TO_VL_TABLE __constant_htons(0x0017) +#define IB_SMP_ATTR_VL_ARB_TABLE __constant_htons(0x0018) +#define IB_SMP_ATTR_LINEAR_FORWARD_TABLE __constant_htons(0x0019) +#define IB_SMP_ATTR_RANDOM_FORWARD_TABLE __constant_htons(0x001A) +#define IB_SMP_ATTR_MCAST_FORWARD_TABLE __constant_htons(0x001B) +#define IB_SMP_ATTR_SM_INFO __constant_htons(0x0020) +#define IB_SMP_ATTR_VENDOR_DIAG __constant_htons(0x0030) +#define IB_SMP_ATTR_LED_INFO __constant_htons(0x0031) +#define IB_SMP_ATTR_VENDOR_MASK __constant_htons(0xFF00) + +static inline u8 +ib_get_smp_direction(struct ib_smp *smp) +{ + return ((smp->status & IB_SMP_DIRECTION) == IB_SMP_DIRECTION); +} + +#endif /* IB_SMI_H */ From roland at topspin.com Mon Dec 27 21:50:52 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:50:52 -0800 Subject: [openib-general] [PATCH][v5][2/24] Add core InfiniBand support In-Reply-To: <200412272150.khYObxkGxtPP9Oju@topspin.com> Message-ID: <200412272150.vKuRYXlCFl5x8NAo@topspin.com> Add implementation of core InfiniBand support. This can be thought of as a midlayer that provides an abstraction between low-level hardware drivers and upper level protocols (such as IP-over-InfiniBand). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-27 21:48:18.185289416 -0800 @@ -0,0 +1,10 @@ +menu "InfiniBand support" + +config INFINIBAND + tristate "InfiniBand support" + ---help--- + Core support for InfiniBand (IB). Make sure to also select + any protocols you wish to use as well as drivers for your + InfiniBand hardware. + +endmenu --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/Makefile 2004-12-27 21:48:18.216284854 -0800 @@ -0,0 +1 @@ +obj-$(CONFIG_INFINIBAND) += core/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-27 21:48:18.262278084 -0800 @@ -0,0 +1,6 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND) += ib_core.o + +ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ + device.o fmr_pool.o cache.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/cache.c 2004-12-27 21:48:18.576231871 -0800 @@ -0,0 +1,328 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include + +#include "core_priv.h" + +struct ib_pkey_cache { + int table_len; + u16 table[0]; +}; + +struct ib_gid_cache { + int table_len; + union ib_gid table[0]; +}; + +struct ib_update_work { + struct work_struct work; + struct ib_device *device; + u8 port_num; +}; + +static inline int start_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : 1; +} + +static inline int end_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; +} + +int ib_cached_gid_get(struct ib_device *device, + u8 port, + int index, + union ib_gid *gid) +{ + struct ib_gid_cache *cache; + unsigned long flags; + int ret = 0; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.gid_cache[port - start_port(device)]; + + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *gid = cache->table[index]; + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_gid_get); + +int ib_cached_pkey_get(struct ib_device *device, + u8 port, + int index, + u16 *pkey) +{ + struct ib_pkey_cache *cache; + unsigned long flags; + int ret = 0; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.pkey_cache[port - start_port(device)]; + + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *pkey = cache->table[index]; + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_pkey_get); + +int ib_cached_pkey_find(struct ib_device *device, + u8 port, + u16 pkey, + u16 *index) +{ + struct ib_pkey_cache *cache; + unsigned long flags; + int i; + int ret = -ENOENT; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.pkey_cache[port - start_port(device)]; + + *index = -1; + + for (i = 0; i < cache->table_len; ++i) + if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { + *index = i; + ret = 0; + break; + } + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_pkey_find); + +static void ib_cache_update(struct ib_device *device, + u8 port) +{ + struct ib_port_attr *tprops = NULL; + struct ib_pkey_cache *pkey_cache = NULL, *old_pkey_cache; + struct ib_gid_cache *gid_cache = NULL, *old_gid_cache; + int i; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + return; + + ret = ib_query_port(device, port, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed (%d) for %s\n", + ret, device->name); + goto err; + } + + pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * + sizeof *pkey_cache->table, GFP_KERNEL); + if (!pkey_cache) + goto err; + + pkey_cache->table_len = tprops->pkey_tbl_len; + + gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len * + sizeof *gid_cache->table, GFP_KERNEL); + if (!gid_cache) + goto err; + + gid_cache->table_len = tprops->gid_tbl_len; + + for (i = 0; i < pkey_cache->table_len; ++i) { + ret = ib_query_pkey(device, port, i, pkey_cache->table + i); + if (ret) { + printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; + } + } + + for (i = 0; i < gid_cache->table_len; ++i) { + ret = ib_query_gid(device, port, i, gid_cache->table + i); + if (ret) { + printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; + } + } + + write_lock_irq(&device->cache.lock); + + old_pkey_cache = device->cache.pkey_cache[port - start_port(device)]; + old_gid_cache = device->cache.gid_cache [port - start_port(device)]; + + device->cache.pkey_cache[port - start_port(device)] = pkey_cache; + device->cache.gid_cache [port - start_port(device)] = gid_cache; + + write_unlock_irq(&device->cache.lock); + + kfree(old_pkey_cache); + kfree(old_gid_cache); + kfree(tprops); + return; + +err: + kfree(pkey_cache); + kfree(gid_cache); + kfree(tprops); +} + +static void ib_cache_task(void *work_ptr) +{ + struct ib_update_work *work = work_ptr; + + ib_cache_update(work->device, work->port_num); + kfree(work); +} + +static void ib_cache_event(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct ib_update_work *work; + + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + work = kmalloc(sizeof *work, GFP_ATOMIC); + if (work) { + INIT_WORK(&work->work, ib_cache_task, work); + work->device = event->device; + work->port_num = event->element.port_num; + schedule_work(&work->work); + } + } +} + +void ib_cache_setup_one(struct ib_device *device) +{ + int p; + + rwlock_init(&device->cache.lock); + + device->cache.pkey_cache = + kmalloc(sizeof *device->cache.pkey_cache * + (end_port(device) - start_port(device) + 1), GFP_KERNEL); + device->cache.gid_cache = + kmalloc(sizeof *device->cache.pkey_cache * + (end_port(device) - start_port(device) + 1), GFP_KERNEL); + + if (!device->cache.pkey_cache || !device->cache.gid_cache) { + printk(KERN_WARNING "Couldn't allocate cache " + "for %s\n", device->name); + goto err; + } + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + device->cache.pkey_cache[p] = NULL; + device->cache.gid_cache [p] = NULL; + ib_cache_update(device, p + start_port(device)); + } + + INIT_IB_EVENT_HANDLER(&device->cache.event_handler, + device, ib_cache_event); + if (ib_register_event_handler(&device->cache.event_handler)) + goto err_cache; + + return; + +err_cache: + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); + } + +err: + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} + +void ib_cache_cleanup_one(struct ib_device *device) +{ + int p; + + ib_unregister_event_handler(&device->cache.event_handler); + flush_scheduled_work(); + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); + } + + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} + +struct ib_client cache_client = { + .name = "cache", + .add = ib_cache_setup_one, + .remove = ib_cache_cleanup_one +}; + +int __init ib_cache_setup(void) +{ + return ib_register_client(&cache_client); +} + +void __exit ib_cache_cleanup(void) +{ + ib_unregister_client(&cache_client); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/core_priv.h 2004-12-27 21:48:18.600228339 -0800 @@ -0,0 +1,52 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: core_priv.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef _CORE_PRIV_H +#define _CORE_PRIV_H + +#include +#include + +#include + +int ib_device_register_sysfs(struct ib_device *device); +void ib_device_unregister_sysfs(struct ib_device *device); + +int ib_sysfs_setup(void); +void ib_sysfs_cleanup(void); + +int ib_cache_setup(void); +void ib_cache_cleanup(void); + +#endif /* _CORE_PRIV_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/device.c 2004-12-27 21:48:18.525239377 -0800 @@ -0,0 +1,614 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: device.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include +#include + +#include + +#include "core_priv.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("core kernel InfiniBand API"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct ib_client_data { + struct list_head list; + struct ib_client *client; + void * data; +}; + +static LIST_HEAD(device_list); +static LIST_HEAD(client_list); + +/* + * device_sem protects access to both device_list and client_list. + * There's no real point to using multiple locks or something fancier + * like an rwsem: we always access both lists, and we're always + * modifying one list or the other list. In any case this is not a + * hot path so there's no point in trying to optimize. + */ +static DECLARE_MUTEX(device_sem); + +static int ib_device_check_mandatory(struct ib_device *device) +{ +#define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device, x), #x } + static const struct { + size_t offset; + char *name; + } mandatory_table[] = { + IB_MANDATORY_FUNC(query_device), + IB_MANDATORY_FUNC(query_port), + IB_MANDATORY_FUNC(query_pkey), + IB_MANDATORY_FUNC(query_gid), + IB_MANDATORY_FUNC(alloc_pd), + IB_MANDATORY_FUNC(dealloc_pd), + IB_MANDATORY_FUNC(create_ah), + IB_MANDATORY_FUNC(destroy_ah), + IB_MANDATORY_FUNC(create_qp), + IB_MANDATORY_FUNC(modify_qp), + IB_MANDATORY_FUNC(destroy_qp), + IB_MANDATORY_FUNC(post_send), + IB_MANDATORY_FUNC(post_recv), + IB_MANDATORY_FUNC(create_cq), + IB_MANDATORY_FUNC(destroy_cq), + IB_MANDATORY_FUNC(poll_cq), + IB_MANDATORY_FUNC(req_notify_cq), + IB_MANDATORY_FUNC(get_dma_mr), + IB_MANDATORY_FUNC(dereg_mr) + }; + int i; + + for (i = 0; i < sizeof mandatory_table / sizeof mandatory_table[0]; ++i) { + if (!*(void **) ((void *) device + mandatory_table[i].offset)) { + printk(KERN_WARNING "Device %s is missing mandatory function %s\n", + device->name, mandatory_table[i].name); + return -EINVAL; + } + } + + return 0; +} + +static struct ib_device *__ib_device_get_by_name(const char *name) +{ + struct ib_device *device; + + list_for_each_entry(device, &device_list, core_list) + if (!strncmp(name, device->name, IB_DEVICE_NAME_MAX)) + return device; + + return NULL; +} + + +static int alloc_name(char *name) +{ + long *inuse; + char buf[IB_DEVICE_NAME_MAX]; + struct ib_device *device; + int i; + + inuse = (long *) get_zeroed_page(GFP_KERNEL); + if (!inuse) + return -ENOMEM; + + list_for_each_entry(device, &device_list, core_list) { + if (!sscanf(device->name, name, &i)) + continue; + if (i < 0 || i >= PAGE_SIZE * 8) + continue; + snprintf(buf, sizeof buf, name, i); + if (!strncmp(buf, device->name, IB_DEVICE_NAME_MAX)) + set_bit(i, inuse); + } + + i = find_first_zero_bit(inuse, PAGE_SIZE * 8); + free_page((unsigned long) inuse); + snprintf(buf, sizeof buf, name, i); + + if (__ib_device_get_by_name(buf)) + return -ENFILE; + + strlcpy(name, buf, IB_DEVICE_NAME_MAX); + return 0; +} + +/** + * ib_alloc_device - allocate an IB device struct + * @size:size of structure to allocate + * + * Low-level drivers should use ib_alloc_device() to allocate &struct + * ib_device. @size is the size of the structure to be allocated, + * including any private data used by the low-level driver. + * ib_dealloc_device() must be used to free structures allocated with + * ib_alloc_device(). + */ +struct ib_device *ib_alloc_device(size_t size) +{ + void *dev; + + BUG_ON(size < sizeof (struct ib_device)); + + dev = kmalloc(size, GFP_KERNEL); + if (!dev) + return NULL; + + memset(dev, 0, size); + + return dev; +} +EXPORT_SYMBOL(ib_alloc_device); + +/** + * ib_dealloc_device - free an IB device struct + * @device:structure to free + * + * Free a structure allocated with ib_alloc_device(). + */ +void ib_dealloc_device(struct ib_device *device) +{ + if (device->reg_state == IB_DEV_UNINITIALIZED) { + kfree(device); + return; + } + + BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); + + ib_device_unregister_sysfs(device); +} +EXPORT_SYMBOL(ib_dealloc_device); + +static int add_client_context(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + unsigned long flags; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) { + printk(KERN_WARNING "Couldn't allocate client context for %s/%s\n", + device->name, client->name); + return -ENOMEM; + } + + context->client = client; + context->data = NULL; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return 0; +} + +/** + * ib_register_device - Register an IB device with IB core + * @device:Device to register + * + * Low-level drivers use ib_register_device() to register their + * devices with the IB core. All registered clients will receive a + * callback for each device that is added. @device must be allocated + * with ib_alloc_device(). + */ +int ib_register_device(struct ib_device *device) +{ + int ret; + + down(&device_sem); + + if (strchr(device->name, '%')) { + ret = alloc_name(device->name); + if (ret) + goto out; + } + + if (ib_device_check_mandatory(device)) { + ret = -EINVAL; + goto out; + } + + INIT_LIST_HEAD(&device->event_handler_list); + INIT_LIST_HEAD(&device->client_data_list); + spin_lock_init(&device->event_handler_lock); + spin_lock_init(&device->client_data_lock); + + ret = ib_device_register_sysfs(device); + if (ret) { + printk(KERN_WARNING "Couldn't register device %s with driver model\n", + device->name); + goto out; + } + + list_add_tail(&device->core_list, &device_list); + + device->reg_state = IB_DEV_REGISTERED; + + { + struct ib_client *client; + + list_for_each_entry(client, &client_list, list) + if (client->add && !add_client_context(device, client)) + client->add(device); + } + + out: + up(&device_sem); + return ret; +} +EXPORT_SYMBOL(ib_register_device); + +/** + * ib_unregister_device - Unregister an IB device + * @device:Device to unregister + * + * Unregister an IB device. All clients will receive a remove callback. + */ +void ib_unregister_device(struct ib_device *device) +{ + struct ib_client *client; + struct ib_client_data *context, *tmp; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry_reverse(client, &client_list, list) + if (client->remove) + client->remove(device); + + list_del(&device->core_list); + + up(&device_sem); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + kfree(context); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + device->reg_state = IB_DEV_UNREGISTERED; +} +EXPORT_SYMBOL(ib_unregister_device); + +/** + * ib_register_client - Register an IB client + * @client:Client to register + * + * Upper level users of the IB drivers can use ib_register_client() to + * register callbacks for IB device addition and removal. When an IB + * device is added, each registered client's add method will be called + * (in the order the clients were registered), and when a device is + * removed, each client's remove method will be called (in the reverse + * order that clients were registered). In addition, when + * ib_register_client() is called, the client will receive an add + * callback for all devices already registered. + */ +int ib_register_client(struct ib_client *client) +{ + struct ib_device *device; + + down(&device_sem); + + list_add_tail(&client->list, &client_list); + list_for_each_entry(device, &device_list, core_list) + if (client->add && !add_client_context(device, client)) + client->add(device); + + up(&device_sem); + + return 0; +} +EXPORT_SYMBOL(ib_register_client); + +/** + * ib_unregister_client - Unregister an IB client + * @client:Client to unregister + * + * Upper level users use ib_unregister_client() to remove their client + * registration. When ib_unregister_client() is called, the client + * will receive a remove callback for each IB device still registered. + */ +void ib_unregister_client(struct ib_client *client) +{ + struct ib_client_data *context, *tmp; + struct ib_device *device; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry(device, &device_list, core_list) { + if (client->remove) + client->remove(device); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + if (context->client == client) { + list_del(&context->list); + kfree(context); + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + } + list_del(&client->list); + + up(&device_sem); +} +EXPORT_SYMBOL(ib_unregister_client); + +/** + * ib_get_client_data - Get IB client context + * @device:Device to get context for + * @client:Client to get context for + * + * ib_get_client_data() returns client context set with + * ib_set_client_data(). + */ +void *ib_get_client_data(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + void *ret = NULL; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + ret = context->data; + break; + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_get_client_data); + +/** + * ib_set_client_data - Get IB client context + * @device:Device to set context for + * @client:Client to set context for + * @data:Context to set + * + * ib_set_client_data() sets client context that can be retrieved with + * ib_get_client_data(). + */ +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data) +{ + struct ib_client_data *context; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + context->data = data; + goto out; + } + + printk(KERN_WARNING "No client context found for %s/%s\n", + device->name, client->name); + +out: + spin_unlock_irqrestore(&device->client_data_lock, flags); +} +EXPORT_SYMBOL(ib_set_client_data); + +/** + * ib_register_event_handler - Register an IB event handler + * @event_handler:Handler to register + * + * ib_register_event_handler() registers an event handler that will be + * called back when asynchronous IB events occur (as defined in + * chapter 11 of the InfiniBand Architecture Specification). This + * callback may occur in interrupt context. + */ +int ib_register_event_handler (struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_add_tail(&event_handler->list, + &event_handler->device->event_handler_list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_register_event_handler); + +/** + * ib_unregister_event_handler - Unregister an event handler + * @event_handler:Handler to unregister + * + * Unregister an event handler registered with + * ib_register_event_handler(). + */ +int ib_unregister_event_handler(struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_del(&event_handler->list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_unregister_event_handler); + +/** + * ib_dispatch_event - Dispatch an asynchronous event + * @event:Event to dispatch + * + * Low-level drivers must call ib_dispatch_event() to dispatch the + * event to all registered event handlers when an asynchronous event + * occurs. + */ +void ib_dispatch_event(struct ib_event *event) +{ + unsigned long flags; + struct ib_event_handler *handler; + + spin_lock_irqsave(&event->device->event_handler_lock, flags); + + list_for_each_entry(handler, &event->device->event_handler_list, list) + handler->handler(handler, event); + + spin_unlock_irqrestore(&event->device->event_handler_lock, flags); +} +EXPORT_SYMBOL(ib_dispatch_event); + +/** + * ib_query_device - Query IB device attributes + * @device:Device to query + * @device_attr:Device attributes + * + * ib_query_device() returns the attributes of a device through the + * @device_attr pointer. + */ +int ib_query_device(struct ib_device *device, + struct ib_device_attr *device_attr) +{ + return device->query_device(device, device_attr); +} +EXPORT_SYMBOL(ib_query_device); + +/** + * ib_query_port - Query IB port attributes + * @device:Device to query + * @port_num:Port number to query + * @port_attr:Port attributes + * + * ib_query_port() returns the attributes of a port through the + * @port_attr pointer. + */ +int ib_query_port(struct ib_device *device, + u8 port_num, + struct ib_port_attr *port_attr) +{ + return device->query_port(device, port_num, port_attr); +} +EXPORT_SYMBOL(ib_query_port); + +/** + * ib_query_gid - Get GID table entry + * @device:Device to query + * @port_num:Port number to query + * @index:GID table index to query + * @gid:Returned GID + * + * ib_query_gid() fetches the specified GID table entry. + */ +int ib_query_gid(struct ib_device *device, + u8 port_num, int index, union ib_gid *gid) +{ + return device->query_gid(device, port_num, index, gid); +} +EXPORT_SYMBOL(ib_query_gid); + +/** + * ib_query_pkey - Get P_Key table entry + * @device:Device to query + * @port_num:Port number to query + * @index:P_Key table index to query + * @pkey:Returned P_Key + * + * ib_query_pkey() fetches the specified P_Key table entry. + */ +int ib_query_pkey(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey) +{ + return device->query_pkey(device, port_num, index, pkey); +} +EXPORT_SYMBOL(ib_query_pkey); + +/** + * ib_modify_device - Change IB device attributes + * @device:Device to modify + * @device_modify_mask:Mask of attributes to change + * @device_modify:New attribute values + * + * ib_modify_device() changes a device's attributes as specified by + * the @device_modify_mask and @device_modify structure. + */ +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify) +{ + return device->modify_device(device, device_modify_mask, + device_modify); +} +EXPORT_SYMBOL(ib_modify_device); + +/** + * ib_modify_port - Modifies the attributes for the specified port. + * @device: The device to modify. + * @port_num: The number of the port to modify. + * @port_modify_mask: Mask used to specify which attributes of the port + * to change. + * @port_modify: New attribute values for the port. + * + * ib_modify_port() changes a port's attributes as specified by the + * @port_modify_mask and @port_modify structure. + */ +int ib_modify_port(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify) +{ + return device->modify_port(device, port_num, port_modify_mask, + port_modify); +} +EXPORT_SYMBOL(ib_modify_port); + +static int __init ib_core_init(void) +{ + int ret; + + ret = ib_sysfs_setup(); + if (ret) + printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); + + ret = ib_cache_setup(); + if (ret) { + printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n"); + ib_sysfs_cleanup(); + } + + return ret; +} + +static void __exit ib_core_cleanup(void) +{ + ib_cache_cleanup(); + ib_sysfs_cleanup(); +} + +module_init(ib_core_init); +module_exit(ib_core_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/fmr_pool.c 2004-12-27 21:48:18.551235551 -0800 @@ -0,0 +1,507 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: fmr_pool.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include +#include + +#include + +#include "core_priv.h" + +enum { + IB_FMR_MAX_REMAPS = 32, + + IB_FMR_HASH_BITS = 8, + IB_FMR_HASH_SIZE = 1 << IB_FMR_HASH_BITS, + IB_FMR_HASH_MASK = IB_FMR_HASH_SIZE - 1 +}; + +/* + * If an FMR is not in use, then the list member will point to either + * its pool's free_list (if the FMR can be mapped again; that is, + * remap_count < IB_FMR_MAX_REMAPS) or its pool's dirty_list (if the + * FMR needs to be unmapped before being remapped). In either of + * these cases it is a bug if the ref_count is not 0. In other words, + * if ref_count is > 0, then the list member must not be linked into + * either free_list or dirty_list. + * + * The cache_node member is used to link the FMR into a cache bucket + * (if caching is enabled). This is independent of the reference + * count of the FMR. When a valid FMR is released, its ref_count is + * decremented, and if ref_count reaches 0, the FMR is placed in + * either free_list or dirty_list as appropriate. However, it is not + * removed from the cache and may be "revived" if a call to + * ib_fmr_register_physical() occurs before the FMR is remapped. In + * this case we just increment the ref_count and remove the FMR from + * free_list/dirty_list. + * + * Before we remap an FMR from free_list, we remove it from the cache + * (to prevent another user from obtaining a stale FMR). When an FMR + * is released, we add it to the tail of the free list, so that our + * cache eviction policy is "least recently used." + * + * All manipulation of ref_count, list and cache_node is protected by + * pool_lock to maintain consistency. + */ + +struct ib_fmr_pool { + spinlock_t pool_lock; + + int pool_size; + int max_pages; + int dirty_watermark; + int dirty_len; + struct list_head free_list; + struct list_head dirty_list; + struct hlist_head *cache_bucket; + + void (*flush_function)(struct ib_fmr_pool *pool, + void * arg); + void *flush_arg; + + struct task_struct *thread; + + atomic_t req_ser; + atomic_t flush_ser; + + wait_queue_head_t force_wait; +}; + +static inline u32 ib_fmr_hash(u64 first_page) +{ + return jhash_2words((u32) first_page, + (u32) (first_page >> 32), + 0); +} + +/* Caller must hold pool_lock */ +static inline struct ib_pool_fmr *ib_fmr_cache_lookup(struct ib_fmr_pool *pool, + u64 *page_list, + int page_list_len, + u64 io_virtual_address) +{ + struct hlist_head *bucket; + struct ib_pool_fmr *fmr; + struct hlist_node *pos; + + if (!pool->cache_bucket) + return NULL; + + bucket = pool->cache_bucket + ib_fmr_hash(*page_list); + + hlist_for_each_entry(fmr, pos, bucket, cache_node) + if (io_virtual_address == fmr->io_virtual_address && + page_list_len == fmr->page_list_len && + !memcmp(page_list, fmr->page_list, + page_list_len * sizeof *page_list)) + return fmr; + + return NULL; +} + +static void ib_fmr_batch_release(struct ib_fmr_pool *pool) +{ + int ret; + struct ib_pool_fmr *fmr; + LIST_HEAD(unmap_list); + LIST_HEAD(fmr_list); + + spin_lock_irq(&pool->pool_lock); + + list_for_each_entry(fmr, &pool->dirty_list, list) { + hlist_del_init(&fmr->cache_node); + fmr->remap_count = 0; + list_add_tail(&fmr->fmr->list, &fmr_list); + +#ifdef DEBUG + if (fmr->ref_count !=0) { + printk(KERN_WARNING "Unmapping FMR 0x%08x with ref count %d", + fmr, fmr->ref_count); + } +#endif + } + + list_splice(&pool->dirty_list, &unmap_list); + INIT_LIST_HEAD(&pool->dirty_list); + pool->dirty_len = 0; + + spin_unlock_irq(&pool->pool_lock); + + if (list_empty(&unmap_list)) { + return; + } + + ret = ib_unmap_fmr(&fmr_list); + if (ret) + printk(KERN_WARNING "ib_unmap_fmr returned %d", ret); + + spin_lock_irq(&pool->pool_lock); + list_splice(&unmap_list, &pool->free_list); + spin_unlock_irq(&pool->pool_lock); +} + +static int ib_fmr_cleanup_thread(void *pool_ptr) +{ + struct ib_fmr_pool *pool = pool_ptr; + + do { + if (pool->dirty_len >= pool->dirty_watermark || + atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) < 0) { + ib_fmr_batch_release(pool); + + atomic_inc(&pool->flush_ser); + wake_up_interruptible(&pool->force_wait); + + if (pool->flush_function) + pool->flush_function(pool, pool->flush_arg); + } + + set_current_state(TASK_INTERRUPTIBLE); + if (pool->dirty_len < pool->dirty_watermark && + atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) >= 0 && + !kthread_should_stop()) + schedule(); + __set_current_state(TASK_RUNNING); + } while (!kthread_should_stop()); + + return 0; +} + +/** + * ib_create_fmr_pool - Create an FMR pool + * @pd:Protection domain for FMRs + * @params:FMR pool parameters + * + * Create a pool of FMRs. Return value is pointer to new pool or + * error code if creation failed. + */ +struct ib_fmr_pool *ib_create_fmr_pool(struct ib_pd *pd, + struct ib_fmr_pool_param *params) +{ + struct ib_device *device; + struct ib_fmr_pool *pool; + int i; + int ret; + + if (!params) + return ERR_PTR(-EINVAL); + + device = pd->device; + if (!device->alloc_fmr || !device->dealloc_fmr || + !device->map_phys_fmr || !device->unmap_fmr) { + printk(KERN_WARNING "Device %s does not support fast memory regions", + device->name); + return ERR_PTR(-ENOSYS); + } + + pool = kmalloc(sizeof *pool, GFP_KERNEL); + if (!pool) { + printk(KERN_WARNING "couldn't allocate pool struct"); + return ERR_PTR(-ENOMEM); + } + + pool->cache_bucket = NULL; + + pool->flush_function = params->flush_function; + pool->flush_arg = params->flush_arg; + + INIT_LIST_HEAD(&pool->free_list); + INIT_LIST_HEAD(&pool->dirty_list); + + if (params->cache) { + pool->cache_bucket = + kmalloc(IB_FMR_HASH_SIZE * sizeof *pool->cache_bucket, + GFP_KERNEL); + if (!pool->cache_bucket) { + printk(KERN_WARNING "Failed to allocate cache in pool"); + ret = -ENOMEM; + goto out_free_pool; + } + + for (i = 0; i < IB_FMR_HASH_SIZE; ++i) + INIT_HLIST_HEAD(pool->cache_bucket + i); + } + + pool->pool_size = 0; + pool->max_pages = params->max_pages_per_fmr; + pool->dirty_watermark = params->dirty_watermark; + pool->dirty_len = 0; + spin_lock_init(&pool->pool_lock); + atomic_set(&pool->req_ser, 0); + atomic_set(&pool->flush_ser, 0); + init_waitqueue_head(&pool->force_wait); + + pool->thread = kthread_create(ib_fmr_cleanup_thread, + pool, + "ib_fmr(%s)", + device->name); + if (IS_ERR(pool->thread)) { + printk(KERN_WARNING "couldn't start cleanup thread"); + ret = PTR_ERR(pool->thread); + goto out_free_pool; + } + + { + struct ib_pool_fmr *fmr; + struct ib_fmr_attr attr = { + .max_pages = params->max_pages_per_fmr, + .max_maps = IB_FMR_MAX_REMAPS, + .page_size = PAGE_SHIFT + }; + + for (i = 0; i < params->pool_size; ++i) { + fmr = kmalloc(sizeof *fmr + params->max_pages_per_fmr * sizeof (u64), + GFP_KERNEL); + if (!fmr) { + printk(KERN_WARNING "failed to allocate fmr struct " + "for FMR %d", i); + goto out_fail; + } + + fmr->pool = pool; + fmr->remap_count = 0; + fmr->ref_count = 0; + INIT_HLIST_NODE(&fmr->cache_node); + + fmr->fmr = ib_alloc_fmr(pd, params->access, &attr); + if (IS_ERR(fmr->fmr)) { + printk(KERN_WARNING "fmr_create failed for FMR %d", i); + kfree(fmr); + goto out_fail; + } + + list_add_tail(&fmr->list, &pool->free_list); + ++pool->pool_size; + } + } + + return pool; + + out_free_pool: + kfree(pool->cache_bucket); + kfree(pool); + + return ERR_PTR(ret); + + out_fail: + ib_destroy_fmr_pool(pool); + + return ERR_PTR(-ENOMEM); +} +EXPORT_SYMBOL(ib_create_fmr_pool); + +/** + * ib_destroy_fmr_pool - Free FMR pool + * @pool:FMR pool to free + * + * Destroy an FMR pool and free all associated resources. + */ +int ib_destroy_fmr_pool(struct ib_fmr_pool *pool) +{ + struct ib_pool_fmr *fmr; + struct ib_pool_fmr *tmp; + int i; + + kthread_stop(pool->thread); + ib_fmr_batch_release(pool); + + i = 0; + list_for_each_entry_safe(fmr, tmp, &pool->free_list, list) { + ib_dealloc_fmr(fmr->fmr); + list_del(&fmr->list); + kfree(fmr); + ++i; + } + + if (i < pool->pool_size) + printk(KERN_WARNING "pool still has %d regions registered", + pool->pool_size - i); + + kfree(pool->cache_bucket); + kfree(pool); + + return 0; +} +EXPORT_SYMBOL(ib_destroy_fmr_pool); + +/** + * ib_flush_fmr_pool - Invalidate all unmapped FMRs + * @pool:FMR pool to flush + * + * Ensure that all unmapped FMRs are fully invalidated. + */ +int ib_flush_fmr_pool(struct ib_fmr_pool *pool) +{ + int serial; + + atomic_inc(&pool->req_ser); + /* + * It's OK if someone else bumps req_ser again here -- we'll + * just wait a little longer. + */ + serial = atomic_read(&pool->req_ser); + + wake_up_process(pool->thread); + + if (wait_event_interruptible(pool->force_wait, + atomic_read(&pool->flush_ser) - + atomic_read(&pool->req_ser) >= 0)) + return -EINTR; + + return 0; +} +EXPORT_SYMBOL(ib_flush_fmr_pool); + +/** + * ib_fmr_pool_map_phys - + * @pool:FMR pool to allocate FMR from + * @page_list:List of pages to map + * @list_len:Number of pages in @page_list + * @io_virtual_address:I/O virtual address for new FMR + * + * Map an FMR from an FMR pool. + */ +struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle, + u64 *page_list, + int list_len, + u64 *io_virtual_address) +{ + struct ib_fmr_pool *pool = pool_handle; + struct ib_pool_fmr *fmr; + unsigned long flags; + int result; + + if (list_len < 1 || list_len > pool->max_pages) + return ERR_PTR(-EINVAL); + + spin_lock_irqsave(&pool->pool_lock, flags); + fmr = ib_fmr_cache_lookup(pool, + page_list, + list_len, + *io_virtual_address); + if (fmr) { + /* found in cache */ + ++fmr->ref_count; + if (fmr->ref_count == 1) { + list_del(&fmr->list); + } + + spin_unlock_irqrestore(&pool->pool_lock, flags); + + return fmr; + } + + if (list_empty(&pool->free_list)) { + spin_unlock_irqrestore(&pool->pool_lock, flags); + return ERR_PTR(-EAGAIN); + } + + fmr = list_entry(pool->free_list.next, struct ib_pool_fmr, list); + list_del(&fmr->list); + hlist_del_init(&fmr->cache_node); + spin_unlock_irqrestore(&pool->pool_lock, flags); + + result = ib_map_phys_fmr(fmr->fmr, page_list, list_len, + *io_virtual_address); + + if (result) { + spin_lock_irqsave(&pool->pool_lock, flags); + list_add(&fmr->list, &pool->free_list); + spin_unlock_irqrestore(&pool->pool_lock, flags); + + printk(KERN_WARNING "fmr_map returns %d", + result); + + return ERR_PTR(result); + } + + ++fmr->remap_count; + fmr->ref_count = 1; + + if (pool->cache_bucket) { + fmr->io_virtual_address = *io_virtual_address; + fmr->page_list_len = list_len; + memcpy(fmr->page_list, page_list, list_len * sizeof(*page_list)); + + spin_lock_irqsave(&pool->pool_lock, flags); + hlist_add_head(&fmr->cache_node, + pool->cache_bucket + ib_fmr_hash(fmr->page_list[0])); + spin_unlock_irqrestore(&pool->pool_lock, flags); + } + + return fmr; +} +EXPORT_SYMBOL(ib_fmr_pool_map_phys); + +/** + * ib_fmr_pool_unmap - Unmap FMR + * @fmr:FMR to unmap + * + * Unmap an FMR. The FMR mapping may remain valid until the FMR is + * reused (or until ib_flush_fmr_pool() is called). + */ +int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr) +{ + struct ib_fmr_pool *pool; + unsigned long flags; + + pool = fmr->pool; + + spin_lock_irqsave(&pool->pool_lock, flags); + + --fmr->ref_count; + if (!fmr->ref_count) { + if (fmr->remap_count < IB_FMR_MAX_REMAPS) { + list_add_tail(&fmr->list, &pool->free_list); + } else { + list_add_tail(&fmr->list, &pool->dirty_list); + ++pool->dirty_len; + wake_up_process(pool->thread); + } + } + +#ifdef DEBUG + if (fmr->ref_count < 0) + printk(KERN_WARNING "FMR %p has ref count %d < 0", + fmr, fmr->ref_count); +#endif + + spin_unlock_irqrestore(&pool->pool_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_fmr_pool_unmap); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/packer.c 2004-12-27 21:48:18.385259982 -0800 @@ -0,0 +1,201 @@ +/* + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: packer.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +static u64 value_read(int offset, int size, void *structure) +{ + switch (size) { + case 1: return *(u8 *) (structure + offset); + case 2: return be16_to_cpup((__be16 *) (structure + offset)); + case 4: return be32_to_cpup((__be32 *) (structure + offset)); + case 8: return be64_to_cpup((__be64 *) (structure + offset)); + default: + printk(KERN_WARNING "Field size %d bits not handled\n", size * 8); + return 0; + } +} + +/** + * ib_pack - Pack a structure into a buffer + * @desc:Array of structure field descriptions + * @desc_len:Number of entries in @desc + * @structure:Structure to pack from + * @buf:Buffer to pack into + * + * ib_pack() packs a list of structure fields into a buffer, + * controlled by the array of fields in @desc. + */ +void ib_pack(const struct ib_field *desc, + int desc_len, + void *structure, + void *buf) +{ + int i; + + for (i = 0; i < desc_len; ++i) { + if (desc[i].size_bits <= 32) { + int shift; + u32 val; + __be32 mask; + __be32 *addr; + + shift = 32 - desc[i].offset_bits - desc[i].size_bits; + if (desc[i].struct_size_bytes) + val = value_read(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + structure) << shift; + else + val = 0; + + mask = cpu_to_be32(((1ull << desc[i].size_bits) - 1) << shift); + addr = (__be32 *) buf + desc[i].offset_words; + *addr = (*addr & ~mask) | (cpu_to_be32(val) & mask); + } else if (desc[i].size_bits <= 64) { + int shift; + u64 val; + __be64 mask; + __be64 *addr; + + shift = 64 - desc[i].offset_bits - desc[i].size_bits; + if (desc[i].struct_size_bytes) + val = value_read(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + structure) << shift; + else + val = 0; + + mask = cpu_to_be64(((1ull << desc[i].size_bits) - 1) << shift); + addr = (__be64 *) ((__be32 *) buf + desc[i].offset_words); + *addr = (*addr & ~mask) | (cpu_to_be64(val) & mask); + } else { + if (desc[i].offset_bits % 8 || + desc[i].size_bits % 8) { + printk(KERN_WARNING "Structure field %s of size %d " + "bits is not byte-aligned\n", + desc[i].field_name, desc[i].size_bits); + } + + if (desc[i].struct_size_bytes) + memcpy(buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + structure + desc[i].struct_offset_bytes, + desc[i].size_bits / 8); + else + memset(buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + 0, + desc[i].size_bits / 8); + } + } +} +EXPORT_SYMBOL(ib_pack); + +static void value_write(int offset, int size, u64 val, void *structure) +{ + switch (size * 8) { + case 8: *( u8 *) (structure + offset) = val; break; + case 16: *(__be16 *) (structure + offset) = cpu_to_be16(val); break; + case 32: *(__be32 *) (structure + offset) = cpu_to_be32(val); break; + case 64: *(__be64 *) (structure + offset) = cpu_to_be64(val); break; + default: + printk(KERN_WARNING "Field size %d bits not handled\n", size * 8); + } +} + +/** + * ib_unpack - Unpack a buffer into a structure + * @desc:Array of structure field descriptions + * @desc_len:Number of entries in @desc + * @buf:Buffer to unpack from + * @structure:Structure to unpack into + * + * ib_pack() unpacks a list of structure fields from a buffer, + * controlled by the array of fields in @desc. + */ +void ib_unpack(const struct ib_field *desc, + int desc_len, + void *buf, + void *structure) +{ + int i; + + for (i = 0; i < desc_len; ++i) { + if (!desc[i].struct_size_bytes) + continue; + + if (desc[i].size_bits <= 32) { + int shift; + u32 val; + u32 mask; + __be32 *addr; + + shift = 32 - desc[i].offset_bits - desc[i].size_bits; + mask = ((1ull << desc[i].size_bits) - 1) << shift; + addr = (__be32 *) buf + desc[i].offset_words; + val = (be32_to_cpup(addr) & mask) >> shift; + value_write(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + val, + structure); + } else if (desc[i].size_bits <= 64) { + int shift; + u64 val; + u64 mask; + __be64 *addr; + + shift = 64 - desc[i].offset_bits - desc[i].size_bits; + mask = ((1ull << desc[i].size_bits) - 1) << shift; + addr = (__be64 *) buf + desc[i].offset_words; + val = (be64_to_cpup(addr) & mask) >> shift; + value_write(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + val, + structure); + } else { + if (desc[i].offset_bits % 8 || + desc[i].size_bits % 8) { + printk(KERN_WARNING "Structure field %s of size %d " + "bits is not byte-aligned\n", + desc[i].field_name, desc[i].size_bits); + } + + memcpy(structure + desc[i].struct_offset_bytes, + buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + desc[i].size_bits / 8); + } + } +} +EXPORT_SYMBOL(ib_unpack); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/sysfs.c 2004-12-27 21:48:18.498243351 -0800 @@ -0,0 +1,725 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include "core_priv.h" + +#include + +struct ib_port { + struct kobject kobj; + struct ib_device *ibdev; + struct attribute_group gid_group; + struct attribute **gid_attr; + struct attribute_group pkey_group; + struct attribute **pkey_attr; + u8 port_num; +}; + +struct port_attribute { + struct attribute attr; + ssize_t (*show)(struct ib_port *, struct port_attribute *, char *buf); + ssize_t (*store)(struct ib_port *, struct port_attribute *, + const char *buf, size_t count); +}; + +#define PORT_ATTR(_name, _mode, _show, _store) \ +struct port_attribute port_attr_##_name = __ATTR(_name, _mode, _show, _store) + +#define PORT_ATTR_RO(_name) \ +struct port_attribute port_attr_##_name = __ATTR_RO(_name) + +struct port_table_attribute { + struct port_attribute attr; + int index; +}; + +static ssize_t port_attr_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct port_attribute *port_attr = + container_of(attr, struct port_attribute, attr); + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + + if (!port_attr->show) + return 0; + + return port_attr->show(p, port_attr, buf); +} + +static struct sysfs_ops port_sysfs_ops = { + .show = port_attr_show +}; + +static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + static const char *state_name[] = { + [IB_PORT_NOP] = "NOP", + [IB_PORT_DOWN] = "DOWN", + [IB_PORT_INIT] = "INIT", + [IB_PORT_ARMED] = "ARMED", + [IB_PORT_ACTIVE] = "ACTIVE", + [IB_PORT_ACTIVE_DEFER] = "ACTIVE_DEFER" + }; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d: %s\n", attr.state, + attr.state >= 0 && attr.state <= ARRAY_SIZE(state_name) ? + state_name[attr.state] : "UNKNOWN"); +} + +static ssize_t lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.lid); +} + +static ssize_t lid_mask_count_show(struct ib_port *p, + struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.lmc); +} + +static ssize_t sm_lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.sm_lid); +} + +static ssize_t sm_sl_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.sm_sl); +} + +static ssize_t cap_mask_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%08x\n", attr.port_cap_flags); +} + +static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + char *speed = ""; + int rate; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + switch (attr.active_speed) { + case 2: speed = " DDR"; break; + case 4: speed = " QDR"; break; + } + + printk(KERN_ERR "width %d speed %d\n", attr.active_width, attr.active_speed); + + rate = 25 * ib_width_enum_to_int(attr.active_width) * attr.active_speed; + if (rate < 0) + return -EINVAL; + + return sprintf(buf, "%d%s Gb/sec (%dX%s)\n", + rate / 10, rate % 10 ? ".5" : "", + ib_width_enum_to_int(attr.active_width), speed); +} + +static PORT_ATTR_RO(state); +static PORT_ATTR_RO(lid); +static PORT_ATTR_RO(lid_mask_count); +static PORT_ATTR_RO(sm_lid); +static PORT_ATTR_RO(sm_sl); +static PORT_ATTR_RO(cap_mask); +static PORT_ATTR_RO(rate); + +static struct attribute *port_default_attrs[] = { + &port_attr_state.attr, + &port_attr_lid.attr, + &port_attr_lid_mask_count.attr, + &port_attr_sm_lid.attr, + &port_attr_sm_sl.attr, + &port_attr_cap_mask.attr, + &port_attr_rate.attr, + NULL +}; + +static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + union ib_gid gid; + ssize_t ret; + + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) gid.raw)[0]), + be16_to_cpu(((u16 *) gid.raw)[1]), + be16_to_cpu(((u16 *) gid.raw)[2]), + be16_to_cpu(((u16 *) gid.raw)[3]), + be16_to_cpu(((u16 *) gid.raw)[4]), + be16_to_cpu(((u16 *) gid.raw)[5]), + be16_to_cpu(((u16 *) gid.raw)[6]), + be16_to_cpu(((u16 *) gid.raw)[7])); +} + +static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + u16 pkey; + ssize_t ret; + + ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); + if (ret) + return ret; + + return sprintf(buf, "0x%04x\n", pkey); +} + +#define PORT_PMA_ATTR(_name, _counter, _width, _offset) \ +struct port_table_attribute port_pma_attr_##_name = { \ + .attr = __ATTR(_name, S_IRUGO, show_pma_counter, NULL), \ + .index = (_offset) | ((_width) << 16) | ((_counter) << 24) \ +} + +static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + int offset = tab_attr->index & 0xffff; + int width = (tab_attr->index >> 16) & 0xff; + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + ssize_t ret; + + if (!p->ibdev->process_mad) + return sprintf(buf, "N/A (no PMA)\n"); + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + if (!in_mad || !out_mad) { + ret = -ENOMEM; + goto out; + } + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(0x12); /* PortCounters */ + + in_mad->data[41] = p->port_num; /* PortSelect field */ + + if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, p->port_num, 0xffff, + in_mad, out_mad) & + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { + ret = -EINVAL; + goto out; + } + + switch (width) { + case 4: + ret = sprintf(buf, "%u\n", (out_mad->data[40 + offset / 8] >> + (offset % 4)) & 0xf); + break; + case 8: + ret = sprintf(buf, "%u\n", out_mad->data[40 + offset / 8]); + break; + case 16: + ret = sprintf(buf, "%u\n", + be16_to_cpup((u16 *)(out_mad->data + 40 + offset / 8))); + break; + case 32: + ret = sprintf(buf, "%u\n", + be32_to_cpup((u32 *)(out_mad->data + 40 + offset / 8))); + break; + default: + ret = 0; + } + +out: + kfree(in_mad); + kfree(out_mad); + + return ret; +} + +static PORT_PMA_ATTR(symbol_error , 0, 16, 32); +static PORT_PMA_ATTR(link_error_recovery , 1, 8, 48); +static PORT_PMA_ATTR(link_downed , 2, 8, 56); +static PORT_PMA_ATTR(port_rcv_errors , 3, 16, 64); +static PORT_PMA_ATTR(port_rcv_remote_physical_errors, 4, 16, 80); +static PORT_PMA_ATTR(port_rcv_switch_relay_errors , 5, 16, 96); +static PORT_PMA_ATTR(port_xmit_discards , 6, 16, 112); +static PORT_PMA_ATTR(port_xmit_constraint_errors , 7, 8, 128); +static PORT_PMA_ATTR(port_rcv_constraint_errors , 8, 8, 136); +static PORT_PMA_ATTR(local_link_integrity_errors , 9, 4, 152); +static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10, 4, 156); +static PORT_PMA_ATTR(VL15_dropped , 11, 16, 176); +static PORT_PMA_ATTR(port_xmit_data , 12, 32, 192); +static PORT_PMA_ATTR(port_rcv_data , 13, 32, 224); +static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256); +static PORT_PMA_ATTR(port_rcv_packets , 15, 32, 288); + +static struct attribute *pma_attrs[] = { + &port_pma_attr_symbol_error.attr.attr, + &port_pma_attr_link_error_recovery.attr.attr, + &port_pma_attr_link_downed.attr.attr, + &port_pma_attr_port_rcv_errors.attr.attr, + &port_pma_attr_port_rcv_remote_physical_errors.attr.attr, + &port_pma_attr_port_rcv_switch_relay_errors.attr.attr, + &port_pma_attr_port_xmit_discards.attr.attr, + &port_pma_attr_port_xmit_constraint_errors.attr.attr, + &port_pma_attr_port_rcv_constraint_errors.attr.attr, + &port_pma_attr_local_link_integrity_errors.attr.attr, + &port_pma_attr_excessive_buffer_overrun_errors.attr.attr, + &port_pma_attr_VL15_dropped.attr.attr, + &port_pma_attr_port_xmit_data.attr.attr, + &port_pma_attr_port_rcv_data.attr.attr, + &port_pma_attr_port_xmit_packets.attr.attr, + &port_pma_attr_port_rcv_packets.attr.attr, + NULL +}; + +static struct attribute_group pma_group = { + .name = "counters", + .attrs = pma_attrs +}; + +static void ib_port_release(struct kobject *kobj) +{ + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + struct attribute *a; + int i; + + for (i = 0; (a = p->gid_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + for (i = 0; (a = p->pkey_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + kfree(p->gid_attr); + kfree(p); +} + +static struct kobj_type port_type = { + .release = ib_port_release, + .sysfs_ops = &port_sysfs_ops, + .default_attrs = port_default_attrs +}; + +static void ib_device_release(struct class_device *cdev) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + + kfree(dev); +} + +static int ib_device_hotplug(struct class_device *cdev, char **envp, + int num_envp, char *buf, int size) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + int i = 0, len = 0; + + if (add_hotplug_env_var(envp, num_envp, &i, buf, size, &len, + "NAME=%s", dev->name)) + return -ENOMEM; + + /* + * It might be nice to pass the node GUID to hotplug, but + * right now the only way to get it is to query the device + * provider, and this can crash during device removal because + * we are will be running after driver removal has started. + * We could add a node_guid field to struct ib_device, or we + * could just let the hotplug script read the node GUID from + * sysfs when devices are added. + */ + + envp[i] = NULL; + return 0; +} + +static int alloc_group(struct attribute ***attr, + ssize_t (*show)(struct ib_port *, + struct port_attribute *, char *buf), + int len) +{ + struct port_table_attribute ***tab_attr = + (struct port_table_attribute ***) attr; + int i; + int ret; + + *tab_attr = kmalloc((1 + len) * sizeof *tab_attr, GFP_KERNEL); + if (!*tab_attr) + return -ENOMEM; + + memset(*tab_attr, 0, (1 + len) * sizeof *tab_attr); + + for (i = 0; i < len; ++i) { + (*tab_attr)[i] = kmalloc(sizeof *(*tab_attr)[i], GFP_KERNEL); + if (!(*tab_attr)[i]) { + ret = -ENOMEM; + goto err; + } + memset((*tab_attr)[i], 0, sizeof *(*tab_attr)[i]); + (*tab_attr)[i]->attr.attr.name = kmalloc(8, GFP_KERNEL); + if (!(*tab_attr)[i]->attr.attr.name) { + ret = -ENOMEM; + goto err; + } + + if (snprintf((*tab_attr)[i]->attr.attr.name, 8, "%d", i) >= 8) { + ret = -ENOMEM; + goto err; + } + + (*tab_attr)[i]->attr.attr.mode = S_IRUGO; + (*tab_attr)[i]->attr.attr.owner = THIS_MODULE; + (*tab_attr)[i]->attr.show = show; + (*tab_attr)[i]->index = i; + } + + return 0; + +err: + for (i = 0; i < len; ++i) { + if ((*tab_attr)[i]) + kfree((*tab_attr)[i]->attr.attr.name); + kfree((*tab_attr)[i]); + } + + kfree(*tab_attr); + + return ret; +} + +static int add_port(struct ib_device *device, int port_num) +{ + struct ib_port *p; + struct ib_port_attr attr; + int i; + int ret; + + ret = ib_query_port(device, port_num, &attr); + if (ret) + return ret; + + p = kmalloc(sizeof *p, GFP_KERNEL); + if (!p) + return -ENOMEM; + memset(p, 0, sizeof *p); + + p->ibdev = device; + p->port_num = port_num; + p->kobj.ktype = &port_type; + + p->kobj.parent = kobject_get(&device->ports_parent); + if (!p->kobj.parent) { + ret = -EBUSY; + goto err; + } + + ret = kobject_set_name(&p->kobj, "%d", port_num); + if (ret) + goto err_put; + + ret = kobject_register(&p->kobj); + if (ret) + goto err_put; + + ret = sysfs_create_group(&p->kobj, &pma_group); + if (ret) + goto err_put; + + ret = alloc_group(&p->gid_attr, show_port_gid, attr.gid_tbl_len); + if (ret) + goto err_remove_pma; + + p->gid_group.name = "gids"; + p->gid_group.attrs = p->gid_attr; + + ret = sysfs_create_group(&p->kobj, &p->gid_group); + if (ret) + goto err_free_gid; + + ret = alloc_group(&p->pkey_attr, show_port_pkey, attr.pkey_tbl_len); + if (ret) + goto err_remove_gid; + + p->pkey_group.name = "pkeys"; + p->pkey_group.attrs = p->pkey_attr; + + ret = sysfs_create_group(&p->kobj, &p->pkey_group); + if (ret) + goto err_free_pkey; + + list_add_tail(&p->kobj.entry, &device->port_list); + + return 0; + +err_free_pkey: + for (i = 0; i < attr.pkey_tbl_len; ++i) { + kfree(p->pkey_attr[i]->name); + kfree(p->pkey_attr[i]); + } + + kfree(p->pkey_attr); + +err_remove_gid: + sysfs_remove_group(&p->kobj, &p->gid_group); + +err_free_gid: + for (i = 0; i < attr.gid_tbl_len; ++i) { + kfree(p->gid_attr[i]->name); + kfree(p->gid_attr[i]); + } + + kfree(p->gid_attr); + +err_remove_pma: + sysfs_remove_group(&p->kobj, &pma_group); + +err_put: + kobject_put(&device->ports_parent); + +err: + kfree(p); + return ret; +} + +static ssize_t show_sys_image_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.sys_image_guid)[0]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[1]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[2]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[3])); +} + +static ssize_t show_node_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.node_guid)[0]), + be16_to_cpu(((u16 *) &attr.node_guid)[1]), + be16_to_cpu(((u16 *) &attr.node_guid)[2]), + be16_to_cpu(((u16 *) &attr.node_guid)[3])); +} + +static CLASS_DEVICE_ATTR(sys_image_guid, S_IRUGO, show_sys_image_guid, NULL); +static CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); + +static struct class_device_attribute *ib_class_attributes[] = { + &class_device_attr_sys_image_guid, + &class_device_attr_node_guid +}; + +static struct class ib_class = { + .name = "infiniband", + .release = ib_device_release, + .hotplug = ib_device_hotplug, +}; + +int ib_device_register_sysfs(struct ib_device *device) +{ + struct class_device *class_dev = &device->class_dev; + int ret; + int i; + + class_dev->class = &ib_class; + class_dev->class_data = device; + strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); + + INIT_LIST_HEAD(&device->port_list); + + ret = class_device_register(class_dev); + if (ret) + goto err; + + for (i = 0; i < ARRAY_SIZE(ib_class_attributes); ++i) { + ret = class_device_create_file(class_dev, ib_class_attributes[i]); + if (ret) + goto err_unregister; + } + + device->ports_parent.parent = kobject_get(&class_dev->kobj); + if (!device->ports_parent.parent) { + ret = -EBUSY; + goto err_unregister; + } + ret = kobject_set_name(&device->ports_parent, "ports"); + if (ret) + goto err_put; + ret = kobject_register(&device->ports_parent); + if (ret) + goto err_put; + + if (device->node_type == IB_NODE_SWITCH) { + ret = add_port(device, 0); + if (ret) + goto err_put; + } else { + int i; + + for (i = 1; i <= device->phys_port_cnt; ++i) { + ret = add_port(device, i); + if (ret) + goto err_put; + } + } + + return 0; + +err_put: + { + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + } + + kobject_put(&class_dev->kobj); + +err_unregister: + class_device_unregister(class_dev); + +err: + return ret; +} + +void ib_device_unregister_sysfs(struct ib_device *device) +{ + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + + kobject_unregister(&device->ports_parent); + class_device_unregister(&device->class_dev); +} + +int ib_sysfs_setup(void) +{ + return class_register(&ib_class); +} + +void ib_sysfs_cleanup(void) +{ + class_unregister(&ib_class); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/ud_header.c 2004-12-27 21:48:18.428253653 -0800 @@ -0,0 +1,365 @@ +/* + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ud_header.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +#include + +#define STRUCT_FIELD(header, field) \ + .struct_offset_bytes = offsetof(struct ib_unpacked_ ## header, field), \ + .struct_size_bytes = sizeof ((struct ib_unpacked_ ## header *) 0)->field, \ + .field_name = #header ":" #field + +static const struct ib_field lrh_table[] = { + { STRUCT_FIELD(lrh, virtual_lane), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 4 }, + { STRUCT_FIELD(lrh, link_version), + .offset_words = 0, + .offset_bits = 4, + .size_bits = 4 }, + { STRUCT_FIELD(lrh, service_level), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 4 }, + { RESERVED, + .offset_words = 0, + .offset_bits = 12, + .size_bits = 2 }, + { STRUCT_FIELD(lrh, link_next_header), + .offset_words = 0, + .offset_bits = 14, + .size_bits = 2 }, + { STRUCT_FIELD(lrh, destination_lid), + .offset_words = 0, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 5 }, + { STRUCT_FIELD(lrh, packet_length), + .offset_words = 1, + .offset_bits = 5, + .size_bits = 11 }, + { STRUCT_FIELD(lrh, source_lid), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 } +}; + +static const struct ib_field grh_table[] = { + { STRUCT_FIELD(grh, ip_version), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 4 }, + { STRUCT_FIELD(grh, traffic_class), + .offset_words = 0, + .offset_bits = 4, + .size_bits = 8 }, + { STRUCT_FIELD(grh, flow_label), + .offset_words = 0, + .offset_bits = 12, + .size_bits = 20 }, + { STRUCT_FIELD(grh, payload_length), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { STRUCT_FIELD(grh, next_header), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 8 }, + { STRUCT_FIELD(grh, hop_limit), + .offset_words = 1, + .offset_bits = 24, + .size_bits = 8 }, + { STRUCT_FIELD(grh, source_gid), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 128 }, + { STRUCT_FIELD(grh, destination_gid), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 128 } +}; + +static const struct ib_field bth_table[] = { + { STRUCT_FIELD(bth, opcode), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(bth, solicited_event), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 1 }, + { STRUCT_FIELD(bth, mig_req), + .offset_words = 0, + .offset_bits = 9, + .size_bits = 1 }, + { STRUCT_FIELD(bth, pad_count), + .offset_words = 0, + .offset_bits = 10, + .size_bits = 2 }, + { STRUCT_FIELD(bth, transport_header_version), + .offset_words = 0, + .offset_bits = 12, + .size_bits = 4 }, + { STRUCT_FIELD(bth, pkey), + .offset_words = 0, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(bth, destination_qpn), + .offset_words = 1, + .offset_bits = 8, + .size_bits = 24 }, + { STRUCT_FIELD(bth, ack_req), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 1 }, + { RESERVED, + .offset_words = 2, + .offset_bits = 1, + .size_bits = 7 }, + { STRUCT_FIELD(bth, psn), + .offset_words = 2, + .offset_bits = 8, + .size_bits = 24 } +}; + +static const struct ib_field deth_table[] = { + { STRUCT_FIELD(deth, qkey), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(deth, source_qpn), + .offset_words = 1, + .offset_bits = 8, + .size_bits = 24 } +}; + +/** + * ib_ud_header_init - Initialize UD header structure + * @payload_bytes:Length of packet payload + * @grh_present:GRH flag (if non-zero, GRH will be included) + * @header:Structure to initialize + * + * ib_ud_header_init() initializes the lrh.link_version, lrh.link_next_header, + * lrh.packet_length, grh.ip_version, grh.payload_length, + * grh.next_header, bth.opcode, bth.pad_count and + * bth.transport_header_version fields of a &struct ib_ud_header given + * the payload length and whether a GRH will be included. + */ +void ib_ud_header_init(int payload_bytes, + int grh_present, + struct ib_ud_header *header) +{ + int header_len; + + memset(header, 0, sizeof *header); + + header_len = + IB_LRH_BYTES + + IB_BTH_BYTES + + IB_DETH_BYTES; + if (grh_present) { + header_len += IB_GRH_BYTES; + } + + header->lrh.link_version = 0; + header->lrh.link_next_header = + grh_present ? IB_LNH_IBA_GLOBAL : IB_LNH_IBA_LOCAL; + header->lrh.packet_length = (IB_LRH_BYTES + + IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) / 4; /* round up */ + + header->grh_present = grh_present; + if (grh_present) { + header->lrh.packet_length += IB_GRH_BYTES / 4; + + header->grh.ip_version = 6; + header->grh.payload_length = + cpu_to_be16((IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) & ~3); /* round up */ + header->grh.next_header = 0x1b; + } + + cpu_to_be16s(&header->lrh.packet_length); + + if (header->immediate_present) + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + else + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY; + header->bth.pad_count = (4 - payload_bytes) & 3; + header->bth.transport_header_version = 0; +} +EXPORT_SYMBOL(ib_ud_header_init); + +/** + * ib_ud_header_pack - Pack UD header struct into wire format + * @header:UD header struct + * @buf:Buffer to pack into + * + * ib_ud_header_pack() packs the UD header structure @header into wire + * format in the buffer @buf. + */ +int ib_ud_header_pack(struct ib_ud_header *header, + void *buf) +{ + int len = 0; + + ib_pack(lrh_table, ARRAY_SIZE(lrh_table), + &header->lrh, buf); + len += IB_LRH_BYTES; + + if (header->grh_present) { + ib_pack(grh_table, ARRAY_SIZE(grh_table), + &header->grh, buf + len); + len += IB_GRH_BYTES; + } + + ib_pack(bth_table, ARRAY_SIZE(bth_table), + &header->bth, buf + len); + len += IB_BTH_BYTES; + + ib_pack(deth_table, ARRAY_SIZE(deth_table), + &header->deth, buf + len); + len += IB_DETH_BYTES; + + if (header->immediate_present) { + memcpy(buf + len, &header->immediate_data, sizeof header->immediate_data); + len += sizeof header->immediate_data; + } + + return len; +} +EXPORT_SYMBOL(ib_ud_header_pack); + +/** + * ib_ud_header_unpack - Unpack UD header struct from wire format + * @header:UD header struct + * @buf:Buffer to pack into + * + * ib_ud_header_pack() unpacks the UD header structure @header from wire + * format in the buffer @buf. + */ +int ib_ud_header_unpack(void *buf, + struct ib_ud_header *header) +{ + ib_unpack(lrh_table, ARRAY_SIZE(lrh_table), + buf, &header->lrh); + buf += IB_LRH_BYTES; + + if (header->lrh.link_version != 0) { + printk(KERN_WARNING "Invalid LRH.link_version %d\n", + header->lrh.link_version); + return -EINVAL; + } + + switch (header->lrh.link_next_header) { + case IB_LNH_IBA_LOCAL: + header->grh_present = 0; + break; + + case IB_LNH_IBA_GLOBAL: + header->grh_present = 1; + ib_unpack(grh_table, ARRAY_SIZE(grh_table), + buf, &header->grh); + buf += IB_GRH_BYTES; + + if (header->grh.ip_version != 6) { + printk(KERN_WARNING "Invalid GRH.ip_version %d\n", + header->grh.ip_version); + return -EINVAL; + } + if (header->grh.next_header != 0x1b) { + printk(KERN_WARNING "Invalid GRH.next_header 0x%02x\n", + header->grh.next_header); + return -EINVAL; + } + break; + + default: + printk(KERN_WARNING "Invalid LRH.link_next_header %d\n", + header->lrh.link_next_header); + return -EINVAL; + } + + ib_unpack(bth_table, ARRAY_SIZE(bth_table), + buf, &header->bth); + buf += IB_BTH_BYTES; + + switch (header->bth.opcode) { + case IB_OPCODE_UD_SEND_ONLY: + header->immediate_present = 0; + break; + case IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE: + header->immediate_present = 1; + break; + default: + printk(KERN_WARNING "Invalid BTH.opcode 0x%02x\n", + header->bth.opcode); + return -EINVAL; + } + + if (header->bth.transport_header_version != 0) { + printk(KERN_WARNING "Invalid BTH.transport_header_version %d\n", + header->bth.transport_header_version); + return -EINVAL; + } + + ib_unpack(deth_table, ARRAY_SIZE(deth_table), + buf, &header->deth); + buf += IB_DETH_BYTES; + + if (header->immediate_present) + memcpy(&header->immediate_data, buf, sizeof header->immediate_data); + + return 0; +} +EXPORT_SYMBOL(ib_ud_header_unpack); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/verbs.c 2004-12-27 21:48:18.453249974 -0800 @@ -0,0 +1,433 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: verbs.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include + +/* Protection domains */ + +struct ib_pd *ib_alloc_pd(struct ib_device *device) +{ + struct ib_pd *pd; + + pd = device->alloc_pd(device); + + if (!IS_ERR(pd)) { + pd->device = device; + atomic_set(&pd->usecnt, 0); + } + + return pd; +} +EXPORT_SYMBOL(ib_alloc_pd); + +int ib_dealloc_pd(struct ib_pd *pd) +{ + if (atomic_read(&pd->usecnt)) + return -EBUSY; + + return pd->device->dealloc_pd(pd); +} +EXPORT_SYMBOL(ib_dealloc_pd); + +/* Address handles */ + +struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + struct ib_ah *ah; + + ah = pd->device->create_ah(pd, ah_attr); + + if (!IS_ERR(ah)) { + ah->device = pd->device; + ah->pd = pd; + atomic_inc(&pd->usecnt); + } + + return ah; +} +EXPORT_SYMBOL(ib_create_ah); + +int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return ah->device->modify_ah ? + ah->device->modify_ah(ah, ah_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_modify_ah); + +int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return ah->device->query_ah ? + ah->device->query_ah(ah, ah_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_query_ah); + +int ib_destroy_ah(struct ib_ah *ah) +{ + struct ib_pd *pd; + int ret; + + pd = ah->pd; + ret = ah->device->destroy_ah(ah); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_destroy_ah); + +/* Queue pairs */ + +struct ib_qp *ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr) +{ + struct ib_qp *qp; + + qp = pd->device->create_qp(pd, qp_init_attr); + + if (!IS_ERR(qp)) { + qp->device = pd->device; + qp->pd = pd; + qp->send_cq = qp_init_attr->send_cq; + qp->recv_cq = qp_init_attr->recv_cq; + qp->srq = qp_init_attr->srq; + qp->event_handler = qp_init_attr->event_handler; + qp->qp_context = qp_init_attr->qp_context; + atomic_inc(&pd->usecnt); + atomic_inc(&qp_init_attr->send_cq->usecnt); + atomic_inc(&qp_init_attr->recv_cq->usecnt); + if (qp_init_attr->srq) + atomic_inc(&qp_init_attr->srq->usecnt); + } + + return qp; +} +EXPORT_SYMBOL(ib_create_qp); + +int ib_modify_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask) +{ + return qp->device->modify_qp(qp, qp_attr, qp_attr_mask); +} +EXPORT_SYMBOL(ib_modify_qp); + +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr) +{ + return qp->device->query_qp ? + qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_query_qp); + +int ib_destroy_qp(struct ib_qp *qp) +{ + struct ib_pd *pd; + struct ib_cq *scq, *rcq; + struct ib_srq *srq; + int ret; + + pd = qp->pd; + scq = qp->send_cq; + rcq = qp->recv_cq; + srq = qp->srq; + + ret = qp->device->destroy_qp(qp); + if (!ret) { + atomic_dec(&pd->usecnt); + atomic_dec(&scq->usecnt); + atomic_dec(&rcq->usecnt); + if (srq) + atomic_dec(&srq->usecnt); + } + + return ret; +} +EXPORT_SYMBOL(ib_destroy_qp); + +/* Completion queues */ + +struct ib_cq *ib_create_cq(struct ib_device *device, + ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), + void *cq_context, int cqe) +{ + struct ib_cq *cq; + + cq = device->create_cq(device, cqe); + + if (!IS_ERR(cq)) { + cq->device = device; + cq->comp_handler = comp_handler; + cq->event_handler = event_handler; + cq->cq_context = cq_context; + atomic_set(&cq->usecnt, 0); + } + + return cq; +} +EXPORT_SYMBOL(ib_create_cq); + +int ib_destroy_cq(struct ib_cq *cq) +{ + if (atomic_read(&cq->usecnt)) + return -EBUSY; + + return cq->device->destroy_cq(cq); +} +EXPORT_SYMBOL(ib_destroy_cq); + +int ib_resize_cq(struct ib_cq *cq, + int cqe) +{ + int ret; + + if (!cq->device->resize_cq) + return -ENOSYS; + + ret = cq->device->resize_cq(cq, &cqe); + if (!ret) + cq->cqe = cqe; + + return ret; +} +EXPORT_SYMBOL(ib_resize_cq); + +/* Memory regions */ + +struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) +{ + struct ib_mr *mr; + + mr = pd->device->get_dma_mr(pd, mr_access_flags); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_get_dma_mr); + +struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_mr *mr; + + mr = pd->device->reg_phys_mr(pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_reg_phys_mr); + +int ib_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_pd *old_pd; + int ret; + + if (!mr->device->rereg_phys_mr) + return -ENOSYS; + + if (atomic_read(&mr->usecnt)) + return -EBUSY; + + old_pd = mr->pd; + + ret = mr->device->rereg_phys_mr(mr, mr_rereg_mask, pd, + phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!ret && (mr_rereg_mask & IB_MR_REREG_PD)) { + atomic_dec(&old_pd->usecnt); + atomic_inc(&pd->usecnt); + } + + return ret; +} +EXPORT_SYMBOL(ib_rereg_phys_mr); + +int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) +{ + return mr->device->query_mr ? + mr->device->query_mr(mr, mr_attr) : -ENOSYS; +} +EXPORT_SYMBOL(ib_query_mr); + +int ib_dereg_mr(struct ib_mr *mr) +{ + struct ib_pd *pd; + int ret; + + if (atomic_read(&mr->usecnt)) + return -EBUSY; + + pd = mr->pd; + ret = mr->device->dereg_mr(mr); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dereg_mr); + +/* Memory windows */ + +struct ib_mw *ib_alloc_mw(struct ib_pd *pd) +{ + struct ib_mw *mw; + + if (!pd->device->alloc_mw) + return ERR_PTR(-ENOSYS); + + mw = pd->device->alloc_mw(pd); + if (!IS_ERR(mw)) { + mw->device = pd->device; + mw->pd = pd; + atomic_inc(&pd->usecnt); + } + + return mw; +} +EXPORT_SYMBOL(ib_alloc_mw); + +int ib_dealloc_mw(struct ib_mw *mw) +{ + struct ib_pd *pd; + int ret; + + pd = mw->pd; + ret = mw->device->dealloc_mw(mw); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dealloc_mw); + +/* "Fast" memory regions */ + +struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct ib_fmr *fmr; + + if (!pd->device->alloc_fmr) + return ERR_PTR(-ENOSYS); + + fmr = pd->device->alloc_fmr(pd, mr_access_flags, fmr_attr); + if (!IS_ERR(fmr)) { + fmr->device = pd->device; + fmr->pd = pd; + atomic_inc(&pd->usecnt); + } + + return fmr; +} +EXPORT_SYMBOL(ib_alloc_fmr); + +int ib_unmap_fmr(struct list_head *fmr_list) +{ + struct ib_fmr *fmr; + + if (list_empty(fmr_list)) + return 0; + + fmr = list_entry(fmr_list->next, struct ib_fmr, list); + return fmr->device->unmap_fmr(fmr_list); +} +EXPORT_SYMBOL(ib_unmap_fmr); + +int ib_dealloc_fmr(struct ib_fmr *fmr) +{ + struct ib_pd *pd; + int ret; + + pd = fmr->pd; + ret = fmr->device->dealloc_fmr(fmr); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dealloc_fmr); + +/* Multicast groups */ + +int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) +{ + return qp->device->attach_mcast ? + qp->device->attach_mcast(qp, gid, lid) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_attach_mcast); + +int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) +{ + return qp->device->detach_mcast ? + qp->device->detach_mcast(qp, gid, lid) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_detach_mcast); From roland at topspin.com Mon Dec 27 21:50:59 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:50:59 -0800 Subject: [openib-general] [PATCH][v5][6/24] Add InfiniBand MAD (management datagram) support (private headers) In-Reply-To: <200412272150.vDVch42vu9imXkVM@topspin.com> Message-ID: <200412272150.vaYM2cWv3KsWVYPV@topspin.com> Add MAD layer private implementation headers. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent.h 2004-12-27 21:48:20.224989180 -0800 @@ -0,0 +1,55 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: agent.h 1389 2004-12-27 22:56:47Z roland $ + */ + +#ifndef __AGENT_H_ +#define __AGENT_H_ + +extern spinlock_t ib_agent_port_list_lock; + +extern int ib_agent_port_open(struct ib_device *device, + int port_num); + +extern int ib_agent_port_close(struct ib_device *device, int port_num); + +extern int agent_send(struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num); + +#endif /* __AGENT_H_ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent_priv.h 2004-12-27 21:48:20.250985354 -0800 @@ -0,0 +1,64 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: agent_priv.h 1389 2004-12-27 22:56:47Z roland $ + */ + +#ifndef __IB_AGENT_PRIV_H__ +#define __IB_AGENT_PRIV_H__ + +#include + +#define SPFX "ib_agent: " + +struct ib_agent_send_wr { + struct list_head send_list; + struct ib_ah *ah; + struct ib_mad_private *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct ib_agent_port_private { + struct list_head port_list; + struct list_head send_posted_list; + spinlock_t send_list_lock; + int port_num; + struct ib_mad_agent *dr_smp_agent; /* DR SM class */ + struct ib_mad_agent *lr_smp_agent; /* LR SM class */ + struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ + struct ib_mr *mr; +}; + +#endif /* __IB_AGENT_PRIV_H__ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/mad_priv.h 2004-12-27 21:48:20.321974904 -0800 @@ -0,0 +1,194 @@ +/* + * Copyright (c) 2004, Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mad_priv.h 1389 2004-12-27 22:56:47Z roland $ + */ + +#ifndef __IB_MAD_PRIV_H__ +#define __IB_MAD_PRIV_H__ + +#include +#include +#include +#include +#include + + +#define PFX "ib_mad: " + +#define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 as a minimum */ + +/* QP and CQ parameters */ +#define IB_MAD_QP_SEND_SIZE 128 +#define IB_MAD_QP_RECV_SIZE 512 +#define IB_MAD_SEND_REQ_MAX_SG 2 +#define IB_MAD_RECV_REQ_MAX_SG 1 + +#define IB_MAD_SEND_Q_PSN 0 + +/* Registration table sizes */ +#define MAX_MGMT_CLASS 80 +#define MAX_MGMT_VERSION 8 +#define MAX_MGMT_OUI 8 +#define MAX_MGMT_VENDOR_RANGE2 IB_MGMT_CLASS_VENDOR_RANGE2_END - \ + IB_MGMT_CLASS_VENDOR_RANGE2_START + 1 + +struct ib_mad_list_head { + struct list_head list; + struct ib_mad_queue *mad_queue; +}; + +struct ib_mad_private_header { + struct ib_mad_list_head mad_list; + struct ib_mad_recv_wc recv_wc; + DECLARE_PCI_UNMAP_ADDR(mapping) +} __attribute__ ((packed)); + +struct ib_mad_private { + struct ib_mad_private_header header; + struct ib_grh grh; + union { + struct ib_mad mad; + struct ib_rmpp_mad rmpp_mad; + struct ib_smp smp; + } mad; +} __attribute__ ((packed)); + +struct ib_mad_agent_private { + struct list_head agent_list; + struct ib_mad_agent agent; + struct ib_mad_reg_req *reg_req; + struct ib_mad_qp_info *qp_info; + + spinlock_t lock; + struct list_head send_list; + struct list_head wait_list; + struct work_struct timed_work; + unsigned long timeout; + struct list_head local_list; + struct work_struct local_work; + + atomic_t refcount; + wait_queue_head_t wait; + u8 rmpp_version; +}; + +struct ib_mad_snoop_private { + struct ib_mad_agent agent; + struct ib_mad_qp_info *qp_info; + int snoop_index; + int mad_snoop_flags; + atomic_t refcount; + wait_queue_head_t wait; +}; + +struct ib_mad_send_wr_private { + struct ib_mad_list_head mad_list; + struct list_head agent_list; + struct ib_mad_agent *agent; + struct ib_send_wr send_wr; + struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; + u64 wr_id; /* client WR ID */ + u64 tid; + unsigned long timeout; + int retry; + int refcount; + enum ib_wc_status status; +}; + +struct ib_mad_local_private { + struct list_head completion_list; + struct ib_mad_private *mad_priv; + struct ib_send_wr send_wr; + struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; + u64 wr_id; /* client WR ID */ + u64 tid; +}; + +struct ib_mad_mgmt_method_table { + struct ib_mad_agent_private *agent[IB_MGMT_MAX_METHODS]; +}; + +struct ib_mad_mgmt_class_table { + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; +}; + +struct ib_mad_mgmt_vendor_class { + u8 oui[MAX_MGMT_OUI][3]; + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_OUI]; +}; + +struct ib_mad_mgmt_vendor_class_table { + struct ib_mad_mgmt_vendor_class *vendor_class[MAX_MGMT_VENDOR_RANGE2]; +}; + +struct ib_mad_mgmt_version_table { + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; +}; + +struct ib_mad_queue { + spinlock_t lock; + struct list_head list; + int count; + int max_active; + struct ib_mad_qp_info *qp_info; +}; + +struct ib_mad_qp_info { + struct ib_mad_port_private *port_priv; + struct ib_qp *qp; + struct ib_mad_queue send_queue; + struct ib_mad_queue recv_queue; + struct list_head overflow_list; + spinlock_t snoop_lock; + struct ib_mad_snoop_private **snoop_table; + int snoop_table_size; + atomic_t snoop_count; +}; + +struct ib_mad_port_private { + struct list_head port_list; + struct ib_device *device; + int port_num; + struct ib_cq *cq; + struct ib_pd *pd; + struct ib_mr *mr; + + spinlock_t reg_lock; + struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION]; + struct list_head agent_list; + struct workqueue_struct *wq; + struct work_struct work; + struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; +}; + +#endif /* __IB_MAD_PRIV_H__ */ From roland at topspin.com Mon Dec 27 21:50:57 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:50:57 -0800 Subject: [openib-general] [PATCH][v5][5/24] Add InfiniBand MAD (management datagram) support In-Reply-To: <200412272150.7LBVS92XE77zrCiS@topspin.com> Message-ID: <200412272150.vDVch42vu9imXkVM@topspin.com> Add support for handling InfiniBand MADs (management datagrams), including sending and receiving MADs as well as passing MADs on to local agents. This is required for an SM (subnet manager) to discover and configure the host, since the SM's query MADs must be passed to the local SMA (subnet management agent). In addition, this support is used by upper level protocols to send queries to and receive responses from the SM. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-27 21:48:18.262278084 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-27 21:48:19.838046137 -0800 @@ -1,6 +1,8 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o + +ib_mad-y := mad.o smi.o agent.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent.c 2004-12-27 21:48:19.916034657 -0800 @@ -0,0 +1,399 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: agent.c 1389 2004-12-27 22:56:47Z roland $ + */ + +#include + +#include + +#include + +#include "smi.h" +#include "agent_priv.h" +#include "mad_priv.h" + + +spinlock_t ib_agent_port_list_lock; +static LIST_HEAD(ib_agent_port_list); + +extern kmem_cache_t *ib_mad_cache; + + +/* + * Caller must hold ib_agent_port_list_lock + */ +static inline struct ib_agent_port_private * +__ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_agent_port_private *entry; + + BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ + + if (device) { + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if (entry->dr_smp_agent->device == device && + entry->port_num == port_num) + return entry; + } + } else { + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if ((entry->dr_smp_agent == mad_agent) || + (entry->lr_smp_agent == mad_agent) || + (entry->perf_mgmt_agent == mad_agent)) + return entry; + } + } + return NULL; +} + +static inline struct ib_agent_port_private * +ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_agent_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + entry = __ib_get_agent_port(device, port_num, mad_agent); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + return entry; +} + +int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num) +{ + struct ib_agent_port_private *port_priv; + + if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + return 1; + port_priv = ib_get_agent_port(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " + "not open\n", + device->name, port_num); + return 1; + } + + return smi_check_local_smp(port_priv->dr_smp_agent, smp); +} + +static int agent_mad_send(struct ib_mad_agent *mad_agent, + struct ib_agent_port_private *port_priv, + struct ib_mad_private *mad_priv, + struct ib_grh *grh, + struct ib_wc *wc) +{ + struct ib_agent_send_wr *agent_send_wr; + struct ib_sge gather_list; + struct ib_send_wr send_wr; + struct ib_send_wr *bad_send_wr; + struct ib_ah_attr ah_attr; + unsigned long flags; + int ret = 1; + + agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); + if (!agent_send_wr) + goto out; + agent_send_wr->mad = mad_priv; + + /* PCI mapping */ + gather_list.addr = dma_map_single(mad_agent->device->dma_device, + &mad_priv->mad, + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + gather_list.length = sizeof(mad_priv->mad); + gather_list.lkey = (*port_priv->mr).lkey; + + send_wr.next = NULL; + send_wr.opcode = IB_WR_SEND; + send_wr.sg_list = &gather_list; + send_wr.num_sge = 1; + send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ + send_wr.wr.ud.timeout_ms = 0; + send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; + + ah_attr.dlid = wc->slid; + ah_attr.port_num = mad_agent->port_num; + ah_attr.src_path_bits = wc->dlid_path_bits; + ah_attr.sl = wc->sl; + ah_attr.static_rate = 0; + ah_attr.ah_flags = 0; /* No GRH */ + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (wc->wc_flags & IB_WC_GRH) { + ah_attr.ah_flags = IB_AH_GRH; + /* Should sgid be looked up ? */ + ah_attr.grh.sgid_index = 0; + ah_attr.grh.hop_limit = grh->hop_limit; + ah_attr.grh.flow_label = be32_to_cpup( + &grh->version_tclass_flow) & 0xfffff; + ah_attr.grh.traffic_class = (be32_to_cpup( + &grh->version_tclass_flow) >> 20) & 0xff; + memcpy(ah_attr.grh.dgid.raw, + grh->sgid.raw, + sizeof(ah_attr.grh.dgid)); + } + } + + agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); + if (IS_ERR(agent_send_wr->ah)) { + printk(KERN_ERR SPFX "No memory for address handle\n"); + kfree(agent_send_wr); + goto out; + } + + send_wr.wr.ud.ah = agent_send_wr->ah; + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + send_wr.wr.ud.pkey_index = wc->pkey_index; + send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; + } else { /* for SMPs */ + send_wr.wr.ud.pkey_index = 0; + send_wr.wr.ud.remote_qkey = 0; + } + send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr; + send_wr.wr_id = (unsigned long)agent_send_wr; + + pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); + + /* Send */ + spin_lock_irqsave(&port_priv->send_list_lock, flags); + if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(agent_send_wr, mapping), + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + ib_destroy_ah(agent_send_wr->ah); + kfree(agent_send_wr); + } else { + list_add_tail(&agent_send_wr->send_list, + &port_priv->send_posted_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + ret = 0; + } + +out: + return ret; +} + +int agent_send(struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num) +{ + struct ib_agent_port_private *port_priv; + struct ib_mad_agent *mad_agent; + + port_priv = ib_get_agent_port(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", + device->name, port_num); + return 1; + } + + /* Get mad agent based on mgmt_class in MAD */ + switch (mad->mad.mad.mad_hdr.mgmt_class) { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + mad_agent = port_priv->dr_smp_agent; + break; + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + mad_agent = port_priv->lr_smp_agent; + break; + case IB_MGMT_CLASS_PERF_MGMT: + mad_agent = port_priv->perf_mgmt_agent; + break; + default: + return 1; + } + + return agent_mad_send(mad_agent, port_priv, mad, grh, wc); +} + +static void agent_send_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_agent_port_private *port_priv; + struct ib_agent_send_wr *agent_send_wr; + unsigned long flags; + + /* Find matching MAD agent */ + port_priv = ib_get_agent_port(NULL, 0, mad_agent); + if (!port_priv) { + printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " + "agent %p\n", mad_agent); + return; + } + + agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id; + spin_lock_irqsave(&port_priv->send_list_lock, flags); + /* Remove completed send from posted send MAD list */ + list_del(&agent_send_wr->send_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + + /* Unmap PCI */ + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(agent_send_wr, mapping), + sizeof(agent_send_wr->mad->mad), + DMA_TO_DEVICE); + + ib_destroy_ah(agent_send_wr->ah); + + /* Release allocated memory */ + kmem_cache_free(ib_mad_cache, agent_send_wr->mad); + kfree(agent_send_wr); +} + +int ib_agent_port_open(struct ib_device *device, int port_num) +{ + int ret; + struct ib_agent_port_private *port_priv; + struct ib_mad_reg_req reg_req; + unsigned long flags; + + /* First, check if port already open for SMI */ + port_priv = ib_get_agent_port(device, port_num, NULL); + if (port_priv) { + printk(KERN_DEBUG SPFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR SPFX "No memory for ib_agent_port_private\n"); + ret = -ENOMEM; + goto error1; + } + + memset(port_priv, 0, sizeof *port_priv); + port_priv->port_num = port_num; + spin_lock_init(&port_priv->send_list_lock); + INIT_LIST_HEAD(&port_priv->send_posted_list); + + /* Obtain MAD agent for directed route SM class */ + reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE; + reg_req.mgmt_class_version = 1; + + port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + + if (IS_ERR(port_priv->dr_smp_agent)) { + ret = PTR_ERR(port_priv->dr_smp_agent); + goto error2; + } + + /* Obtain MAD agent for LID routed SM class */ + reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->lr_smp_agent)) { + ret = PTR_ERR(port_priv->lr_smp_agent); + goto error3; + } + + /* Obtain MAD agent for PerfMgmt class */ + reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->perf_mgmt_agent)) { + ret = PTR_ERR(port_priv->perf_mgmt_agent); + goto error4; + } + + port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(port_priv->mr)) { + printk(KERN_ERR SPFX "Couldn't get DMA MR\n"); + ret = PTR_ERR(port_priv->mr); + goto error5; + } + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_agent_port_list); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + return 0; + +error5: + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); +error4: + ib_unregister_mad_agent(port_priv->lr_smp_agent); +error3: + ib_unregister_mad_agent(port_priv->dr_smp_agent); +error2: + kfree(port_priv); +error1: + return ret; +} + +int ib_agent_port_close(struct ib_device *device, int port_num) +{ + struct ib_agent_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + port_priv = __ib_get_agent_port(device, port_num, NULL); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + printk(KERN_ERR SPFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + ib_dereg_mr(port_priv->mr); + + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); + ib_unregister_mad_agent(port_priv->lr_smp_agent); + ib_unregister_mad_agent(port_priv->dr_smp_agent); + kfree(port_priv); + + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/mad.c 2004-12-27 21:48:19.890038484 -0800 @@ -0,0 +1,2632 @@ +/* + * Copyright (c) 2004, Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mad.c 1389 2004-12-27 22:56:47Z roland $ + */ + +#include +#include + +#include + +#include "mad_priv.h" +#include "smi.h" +#include "agent.h" + + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_DESCRIPTION("kernel IB MAD API"); +MODULE_AUTHOR("Hal Rosenstock"); +MODULE_AUTHOR("Sean Hefty"); + + +kmem_cache_t *ib_mad_cache; +static struct list_head ib_mad_port_list; +static u32 ib_mad_client_id = 0; + +/* Port list lock */ +static spinlock_t ib_mad_port_list_lock; + + +/* Forward declarations */ +static int method_in_use(struct ib_mad_mgmt_method_table **method, + struct ib_mad_reg_req *mad_reg_req); +static void remove_mad_reg_req(struct ib_mad_agent_private *priv); +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad); +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc); +static void timeout_sends(void *data); +static void local_completions(void *data); +static int solicited_mad(struct ib_mad *mad); +static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv, + u8 mgmt_class); +static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv); + +/* + * Returns a ib_mad_port_private structure or NULL for a device/port + * Assumes ib_mad_port_list_lock is being held + */ +static inline struct ib_mad_port_private * +__ib_get_mad_port(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *entry; + + list_for_each_entry(entry, &ib_mad_port_list, port_list) { + if (entry->device == device && entry->port_num == port_num) + return entry; + } + return NULL; +} + +/* + * Wrapper function to return a ib_mad_port_private structure or NULL + * for a device/port + */ +static inline struct ib_mad_port_private * +ib_get_mad_port(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + entry = __ib_get_mad_port(device, port_num); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + + return entry; +} + +static inline u8 convert_mgmt_class(u8 mgmt_class) +{ + /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ + return mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ? + 0 : mgmt_class; +} + +static int get_spl_qp_index(enum ib_qp_type qp_type) +{ + switch (qp_type) + { + case IB_QPT_SMI: + return 0; + case IB_QPT_GSI: + return 1; + default: + return -1; + } +} + +static int vendor_class_index(u8 mgmt_class) +{ + return mgmt_class - IB_MGMT_CLASS_VENDOR_RANGE2_START; +} + +static int is_vendor_class(u8 mgmt_class) +{ + if ((mgmt_class < IB_MGMT_CLASS_VENDOR_RANGE2_START) || + (mgmt_class > IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return 0; + return 1; +} + +static int is_vendor_oui(char *oui) +{ + if (oui[0] || oui[1] || oui[2]) + return 1; + return 0; +} + +static int is_vendor_method_in_use( + struct ib_mad_mgmt_vendor_class *vendor_class, + struct ib_mad_reg_req *mad_reg_req) +{ + struct ib_mad_mgmt_method_table *method; + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) { + if (!memcmp(vendor_class->oui[i], mad_reg_req->oui, 3)) { + method = vendor_class->method_table[i]; + if (method) { + if (method_in_use(&method, mad_reg_req)) + return 1; + else + break; + } + } + } + return 0; +} + +/* + * ib_register_mad_agent - Register to send/receive MADs + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret = ERR_PTR(-EINVAL); + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_reg_req *reg_req = NULL; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + struct ib_mad_mgmt_method_table *method; + int ret2, qpn; + unsigned long flags; + u8 mgmt_class, vclass; + + /* Validate parameters */ + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) + goto error1; + + if (rmpp_version) + goto error1; /* XXX: until RMPP implemented */ + + /* Validate MAD registration request if supplied */ + if (mad_reg_req) { + if (mad_reg_req->mgmt_class_version >= MAX_MGMT_VERSION) + goto error1; + if (!recv_handler) + goto error1; + if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) { + /* + * IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is the only + * one in this range currently allowed + */ + if (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + goto error1; + } else if (mad_reg_req->mgmt_class == 0) { + /* + * Class 0 is reserved in IBA and is used for + * aliasing of IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE + */ + goto error1; + } else if (is_vendor_class(mad_reg_req->mgmt_class)) { + /* + * If class is in "new" vendor range, + * ensure supplied OUI is not zero + */ + if (!is_vendor_oui(mad_reg_req->oui)) + goto error1; + } + /* Make sure class supplied is consistent with QP type */ + if (qp_type == IB_QPT_SMI) { + if ((mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_LID_ROUTED) && + (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } else { + if ((mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } + } else { + /* No registration request supplied */ + if (!send_handler) + goto error1; + } + + /* Validate device and port */ + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + + /* Allocate structures */ + mad_agent_priv = kmalloc(sizeof *mad_agent_priv, GFP_KERNEL); + if (!mad_agent_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + if (mad_reg_req) { + reg_req = kmalloc(sizeof *reg_req, GFP_KERNEL); + if (!reg_req) { + ret = ERR_PTR(-ENOMEM); + goto error2; + } + /* Make a copy of the MAD registration request */ + memcpy(reg_req, mad_reg_req, sizeof *reg_req); + } + + /* Now, fill in the various structures */ + memset(mad_agent_priv, 0, sizeof *mad_agent_priv); + mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; + mad_agent_priv->reg_req = reg_req; + mad_agent_priv->rmpp_version = rmpp_version; + mad_agent_priv->agent.device = device; + mad_agent_priv->agent.recv_handler = recv_handler; + mad_agent_priv->agent.send_handler = send_handler; + mad_agent_priv->agent.context = context; + mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_agent_priv->agent.port_num = port_num; + + spin_lock_irqsave(&port_priv->reg_lock, flags); + mad_agent_priv->agent.hi_tid = ++ib_mad_client_id; + + /* + * Make sure MAD registration (if supplied) + * is non overlapping with any existing ones + */ + if (mad_reg_req) { + mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); + if (!is_vendor_class(mgmt_class)) { + class = port_priv->version[mad_reg_req-> + mgmt_class_version].class; + if (class) { + method = class->method_table[mgmt_class]; + if (method) { + if (method_in_use(&method, + mad_reg_req)) + goto error3; + } + } + ret2 = add_nonoui_reg_req(mad_reg_req, mad_agent_priv, + mgmt_class); + } else { + /* "New" vendor class range */ + vendor = port_priv->version[mad_reg_req-> + mgmt_class_version].vendor; + if (vendor) { + vclass = vendor_class_index(mgmt_class); + vendor_class = vendor->vendor_class[vclass]; + if (vendor_class) { + if (is_vendor_method_in_use( + vendor_class, + mad_reg_req)) + goto error3; + } + } + ret2 = add_oui_reg_req(mad_reg_req, mad_agent_priv); + } + if (ret2) { + ret = ERR_PTR(ret2); + goto error3; + } + } + + /* Add mad agent into port's agent list */ + list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + spin_lock_init(&mad_agent_priv->lock); + INIT_LIST_HEAD(&mad_agent_priv->send_list); + INIT_LIST_HEAD(&mad_agent_priv->wait_list); + INIT_WORK(&mad_agent_priv->timed_work, timeout_sends, mad_agent_priv); + INIT_LIST_HEAD(&mad_agent_priv->local_list); + INIT_WORK(&mad_agent_priv->local_work, local_completions, + mad_agent_priv); + atomic_set(&mad_agent_priv->refcount, 1); + init_waitqueue_head(&mad_agent_priv->wait); + + return &mad_agent_priv->agent; + +error3: + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + kfree(reg_req); +error2: + kfree(mad_agent_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_agent); + +static inline int is_snooping_sends(int mad_snoop_flags) +{ + return (mad_snoop_flags & + (/*IB_MAD_SNOOP_POSTED_SENDS | + IB_MAD_SNOOP_RMPP_SENDS |*/ + IB_MAD_SNOOP_SEND_COMPLETIONS /*| + IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS*/)); +} + +static inline int is_snooping_recvs(int mad_snoop_flags) +{ + return (mad_snoop_flags & + (IB_MAD_SNOOP_RECVS /*| + IB_MAD_SNOOP_RMPP_RECVS*/)); +} + +static int register_snoop_agent(struct ib_mad_qp_info *qp_info, + struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_snoop_private **new_snoop_table; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + /* Check for empty slot in array. */ + for (i = 0; i < qp_info->snoop_table_size; i++) + if (!qp_info->snoop_table[i]) + break; + + if (i == qp_info->snoop_table_size) { + /* Grow table. */ + new_snoop_table = kmalloc(sizeof mad_snoop_priv * + qp_info->snoop_table_size + 1, + GFP_ATOMIC); + if (!new_snoop_table) { + i = -ENOMEM; + goto out; + } + if (qp_info->snoop_table) { + memcpy(new_snoop_table, qp_info->snoop_table, + sizeof mad_snoop_priv * + qp_info->snoop_table_size); + kfree(qp_info->snoop_table); + } + qp_info->snoop_table = new_snoop_table; + qp_info->snoop_table_size++; + } + qp_info->snoop_table[i] = mad_snoop_priv; + atomic_inc(&qp_info->snoop_count); +out: + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + return i; +} + +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret; + struct ib_mad_snoop_private *mad_snoop_priv; + int qpn; + + /* Validate parameters */ + if ((is_snooping_sends(mad_snoop_flags) && !snoop_handler) || + (is_snooping_recvs(mad_snoop_flags) && !recv_handler)) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + /* Allocate structures */ + mad_snoop_priv = kmalloc(sizeof *mad_snoop_priv, GFP_KERNEL); + if (!mad_snoop_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + /* Now, fill in the various structures */ + memset(mad_snoop_priv, 0, sizeof *mad_snoop_priv); + mad_snoop_priv->qp_info = &port_priv->qp_info[qpn]; + mad_snoop_priv->agent.device = device; + mad_snoop_priv->agent.recv_handler = recv_handler; + mad_snoop_priv->agent.snoop_handler = snoop_handler; + mad_snoop_priv->agent.context = context; + mad_snoop_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_snoop_priv->agent.port_num = port_num; + mad_snoop_priv->mad_snoop_flags = mad_snoop_flags; + init_waitqueue_head(&mad_snoop_priv->wait); + mad_snoop_priv->snoop_index = register_snoop_agent( + &port_priv->qp_info[qpn], + mad_snoop_priv); + if (mad_snoop_priv->snoop_index < 0) { + ret = ERR_PTR(mad_snoop_priv->snoop_index); + goto error2; + } + + atomic_set(&mad_snoop_priv->refcount, 1); + return &mad_snoop_priv->agent; + +error2: + kfree(mad_snoop_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_snoop); + +static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; + + /* Note that we could still be handling received MADs */ + + /* + * Canceling all sends results in dropping received response + * MADs, preventing us from queuing additional work + */ + cancel_mads(mad_agent_priv); + + port_priv = mad_agent_priv->qp_info->port_priv; + cancel_delayed_work(&mad_agent_priv->timed_work); + flush_workqueue(port_priv->wq); + + spin_lock_irqsave(&port_priv->reg_lock, flags); + remove_mad_reg_req(mad_agent_priv); + list_del(&mad_agent_priv->agent_list); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + /* XXX: Cleanup pending RMPP receives for this agent */ + + atomic_dec(&mad_agent_priv->refcount); + wait_event(mad_agent_priv->wait, + !atomic_read(&mad_agent_priv->refcount)); + + if (mad_agent_priv->reg_req) + kfree(mad_agent_priv->reg_req); + kfree(mad_agent_priv); +} + +static void unregister_mad_snoop(struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_qp_info *qp_info; + unsigned long flags; + + qp_info = mad_snoop_priv->qp_info; + spin_lock_irqsave(&qp_info->snoop_lock, flags); + qp_info->snoop_table[mad_snoop_priv->snoop_index] = NULL; + atomic_dec(&qp_info->snoop_count); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + + atomic_dec(&mad_snoop_priv->refcount); + wait_event(mad_snoop_priv->wait, + !atomic_read(&mad_snoop_priv->refcount)); + + kfree(mad_snoop_priv); +} + +/* + * ib_unregister_mad_agent - Unregisters a client from using MAD services + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_snoop_private *mad_snoop_priv; + + /* If the TID is zero, the agent can only snoop. */ + if (mad_agent->hi_tid) { + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + unregister_mad_agent(mad_agent_priv); + } else { + mad_snoop_priv = container_of(mad_agent, + struct ib_mad_snoop_private, + agent); + unregister_mad_snoop(mad_snoop_priv); + } + return 0; +} +EXPORT_SYMBOL(ib_unregister_mad_agent); + +static void dequeue_mad(struct ib_mad_list_head *mad_list) +{ + struct ib_mad_queue *mad_queue; + unsigned long flags; + + BUG_ON(!mad_list->mad_queue); + mad_queue = mad_list->mad_queue; + spin_lock_irqsave(&mad_queue->lock, flags); + list_del(&mad_list->list); + mad_queue->count--; + spin_unlock_irqrestore(&mad_queue->lock, flags); +} + +static void snoop_send(struct ib_mad_qp_info *qp_info, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.snoop_handler(&mad_snoop_priv->agent, + send_wr, mad_send_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + +static void snoop_recv(struct ib_mad_qp_info *qp_info, + struct ib_mad_recv_wc *mad_recv_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.recv_handler(&mad_snoop_priv->agent, + mad_recv_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + +/* + * Return 0 if SMP is to be sent + * Return 1 if SMP was consumed locally (whether or not solicited) + * Return < 0 if error + */ +static int handle_outgoing_smp(struct ib_mad_agent_private *mad_agent_priv, + struct ib_smp *smp, + struct ib_send_wr *send_wr) +{ + int ret, alloc_flags; + unsigned long flags; + struct ib_mad_local_private *local; + struct ib_mad_private *mad_priv; + struct ib_device *device = mad_agent_priv->agent.device; + u8 port_num = mad_agent_priv->agent.port_num; + + if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + ret = -EINVAL; + printk(KERN_ERR PFX "Invalid directed route\n"); + goto out; + } + /* Check to post send on QP or process locally */ + ret = smi_check_local_dr_smp(smp, device, port_num); + if (!ret || !device->process_mad) + goto out; + + if (in_atomic() || irqs_disabled()) + alloc_flags = GFP_ATOMIC; + else + alloc_flags = GFP_KERNEL; + local = kmalloc(sizeof *local, alloc_flags); + if (!local) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); + goto out; + } + local->mad_priv = NULL; + mad_priv = kmem_cache_alloc(ib_mad_cache, alloc_flags); + if (!mad_priv) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for local response MAD\n"); + kfree(local); + goto out; + } + ret = device->process_mad(device, 0, port_num, smp->dr_slid, + (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); + switch (ret) + { + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: + /* + * See if response is solicited and + * there is a recv handler + */ + if (solicited_mad(&mad_priv->mad.mad) && + mad_agent_priv->agent.recv_handler) + local->mad_priv = mad_priv; + else + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS: + kmem_cache_free(ib_mad_cache, mad_priv); + kfree(local); + ret = 0; + goto out; + default: + kmem_cache_free(ib_mad_cache, mad_priv); + kfree(local); + ret = -EINVAL; + goto out; + } + + local->send_wr = *send_wr; + local->send_wr.sg_list = local->sg_list; + memcpy(local->sg_list, send_wr->sg_list, + sizeof *send_wr->sg_list * send_wr->num_sge); + local->send_wr.next = NULL; + local->tid = send_wr->wr.ud.mad_hdr->tid; + local->wr_id = send_wr->wr_id; + /* Reference MAD agent until local completion handled */ + atomic_inc(&mad_agent_priv->refcount); + /* Queue local completion to local list */ + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_add_tail(&local->completion_list, &mad_agent_priv->local_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + queue_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->local_work); + ret = 1; +out: + return ret; +} + +static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr) +{ + struct ib_mad_qp_info *qp_info; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; + + /* Replace user's WR ID with our own to find WR upon completion */ + qp_info = mad_agent_priv->qp_info; + mad_send_wr->wr_id = mad_send_wr->send_wr.wr_id; + mad_send_wr->send_wr.wr_id = (unsigned long)&mad_send_wr->mad_list; + mad_send_wr->mad_list.mad_queue = &qp_info->send_queue; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + if (qp_info->send_queue.count++ < qp_info->send_queue.max_active) { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->send_queue.list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = ib_post_send(mad_agent_priv->agent.qp, + &mad_send_wr->send_wr, &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + dequeue_mad(&mad_send_wr->mad_list); + } + } else { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->overflow_list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = 0; + } + return ret; +} + +/* + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + int ret = -EINVAL; + struct ib_mad_agent_private *mad_agent_priv; + + /* Validate supplied parameters */ + if (!bad_send_wr) + goto error1; + + if (!mad_agent || !send_wr) + goto error2; + + if (!mad_agent->send_handler) + goto error2; + + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + + /* Walk list of send WRs and post each on send list */ + while (send_wr) { + unsigned long flags; + struct ib_send_wr *next_send_wr; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_smp *smp; + + /* Validate more parameters */ + if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG) + goto error2; + + if (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler) + goto error2; + + if (!send_wr->wr.ud.mad_hdr) { + printk(KERN_ERR PFX "MAD header must be supplied " + "in WR %p\n", send_wr); + goto error2; + } + + /* + * Save pointer to next work request to post in case the + * current one completes, and the user modifies the work + * request associated with the completion + */ + next_send_wr = (struct ib_send_wr *)send_wr->next; + + smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; + if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + ret = handle_outgoing_smp(mad_agent_priv, smp, send_wr); + if (ret < 0) /* error */ + goto error2; + else if (ret == 1) /* locally consumed */ + goto next; + } + + /* Allocate MAD send WR tracking structure */ + mad_send_wr = kmalloc(sizeof *mad_send_wr, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_send_wr) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_send_wr_private\n"); + ret = -ENOMEM; + goto error2; + } + + mad_send_wr->send_wr = *send_wr; + mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; + memcpy(mad_send_wr->sg_list, send_wr->sg_list, + sizeof *send_wr->sg_list * send_wr->num_sge); + mad_send_wr->send_wr.next = NULL; + mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid; + mad_send_wr->agent = mad_agent; + /* Timeout will be updated after send completes */ + mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr. + ud.timeout_ms); + mad_send_wr->retry = 0; + /* One reference for each work request to QP + response */ + mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); + mad_send_wr->status = IB_WC_SUCCESS; + + /* Reference MAD agent until send completes */ + atomic_inc(&mad_agent_priv->refcount); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_add_tail(&mad_send_wr->agent_list, + &mad_agent_priv->send_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + ret = ib_send_mad(mad_agent_priv, mad_send_wr); + if (ret) { + /* Fail send request */ + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + atomic_dec(&mad_agent_priv->refcount); + goto error2; + } +next: + send_wr = next_send_wr; + } + return 0; + +error2: + *bad_send_wr = send_wr; +error1: + return ret; +} +EXPORT_SYMBOL(ib_post_send_mad); + +/* + * ib_free_recv_mad - Returns data buffers used to receive + * a MAD to the access layer + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_recv_buf *entry; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *priv; + + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, struct ib_mad_private, header); + + /* + * Walk receive buffer list associated with this WC + * No need to remove them from list of receive buffers + */ + list_for_each_entry(entry, &mad_recv_wc->recv_buf.list, list) { + /* Free previous receive buffer */ + kmem_cache_free(ib_mad_cache, priv); + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, struct ib_mad_private, + header); + } + + /* Free last buffer */ + kmem_cache_free(ib_mad_cache, priv); +} +EXPORT_SYMBOL(ib_free_recv_mad); + +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf) +{ + printk(KERN_ERR PFX "ib_coalesce_recv_mad() not implemented yet\n"); +} +EXPORT_SYMBOL(ib_coalesce_recv_mad); + +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + return ERR_PTR(-EINVAL); /* XXX: for now */ +} +EXPORT_SYMBOL(ib_redirect_mad_qp); + +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc) +{ + printk(KERN_ERR PFX "ib_process_mad_wc() not implemented yet\n"); + return 0; +} +EXPORT_SYMBOL(ib_process_mad_wc); + +static int method_in_use(struct ib_mad_mgmt_method_table **method, + struct ib_mad_reg_req *mad_reg_req) +{ + int i; + + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + if ((*method)->agent[i]) { + printk(KERN_ERR PFX "Method %d already in use\n", i); + return -EINVAL; + } + } + return 0; +} + +static int allocate_method_table(struct ib_mad_mgmt_method_table **method) +{ + /* Allocate management method table */ + *method = kmalloc(sizeof **method, GFP_ATOMIC); + if (!*method) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_method_table\n"); + return -ENOMEM; + } + /* Clear management method table */ + memset(*method, 0, sizeof **method); + + return 0; +} + +/* + * Check to see if there are any methods still in use + */ +static int check_method_table(struct ib_mad_mgmt_method_table *method) +{ + int i; + + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) + if (method->agent[i]) + return 1; + return 0; +} + +/* + * Check to see if there are any method tables for this class still in use + */ +static int check_class_table(struct ib_mad_mgmt_class_table *class) +{ + int i; + + for (i = 0; i < MAX_MGMT_CLASS; i++) + if (class->method_table[i]) + return 1; + return 0; +} + +static int check_vendor_class(struct ib_mad_mgmt_vendor_class *vendor_class) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + if (vendor_class->method_table[i]) + return 1; + return 0; +} + +static int find_vendor_oui(struct ib_mad_mgmt_vendor_class *vendor_class, + char *oui) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + /* Is there matching OUI for this vendor class ? */ + if (!memcmp(vendor_class->oui[i], oui, 3)) + return i; + + return -1; +} + +static int check_vendor_table(struct ib_mad_mgmt_vendor_class_table *vendor) +{ + int i; + + for (i = 0; i < MAX_MGMT_VENDOR_RANGE2; i++) + if (vendor->vendor_class[i]) + return 1; + + return 0; +} + +static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method, + struct ib_mad_agent_private *agent) +{ + int i; + + /* Remove any methods for this mad agent */ + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) { + if (method->agent[i] == agent) { + method->agent[i] = NULL; + } + } +} + +static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv, + u8 mgmt_class) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_class_table **class; + struct ib_mad_mgmt_method_table **method; + int i, ret; + + port_priv = agent_priv->qp_info->port_priv; + class = &port_priv->version[mad_reg_req->mgmt_class_version].class; + if (!*class) { + /* Allocate management class table for "new" class version */ + *class = kmalloc(sizeof **class, GFP_ATOMIC); + if (!*class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_class_table\n"); + ret = -ENOMEM; + goto error1; + } + /* Clear management class table */ + memset(*class, 0, sizeof(**class)); + /* Allocate method table for this management class */ + method = &(*class)->method_table[mgmt_class]; + if ((ret = allocate_method_table(method))) + goto error2; + } else { + method = &(*class)->method_table[mgmt_class]; + if (!*method) { + /* Allocate method table for this management class */ + if ((ret = allocate_method_table(method))) + goto error1; + } + } + + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error3; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = agent_priv; + } + return 0; + +error3: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, agent_priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; + goto error1; +error2: + kfree(*class); + *class = NULL; +error1: + return ret; +} + +static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_vendor_class_table **vendor_table; + struct ib_mad_mgmt_vendor_class_table *vendor = NULL; + struct ib_mad_mgmt_vendor_class *vendor_class = NULL; + struct ib_mad_mgmt_method_table **method; + int i, ret = -ENOMEM; + u8 vclass; + + /* "New" vendor (with OUI) class */ + vclass = vendor_class_index(mad_reg_req->mgmt_class); + port_priv = agent_priv->qp_info->port_priv; + vendor_table = &port_priv->version[ + mad_reg_req->mgmt_class_version].vendor; + if (!*vendor_table) { + /* Allocate mgmt vendor class table for "new" class version */ + vendor = kmalloc(sizeof *vendor, GFP_ATOMIC); + if (!vendor) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class_table\n"); + goto error1; + } + /* Clear management vendor class table */ + memset(vendor, 0, sizeof(*vendor)); + *vendor_table = vendor; + } + if (!(*vendor_table)->vendor_class[vclass]) { + /* Allocate table for this management vendor class */ + vendor_class = kmalloc(sizeof *vendor_class, GFP_ATOMIC); + if (!vendor_class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class\n"); + goto error2; + } + memset(vendor_class, 0, sizeof(*vendor_class)); + (*vendor_table)->vendor_class[vclass] = vendor_class; + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* Is there matching OUI for this vendor class ? */ + if (!memcmp((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3)) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(!*method); + goto check_in_use; + } + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* OUI slot available ? */ + if (!is_vendor_oui((*vendor_table)->vendor_class[ + vclass]->oui[i])) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(*method); + /* Allocate method table for this OUI */ + if ((ret = allocate_method_table(method))) + goto error3; + memcpy((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3); + goto check_in_use; + } + } + printk(KERN_ERR PFX "All OUI slots in use\n"); + goto error3; + +check_in_use: + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error4; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = agent_priv; + } + return 0; + +error4: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, agent_priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; +error3: + if (vendor_class) { + (*vendor_table)->vendor_class[vclass] = NULL; + kfree(vendor_class); + } +error2: + if (vendor) { + *vendor_table = NULL; + kfree(vendor); + } +error1: + return ret; +} + +static void remove_mad_reg_req(struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + int index; + u8 mgmt_class; + + /* + * Was MAD registration request supplied + * with original registration ? + */ + if (!agent_priv->reg_req) { + goto out; + } + + port_priv = agent_priv->qp_info->port_priv; + class = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].class; + if (!class) + goto vendor_check; + + mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); + method = class->method_table[mgmt_class]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* Now, check to see if there are any methods still in use */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + class->method_table[mgmt_class] = NULL; + /* Any management classes left ? */ + if (!check_class_table(class)) { + /* If not, release management class table */ + kfree(class); + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version].class = NULL; + } + } + } + +vendor_check: + vendor = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].vendor; + if (!vendor) + goto out; + + mgmt_class = vendor_class_index(agent_priv->reg_req->mgmt_class); + vendor_class = vendor->vendor_class[mgmt_class]; + if (vendor_class) { + index = find_vendor_oui(vendor_class, agent_priv->reg_req->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* + * Now, check to see if there are + * any methods still in use + */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + vendor_class->method_table[index] = NULL; + memset(vendor_class->oui[index], 0, 3); + /* Any OUIs left ? */ + if (!check_vendor_class(vendor_class)) { + /* If not, release vendor class table */ + kfree(vendor_class); + vendor->vendor_class[mgmt_class] = NULL; + /* Any other vendor classes left ? */ + if (!check_vendor_table(vendor)) { + kfree(vendor); + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version]. + vendor = NULL; + } + } + } + } + } + +out: + return; +} + +static int response_mad(struct ib_mad *mad) +{ + /* Trap represses are responses although response bit is reset */ + return ((mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) || + (mad->mad_hdr.method & IB_MGMT_METHOD_RESP)); +} + +static int solicited_mad(struct ib_mad *mad) +{ + /* CM MADs are never solicited */ + if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_CM) { + return 0; + } + + /* XXX: Determine whether MAD is using RMPP */ + + /* Not using RMPP */ + /* Is this MAD a response to a previous MAD ? */ + return response_mad(mad); +} + +static struct ib_mad_agent_private * +find_mad_agent(struct ib_mad_port_private *port_priv, + struct ib_mad *mad, + int solicited) +{ + struct ib_mad_agent_private *mad_agent = NULL; + unsigned long flags; + + spin_lock_irqsave(&port_priv->reg_lock, flags); + + /* + * Whether MAD was solicited determines type of routing to + * MAD client. + */ + if (solicited) { + u32 hi_tid; + struct ib_mad_agent_private *entry; + + /* + * Routing is based on high 32 bits of transaction ID + * of MAD. + */ + hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32; + list_for_each_entry(entry, &port_priv->agent_list, + agent_list) { + if (entry->agent.hi_tid == hi_tid) { + mad_agent = entry; + break; + } + } + } else { + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + struct ib_vendor_mad *vendor_mad; + int index; + + /* + * Routing is based on version, class, and method + * For "newer" vendor MADs, also based on OUI + */ + if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) + goto out; + if (!is_vendor_class(mad->mad_hdr.mgmt_class)) { + class = port_priv->version[ + mad->mad_hdr.class_version].class; + if (!class) + goto out; + method = class->method_table[convert_mgmt_class( + mad->mad_hdr.mgmt_class)]; + if (method) + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } else { + vendor = port_priv->version[ + mad->mad_hdr.class_version].vendor; + if (!vendor) + goto out; + vendor_class = vendor->vendor_class[vendor_class_index( + mad->mad_hdr.mgmt_class)]; + if (!vendor_class) + goto out; + /* Find matching OUI */ + vendor_mad = (struct ib_vendor_mad *)mad; + index = find_vendor_oui(vendor_class, vendor_mad->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } + } + } + + if (mad_agent) { + if (mad_agent->agent.recv_handler) + atomic_inc(&mad_agent->refcount); + else { + printk(KERN_NOTICE PFX "No receive handler for client " + "%p on port %d\n", + &mad_agent->agent, port_priv->port_num); + mad_agent = NULL; + } + } +out: + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + return mad_agent; +} + +static int validate_mad(struct ib_mad *mad, u32 qp_num) +{ + int valid = 0; + + /* Make sure MAD base version is understood */ + if (mad->mad_hdr.base_version != IB_MGMT_BASE_VERSION) { + printk(KERN_ERR PFX "MAD received with unsupported base " + "version %d\n", mad->mad_hdr.base_version); + goto out; + } + + /* Filter SMI packets sent to other than QP0 */ + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) { + if (qp_num == 0) + valid = 1; + } else { + /* Filter GSI packets sent to QP0 */ + if (qp_num != 0) + valid = 1; + } + +out: + return valid; +} + +/* + * Return start of fully reassembled MAD, or NULL, if MAD isn't assembled yet + */ +static struct ib_mad_private * +reassemble_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv) +{ + /* Until we have RMPP, all receives are reassembled!... */ + INIT_LIST_HEAD(&recv->header.recv_wc.recv_buf.list); + return recv; +} + +static struct ib_mad_send_wr_private* +find_send_req(struct ib_mad_agent_private *mad_agent_priv, + u64 tid) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, + agent_list) { + if (mad_send_wr->tid == tid) + return mad_send_wr; + } + + /* + * It's possible to receive the response before we've + * been notified that the send has completed + */ + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_list) { + if (mad_send_wr->tid == tid && mad_send_wr->timeout) { + /* Verify request has not been canceled */ + return (mad_send_wr->status == IB_WC_SUCCESS) ? + mad_send_wr : NULL; + } + } + return NULL; +} + +static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv, + int solicited) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + /* Fully reassemble receive before processing */ + recv = reassemble_recv(mad_agent_priv, recv); + if (!recv) { + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } + + /* Complete corresponding request */ + if (solicited) { + spin_lock_irqsave(&mad_agent_priv->lock, flags); + mad_send_wr = find_send_req(mad_agent_priv, + recv->mad.mad.mad_hdr.tid); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + ib_free_recv_mad(&recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } + /* Timeout = 0 means that we won't wait for a response */ + mad_send_wr->timeout = 0; + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + /* Defined behavior is to complete response before request */ + recv->header.recv_wc.wc->wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &recv->header.recv_wc); + atomic_dec(&mad_agent_priv->refcount); + + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); + } else { + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + } +} + +static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_qp_info *qp_info; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *recv, *response; + struct ib_mad_list_head *mad_list; + struct ib_mad_agent_private *mad_agent; + int solicited; + + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!response) + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " + "for response buffer\n"); + + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + dequeue_mad(mad_list); + + mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, header); + dma_unmap_single(port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); + + /* Setup MAD receive work completion from "normal" work completion */ + recv->header.recv_wc.wc = wc; + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); + recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; + recv->header.recv_wc.recv_buf.grh = &recv->grh; + + if (atomic_read(&qp_info->snoop_count)) + snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); + + /* Validate MAD */ + if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) + goto out; + + if (recv->mad.mad.mad_hdr.mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (!smi_handle_dr_smp_recv(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num, + port_priv->device->phys_port_cnt)) + goto out; + if (!smi_check_forward_dr_smp(&recv->mad.smp)) + goto local; + if (!smi_handle_dr_smp_send(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num)) + goto out; + if (!smi_check_local_dr_smp(&recv->mad.smp, + port_priv->device, + port_priv->port_num)) + goto out; + } + +local: + /* Give driver "right of first refusal" on incoming MAD */ + if (port_priv->device->process_mad) { + int ret; + + if (!response) { + printk(KERN_ERR PFX "No memory for response MAD\n"); + /* + * Is it better to assume that + * it wouldn't be processed ? + */ + goto out; + } + + ret = port_priv->device->process_mad(port_priv->device, 0, + port_priv->port_num, + wc->slid, + &recv->mad.mad, + &response->mad.mad); + if (ret & IB_MAD_RESULT_SUCCESS) { + if (ret & IB_MAD_RESULT_CONSUMED) + goto out; + if (ret & IB_MAD_RESULT_REPLY) { + /* Send response */ + if (!agent_send(response, &recv->grh, wc, + port_priv->device, + port_priv->port_num)) + response = NULL; + goto out; + } + } + } + + /* Determine corresponding MAD agent for incoming receive MAD */ + solicited = solicited_mad(&recv->mad.mad); + mad_agent = find_mad_agent(port_priv, &recv->mad.mad, solicited); + if (mad_agent) { + ib_mad_complete_recv(mad_agent, recv, solicited); + /* + * recv is freed up in error cases in ib_mad_complete_recv + * or via recv_handler in ib_mad_complete_recv() + */ + recv = NULL; + } + +out: + /* Post another receive request for this QP */ + if (response) { + ib_mad_post_receive_mads(qp_info, response); + if (recv) + kmem_cache_free(ib_mad_cache, recv); + } else + ib_mad_post_receive_mads(qp_info, recv); +} + +static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_send_wr_private *mad_send_wr; + unsigned long delay; + + if (list_empty(&mad_agent_priv->wait_list)) { + cancel_delayed_work(&mad_agent_priv->timed_work); + } else { + mad_send_wr = list_entry(mad_agent_priv->wait_list.next, + struct ib_mad_send_wr_private, + agent_list); + + if (time_after(mad_agent_priv->timeout, + mad_send_wr->timeout)) { + mad_agent_priv->timeout = mad_send_wr->timeout; + cancel_delayed_work(&mad_agent_priv->timed_work); + delay = mad_send_wr->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, + &mad_agent_priv->timed_work, delay); + } + } +} + +static void wait_for_response(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr ) +{ + struct ib_mad_send_wr_private *temp_mad_send_wr; + struct list_head *list_item; + unsigned long delay; + + list_del(&mad_send_wr->agent_list); + + delay = mad_send_wr->timeout; + mad_send_wr->timeout += jiffies; + + list_for_each_prev(list_item, &mad_agent_priv->wait_list) { + temp_mad_send_wr = list_entry(list_item, + struct ib_mad_send_wr_private, + agent_list); + if (time_after(mad_send_wr->timeout, + temp_mad_send_wr->timeout)) + break; + } + list_add(&mad_send_wr->agent_list, list_item); + + /* Reschedule a work item if we have a shorter timeout */ + if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) { + cancel_delayed_work(&mad_agent_priv->timed_work); + queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->timed_work, delay); + } +} + +/* + * Process a send work completion + */ +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_mad_agent_private *mad_agent_priv; + unsigned long flags; + + mad_agent_priv = container_of(mad_send_wr->agent, + struct ib_mad_agent_private, agent); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + if (mad_send_wc->status != IB_WC_SUCCESS && + mad_send_wr->status == IB_WC_SUCCESS) { + mad_send_wr->status = mad_send_wc->status; + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + } + + if (--mad_send_wr->refcount > 0) { + if (mad_send_wr->refcount == 1 && mad_send_wr->timeout && + mad_send_wr->status == IB_WC_SUCCESS) { + wait_for_response(mad_agent_priv, mad_send_wr); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + return; + } + + /* Remove send from MAD agent and notify client of completion */ + list_del(&mad_send_wr->agent_list); + adjust_timeout(mad_agent_priv); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + if (mad_send_wr->status != IB_WC_SUCCESS ) + mad_send_wc->status = mad_send_wr->status; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + mad_send_wc); + + /* Release reference on agent taken when sending */ + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + + kfree(mad_send_wr); +} + +static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_send_wr_private *mad_send_wr, *queued_send_wr; + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_queue *send_queue; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; + + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + send_queue = mad_list->mad_queue; + qp_info = send_queue->qp_info; + +retry: + queued_send_wr = NULL; + spin_lock_irqsave(&send_queue->lock, flags); + list_del(&mad_list->list); + + /* Move queued send to the send queue */ + if (send_queue->count-- > send_queue->max_active) { + mad_list = container_of(qp_info->overflow_list.next, + struct ib_mad_list_head, list); + queued_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + list_del(&mad_list->list); + list_add_tail(&mad_list->list, &send_queue->list); + } + spin_unlock_irqrestore(&send_queue->lock, flags); + + /* Restore client wr_id in WC and complete send */ + wc->wr_id = mad_send_wr->wr_id; + if (atomic_read(&qp_info->snoop_count)) + snoop_send(qp_info, &mad_send_wr->send_wr, + (struct ib_mad_send_wc *)wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc *)wc); + + if (queued_send_wr) { + ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr, + &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + mad_send_wr = queued_send_wr; + wc->status = IB_WC_LOC_QP_OP_ERR; + goto retry; + } + } +} + +static void mark_sends_for_retry(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_list_head *mad_list; + unsigned long flags; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + list_for_each_entry(mad_list, &qp_info->send_queue.list, list) { + mad_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + mad_send_wr->retry = 1; + } + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); +} + +static void mad_error_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_send_wr_private *mad_send_wr; + int ret; + + /* Determine if failure was a send or receive */ + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + if (mad_list->mad_queue == &qp_info->recv_queue) + /* + * Receive errors indicate that the QP has entered the error + * state - error handling/shutdown code will cleanup + */ + return; + + /* + * Send errors will transition the QP to SQE - move + * QP to RTS and repost flushed work requests + */ + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + if (wc->status == IB_WC_WR_FLUSH_ERR) { + if (mad_send_wr->retry) { + /* Repost send */ + struct ib_send_wr *bad_send_wr; + + mad_send_wr->retry = 0; + ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr, + &bad_send_wr); + if (ret) + ib_mad_send_done_handler(port_priv, wc); + } else + ib_mad_send_done_handler(port_priv, wc); + } else { + struct ib_qp_attr *attr; + + /* Transition QP to RTS and fail offending send */ + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RTS; + attr->cur_qp_state = IB_QPS_SQE; + ret = ib_modify_qp(qp_info->qp, attr, + IB_QP_STATE | IB_QP_CUR_STATE); + kfree(attr); + if (ret) + printk(KERN_ERR PFX "mad_error_handler - " + "ib_modify_qp to RTS : %d\n", ret); + else + mark_sends_for_retry(qp_info); + } + ib_mad_send_done_handler(port_priv, wc); + } +} + +/* + * IB MAD completion callback + */ +static void ib_mad_completion_handler(void *data) +{ + struct ib_mad_port_private *port_priv; + struct ib_wc wc; + + port_priv = (struct ib_mad_port_private *)data; + ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + + while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { + if (wc.status == IB_WC_SUCCESS) { + switch (wc.opcode) { + case IB_WC_SEND: + ib_mad_send_done_handler(port_priv, &wc); + break; + case IB_WC_RECV: + ib_mad_recv_done_handler(port_priv, &wc); + break; + default: + BUG_ON(1); + break; + } + } else + mad_error_handler(port_priv, &wc); + } +} + +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv) +{ + unsigned long flags; + struct ib_mad_send_wr_private *mad_send_wr, *temp_mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + struct list_head cancel_list; + + INIT_LIST_HEAD(&cancel_list); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &mad_agent_priv->send_list, agent_list) { + if (mad_send_wr->status == IB_WC_SUCCESS) { + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + } + } + + /* Empty wait list to prevent receives from finding a request */ + list_splice_init(&mad_agent_priv->wait_list, &cancel_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + /* Report all cancelled requests */ + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &cancel_list, agent_list) { + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + list_del(&mad_send_wr->agent_list); + kfree(mad_send_wr); + atomic_dec(&mad_agent_priv->refcount); + } +} + +static struct ib_mad_send_wr_private* +find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv, + u64 wr_id) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, + agent_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + return NULL; +} + +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + mad_send_wr = find_send_by_wr_id(mad_agent_priv, wr_id); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + goto out; + } + + if (mad_send_wr->status == IB_WC_SUCCESS) + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + + if (mad_send_wr->refcount != 0) { + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + goto out; + } + + list_del(&mad_send_wr->agent_list); + adjust_timeout(mad_agent_priv); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + +out: + return; +} +EXPORT_SYMBOL(ib_cancel_mad); + +static void local_completions(void *data) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_local_private *local; + unsigned long flags; + struct ib_wc wc; + struct ib_mad_send_wc mad_send_wc; + + mad_agent_priv = (struct ib_mad_agent_private *)data; + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + while (!list_empty(&mad_agent_priv->local_list)) { + local = list_entry(mad_agent_priv->local_list.next, + struct ib_mad_local_private, + completion_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + if (local->mad_priv) { + /* + * Defined behavior is to complete response + * before request + */ + wc.wr_id = local->wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = sizeof(struct ib_mad); + wc.src_qp = IB_QP0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = IB_LID_PERMISSIVE; + wc.sl = 0; + wc.dlid_path_bits = 0; + local->mad_priv->header.recv_wc.wc = &wc; + local->mad_priv->header.recv_wc.mad_len = + sizeof(struct ib_mad); + INIT_LIST_HEAD(&local->mad_priv->header.recv_wc.recv_buf.list); + local->mad_priv->header.recv_wc.recv_buf.grh = NULL; + local->mad_priv->header.recv_wc.recv_buf.mad = + &local->mad_priv->mad.mad; + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_recv(mad_agent_priv->qp_info, + &local->mad_priv->header.recv_wc, + IB_MAD_SNOOP_RECVS); + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &local->mad_priv->header.recv_wc); + } + + /* Complete send */ + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = local->wr_id; + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_send(mad_agent_priv->qp_info, &local->send_wr, + &mad_send_wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_del(&local->completion_list); + atomic_dec(&mad_agent_priv->refcount); + kfree(local); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); +} + +static void timeout_sends(void *data) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags, delay; + + mad_agent_priv = (struct ib_mad_agent_private *)data; + + mad_send_wc.status = IB_WC_RESP_TIMEOUT_ERR; + mad_send_wc.vendor_err = 0; + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + while (!list_empty(&mad_agent_priv->wait_list)) { + mad_send_wr = list_entry(mad_agent_priv->wait_list.next, + struct ib_mad_send_wr_private, + agent_list); + + if (time_after(mad_send_wr->timeout, jiffies)) { + delay = mad_send_wr->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, + &mad_agent_priv->timed_work, delay); + break; + } + + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + atomic_dec(&mad_agent_priv->refcount); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); +} + +static void ib_mad_thread_completion_handler(struct ib_cq *cq) +{ + struct ib_mad_port_private *port_priv = cq->cq_context; + + queue_work(port_priv->wq, &port_priv->work); +} + +/* + * Allocate receive MADs and post receive WRs for them + */ +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad) +{ + unsigned long flags; + int post, ret; + struct ib_mad_private *mad_priv; + struct ib_sge sg_list; + struct ib_recv_wr recv_wr, *bad_recv_wr; + struct ib_mad_queue *recv_queue = &qp_info->recv_queue; + + /* Initialize common scatter list fields */ + sg_list.length = sizeof *mad_priv - sizeof mad_priv->header; + sg_list.lkey = (*qp_info->port_priv->mr).lkey; + + /* Initialize common receive WR fields */ + recv_wr.next = NULL; + recv_wr.sg_list = &sg_list; + recv_wr.num_sge = 1; + recv_wr.recv_flags = IB_RECV_SIGNALED; + + do { + /* Allocate and map receive buffer */ + if (mad) { + mad_priv = mad; + mad = NULL; + } else { + mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!mad_priv) { + printk(KERN_ERR PFX "No memory for receive buffer\n"); + ret = -ENOMEM; + break; + } + } + sg_list.addr = dma_map_single(qp_info->port_priv-> + device->dma_device, + &mad_priv->grh, + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); + recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; + mad_priv->header.mad_list.mad_queue = recv_queue; + + /* Post receive WR */ + spin_lock_irqsave(&recv_queue->lock, flags); + post = (++recv_queue->count < recv_queue->max_active); + list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list); + spin_unlock_irqrestore(&recv_queue->lock, flags); + ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); + if (ret) { + spin_lock_irqsave(&recv_queue->lock, flags); + list_del(&mad_priv->header.mad_list.list); + recv_queue->count--; + spin_unlock_irqrestore(&recv_queue->lock, flags); + dma_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&mad_priv->header, + mapping), + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); + kmem_cache_free(ib_mad_cache, mad_priv); + printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret); + break; + } + } while (post); + + return ret; +} + +/* + * Return all the posted receive MADs + */ +static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *recv; + struct ib_mad_list_head *mad_list; + + while (!list_empty(&qp_info->recv_queue.list)) { + + mad_list = list_entry(qp_info->recv_queue.list.next, + struct ib_mad_list_head, list); + mad_priv_hdr = container_of(mad_list, + struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, + header); + + /* Remove from posted receive MAD list */ + list_del(&mad_list->list); + + /* Undo PCI mapping */ + dma_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); + kmem_cache_free(ib_mad_cache, recv); + } + + qp_info->recv_queue.count = 0; +} + +/* + * Start the port + */ +static int ib_mad_port_start(struct ib_mad_port_private *port_priv) +{ + int ret, i; + struct ib_qp_attr *attr; + struct ib_qp *qp; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) { + printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n"); + return -ENOMEM; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + qp = port_priv->qp_info[i].qp; + /* + * PKey index for QP1 is irrelevant but + * one is needed for the Reset to Init transition + */ + attr->qp_state = IB_QPS_INIT; + attr->pkey_index = 0; + attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | + IB_QP_PKEY_INDEX | IB_QP_QKEY); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "INIT: %d\n", i, ret); + goto out; + } + + attr->qp_state = IB_QPS_RTR; + ret = ib_modify_qp(qp, attr, IB_QP_STATE); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "RTR: %d\n", i, ret); + goto out; + } + + attr->qp_state = IB_QPS_RTS; + attr->sq_psn = IB_MAD_SEND_Q_PSN; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "RTS: %d\n", i, ret); + goto out; + } + } + + ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + if (ret) { + printk(KERN_ERR PFX "Failed to request completion " + "notification: %d\n", ret); + goto out; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); + if (ret) { + printk(KERN_ERR PFX "Couldn't post receive WRs\n"); + goto out; + } + } +out: + kfree(attr); + return ret; +} + +static void qp_event_handler(struct ib_event *event, void *qp_context) +{ + struct ib_mad_qp_info *qp_info = qp_context; + + /* It's worse than that! He's dead, Jim! */ + printk(KERN_ERR PFX "Fatal error (%d) on MAD QP (%d)\n", + event->event, qp_info->qp->qp_num); +} + +static void init_mad_queue(struct ib_mad_qp_info *qp_info, + struct ib_mad_queue *mad_queue) +{ + mad_queue->qp_info = qp_info; + mad_queue->count = 0; + spin_lock_init(&mad_queue->lock); + INIT_LIST_HEAD(&mad_queue->list); +} + +static void init_mad_qp(struct ib_mad_port_private *port_priv, + struct ib_mad_qp_info *qp_info) +{ + qp_info->port_priv = port_priv; + init_mad_queue(qp_info, &qp_info->send_queue); + init_mad_queue(qp_info, &qp_info->recv_queue); + INIT_LIST_HEAD(&qp_info->overflow_list); + spin_lock_init(&qp_info->snoop_lock); + qp_info->snoop_table = NULL; + qp_info->snoop_table_size = 0; + atomic_set(&qp_info->snoop_count, 0); +} + +static int create_mad_qp(struct ib_mad_qp_info *qp_info, + enum ib_qp_type qp_type) +{ + struct ib_qp_init_attr qp_init_attr; + int ret; + + memset(&qp_init_attr, 0, sizeof qp_init_attr); + qp_init_attr.send_cq = qp_info->port_priv->cq; + qp_init_attr.recv_cq = qp_info->port_priv->cq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; + qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; + qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; + qp_init_attr.qp_type = qp_type; + qp_init_attr.port_num = qp_info->port_priv->port_num; + qp_init_attr.qp_context = qp_info; + qp_init_attr.event_handler = qp_event_handler; + qp_info->qp = ib_create_qp(qp_info->port_priv->pd, &qp_init_attr); + if (IS_ERR(qp_info->qp)) { + printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", + get_spl_qp_index(qp_type)); + ret = PTR_ERR(qp_info->qp); + goto error; + } + /* Use minimum queue sizes unless the CQ is resized */ + qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; + qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; + return 0; + +error: + return ret; +} + +static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) +{ + ib_destroy_qp(qp_info->qp); + if (qp_info->snoop_table) + kfree(qp_info->snoop_table); +} + +/* + * Open the port + * Create the QP, PD, MR, and CQ if needed + */ +static int ib_mad_port_open(struct ib_device *device, + int port_num) +{ + int ret, cq_size; + struct ib_mad_port_private *port_priv; + unsigned long flags; + char name[sizeof "ib_mad123"]; + + /* First, check if port already open at MAD layer */ + port_priv = ib_get_mad_port(device, port_num); + if (port_priv) { + printk(KERN_DEBUG PFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR PFX "No memory for ib_mad_port_private\n"); + return -ENOMEM; + } + memset(port_priv, 0, sizeof *port_priv); + port_priv->device = device; + port_priv->port_num = port_num; + spin_lock_init(&port_priv->reg_lock); + INIT_LIST_HEAD(&port_priv->agent_list); + init_mad_qp(port_priv, &port_priv->qp_info[0]); + init_mad_qp(port_priv, &port_priv->qp_info[1]); + + cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + port_priv->cq = ib_create_cq(port_priv->device, + (ib_comp_handler) + ib_mad_thread_completion_handler, + NULL, port_priv, cq_size); + if (IS_ERR(port_priv->cq)) { + printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n"); + ret = PTR_ERR(port_priv->cq); + goto error3; + } + + port_priv->pd = ib_alloc_pd(device); + if (IS_ERR(port_priv->pd)) { + printk(KERN_ERR PFX "Couldn't create ib_mad PD\n"); + ret = PTR_ERR(port_priv->pd); + goto error4; + } + + port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(port_priv->mr)) { + printk(KERN_ERR PFX "Couldn't get ib_mad DMA MR\n"); + ret = PTR_ERR(port_priv->mr); + goto error5; + } + + ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI); + if (ret) + goto error6; + ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI); + if (ret) + goto error7; + + snprintf(name, sizeof name, "ib_mad%d", port_num); + port_priv->wq = create_singlethread_workqueue(name); + if (!port_priv->wq) { + ret = -ENOMEM; + goto error8; + } + INIT_WORK(&port_priv->work, ib_mad_completion_handler, port_priv); + + ret = ib_mad_port_start(port_priv); + if (ret) { + printk(KERN_ERR PFX "Couldn't start port\n"); + goto error9; + } + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_mad_port_list); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + return 0; + +error9: + destroy_workqueue(port_priv->wq); +error8: + destroy_mad_qp(&port_priv->qp_info[1]); +error7: + destroy_mad_qp(&port_priv->qp_info[0]); +error6: + ib_dereg_mr(port_priv->mr); +error5: + ib_dealloc_pd(port_priv->pd); +error4: + ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); +error3: + kfree(port_priv); + + return ret; +} + +/* + * Close the port + * If there are no classes using the port, free the port + * resources (CQ, MR, PD, QP) and remove the port's info structure + */ +static int ib_mad_port_close(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + port_priv = __ib_get_mad_port(device, port_num); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + printk(KERN_ERR PFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + + /* Stop processing completions. */ + flush_workqueue(port_priv->wq); + destroy_workqueue(port_priv->wq); + destroy_mad_qp(&port_priv->qp_info[1]); + destroy_mad_qp(&port_priv->qp_info[0]); + ib_dereg_mr(port_priv->mr); + ib_dealloc_pd(port_priv->pd); + ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); + /* XXX: Handle deallocation of MAD registration tables */ + + kfree(port_priv); + + return 0; +} + +static void ib_mad_init_device(struct ib_device *device) +{ + int ret, num_ports, cur_port, i, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret = ib_mad_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d\n", + device->name, cur_port); + goto error_device_open; + } + ret = ib_agent_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d " + "for agents\n", + device->name, cur_port); + goto error_device_open; + } + } + + goto error_device_query; + +error_device_open: + while (i > 0) { + cur_port--; + ret2 = ib_agent_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + } + ret2 = ib_mad_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + } + i--; + } + +error_device_query: + return; +} + +static void ib_mad_remove_device(struct ib_device *device) +{ + int ret = 0, i, num_ports, cur_port, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret2 = ib_agent_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + ret2 = ib_mad_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + } +} + +static struct ib_client mad_client = { + .name = "mad", + .add = ib_mad_init_device, + .remove = ib_mad_remove_device +}; + +static int __init ib_mad_init_module(void) +{ + int ret; + + spin_lock_init(&ib_mad_port_list_lock); + spin_lock_init(&ib_agent_port_list_lock); + + ib_mad_cache = kmem_cache_create("ib_mad", + sizeof(struct ib_mad_private), + 0, + SLAB_HWCACHE_ALIGN, + NULL, + NULL); + if (!ib_mad_cache) { + printk(KERN_ERR PFX "Couldn't create ib_mad cache\n"); + ret = -ENOMEM; + goto error1; + } + + INIT_LIST_HEAD(&ib_mad_port_list); + + if (ib_register_client(&mad_client)) { + printk(KERN_ERR PFX "Couldn't register ib_mad client\n"); + ret = -EINVAL; + goto error2; + } + + return 0; + +error2: + kmem_cache_destroy(ib_mad_cache); +error1: + return ret; +} + +static void __exit ib_mad_cleanup_module(void) +{ + ib_unregister_client(&mad_client); + + if (kmem_cache_destroy(ib_mad_cache)) { + printk(KERN_DEBUG PFX "Failed to destroy ib_mad cache\n"); + } +} + +module_init(ib_mad_init_module); +module_exit(ib_mad_cleanup_module); From roland at topspin.com Mon Dec 27 21:51:00 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:00 -0800 Subject: [openib-general] [PATCH][v5][7/24] Add InfiniBand MAD SMI support In-Reply-To: <200412272150.vaYM2cWv3KsWVYPV@topspin.com> Message-ID: <200412272151.7VvrtyAsRwgOK3mP@topspin.com> Add MAD layer SMI (Subnet Management Interface) code. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/smi.c 2004-12-27 21:48:20.566938847 -0800 @@ -0,0 +1,234 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $ + */ + +#include + + +/* + * Fixup a directed route SMP for sending + * Return 0 if the SMP should be discarded + */ +int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 */ + if (hop_cnt && hop_ptr == 0) { + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:2 */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->return_path set when received */ + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + /* smp->return_path set when received */ + smp->hop_ptr++; + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + /* C14-9:5 -- Fail unreasonable hop pointer */ + return (hop_ptr == hop_cnt + 1); + + } else { + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) { + smp->hop_ptr--; + /* C14-13:3 -- SMPs destined for SM shouldn't be here */ + return (node_type == IB_NODE_SWITCH || + smp->dr_slid == IB_LID_PERMISSIVE); + } + + /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */ + if (hop_ptr == 0) + return 1; + + /* C14-13:5 -- Check for unreasonable hop pointer */ + return 0; + } +} + +/* + * Adjust information for a received SMP + * Return 0 if the SMP should be dropped + */ +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 -- sender should have incremented hop_ptr */ + if (hop_cnt && hop_ptr == 0) + return 0; + + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + return (smp->initial_path[hop_ptr+1] <= phys_port_cnt); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + if (hop_cnt) + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + /* C14-9:5 -- fail unreasonable hop pointer */ + return (hop_ptr == hop_cnt + 1); + + } else { + + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->hop_ptr updated when sending */ + return (smp->return_path[hop_ptr-1] <= phys_port_cnt); + } + + /* C14-13:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == 1) { + if (smp->dr_slid == IB_LID_PERMISSIVE) { + /* giving SMP to SM - update hop_ptr */ + smp->hop_ptr--; + return 1; + } + /* smp->hop_ptr updated when sending */ + return (node_type == IB_NODE_SWITCH); + } + + /* C14-13:4 -- hop_ptr = 0 -> give to SM */ + /* C14-13:5 -- Check for unreasonable hop pointer */ + return (hop_ptr == 0); + } +} + +/* + * Return 1 if the received DR SMP should be forwarded to the send queue + * Return 0 if the SMP should be completed up the stack + */ +int smi_check_forward_dr_smp(struct ib_smp *smp) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + if (!ib_get_smp_direction(smp)) { + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) + return 1; + + /* C14-9:3 -- at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) + return (smp->dr_dlid == IB_LID_PERMISSIVE); + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + if (hop_ptr == hop_cnt + 1) + return 1; + } else { + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) + return 1; + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) + return (smp->dr_slid != IB_LID_PERMISSIVE); + } + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/smi.h 2004-12-27 21:48:20.592935020 -0800 @@ -0,0 +1,67 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: smi.h 1389 2004-12-27 22:56:47Z roland $ + */ + +#ifndef __SMI_H_ +#define __SMI_H_ + +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt); +extern int smi_check_forward_dr_smp(struct ib_smp *smp); +extern int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num); +extern int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num); + +/* + * Return 1 if the SMP should be handled by the local SMA/SM via process_mad + */ +static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, + struct ib_smp *smp) +{ + /* C14-9:3 -- We're at the end of the DR segment of path */ + /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ + return ((mad_agent->device->process_mad && + !ib_get_smp_direction(smp) && + (smp->hop_ptr == smp->hop_cnt + 1))); +} + +#endif /* __SMI_H_ */ From roland at topspin.com Mon Dec 27 21:51:01 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:01 -0800 Subject: [openib-general] [PATCH][v5][8/24] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <200412272151.7VvrtyAsRwgOK3mP@topspin.com> Message-ID: <200412272151.h64HpFQTg9SpyhAM@topspin.com> Add support for sending queries to the SA (Subnet Administration). In particular the PathRecord and MCMember (multicast group member) used by the IP-over-InfiniBand driver are implemented. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-27 21:48:19.838046137 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-27 21:48:20.847897490 -0800 @@ -1,8 +1,10 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o ib_mad-y := mad.o smi.o agent.o + +ib_sa-y := sa_query.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/sa_query.c 2004-12-27 21:48:20.896890279 -0800 @@ -0,0 +1,866 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: sa_query.c 1389 2004-12-27 22:56:47Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand subnet administration query support"); +MODULE_LICENSE("Dual BSD/GPL"); + +/* + * These two structures must be packed because they have 64-bit fields + * that are only 32-bit aligned. 64-bit architectures will lay them + * out wrong otherwise. (And unfortunately they are sent on the wire + * so we can't change the layout) + */ +struct ib_sa_hdr { + u64 sm_key; + u16 attr_offset; + u16 reserved; + ib_sa_comp_mask comp_mask; +} __attribute__ ((packed)); + +struct ib_sa_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + struct ib_sa_hdr sa_hdr; + u8 data[200]; +} __attribute__ ((packed)); + +struct ib_sa_sm_ah { + struct ib_ah *ah; + struct kref ref; +}; + +struct ib_sa_port { + struct ib_mad_agent *agent; + struct ib_mr *mr; + struct ib_sa_sm_ah *sm_ah; + struct work_struct update_task; + spinlock_t ah_lock; + u8 port_num; +}; + +struct ib_sa_device { + int start_port, end_port; + struct ib_event_handler event_handler; + struct ib_sa_port port[0]; +}; + +struct ib_sa_query { + void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); + void (*release)(struct ib_sa_query *); + struct ib_sa_port *port; + struct ib_sa_mad *mad; + struct ib_sa_sm_ah *sm_ah; + DECLARE_PCI_UNMAP_ADDR(mapping) + int id; +}; + +struct ib_sa_path_query { + void (*callback)(int, struct ib_sa_path_rec *, void *); + void *context; + struct ib_sa_query sa_query; +}; + +struct ib_sa_mcmember_query { + void (*callback)(int, struct ib_sa_mcmember_rec *, void *); + void *context; + struct ib_sa_query sa_query; +}; + +static void ib_sa_add_one(struct ib_device *device); +static void ib_sa_remove_one(struct ib_device *device); + +static struct ib_client sa_client = { + .name = "sa", + .add = ib_sa_add_one, + .remove = ib_sa_remove_one +}; + +static spinlock_t idr_lock; +static DEFINE_IDR(query_idr); + +static spinlock_t tid_lock; +static u32 tid; + +enum { + IB_SA_ATTR_CLASS_PORTINFO = 0x01, + IB_SA_ATTR_NOTICE = 0x02, + IB_SA_ATTR_INFORM_INFO = 0x03, + IB_SA_ATTR_NODE_REC = 0x11, + IB_SA_ATTR_PORT_INFO_REC = 0x12, + IB_SA_ATTR_SL2VL_REC = 0x13, + IB_SA_ATTR_SWITCH_REC = 0x14, + IB_SA_ATTR_LINEAR_FDB_REC = 0x15, + IB_SA_ATTR_RANDOM_FDB_REC = 0x16, + IB_SA_ATTR_MCAST_FDB_REC = 0x17, + IB_SA_ATTR_SM_INFO_REC = 0x18, + IB_SA_ATTR_LINK_REC = 0x20, + IB_SA_ATTR_GUID_INFO_REC = 0x30, + IB_SA_ATTR_SERVICE_REC = 0x31, + IB_SA_ATTR_PARTITION_REC = 0x33, + IB_SA_ATTR_RANGE_REC = 0x34, + IB_SA_ATTR_PATH_REC = 0x35, + IB_SA_ATTR_VL_ARB_REC = 0x36, + IB_SA_ATTR_MC_GROUP_REC = 0x37, + IB_SA_ATTR_MC_MEMBER_REC = 0x38, + IB_SA_ATTR_TRACE_REC = 0x39, + IB_SA_ATTR_MULTI_PATH_REC = 0x3a, + IB_SA_ATTR_SERVICE_ASSOC_REC = 0x3b +}; + +#define PATH_REC_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_path_rec, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_path_rec *) 0)->field, \ + .field_name = "sa_path_rec:" #field + +static const struct ib_field path_rec_table[] = { + { RESERVED, + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 32 }, + { PATH_REC_FIELD(dgid), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 128 }, + { PATH_REC_FIELD(sgid), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 128 }, + { PATH_REC_FIELD(dlid), + .offset_words = 10, + .offset_bits = 0, + .size_bits = 16 }, + { PATH_REC_FIELD(slid), + .offset_words = 10, + .offset_bits = 16, + .size_bits = 16 }, + { PATH_REC_FIELD(raw_traffic), + .offset_words = 11, + .offset_bits = 0, + .size_bits = 1 }, + { RESERVED, + .offset_words = 11, + .offset_bits = 1, + .size_bits = 3 }, + { PATH_REC_FIELD(flow_label), + .offset_words = 11, + .offset_bits = 4, + .size_bits = 20 }, + { PATH_REC_FIELD(hop_limit), + .offset_words = 11, + .offset_bits = 24, + .size_bits = 8 }, + { PATH_REC_FIELD(traffic_class), + .offset_words = 12, + .offset_bits = 0, + .size_bits = 8 }, + { PATH_REC_FIELD(reversible), + .offset_words = 12, + .offset_bits = 8, + .size_bits = 1 }, + { PATH_REC_FIELD(numb_path), + .offset_words = 12, + .offset_bits = 9, + .size_bits = 7 }, + { PATH_REC_FIELD(pkey), + .offset_words = 12, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 13, + .offset_bits = 0, + .size_bits = 12 }, + { PATH_REC_FIELD(sl), + .offset_words = 13, + .offset_bits = 12, + .size_bits = 4 }, + { PATH_REC_FIELD(mtu_selector), + .offset_words = 13, + .offset_bits = 16, + .size_bits = 2 }, + { PATH_REC_FIELD(mtu), + .offset_words = 13, + .offset_bits = 18, + .size_bits = 6 }, + { PATH_REC_FIELD(rate_selector), + .offset_words = 13, + .offset_bits = 24, + .size_bits = 2 }, + { PATH_REC_FIELD(rate), + .offset_words = 13, + .offset_bits = 26, + .size_bits = 6 }, + { PATH_REC_FIELD(packet_life_time_selector), + .offset_words = 14, + .offset_bits = 0, + .size_bits = 2 }, + { PATH_REC_FIELD(packet_life_time), + .offset_words = 14, + .offset_bits = 2, + .size_bits = 6 }, + { PATH_REC_FIELD(preference), + .offset_words = 14, + .offset_bits = 8, + .size_bits = 8 }, + { RESERVED, + .offset_words = 14, + .offset_bits = 16, + .size_bits = 48 }, +}; + +#define MCMEMBER_REC_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_mcmember_rec, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_mcmember_rec *) 0)->field, \ + .field_name = "sa_mcmember_rec:" #field + +static const struct ib_field mcmember_rec_table[] = { + { MCMEMBER_REC_FIELD(mgid), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 128 }, + { MCMEMBER_REC_FIELD(port_gid), + .offset_words = 4, + .offset_bits = 0, + .size_bits = 128 }, + { MCMEMBER_REC_FIELD(qkey), + .offset_words = 8, + .offset_bits = 0, + .size_bits = 32 }, + { MCMEMBER_REC_FIELD(mlid), + .offset_words = 9, + .offset_bits = 0, + .size_bits = 16 }, + { MCMEMBER_REC_FIELD(mtu_selector), + .offset_words = 9, + .offset_bits = 16, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(mtu), + .offset_words = 9, + .offset_bits = 18, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(traffic_class), + .offset_words = 9, + .offset_bits = 24, + .size_bits = 8 }, + { MCMEMBER_REC_FIELD(pkey), + .offset_words = 10, + .offset_bits = 0, + .size_bits = 16 }, + { MCMEMBER_REC_FIELD(rate_selector), + .offset_words = 10, + .offset_bits = 16, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(rate), + .offset_words = 10, + .offset_bits = 18, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(packet_life_time_selector), + .offset_words = 10, + .offset_bits = 24, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(packet_life_time), + .offset_words = 10, + .offset_bits = 26, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(sl), + .offset_words = 11, + .offset_bits = 0, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(flow_label), + .offset_words = 11, + .offset_bits = 4, + .size_bits = 20 }, + { MCMEMBER_REC_FIELD(hop_limit), + .offset_words = 11, + .offset_bits = 24, + .size_bits = 8 }, + { MCMEMBER_REC_FIELD(scope), + .offset_words = 12, + .offset_bits = 0, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(join_state), + .offset_words = 12, + .offset_bits = 4, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(proxy_join), + .offset_words = 12, + .offset_bits = 8, + .size_bits = 1 }, + { RESERVED, + .offset_words = 12, + .offset_bits = 9, + .size_bits = 23 }, +}; + +static void free_sm_ah(struct kref *kref) +{ + struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); + + ib_destroy_ah(sm_ah->ah); + kfree(sm_ah); +} + +static void update_sm_ah(void *port_ptr) +{ + struct ib_sa_port *port = port_ptr; + struct ib_sa_sm_ah *new_ah, *old_ah; + struct ib_port_attr port_attr; + struct ib_ah_attr ah_attr; + + if (ib_query_port(port->agent->device, port->port_num, &port_attr)) { + printk(KERN_WARNING "Couldn't query port\n"); + return; + } + + new_ah = kmalloc(sizeof *new_ah, GFP_KERNEL); + if (!new_ah) { + printk(KERN_WARNING "Couldn't allocate new SM AH\n"); + return; + } + + kref_init(&new_ah->ref); + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = port_attr.sm_lid; + ah_attr.sl = port_attr.sm_sl; + ah_attr.port_num = port->port_num; + + new_ah->ah = ib_create_ah(port->agent->qp->pd, &ah_attr); + if (IS_ERR(new_ah->ah)) { + printk(KERN_WARNING "Couldn't create new SM AH\n"); + kfree(new_ah); + return; + } + + spin_lock_irq(&port->ah_lock); + old_ah = port->sm_ah; + port->sm_ah = new_ah; + spin_unlock_irq(&port->ah_lock); + + if (old_ah) + kref_put(&old_ah->ref, free_sm_ah); +} + +static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event) +{ + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + struct ib_sa_device *sa_dev = + ib_get_client_data(event->device, &sa_client); + + schedule_work(&sa_dev->port[event->element.port_num - + sa_dev->start_port].update_task); + } +} + +/** + * ib_sa_cancel_query - try to cancel an SA query + * @id:ID of query to cancel + * @query:query pointer to cancel + * + * Try to cancel an SA query. If the id and query don't match up or + * the query has already completed, nothing is done. Otherwise the + * query is canceled and will complete with a status of -EINTR. + */ +void ib_sa_cancel_query(int id, struct ib_sa_query *query) +{ + unsigned long flags; + struct ib_mad_agent *agent; + + spin_lock_irqsave(&idr_lock, flags); + if (idr_find(&query_idr, id) != query) { + spin_unlock_irqrestore(&idr_lock, flags); + return; + } + agent = query->port->agent; + spin_unlock_irqrestore(&idr_lock, flags); + + ib_cancel_mad(agent, id); +} +EXPORT_SYMBOL(ib_sa_cancel_query); + +static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) +{ + unsigned long flags; + + memset(mad, 0, sizeof *mad); + + mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; + mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; + mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + + spin_lock_irqsave(&tid_lock, flags); + mad->mad_hdr.tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | tid++); + spin_unlock_irqrestore(&tid_lock, flags); +} + +static int send_mad(struct ib_sa_query *query, int timeout_ms) +{ + struct ib_sa_port *port = query->port; + unsigned long flags; + int ret; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .mad_hdr = &query->mad->mad_hdr, + .remote_qpn = 1, + .remote_qkey = IB_QP1_QKEY, + .timeout_ms = timeout_ms + } + } + }; + +retry: + if (!idr_pre_get(&query_idr, GFP_ATOMIC)) + return -ENOMEM; + spin_lock_irqsave(&idr_lock, flags); + ret = idr_get_new(&query_idr, query, &query->id); + spin_unlock_irqrestore(&idr_lock, flags); + if (ret == -EAGAIN) + goto retry; + if (ret) + return ret; + + wr.wr_id = query->id; + + spin_lock_irqsave(&port->ah_lock, flags); + kref_get(&port->sm_ah->ref); + query->sm_ah = port->sm_ah; + wr.wr.ud.ah = port->sm_ah->ah; + spin_unlock_irqrestore(&port->ah_lock, flags); + + gather_list.addr = dma_map_single(port->agent->device->dma_device, + query->mad, + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + gather_list.length = sizeof (struct ib_sa_mad); + gather_list.lkey = port->mr->lkey; + pci_unmap_addr_set(query, mapping, gather_list.addr); + + ret = ib_post_send_mad(port->agent, &wr, &bad_wr); + if (ret) { + dma_unmap_single(port->agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + spin_lock_irqsave(&idr_lock, flags); + idr_remove(&query_idr, query->id); + spin_unlock_irqrestore(&idr_lock, flags); + } + + return ret; +} + +static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_path_query *query = + container_of(sa_query, struct ib_sa_path_query, sa_query); + + if (mad) { + struct ib_sa_path_rec rec; + + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) +{ + kfree(sa_query->mad); + kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); +} + +/** + * ib_sa_path_rec_get - Start a Path get query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Path Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a Path Record Get query to the SA to look up a path. The + * callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_path_rec_get() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_path_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; + struct ib_mad_agent *agent = port->agent; + int ret; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); + if (!query->sa_query.mad) { + kfree(query); + return -ENOMEM; + } + + query->callback = callback; + query->context = context; + + init_mad(query->sa_query.mad, agent); + + query->sa_query.callback = ib_sa_path_rec_callback; + query->sa_query.release = ib_sa_path_rec_release; + query->sa_query.port = port; + query->sa_query.mad->mad_hdr.method = IB_MGMT_METHOD_GET; + query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), + rec, query->sa_query.mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms); + if (ret) { + *sa_query = NULL; + kfree(query->sa_query.mad); + kfree(query); + } + + return ret ? ret : query->sa_query.id; +} +EXPORT_SYMBOL(ib_sa_path_rec_get); + +static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_mcmember_query *query = + container_of(sa_query, struct ib_sa_mcmember_query, sa_query); + + if (mad) { + struct ib_sa_mcmember_rec rec; + + ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) +{ + kfree(sa_query->mad); + kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); +} + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_mcmember_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; + struct ib_mad_agent *agent = port->agent; + int ret; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); + if (!query->sa_query.mad) { + kfree(query); + return -ENOMEM; + } + + query->callback = callback; + query->context = context; + + init_mad(query->sa_query.mad, agent); + + query->sa_query.callback = ib_sa_mcmember_rec_callback; + query->sa_query.release = ib_sa_mcmember_rec_release; + query->sa_query.port = port; + query->sa_query.mad->mad_hdr.method = method; + query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); + query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + rec, query->sa_query.mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms); + if (ret) { + *sa_query = NULL; + kfree(query->sa_query.mad); + kfree(query); + } + + return ret ? ret : query->sa_query.id; +} +EXPORT_SYMBOL(ib_sa_mcmember_rec_query); + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_sa_query *query; + unsigned long flags; + + spin_lock_irqsave(&idr_lock, flags); + query = idr_find(&query_idr, mad_send_wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); + + if (!query) + return; + + switch (mad_send_wc->status) { + case IB_WC_SUCCESS: + /* No callback -- already got recv */ + break; + case IB_WC_RESP_TIMEOUT_ERR: + query->callback(query, -ETIMEDOUT, NULL); + break; + case IB_WC_WR_FLUSH_ERR: + query->callback(query, -EINTR, NULL); + break; + default: + query->callback(query, -EIO, NULL); + break; + } + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + + query->release(query); + + spin_lock_irqsave(&idr_lock, flags); + idr_remove(&query_idr, mad_send_wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); +} + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_query *query; + unsigned long flags; + + spin_lock_irqsave(&idr_lock, flags); + query = idr_find(&query_idr, mad_recv_wc->wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); + + if (query) { + if (mad_recv_wc->wc->status == IB_WC_SUCCESS) + query->callback(query, + mad_recv_wc->recv_buf.mad->mad_hdr.status ? + -EINVAL : 0, + (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); + else + query->callback(query, -EIO, NULL); + } + + ib_free_recv_mad(mad_recv_wc); +} + +static void ib_sa_add_one(struct ib_device *device) +{ + struct ib_sa_device *sa_dev; + int s, e, i; + + if (device->node_type == IB_NODE_SWITCH) + s = e = 0; + else { + s = 1; + e = device->phys_port_cnt; + } + + sa_dev = kmalloc(sizeof *sa_dev + + (e - s + 1) * sizeof (struct ib_sa_port), + GFP_KERNEL); + if (!sa_dev) + return; + + sa_dev->start_port = s; + sa_dev->end_port = e; + + for (i = 0; i <= e - s; ++i) { + sa_dev->port[i].mr = NULL; + sa_dev->port[i].sm_ah = NULL; + sa_dev->port[i].port_num = i + s; + spin_lock_init(&sa_dev->port[i].ah_lock); + + sa_dev->port[i].agent = + ib_register_mad_agent(device, i + s, IB_QPT_GSI, + NULL, 0, send_handler, + recv_handler, sa_dev); + if (IS_ERR(sa_dev->port[i].agent)) + goto err; + + sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(sa_dev->port[i].mr)) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + goto err; + } + + INIT_WORK(&sa_dev->port[i].update_task, + update_sm_ah, &sa_dev->port[i]); + } + + ib_set_client_data(device, &sa_client, sa_dev); + + /* + * We register our event handler after everything is set up, + * and then update our cached info after the event handler is + * registered to avoid any problems if a port changes state + * during our initialization. + */ + + INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event); + if (ib_register_event_handler(&sa_dev->event_handler)) + goto err; + + for (i = 0; i <= e - s; ++i) + update_sm_ah(&sa_dev->port[i]); + + return; + +err: + while (--i >= 0) { + ib_dereg_mr(sa_dev->port[i].mr); + ib_unregister_mad_agent(sa_dev->port[i].agent); + } + + kfree(sa_dev); + + return; +} + +static void ib_sa_remove_one(struct ib_device *device) +{ + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + int i; + + if (!sa_dev) + return; + + ib_unregister_event_handler(&sa_dev->event_handler); + + for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); + } + + kfree(sa_dev); +} + +static int __init ib_sa_init(void) +{ + int ret; + + spin_lock_init(&idr_lock); + spin_lock_init(&tid_lock); + + get_random_bytes(&tid, sizeof tid); + + ret = ib_register_client(&sa_client); + if (ret) + printk(KERN_ERR "Couldn't register ib_sa client\n"); + + return ret; +} + +static void __exit ib_sa_cleanup(void) +{ + ib_unregister_client(&sa_client); +} + +module_init(ib_sa_init); +module_exit(ib_sa_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_sa.h 2004-12-27 21:48:20.923886305 -0800 @@ -0,0 +1,280 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_sa.h 1389 2004-12-27 22:56:47Z roland $ + */ + +#ifndef IB_SA_H +#define IB_SA_H + +#include + +#include +#include + +enum { + IB_SA_CLASS_VERSION = 2, /* IB spec version 1.1/1.2 */ + + IB_SA_METHOD_DELETE = 0x15 +}; + +enum ib_sa_selector { + IB_SA_GTE = 0, + IB_SA_LTE = 1, + IB_SA_EQ = 2, + /* + * The meaning of "best" depends on the attribute: for + * example, for MTU best will return the largest available + * MTU, while for packet life time, best will return the + * smallest available life time. + */ + IB_SA_BEST = 3 +}; + +typedef u64 __bitwise ib_sa_comp_mask; + +#define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) + +/* + * Structures for SA records are named "struct ib_sa_xxx_rec." No + * attempt is made to pack structures to match the physical layout of + * SA records in SA MADs; all packing and unpacking is handled by the + * SA query code. + * + * For a record with structure ib_sa_xxx_rec, the naming convention + * for the component mask value for field yyy is IB_SA_XXX_REC_YYY (we + * never use different abbreviations or otherwise change the spelling + * of xxx/yyy between ib_sa_xxx_rec.yyy and IB_SA_XXX_REC_YYY). + * + * Reserved rows are indicated with comments to help maintainability. + */ + +/* reserved: 0 */ +/* reserved: 1 */ +#define IB_SA_PATH_REC_DGID IB_SA_COMP_MASK( 2) +#define IB_SA_PATH_REC_SGID IB_SA_COMP_MASK( 3) +#define IB_SA_PATH_REC_DLID IB_SA_COMP_MASK( 4) +#define IB_SA_PATH_REC_SLID IB_SA_COMP_MASK( 5) +#define IB_SA_PATH_REC_RAW_TRAFFIC IB_SA_COMP_MASK( 6) +/* reserved: 7 */ +#define IB_SA_PATH_REC_FLOW_LABEL IB_SA_COMP_MASK( 8) +#define IB_SA_PATH_REC_HOP_LIMIT IB_SA_COMP_MASK( 9) +#define IB_SA_PATH_REC_TRAFFIC_CLASS IB_SA_COMP_MASK(10) +#define IB_SA_PATH_REC_REVERSIBLE IB_SA_COMP_MASK(11) +#define IB_SA_PATH_REC_NUMB_PATH IB_SA_COMP_MASK(12) +#define IB_SA_PATH_REC_PKEY IB_SA_COMP_MASK(13) +/* reserved: 14 */ +#define IB_SA_PATH_REC_SL IB_SA_COMP_MASK(15) +#define IB_SA_PATH_REC_MTU_SELECTOR IB_SA_COMP_MASK(16) +#define IB_SA_PATH_REC_MTU IB_SA_COMP_MASK(17) +#define IB_SA_PATH_REC_RATE_SELECTOR IB_SA_COMP_MASK(18) +#define IB_SA_PATH_REC_RATE IB_SA_COMP_MASK(19) +#define IB_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR IB_SA_COMP_MASK(20) +#define IB_SA_PATH_REC_PACKET_LIFE_TIME IB_SA_COMP_MASK(21) +#define IB_SA_PATH_REC_PREFERENCE IB_SA_COMP_MASK(22) + +struct ib_sa_path_rec { + /* reserved */ + /* reserved */ + union ib_gid dgid; + union ib_gid sgid; + u16 dlid; + u16 slid; + int raw_traffic; + /* reserved */ + u32 flow_label; + u8 hop_limit; + u8 traffic_class; + int reversible; + u8 numb_path; + u16 pkey; + /* reserved */ + u8 sl; + u8 mtu_selector; + enum ib_mtu mtu; + u8 rate_selector; + u8 rate; + u8 packet_life_time_selector; + u8 packet_life_time; + u8 preference; +}; + +#define IB_SA_MCMEMBER_REC_MGID IB_SA_COMP_MASK( 0) +#define IB_SA_MCMEMBER_REC_PORT_GID IB_SA_COMP_MASK( 1) +#define IB_SA_MCMEMBER_REC_QKEY IB_SA_COMP_MASK( 2) +#define IB_SA_MCMEMBER_REC_MLID IB_SA_COMP_MASK( 3) +#define IB_SA_MCMEMBER_REC_MTU_SELECTOR IB_SA_COMP_MASK( 4) +#define IB_SA_MCMEMBER_REC_MTU IB_SA_COMP_MASK( 5) +#define IB_SA_MCMEMBER_REC_TRAFFIC_CLASS IB_SA_COMP_MASK( 6) +#define IB_SA_MCMEMBER_REC_PKEY IB_SA_COMP_MASK( 7) +#define IB_SA_MCMEMBER_REC_RATE_SELECTOR IB_SA_COMP_MASK( 8) +#define IB_SA_MCMEMBER_REC_RATE IB_SA_COMP_MASK( 9) +#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR IB_SA_COMP_MASK(10) +#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME IB_SA_COMP_MASK(11) +#define IB_SA_MCMEMBER_REC_SL IB_SA_COMP_MASK(12) +#define IB_SA_MCMEMBER_REC_FLOW_LABEL IB_SA_COMP_MASK(13) +#define IB_SA_MCMEMBER_REC_HOP_LIMIT IB_SA_COMP_MASK(14) +#define IB_SA_MCMEMBER_REC_SCOPE IB_SA_COMP_MASK(15) +#define IB_SA_MCMEMBER_REC_JOIN_STATE IB_SA_COMP_MASK(16) +#define IB_SA_MCMEMBER_REC_PROXY_JOIN IB_SA_COMP_MASK(17) + +struct ib_sa_mcmember_rec { + union ib_gid mgid; + union ib_gid port_gid; + u32 qkey; + u16 mlid; + u8 mtu_selector; + enum ib_mtu mtu; + u8 traffic_class; + u16 pkey; + u8 rate_selector; + u8 rate; + u8 packet_life_time_selector; + u8 packet_life_time; + u8 sl; + u32 flow_label; + u8 hop_limit; + u8 scope; + u8 join_state; + int proxy_join; +}; + +struct ib_sa_query; + +void ib_sa_cancel_query(int id, struct ib_sa_query *query); + +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +/** + * ib_sa_mcmember_rec_set - Start an MCMember set query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:MCMember Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send an MCMember Set query to the SA (eg to join a multicast + * group). The callback function will be called when the query + * completes (or fails); status is 0 for a successful response, -EINTR + * if the query is canceled, -ETIMEDOUT is the query timed out, or + * -EIO if an error occurred sending the query. The resp parameter of + * the callback is only valid if status is 0. + * + * If the return value of ib_sa_mcmember_rec_set() is negative, it is + * an error code. Otherwise it is a query ID that can be used to + * cancel the query. + */ +static inline int +ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query) +{ + return ib_sa_mcmember_rec_query(device, port_num, + IB_MGMT_METHOD_SET, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); +} + +/** + * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:MCMember Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send an MCMember Delete query to the SA (eg to leave a multicast + * group). The callback function will be called when the query + * completes (or fails); status is 0 for a successful response, -EINTR + * if the query is canceled, -ETIMEDOUT is the query timed out, or + * -EIO if an error occurred sending the query. The resp parameter of + * the callback is only valid if status is 0. + * + * If the return value of ib_sa_mcmember_rec_delete() is negative, it + * is an error code. Otherwise it is a query ID that can be used to + * cancel the query. + */ +static inline int +ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query) +{ + return ib_sa_mcmember_rec_query(device, port_num, + IB_SA_METHOD_DELETE, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); +} + + +#endif /* IB_SA_H */ From roland at topspin.com Mon Dec 27 21:51:02 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:02 -0800 Subject: [openib-general] [PATCH][v5][9/24] Add Mellanox HCA low-level driver In-Reply-To: <200412272151.h64HpFQTg9SpyhAM@topspin.com> Message-ID: <200412272151.HKpCNDFOMg5KO6kB@topspin.com> Add a low-level driver for Mellanox MT23108 and MT25208 HCAs. The MT25208 is only fully supported when in MT23108 compatibility mode; only the very beginnings of support for native MT25208 mode (required for HCAs without local memory) is present. (As a side note, I believe this driver would be the first in-tree consumer of the PCI MSI/MSI-X API) Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-12-27 21:48:18.185289416 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-27 21:48:21.258837002 -0800 @@ -7,4 +7,6 @@ any protocols you wish to use as well as drivers for your InfiniBand hardware. +source "drivers/infiniband/hw/mthca/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-12-27 21:48:18.216284854 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-12-27 21:48:21.219842741 -0800 @@ -1 +1,2 @@ obj-$(CONFIG_INFINIBAND) += core/ +obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Kconfig 2004-12-27 21:48:21.318828171 -0800 @@ -0,0 +1,26 @@ +config INFINIBAND_MTHCA + tristate "Mellanox HCA support" + depends on PCI && INFINIBAND + ---help--- + This is a low-level driver for Mellanox InfiniHost host + channel adapters (HCAs), including the MT23108 PCI-X HCA + ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). + +config INFINIBAND_MTHCA_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_MTHCA + default n + ---help--- + This option causes the mthca driver produce a bunch of debug + messages. Select this is you are developing the driver or + trying to diagnose a problem. + +config INFINIBAND_MTHCA_SSE_DOORBELL + bool "SSE doorbell code" + depends on INFINIBAND_MTHCA && X86 && !X86_64 + default n + ---help--- + This option will have the mthca driver use SSE instructions + to ring hardware doorbell registers. This may improve + performance for some workloads, but the driver will not run + on processors without SSE instructions. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Makefile 2004-12-27 21:48:21.366821107 -0800 @@ -0,0 +1,12 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o + +ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ + mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \ + mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ + mthca_provider.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c 2004-12-27 21:48:21.428811982 -0800 @@ -0,0 +1,179 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_allocator.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" + +/* Trivial bitmap-based allocator */ +u32 mthca_alloc(struct mthca_alloc *alloc) +{ + u32 obj; + + spin_lock(&alloc->lock); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); + if (obj >= alloc->max) { + alloc->top = (alloc->top + alloc->max) & alloc->mask; + obj = find_first_zero_bit(alloc->table, alloc->max); + } + + if (obj < alloc->max) { + set_bit(obj, alloc->table); + obj |= alloc->top; + } else + obj = -1; + + spin_unlock(&alloc->lock); + + return obj; +} + +void mthca_free(struct mthca_alloc *alloc, u32 obj) +{ + obj &= alloc->max - 1; + spin_lock(&alloc->lock); + clear_bit(obj, alloc->table); + alloc->last = min(alloc->last, obj); + alloc->top = (alloc->top + alloc->max) & alloc->mask; + spin_unlock(&alloc->lock); +} + +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved) +{ + int i; + + /* num must be a power of 2 */ + if (num != 1 << (ffs(num) - 1)) + return -EINVAL; + + alloc->last = 0; + alloc->top = 0; + alloc->max = num; + alloc->mask = mask; + spin_lock_init(&alloc->lock); + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), + GFP_KERNEL); + if (!alloc->table) + return -ENOMEM; + + bitmap_zero(alloc->table, num); + for (i = 0; i < reserved; ++i) + set_bit(i, alloc->table); + + return 0; +} + +void mthca_alloc_cleanup(struct mthca_alloc *alloc) +{ + kfree(alloc->table); +} + +/* + * Array of pointers with lazy allocation of leaf pages. Callers of + * _get, _set and _clear methods must use a lock or otherwise + * serialize access to the array. + */ + +void *mthca_array_get(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (array->page_list[p].page) { + int i = index & (PAGE_SIZE / sizeof (void *) - 1); + return array->page_list[p].page[i]; + } else + return NULL; +} + +int mthca_array_set(struct mthca_array *array, int index, void *value) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ + if (!array->page_list[p].page) + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); + + if (!array->page_list[p].page) + return -ENOMEM; + + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = + value; + ++array->page_list[p].used; + + return 0; +} + +void mthca_array_clear(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (--array->page_list[p].used == 0) { + free_page((unsigned long) array->page_list[p].page); + array->page_list[p].page = NULL; + } + + if (array->page_list[p].used < 0) + pr_debug("Array %p index %d page %d with ref count %d < 0\n", + array, index, p, array->page_list[p].used); +} + +int mthca_array_init(struct mthca_array *array, int nent) +{ + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; + int i; + + array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL); + if (!array->page_list) + return -ENOMEM; + + for (i = 0; i < npage; ++i) { + array->page_list[i].page = NULL; + array->page_list[i].used = 0; + } + + return 0; +} + +void mthca_array_cleanup(struct mthca_array *array, int nent) +{ + int i; + + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + free_page((unsigned long) array->page_list[i].page); + + kfree(array->page_list); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h 2004-12-27 21:48:21.473805359 -0800 @@ -0,0 +1,55 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_config_reg.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef MTHCA_CONFIG_REG_H +#define MTHCA_CONFIG_REG_H + +#include + +#define MTHCA_HCR_BASE 0x80680 +#define MTHCA_HCR_SIZE 0x0001c +#define MTHCA_ECR_BASE 0x80700 +#define MTHCA_ECR_SIZE 0x00008 +#define MTHCA_ECR_CLR_BASE 0x80708 +#define MTHCA_ECR_CLR_SIZE 0x00008 +#define MTHCA_ECR_OFFSET (MTHCA_ECR_BASE - MTHCA_HCR_BASE) +#define MTHCA_ECR_CLR_OFFSET (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE) +#define MTHCA_CLR_INT_BASE 0xf00d8 +#define MTHCA_CLR_INT_SIZE 0x00008 + +#define MTHCA_MAP_HCR_SIZE (MTHCA_ECR_CLR_BASE + \ + MTHCA_ECR_CLR_SIZE - \ + MTHCA_HCR_BASE) + +#endif /* MTHCA_CONFIG_REG_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h 2004-12-27 21:48:21.522798147 -0800 @@ -0,0 +1,391 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_dev.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef MTHCA_DEV_H +#define MTHCA_DEV_H + +#include +#include +#include +#include +#include +#include + +#include "mthca_provider.h" +#include "mthca_doorbell.h" + +#define DRV_NAME "ib_mthca" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.06-pre" +#define DRV_RELDATE "November 8, 2004" + +/* Types of supported HCA */ +enum { + TAVOR, /* MT23108 */ + ARBEL_COMPAT, /* MT25208 in Tavor compat mode */ + ARBEL_NATIVE /* MT25208 with extended features */ +}; + +enum { + MTHCA_FLAG_DDR_HIDDEN = 1 << 1, + MTHCA_FLAG_SRQ = 1 << 2, + MTHCA_FLAG_MSI = 1 << 3, + MTHCA_FLAG_MSI_X = 1 << 4, + MTHCA_FLAG_NO_LAM = 1 << 5 +}; + +enum { + MTHCA_KAR_PAGE = 1, + MTHCA_MAX_PORTS = 2 +}; + +enum { + MTHCA_MPT_ENTRY_SIZE = 0x40, + MTHCA_EQ_CONTEXT_SIZE = 0x40, + MTHCA_CQ_CONTEXT_SIZE = 0x40, + MTHCA_QP_CONTEXT_SIZE = 0x200, + MTHCA_AV_SIZE = 0x20, + MTHCA_MGM_ENTRY_SIZE = 0x40 +}; + +enum { + MTHCA_EQ_CMD, + MTHCA_EQ_ASYNC, + MTHCA_EQ_COMP, + MTHCA_NUM_EQ +}; + +struct mthca_cmd { + int use_events; + struct semaphore hcr_sem; + struct semaphore poll_sem; + struct semaphore event_sem; + int max_cmds; + spinlock_t context_lock; + int free_head; + struct mthca_cmd_context *context; + u16 token_mask; +}; + +struct mthca_limits { + int num_ports; + int vl_cap; + int mtu_cap; + int gid_table_len; + int pkey_table_len; + int local_ca_ack_delay; + int max_sg; + int num_qps; + int reserved_qps; + int num_srqs; + int reserved_srqs; + int num_eecs; + int reserved_eecs; + int num_cqs; + int reserved_cqs; + int num_eqs; + int reserved_eqs; + int num_mpts; + int num_mtt_segs; + int mtt_seg_size; + int reserved_mtts; + int reserved_mrws; + int num_rdbs; + int reserved_uars; + int num_mgms; + int num_amgms; + int reserved_mcgs; + int num_pds; + int reserved_pds; +}; + +struct mthca_alloc { + u32 last; + u32 top; + u32 max; + u32 mask; + spinlock_t lock; + unsigned long *table; +}; + +struct mthca_array { + struct { + void **page; + int used; + } *page_list; +}; + +struct mthca_pd_table { + struct mthca_alloc alloc; +}; + +struct mthca_mr_table { + struct mthca_alloc mpt_alloc; + int max_mtt_order; + unsigned long **mtt_buddy; + u64 mtt_base; +}; + +struct mthca_eq_table { + struct mthca_alloc alloc; + void __iomem *clr_int; + u32 clr_mask; + struct mthca_eq eq[MTHCA_NUM_EQ]; + int have_irq; + u8 inta_pin; +}; + +struct mthca_cq_table { + struct mthca_alloc alloc; + spinlock_t lock; + struct mthca_array cq; +}; + +struct mthca_qp_table { + struct mthca_alloc alloc; + int sqp_start; + spinlock_t lock; + struct mthca_array qp; +}; + +struct mthca_av_table { + struct pci_pool *pool; + int num_ddr_avs; + u64 ddr_av_base; + void __iomem *av_map; + struct mthca_alloc alloc; +}; + +struct mthca_mcg_table { + struct semaphore sem; + struct mthca_alloc alloc; +}; + +struct mthca_dev { + struct ib_device ib_dev; + struct pci_dev *pdev; + + int hca_type; + unsigned long mthca_flags; + + u32 rev_id; + + /* firmware info */ + u64 fw_ver; + union { + struct { + u64 fw_start; + u64 fw_end; + } tavor; + struct { + u64 clr_int_base; + u64 eq_arm_base; + u64 eq_set_ci_base; + struct scatterlist *mem; + u16 fw_pages; + } arbel; + } fw; + + u64 ddr_start; + u64 ddr_end; + + MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) + + void __iomem *hcr; + void __iomem *clr_base; + void __iomem *kar; + + struct mthca_cmd cmd; + struct mthca_limits limits; + + struct mthca_pd_table pd_table; + struct mthca_mr_table mr_table; + struct mthca_eq_table eq_table; + struct mthca_cq_table cq_table; + struct mthca_qp_table qp_table; + struct mthca_av_table av_table; + struct mthca_mcg_table mcg_table; + + struct mthca_pd driver_pd; + struct mthca_mr driver_mr; + + struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; + struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; + spinlock_t sm_lock; +}; + +#define mthca_dbg(mdev, format, arg...) \ + dev_dbg(&mdev->pdev->dev, format, ## arg) +#define mthca_err(mdev, format, arg...) \ + dev_err(&mdev->pdev->dev, format, ## arg) +#define mthca_info(mdev, format, arg...) \ + dev_info(&mdev->pdev->dev, format, ## arg) +#define mthca_warn(mdev, format, arg...) \ + dev_warn(&mdev->pdev->dev, format, ## arg) + +extern void __buggy_use_of_MTHCA_GET(void); +extern void __buggy_use_of_MTHCA_PUT(void); + +#define MTHCA_GET(dest, source, offset) \ + do { \ + void *__p = (char *) (source) + (offset); \ + switch (sizeof (dest)) { \ + case 1: (dest) = *(u8 *) __p; break; \ + case 2: (dest) = be16_to_cpup(__p); break; \ + case 4: (dest) = be32_to_cpup(__p); break; \ + case 8: (dest) = be64_to_cpup(__p); break; \ + default: __buggy_use_of_MTHCA_GET(); \ + } \ + } while (0) + +#define MTHCA_PUT(dest, source, offset) \ + do { \ + __typeof__(source) *__p = \ + (__typeof__(source) *) ((char *) (dest) + (offset)); \ + switch (sizeof(source)) { \ + case 1: *__p = (source); break; \ + case 2: *__p = cpu_to_be16(source); break; \ + case 4: *__p = cpu_to_be32(source); break; \ + case 8: *__p = cpu_to_be64(source); break; \ + default: __buggy_use_of_MTHCA_PUT(); \ + } \ + } while (0) + +int mthca_reset(struct mthca_dev *mdev); + +u32 mthca_alloc(struct mthca_alloc *alloc); +void mthca_free(struct mthca_alloc *alloc, u32 obj); +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved); +void mthca_alloc_cleanup(struct mthca_alloc *alloc); +void *mthca_array_get(struct mthca_array *array, int index); +int mthca_array_set(struct mthca_array *array, int index, void *value); +void mthca_array_clear(struct mthca_array *array, int index); +int mthca_array_init(struct mthca_array *array, int nent); +void mthca_array_cleanup(struct mthca_array *array, int nent); + +int mthca_init_pd_table(struct mthca_dev *dev); +int mthca_init_mr_table(struct mthca_dev *dev); +int mthca_init_eq_table(struct mthca_dev *dev); +int mthca_init_cq_table(struct mthca_dev *dev); +int mthca_init_qp_table(struct mthca_dev *dev); +int mthca_init_av_table(struct mthca_dev *dev); +int mthca_init_mcg_table(struct mthca_dev *dev); + +void mthca_cleanup_pd_table(struct mthca_dev *dev); +void mthca_cleanup_mr_table(struct mthca_dev *dev); +void mthca_cleanup_eq_table(struct mthca_dev *dev); +void mthca_cleanup_cq_table(struct mthca_dev *dev); +void mthca_cleanup_qp_table(struct mthca_dev *dev); +void mthca_cleanup_av_table(struct mthca_dev *dev); +void mthca_cleanup_mcg_table(struct mthca_dev *dev); + +int mthca_register_device(struct mthca_dev *dev); +void mthca_unregister_device(struct mthca_dev *dev); + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr); +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry); +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited); +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq); +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq); +void mthca_cq_event(struct mthca_dev *dev, u32 cqn); +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe); +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp); +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp); +void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp); +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah); +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header); + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); +int mthca_create_agents(struct mthca_dev *dev); +void mthca_free_agents(struct mthca_dev *dev); + +static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct mthca_dev, ib_dev); +} + +#endif /* MTHCA_DEV_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h 2004-12-27 21:48:21.567791525 -0800 @@ -0,0 +1,123 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_doorbell.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include + +#define MTHCA_RD_DOORBELL 0x00 +#define MTHCA_SEND_DOORBELL 0x10 +#define MTHCA_RECEIVE_DOORBELL 0x18 +#define MTHCA_CQ_DOORBELL 0x20 +#define MTHCA_EQ_DOORBELL 0x28 + +#if BITS_PER_LONG == 64 +/* + * Assume that we can just write a 64-bit doorbell atomically. s390 + * actually doesn't have writeq() but S/390 systems don't even have + * PCI so we won't worry about it. + */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + __raw_writeq(*(u64 *) val, dest); +} + +#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL) +/* Use SSE to write 64 bits atomically without a lock. */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline unsigned long mthca_get_fpu(void) +{ + unsigned long cr0; + + preempt_disable(); + asm volatile("mov %%cr0,%0; clts" : "=r" (cr0)); + return cr0; +} + +static inline void mthca_put_fpu(unsigned long cr0) +{ + asm volatile("mov %0,%%cr0" : : "r" (cr0)); + preempt_enable(); +} + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + /* i386 stack is aligned to 8 bytes, so this should be OK: */ + u8 xmmsave[8] __attribute__((aligned(8))); + unsigned long cr0; + + cr0 = mthca_get_fpu(); + + asm volatile ( + "movlps %%xmm0,(%0); \n\t" + "movlps (%1),%%xmm0; \n\t" + "movlps %%xmm0,(%2); \n\t" + "movlps (%0),%%xmm0; \n\t" + : + : "r" (xmmsave), "r" (val), "r" (dest) + : "memory" ); + + mthca_put_fpu(cr0); +} + +#else +/* Just fall back to a spinlock to protect the doorbell */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name; +#define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + unsigned long flags; + + spin_lock_irqsave(doorbell_lock, flags); + __raw_writel(val[0], dest); + __raw_writel(val[1], dest + 4); + spin_unlock_irqrestore(doorbell_lock, flags); +} + +#endif --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c 2004-12-27 21:48:21.623783283 -0800 @@ -0,0 +1,936 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_main.c 1396 2004-12-28 04:10:27Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL +#include +#endif + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" +#include "mthca_profile.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +#ifdef CONFIG_PCI_MSI + +static int msi_x = 0; +module_param(msi_x, int, 0444); +MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); + +static int msi = 0; +module_param(msi, int, 0444); +MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero"); + +#else /* CONFIG_PCI_MSI */ + +#define msi_x (0) +#define msi (0) + +#endif /* CONFIG_PCI_MSI */ + +static const char mthca_version[] __devinitdata = + "ib_mthca: Mellanox InfiniBand HCA driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +static int __devinit mthca_tune_pci(struct mthca_dev *mdev) +{ + int cap; + u16 val; + + /* First try to max out Read Byte Count */ + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) { + mthca_err(mdev, "Couldn't read PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2); + if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) { + mthca_err(mdev, "Couldn't write PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == TAVOR) + mthca_info(mdev, "No PCI-X capability, not setting RBC.\n"); + + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) { + mthca_err(mdev, "Couldn't read PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12); + if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) { + mthca_err(mdev, "Couldn't write PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == ARBEL_NATIVE || + mdev->hca_type == ARBEL_COMPAT) + mthca_info(mdev, "No PCI Express capability, " + "not setting Max Read Request Size.\n"); + + return 0; +} + +static int __devinit mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim) +{ + int err; + u8 status; + + err = mthca_QUERY_DEV_LIM(mdev, dev_lim, &status); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + if (dev_lim->min_page_sz > PAGE_SIZE) { + mthca_err(mdev, "HCA minimum page size of %d bigger than " + "kernel PAGE_SIZE of %ld, aborting.\n", + dev_lim->min_page_sz, PAGE_SIZE); + return -ENODEV; + } + if (dev_lim->num_ports > MTHCA_MAX_PORTS) { + mthca_err(mdev, "HCA has %d ports, but we only support %d, " + "aborting.\n", + dev_lim->num_ports, MTHCA_MAX_PORTS); + return -ENODEV; + } + + mdev->limits.num_ports = dev_lim->num_ports; + mdev->limits.vl_cap = dev_lim->max_vl; + mdev->limits.mtu_cap = dev_lim->max_mtu; + mdev->limits.gid_table_len = dev_lim->max_gids; + mdev->limits.pkey_table_len = dev_lim->max_pkeys; + mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay; + mdev->limits.max_sg = dev_lim->max_sg; + mdev->limits.reserved_qps = dev_lim->reserved_qps; + mdev->limits.reserved_srqs = dev_lim->reserved_srqs; + mdev->limits.reserved_eecs = dev_lim->reserved_eecs; + mdev->limits.reserved_cqs = dev_lim->reserved_cqs; + mdev->limits.reserved_eqs = dev_lim->reserved_eqs; + mdev->limits.reserved_mtts = dev_lim->reserved_mtts; + mdev->limits.reserved_mrws = dev_lim->reserved_mrws; + mdev->limits.reserved_uars = dev_lim->reserved_uars; + mdev->limits.reserved_pds = dev_lim->reserved_pds; + + if (dev_lim->flags & DEV_LIM_FLAG_SRQ) + mdev->mthca_flags |= MTHCA_FLAG_SRQ; + + return 0; +} + +static int __devinit mthca_init_tavor(struct mthca_dev *mdev) +{ + u8 status; + int err; + struct mthca_dev_lim dev_lim; + struct mthca_init_hca_param init_hca; + struct mthca_adapter adapter; + + err = mthca_SYS_EN(mdev, &status); + if (err) { + mthca_err(mdev, "SYS_EN command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "SYS_EN returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DDR(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_DDR command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DDR returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_dev_lim(mdev, &dev_lim); + + err = mthca_make_profile(mdev, &dev_lim, &init_hca); + if (err) + goto err_out_disable; + + err = mthca_INIT_HCA(mdev, &init_hca, &status); + if (err) { + mthca_err(mdev, "INIT_HCA command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "INIT_HCA returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_QUERY_ADAPTER(mdev, &adapter, &status); + if (err) { + mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_close; + } + + mdev->eq_table.inta_pin = adapter.inta_pin; + mdev->rev_id = adapter.revision_id; + + return 0; + +err_out_close: + mthca_CLOSE_HCA(mdev, 0, &status); + +err_out_disable: + mthca_SYS_DIS(mdev, &status); + + return err; +} + +static int __devinit mthca_load_fw(struct mthca_dev *mdev) +{ + u8 status; + int err; + int num_ent, num_sg, fw_pages, cur_order; + int i; + + /* FIXME: use HCA-attached memory for FW if present */ + + mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem * + mdev->fw.arbel.fw_pages, + GFP_KERNEL); + if (!mdev->fw.arbel.mem) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + return -ENOMEM; + } + + memset(mdev->fw.arbel.mem, 0, + sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages); + + fw_pages = mdev->fw.arbel.fw_pages; + num_ent = 0; + + /* + * We allocate in as big chunks as we can, up to a maximum of + * 256 KB per chunk. + */ + cur_order = get_order(1 << 18); + + while (fw_pages > 0) { + while (1 << cur_order > fw_pages) + --cur_order; + + /* + * We allocate with GFP_HIGHUSER because only the + * firmware is going to touch these pages, so there's + * no need for a kernel virtual address. We use + * __GFP_NOWARN because we'll deal with any allocation + * failures ourselves. + */ + mdev->fw.arbel.mem[num_ent].page = alloc_pages(GFP_HIGHUSER | __GFP_NOWARN, + cur_order); + mdev->fw.arbel.mem[num_ent].length = PAGE_SIZE << cur_order; + if (!mdev->fw.arbel.mem[num_ent].page) { + --cur_order; + if (cur_order < 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + } else { + ++num_ent; + fw_pages -= 1 << cur_order; + } + } + + num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem, num_ent, + PCI_DMA_BIDIRECTIONAL); + if (num_sg <= 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + + err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status); + if (err) { + mthca_err(mdev, "MAP_FA command failed, aborting.\n"); + goto err_unmap; + } + if (status) { + mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap; + } + err = mthca_RUN_FW(mdev, &status); + if (err) { + mthca_err(mdev, "RUN_FW command failed, aborting.\n"); + goto err_unmap_fa; + } + if (status) { + mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap_fa; + } + + return 0; + +err_unmap_fa: + mthca_UNMAP_FA(mdev, &status); + +err_unmap: + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); +err_free: + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_pages(mdev->fw.arbel.mem[i].page, + get_order(mdev->fw.arbel.mem[i].length)); + kfree(mdev->fw.arbel.mem); + return err; +} + +static int __devinit mthca_init_arbel(struct mthca_dev *mdev) +{ + struct mthca_dev_lim dev_lim; + u8 status; + int err; + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_ENABLE_LAM(mdev, &status); + if (err) { + mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n"); + return err; + } + if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) { + mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n"); + mdev->mthca_flags |= MTHCA_FLAG_NO_LAM; + } else if (status) { + mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_load_fw(mdev); + if (err) { + mthca_err(mdev, "Failed to start FW, aborting.\n"); + goto err_out_disable; + } + + err = mthca_dev_lim(mdev, &dev_lim); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + goto err_out_disable; + } + + mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, " + "aborting.\n"); + err = -ENODEV; + +err_out_disable: + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + return err; +} + +static int __devinit mthca_init_hca(struct mthca_dev *mdev) +{ + if (mdev->hca_type == ARBEL_NATIVE) + return mthca_init_arbel(mdev); + else + return mthca_init_tavor(mdev); +} + +static int __devinit mthca_setup_hca(struct mthca_dev *dev) +{ + int err; + + MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock); + + err = mthca_init_pd_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + return err; + } + + err = mthca_init_mr_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "memory region table, aborting.\n"); + goto err_out_pd_table_free; + } + + err = mthca_pd_alloc(dev, &dev->driver_pd); + if (err) { + mthca_err(dev, "Failed to create driver PD, " + "aborting.\n"); + goto err_out_mr_table_free; + } + + err = mthca_init_eq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "event queue table, aborting.\n"); + goto err_out_pd_free; + } + + err = mthca_cmd_use_events(dev); + if (err) { + mthca_err(dev, "Failed to switch to event-driven " + "firmware commands, aborting.\n"); + goto err_out_eq_table_free; + } + + err = mthca_init_cq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "completion queue table, aborting.\n"); + goto err_out_cmd_poll; + } + + err = mthca_init_qp_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "queue pair table, aborting.\n"); + goto err_out_cq_table_free; + } + + err = mthca_init_av_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "address vector table, aborting.\n"); + goto err_out_qp_table_free; + } + + err = mthca_init_mcg_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "multicast group table, aborting.\n"); + goto err_out_av_table_free; + } + + return 0; + +err_out_av_table_free: + mthca_cleanup_av_table(dev); + +err_out_qp_table_free: + mthca_cleanup_qp_table(dev); + +err_out_cq_table_free: + mthca_cleanup_cq_table(dev); + +err_out_cmd_poll: + mthca_cmd_use_polling(dev); + +err_out_eq_table_free: + mthca_cleanup_eq_table(dev); + +err_out_pd_free: + mthca_pd_free(dev, &dev->driver_pd); + +err_out_mr_table_free: + mthca_cleanup_mr_table(dev); + +err_out_pd_table_free: + mthca_cleanup_pd_table(dev); + return err; +} + +static int __devinit mthca_request_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + int err; + + /* + * We request our first BAR in two chunks, since the MSI-X + * vector table is right in the middle. + * + * This is why we can't just use pci_request_regions() -- if + * we did then setting up MSI-X would fail, since the PCI core + * wants to do request_mem_region on the MSI-X vector table. + */ + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE, + DRV_NAME)) + return -EBUSY; + + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE, + DRV_NAME)) { + err = -EBUSY; + goto err_out_bar0_beg; + } + + err = pci_request_region(pdev, 2, DRV_NAME); + if (err) + goto err_out_bar0_end; + + if (!ddr_hidden) { + err = pci_request_region(pdev, 4, DRV_NAME); + if (err) + goto err_out_bar2; + } + + return 0; + +err_out_bar0_beg: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + +err_out_bar0_end: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + +err_out_bar2: + pci_release_region(pdev, 2); + return err; +} + +static void mthca_release_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + pci_release_region(pdev, 2); + if (!ddr_hidden) + pci_release_region(pdev, 4); +} + +static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev) +{ + struct msix_entry entries[3]; + int err; + + entries[0].entry = 0; + entries[1].entry = 1; + entries[2].entry = 2; + + err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries)); + if (err) { + if (err > 0) + mthca_info(mdev, "Only %d MSI-X vectors available, " + "not using MSI-X\n", err); + return err; + } + + mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector; + mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector; + mdev->eq_table.eq[MTHCA_EQ_CMD ].msi_x_vector = entries[2].vector; + + return 0; +} + +static void mthca_close_hca(struct mthca_dev *mdev) +{ + u8 status; + int i; + + mthca_CLOSE_HCA(mdev, 0, &status); + + if (mdev->hca_type == ARBEL_NATIVE) { + mthca_UNMAP_FA(mdev, &status); + + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_pages(mdev->fw.arbel.mem[i].page, + get_order(mdev->fw.arbel.mem[i].length)); + kfree(mdev->fw.arbel.mem); + + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + } else + mthca_SYS_DIS(mdev, &status); +} + +static int __devinit mthca_init_one(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + static int mthca_version_printed = 0; + int ddr_hidden = 0; + int err; + unsigned long mthca_base; + struct mthca_dev *mdev; + + if (!mthca_version_printed) { + printk(KERN_INFO "%s", mthca_version); + ++mthca_version_printed; + } + + printk(KERN_INFO PFX "Initializing %s (%s)\n", + pci_pretty_name(pdev), pci_name(pdev)); + + err = pci_enable_device(pdev); + if (err) { + dev_err(&pdev->dev, "Cannot enable PCI device, " + "aborting.\n"); + return err; + } + + /* + * Check for BARs. We expect 0: 1MB, 2: 8MB, 4: DDR (may not + * be present) + */ + if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) || + pci_resource_len(pdev, 0) != 1 << 20) { + dev_err(&pdev->dev, "Missing DCS, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) || + pci_resource_len(pdev, 2) != 1 << 23) { + dev_err(&pdev->dev, "Missing UAR, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM)) + ddr_hidden = 1; + + err = mthca_request_regions(pdev, ddr_hidden); + if (err) { + dev_err(&pdev->dev, "Cannot obtain PCI resources, " + "aborting.\n"); + goto err_out_disable_pdev; + } + + pci_set_master(pdev); + + err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n"); + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n"); + goto err_out_free_res; + } + } + err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit " + "consistent PCI DMA mask.\n"); + err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, " + "aborting.\n"); + goto err_out_free_res; + } + } + + mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev); + if (!mdev) { + dev_err(&pdev->dev, "Device struct alloc failed, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_res; + } + + mdev->pdev = pdev; + mdev->hca_type = id->driver_data; + + if (ddr_hidden) + mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; + + /* + * Now reset the HCA before we touch the PCI capabilities or + * attempt a firmware command, since a boot ROM may have left + * the HCA in an undefined state. + */ + err = mthca_reset(mdev); + if (err) { + mthca_err(mdev, "Failed to reset HCA, aborting.\n"); + goto err_out_free_dev; + } + + if (msi_x && !mthca_enable_msi_x(mdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI_X; + if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) && + !pci_enable_msi(pdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI; + + sema_init(&mdev->cmd.hcr_sem, 1); + sema_init(&mdev->cmd.poll_sem, 1); + mdev->cmd.use_events = 0; + + mthca_base = pci_resource_start(pdev, 0); + mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE); + if (!mdev->hcr) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_dev; + } + mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + if (!mdev->clr_base) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap; + } + + mthca_base = pci_resource_start(pdev, 2); + mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); + if (!mdev->kar) { + mthca_err(mdev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap_clr; + } + + err = mthca_tune_pci(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_init_hca(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_setup_hca(mdev); + if (err) + goto err_out_close; + + err = mthca_register_device(mdev); + if (err) + goto err_out_cleanup; + + err = mthca_create_agents(mdev); + if (err) + goto err_out_unregister; + + pci_set_drvdata(pdev, mdev); + + return 0; + +err_out_unregister: + mthca_unregister_device(mdev); + +err_out_cleanup: + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + +err_out_close: + mthca_close_hca(mdev); + +err_out_iounmap_kar: + iounmap(mdev->kar); + +err_out_iounmap_clr: + iounmap(mdev->clr_base); + +err_out_iounmap: + iounmap(mdev->hcr); + +err_out_free_dev: + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + +err_out_free_res: + mthca_release_regions(pdev, ddr_hidden); + +err_out_disable_pdev: + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + return err; +} + +static void __devexit mthca_remove_one(struct pci_dev *pdev) +{ + struct mthca_dev *mdev = pci_get_drvdata(pdev); + u8 status; + int p; + + if (mdev) { + mthca_free_agents(mdev); + mthca_unregister_device(mdev); + + for (p = 1; p <= mdev->limits.num_ports; ++p) + mthca_CLOSE_IB(mdev, p, &status); + + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + + mthca_close_hca(mdev); + + iounmap(mdev->hcr); + iounmap(mdev->clr_base); + + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + mthca_release_regions(pdev, mdev->mthca_flags & + MTHCA_FLAG_DDR_HIDDEN); + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + } +} + +static struct pci_device_id mthca_pci_table[] = { + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { 0, } +}; + +MODULE_DEVICE_TABLE(pci, mthca_pci_table); + +static struct pci_driver mthca_driver = { + .name = "ib_mthca", + .id_table = mthca_pci_table, + .probe = mthca_init_one, + .remove = __devexit_p(mthca_remove_one) +}; + +static int __init mthca_init(void) +{ + int ret; + + /* + * TODO: measure whether dynamically choosing doorbell code at + * runtime affects our performance. Is there a "magic" way to + * choose without having to follow a function pointer every + * time we ring a doorbell? + */ +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL + if (!cpu_has_xmm) { + printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n"); + printk(KERN_ERR PFX "the current CPU does not support SSE.\n"); + printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL " + "and recompile.\n"); + return -ENODEV; + } +#endif + + ret = pci_register_driver(&mthca_driver); + return ret < 0 ? ret : 0; +} + +static void __exit mthca_cleanup(void) +{ + pci_unregister_driver(&mthca_driver); +} + +module_init(mthca_init); +module_exit(mthca_cleanup); From roland at topspin.com Mon Dec 27 21:51:04 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:04 -0800 Subject: [openib-general] [PATCH][v5][10/24] Add Mellanox HCA low-level driver (midlayer interface) In-Reply-To: <200412272151.HKpCNDFOMg5KO6kB@topspin.com> Message-ID: <200412272151.5KRaOFYNt0hYYQgh@topspin.com> Add midlayer interface code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c 2004-12-27 21:48:22.043721469 -0800 @@ -0,0 +1,627 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_provider.c 1397 2004-12-28 05:09:00Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +static int mthca_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + props->fw_ver = to_mdev(ibdev)->fw_ver; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->base_version = 1; + in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->class_version = 1; + in_mad->method = IB_MGMT_METHOD_GET; + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + 1, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 36)) & + 0xffffff; + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 30)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 32)); + memcpy(&props->sys_image_guid, out_mad->data + 4, 8); + memcpy(&props->node_guid, out_mad->data + 12, 8); + + err = 0; + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->base_version = 1; + in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->class_version = 1; + in_mad->method = IB_MGMT_METHOD_GET; + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->lid = be16_to_cpup((u16 *) (out_mad->data + 16)); + props->lmc = out_mad->data[34] & 0x7; + props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 18)); + props->sm_sl = out_mad->data[36] & 0xf; + props->state = out_mad->data[32] & 0xf; + props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 20)); + props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; + props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; + props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 48)); + props->active_width = out_mad->data[31] & 0xf; + props->active_speed = out_mad->data[35] >> 4; + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return 0; +} + +static int mthca_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->base_version = 1; + in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->class_version = 1; + in_mad->method = IB_MGMT_METHOD_GET; + in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; + in_mad->attr_mod = cpu_to_be32(index / 32); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + *pkey = be16_to_cpu(((u16 *) out_mad->data)[index % 32]); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->base_version = 1; + in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->class_version = 1; + in_mad->method = IB_MGMT_METHOD_GET; + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw, out_mad->data + 8, 8); + + memset(in_mad, 0, sizeof *in_mad); + in_mad->base_version = 1; + in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->class_version = 1; + in_mad->method = IB_MGMT_METHOD_GET; + in_mad->attr_id = IB_SMP_ATTR_GUID_INFO; + in_mad->attr_mod = cpu_to_be32(index / 8); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw + 8, out_mad->data + (index % 8) * 16, 8); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev) +{ + struct mthca_pd *pd; + int err; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = mthca_pd_alloc(to_mdev(ibdev), pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + return &pd->ibpd; +} + +static int mthca_dealloc_pd(struct ib_pd *pd) +{ + mthca_pd_free(to_mdev(pd->device), to_mpd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *mthca_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + int err; + struct mthca_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return ERR_PTR(-ENOMEM); + + err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah); + if (err) { + kfree(ah); + return ERR_PTR(err); + } + + return &ah->ibah; +} + +static int mthca_ah_destroy(struct ib_ah *ah) +{ + mthca_destroy_ah(to_mdev(ah->device), to_mah(ah)); + kfree(ah); + + return 0; +} + +static struct ib_qp *mthca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr) +{ + struct mthca_qp *qp; + int err; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + case IB_QPT_UC: + case IB_QPT_UD: + { + qp = kmalloc(sizeof *qp, GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->qp_type, init_attr->sq_sig_type, + init_attr->rq_sig_type, qp); + qp->ibqp.qp_num = qp->qpn; + break; + } + case IB_QPT_SMI: + case IB_QPT_GSI: + { + qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1; + + err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->sq_sig_type, init_attr->rq_sig_type, + qp->ibqp.qp_num, init_attr->port_num, + to_msqp(qp)); + break; + } + default: + /* Don't support raw QPs */ + return ERR_PTR(-ENOSYS); + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + init_attr->cap.max_inline_data = 0; + + return &qp->ibqp; +} + +static int mthca_destroy_qp(struct ib_qp *qp) +{ + mthca_free_qp(to_mdev(qp->device), to_mqp(qp)); + kfree(qp); + return 0; +} + +static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries) +{ + struct mthca_cq *cq; + int nent; + int err; + + cq = kmalloc(sizeof *cq, GFP_KERNEL); + if (!cq) + return ERR_PTR(-ENOMEM); + + for (nent = 1; nent <= entries; nent <<= 1) + ; /* nothing */ + + err = mthca_init_cq(to_mdev(ibdev), nent, cq); + if (err) { + kfree(cq); + cq = ERR_PTR(err); + } else + cq->ibcq.cqe = nent - 1; + + return &cq->ibcq; +} + +static int mthca_destroy_cq(struct ib_cq *cq) +{ + mthca_free_cq(to_mdev(cq->device), to_mcq(cq)); + kfree(cq); + + return 0; +} + +static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify) +{ + mthca_arm_cq(to_mdev(cq->device), to_mcq(cq), + notify == IB_CQ_SOLICITED); + return 0; +} + +static inline u32 convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC : 0) | + (acc & IB_ACCESS_REMOTE_WRITE ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? MTHCA_MPT_FLAG_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? MTHCA_MPT_FLAG_LOCAL_WRITE : 0) | + MTHCA_MPT_FLAG_LOCAL_READ; +} + +static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct mthca_mr *mr; + int err; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + err = mthca_mr_alloc_notrans(to_mdev(pd->device), + to_mpd(pd)->pd_num, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + struct mthca_mr *mr; + u64 *page_list; + u64 total_size; + u64 mask; + int shift; + int npages; + int err; + int i, j, n; + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + mask = 0; + total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (buffer_list[i].addr & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + /* Find largest page shift we can use to cover buffers */ + for (shift = PAGE_SHIFT; shift < 31; ++shift) + if (num_phys_buf > 1) { + if ((1ULL << shift) & mask) + break; + } else { + if (1ULL << shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << shift) - 1))) + break; + } + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1); + buffer_list[0].addr &= ~0ull << shift; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + npages = 0; + for (i = 0; i < num_phys_buf; ++i) + npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + + if (!npages) + return &mr->ibmr; + + page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); + if (!page_list) { + kfree(mr); + return ERR_PTR(-ENOMEM); + } + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + ++j) + page_list[n++] = buffer_list[i].addr + ((u64) j << shift); + + mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) " + "in PD %x; shift %d, npages %d.\n", + (unsigned long long) buffer_list[0].addr, + (unsigned long long) *iova_start, + to_mpd(pd)->pd_num, + shift, npages); + + err = mthca_mr_alloc_phys(to_mdev(pd->device), + to_mpd(pd)->pd_num, + page_list, shift, npages, + *iova_start, total_size, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + kfree(page_list); + return &mr->ibmr; +} + +static int mthca_dereg_mr(struct ib_mr *mr) +{ + mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); + kfree(mr); + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x\n", dev->rev_id); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32), + (int) (dev->fw_ver >> 16) & 0xffff, + (int) dev->fw_ver & 0xffff); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + switch (dev->hca_type) { + case TAVOR: return sprintf(buf, "MT23108\n"); + case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); + case ARBEL_NATIVE: return sprintf(buf, "MT25208\n"); + default: return sprintf(buf, "unknown\n"); + } +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); + +static struct class_device_attribute *mthca_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type +}; + +int mthca_register_device(struct mthca_dev *dev) +{ + int ret; + int i; + + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); + dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.phys_port_cnt = dev->limits.num_ports; + dev->ib_dev.dma_device = &dev->pdev->dev; + dev->ib_dev.class_dev.dev = &dev->pdev->dev; + dev->ib_dev.query_device = mthca_query_device; + dev->ib_dev.query_port = mthca_query_port; + dev->ib_dev.modify_port = mthca_modify_port; + dev->ib_dev.query_pkey = mthca_query_pkey; + dev->ib_dev.query_gid = mthca_query_gid; + dev->ib_dev.alloc_pd = mthca_alloc_pd; + dev->ib_dev.dealloc_pd = mthca_dealloc_pd; + dev->ib_dev.create_ah = mthca_ah_create; + dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.create_qp = mthca_create_qp; + dev->ib_dev.modify_qp = mthca_modify_qp; + dev->ib_dev.destroy_qp = mthca_destroy_qp; + dev->ib_dev.post_send = mthca_post_send; + dev->ib_dev.post_recv = mthca_post_receive; + dev->ib_dev.create_cq = mthca_create_cq; + dev->ib_dev.destroy_cq = mthca_destroy_cq; + dev->ib_dev.poll_cq = mthca_poll_cq; + dev->ib_dev.req_notify_cq = mthca_req_notify_cq; + dev->ib_dev.get_dma_mr = mthca_get_dma_mr; + dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; + dev->ib_dev.dereg_mr = mthca_dereg_mr; + dev->ib_dev.attach_mcast = mthca_multicast_attach; + dev->ib_dev.detach_mcast = mthca_multicast_detach; + dev->ib_dev.process_mad = mthca_process_mad; + + ret = ib_register_device(&dev->ib_dev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) { + ret = class_device_create_file(&dev->ib_dev.class_dev, + mthca_class_attributes[i]); + if (ret) { + ib_unregister_device(&dev->ib_dev); + return ret; + } + } + + return 0; +} + +void mthca_unregister_device(struct mthca_dev *dev) +{ + ib_unregister_device(&dev->ib_dev); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h 2004-12-27 21:48:22.091714405 -0800 @@ -0,0 +1,225 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_provider.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef MTHCA_PROVIDER_H +#define MTHCA_PROVIDER_H + +#include +#include + +#define MTHCA_MPT_FLAG_ATOMIC (1 << 14) +#define MTHCA_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define MTHCA_MPT_FLAG_REMOTE_READ (1 << 12) +#define MTHCA_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define MTHCA_MPT_FLAG_LOCAL_READ (1 << 10) + +struct mthca_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct mthca_mr { + struct ib_mr ibmr; + int order; + u32 first_seg; +}; + +struct mthca_pd { + struct ib_pd ibpd; + u32 pd_num; + atomic_t sqp_count; + struct mthca_mr ntmr; +}; + +struct mthca_eq { + struct mthca_dev *dev; + int eqn; + u32 ecr_mask; + u16 msi_x_vector; + u16 msi_x_entry; + int have_irq; + int nent; + int cons_index; + struct mthca_buf_list *page_list; + struct mthca_mr mr; +}; + +struct mthca_av; + +struct mthca_ah { + struct ib_ah ibah; + int on_hca; + u32 key; + struct mthca_av *av; + dma_addr_t avdma; +}; + +/* + * Quick description of our CQ/QP locking scheme: + * + * We have one global lock that protects dev->cq/qp_table. Each + * struct mthca_cq/qp also has its own lock. An individual qp lock + * may be taken inside of an individual cq lock. Both cqs attached to + * a qp may be locked, with the send cq locked first. No other + * nesting should be done. + * + * Each struct mthca_cq/qp also has an atomic_t ref count. The + * pointer from the cq/qp_table to the struct counts as one reference. + * This reference also is good for access through the consumer API, so + * modifying the CQ/QP etc doesn't need to take another reference. + * Access because of a completion being polled does need a reference. + * + * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the + * destroy function to sleep on. + * + * This means that access from the consumer API requires nothing but + * taking the struct's lock. + * + * Access because of a completion event should go as follows: + * - lock cq/qp_table and look up struct + * - increment ref count in struct + * - drop cq/qp_table lock + * - lock struct, do your thing, and unlock struct + * - decrement ref count; if zero, wake up waiters + * + * To destroy a CQ/QP, we can do the following: + * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock + * - decrement ref count + * - wait_event until ref count is zero + * + * It is the consumer's responsibilty to make sure that no QP + * operations (WQE posting or state modification) are pending when the + * QP is destroyed. Also, the consumer must make sure that calls to + * qp_modify are serialized. + * + * Possible optimizations (wait for profile data to see if/where we + * have locks bouncing between CPUs): + * - split cq/qp table lock into n separate (cache-aligned) locks, + * indexed (say) by the page in the table + * - split QP struct lock into three (one for common info, one for the + * send queue and one for the receive queue) + */ + +struct mthca_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int cons_index; + int is_direct; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + struct mthca_mr mr; + wait_queue_head_t wait; +}; + +struct mthca_wq { + int max; + int cur; + int next; + int last_comp; + void *last; + int max_gs; + int wqe_shift; + enum ib_sig_type policy; +}; + +struct mthca_qp { + struct ib_qp ibqp; + spinlock_t lock; + atomic_t refcount; + u32 qpn; + int transport; + enum ib_qp_state state; + int is_direct; + struct mthca_mr mr; + + struct mthca_wq rq; + struct mthca_wq sq; + int send_wqe_offset; + + u64 *wrid; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + + wait_queue_head_t wait; +}; + +struct mthca_sqp { + struct mthca_qp qp; + int port; + int pkey_index; + u32 qkey; + u32 send_psn; + struct ib_ud_header ud_header; + int header_buf_size; + void *header_buf; + dma_addr_t header_dma; +}; + +static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct mthca_mr, ibmr); +} + +static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct mthca_pd, ibpd); +} + +static inline struct mthca_ah *to_mah(struct ib_ah *ibah) +{ + return container_of(ibah, struct mthca_ah, ibah); +} + +static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct mthca_cq, ibcq); +} + +static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct mthca_qp, ibqp); +} + +static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp) +{ + return container_of(qp, struct mthca_sqp, qp); +} + +#endif /* MTHCA_PROVIDER_H */ From roland at topspin.com Mon Dec 27 21:51:05 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:05 -0800 Subject: [openib-general] [PATCH][v5][11/24] Add Mellanox HCA low-level driver (FW commands) In-Reply-To: <200412272151.5KRaOFYNt0hYYQgh@topspin.com> Message-ID: <200412272151.6amOd1o39RpEe1KK@topspin.com> Add firmware command processing code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c 2004-12-27 21:48:22.369673490 -0800 @@ -0,0 +1,1573 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_cmd.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" + +#define CMD_POLL_TOKEN 0xffff + +enum { + HCR_IN_PARAM_OFFSET = 0x00, + HCR_IN_MODIFIER_OFFSET = 0x08, + HCR_OUT_PARAM_OFFSET = 0x0c, + HCR_TOKEN_OFFSET = 0x14, + HCR_STATUS_OFFSET = 0x18, + + HCR_OPMOD_SHIFT = 12, + HCA_E_BIT = 22, + HCR_GO_BIT = 23 +}; + +enum { + /* initialization and general commands */ + CMD_SYS_EN = 0x1, + CMD_SYS_DIS = 0x2, + CMD_MAP_FA = 0xfff, + CMD_UNMAP_FA = 0xffe, + CMD_RUN_FW = 0xff6, + CMD_MOD_STAT_CFG = 0x34, + CMD_QUERY_DEV_LIM = 0x3, + CMD_QUERY_FW = 0x4, + CMD_ENABLE_LAM = 0xff8, + CMD_DISABLE_LAM = 0xff7, + CMD_QUERY_DDR = 0x5, + CMD_QUERY_ADAPTER = 0x6, + CMD_INIT_HCA = 0x7, + CMD_CLOSE_HCA = 0x8, + CMD_INIT_IB = 0x9, + CMD_CLOSE_IB = 0xa, + CMD_QUERY_HCA = 0xb, + CMD_SET_IB = 0xc, + CMD_ACCESS_DDR = 0x2e, + CMD_MAP_ICM = 0xffa, + CMD_UNMAP_ICM = 0xff9, + CMD_MAP_ICM_AUX = 0xffc, + CMD_UNMAP_ICM_AUX = 0xffb, + CMD_SET_ICM_SIZE = 0xffd, + + /* TPT commands */ + CMD_SW2HW_MPT = 0xd, + CMD_QUERY_MPT = 0xe, + CMD_HW2SW_MPT = 0xf, + CMD_READ_MTT = 0x10, + CMD_WRITE_MTT = 0x11, + CMD_SYNC_TPT = 0x2f, + + /* EQ commands */ + CMD_MAP_EQ = 0x12, + CMD_SW2HW_EQ = 0x13, + CMD_HW2SW_EQ = 0x14, + CMD_QUERY_EQ = 0x15, + + /* CQ commands */ + CMD_SW2HW_CQ = 0x16, + CMD_HW2SW_CQ = 0x17, + CMD_QUERY_CQ = 0x18, + CMD_RESIZE_CQ = 0x2c, + + /* SRQ commands */ + CMD_SW2HW_SRQ = 0x35, + CMD_HW2SW_SRQ = 0x36, + CMD_QUERY_SRQ = 0x37, + + /* QP/EE commands */ + CMD_RST2INIT_QPEE = 0x19, + CMD_INIT2RTR_QPEE = 0x1a, + CMD_RTR2RTS_QPEE = 0x1b, + CMD_RTS2RTS_QPEE = 0x1c, + CMD_SQERR2RTS_QPEE = 0x1d, + CMD_2ERR_QPEE = 0x1e, + CMD_RTS2SQD_QPEE = 0x1f, + CMD_SQD2SQD_QPEE = 0x38, + CMD_SQD2RTS_QPEE = 0x20, + CMD_ERR2RST_QPEE = 0x21, + CMD_QUERY_QPEE = 0x22, + CMD_INIT2INIT_QPEE = 0x2d, + CMD_SUSPEND_QPEE = 0x32, + CMD_UNSUSPEND_QPEE = 0x33, + /* special QPs and management commands */ + CMD_CONF_SPECIAL_QP = 0x23, + CMD_MAD_IFC = 0x24, + + /* multicast commands */ + CMD_READ_MGM = 0x25, + CMD_WRITE_MGM = 0x26, + CMD_MGID_HASH = 0x27, + + /* miscellaneous commands */ + CMD_DIAG_RPRT = 0x30, + CMD_NOP = 0x31, + + /* debug commands */ + CMD_QUERY_DEBUG_MSG = 0x2a, + CMD_SET_DEBUG_MSG = 0x2b, +}; + +/* + * According to Mellanox code, FW may be starved and never complete + * commands. So we can't use strict timeouts described in PRM -- we + * just arbitrarily select 60 seconds for now. + */ +#if 0 +/* + * Round up and add 1 to make sure we get the full wait time (since we + * will be starting in the middle of a jiffy) + */ +enum { + CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1, + CMD_TIME_CLASS_B = (HZ + 99) / 100 + 1, + CMD_TIME_CLASS_C = (HZ + 9) / 10 + 1 +}; +#else +enum { + CMD_TIME_CLASS_A = 60 * HZ, + CMD_TIME_CLASS_B = 60 * HZ, + CMD_TIME_CLASS_C = 60 * HZ +}; +#endif + +enum { + GO_BIT_TIMEOUT = HZ * 10 +}; + +struct mthca_cmd_context { + struct completion done; + struct timer_list timer; + int result; + int next; + u64 out_param; + u16 token; + u8 status; +}; + +static inline int go_bit(struct mthca_dev *dev) +{ + return readl(dev->hcr + HCR_STATUS_OFFSET) & + swab32(1 << HCR_GO_BIT); +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + if (down_interruptible(&dev->cmd.hcr_sem)) + return -EINTR; + + if (event) { + unsigned long end = jiffies + GO_BIT_TIMEOUT; + + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + } + + if (go_bit(dev)) { + err = -EAGAIN; + goto out; + } + + /* + * We use writel (instead of something like memcpy_toio) + * because writes of less than 32 bits to the HCR don't work + * (and some architectures such as ia64 implement memcpy_toio + * in terms of writeb). + */ + __raw_writel(cpu_to_be32(in_param >> 32), dev->hcr + 0 * 4); + __raw_writel(cpu_to_be32(in_param & 0xfffffffful), dev->hcr + 1 * 4); + __raw_writel(cpu_to_be32(in_modifier), dev->hcr + 2 * 4); + __raw_writel(cpu_to_be32(out_param >> 32), dev->hcr + 3 * 4); + __raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4); + __raw_writel(cpu_to_be32(token << 16), dev->hcr + 5 * 4); + + /* __raw_writel may not order writes. */ + wmb(); + + __raw_writel(cpu_to_be32((1 << HCR_GO_BIT) | + (event ? (1 << HCA_E_BIT) : 0) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), dev->hcr + 6 * 4); + +out: + up(&dev->cmd.hcr_sem); + return err; +} + +static int mthca_cmd_poll(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + unsigned long end; + + if (down_interruptible(&dev->cmd.poll_sem)) + return -EINTR; + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, CMD_POLL_TOKEN, 0); + if (err) + goto out; + + end = timeout + jiffies; + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + + if (go_bit(dev)) { + err = -EBUSY; + goto out; + } + + if (out_is_imm) { + memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64)); + be64_to_cpus(out_param); + } + + *status = be32_to_cpu(__raw_readl(dev->hcr + HCR_STATUS_OFFSET)) >> 24; + +out: + up(&dev->cmd.poll_sem); + return err; +} + +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param) +{ + struct mthca_cmd_context *context = + &dev->cmd.context[token & dev->cmd.token_mask]; + + /* previously timed out command completing at long last */ + if (token != context->token) + return; + + context->result = 0; + context->status = status; + context->out_param = out_param; + + context->token += dev->cmd.token_mask + 1; + + complete(&context->done); +} + +static void event_timeout(unsigned long context_ptr) +{ + struct mthca_cmd_context *context = + (struct mthca_cmd_context *) context_ptr; + + context->result = -EBUSY; + complete(&context->done); +} + +static int mthca_cmd_wait(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + struct mthca_cmd_context *context; + + if (down_interruptible(&dev->cmd.event_sem)) + return -EINTR; + + spin_lock(&dev->cmd.context_lock); + BUG_ON(dev->cmd.free_head < 0); + context = &dev->cmd.context[dev->cmd.free_head]; + dev->cmd.free_head = context->next; + spin_unlock(&dev->cmd.context_lock); + + init_completion(&context->done); + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, context->token, 1); + if (err) + goto out; + + context->timer.expires = jiffies + timeout; + add_timer(&context->timer); + + wait_for_completion(&context->done); + del_timer_sync(&context->timer); + + err = context->result; + if (err) + goto out; + + *status = context->status; + if (*status) + mthca_dbg(dev, "Command %02x completed with status %02x\n", + op, *status); + + if (out_is_imm) + *out_param = context->out_param; + +out: + spin_lock(&dev->cmd.context_lock); + context->next = dev->cmd.free_head; + dev->cmd.free_head = context - dev->cmd.context; + spin_unlock(&dev->cmd.context_lock); + + up(&dev->cmd.event_sem); + return err; +} + +/* Invoke a command with an output mailbox */ +static int mthca_cmd_box(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); +} + +/* Invoke a command with no output parameter */ +static int mthca_cmd(struct mthca_dev *dev, + u64 in_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + return mthca_cmd_box(dev, in_param, 0, in_modifier, + op_modifier, op, timeout, status); +} + +/* + * Invoke a command with an immediate output parameter (and copy the + * output into the caller's out_param pointer after the command + * executes). + */ +static int mthca_cmd_imm(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); +} + +/* + * Switch to using events to issue FW commands (should be called after + * event queue to command events has been initialized). + */ +int mthca_cmd_use_events(struct mthca_dev *dev) +{ + int i; + + dev->cmd.context = kmalloc(dev->cmd.max_cmds * + sizeof (struct mthca_cmd_context), + GFP_KERNEL); + if (!dev->cmd.context) + return -ENOMEM; + + for (i = 0; i < dev->cmd.max_cmds; ++i) { + dev->cmd.context[i].token = i; + dev->cmd.context[i].next = i + 1; + init_timer(&dev->cmd.context[i].timer); + dev->cmd.context[i].timer.data = + (unsigned long) &dev->cmd.context[i]; + dev->cmd.context[i].timer.function = event_timeout; + } + + dev->cmd.context[dev->cmd.max_cmds - 1].next = -1; + dev->cmd.free_head = 0; + + sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds); + spin_lock_init(&dev->cmd.context_lock); + + for (dev->cmd.token_mask = 1; + dev->cmd.token_mask < dev->cmd.max_cmds; + dev->cmd.token_mask <<= 1) + ; /* nothing */ + --dev->cmd.token_mask; + + dev->cmd.use_events = 1; + down(&dev->cmd.poll_sem); + + return 0; +} + +/* + * Switch back to polling (used when shutting down the device) + */ +void mthca_cmd_use_polling(struct mthca_dev *dev) +{ + int i; + + dev->cmd.use_events = 0; + + for (i = 0; i < dev->cmd.max_cmds; ++i) + down(&dev->cmd.event_sem); + + kfree(dev->cmd.context); + + up(&dev->cmd.poll_sem); +} + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status) +{ + u64 out; + int ret; + + ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status); + + if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR) + mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, " + "sladdr=%d, SPD source=%s\n", + (int) (out >> 6) & 0xf, (int) (out >> 4) & 3, + (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM"); + + return ret; +} + +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status); +} + +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int lg; + int nent = 0; + int i, j; + int err = 0; + int ts = 0; + + inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma); + memset(inbox, 0, PAGE_SIZE); + + for (i = 0; i < count; ++i) { + /* + * We have to pass pages that are aligned to their + * size, so find the least significant 1 in the + * address or size and use that as our log2 size. + */ + lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1; + if (lg < 12) { + mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n", + (unsigned long long) sg_dma_address(sglist + i), + sg_dma_len(sglist + i)); + err = -EINVAL; + goto out; + } + for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) { + *((__be64 *) (inbox + nent * 4 + 2)) = + cpu_to_be64((sg_dma_address(sglist + i) + + (j << lg)) | + (lg - 12)); + ts += 1 << (lg - 10); + if (nent == PAGE_SIZE / 16) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + if (err || *status) + goto out; + nent = 0; + } + } + } + + if (nent) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + } + + mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts); + +out: + pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma); + return err; +} + +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status); +} + +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status); +} + +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err = 0; + u8 lg; + +#define QUERY_FW_OUT_SIZE 0x100 +#define QUERY_FW_VER_OFFSET 0x00 +#define QUERY_FW_MAX_CMD_OFFSET 0x0f +#define QUERY_FW_ERR_START_OFFSET 0x30 +#define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_START_OFFSET 0x20 +#define QUERY_FW_END_OFFSET 0x28 + +#define QUERY_FW_SIZE_OFFSET 0x00 +#define QUERY_FW_CLR_INT_BASE_OFFSET 0x20 +#define QUERY_FW_EQ_ARM_BASE_OFFSET 0x40 +#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma); + if (!outbox) { + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->fw_ver, outbox, QUERY_FW_VER_OFFSET); + /* + * FW subminor version is at more signifant bits than minor + * version, so swap here. + */ + dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) | + ((dev->fw_ver & 0xffff0000ull) >> 16) | + ((dev->fw_ver & 0x0000ffffull) << 16); + + MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET); + dev->cmd.max_cmds = 1 << lg; + + mthca_dbg(dev, "FW version %012llx, max commands %d\n", + (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET); + MTHCA_GET(dev->fw.arbel.clr_int_base, outbox, QUERY_FW_CLR_INT_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_arm_base, outbox, QUERY_FW_EQ_ARM_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET); + mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2); + + /* + * Arbel page size is always 4 KB; round up number of + * system pages needed. + */ + dev->fw.arbel.fw_pages = + (dev->fw.arbel.fw_pages + (1 << (PAGE_SHIFT - 12)) - 1) >> + (PAGE_SHIFT - 12); + + mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", + (unsigned long long) dev->fw.arbel.clr_int_base, + (unsigned long long) dev->fw.arbel.eq_arm_base, + (unsigned long long) dev->fw.arbel.eq_set_ci_base); + } else { + MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET); + MTHCA_GET(dev->fw.tavor.fw_end, outbox, QUERY_FW_END_OFFSET); + + mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n", + (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10), + (unsigned long long) dev->fw.tavor.fw_start, + (unsigned long long) dev->fw.tavor.fw_end); + } + +out: + pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define ENABLE_LAM_OUT_SIZE 0x100 +#define ENABLE_LAM_START_OFFSET 0x00 +#define ENABLE_LAM_END_OFFSET 0x08 +#define ENABLE_LAM_INFO_OFFSET 0x13 + +#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4) +#define ENABLE_LAM_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM, + CMD_TIME_CLASS_C, status); + + if (err) + goto out; + + if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, ENABLE_LAM_END_OFFSET); + MTHCA_GET(info, outbox, ENABLE_LAM_INFO_OFFSET); + + if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & ENABLE_LAM_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status); +} + +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define QUERY_DDR_OUT_SIZE 0x100 +#define QUERY_DDR_START_OFFSET 0x00 +#define QUERY_DDR_END_OFFSET 0x08 +#define QUERY_DDR_INFO_OFFSET 0x13 + +#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4) +#define QUERY_DDR_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, QUERY_DDR_END_OFFSET); + MTHCA_GET(info, outbox, QUERY_DDR_INFO_OFFSET); + + if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & QUERY_DDR_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & QUERY_DDR_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + u8 field; + u16 size; + int err; + +#define QUERY_DEV_LIM_OUT_SIZE 0x100 +#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET 0x10 +#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET 0x11 +#define QUERY_DEV_LIM_RSVD_QP_OFFSET 0x12 +#define QUERY_DEV_LIM_MAX_QP_OFFSET 0x13 +#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET 0x14 +#define QUERY_DEV_LIM_MAX_SRQ_OFFSET 0x15 +#define QUERY_DEV_LIM_RSVD_EEC_OFFSET 0x16 +#define QUERY_DEV_LIM_MAX_EEC_OFFSET 0x17 +#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET 0x19 +#define QUERY_DEV_LIM_RSVD_CQ_OFFSET 0x1a +#define QUERY_DEV_LIM_MAX_CQ_OFFSET 0x1b +#define QUERY_DEV_LIM_MAX_MPT_OFFSET 0x1d +#define QUERY_DEV_LIM_RSVD_EQ_OFFSET 0x1e +#define QUERY_DEV_LIM_MAX_EQ_OFFSET 0x1f +#define QUERY_DEV_LIM_RSVD_MTT_OFFSET 0x20 +#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET 0x21 +#define QUERY_DEV_LIM_RSVD_MRW_OFFSET 0x22 +#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET 0x23 +#define QUERY_DEV_LIM_MAX_AV_OFFSET 0x27 +#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET 0x29 +#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET 0x2b +#define QUERY_DEV_LIM_MAX_RDMA_OFFSET 0x2f +#define QUERY_DEV_LIM_RSZ_SRQ_OFFSET 0x33 +#define QUERY_DEV_LIM_ACK_DELAY_OFFSET 0x35 +#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 +#define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 +#define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f +#define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 +#define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 +#define QUERY_DEV_LIM_UAR_SZ_OFFSET 0x49 +#define QUERY_DEV_LIM_PAGE_SZ_OFFSET 0x4b +#define QUERY_DEV_LIM_MAX_SG_OFFSET 0x51 +#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET 0x52 +#define QUERY_DEV_LIM_MAX_SG_RQ_OFFSET 0x55 +#define QUERY_DEV_LIM_MAX_DESC_SZ_RQ_OFFSET 0x56 +#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET 0x61 +#define QUERY_DEV_LIM_RSVD_MCG_OFFSET 0x62 +#define QUERY_DEV_LIM_MAX_MCG_OFFSET 0x63 +#define QUERY_DEV_LIM_RSVD_PD_OFFSET 0x64 +#define QUERY_DEV_LIM_MAX_PD_OFFSET 0x65 +#define QUERY_DEV_LIM_RSVD_RDD_OFFSET 0x66 +#define QUERY_DEV_LIM_MAX_RDD_OFFSET 0x67 +#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET 0x80 +#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET 0x82 +#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84 +#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86 +#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET 0x88 +#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET 0x8a +#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET 0x8c +#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET 0x8e +#define QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET 0x90 +#define QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET 0x92 +#define QUERY_DEV_LIM_PBL_SZ_OFFSET 0x96 +#define QUERY_DEV_LIM_BMME_FLAGS_OFFSET 0x97 +#define QUERY_DEV_LIM_RSVD_LKEY_OFFSET 0x98 +#define QUERY_DEV_LIM_LAMR_OFFSET 0x9f +#define QUERY_DEV_LIM_MAX_ICM_SZ_OFFSET 0xa0 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); + dev_lim->reserved_qps = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); + dev_lim->max_qps = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET); + dev_lim->reserved_srqs = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET); + dev_lim->max_srqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET); + dev_lim->reserved_eecs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET); + dev_lim->max_eecs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET); + dev_lim->max_cq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET); + dev_lim->reserved_cqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET); + dev_lim->max_cqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET); + dev_lim->max_mpts = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET); + dev_lim->reserved_eqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET); + dev_lim->max_eqs = 1 << (field & 0x7); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET); + dev_lim->reserved_mtts = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET); + dev_lim->max_mrw_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET); + dev_lim->reserved_mrws = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET); + dev_lim->max_mtt_seg = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET); + dev_lim->max_requester_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET); + dev_lim->max_responder_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET); + dev_lim->max_rdma_global = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET); + dev_lim->local_ca_ack_delay = field & 0x1f; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET); + dev_lim->max_mtu = field >> 4; + dev_lim->max_port_width = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET); + dev_lim->max_vl = field >> 4; + dev_lim->num_ports = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); + dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); + dev_lim->max_pkeys = 1 << (field & 0xf); + MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET); + dev_lim->reserved_uars = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET); + dev_lim->uar_size = 1 << ((field & 0x3f) + 20); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET); + dev_lim->min_page_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET); + dev_lim->max_sg = field; + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET); + dev_lim->max_desc_sz = size; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET); + dev_lim->max_qp_per_mcg = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET); + dev_lim->reserved_mgms = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET); + dev_lim->max_mcgs = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET); + dev_lim->reserved_pds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET); + dev_lim->max_pds = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET); + dev_lim->reserved_rdds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET); + dev_lim->max_rdds = 1 << (field & 0x3f); + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET); + dev_lim->eec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET); + dev_lim->qpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET); + dev_lim->eeec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET); + dev_lim->eqpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET); + dev_lim->eqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET); + dev_lim->cqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET); + dev_lim->srq_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); + dev_lim->uar_scratch_entry_sz = size; + + mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); + mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); + mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_lim->reserved_mrws, dev_lim->reserved_mtts); + mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); + mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_lim->max_pds, dev_lim->reserved_mgms); + + mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSZ_SRQ_OFFSET); + dev_lim->hca.arbel.resize_srq = field & 1; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET); + dev_lim->hca.arbel.mtt_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); + dev_lim->hca.arbel.mpt_entry_sz = size; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET); + dev_lim->hca.arbel.max_pbl_sz = 1 << (field & 0x3f); + MTHCA_GET(dev_lim->hca.arbel.bmme_flags, outbox, + QUERY_DEV_LIM_BMME_FLAGS_OFFSET); + MTHCA_GET(dev_lim->hca.arbel.reserved_lkey, outbox, + QUERY_DEV_LIM_RSVD_LKEY_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_LAMR_OFFSET); + dev_lim->hca.arbel.lam_required = field & 1; + MTHCA_GET(dev_lim->hca.arbel.max_icm_sz, outbox, + QUERY_DEV_LIM_MAX_ICM_SZ_OFFSET); + + if (dev_lim->hca.arbel.bmme_flags & 1) + mthca_dbg(dev, "Base MM extensions: yes " + "(flags %d, max PBL %d, rsvd L_Key %08x)\n", + dev_lim->hca.arbel.bmme_flags, + dev_lim->hca.arbel.max_pbl_sz, + dev_lim->hca.arbel.reserved_lkey); + else + mthca_dbg(dev, "Base MM extensions: no\n"); + + mthca_dbg(dev, "Max ICM size %lld MB\n", + (unsigned long long) dev_lim->hca.arbel.max_icm_sz >> 20); + } else { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); + dev_lim->hca.tavor.max_avs = 1 << (field & 0x3f); + } + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err; + +#define QUERY_ADAPTER_OUT_SIZE 0x100 +#define QUERY_ADAPTER_VENDOR_ID_OFFSET 0x00 +#define QUERY_ADAPTER_DEVICE_ID_OFFSET 0x04 +#define QUERY_ADAPTER_REVISION_ID_OFFSET 0x08 +#define QUERY_ADAPTER_INTA_PIN_OFFSET 0x10 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET); + MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET); + MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET); + MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + +#define INIT_HCA_IN_SIZE 0x200 +#define INIT_HCA_FLAGS_OFFSET 0x014 +#define INIT_HCA_QPC_OFFSET 0x020 +#define INIT_HCA_QPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x10) +#define INIT_HCA_LOG_QP_OFFSET (INIT_HCA_QPC_OFFSET + 0x17) +#define INIT_HCA_EEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x20) +#define INIT_HCA_LOG_EEC_OFFSET (INIT_HCA_QPC_OFFSET + 0x27) +#define INIT_HCA_SRQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x28) +#define INIT_HCA_LOG_SRQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x2f) +#define INIT_HCA_CQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x30) +#define INIT_HCA_LOG_CQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x37) +#define INIT_HCA_EQPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x40) +#define INIT_HCA_EEEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x50) +#define INIT_HCA_EQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x60) +#define INIT_HCA_LOG_EQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x67) +#define INIT_HCA_RDB_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x70) +#define INIT_HCA_UDAV_OFFSET 0x0b0 +#define INIT_HCA_UDAV_LKEY_OFFSET (INIT_HCA_UDAV_OFFSET + 0x0) +#define INIT_HCA_UDAV_PD_OFFSET (INIT_HCA_UDAV_OFFSET + 0x4) +#define INIT_HCA_MCAST_OFFSET 0x0c0 +#define INIT_HCA_MC_BASE_OFFSET (INIT_HCA_MCAST_OFFSET + 0x00) +#define INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12) +#define INIT_HCA_MC_HASH_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x16) +#define INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b) +#define INIT_HCA_TPT_OFFSET 0x0f0 +#define INIT_HCA_MPT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x00) +#define INIT_HCA_MTT_SEG_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x09) +#define INIT_HCA_LOG_MPT_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x0b) +#define INIT_HCA_MTT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x10) +#define INIT_HCA_UAR_OFFSET 0x120 +#define INIT_HCA_UAR_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x00) +#define INIT_HCA_UAR_PAGE_SZ_OFFSET (INIT_HCA_UAR_OFFSET + 0x0b) +#define INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10) + + inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_HCA_IN_SIZE); + +#if defined(__LITTLE_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1); +#elif defined(__BIG_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1); +#else +#error Host endianness not defined +#endif + /* Check port for UD address vector: */ + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1); + + /* We leave wqe_quota, responder_exu, etc as 0 (default) */ + + /* QPC/EEC/CQC/EQC/RDB attributes */ + + MTHCA_PUT(inbox, param->qpc_base, INIT_HCA_QPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_qps, INIT_HCA_LOG_QP_OFFSET); + MTHCA_PUT(inbox, param->eec_base, INIT_HCA_EEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET); + MTHCA_PUT(inbox, param->srqc_base, INIT_HCA_SRQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET); + MTHCA_PUT(inbox, param->cqc_base, INIT_HCA_CQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_cqs, INIT_HCA_LOG_CQ_OFFSET); + MTHCA_PUT(inbox, param->eqpc_base, INIT_HCA_EQPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eeec_base, INIT_HCA_EEEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eqc_base, INIT_HCA_EQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eqs, INIT_HCA_LOG_EQ_OFFSET); + MTHCA_PUT(inbox, param->rdb_base, INIT_HCA_RDB_BASE_OFFSET); + + /* UD AV attributes */ + + /* multicast attributes */ + + MTHCA_PUT(inbox, param->mc_base, INIT_HCA_MC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET); + MTHCA_PUT(inbox, param->mc_hash_sz, INIT_HCA_MC_HASH_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET); + + /* TPT attributes */ + + MTHCA_PUT(inbox, param->mpt_base, INIT_HCA_MPT_BASE_OFFSET); + MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET); + MTHCA_PUT(inbox, param->mtt_base, INIT_HCA_MTT_BASE_OFFSET); + + /* UAR attributes */ + { + u8 uar_page_sz = PAGE_SHIFT - 12; + MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET); + MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET); + } + + err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA, + HZ, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + u32 flags; + +#define INIT_IB_IN_SIZE 56 +#define INIT_IB_FLAGS_OFFSET 0x00 +#define INIT_IB_FLAG_SIG (1 << 18) +#define INIT_IB_FLAG_NG (1 << 17) +#define INIT_IB_FLAG_G0 (1 << 16) +#define INIT_IB_FLAG_1X (1 << 8) +#define INIT_IB_FLAG_4X (1 << 9) +#define INIT_IB_FLAG_12X (1 << 11) +#define INIT_IB_VL_SHIFT 4 +#define INIT_IB_MTU_SHIFT 12 +#define INIT_IB_MAX_GID_OFFSET 0x06 +#define INIT_IB_MAX_PKEY_OFFSET 0x0a +#define INIT_IB_GUID0_OFFSET 0x10 +#define INIT_IB_NODE_GUID_OFFSET 0x18 +#define INIT_IB_SI_GUID_OFFSET 0x20 + + inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_IB_IN_SIZE); + + flags = 0; + flags |= param->enable_1x ? INIT_IB_FLAG_1X : 0; + flags |= param->enable_4x ? INIT_IB_FLAG_4X : 0; + flags |= param->set_guid0 ? INIT_IB_FLAG_G0 : 0; + flags |= param->set_node_guid ? INIT_IB_FLAG_NG : 0; + flags |= param->set_si_guid ? INIT_IB_FLAG_SIG : 0; + flags |= param->vl_cap << INIT_IB_VL_SHIFT; + flags |= param->mtu_cap << INIT_IB_MTU_SHIFT; + MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET); + + MTHCA_PUT(inbox, param->gid_cap, INIT_IB_MAX_GID_OFFSET); + MTHCA_PUT(inbox, param->pkey_cap, INIT_IB_MAX_PKEY_OFFSET); + MTHCA_PUT(inbox, param->guid0, INIT_IB_GUID0_OFFSET); + MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET); + MTHCA_PUT(inbox, param->si_guid, INIT_IB_SI_GUID_OFFSET); + + err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, + CMD_TIME_CLASS_A, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status) +{ + return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status); +} + +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status) +{ + return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status); +} + +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + if (mpt_entry) { + outdma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry, + CMD_HW2SW_MPT, + CMD_TIME_CLASS_B, status); + + if (mpt_entry) + pci_unmap_single(dev->pdev, outdma, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mtt_entry, + (num_mtt + 2) * 8, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + (num_mtt + 2) * 8, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status) +{ + mthca_dbg(dev, "%s mask %016llx for eqn %d\n", + unmap ? "Clearing" : "Setting", + (unsigned long long) event_mask, eq_num); + return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num, + 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status); +} + +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, eq_num, 0, + CMD_HW2SW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, cq_num, 0, + CMD_HW2SW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status) +{ + static const u16 op[] = { + [MTHCA_TRANS_RST2INIT] = CMD_RST2INIT_QPEE, + [MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE, + [MTHCA_TRANS_INIT2RTR] = CMD_INIT2RTR_QPEE, + [MTHCA_TRANS_RTR2RTS] = CMD_RTR2RTS_QPEE, + [MTHCA_TRANS_RTS2RTS] = CMD_RTS2RTS_QPEE, + [MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE, + [MTHCA_TRANS_ANY2ERR] = CMD_2ERR_QPEE, + [MTHCA_TRANS_RTS2SQD] = CMD_RTS2SQD_QPEE, + [MTHCA_TRANS_SQD2SQD] = CMD_SQD2SQD_QPEE, + [MTHCA_TRANS_SQD2RTS] = CMD_SQD2RTS_QPEE, + [MTHCA_TRANS_ANY2RST] = CMD_ERR2RST_QPEE + }; + u8 op_mod = 0; + + dma_addr_t indma; + int err; + + if (trans < 0 || trans >= ARRAY_SIZE(op)) + return -EINVAL; + + if (trans == MTHCA_TRANS_ANY2RST) { + indma = 0; + op_mod = 3; /* don't write outbox, any->reset */ + + /* For debugging */ + qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + &indma); + op_mod = 2; /* write outbox, any->reset */ + } else { + indma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + } + + if (trans == MTHCA_TRANS_ANY2RST) { + err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + + } else + err = mthca_cmd(dev, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (trans != MTHCA_TRANS_ANY2RST) + pci_unmap_single(dev->pdev, indma, + MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE); + else + pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + qp_context, indma); + return err; +} + +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0, + CMD_QUERY_QPEE, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status) +{ + u8 op_mod; + + switch (type) { + case IB_QPT_SMI: + op_mod = 0; + break; + case IB_QPT_GSI: + op_mod = 1; + break; + case IB_QPT_RAW_IPV6: + op_mod = 2; + break; + case IB_QPT_RAW_ETY: + op_mod = 3; + break; + default: + return -EINVAL; + } + + return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP, + CMD_TIME_CLASS_B, status); +} + +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status) { + void *box; + dma_addr_t dma; + int err; + +#define MAD_IFC_BOX_SIZE 512 + + box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma); + if (!box) + return -ENOMEM; + + memcpy(box, in_mad, 256); + + err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey, + CMD_MAD_IFC, CMD_TIME_CLASS_C, status); + + if (!err && !*status) + memcpy(response_mad, box + 256, 256); + + pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma); + return err; +} + +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, index, 0, + CMD_READ_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status) +{ + dma_addr_t indma; + u64 imm; + int err; + + indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH, + CMD_TIME_CLASS_A, status); + *hash = imm; + + pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE); + return err; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h 2004-12-27 21:48:22.408667751 -0800 @@ -0,0 +1,276 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_cmd.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef MTHCA_CMD_H +#define MTHCA_CMD_H + +#include + +#define MTHCA_CMD_MAILBOX_ALIGN 16UL +#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1) + +enum { + /* command completed successfully: */ + MTHCA_CMD_STAT_OK = 0x00, + /* Internal error (such as a bus error) occurred while processing command: */ + MTHCA_CMD_STAT_INTERNAL_ERR = 0x01, + /* Operation/command not supported or opcode modifier not supported: */ + MTHCA_CMD_STAT_BAD_OP = 0x02, + /* Parameter not supported or parameter out of range: */ + MTHCA_CMD_STAT_BAD_PARAM = 0x03, + /* System not enabled or bad system state: */ + MTHCA_CMD_STAT_BAD_SYS_STATE = 0x04, + /* Attempt to access reserved or unallocaterd resource: */ + MTHCA_CMD_STAT_BAD_RESOURCE = 0x05, + /* Requested resource is currently executing a command, or is otherwise busy: */ + MTHCA_CMD_STAT_RESOURCE_BUSY = 0x06, + /* memory error: */ + MTHCA_CMD_STAT_DDR_MEM_ERR = 0x07, + /* Required capability exceeds device limits: */ + MTHCA_CMD_STAT_EXCEED_LIM = 0x08, + /* Resource is not in the appropriate state or ownership: */ + MTHCA_CMD_STAT_BAD_RES_STATE = 0x09, + /* Index out of range: */ + MTHCA_CMD_STAT_BAD_INDEX = 0x0a, + /* FW image corrupted: */ + MTHCA_CMD_STAT_BAD_NVMEM = 0x0b, + /* Attempt to modify a QP/EE which is not in the presumed state: */ + MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, + /* Bad segment parameters (Address/Size): */ + MTHCA_CMD_STAT_BAD_SEG_PARAM = 0x20, + /* Memory Region has Memory Windows bound to: */ + MTHCA_CMD_STAT_REG_BOUND = 0x21, + /* HCA local attached memory not present: */ + MTHCA_CMD_STAT_LAM_NOT_PRE = 0x22, + /* Bad management packet (silently discarded): */ + MTHCA_CMD_STAT_BAD_PKT = 0x30, + /* More outstanding CQEs in CQ than new CQ size: */ + MTHCA_CMD_STAT_BAD_SIZE = 0x40 +}; + +enum { + MTHCA_TRANS_INVALID = 0, + MTHCA_TRANS_RST2INIT, + MTHCA_TRANS_INIT2INIT, + MTHCA_TRANS_INIT2RTR, + MTHCA_TRANS_RTR2RTS, + MTHCA_TRANS_RTS2RTS, + MTHCA_TRANS_SQERR2RTS, + MTHCA_TRANS_ANY2ERR, + MTHCA_TRANS_RTS2SQD, + MTHCA_TRANS_SQD2SQD, + MTHCA_TRANS_SQD2RTS, + MTHCA_TRANS_ANY2RST, +}; + +enum { + DEV_LIM_FLAG_SRQ = 1 << 6 +}; + +struct mthca_dev_lim { + int max_srq_sz; + int max_qp_sz; + int reserved_qps; + int max_qps; + int reserved_srqs; + int max_srqs; + int reserved_eecs; + int max_eecs; + int max_cq_sz; + int reserved_cqs; + int max_cqs; + int max_mpts; + int reserved_eqs; + int max_eqs; + int reserved_mtts; + int max_mrw_sz; + int reserved_mrws; + int max_mtt_seg; + int max_requester_per_qp; + int max_responder_per_qp; + int max_rdma_global; + int local_ca_ack_delay; + int max_mtu; + int max_port_width; + int max_vl; + int num_ports; + int max_gids; + int max_pkeys; + u32 flags; + int reserved_uars; + int uar_size; + int min_page_sz; + int max_sg; + int max_desc_sz; + int max_qp_per_mcg; + int reserved_mgms; + int max_mcgs; + int reserved_pds; + int max_pds; + int reserved_rdds; + int max_rdds; + int eec_entry_sz; + int qpc_entry_sz; + int eeec_entry_sz; + int eqpc_entry_sz; + int eqc_entry_sz; + int cqc_entry_sz; + int srq_entry_sz; + int uar_scratch_entry_sz; + union { + struct { + int max_avs; + } tavor; + struct { + int resize_srq; + int mtt_entry_sz; + int mpt_entry_sz; + int max_pbl_sz; + u8 bmme_flags; + u32 reserved_lkey; + int lam_required; + u64 max_icm_sz; + } arbel; + } hca; +}; + +struct mthca_adapter { + u32 vendor_id; + u32 device_id; + u32 revision_id; + u8 inta_pin; +}; + +struct mthca_init_hca_param { + u64 qpc_base; + u8 log_num_qps; + u64 eec_base; + u8 log_num_eecs; + u64 srqc_base; + u8 log_num_srqs; + u64 cqc_base; + u8 log_num_cqs; + u64 eqpc_base; + u64 eeec_base; + u64 eqc_base; + u8 log_num_eqs; + u64 rdb_base; + u64 mc_base; + u16 log_mc_entry_sz; + u16 mc_hash_sz; + u8 log_mc_table_sz; + u64 mpt_base; + u8 mtt_seg_sz; + u8 log_mpt_sz; + u64 mtt_base; + u64 uar_scratch_base; +}; + +struct mthca_init_ib_param { + int enable_1x; + int enable_4x; + int vl_cap; + int mtu_cap; + u16 gid_cap; + u16 pkey_cap; + int set_guid0; + u64 guid0; + int set_node_guid; + u64 node_guid; + int set_si_guid; + u64 si_guid; +}; + +int mthca_cmd_use_events(struct mthca_dev *dev); +void mthca_cmd_use_polling(struct mthca_dev *dev); +void mthca_cmd_event(struct mthca_dev *dev, u16 token, + u8 status, u64 out_param); + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status); +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status); +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status); +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status); +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status); +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status); +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status); +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status); +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status); +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status); +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status); +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status); +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status); +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status); +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status); +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status); + +#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) (x), MTHCA_CMD_MAILBOX_ALIGN)) + +#endif /* MTHCA_CMD_H */ From roland at topspin.com Mon Dec 27 21:51:08 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:08 -0800 Subject: [openib-general] [PATCH][v5][12/24] Add Mellanox HCA low-level driver (EQ) In-Reply-To: <200412272151.6amOd1o39RpEe1KK@topspin.com> Message-ID: <200412272151.abmqZdwtcCCwm2QW@topspin.com> Add event queue code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c 2004-12-27 21:48:22.766615062 -0800 @@ -0,0 +1,690 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_eq.c 1382 2004-12-24 02:21:02Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" +#include "mthca_config_reg.h" + +enum { + MTHCA_NUM_ASYNC_EQE = 0x80, + MTHCA_NUM_CMD_EQE = 0x80, + MTHCA_EQ_ENTRY_SIZE = 0x20 +}; + +/* + * Must be packed because start is 64 bits but only aligned to 32 bits. + */ +struct mthca_eq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 pd; + u8 reserved1[3]; + u8 intr; + u32 lost_count; + u32 lkey; + u32 reserved2[2]; + u32 consumer_index; + u32 producer_index; + u32 reserved3[4]; +} __attribute__((packed)); + +#define MTHCA_EQ_STATUS_OK ( 0 << 28) +#define MTHCA_EQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_EQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_EQ_OWNER_SW ( 0 << 24) +#define MTHCA_EQ_OWNER_HW ( 1 << 24) +#define MTHCA_EQ_FLAG_TR ( 1 << 18) +#define MTHCA_EQ_FLAG_OI ( 1 << 17) +#define MTHCA_EQ_STATE_ARMED ( 1 << 8) +#define MTHCA_EQ_STATE_FIRED ( 2 << 8) +#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 << 8) + +enum { + MTHCA_EVENT_TYPE_COMP = 0x00, + MTHCA_EVENT_TYPE_PATH_MIG = 0x01, + MTHCA_EVENT_TYPE_COMM_EST = 0x02, + MTHCA_EVENT_TYPE_SQ_DRAINED = 0x03, + MTHCA_EVENT_TYPE_SRQ_LAST_WQE = 0x13, + MTHCA_EVENT_TYPE_CQ_ERROR = 0x04, + MTHCA_EVENT_TYPE_WQ_CATAS_ERROR = 0x05, + MTHCA_EVENT_TYPE_EEC_CATAS_ERROR = 0x06, + MTHCA_EVENT_TYPE_PATH_MIG_FAILED = 0x07, + MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, + MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, + MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, + MTHCA_EVENT_TYPE_PORT_CHANGE = 0x09, + MTHCA_EVENT_TYPE_EQ_OVERFLOW = 0x0f, + MTHCA_EVENT_TYPE_ECC_DETECT = 0x0e, + MTHCA_EVENT_TYPE_CMD = 0x0a +}; + +#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG) | \ + (1ULL << MTHCA_EVENT_TYPE_COMM_EST) | \ + (1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED) | \ + (1ULL << MTHCA_EVENT_TYPE_CQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE) | \ + (1ULL << MTHCA_EVENT_TYPE_ECC_DETECT)) +#define MTHCA_SRQ_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE) +#define MTHCA_CMD_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_CMD) + +#define MTHCA_EQ_DB_INC_CI (1 << 24) +#define MTHCA_EQ_DB_REQ_NOT (2 << 24) +#define MTHCA_EQ_DB_DISARM_CQ (3 << 24) +#define MTHCA_EQ_DB_SET_CI (4 << 24) +#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24) + +struct mthca_eqe { + u8 reserved1; + u8 type; + u8 reserved2; + u8 subtype; + union { + u32 raw[6]; + struct { + u32 cqn; + } __attribute__((packed)) comp; + struct { + u16 reserved1; + u16 token; + u32 reserved2; + u8 reserved3[3]; + u8 status; + u64 out_param; + } __attribute__((packed)) cmd; + struct { + u32 qpn; + } __attribute__((packed)) qp; + struct { + u32 cqn; + u32 reserved1; + u8 reserved2[3]; + u8 syndrome; + } __attribute__((packed)) cq_err; + struct { + u32 reserved1[2]; + u32 port; + } __attribute__((packed)) port_change; + } event; + u8 reserved3[3]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_EQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_EQ_ENTRY_OWNER_HW (1 << 7) + +static inline u64 async_mask(struct mthca_dev *dev) +{ + return dev->mthca_flags & MTHCA_FLAG_SRQ ? + MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK : + MTHCA_ASYNC_EVENT_MASK; +} + +static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); + doorbell[1] = cpu_to_be32(ci); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void eq_req_not(struct mthca_dev *dev, int eqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); + doorbell[1] = 0; + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + doorbell[1] = cpu_to_be32(cqn); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry) +{ + return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int next_eqe_sw(struct mthca_eq *eq) +{ + return !(MTHCA_EQ_ENTRY_OWNER_HW & + get_eqe(eq, eq->cons_index)->owner); +} + +static inline void set_eqe_hw(struct mthca_eq *eq, int entry) +{ + get_eqe(eq, entry)->owner = MTHCA_EQ_ENTRY_OWNER_HW; +} + +static void port_change(struct mthca_dev *dev, int port, int active) +{ + struct ib_event record; + + mthca_dbg(dev, "Port change to %s for port %d\n", + active ? "active" : "down", port); + + record.device = &dev->ib_dev; + record.event = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR; + record.element.port_num = port; + + ib_dispatch_event(&record); +} + +static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) +{ + struct mthca_eqe *eqe; + int disarm_cqn; + + while (next_eqe_sw(eq)) { + int set_ci = 0; + eqe = get_eqe(eq, eq->cons_index); + + switch (eqe->type) { + case MTHCA_EVENT_TYPE_COMP: + disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff; + disarm_cq(dev, eq->eqn, disarm_cqn); + mthca_cq_event(dev, disarm_cqn); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG); + break; + + case MTHCA_EVENT_TYPE_COMM_EST: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_COMM_EST); + break; + + case MTHCA_EVENT_TYPE_SQ_DRAINED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_SQ_DRAINED); + break; + + case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_FATAL); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG_FAILED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_REQ_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_ACCESS_ERR); + break; + + case MTHCA_EVENT_TYPE_CMD: + mthca_cmd_event(dev, + be16_to_cpu(eqe->event.cmd.token), + eqe->event.cmd.status, + be64_to_cpu(eqe->event.cmd.out_param)); + /* + * cmd_event() may add more commands. + * The card will think the queue has overflowed if + * we don't tell it we've been processing events. + */ + set_ci = 1; + break; + + case MTHCA_EVENT_TYPE_PORT_CHANGE: + port_change(dev, + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3, + eqe->subtype == 0x4); + break; + + case MTHCA_EVENT_TYPE_CQ_ERROR: + mthca_warn(dev, "CQ %s on CQN %08x\n", + eqe->event.cq_err.syndrome == 1 ? + "overrun" : "access violation", + be32_to_cpu(eqe->event.cq_err.cqn)); + break; + + case MTHCA_EVENT_TYPE_EQ_OVERFLOW: + mthca_warn(dev, "EQ overrun on EQN %d\n", eq->eqn); + break; + + case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR: + case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR: + case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR: + case MTHCA_EVENT_TYPE_ECC_DETECT: + default: + mthca_warn(dev, "Unhandled event %02x(%02x) on EQ %d\n", + eqe->type, eqe->subtype, eq->eqn); + break; + }; + + set_eqe_hw(eq, eq->cons_index); + eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + + if (set_ci) { + wmb(); /* see comment below */ + set_eq_ci(dev, eq->eqn, eq->cons_index); + set_ci = 0; + } + } + + /* + * This barrier makes sure that all updates to + * ownership bits done by set_eqe_hw() hit memory + * before the consumer index is updated. set_eq_ci() + * allows the HCA to possibly write more EQ entries, + * and we want to avoid the exceedingly unlikely + * possibility of the HCA writing an entry and then + * having set_eqe_hw() overwrite the owner field. + */ + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + eq_req_not(dev, eq->eqn); +} + +static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +{ + struct mthca_dev *dev = dev_ptr; + u32 ecr; + int work = 0; + int i; + + if (dev->eq_table.clr_mask) + writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); + + while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) { + work = 1; + + writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (ecr & dev->eq_table.eq[i].ecr_mask) + mthca_eq_int(dev, &dev->eq_table.eq[i]); + } + + return IRQ_RETVAL(work); +} + +static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr, + struct pt_regs *regs) +{ + struct mthca_eq *eq = eq_ptr; + struct mthca_dev *dev = eq->dev; + + writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + mthca_eq_int(dev, eq); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static int __devinit mthca_create_eq(struct mthca_dev *dev, + int nent, + u8 intr, + struct mthca_eq *eq) +{ + int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + u64 *dma_list = NULL; + dma_addr_t t; + void *mailbox = NULL; + struct mthca_eq_context *eq_context; + int err = -ENOMEM; + int i; + u8 status; + + /* Make sure EQ size is aligned to a power of 2 size. */ + for (i = 1; i < nent; i <<= 1) + ; /* nothing */ + nent = i; + + eq->dev = dev; + + eq->page_list = kmalloc(npages * sizeof *eq->page_list, + GFP_KERNEL); + if (!eq->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + eq->page_list[i].buf = NULL; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free; + eq_context = MAILBOX_ALIGN(mailbox); + + for (i = 0; i < npages; ++i) { + eq->page_list[i].buf = pci_alloc_consistent(dev->pdev, + PAGE_SIZE, &t); + if (!eq->page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&eq->page_list[i], mapping, t); + + memset(eq->page_list[i].buf, 0, PAGE_SIZE); + } + + for (i = 0; i < nent; ++i) + set_eqe_hw(eq, i); + + eq->eqn = mthca_alloc(&dev->eq_table.alloc); + if (eq->eqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, PAGE_SHIFT, npages, + 0, npages * PAGE_SIZE, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &eq->mr); + if (err) + goto err_out_free_eq; + + eq->nent = nent; + + memset(eq_context, 0, sizeof *eq_context); + eq_context->flags = cpu_to_be32(MTHCA_EQ_STATUS_OK | + MTHCA_EQ_OWNER_HW | + MTHCA_EQ_STATE_ARMED | + MTHCA_EQ_FLAG_TR); + eq_context->start = cpu_to_be64(0); + eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + eq_context->intr = intr; + eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); + + err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err); + goto err_out_free_mr; + } + if (status) { + mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + kfree(dma_list); + kfree(mailbox); + + eq->ecr_mask = swab32(1 << eq->eqn); + eq->cons_index = 0; + + eq_req_not(dev, eq->eqn); + + mthca_dbg(dev, "Allocated EQ %d with %d entries\n", + eq->eqn, nent); + + return err; + + err_out_free_mr: + mthca_free_mr(dev, &eq->mr); + + err_out_free_eq: + mthca_free(&dev->eq_table.alloc, eq->eqn); + + err_out_free: + for (i = 0; i < npages; ++i) + if (eq->page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], + mapping)); + + kfree(eq->page_list); + kfree(dma_list); + kfree(mailbox); + + err_out: + return err; +} + +static void mthca_free_eq(struct mthca_dev *dev, + struct mthca_eq *eq) +{ + void *mailbox = NULL; + int err; + u8 status; + int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + int i; + + mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + return; + + err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox), + eq->eqn, &status); + if (err) + mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err); + if (status) + mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n", + status); + + if (0) { + mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); + for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4)); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + + mthca_free_mr(dev, &eq->mr); + for (i = 0; i < npages; ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], mapping)); + + kfree(eq->page_list); + kfree(mailbox); +} + +static void mthca_free_irqs(struct mthca_dev *dev) +{ + int i; + + if (dev->eq_table.have_irq) + free_irq(dev->pdev->irq, dev); + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (dev->eq_table.eq[i].have_irq) + free_irq(dev->eq_table.eq[i].msi_x_vector, + dev->eq_table.eq + i); +} + +int __devinit mthca_init_eq_table(struct mthca_dev *dev) +{ + int err; + u8 status; + u8 intr; + int i; + + err = mthca_alloc_init(&dev->eq_table.alloc, + dev->limits.num_eqs, + dev->limits.num_eqs - 1, + dev->limits.reserved_eqs); + if (err) + return err; + + if (dev->mthca_flags & MTHCA_FLAG_MSI || + dev->mthca_flags & MTHCA_FLAG_MSI_X) { + dev->eq_table.clr_mask = 0; + } else { + dev->eq_table.clr_mask = + swab32(1 << (dev->eq_table.inta_pin & 31)); + dev->eq_table.clr_int = dev->clr_base + + (dev->eq_table.inta_pin < 31 ? 4 : 0); + } + + intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? + 128 : dev->eq_table.inta_pin; + + err = mthca_create_eq(dev, dev->limits.num_cqs, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, + &dev->eq_table.eq[MTHCA_EQ_COMP]); + if (err) + goto err_out_free; + + err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, + &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + if (err) + goto err_out_comp; + + err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, + &dev->eq_table.eq[MTHCA_EQ_CMD]); + if (err) + goto err_out_async; + + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { + static const char *eq_name[] = { + [MTHCA_EQ_COMP] = DRV_NAME " (comp)", + [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", + [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" + }; + + for (i = 0; i < MTHCA_NUM_EQ; ++i) { + err = request_irq(dev->eq_table.eq[i].msi_x_vector, + mthca_msi_x_interrupt, 0, + eq_name[i], dev->eq_table.eq + i); + if (err) + goto err_out_cmd; + dev->eq_table.eq[i].have_irq = 1; + } + } else { + err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ, + DRV_NAME, dev); + if (err) + goto err_out_cmd; + dev->eq_table.have_irq = 1; + } + + err = mthca_MAP_EQ(dev, async_mask(dev), + 0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status); + + err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); + + return 0; + +err_out_cmd: + mthca_free_irqs(dev); + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]); + +err_out_async: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + +err_out_comp: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); + +err_out_free: + mthca_alloc_cleanup(&dev->eq_table.alloc); + return err; +} + +void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev) +{ + u8 status; + int i; + + mthca_free_irqs(dev); + + mthca_MAP_EQ(dev, async_mask(dev), + 1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + mthca_free_eq(dev, &dev->eq_table.eq[i]); + + mthca_alloc_cleanup(&dev->eq_table.alloc); +} From roland at topspin.com Mon Dec 27 21:51:08 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:08 -0800 Subject: [openib-general] [PATCH][v5][13/24] Add Mellanox HCA low-level driver (initialization) In-Reply-To: <200412272151.abmqZdwtcCCwm2QW@topspin.com> Message-ID: <200412272151.MaHN86hIK0Tt84ws@topspin.com> Add device initializaton code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c 2004-12-27 21:48:23.120562962 -0800 @@ -0,0 +1,226 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_profile.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include "mthca_profile.h" + +static int default_profile[MTHCA_RES_NUM] = { + [MTHCA_RES_QP] = 1 << 16, + [MTHCA_RES_EQP] = 1 << 16, + [MTHCA_RES_CQ] = 1 << 16, + [MTHCA_RES_EQ] = 32, + [MTHCA_RES_RDB] = 1 << 18, + [MTHCA_RES_MCG] = 1 << 13, + [MTHCA_RES_MPT] = 1 << 17, + [MTHCA_RES_MTT] = 1 << 20, + [MTHCA_RES_UDAV] = 1 << 15 +}; + +enum { + MTHCA_RDB_ENTRY_SIZE = 32, + MTHCA_MTT_SEG_SIZE = 64 +}; + +enum { + MTHCA_NUM_PDS = 1 << 15 +}; + +int mthca_make_profile(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca) +{ + /* just use default profile for now */ + struct mthca_resource { + u64 size; + u64 start; + int type; + int num; + int log_num; + }; + + u64 total_size = 0; + struct mthca_resource *profile; + struct mthca_resource tmp; + int i, j; + + default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE; + + profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); + if (!profile) + return -ENOMEM; + + profile[MTHCA_RES_QP].size = dev_lim->qpc_entry_sz; + profile[MTHCA_RES_EEC].size = dev_lim->eec_entry_sz; + profile[MTHCA_RES_SRQ].size = dev_lim->srq_entry_sz; + profile[MTHCA_RES_CQ].size = dev_lim->cqc_entry_sz; + profile[MTHCA_RES_EQP].size = dev_lim->eqpc_entry_sz; + profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz; + profile[MTHCA_RES_EQ].size = dev_lim->eqc_entry_sz; + profile[MTHCA_RES_RDB].size = MTHCA_RDB_ENTRY_SIZE; + profile[MTHCA_RES_MCG].size = MTHCA_MGM_ENTRY_SIZE; + profile[MTHCA_RES_MPT].size = MTHCA_MPT_ENTRY_SIZE; + profile[MTHCA_RES_MTT].size = MTHCA_MTT_SEG_SIZE; + profile[MTHCA_RES_UAR].size = dev_lim->uar_scratch_entry_sz; + profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE; + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + profile[i].type = i; + profile[i].num = default_profile[i]; + profile[i].log_num = max(ffs(default_profile[i]) - 1, 0); + profile[i].size *= default_profile[i]; + } + + /* + * Sort the resources in decreasing order of size. Since they + * all have sizes that are powers of 2, we'll be able to keep + * resources aligned to their size and pack them without gaps + * using the sorted order. + */ + for (i = MTHCA_RES_NUM; i > 0; --i) + for (j = 1; j < i; ++j) { + if (profile[j].size > profile[j - 1].size) { + tmp = profile[j]; + profile[j] = profile[j - 1]; + profile[j - 1] = tmp; + } + } + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + if (profile[i].size) { + profile[i].start = dev->ddr_start + total_size; + total_size += profile[i].size; + } + if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) { + mthca_err(dev, "Profile requires 0x%llx bytes; " + "won't fit between DDR start at 0x%016llx " + "and FW start at 0x%016llx.\n", + (unsigned long long) total_size, + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->fw.tavor.fw_start); + kfree(profile); + return -ENOMEM; + } + + if (profile[i].size) + mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx " + "(size 0x%8llx)\n", + i, profile[i].type, profile[i].log_num, + (unsigned long long) profile[i].start, + (unsigned long long) profile[i].size); + } + + mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n", + (int) (total_size >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10)); + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + switch (profile[i].type) { + case MTHCA_RES_QP: + dev->limits.num_qps = profile[i].num; + init_hca->qpc_base = profile[i].start; + init_hca->log_num_qps = profile[i].log_num; + break; + case MTHCA_RES_EEC: + dev->limits.num_eecs = profile[i].num; + init_hca->eec_base = profile[i].start; + init_hca->log_num_eecs = profile[i].log_num; + break; + case MTHCA_RES_SRQ: + dev->limits.num_srqs = profile[i].num; + init_hca->srqc_base = profile[i].start; + init_hca->log_num_srqs = profile[i].log_num; + break; + case MTHCA_RES_CQ: + dev->limits.num_cqs = profile[i].num; + init_hca->cqc_base = profile[i].start; + init_hca->log_num_cqs = profile[i].log_num; + break; + case MTHCA_RES_EQP: + init_hca->eqpc_base = profile[i].start; + break; + case MTHCA_RES_EEEC: + init_hca->eeec_base = profile[i].start; + break; + case MTHCA_RES_EQ: + dev->limits.num_eqs = profile[i].num; + init_hca->eqc_base = profile[i].start; + init_hca->log_num_eqs = profile[i].log_num; + break; + case MTHCA_RES_RDB: + dev->limits.num_rdbs = profile[i].num; + init_hca->rdb_base = profile[i].start; + break; + case MTHCA_RES_MCG: + dev->limits.num_mgms = profile[i].num >> 1; + dev->limits.num_amgms = profile[i].num >> 1; + init_hca->mc_base = profile[i].start; + init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1; + init_hca->log_mc_table_sz = profile[i].log_num; + init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); + break; + case MTHCA_RES_MPT: + dev->limits.num_mpts = profile[i].num; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; + break; + case MTHCA_RES_MTT: + dev->limits.num_mtt_segs = profile[i].num; + dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE; + dev->mr_table.mtt_base = profile[i].start; + init_hca->mtt_base = profile[i].start; + init_hca->mtt_seg_sz = ffs(MTHCA_MTT_SEG_SIZE) - 7; + break; + case MTHCA_RES_UAR: + init_hca->uar_scratch_base = profile[i].start; + break; + case MTHCA_RES_UDAV: + dev->av_table.ddr_av_base = profile[i].start; + dev->av_table.num_ddr_avs = profile[i].num; + default: + break; + } + } + + /* + * PDs don't take any HCA memory, but we assign them as part + * of the HCA profile anyway. + */ + dev->limits.num_pds = MTHCA_NUM_PDS; + + kfree(profile); + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h 2004-12-27 21:48:23.154557958 -0800 @@ -0,0 +1,62 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_profile.h 1349 2004-12-16 21:09:43Z roland $ + */ + +#ifndef MTHCA_PROFILE_H +#define MTHCA_PROFILE_H + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_RES_QP, + MTHCA_RES_EEC, + MTHCA_RES_SRQ, + MTHCA_RES_CQ, + MTHCA_RES_EQP, + MTHCA_RES_EEEC, + MTHCA_RES_EQ, + MTHCA_RES_RDB, + MTHCA_RES_MCG, + MTHCA_RES_MPT, + MTHCA_RES_MTT, + MTHCA_RES_UAR, + MTHCA_RES_UDAV, + MTHCA_RES_NUM +}; + +int mthca_make_profile(struct mthca_dev *mdev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca); + +#endif /* MTHCA_PROFILE_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c 2004-12-27 21:48:23.199551335 -0800 @@ -0,0 +1,232 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_reset.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +int mthca_reset(struct mthca_dev *mdev) +{ + int i; + int err = 0; + u32 *hca_header = NULL; + u32 *bridge_header = NULL; + struct pci_dev *bridge = NULL; + +#define MTHCA_RESET_OFFSET 0xf0010 +#define MTHCA_RESET_VALUE cpu_to_be32(1) + + /* + * Reset the chip. This is somewhat ugly because we have to + * save off the PCI header before reset and then restore it + * after the chip reboots. We skip config space offsets 22 + * and 23 since those have a special meaning. + * + * To make matters worse, for Tavor (PCI-X HCA) we have to + * find the associated bridge device and save off its PCI + * header as well. + */ + + if (mdev->hca_type == TAVOR) { + /* Look for the bridge -- its device ID will be 2 more + than HCA's device ID. */ + while ((bridge = pci_get_device(mdev->pdev->vendor, + mdev->pdev->device + 2, + bridge)) != NULL) { + if (bridge->hdr_type == PCI_HEADER_TYPE_BRIDGE && + bridge->subordinate == mdev->pdev->bus) { + mthca_dbg(mdev, "Found bridge: %s (%s)\n", + pci_pretty_name(bridge), pci_name(bridge)); + break; + } + } + + if (!bridge) { + /* + * Didn't find a bridge for a Tavor device -- + * assume we're in no-bridge mode and hope for + * the best. + */ + mthca_warn(mdev, "No bridge found for %s (%s)\n", + pci_pretty_name(mdev->pdev), pci_name(mdev->pdev)); + } + + } + + /* For Arbel do we need to save off the full 4K PCI Express header?? */ + hca_header = kmalloc(256, GFP_KERNEL); + if (!hca_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA " + "PCI header, aborting.\n"); + goto out; + } + } + + if (bridge) { + bridge_header = kmalloc(256, GFP_KERNEL); + if (!bridge_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "bridge PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA bridge " + "PCI header, aborting.\n"); + goto out; + } + } + } + + /* actually hit reset */ + { + void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) + + MTHCA_RESET_OFFSET, 4); + + if (!reset) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't map HCA reset register, " + "aborting.\n"); + goto out; + } + + writel(MTHCA_RESET_VALUE, reset); + iounmap(reset); + } + + /* Docs say to wait one second before accessing device */ + msleep(1000); + + /* Now wait for PCI device to start responding again */ + { + u32 v; + int c = 0; + + for (c = 0; c < 100; ++c) { + if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't access HCA after reset, " + "aborting.\n"); + goto out; + } + + if (v != 0xffffffff) + goto good; + + msleep(100); + } + + err = -ENODEV; + mthca_err(mdev, "PCI device did not come back after reset, " + "aborting.\n"); + goto out; + } + +good: + /* Now restore the PCI headers */ + if (bridge) { + /* + * Bridge control register is at 0x3e, so we'll + * naturally restore it last in this loop. + */ + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(bridge, PCI_COMMAND, + bridge_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, " + "aborting.\n"); + goto out; + } + } + + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(mdev->pdev, PCI_COMMAND, + hca_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA COMMAND, " + "aborting.\n"); + goto out; + } + +out: + if (bridge) + pci_dev_put(bridge); + kfree(bridge_header); + kfree(hca_header); + + return err; +} From roland at topspin.com Mon Dec 27 21:51:10 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:10 -0800 Subject: [openib-general] [PATCH][v5][14/24] Add Mellanox HCA low-level driver (QP/CQ) In-Reply-To: <200412272151.MaHN86hIK0Tt84ws@topspin.com> Message-ID: <200412272151.qdu1iD71iJs65qnW@topspin.com> Add CQ (completion queue) and QP (queue pair) code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c 2004-12-27 21:48:23.509505711 -0800 @@ -0,0 +1,836 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_cq.c 1369 2004-12-20 16:17:07Z roland $ + */ + +#include + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE +}; + +enum { + MTHCA_CQ_ENTRY_SIZE = 0x20 +}; + +/* + * Must be packed because start is 64 bits but only aligned to 32 bits. + */ +struct mthca_cq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 error_eqn; + u32 comp_eqn; + u32 pd; + u32 lkey; + u32 last_notified_index; + u32 solicit_producer_index; + u32 consumer_index; + u32 producer_index; + u32 cqn; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_CQ_STATUS_OK ( 0 << 28) +#define MTHCA_CQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_CQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_CQ_FLAG_TR ( 1 << 18) +#define MTHCA_CQ_FLAG_OI ( 1 << 17) +#define MTHCA_CQ_STATE_DISARMED ( 0 << 8) +#define MTHCA_CQ_STATE_ARMED ( 1 << 8) +#define MTHCA_CQ_STATE_ARMED_SOL ( 4 << 8) +#define MTHCA_EQ_STATE_FIRED (10 << 8) + +enum { + MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe +}; + +enum { + SYNDROME_LOCAL_LENGTH_ERR = 0x01, + SYNDROME_LOCAL_QP_OP_ERR = 0x02, + SYNDROME_LOCAL_EEC_OP_ERR = 0x03, + SYNDROME_LOCAL_PROT_ERR = 0x04, + SYNDROME_WR_FLUSH_ERR = 0x05, + SYNDROME_MW_BIND_ERR = 0x06, + SYNDROME_BAD_RESP_ERR = 0x10, + SYNDROME_LOCAL_ACCESS_ERR = 0x11, + SYNDROME_REMOTE_INVAL_REQ_ERR = 0x12, + SYNDROME_REMOTE_ACCESS_ERR = 0x13, + SYNDROME_REMOTE_OP_ERR = 0x14, + SYNDROME_RETRY_EXC_ERR = 0x15, + SYNDROME_RNR_RETRY_EXC_ERR = 0x16, + SYNDROME_LOCAL_RDD_VIOL_ERR = 0x20, + SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21, + SYNDROME_REMOTE_ABORTED_ERR = 0x22, + SYNDROME_INVAL_EECN_ERR = 0x23, + SYNDROME_INVAL_EEC_STATE_ERR = 0x24 +}; + +struct mthca_cqe { + u32 my_qpn; + u32 my_ee; + u32 rqpn; + u16 sl_g_mlpath; + u16 rlid; + u32 imm_etype_pkey_eec; + u32 byte_cnt; + u32 wqe; + u8 opcode; + u8 is_send; + u8 reserved; + u8 owner; +}; + +struct mthca_err_cqe { + u32 my_qpn; + u32 reserved1[3]; + u8 syndrome; + u8 reserved2; + u16 db_cnt; + u32 reserved3; + u32 wqe; + u8 opcode; + u8 reserved4[2]; + u8 owner; +}; + +#define MTHCA_CQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_CQ_ENTRY_OWNER_HW (1 << 7) + +#define MTHCA_CQ_DB_INC_CI (1 << 24) +#define MTHCA_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_CQ_DB_REQ_NOT_SOL (3 << 24) +#define MTHCA_CQ_DB_SET_CI (4 << 24) +#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24) + +static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) +{ + if (cq->is_direct) + return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE); + else + return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int cqe_sw(struct mthca_cq *cq, int i) +{ + return !(MTHCA_CQ_ENTRY_OWNER_HW & + get_cqe(cq, i)->owner); +} + +static inline int next_cqe_sw(struct mthca_cq *cq) +{ + return cqe_sw(cq, cq->cons_index); +} + +static inline void set_cqe_hw(struct mthca_cq *cq, int entry) +{ + get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW; +} + +static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, + int nent) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn); + doorbell[1] = cpu_to_be32(nent - 1); + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +void mthca_cq_event(struct mthca_dev *dev, u32 cqn) +{ + struct mthca_cq *cq; + + spin_lock(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock(&dev->cq_table.lock); + + if (!cq) { + mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); + return; + } + + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) +{ + struct mthca_cq *cq; + struct mthca_cqe *cqe; + int prod_index; + int nfreed = 0; + + spin_lock_irq(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock_irq(&dev->cq_table.lock); + + if (!cq) + return; + + spin_lock_irq(&cq->lock); + + /* + * First we need to find the current producer index, so we + * know where to start cleaning from. It doesn't matter if HW + * adds new entries after this loop -- the QP we're worried + * about is already in RESET, so the new entries won't come + * from our QP and therefore don't need to be checked. + */ + for (prod_index = cq->cons_index; + cqe_sw(cq, prod_index & cq->ibcq.cqe); + ++prod_index) + if (prod_index == cq->cons_index + cq->ibcq.cqe) + break; + + if (0) + mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n", + qpn, cqn, cq->cons_index, prod_index); + + /* + * Now sweep backwards through the CQ, removing CQ entries + * that match our QP by copying older entries on top of them. + */ + while (prod_index > cq->cons_index) { + cqe = get_cqe(cq, (prod_index - 1) & cq->ibcq.cqe); + if (cqe->my_qpn == cpu_to_be32(qpn)) + ++nfreed; + else if (nfreed) + memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & + cq->ibcq.cqe), + cqe, + MTHCA_CQ_ENTRY_SIZE); + --prod_index; + } + + if (nfreed) { + wmb(); + inc_cons_index(dev, cq, nfreed); + cq->cons_index = (cq->cons_index + nfreed) & cq->ibcq.cqe; + } + + spin_unlock_irq(&cq->lock); + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq, + struct mthca_qp *qp, int wqe_index, int is_send, + struct mthca_err_cqe *cqe, + struct ib_wc *entry, int *free_cqe) +{ + int err; + int dbd; + u32 new_wqe; + + if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) { + int j; + + mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); + } + + /* + * For completions in error, only work request ID, status (and + * freed resource count for RD) have to be set. + */ + switch (cqe->syndrome) { + case SYNDROME_LOCAL_LENGTH_ERR: + entry->status = IB_WC_LOC_LEN_ERR; + break; + case SYNDROME_LOCAL_QP_OP_ERR: + entry->status = IB_WC_LOC_QP_OP_ERR; + break; + case SYNDROME_LOCAL_EEC_OP_ERR: + entry->status = IB_WC_LOC_EEC_OP_ERR; + break; + case SYNDROME_LOCAL_PROT_ERR: + entry->status = IB_WC_LOC_PROT_ERR; + break; + case SYNDROME_WR_FLUSH_ERR: + entry->status = IB_WC_WR_FLUSH_ERR; + break; + case SYNDROME_MW_BIND_ERR: + entry->status = IB_WC_MW_BIND_ERR; + break; + case SYNDROME_BAD_RESP_ERR: + entry->status = IB_WC_BAD_RESP_ERR; + break; + case SYNDROME_LOCAL_ACCESS_ERR: + entry->status = IB_WC_LOC_ACCESS_ERR; + break; + case SYNDROME_REMOTE_INVAL_REQ_ERR: + entry->status = IB_WC_REM_INV_REQ_ERR; + break; + case SYNDROME_REMOTE_ACCESS_ERR: + entry->status = IB_WC_REM_ACCESS_ERR; + break; + case SYNDROME_REMOTE_OP_ERR: + entry->status = IB_WC_REM_OP_ERR; + break; + case SYNDROME_RETRY_EXC_ERR: + entry->status = IB_WC_RETRY_EXC_ERR; + break; + case SYNDROME_RNR_RETRY_EXC_ERR: + entry->status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case SYNDROME_LOCAL_RDD_VIOL_ERR: + entry->status = IB_WC_LOC_RDD_VIOL_ERR; + break; + case SYNDROME_REMOTE_INVAL_RD_REQ_ERR: + entry->status = IB_WC_REM_INV_RD_REQ_ERR; + break; + case SYNDROME_REMOTE_ABORTED_ERR: + entry->status = IB_WC_REM_ABORT_ERR; + break; + case SYNDROME_INVAL_EECN_ERR: + entry->status = IB_WC_INV_EECN_ERR; + break; + case SYNDROME_INVAL_EEC_STATE_ERR: + entry->status = IB_WC_INV_EEC_STATE_ERR; + break; + default: + entry->status = IB_WC_GENERAL_ERR; + break; + } + + err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); + if (err) + return err; + + /* + * If we're at the end of the WQE chain, or we've used up our + * doorbell count, free the CQE. Otherwise just update it for + * the next poll operation. + */ + if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) + return 0; + + cqe->db_cnt = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd); + cqe->wqe = new_wqe; + cqe->syndrome = SYNDROME_WR_FLUSH_ERR; + + *free_cqe = 0; + + return 0; +} + +static void dump_cqe(struct mthca_cqe *cqe) +{ + int j; + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); +} + +static inline int mthca_poll_one(struct mthca_dev *dev, + struct mthca_cq *cq, + struct mthca_qp **cur_qp, + int *freed, + struct ib_wc *entry) +{ + struct mthca_wq *wq; + struct mthca_cqe *cqe; + int wqe_index; + int is_error = 0; + int is_send; + int free_cqe = 1; + int err = 0; + + if (!next_cqe_sw(cq)) + return -EAGAIN; + + rmb(); + + cqe = get_cqe(cq, cq->cons_index); + + if (0) { + mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + dump_cqe(cqe); + } + + if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) { + is_error = 1; + is_send = cqe->opcode & 1; + } else + is_send = cqe->is_send & 0x80; + + if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { + if (*cur_qp) { + if (*freed) { + wmb(); + inc_cons_index(dev, cq, *freed); + *freed = 0; + } + spin_unlock(&(*cur_qp)->lock); + if (atomic_dec_and_test(&(*cur_qp)->refcount)) + wake_up(&(*cur_qp)->wait); + } + + spin_lock(&dev->qp_table.lock); + *cur_qp = mthca_array_get(&dev->qp_table.qp, + be32_to_cpu(cqe->my_qpn) & + (dev->limits.num_qps - 1)); + if (*cur_qp) + atomic_inc(&(*cur_qp)->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!*cur_qp) { + mthca_warn(dev, "CQ entry for unknown QP %06x\n", + be32_to_cpu(cqe->my_qpn) & 0xffffff); + err = -EINVAL; + goto out; + } + + spin_lock(&(*cur_qp)->lock); + } + + if (is_send) { + wq = &(*cur_qp)->sq; + wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset) + >> wq->wqe_shift); + entry->wr_id = (*cur_qp)->wrid[wqe_index + + (*cur_qp)->rq.max]; + } else { + wq = &(*cur_qp)->rq; + wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + entry->wr_id = (*cur_qp)->wrid[wqe_index]; + } + + if (wq->last_comp < wqe_index) + wq->cur -= wqe_index - wq->last_comp; + else + wq->cur -= wq->max - wq->last_comp + wqe_index; + + wq->last_comp = wqe_index; + + if (0) + mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n", + is_send ? "Send" : "Receive", + (*cur_qp)->qpn, wqe_index, wq->max); + + if (is_error) { + err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, + (struct mthca_err_cqe *) cqe, + entry, &free_cqe); + goto out; + } + + if (is_send) { + entry->opcode = IB_WC_SEND; /* XXX */ + } else { + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + switch (cqe->opcode & 0x1f) { + case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE: + case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV; + break; + case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE: + case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV_RDMA_WITH_IMM; + break; + default: + entry->wc_flags = 0; + entry->opcode = IB_WC_RECV; + break; + } + entry->slid = be16_to_cpu(cqe->rlid); + entry->sl = be16_to_cpu(cqe->sl_g_mlpath) >> 12; + entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; + entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; + entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; + entry->wc_flags |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ? + IB_WC_GRH : 0; + } + + entry->status = IB_WC_SUCCESS; + + out: + if (free_cqe) { + set_cqe_hw(cq, cq->cons_index); + ++(*freed); + cq->cons_index = (cq->cons_index + 1) & cq->ibcq.cqe; + } + + return err; +} + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct mthca_dev *dev = to_mdev(ibcq->device); + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_qp *qp = NULL; + unsigned long flags; + int err = 0; + int freed = 0; + int npolled; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + err = mthca_poll_one(dev, cq, &qp, + &freed, entry + npolled); + if (err) + break; + } + + if (freed) { + wmb(); + inc_cons_index(dev, cq, freed); + } + + if (qp) { + spin_unlock(&qp->lock); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + + + spin_unlock_irqrestore(&cq->lock, flags); + + return err == 0 || err == -EAGAIN ? npolled : err; +} + +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((solicited ? + MTHCA_CQ_DB_REQ_NOT_SOL : + MTHCA_CQ_DB_REQ_NOT) | + cq->cqn); + doorbell[1] = 0xffffffff; + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq) +{ + int size = nent * MTHCA_CQ_ENTRY_SIZE; + dma_addr_t t; + void *mailbox = NULL; + int npages, shift; + u64 *dma_list = NULL; + struct mthca_cq_context *cq_context; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out; + + cq_context = MAILBOX_ALIGN(mailbox); + + if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { + if (0) + mthca_dbg(dev, "Creating direct CQ of size %d\n", size); + + cq->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + cq->queue.direct.buf = pci_alloc_consistent(dev->pdev, + size, &t); + if (!cq->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&cq->queue.direct, mapping, t); + + memset(cq->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + cq->is_direct = 0; + npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, + GFP_KERNEL); + if (!cq->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + cq->queue.page_list[i].buf = NULL; + + for (i = 0; i < npages; ++i) { + cq->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!cq->queue.page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); + + memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE); + } + } + + for (i = 0; i < nent; ++i) + set_cqe_hw(cq, i); + + cq->cqn = mthca_alloc(&dev->cq_table.alloc); + if (cq->cqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, shift, npages, + 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &cq->mr); + if (err) + goto err_out_free_cq; + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + memset(cq_context, 0, sizeof *cq_context); + cq_context->flags = cpu_to_be32(MTHCA_CQ_STATUS_OK | + MTHCA_CQ_STATE_DISARMED | + MTHCA_CQ_FLAG_TR); + cq_context->start = cpu_to_be64(0); + cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); + cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); + cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); + cq_context->cqn = cpu_to_be32(cq->cqn); + + err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err); + goto err_out_free_mr; + } + + if (status) { + mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + spin_lock_irq(&dev->cq_table.lock); + if (mthca_array_set(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1), + cq)) { + spin_unlock_irq(&dev->cq_table.lock); + goto err_out_free_mr; + } + spin_unlock_irq(&dev->cq_table.lock); + + cq->cons_index = 0; + + kfree(dma_list); + kfree(mailbox); + + return 0; + + err_out_free_mr: + mthca_free_mr(dev, &cq->mr); + + err_out_free_cq: + mthca_free(&dev->cq_table.alloc, cq->cqn); + + err_out_free: + if (cq->is_direct) + pci_free_consistent(dev->pdev, size, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, mapping)); + else { + for (i = 0; i < npages; ++i) + if (cq->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + err_out: + kfree(dma_list); + kfree(mailbox); + + return err; +} + +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq) +{ + void *mailbox; + int err; + u8 status; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_warn(dev, "No memory for mailbox to free CQ.\n"); + return; + } + + err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status); + if (err) + mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n", + status); + + if (0) { + u32 *ctx = MAILBOX_ALIGN(mailbox); + int j; + + printk(KERN_ERR "context for CQN %x\n", cq->cqn); + for (j = 0; j < 16; ++j) + printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); + } + + spin_lock_irq(&dev->cq_table.lock); + mthca_array_clear(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1)); + spin_unlock_irq(&dev->cq_table.lock); + + atomic_dec(&cq->refcount); + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + mthca_free_mr(dev, &cq->mr); + + if (cq->is_direct) + pci_free_consistent(dev->pdev, + (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, + mapping)); + else { + int i; + + for (i = 0; + i < ((cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + mthca_free(&dev->cq_table.alloc, cq->cqn); + kfree(mailbox); +} + +int __devinit mthca_init_cq_table(struct mthca_dev *dev) +{ + int err; + + spin_lock_init(&dev->cq_table.lock); + + err = mthca_alloc_init(&dev->cq_table.alloc, + dev->limits.num_cqs, + (1 << 24) - 1, + dev->limits.reserved_cqs); + if (err) + return err; + + err = mthca_array_init(&dev->cq_table.cq, + dev->limits.num_cqs); + if (err) + mthca_alloc_cleanup(&dev->cq_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev) +{ + mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs); + mthca_alloc_cleanup(&dev->cq_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c 2004-12-27 21:48:23.540501149 -0800 @@ -0,0 +1,1536 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_qp.c 1355 2004-12-17 15:23:43Z roland $ + */ + +#include + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, + MTHCA_ACK_REQ_FREQ = 10, + MTHCA_FLIGHT_LIMIT = 9, + MTHCA_UD_HEADER_SIZE = 72 /* largest UD header possible */ +}; + +enum { + MTHCA_QP_STATE_RST = 0, + MTHCA_QP_STATE_INIT = 1, + MTHCA_QP_STATE_RTR = 2, + MTHCA_QP_STATE_RTS = 3, + MTHCA_QP_STATE_SQE = 4, + MTHCA_QP_STATE_SQD = 5, + MTHCA_QP_STATE_ERR = 6, + MTHCA_QP_STATE_DRAINING = 7 +}; + +enum { + MTHCA_QP_ST_RC = 0x0, + MTHCA_QP_ST_UC = 0x1, + MTHCA_QP_ST_RD = 0x2, + MTHCA_QP_ST_UD = 0x3, + MTHCA_QP_ST_MLX = 0x7 +}; + +enum { + MTHCA_QP_PM_MIGRATED = 0x3, + MTHCA_QP_PM_ARMED = 0x0, + MTHCA_QP_PM_REARM = 0x1 +}; + +enum { + /* qp_context flags */ + MTHCA_QP_BIT_DE = 1 << 8, + /* params1 */ + MTHCA_QP_BIT_SRE = 1 << 15, + MTHCA_QP_BIT_SWE = 1 << 14, + MTHCA_QP_BIT_SAE = 1 << 13, + MTHCA_QP_BIT_SIC = 1 << 4, + MTHCA_QP_BIT_SSC = 1 << 3, + /* params2 */ + MTHCA_QP_BIT_RRE = 1 << 15, + MTHCA_QP_BIT_RWE = 1 << 14, + MTHCA_QP_BIT_RAE = 1 << 13, + MTHCA_QP_BIT_RIC = 1 << 4, + MTHCA_QP_BIT_RSC = 1 << 3 +}; + +struct mthca_qp_path { + u32 port_pkey; + u8 rnr_retry; + u8 g_mylmc; + u16 rlid; + u8 ackto; + u8 mgid_index; + u8 static_rate; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u8 rgid[16]; +} __attribute__((packed)); + +struct mthca_qp_context { + u32 flags; + u32 sched_queue; + u32 mtu_msgmax; + u32 usr_page; + u32 local_qpn; + u32 remote_qpn; + u32 reserved1[2]; + struct mthca_qp_path pri_path; + struct mthca_qp_path alt_path; + u32 rdd; + u32 pd; + u32 wqe_base; + u32 wqe_lkey; + u32 params1; + u32 reserved2; + u32 next_send_psn; + u32 cqn_snd; + u32 next_snd_wqe[2]; + u32 last_acked_psn; + u32 ssn; + u32 params2; + u32 rnr_nextrecvpsn; + u32 ra_buff_indx; + u32 cqn_rcv; + u32 next_rcv_wqe[2]; + u32 qkey; + u32 srqn; + u32 rmsn; + u32 reserved3[19]; +} __attribute__((packed)); + +struct mthca_qp_param { + u32 opt_param_mask; + u32 reserved1; + struct mthca_qp_context context; + u32 reserved2[62]; +} __attribute__((packed)); + +enum { + MTHCA_QP_OPTPAR_ALT_ADDR_PATH = 1 << 0, + MTHCA_QP_OPTPAR_RRE = 1 << 1, + MTHCA_QP_OPTPAR_RAE = 1 << 2, + MTHCA_QP_OPTPAR_REW = 1 << 3, + MTHCA_QP_OPTPAR_PKEY_INDEX = 1 << 4, + MTHCA_QP_OPTPAR_Q_KEY = 1 << 5, + MTHCA_QP_OPTPAR_RNR_TIMEOUT = 1 << 6, + MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7, + MTHCA_QP_OPTPAR_SRA_MAX = 1 << 8, + MTHCA_QP_OPTPAR_RRA_MAX = 1 << 9, + MTHCA_QP_OPTPAR_PM_STATE = 1 << 10, + MTHCA_QP_OPTPAR_PORT_NUM = 1 << 11, + MTHCA_QP_OPTPAR_RETRY_COUNT = 1 << 12, + MTHCA_QP_OPTPAR_ALT_RNR_RETRY = 1 << 13, + MTHCA_QP_OPTPAR_ACK_TIMEOUT = 1 << 14, + MTHCA_QP_OPTPAR_RNR_RETRY = 1 << 15, + MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 +}; + +enum { + MTHCA_OPCODE_NOP = 0x00, + MTHCA_OPCODE_RDMA_WRITE = 0x08, + MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, + MTHCA_OPCODE_SEND = 0x0a, + MTHCA_OPCODE_SEND_IMM = 0x0b, + MTHCA_OPCODE_RDMA_READ = 0x10, + MTHCA_OPCODE_ATOMIC_CS = 0x11, + MTHCA_OPCODE_ATOMIC_FA = 0x12, + MTHCA_OPCODE_BIND_MW = 0x18, + MTHCA_OPCODE_INVALID = 0xff +}; + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 +}; + +struct mthca_next_seg { + u32 nda_op; /* [31:6] next WQE [4:0] next opcode */ + u32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + u32 flags; /* [3] CQ [2] Event [1] Solicit */ + u32 imm; /* immediate data */ +}; + +struct mthca_ud_seg { + u32 reserved1; + u32 lkey; + u64 av_addr; + u32 reserved2[4]; + u32 dqpn; + u32 qkey; + u32 reserved3[2]; +}; + +struct mthca_bind_seg { + u32 flags; /* [31] Atomic [30] rem write [29] rem read */ + u32 reserved; + u32 new_rkey; + u32 lkey; + u64 addr; + u64 length; +}; + +struct mthca_raddr_seg { + u64 raddr; + u32 rkey; + u32 reserved; +}; + +struct mthca_atomic_seg { + u64 swap_add; + u64 compare; +}; + +struct mthca_data_seg { + u32 byte_count; + u32 lkey; + u64 addr; +}; + +struct mthca_mlx_seg { + u32 nda_op; + u32 nds; + u32 flags; /* [17] VL15 [16] SLR [14:12] static rate + [11:8] SL [3] C [2] E */ + u16 rlid; + u16 vcrc; +}; + +static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 3; +} + +static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 1; +} + +static void *get_recv_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + (n << qp->rq.wqe_shift); + else + return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf + + ((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1)); +} + +static void *get_send_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + qp->send_wqe_offset + + (n << qp->sq.wqe_shift); + else + return qp->queue.page_list[(qp->send_wqe_offset + + (n << qp->sq.wqe_shift)) >> + PAGE_SHIFT].buf + + ((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) & + (PAGE_SIZE - 1)); +} + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type) +{ + struct mthca_qp *qp; + struct ib_event event; + + spin_lock(&dev->qp_table.lock); + qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1)); + if (qp) + atomic_inc(&qp->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!qp) { + mthca_warn(dev, "Async event for bogus QP %08x\n", qpn); + return; + } + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); + + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +static int to_mthca_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: return MTHCA_QP_STATE_RST; + case IB_QPS_INIT: return MTHCA_QP_STATE_INIT; + case IB_QPS_RTR: return MTHCA_QP_STATE_RTR; + case IB_QPS_RTS: return MTHCA_QP_STATE_RTS; + case IB_QPS_SQD: return MTHCA_QP_STATE_SQD; + case IB_QPS_SQE: return MTHCA_QP_STATE_SQE; + case IB_QPS_ERR: return MTHCA_QP_STATE_ERR; + default: return -1; + } +} + +enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS }; + +static int to_mthca_st(int transport) +{ + switch (transport) { + case RC: return MTHCA_QP_ST_RC; + case UC: return MTHCA_QP_ST_UC; + case UD: return MTHCA_QP_ST_UD; + case RD: return MTHCA_QP_ST_RD; + case MLX: return MTHCA_QP_ST_MLX; + default: return -1; + } +} + +static const struct { + int trans; + u32 req_param[NUM_TRANS]; + u32 opt_param[NUM_TRANS]; +} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_RST2INIT, + .req_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + }, + /* bug-for-bug compatibility with VAPI: */ + .opt_param = { + [MLX] = IB_QP_PORT + } + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_INIT2INIT, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + }, + [IB_QPS_RTR] = { + .trans = MTHCA_TRANS_INIT2RTR, + .req_param = { + [RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTR2RTS, + .req_param = { + [UD] = IB_QP_SQ_PSN, + [RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + [MLX] = IB_QP_SQ_PSN, + }, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTS2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_RTS2SQD, + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQD2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_SQD2SQD, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQERR2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR } + } +}; + +static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr, + int attr_mask) +{ + if (attr_mask & IB_QP_PKEY_INDEX) + sqp->pkey_index = attr->pkey_index; + if (attr_mask & IB_QP_QKEY) + sqp->qkey = attr->qkey; + if (attr_mask & IB_QP_SQ_PSN) + sqp->send_psn = attr->sq_psn; +} + +static void init_port(struct mthca_dev *dev, int port) +{ + int err; + u8 status; + struct mthca_init_ib_param param; + + memset(¶m, 0, sizeof param); + + param.enable_1x = 1; + param.enable_4x = 1; + param.vl_cap = dev->limits.vl_cap; + param.mtu_cap = dev->limits.mtu_cap; + param.gid_cap = dev->limits.gid_table_len; + param.pkey_cap = dev->limits.pkey_table_len; + + err = mthca_INIT_IB(dev, ¶m, port, &status); + if (err) + mthca_warn(dev, "INIT_IB failed, return code %d.\n", err); + if (status) + mthca_warn(dev, "INIT_IB returned status %02x.\n", status); +} + +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + enum ib_qp_state cur_state, new_state; + void *mailbox = NULL; + struct mthca_qp_param *qp_param; + struct mthca_qp_context *qp_context; + u32 req_param, opt_param; + u8 status; + int err; + + if (attr_mask & IB_QP_CUR_STATE) { + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + cur_state = attr->cur_qp_state; + } else { + spin_lock_irq(&qp->lock); + cur_state = qp->state; + spin_unlock_irq(&qp->lock); + } + + if (attr_mask & IB_QP_STATE) { + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + new_state = attr->qp_state; + } else + new_state = cur_state; + + if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) { + mthca_dbg(dev, "Illegal QP transition " + "%d->%d\n", cur_state, new_state); + return -EINVAL; + } + + req_param = state_table[cur_state][new_state].req_param[qp->transport]; + opt_param = state_table[cur_state][new_state].opt_param[qp->transport]; + + if ((req_param & attr_mask) != req_param) { + mthca_dbg(dev, "QP transition " + "%d->%d missing req attr 0x%08x\n", + cur_state, new_state, + req_param & ~attr_mask); + return -EINVAL; + } + + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) { + mthca_dbg(dev, "QP transition (transport %d) " + "%d->%d has extra attr 0x%08x\n", + qp->transport, + cur_state, new_state, + attr_mask & ~(req_param | opt_param | + IB_QP_STATE)); + return -EINVAL; + } + + mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + qp_param = MAILBOX_ALIGN(mailbox); + qp_context = &qp_param->context; + memset(qp_param, 0, sizeof *qp_param); + + qp_context->flags = cpu_to_be32((to_mthca_state(new_state) << 28) | + (to_mthca_st(qp->transport) << 16)); + qp_context->flags |= cpu_to_be32(MTHCA_QP_BIT_DE); + if (!(attr_mask & IB_QP_PATH_MIG_STATE)) + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + else { + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE); + switch (attr->path_mig_state) { + case IB_MIG_MIGRATED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + break; + case IB_MIG_REARM: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11); + break; + case IB_MIG_ARMED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11); + break; + } + } + /* leave sched_queue as 0 */ + if (qp->transport == MLX || qp->transport == UD) + qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) | + (11 << 24)); + else if (attr_mask & IB_QP_PATH_MTU) { + qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | + (31 << 24)); + } + qp_context->usr_page = cpu_to_be32(MTHCA_KAR_PAGE); + qp_context->local_qpn = cpu_to_be32(qp->qpn); + if (attr_mask & IB_QP_DEST_QPN) { + qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); + } + + if (qp->transport == MLX) + qp_context->pri_path.port_pkey |= + cpu_to_be32(to_msqp(qp)->port << 24); + else { + if (attr_mask & IB_QP_PORT) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->port_num << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM); + } + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->pkey_index); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX); + } + + if (attr_mask & IB_QP_RNR_RETRY) { + qp_context->pri_path.rnr_retry = attr->rnr_retry << 5; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY); + } + + if (attr_mask & IB_QP_AV) { + qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; + qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); + qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3; + if (attr->ah_attr.ah_flags & IB_AH_GRH) { + qp_context->pri_path.g_mylmc |= 1 << 7; + qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; + qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32((attr->ah_attr.sl << 28) | + (attr->ah_attr.grh.traffic_class << 20) | + (attr->ah_attr.grh.flow_label)); + memcpy(qp_context->pri_path.rgid, + attr->ah_attr.grh.dgid.raw, 16); + } else { + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32(attr->ah_attr.sl << 28); + } + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); + } + + if (attr_mask & IB_QP_TIMEOUT) { + qp_context->pri_path.ackto = attr->timeout; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); + } + + /* XXX alt_path */ + + /* leave rdd as 0 */ + qp_context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pd_num); + /* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */ + qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); + qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | + (MTHCA_FLIGHT_LIMIT << 24) | + MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SWE | + MTHCA_QP_BIT_SAE); + if (qp->sq.policy == IB_SIGNAL_ALL_WR) + qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); + if (attr_mask & IB_QP_RETRY_CNT) { + qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT); + } + + /* XXX initiator resources */ + + if (attr_mask & IB_QP_SQ_PSN) + qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); + qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); + + /* XXX RDMA/atomic enable, responder resources */ + + if (qp->rq.policy == IB_SIGNAL_ALL_WR) + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); + } + if (attr_mask & IB_QP_RQ_PSN) + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); + + /* XXX ra_buff_indx */ + + qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); + + if (attr_mask & IB_QP_QKEY) { + qp_context->qkey = cpu_to_be32(attr->qkey); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); + } + + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, + qp->qpn, 0, qp_param, 0, &status); + if (status) { + mthca_warn(dev, "modify QP %d returned status %02x.\n", + state_table[cur_state][new_state].trans, status); + err = -EINVAL; + } + + if (!err) + qp->state = new_state; + + kfree(mailbox); + + if (is_sqp(dev, qp)) + store_attrs(to_msqp(qp), attr, attr_mask); + + /* + * If we are moving QP0 to RTR, bring the IB link up; if we + * are moving QP0 to RESET or ERROR, bring the link back down. + */ + if (is_qp0(dev, qp)) { + if (cur_state != IB_QPS_RTR && + new_state == IB_QPS_RTR) + init_port(dev, to_msqp(qp)->port); + + if (cur_state != IB_QPS_RESET && + cur_state != IB_QPS_ERR && + (new_state == IB_QPS_RESET || + new_state == IB_QPS_ERR)) + mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + } + + return err; +} + +/* + * Allocate and register buffer for WQEs. qp->rq.max, sq.max, + * rq.max_gs and sq.max_gs must all be assigned. + * mthca_alloc_wqe_buf will calculate rq.wqe_shift and + * sq.wqe_shift (as well as send_wqe_offset, is_direct, and + * queue) + */ +static int mthca_alloc_wqe_buf(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int size; + int i; + int npages, shift; + dma_addr_t t; + u64 *dma_list = NULL; + int err = -ENOMEM; + + size = sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg); + + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; + qp->rq.wqe_shift++) + ; /* nothing */ + + size = sizeof (struct mthca_next_seg) + + qp->sq.max_gs * sizeof (struct mthca_data_seg); + if (qp->transport == MLX) + size += 2 * sizeof (struct mthca_data_seg); + else if (qp->transport == UD) + size += sizeof (struct mthca_ud_seg); + else /* bind seg is as big as atomic + raddr segs */ + size += sizeof (struct mthca_bind_seg); + + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; + qp->sq.wqe_shift++) + ; /* nothing */ + + qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift, + 1 << qp->sq.wqe_shift); + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64), + GFP_KERNEL); + if (!qp->wrid) + goto err_out; + + if (size <= MTHCA_MAX_DIRECT_QP_SIZE) { + qp->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n", + size, shift); + + qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t); + if (!qp->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&qp->queue.direct, mapping, t); + + memset(qp->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + qp->is_direct = 0; + npages = size / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + qp->queue.page_list = kmalloc(npages * + sizeof *qp->queue.page_list, + GFP_KERNEL); + if (!qp->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) { + qp->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!qp->queue.page_list[i].buf) + goto err_out_free; + + memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE); + + pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t); + dma_list[i] = t; + } + } + + err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, + npages, 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &qp->mr); + if (err) + goto err_out_free; + + kfree(dma_list); + return 0; + + err_out_free: + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else + for (i = 0; i < npages; ++i) { + if (qp->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + + } + + err_out: + kfree(qp->wrid); + kfree(dma_list); + return err; +} + +static int mthca_alloc_qp_common(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + spin_lock_init(&qp->lock); + atomic_set(&qp->refcount, 1); + qp->state = IB_QPS_RESET; + qp->sq.policy = send_policy; + qp->rq.policy = recv_policy; + qp->rq.cur = 0; + qp->sq.cur = 0; + qp->rq.next = 0; + qp->sq.next = 0; + qp->rq.last_comp = qp->rq.max - 1; + qp->sq.last_comp = qp->sq.max - 1; + qp->rq.last = NULL; + qp->sq.last = NULL; + + err = mthca_alloc_wqe_buf(dev, pd, qp); + return err; +} + +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + switch (type) { + case IB_QPT_RC: qp->transport = RC; break; + case IB_QPT_UC: qp->transport = UC; break; + case IB_QPT_UD: qp->transport = UD; break; + default: return -EINVAL; + } + + qp->qpn = mthca_alloc(&dev->qp_table.alloc); + if (qp->qpn == -1) + return -ENOMEM; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, qp); + if (err) { + mthca_free(&dev->qp_table.alloc, qp->qpn); + return err; + } + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_set(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1), qp); + spin_unlock_irq(&dev->qp_table.lock); + + return 0; +} + +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp) +{ + int err = 0; + u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; + + sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; + sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size, + &sqp->header_dma, GFP_KERNEL); + if (!sqp->header_buf) + return -ENOMEM; + + spin_lock_irq(&dev->qp_table.lock); + if (mthca_array_get(&dev->qp_table.qp, mqpn)) + err = -EBUSY; + else + mthca_array_set(&dev->qp_table.qp, mqpn, sqp); + spin_unlock_irq(&dev->qp_table.lock); + + if (err) + goto err_out; + + sqp->port = port; + sqp->qp.qpn = mqpn; + sqp->qp.transport = MLX; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, + &sqp->qp); + if (err) + goto err_out_free; + + atomic_inc(&pd->sqp_count); + + return 0; + + err_out_free: + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, mqpn); + spin_unlock_irq(&dev->qp_table.lock); + + err_out: + dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, + sqp->header_buf, sqp->header_dma); + + return err; +} + +void mthca_free_qp(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + u8 status; + int size; + int i; + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1)); + spin_unlock_irq(&dev->qp_table.lock); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); + + if (qp->state != IB_QPS_RESET) + mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status); + + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + if (qp->ibqp.send_cq != qp->ibqp.recv_cq) + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); + + mthca_free_mr(dev, &qp->mr); + + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else { + for (i = 0; i < size / PAGE_SIZE; ++i) { + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + } + } + + kfree(qp->wrid); + + if (is_sqp(dev, qp)) { + atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); + dma_free_coherent(&dev->pdev->dev, + to_msqp(qp)->header_buf_size, + to_msqp(qp)->header_buf, + to_msqp(qp)->header_dma); + } + else + mthca_free(&dev->qp_table.alloc, qp->qpn); +} + +/* Create UD header for an MLX send and build a data segment for it */ +static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp, + int ind, struct ib_send_wr *wr, + struct mthca_mlx_seg *mlx, + struct mthca_data_seg *data) +{ + int header_size; + int err; + + ib_ud_header_init(256, /* assume a MAD */ + sqp->ud_header.grh_present, + &sqp->ud_header); + + err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); + if (err) + return err; + mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1); + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) | + (sqp->ud_header.lrh.destination_lid == 0xffff ? + MTHCA_MLX_SLR : 0) | + (sqp->ud_header.lrh.service_level << 8)); + mlx->rlid = sqp->ud_header.lrh.destination_lid; + mlx->vcrc = 0; + + switch (wr->opcode) { + case IB_WR_SEND: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; + sqp->ud_header.immediate_present = 0; + break; + case IB_WR_SEND_WITH_IMM: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + sqp->ud_header.immediate_present = 1; + sqp->ud_header.immediate_data = wr->imm_data; + break; + default: + return -EINVAL; + } + + sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (sqp->ud_header.lrh.destination_lid == 0xffff) + sqp->ud_header.lrh.source_lid = 0xffff; + sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + sqp->pkey_index, + &sqp->ud_header.bth.pkey); + else + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + wr->wr.ud.pkey_index, + &sqp->ud_header.bth.pkey); + cpu_to_be16s(&sqp->ud_header.bth.pkey); + sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + sqp->qkey : wr->wr.ud.remote_qkey); + sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + header_size = ib_ud_header_pack(&sqp->ud_header, + sqp->header_buf + + ind * MTHCA_UD_HEADER_SIZE); + + data->byte_count = cpu_to_be32(header_size); + data->lkey = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey); + data->addr = cpu_to_be64(sqp->header_dma + + ind * MTHCA_UD_HEADER_SIZE); + + return 0; +} + +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + void *wqe; + void *prev_wqe; + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + u32 f0 = 0; + int ind; + u8 op0 = 0; + + static const u8 opcode[] = { + [IB_WR_SEND] = MTHCA_OPCODE_SEND, + [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, + [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, + [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, + [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, + }; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post send */ + + ind = qp->sq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->sq.cur + nreq >= qp->sq.max) { + mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", + qp->sq.cur, qp->sq.max, nreq); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_send_wqe(qp, ind); + prev_wqe = qp->sq.last; + qp->sq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + ((struct mthca_next_seg *) wqe)->flags = + ((wr->send_flags & IB_SEND_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | + ((wr->send_flags & IB_SEND_SOLICITED) ? + cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + cpu_to_be32(1); + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + switch (qp->transport) { + case RC: + switch (wr->opcode) { + case IB_WR_ATOMIC_CMP_AND_SWP: + case IB_WR_ATOMIC_FETCH_AND_ADD: + ((struct mthca_raddr_seg *) wqe)->raddr = + cpu_to_be64(wr->wr.atomic.remote_addr); + ((struct mthca_raddr_seg *) wqe)->rkey = + cpu_to_be32(wr->wr.atomic.rkey); + ((struct mthca_raddr_seg *) wqe)->reserved = 0; + + wqe += sizeof (struct mthca_raddr_seg); + + if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP) { + ((struct mthca_atomic_seg *) wqe)->swap_add = + cpu_to_be64(wr->wr.atomic.swap); + ((struct mthca_atomic_seg *) wqe)->compare = + cpu_to_be64(wr->wr.atomic.compare_add); + } else { + ((struct mthca_atomic_seg *) wqe)->swap_add = + cpu_to_be64(wr->wr.atomic.compare_add); + ((struct mthca_atomic_seg *) wqe)->compare = 0; + } + + wqe += sizeof (struct mthca_atomic_seg); + size += sizeof (struct mthca_raddr_seg) / 16 + + sizeof (struct mthca_atomic_seg); + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + case IB_WR_RDMA_READ: + ((struct mthca_raddr_seg *) wqe)->raddr = + cpu_to_be64(wr->wr.rdma.remote_addr); + ((struct mthca_raddr_seg *) wqe)->rkey = + cpu_to_be32(wr->wr.rdma.rkey); + ((struct mthca_raddr_seg *) wqe)->reserved = 0; + wqe += sizeof (struct mthca_raddr_seg); + size += sizeof (struct mthca_raddr_seg) / 16; + break; + + default: + /* No extra segments required for sends */ + break; + } + + case UD: + ((struct mthca_ud_seg *) wqe)->lkey = + cpu_to_be32(to_mah(wr->wr.ud.ah)->key); + ((struct mthca_ud_seg *) wqe)->av_addr = + cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma); + ((struct mthca_ud_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mthca_ud_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mthca_ud_seg); + size += sizeof (struct mthca_ud_seg) / 16; + break; + + case MLX: + err = build_mlx_header(dev, to_msqp(qp), ind, wr, + wqe - sizeof (struct mthca_next_seg), + wqe); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + break; + } + + if (wr->num_sge > qp->sq.max_gs) { + mthca_err(dev, "too many gathers\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC */ + if (qp->transport == MLX) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind + qp->rq.max] = wr->wr_id; + + if (wr->opcode >= ARRAY_SIZE(opcode)) { + mthca_err(dev, "opcode invalid\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32(((ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | + opcode[wr->opcode]); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); + } + + if (!size0) { + size0 = size; + op0 = opcode[wr->opcode]; + } + + ++ind; + if (unlikely(ind >= qp->sq.max)) + ind -= qp->sq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->sq.cur += nreq; + qp->sq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + int ind; + void *wqe; + void *prev_wqe; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post receive */ + + ind = qp->rq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->rq.cur + nreq >= qp->rq.max) { + mthca_err(dev, "RQ %06x full\n", qp->qpn); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_recv_wqe(qp, ind); + prev_wqe = qp->rq.last; + qp->rq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD); + ((struct mthca_next_seg *) wqe)->flags = + (wr->recv_flags & IB_RECV_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (wr->num_sge > qp->rq.max_gs) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind] = wr->wr_id; + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32((ind << qp->rq.wqe_shift) | 1); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD | size); + } + + if (!size0) + size0 = size; + + ++ind; + if (unlikely(ind >= qp->rq.max)) + ind -= qp->rq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->rq.cur += nreq; + qp->rq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe) +{ + struct mthca_next_seg *next; + + if (is_send) + next = get_send_wqe(qp, index); + else + next = get_recv_wqe(qp, index); + + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + if (next->ee_nds & cpu_to_be32(0x3f)) + *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | + (next->ee_nds & cpu_to_be32(0x3f)); + else + *new_wqe = 0; + + return 0; +} + +int __devinit mthca_init_qp_table(struct mthca_dev *dev) +{ + int err; + u8 status; + int i; + + spin_lock_init(&dev->qp_table.lock); + + /* + * We reserve 2 extra QPs per port for the special QPs. The + * special QP for port 1 has to be even, so round up. + */ + dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL; + err = mthca_alloc_init(&dev->qp_table.alloc, + dev->limits.num_qps, + (1 << 24) - 1, + dev->qp_table.sqp_start + + MTHCA_MAX_PORTS * 2); + if (err) + return err; + + err = mthca_array_init(&dev->qp_table.qp, + dev->limits.num_qps); + if (err) { + mthca_alloc_cleanup(&dev->qp_table.alloc); + return err; + } + + for (i = 0; i < 2; ++i) { + err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI, + dev->qp_table.sqp_start + i * 2, + &status); + if (err) + goto err_out; + if (status) { + mthca_warn(dev, "CONF_SPECIAL_QP returned " + "status %02x, aborting.\n", + status); + err = -EINVAL; + goto err_out; + } + } + return 0; + + err_out: + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps); + mthca_alloc_cleanup(&dev->qp_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev) +{ + int i; + u8 status; + + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_alloc_cleanup(&dev->qp_table.alloc); +} From roland at topspin.com Mon Dec 27 21:51:12 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:12 -0800 Subject: [openib-general] [PATCH][v5][15/24] Add Mellanox HCA low-level driver (last bits) In-Reply-To: <200412272151.qdu1iD71iJs65qnW@topspin.com> Message-ID: <200412272151.KJdDvki21LuSsIo9@topspin.com> Add code for remaining InfiniBand objects (address vectors, multicast groups, memory regions and protection domains) Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c 2004-12-27 21:48:23.889449784 -0800 @@ -0,0 +1,219 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_av.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +#include +#include + +#include "mthca_dev.h" + +struct mthca_av { + u32 port_pd; + u8 reserved1; + u8 g_slid; + u16 dlid; + u8 reserved2; + u8 gid_index; + u8 msg_sr; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u32 dgid[4]; +}; + +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah) +{ + u32 index = -1; + struct mthca_av *av = NULL; + + ah->on_hca = 0; + + if (!atomic_read(&pd->sqp_count) && + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + index = mthca_alloc(&dev->av_table.alloc); + + /* fall back to allocate in host memory */ + if (index == -1) + goto host_alloc; + + av = kmalloc(sizeof *av, GFP_KERNEL); + if (!av) + goto host_alloc; + + ah->on_hca = 1; + ah->avdma = dev->av_table.ddr_av_base + + index * MTHCA_AV_SIZE; + } + + host_alloc: + if (!ah->on_hca) { + ah->av = pci_pool_alloc(dev->av_table.pool, + SLAB_KERNEL, &ah->avdma); + if (!ah->av) + return -ENOMEM; + + av = ah->av; + } + + ah->key = pd->ntmr.ibmr.lkey; + + memset(av, 0, MTHCA_AV_SIZE); + + av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24)); + av->g_slid = ah_attr->src_path_bits; + av->dlid = cpu_to_be16(ah_attr->dlid); + av->msg_sr = (3 << 4) | /* 2K message */ + ah_attr->static_rate; + av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + if (ah_attr->ah_flags & IB_AH_GRH) { + av->g_slid |= 0x80; + av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + + ah_attr->grh.sgid_index; + av->hop_limit = ah_attr->grh.hop_limit; + av->sl_tclass_flowlabel |= + cpu_to_be32((ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); + } else { + /* Arbel workaround -- low byte of GID must be 2 */ + av->dgid[3] = cpu_to_be32(2); + } + + if (0) { + int j; + + mthca_dbg(dev, "Created UDAV at %p/%08lx:\n", + av, (unsigned long) ah->avdma); + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) av)[j])); + } + + if (ah->on_hca) { + memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, + av, MTHCA_AV_SIZE); + kfree(av); + } + + return 0; +} + +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) +{ + if (ah->on_hca) + mthca_free(&dev->av_table.alloc, + (ah->avdma - dev->av_table.ddr_av_base) / + MTHCA_AV_SIZE); + else + pci_pool_free(dev->av_table.pool, ah->av, ah->avdma); + + return 0; +} + +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header) +{ + if (ah->on_hca) + return -EINVAL; + + header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + header->lrh.destination_lid = ah->av->dlid; + header->lrh.source_lid = ah->av->g_slid & 0x7f; + if (ah->av->g_slid & 0x80) { + header->grh_present = 1; + header->grh.traffic_class = + (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; + header->grh.flow_label = + ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); + ib_cached_gid_get(&dev->ib_dev, + be32_to_cpu(ah->av->port_pd) >> 24, + ah->av->gid_index, + &header->grh.source_gid); + memcpy(header->grh.destination_gid.raw, + ah->av->dgid, 16); + } else { + header->grh_present = 0; + } + + return 0; +} + +int __devinit mthca_init_av_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->av_table.alloc, + dev->av_table.num_ddr_avs, + dev->av_table.num_ddr_avs - 1, + 0); + if (err) + return err; + + dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev, + MTHCA_AV_SIZE, + MTHCA_AV_SIZE, 0); + if (!dev->av_table.pool) + goto out_free_alloc; + + if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) + + dev->av_table.ddr_av_base - + dev->ddr_start, + dev->av_table.num_ddr_avs * + MTHCA_AV_SIZE); + if (!dev->av_table.av_map) + goto out_free_pool; + } else + dev->av_table.av_map = NULL; + + return 0; + + out_free_pool: + pci_pool_destroy(dev->av_table.pool); + + out_free_alloc: + mthca_alloc_cleanup(&dev->av_table.alloc); + return -ENOMEM; +} + +void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) +{ + if (dev->av_table.av_map) + iounmap(dev->av_table.av_map); + pci_pool_destroy(dev->av_table.pool); + mthca_alloc_cleanup(&dev->av_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c 2004-12-27 21:48:23.936442867 -0800 @@ -0,0 +1,376 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_mcg.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2) +}; + +struct mthca_mgm { + u32 next_gid_index; + u32 reserved[3]; + u8 gid[16]; + u32 qp[MTHCA_QP_PER_MGM]; +}; + +static const u8 zero_gid[16]; /* automatically initialized to 0 */ + +/* + * Caller must hold MCG table semaphore. gid and mgm parameters must + * be properly aligned for command interface. + * + * Returns 0 unless a firmware command error occurs. + * + * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1 + * and *mgm holds MGM entry. + * + * if GID is found in AMGM, *index = index in AMGM, *prev = index of + * previous entry in hash chain and *mgm holds AMGM entry. + * + * If no AMGM exists for given gid, *index = -1, *prev = index of last + * entry in hash chain and *mgm holds end of hash chain. + */ +static int find_mgm(struct mthca_dev *dev, + u8 *gid, struct mthca_mgm *mgm, + u16 *hash, int *prev, int *index) +{ + void *mailbox; + u8 *mgid; + int err; + u8 status; + + mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgid = MAILBOX_ALIGN(mailbox); + + memcpy(mgid, gid, 16); + + err = mthca_MGID_HASH(dev, mgid, hash, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "MGID_HASH returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (0) + mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:" + "%04x:%04x:%04x:%04x is %04x\n", + be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]), + be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]), + be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]), + be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]), + *hash); + + *index = *hash; + *prev = -1; + + do { + err = mthca_READ_MGM(dev, *index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + return -EINVAL; + } + + if (!memcmp(mgm->gid, zero_gid, 16)) { + if (*index != *hash) { + mthca_err(dev, "Found zero MGID in AMGM.\n"); + err = -EINVAL; + } + goto out; + } + + if (!memcmp(mgm->gid, gid, 16)) + goto out; + + *prev = *index; + *index = be32_to_cpu(mgm->next_gid_index) >> 5; + } while (*index); + + *index = -1; + + out: + kfree(mailbox); + return err; +} + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int index, prev; + int link = 0; + int i; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index != -1) { + if (!memcmp(mgm->gid, zero_gid, 16)) + memcpy(mgm->gid, gid->raw, 16); + } else { + link = 1; + + index = mthca_alloc(&dev->mcg_table.alloc); + if (index == -1) { + mthca_err(dev, "No AMGM entries left\n"); + err = -ENOMEM; + goto out; + } + + err = mthca_READ_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + memcpy(mgm->gid, gid->raw, 16); + mgm->next_gid_index = 0; + } + + for (i = 0; i < MTHCA_QP_PER_MGM; ++i) + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { + mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31)); + break; + } + + if (i == MTHCA_QP_PER_MGM) { + mthca_err(dev, "MGM at index %x is full.\n", index); + err = -ENOMEM; + goto out; + } + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + if (!link) + goto out; + + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int prev, index; + int i, loc; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index == -1) { + mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x " + "not found\n", + be16_to_cpu(((u16 *) gid->raw)[0]), + be16_to_cpu(((u16 *) gid->raw)[1]), + be16_to_cpu(((u16 *) gid->raw)[2]), + be16_to_cpu(((u16 *) gid->raw)[3]), + be16_to_cpu(((u16 *) gid->raw)[4]), + be16_to_cpu(((u16 *) gid->raw)[5]), + be16_to_cpu(((u16 *) gid->raw)[6]), + be16_to_cpu(((u16 *) gid->raw)[7])); + err = -EINVAL; + goto out; + } + + for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) { + if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) + loc = i; + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) + break; + } + + if (loc == -1) { + mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num); + err = -EINVAL; + goto out; + } + + mgm->qp[loc] = mgm->qp[i - 1]; + mgm->qp[i - 1] = 0; + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (i != 1) + goto out; + + goto out; + + if (prev == -1) { + /* Remove entry from MGM */ + if (be32_to_cpu(mgm->next_gid_index) >> 5) { + err = mthca_READ_MGM(dev, + be32_to_cpu(mgm->next_gid_index) >> 5, + mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", + status); + err = -EINVAL; + goto out; + } + } else + memset(mgm->gid, 0, 16); + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } else { + /* Remove entry from AMGM */ + index = be32_to_cpu(mgm->next_gid_index) >> 5; + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int __devinit mthca_init_mcg_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->mcg_table.alloc, + dev->limits.num_amgms, + dev->limits.num_amgms - 1, + 0); + if (err) + return err; + + init_MUTEX(&dev->mcg_table.sem); + + return 0; +} + +void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev) +{ + mthca_alloc_cleanup(&dev->mcg_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c 2004-12-27 21:48:23.964438746 -0800 @@ -0,0 +1,396 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_mr.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +/* + * Must be packed because mtt_seg is 64 bits but only aligned to 32 bits. + */ +struct mthca_mpt_entry { + u32 flags; + u32 page_size; + u32 key; + u32 pd; + u64 start; + u64 length; + u32 lkey; + u32 window_count; + u32 window_count_limit; + u64 mtt_seg; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_MPT_FLAG_SW_OWNS (0xfUL << 28) +#define MTHCA_MPT_FLAG_MIO (1 << 17) +#define MTHCA_MPT_FLAG_BIND_ENABLE (1 << 15) +#define MTHCA_MPT_FLAG_PHYSICAL (1 << 9) +#define MTHCA_MPT_FLAG_REGION (1 << 8) + +#define MTHCA_MTT_FLAG_PRESENT 1 + +/* + * Buddy allocator for MTT segments (currently not very efficient + * since it doesn't keep a free list and just searches linearly + * through the bitmaps) + */ + +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +{ + int o; + int m; + u32 seg; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + for (o = order; o <= dev->mr_table.max_mtt_order; ++o) { + m = 1 << (dev->mr_table.max_mtt_order - o); + seg = find_first_bit(dev->mr_table.mtt_buddy[o], m); + if (seg < m) + goto found; + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + return -1; + + found: + clear_bit(seg, dev->mr_table.mtt_buddy[o]); + + while (o > order) { + --o; + seg <<= 1; + set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]); + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + + seg <<= order; + + return seg; +} + +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +{ + seg >>= order; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) { + clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]); + seg >>= 1; + ++order; + } + + set_bit(seg, dev->mr_table.mtt_buddy[order]); + + spin_unlock(&dev->mr_table.mpt_alloc.lock); +} + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + struct mthca_mpt_entry *mpt_entry; + int err; + u8 status; + + might_sleep(); + + mr->order = -1; + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return -ENOMEM; + } + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_PHYSICAL | + MTHCA_MPT_FLAG_REGION | + access); + mpt_entry->page_size = 0; + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = 0; + mpt_entry->length = ~0ULL; + + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; +} + +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + u64 *mtt_entry; + struct mthca_mpt_entry *mpt_entry; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + WARN_ON(buffer_size_shift >= 32); + + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + /* nothing */ ; + + mr->first_seg = mthca_alloc_mtt(dev, mr->order); + if (mr->first_seg == -1) + goto err_out_mpt_free; + + /* + * If list_len is odd, we add one more dummy entry for + * firmware efficiency. + */ + mailbox = kmalloc(max(sizeof *mpt_entry, + (size_t) 8 * (list_len + (list_len & 1) + 2)) + + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mtt_entry = MAILBOX_ALIGN(mailbox); + + mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + mtt_entry[1] = 0; + for (i = 0; i < list_len; ++i) + mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + if (list_len & 1) { + mtt_entry[i + 2] = 0; + ++list_len; + } + + if (0) { + mthca_dbg(dev, "Dumping MPT entry\n"); + for (i = 0; i < list_len + 2; ++i) + printk(KERN_ERR "[%2d] %016llx\n", + i, (unsigned long long) be64_to_cpu(mtt_entry[i])); + } + + err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status); + if (err) { + mthca_warn(dev, "WRITE_MTT failed (%d)\n", err); + goto err_out_mailbox_free; + } + if (status) { + mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_mailbox_free; + } + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12); + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = cpu_to_be64(iova); + mpt_entry->length = cpu_to_be64(total_size); + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; + + err_out_mailbox_free: + kfree(mailbox); + + err_out_free_mtt: + mthca_free_mtt(dev, mr->first_seg, mr->order); + + err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return err; +} + +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) +{ + int err; + u8 status; + + might_sleep(); + + err = mthca_HW2SW_MPT(dev, NULL, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", + status); + + if (mr->order >= 0) + mthca_free_mtt(dev, mr->first_seg, mr->order); + + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); +} + +int __devinit mthca_init_mr_table(struct mthca_dev *dev) +{ + int err; + int i, s; + + err = mthca_alloc_init(&dev->mr_table.mpt_alloc, + dev->limits.num_mpts, + ~0, dev->limits.reserved_mrws); + if (err) + return err; + + err = -ENOMEM; + + for (i = 1, dev->mr_table.max_mtt_order = 0; + i < dev->limits.num_mtt_segs; + i <<= 1, ++dev->mr_table.max_mtt_order) + /* nothing */ ; + + dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) * + sizeof (long *), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy) + goto err_out; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + dev->mr_table.mtt_buddy[i] = NULL; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) { + s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i)); + dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy[i]) + goto err_out_free; + bitmap_zero(dev->mr_table.mtt_buddy[i], + 1 << (dev->mr_table.max_mtt_order - i)); + } + + set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]); + + for (i = 0; i < dev->mr_table.max_mtt_order; ++i) + if (1 << i >= dev->limits.reserved_mtts) + break; + + if (i == dev->mr_table.max_mtt_order) { + mthca_err(dev, "MTT table of order %d is " + "too small.\n", i); + goto err_out_free; + } + + (void) mthca_alloc_mtt(dev, i); + + return 0; + + err_out_free: + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + + err_out: + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); + + return err; +} + +void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) +{ + int i; + + /* XXX check if any MRs are still allocated? */ + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + kfree(dev->mr_table.mtt_buddy); + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c 2004-12-27 21:48:23.990434920 -0800 @@ -0,0 +1,80 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_pd.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd) +{ + int err; + + might_sleep(); + + atomic_set(&pd->sqp_count, 0); + pd->pd_num = mthca_alloc(&dev->pd_table.alloc); + if (pd->pd_num == -1) + return -ENOMEM; + + err = mthca_mr_alloc_notrans(dev, pd->pd_num, + MTHCA_MPT_FLAG_LOCAL_READ | + MTHCA_MPT_FLAG_LOCAL_WRITE, + &pd->ntmr); + if (err) + mthca_free(&dev->pd_table.alloc, pd->pd_num); + + return err; +} + +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd) +{ + might_sleep(); + mthca_free_mr(dev, &pd->ntmr); + mthca_free(&dev->pd_table.alloc, pd->pd_num); +} + +int __devinit mthca_init_pd_table(struct mthca_dev *dev) +{ + return mthca_alloc_init(&dev->pd_table.alloc, + dev->limits.num_pds, + (1 << 24) - 1, + dev->limits.reserved_pds); +} + +void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev) +{ + /* XXX check if any PDs are still allocated? */ + mthca_alloc_cleanup(&dev->pd_table.alloc); +} From roland at topspin.com Mon Dec 27 21:51:13 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:13 -0800 Subject: [openib-general] [PATCH][v5][16/24] Add Mellanox HCA low-level driver (MAD) In-Reply-To: <200412272151.KJdDvki21LuSsIo9@topspin.com> Message-ID: <200412272151.ZZGB77m9dy61uyrb@topspin.com> Add MAD (management datagram) code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c 2004-12-27 21:48:24.331384733 -0800 @@ -0,0 +1,320 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mthca_mad.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_VENDOR_CLASS1 = 0x9, + MTHCA_VENDOR_CLASS2 = 0xa +}; + +struct mthca_trap_mad { + struct ib_mad *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static void update_sm_ah(struct mthca_dev *dev, + u8 port_num, u16 lid, u8 sl) +{ + struct ib_ah *new_ah; + struct ib_ah_attr ah_attr; + unsigned long flags; + + if (!dev->send_agent[port_num - 1][0]) + return; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = lid; + ah_attr.sl = sl; + ah_attr.port_num = port_num; + + new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd, + &ah_attr); + if (IS_ERR(new_ah)) + return; + + spin_lock_irqsave(&dev->sm_lock, flags); + if (dev->sm_ah[port_num - 1]) + ib_destroy_ah(dev->sm_ah[port_num - 1]); + dev->sm_ah[port_num - 1] = new_ah; + spin_unlock_irqrestore(&dev->sm_lock, flags); +} + +/* + * Snoop SM MADs for port info and P_Key table sets, so we can + * synthesize LID change and P_Key change events. + */ +static void smp_snoop(struct ib_device *ibdev, + u8 port_num, + struct ib_mad *mad) +{ + struct ib_event event; + + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_SET) { + if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { + update_sm_ah(to_mdev(ibdev), port_num, + be16_to_cpup((__be16 *) (mad->data + 58)), + (*(u8 *) (mad->data + 76)) & 0xf); + + event.device = ibdev; + event.event = IB_EVENT_LID_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + + if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) { + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + } +} + +static void forward_trap(struct mthca_dev *dev, + u8 port_num, + struct ib_mad *mad) +{ + int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; + struct mthca_trap_mad *tmad; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = qpn ? IB_QP1_QKEY : 0, + .timeout_ms = 0 + } + } + }; + struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; + int ret; + unsigned long flags; + + if (agent) { + tmad = kmalloc(sizeof *tmad, GFP_KERNEL); + if (!tmad) + return; + + tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL); + if (!tmad->mad) { + kfree(tmad); + return; + } + + memcpy(tmad->mad, mad, sizeof *mad); + + wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; + wr.wr_id = (unsigned long) tmad; + + gather_list.addr = dma_map_single(agent->device->dma_device, + tmad->mad, + sizeof *tmad->mad, + DMA_TO_DEVICE); + gather_list.length = sizeof *tmad->mad; + gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; + pci_unmap_addr_set(tmad, mapping, gather_list.addr); + + /* + * We rely here on the fact that MLX QPs don't use the + * address handle after the send is posted (this is + * wrong following the IB spec strictly, but we know + * it's OK for our devices). + */ + spin_lock_irqsave(&dev->sm_lock, flags); + wr.wr.ud.ah = dev->sm_ah[port_num - 1]; + if (wr.wr.ud.ah) + ret = ib_post_send_mad(agent, &wr, &bad_wr); + else + ret = -EINVAL; + spin_unlock_irqrestore(&dev->sm_lock, flags); + + if (ret) { + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + DMA_TO_DEVICE); + kfree(tmad->mad); + kfree(tmad); + } + } +} + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + int err; + u8 status; + + /* Forward locally generated traps to the SM */ + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && + slid == 0) { + forward_trap(to_mdev(ibdev), port_num, in_mad); + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + /* + * Only handle SM gets, sets and trap represses for SM class + * + * Only handle PMA and Mellanox vendor-specific class gets and + * sets for other classes. + */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_TRAP_REPRESS) + return IB_MAD_RESULT_SUCCESS; + + /* + * Don't process SMInfo queries or vendor-specific + * MADs -- the SMA can't handle them. + */ + if (in_mad->mad_hdr.attr_id == IB_SMP_ATTR_SM_INFO || + ((in_mad->mad_hdr.attr_id & IB_SMP_ATTR_VENDOR_MASK) == + IB_SMP_ATTR_VENDOR_MASK)) + return IB_MAD_RESULT_SUCCESS; + } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1 || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET) + return IB_MAD_RESULT_SUCCESS; + } else + return IB_MAD_RESULT_SUCCESS; + + err = mthca_MAD_IFC(to_mdev(ibdev), + !!(mad_flags & IB_MAD_IGNORE_MKEY), + port_num, in_mad, out_mad, + &status); + if (err) { + mthca_err(to_mdev(ibdev), "MAD_IFC failed\n"); + return IB_MAD_RESULT_FAILURE; + } + if (status == MTHCA_CMD_STAT_BAD_PKT) + return IB_MAD_RESULT_SUCCESS; + if (status) { + mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n", + status); + return IB_MAD_RESULT_FAILURE; + } + + if (!out_mad->mad_hdr.status) + smp_snoop(ibdev, port_num, in_mad); + + /* set return bit in status of directed route responses */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + out_mad->mad_hdr.status |= cpu_to_be16(1 << 15); + + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) + /* no response for trap repress */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct mthca_trap_mad *tmad = + (void *) (unsigned long) mad_send_wc->wr_id; + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + DMA_TO_DEVICE); + kfree(tmad->mad); + kfree(tmad); +} + +int mthca_create_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + spin_lock_init(&dev->sm_lock); + + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) { + agent = ib_register_mad_agent(&dev->ib_dev, p + 1, + q ? IB_QPT_GSI : IB_QPT_SMI, + NULL, 0, send_handler, + NULL, NULL); + if (IS_ERR(agent)) + goto err; + dev->send_agent[p][q] = agent; + } + + return 0; + +err: + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) + if (dev->send_agent[p][q]) + ib_unregister_mad_agent(dev->send_agent[p][q]); + + return PTR_ERR(agent); +} + +void mthca_free_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + for (p = 0; p < dev->limits.num_ports; ++p) { + for (q = 0; q <= 1; ++q) { + agent = dev->send_agent[p][q]; + dev->send_agent[p][q] = NULL; + ib_unregister_mad_agent(agent); + } + + if (dev->sm_ah[p]) + ib_destroy_ah(dev->sm_ah[p]); + } +} From roland at topspin.com Mon Dec 27 21:51:14 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:14 -0800 Subject: [openib-general] [PATCH][v5][17/24] IPoIB IPv4 multicast In-Reply-To: <200412272151.ZZGB77m9dy61uyrb@topspin.com> Message-ID: <200412272151.qKxQwjb8RXkS5kEQ@topspin.com> Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB hardware addresses. Also add so INFINIBAND_ALEN has a home. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/include/linux/if_infiniband.h 2004-12-27 21:48:24.639339403 -0800 @@ -0,0 +1,29 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef _LINUX_IF_INFINIBAND_H +#define _LINUX_IF_INFINIBAND_H + +#define INFINIBAND_ALEN 20 /* Octets in IPoIB HW addr */ + +#endif /* _LINUX_IF_INFINIBAND_H */ --- linux-bk.orig/include/net/ip.h 2004-12-27 21:47:47.982735072 -0800 +++ linux-bk/include/net/ip.h 2004-12-27 21:48:24.639339403 -0800 @@ -229,6 +229,39 @@ buf[3]=addr&0x7F; } +/* + * Map a multicast IP onto multicast MAC for type IP-over-InfiniBand. + * Leave P_Key as 0 to be filled in by driver. + */ + +static inline void ip_ib_mc_map(u32 addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + addr = ntohl(addr); + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x40; /* IPv4 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + buf[10] = 0; + buf[11] = 0; + buf[12] = 0; + buf[13] = 0; + buf[14] = 0; + buf[15] = 0; + buf[19] = addr & 0xff; + addr >>= 8; + buf[18] = addr & 0xff; + addr >>= 8; + buf[17] = addr & 0xff; + addr >>= 8; + buf[16] = addr & 0x0f; +} + #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) #include #endif --- linux-bk.orig/net/ipv4/arp.c 2004-12-27 21:47:52.507069119 -0800 +++ linux-bk/net/ipv4/arp.c 2004-12-27 21:48:24.640339256 -0800 @@ -213,6 +213,9 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; + case ARPHRD_INFINIBAND: + ip_ib_mc_map(addr, haddr); + return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); From roland at topspin.com Mon Dec 27 21:51:15 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:15 -0800 Subject: [openib-general] [PATCH][v5][18/24] IPoIB IPv6 support In-Reply-To: <200412272151.qKxQwjb8RXkS5kEQ@topspin.com> Message-ID: <200412272151.6vXZnyn99yKN3sSD@topspin.com> Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB hardware addresses, and add support for autoconfiguration for devices with type ARPHRD_INFINIBAND. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Signed-off-by: Nitin Hande Signed-off-by: Roland Dreier --- linux-bk.orig/include/net/if_inet6.h 2004-12-27 21:47:59.669014924 -0800 +++ linux-bk/include/net/if_inet6.h 2004-12-27 21:48:24.976289805 -0800 @@ -266,5 +266,20 @@ { buf[0] = 0x00; } + +static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x60; /* IPv6 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + memcpy(buf + 10, addr->s6_addr + 6, 10); +} #endif #endif --- linux-bk.orig/net/ipv6/addrconf.c 2004-12-27 21:47:59.159089982 -0800 +++ linux-bk/net/ipv6/addrconf.c 2004-12-27 21:48:24.978289511 -0800 @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -1095,6 +1096,12 @@ memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + if (dev->addr_len != INFINIBAND_ALEN) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; } return -1; } @@ -1794,7 +1801,8 @@ if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && - (dev->type != ARPHRD_ARCNET)) { + (dev->type != ARPHRD_ARCNET) && + (dev->type != ARPHRD_INFINIBAND)) { /* Alas, we support only Ethernet autoconfiguration. */ return; } --- linux-bk.orig/net/ipv6/ndisc.c 2004-12-27 21:47:44.031316692 -0800 +++ linux-bk/net/ipv6/ndisc.c 2004-12-27 21:48:24.979289364 -0800 @@ -260,6 +260,9 @@ case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; + case ARPHRD_INFINIBAND: + ipv6_ib_mc_map(addr, buf); + return 0; default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); From roland at topspin.com Mon Dec 27 21:51:15 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:15 -0800 Subject: [openib-general] [PATCH][v5][19/24] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <200412272151.6vXZnyn99yKN3sSD@topspin.com> Message-ID: <200412272151.XKAQkznglOHW39xJ@topspin.com> Add a driver that implements the (IPoIB) IP-over-InfiniBand protocol. This is a network device driver of type ARPHRD_INFINIBAND (and addr_len INFINIBAND_ALEN bytes). The ARP/ND implementation for this driver is not completely straightforward, because InfiniBand requires an additional path lookup be performed (through an IB-specific mechanism) after a remote hardware address has been resolved. We are very open to suggestions of a better way to handle this than the current implementation. Although IB has a special multicast group join mode intended to support IP multicast routing (non member join), no means to identify different multicast styles has yet been determined, so all joins by the driver are currently full member joins. We are looking for guidance in how to solve this. The IPoIB protocol/encapsulation is described in the Internet-Drafts http://www.ietf.org/internet-drafts/draft-ietf-ipoib-architecture-04.txt http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-12-27 21:48:21.258837002 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-27 21:48:25.377230788 -0800 @@ -9,4 +9,6 @@ source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/ulp/ipoib/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-12-27 21:48:21.219842741 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-12-27 21:48:25.347235203 -0800 @@ -1,2 +1,3 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ +obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Kconfig 2004-12-27 21:48:25.454219455 -0800 @@ -0,0 +1,33 @@ +config INFINIBAND_IPOIB + tristate "IP-over-InfiniBand" + depends on INFINIBAND && NETDEVICES && INET + ---help--- + Support for the IP-over-InfiniBand protocol (IPoIB). This + transports IP packets over InfiniBand so you can use your IB + device as a fancy NIC. + + The IPoIB protocol is defined by the IETF ipoib working + group: . + +config INFINIBAND_IPOIB_DEBUG + bool "IP-over-InfiniBand debugging" + depends on INFINIBAND_IPOIB + ---help--- + This option causes debugging code to be compiled into the + IPoIB driver. The output can be turned on via the + debug_level and mcast_debug_level module parameters (which + can also be set after the driver is loaded through sysfs). + + This option also creates an "ipoib_debugfs," which can be + mounted to expose debugging information about IB multicast + groups used by the IPoIB driver. + +config INFINIBAND_IPOIB_DEBUG_DATA + bool "IP-over-InfiniBand data path debugging" + depends on INFINIBAND_IPOIB_DEBUG + ---help--- + This option compiles debugging code into the the data path + of the IPoIB driver. The output can be turned on via the + data_debug_level module parameter; however, even with output + turned off, this debugging code will have some performance + impact. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Makefile 2004-12-27 21:48:25.420224459 -0800 @@ -0,0 +1,11 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_IPOIB) += ib_ipoib.o + +ib_ipoib-y := ipoib_main.o \ + ipoib_ib.o \ + ipoib_multicast.o \ + ipoib_verbs.o \ + ipoib_vlan.o +ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o + --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib.h 2004-12-27 21:48:25.497213127 -0800 @@ -0,0 +1,350 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib.h 1358 2004-12-17 22:00:11Z roland $ + */ + +#ifndef _IPOIB_H +#define _IPOIB_H + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include + +#include +#include +#include + +/* constants */ + +enum { + IPOIB_PACKET_SIZE = 2048, + IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, + + IPOIB_ENCAP_LEN = 4, + + IPOIB_RX_RING_SIZE = 128, + IPOIB_TX_RING_SIZE = 64, + + IPOIB_NUM_WC = 4, + + IPOIB_MAX_PATH_REC_QUEUE = 3, + IPOIB_MAX_MCAST_QUEUE = 3, + + IPOIB_FLAG_OPER_UP = 0, + IPOIB_FLAG_ADMIN_UP = 1, + IPOIB_PKEY_ASSIGNED = 2, + IPOIB_PKEY_STOP = 3, + IPOIB_FLAG_SUBINTERFACE = 4, + IPOIB_MCAST_RUN = 5, + IPOIB_STOP_REAPER = 6, + + IPOIB_MAX_BACKOFF_SECONDS = 16, + + IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ + IPOIB_MCAST_FLAG_SENDONLY = 1, + IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ + IPOIB_MCAST_FLAG_ATTACHED = 3, +}; + +/* structs */ + +struct ipoib_header { + u16 proto; + u16 reserved; +}; + +struct ipoib_pseudoheader { + u8 hwaddr[INFINIBAND_ALEN]; +}; + +struct ipoib_mcast; + +struct ipoib_buf { + struct sk_buff *skb; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +/* + * Device private locking: tx_lock protects members used in TX fast + * path (and we use LLTX so upper layers don't do extra locking). + * lock protects everything else. lock nests inside of tx_lock (ie + * tx_lock must be acquired first if needed). + */ +struct ipoib_dev_priv { + spinlock_t lock; + + struct net_device *dev; + + unsigned long flags; + + struct semaphore mcast_mutex; + struct semaphore vlan_mutex; + + struct rb_root path_tree; + struct list_head path_list; + + struct ipoib_mcast *broadcast; + struct list_head multicast_list; + struct rb_root multicast_tree; + + struct work_struct pkey_task; + struct work_struct mcast_task; + struct work_struct flush_task; + struct work_struct restart_task; + struct work_struct ah_reap_task; + + struct ib_device *ca; + u8 port; + u16 pkey; + struct ib_pd *pd; + struct ib_mr *mr; + struct ib_cq *cq; + struct ib_qp *qp; + u32 qkey; + + union ib_gid local_gid; + u16 local_lid; + + unsigned int admin_mtu; + unsigned int mcast_mtu; + + struct ipoib_buf *rx_ring; + + spinlock_t tx_lock; + struct ipoib_buf *tx_ring; + unsigned tx_head; + unsigned tx_tail; + + struct ib_wc ibwc[IPOIB_NUM_WC]; + + struct list_head dead_ahs; + + struct ib_event_handler event_handler; + + struct net_device_stats stats; + + struct net_device *parent; + struct list_head child_intfs; + struct list_head list; + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG + struct list_head fs_list; + struct dentry *mcg_dentry; +#endif +}; + +struct ipoib_ah { + struct net_device *dev; + struct ib_ah *ah; + struct list_head list; + struct kref ref; + unsigned last_send; +}; + +struct ipoib_path { + struct net_device *dev; + struct ib_sa_path_rec pathrec; + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct list_head neigh_list; + + int query_id; + struct ib_sa_query *query; + struct completion done; + + struct rb_node rb_node; + struct list_head list; +}; + +struct ipoib_neigh { + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct neighbour *neighbour; + + struct list_head list; +}; + +static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) +{ + return (struct ipoib_neigh **) (neigh->ha + 24 - + (offsetof(struct neighbour, ha) & 4)); +} + +extern struct workqueue_struct *ipoib_workqueue; + +/* functions */ + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr); +void ipoib_free_ah(struct kref *kref); +static inline void ipoib_put_ah(struct ipoib_ah *ah) +{ + kref_put(&ah->ref, ipoib_free_ah); +} + +int ipoib_add_pkey_attr(struct net_device *dev); + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn); +void ipoib_reap_ah(void *dev_ptr); + +void ipoib_flush_paths(struct net_device *dev); +struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_ib_dev_flush(void *dev); +void ipoib_ib_dev_cleanup(struct net_device *dev); + +int ipoib_ib_dev_open(struct net_device *dev); +int ipoib_ib_dev_up(struct net_device *dev); +int ipoib_ib_dev_down(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev); + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_dev_cleanup(struct net_device *dev); + +void ipoib_mcast_join_task(void *dev_ptr); +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb); + +void ipoib_mcast_restart_task(void *dev_ptr); +int ipoib_mcast_start_thread(struct net_device *dev); +int ipoib_mcast_stop_thread(struct net_device *dev); + +void ipoib_mcast_dev_down(struct net_device *dev); +void ipoib_mcast_dev_flush(struct net_device *dev); + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev); +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter); +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter); +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *gid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only); + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); + +int ipoib_qp_create(struct net_device *dev); +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca); +void ipoib_transport_dev_cleanup(struct net_device *dev); + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); + +void ipoib_pkey_poll(void *dev); +int ipoib_pkey_dev_delay_open(struct net_device *dev); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int ipoib_create_debug_file(struct net_device *dev); +void ipoib_delete_debug_file(struct net_device *dev); +int ipoib_register_debugfs(void); +void ipoib_unregister_debugfs(void); +#else +static inline int ipoib_create_debug_file(struct net_device *dev) { return 0; } +static inline void ipoib_delete_debug_file(struct net_device *dev) { } +static inline int ipoib_register_debugfs(void) { return 0; } +static inline void ipoib_unregister_debugfs(void) { } +#endif + + +#define ipoib_printk(level, priv, format, arg...) \ + printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg) +#define ipoib_warn(priv, format, arg...) \ + ipoib_printk(KERN_WARNING, priv, format , ## arg) + + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +extern int debug_level; + +#define ipoib_dbg(priv, format, arg...) \ + do { \ + if (debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { \ + if (mcast_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG */ +#define ipoib_dbg(priv, format, arg...) \ + do { (void) (priv); } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +#define ipoib_dbg_data(priv, format, arg...) \ + do { \ + if (data_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ +#define ipoib_dbg_data(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ + + +#define IPOIB_GID_FMT "%x:%x:%x:%x:%x:%x:%x:%x" + +#define IPOIB_GID_ARG(gid) be16_to_cpup((__be16 *) ((gid).raw + 0)), \ + be16_to_cpup((__be16 *) ((gid).raw + 2)), \ + be16_to_cpup((__be16 *) ((gid).raw + 4)), \ + be16_to_cpup((__be16 *) ((gid).raw + 6)), \ + be16_to_cpup((__be16 *) ((gid).raw + 8)), \ + be16_to_cpup((__be16 *) ((gid).raw + 10)), \ + be16_to_cpup((__be16 *) ((gid).raw + 12)), \ + be16_to_cpup((__be16 *) ((gid).raw + 14)) + +#endif /* _IPOIB_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2004-12-27 21:48:25.549205474 -0800 @@ -0,0 +1,287 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_fs.c 1389 2004-12-27 22:56:47Z roland $ + */ + +#include +#include + +#include "ipoib.h" + +enum { + IPOIB_MAGIC = 0x49504942 /* "IPIB" */ +}; + +static DECLARE_MUTEX(ipoib_fs_mutex); +static struct dentry *ipoib_root; +static struct super_block *ipoib_sb; +static LIST_HEAD(ipoib_device_list); + +static void *ipoib_mcg_seq_start(struct seq_file *file, loff_t *pos) +{ + struct ipoib_mcast_iter *iter; + loff_t n = *pos; + + iter = ipoib_mcast_iter_init(file->private); + if (!iter) + return NULL; + + while (n--) { + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + } + + return iter; +} + +static void *ipoib_mcg_seq_next(struct seq_file *file, void *iter_ptr, + loff_t *pos) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + + (*pos)++; + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +static void ipoib_mcg_seq_stop(struct seq_file *file, void *iter_ptr) +{ + /* nothing for now */ +} + +static int ipoib_mcg_seq_show(struct seq_file *file, void *iter_ptr) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + char gid_buf[sizeof "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"]; + union ib_gid mgid; + int i, n; + unsigned long created; + unsigned int queuelen, complete, send_only; + + if (iter) { + ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen, + &complete, &send_only); + + for (n = 0, i = 0; i < sizeof mgid / 2; ++i) { + n += sprintf(gid_buf + n, "%x", + be16_to_cpu(((u16 *)mgid.raw)[i])); + if (i < sizeof mgid / 2 - 1) + gid_buf[n++] = ':'; + } + } + + seq_printf(file, "GID: %*s", -(1 + (int) sizeof gid_buf), gid_buf); + + seq_printf(file, + " created: %10ld queuelen: %4d complete: %d send_only: %d\n", + created, queuelen, complete, send_only); + + return 0; +} + +static struct seq_operations ipoib_seq_ops = { + .start = ipoib_mcg_seq_start, + .next = ipoib_mcg_seq_next, + .stop = ipoib_mcg_seq_stop, + .show = ipoib_mcg_seq_show, +}; + +static int ipoib_mcg_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &ipoib_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->u.generic_ip; + + return 0; +} + +static struct file_operations ipoib_fops = { + .owner = THIS_MODULE, + .open = ipoib_mcg_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +static struct inode *ipoib_get_inode(void) +{ + struct inode *inode = new_inode(ipoib_sb); + + if (inode) { + inode->i_mode = S_IFREG | S_IRUGO; + inode->i_uid = 0; + inode->i_gid = 0; + inode->i_blksize = PAGE_CACHE_SIZE; + inode->i_blocks = 0; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode->i_fop = &ipoib_fops; + } + + return inode; +} + +static int __ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dentry *dentry; + struct inode *inode; + char name[IFNAMSIZ + sizeof "_mcg"]; + + snprintf(name, sizeof name, "%s_mcg", dev->name); + + dentry = d_alloc_name(ipoib_root, name); + if (!dentry) + return -ENOMEM; + + inode = ipoib_get_inode(); + if (!inode) { + dput(dentry); + return -ENOMEM; + } + + inode->u.generic_ip = dev; + priv->mcg_dentry = dentry; + + d_add(dentry, inode); + + return 0; +} + +int ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + + list_add_tail(&priv->fs_list, &ipoib_device_list); + + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return 0; + } + + up(&ipoib_fs_mutex); + + return __ipoib_create_debug_file(dev); +} + +void ipoib_delete_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + list_del(&priv->fs_list); + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return; + } + up(&ipoib_fs_mutex); + + if (priv->mcg_dentry) { + d_drop(priv->mcg_dentry); + simple_unlink(ipoib_root->d_inode, priv->mcg_dentry); + } +} + +static int ipoib_fill_super(struct super_block *sb, void *data, int silent) +{ + static struct tree_descr ipoib_files[] = { + { "" } + }; + struct ipoib_dev_priv *priv; + int ret; + + ret = simple_fill_super(sb, IPOIB_MAGIC, ipoib_files); + if (ret) + return ret; + + ipoib_root = sb->s_root; + + down(&ipoib_fs_mutex); + + ipoib_sb = sb; + + list_for_each_entry(priv, &ipoib_device_list, fs_list) { + ret = __ipoib_create_debug_file(priv->dev); + if (ret) + break; + } + + up(&ipoib_fs_mutex); + + return ret; +} + +static struct super_block *ipoib_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + return get_sb_single(fs_type, flags, data, ipoib_fill_super); +} + +static void ipoib_kill_sb(struct super_block *sb) +{ + down(&ipoib_fs_mutex); + ipoib_sb = NULL; + up(&ipoib_fs_mutex); + + kill_litter_super(sb); +} + +static struct file_system_type ipoib_fs_type = { + .owner = THIS_MODULE, + .name = "ipoib_debugfs", + .get_sb = ipoib_get_sb, + .kill_sb = ipoib_kill_sb, +}; + +int ipoib_register_debugfs(void) +{ + return register_filesystem(&ipoib_fs_type); +} + +void ipoib_unregister_debugfs(void) +{ + unregister_filesystem(&ipoib_fs_type); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2004-12-27 21:48:25.597198409 -0800 @@ -0,0 +1,678 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_ib.c 1386 2004-12-27 16:23:17Z roland $ + */ + +#include +#include + +#include + +#include "ipoib.h" + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +int data_debug_level; + +module_param(data_debug_level, int, 0644); +MODULE_PARM_DESC(data_debug_level, + "Enable data path debug tracing if > 0"); +#endif + +#define IPOIB_OP_RECV (1ul << 31) + +static DECLARE_MUTEX(pkey_sem); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr) +{ + struct ipoib_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return NULL; + + ah->dev = dev; + ah->last_send = 0; + kref_init(&ah->ref); + + ah->ah = ib_create_ah(pd, attr); + if (IS_ERR(ah->ah)) { + kfree(ah); + ah = NULL; + } else + ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah); + + return ah; +} + +void ipoib_free_ah(struct kref *kref) +{ + struct ipoib_ah *ah = container_of(kref, struct ipoib_ah, ref); + struct ipoib_dev_priv *priv = netdev_priv(ah->dev); + + unsigned long flags; + + if (ah->last_send <= priv->tx_tail) { + ipoib_dbg(priv, "Freeing ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } else { + spin_lock_irqsave(&priv->lock, flags); + list_add_tail(&ah->list, &priv->dead_ahs); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv, + unsigned int wr_id, + dma_addr_t addr) +{ + struct ib_sge list = { + .addr = addr, + .length = IPOIB_BUF_SIZE, + .lkey = priv->mr->lkey, + }; + struct ib_recv_wr param = { + .wr_id = wr_id | IPOIB_OP_RECV, + .sg_list = &list, + .num_sge = 1, + .recv_flags = IB_RECV_SIGNALED + }; + struct ib_recv_wr *bad_wr; + + return ib_post_recv(priv->qp, ¶m, &bad_wr); +} + +static int ipoib_ib_post_receive(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + dma_addr_t addr; + int ret; + + skb = dev_alloc_skb(IPOIB_BUF_SIZE + 4); + if (!skb) { + ipoib_warn(priv, "failed to allocate receive buffer\n"); + + priv->rx_ring[id].skb = NULL; + return -ENOMEM; + } + skb_reserve(skb, 4); /* 16 byte align IP header */ + priv->rx_ring[id].skb = skb; + addr = dma_map_single(priv->ca->dma_device, + skb->data, IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr); + + ret = ipoib_ib_receive(priv, id, addr); + if (ret) { + ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n", + id, ret); + priv->rx_ring[id].skb = NULL; + } + + return ret; +} + +static int ipoib_ib_post_receives(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + if (ipoib_ib_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + return -EIO; + } + } + + return 0; +} + +static void ipoib_ib_handle_wc(struct net_device *dev, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + + ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); + + if (wr_id & IPOIB_OP_RECV) { + wr_id &= ~IPOIB_OP_RECV; + + if (wr_id < IPOIB_RX_RING_SIZE) { + struct sk_buff *skb = priv->rx_ring[wr_id].skb; + + priv->rx_ring[wr_id].skb = NULL; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(&priv->rx_ring[wr_id], + mapping), + IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + + if (wc->status != IB_WC_SUCCESS) { + if (wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed recv event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put(skb, wc->byte_len); + skb_pull(skb, IB_GRH_BYTES); + + if (wc->slid != priv->local_lid || + wc->src_qp != priv->qp->qp_num) { + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + } else { + ipoib_dbg_data(priv, "dropping loopback packet\n"); + dev_kfree_skb_any(skb); + } + + /* repost receive */ + if (ipoib_ib_post_receive(dev, wr_id)) + ipoib_warn(priv, "ipoib_ib_post_receive failed " + "for buf %d\n", wr_id); + } else + ipoib_warn(priv, "completion event with wrid %d\n", + wr_id); + + } else { + struct ipoib_buf *tx_req; + unsigned long flags; + + if (wr_id >= IPOIB_TX_RING_SIZE) { + ipoib_warn(priv, "completion event with wrid %d (> %d)\n", + wr_id, IPOIB_TX_RING_SIZE); + return; + } + + ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); + + tx_req = &priv->tx_ring[wr_id]; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; + + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->tx_tail; + if (netif_queue_stopped(dev) && + priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + } +} + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + for (i = 0; i < n; ++i) + ipoib_ib_handle_wc(dev, priv->ibwc + i); + } while (n == IPOIB_NUM_WC); +} + +static inline int post_send(struct ipoib_dev_priv *priv, + unsigned int wr_id, + struct ib_ah *address, u32 qpn, + dma_addr_t addr, int len) +{ + struct ib_sge list = { + .addr = addr, + .length = len, + .lkey = priv->mr->lkey, + }; + struct ib_send_wr param = { + .wr_id = wr_id, + .opcode = IB_WR_SEND, + .sg_list = &list, + .num_sge = 1, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = priv->qkey, + .ah = address + }, + }, + .send_flags = IB_SEND_SIGNALED, + }; + struct ib_send_wr *bad_wr; + + return ib_post_send(priv->qp, ¶m, &bad_wr); +} + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_buf *tx_req; + dma_addr_t addr; + + if (skb->len > dev->mtu + INFINIBAND_ALEN) { + ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + skb->len, dev->mtu + INFINIBAND_ALEN); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n", + skb->len, address, qpn); + + /* + * We put the skb into the tx_ring _before_ we call post_send() + * because it's entirely possible that the completion handler will + * run before we execute anything after the post_send(). That + * means we have to make sure everything is properly recorded and + * our state is consistent before we call post_send(). + */ + tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req->skb = skb; + addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, + DMA_TO_DEVICE); + pci_unmap_addr_set(tx_req, mapping, addr); + + if (unlikely(post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + address->ah, qpn, addr, skb->len))) { + ipoib_warn(priv, "post_send failed\n"); + ++priv->stats.tx_errors; + dma_unmap_single(priv->ca->dma_device, addr, skb->len, + DMA_TO_DEVICE); + dev_kfree_skb_any(skb); + } else { + dev->trans_start = jiffies; + + address->last_send = priv->tx_head; + ++priv->tx_head; + + if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + } + } +} + +void __ipoib_reap_ah(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah, *tah; + LIST_HEAD(remove_list); + + spin_lock_irq(&priv->lock); + list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list) + if (ah->last_send <= priv->tx_tail) { + list_del(&ah->list); + list_add_tail(&ah->list, &remove_list); + } + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(ah, tah, &remove_list, list) { + ipoib_dbg(priv, "Reaping ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } +} + +void ipoib_reap_ah(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + __ipoib_reap_ah(dev); + + if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); +} + +int ipoib_ib_dev_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + ret = ipoib_qp_create(dev); + if (ret) { + ipoib_warn(priv, "ipoib_qp_create returned %d\n", ret); + return -1; + } + + ret = ipoib_ib_post_receives(dev); + if (ret) { + ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + return -1; + } + + clear_bit(IPOIB_STOP_REAPER, &priv->flags); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + + return 0; +} + +int ipoib_ib_dev_up(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + + return ipoib_mcast_start_thread(dev); +} + +int ipoib_ib_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "downing ib_dev\n"); + + clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + netif_carrier_off(dev); + + /* Shutdown the P_Key thread if still active */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + set_bit(IPOIB_PKEY_STOP, &priv->flags); + cancel_delayed_work(&priv->pkey_task); + up(&pkey_sem); + flush_workqueue(ipoib_workqueue); + } + + ipoib_mcast_stop_thread(dev); + + /* + * Flush the multicast groups first so we stop any multicast joins. The + * completion thread may have already died and we may deadlock waiting + * for the completion thread to finish some multicast joins. + */ + ipoib_mcast_dev_flush(dev); + + /* Delete broadcast and local addresses since they will be recreated */ + ipoib_mcast_dev_down(dev); + + ipoib_flush_paths(dev); + + return 0; +} + +static int recvs_pending(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int pending = 0; + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + ++pending; + + return pending; +} + +int ipoib_ib_dev_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + int attr_mask; + unsigned long begin; + struct ipoib_buf *tx_req; + int i; + + /* Kill the existing QP and allocate a new one */ + qp_attr.qp_state = IB_QPS_ERR; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to ERROR state\n"); + + /* Wait for all sends and receives to complete */ + begin = jiffies; + + while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) { + if (time_after(jiffies, begin + 5 * HZ)) { + ipoib_warn(priv, "timing out; %d sends %d receives not completed\n", + priv->tx_head - priv->tx_tail, recvs_pending(dev)); + + /* + * assume the HW is wedged and just free up + * all our pending work requests. + */ + while (priv->tx_tail < priv->tx_head) { + tx_req = &priv->tx_ring[priv->tx_tail & + (IPOIB_TX_RING_SIZE - 1)]; + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + dev_kfree_skb_any(tx_req->skb); + ++priv->tx_tail; + } + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) { + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(&priv->rx_ring[i], + mapping), + IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + dev_kfree_skb_any(priv->rx_ring[i].skb); + priv->rx_ring[i].skb = NULL; + } + + goto timeout; + } + + yield(); + } + + ipoib_dbg(priv, "All sends and receives done.\n"); + +timeout: + qp_attr.qp_state = IB_QPS_RESET; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to RESET state\n"); + + /* Wait for all AHs to be reaped */ + set_bit(IPOIB_STOP_REAPER, &priv->flags); + cancel_delayed_work(&priv->ah_reap_task); + flush_workqueue(ipoib_workqueue); + + begin = jiffies; + + while (!list_empty(&priv->dead_ahs)) { + __ipoib_reap_ah(dev); + + if (time_after(jiffies, begin + HZ)) { + ipoib_warn(priv, "timing out; will leak address handles\n"); + break; + } + + yield(); + } + + return 0; +} + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + priv->ca = ca; + priv->port = port; + priv->qp = NULL; + + if (ipoib_transport_dev_init(dev, ca)) { + printk(KERN_WARNING "%s: ipoib_transport_dev_init failed\n", ca->name); + return -ENODEV; + } + + if (dev->flags & IFF_UP) { + if (ipoib_ib_dev_open(dev)) { + ipoib_transport_dev_cleanup(dev); + return -ENODEV; + } + } + + return 0; +} + +void ipoib_ib_dev_flush(void *_dev) +{ + struct net_device *dev = (struct net_device *)_dev; + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv; + + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + return; + + ipoib_dbg(priv, "flushing\n"); + + ipoib_ib_dev_down(dev); + + /* + * The device could have been brought down between the start and when + * we get here, don't bring it back up if it's not configured up + */ + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_ib_dev_up(dev); + + /* Flush any child interfaces too */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + ipoib_ib_dev_flush(&cpriv->dev); +} + +void ipoib_ib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "cleaning up ib_dev\n"); + + ipoib_mcast_stop_thread(dev); + + /* Delete the broadcast address and the local address */ + ipoib_mcast_dev_down(dev); + + ipoib_transport_dev_cleanup(dev); +} + +/* + * Delayed P_Key Assigment Interim Support + * + * The following is initial implementation of delayed P_Key assigment + * mechanism. It is using the same approach implemented for the multicast + * group join. The single goal of this implementation is to quickly address + * Bug #2507. This implementation will probably be removed when the P_Key + * change async notification is available. + */ +int ipoib_open(struct net_device *dev); + +static void ipoib_pkey_dev_check_presence(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + u16 pkey_index = 0; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + else + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); +} + +void ipoib_pkey_poll(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_pkey_dev_check_presence(dev); + + if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + ipoib_open(dev); + else { + down(&pkey_sem); + if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + } +} + +int ipoib_pkey_dev_delay_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Look for the interface pkey value in the IB Port P_Key table and */ + /* set the interface pkey assigment flag */ + ipoib_pkey_dev_check_presence(dev); + + /* P_Key value not assigned yet - start polling */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + clear_bit(IPOIB_PKEY_STOP, &priv->flags); + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + return 1; + } + + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_main.c 2004-12-27 21:48:25.628193847 -0800 @@ -0,0 +1,1079 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_main.c 1377 2004-12-23 19:57:12Z roland $ + */ + +#include "ipoib.h" + +#include +#include + +#include +#include +#include + +#include /* For ARPHRD_xxx */ + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); +MODULE_LICENSE("Dual BSD/GPL"); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int debug_level; + +module_param(debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); +#endif + +static const u8 ipv4_bcast_addr[] = { + 0x00, 0xff, 0xff, 0xff, + 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff +}; + +struct workqueue_struct *ipoib_workqueue; + +static void ipoib_add_one(struct ib_device *device); +static void ipoib_remove_one(struct ib_device *device); + +static struct ib_client ipoib_client = { + .name = "ipoib", + .add = ipoib_add_one, + .remove = ipoib_remove_one +}; + +int ipoib_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "bringing up interface\n"); + + set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + if (ipoib_pkey_dev_delay_open(dev)) + return 0; + + if (ipoib_ib_dev_open(dev)) + return -EINVAL; + + if (ipoib_ib_dev_up(dev)) + return -EINVAL; + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring up any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (flags & IFF_UP) + continue; + + dev_change_flags(cpriv->dev, flags | IFF_UP); + } + up(&priv->vlan_mutex); + } + + netif_start_queue(dev); + + return 0; +} + +static int ipoib_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "stopping interface\n"); + + clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + netif_stop_queue(dev); + + ipoib_ib_dev_down(dev); + ipoib_ib_dev_stop(dev); + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring down any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (!(flags & IFF_UP)) + continue; + + dev_change_flags(cpriv->dev, flags & ~IFF_UP); + } + up(&priv->vlan_mutex); + } + + return 0; +} + +static int ipoib_change_mtu(struct net_device *dev, int new_mtu) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) + return -EINVAL; + + priv->admin_mtu = new_mtu; + + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + return 0; +} + +static struct ipoib_path *__path_find(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->path_tree.rb_node; + struct ipoib_path *path; + int ret; + + while (n) { + path = rb_entry(n, struct ipoib_path, rb_node); + + ret = memcmp(gid->raw, path->pathrec.dgid.raw, + sizeof (union ib_gid)); + + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return path; + } + + return NULL; +} + +static int __path_add(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->path_tree.rb_node; + struct rb_node *pn = NULL; + struct ipoib_path *tpath; + int ret; + + while (*n) { + pn = *n; + tpath = rb_entry(pn, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, tpath->pathrec.dgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&path->rb_node, pn, n); + rb_insert_color(&path->rb_node, &priv->path_tree); + + list_add_tail(&path->list, &priv->path_list); + + return 0; +} + +static void __path_free(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tn; + struct sk_buff *skb; + + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb_irq(skb); + + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + if (path->ah) + ipoib_put_ah(path->ah); + + rb_erase(&path->rb_node, &priv->path_tree); + list_del(&path->list); + kfree(path); +} + +void ipoib_flush_paths(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path, *tp; + LIST_HEAD(remove_list); + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + list_splice(&priv->path_list, &remove_list); + INIT_LIST_HEAD(&priv->path_list); + spin_unlock_irqrestore(&priv->lock, flags); + + list_for_each_entry_safe(path, tp, &remove_list, list) { + if (path->query) + ib_sa_cancel_query(path->query_id, path->query); + wait_for_completion(&path->done); + __path_free(dev, path); + } +} + +static void path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *path_ptr) +{ + struct ipoib_path *path = path_ptr; + struct net_device *dev = path->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah = NULL; + struct ipoib_neigh *neigh; + struct sk_buff_head skqueue; + struct sk_buff *skb; + unsigned long flags; + + if (pathrec) + ipoib_dbg(priv, "PathRec LID 0x%04x for GID " IPOIB_GID_FMT "\n", + be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); + else + ipoib_dbg(priv, "PathRec status %d for GID " IPOIB_GID_FMT "\n", + status, IPOIB_GID_ARG(path->pathrec.dgid)); + + skb_queue_head_init(&skqueue); + + if (!status) { + /* + * For now we set static_rate to 0. This is not + * really correct: we should look at the rate + * component of the path member record, compare it + * with the rate of our local port (calculated from + * the active link speed and link width) and set an + * inter-packet delay appropriately. + */ + struct ib_ah_attr av = { + .dlid = be16_to_cpu(pathrec->dlid), + .sl = pathrec->sl, + .static_rate = 0, + .port_num = priv->port + }; + + ah = ipoib_create_ah(dev, priv->pd, &av); + } + + spin_lock_irqsave(&priv->lock, flags); + + path->ah = ah; + + if (ah) { + path->pathrec = *pathrec; + + ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n", + ah, be16_to_cpu(pathrec->dlid), pathrec->sl); + + while ((skb = __skb_dequeue(&path->queue))) + __skb_queue_tail(&skqueue, skb); + + list_for_each_entry(neigh, &path->neigh_list, list) { + kref_get(&path->ah->ref); + neigh->ah = path->ah; + + while ((skb = __skb_dequeue(&neigh->queue))) + __skb_queue_tail(&skqueue, skb); + } + } else + path->query = NULL; + + complete(&path->done); + + spin_unlock_irqrestore(&priv->lock, flags); + + while ((skb = __skb_dequeue(&skqueue))) { + skb->dev = dev; + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed " + "to requeue packet\n"); + } +} + +static struct ipoib_path *path_rec_create(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + path = kmalloc(sizeof *path, GFP_ATOMIC); + if (!path) + return NULL; + + path->dev = dev; + path->pathrec.dlid = 0; + + skb_queue_head_init(&path->queue); + + INIT_LIST_HEAD(&path->neigh_list); + path->query = NULL; + init_completion(&path->done); + + memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.numb_path = 1; + + __path_add(dev, path); + + return path; +} + +static int path_rec_start(struct net_device *dev, + struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "Start path record lookup for " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(path->pathrec.dgid)); + + path->query_id = + ib_sa_path_rec_get(priv->ca, priv->port, + &path->pathrec, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + 1000, GFP_ATOMIC, + path_rec_completion, + path, &path->query); + if (path->query_id < 0) { + ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); + path->query = NULL; + return path->query_id; + } + + return 0; +} + +static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + return; + } + + skb_queue_head_init(&neigh->queue); + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) + goto err; + } + + list_add_tail(&neigh->list, &path->neigh_list); + + if (path->pathrec.dlid) { + kref_get(&path->ah->ref); + neigh->ah = path->ah; + + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + } else { + neigh->ah = NULL; + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + __skb_queue_tail(&neigh->queue, skb); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + if (!path->query && path_rec_start(dev, path)) + goto err; + } + + spin_unlock(&priv->lock); + return; + +err: + *to_ipoib_neigh(skb->dst->neighbour) = NULL; + list_del(&neigh->list); + kfree(neigh); + neigh->neighbour->ops->destructor = NULL; + + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + + spin_unlock(&priv->lock); +} + +static void path_lookup(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + + /* Look up path record for unicasts */ + if (skb->dst->neighbour->ha[4] != 0xff) { + neigh_add_path(skb, dev); + return; + } + + /* Add in the P_Key for multicasts */ + skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; + skb->dst->neighbour->ha[9] = priv->pkey & 0xff; + ipoib_mcast_send(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4), skb); +} + +static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, + struct ipoib_pseudoheader *phdr) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (phdr->hwaddr + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (phdr->hwaddr + 4)); + if (path) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof *phdr); + __skb_queue_tail(&path->queue, skb); + + if (path_rec_start(dev, path)) + __path_free(dev, path); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + spin_unlock(&priv->lock); + return; + } + + if (path->pathrec.dlid) { + ipoib_dbg(priv, "Send unicast ARP to %04x\n", + be16_to_cpu(path->pathrec.dlid)); + + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) phdr->hwaddr)); + } else if ((path->query || !path_rec_start(dev, path)) && + skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof *phdr); + __skb_queue_tail(&path->queue, skb); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + spin_unlock(&priv->lock); +} + +static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh; + unsigned long flags; + + local_irq_save(flags); + if (!spin_trylock(&priv->tx_lock)) { + local_irq_restore(flags); + return NETDEV_TX_LOCKED; + } + + /* + * Check if our queue is stopped. Since we have the LLTX bit + * set, we can't rely on netif_stop_queue() preventing our + * xmit function from being called with a full queue. + */ + if (unlikely(netif_queue_stopped(dev))) { + spin_unlock_irqrestore(&priv->tx_lock, flags); + return NETDEV_TX_BUSY; + } + + if (skb->dst && skb->dst->neighbour) { + if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { + path_lookup(skb, dev); + goto out; + } + + neigh = *to_ipoib_neigh(skb->dst->neighbour); + + if (likely(neigh->ah)) { + ipoib_send(dev, skb, neigh->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + goto out; + } + + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + spin_lock(&priv->lock); + __skb_queue_tail(&neigh->queue, skb); + spin_unlock(&priv->lock); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + } else { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb->data; + skb_pull(skb, sizeof *phdr); + + if (phdr->hwaddr[4] == 0xff) { + /* Add in the P_Key for multicast*/ + phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; + phdr->hwaddr[9] = priv->pkey & 0xff; + + ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb); + } else { + /* unicast GID -- should be ARP reply */ + + if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { + ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " + IPOIB_GID_FMT "\n", + skb->dst ? "neigh" : "dst", + be16_to_cpup((u16 *) skb->data), + be32_to_cpup((u32 *) phdr->hwaddr), + IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + goto out; + } + + unicast_arp_send(skb, dev, phdr); + } + } + +out: + spin_unlock_irqrestore(&priv->tx_lock, flags); + + return NETDEV_TX_OK; +} + +struct net_device_stats *ipoib_get_stats(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return &priv->stats; +} + +static void ipoib_timeout(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_warn(priv, "transmit timeout: latency %ld\n", + jiffies - dev->trans_start); + /* XXX reset QP, etc. */ +} + +static int ipoib_hard_header(struct sk_buff *skb, + struct net_device *dev, + unsigned short type, + void *daddr, void *saddr, unsigned len) +{ + struct ipoib_header *header; + + header = (struct ipoib_header *) skb_push(skb, sizeof *header); + + header->proto = htons(type); + header->reserved = 0; + + /* + * If we don't have a neighbour structure, stuff the + * destination address onto the front of the skb so we can + * figure out where to send the packet later. + */ + if (!skb->dst || !skb->dst->neighbour) { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb_push(skb, sizeof *phdr); + memcpy(phdr->hwaddr, daddr, INFINIBAND_ALEN); + } + + return 0; +} + +static void ipoib_set_mcast_list(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + schedule_work(&priv->restart_task); +} + +static void ipoib_neigh_destructor(struct neighbour *n) +{ + struct ipoib_neigh *neigh = *to_ipoib_neigh(n); + struct ipoib_dev_priv *priv = netdev_priv(n->dev); + unsigned long flags; + + ipoib_dbg(priv, + "neigh_destructor for %06x " IPOIB_GID_FMT "\n", + be32_to_cpup((__be32 *) n->ha), + IPOIB_GID_ARG(*((union ib_gid *) (n->ha + 4)))); + + spin_lock_irqsave(&priv->lock, flags); + + if (neigh) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + *to_ipoib_neigh(n) = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static int ipoib_neigh_setup(struct neighbour *neigh) +{ + /* + * Is this kosher? I can't find anybody in the kernel that + * sets neigh->destructor, so we should be able to set it here + * without trouble. + */ + neigh->ops->destructor = ipoib_neigh_destructor; + + return 0; +} + +static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) +{ + parms->neigh_setup = ipoib_neigh_setup; + + return 0; +} + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Allocate RX/TX "rings" to hold queued skbs */ + + priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->rx_ring) { + printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", + ca->name, IPOIB_RX_RING_SIZE); + goto out; + } + memset(priv->rx_ring, 0, + IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf)); + + priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->tx_ring) { + printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", + ca->name, IPOIB_TX_RING_SIZE); + goto out_rx_ring_cleanup; + } + memset(priv->tx_ring, 0, + IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf)); + + /* priv->tx_head & tx_tail are already 0 */ + + if (ipoib_ib_dev_init(dev, ca, port)) + goto out_tx_ring_cleanup; + + return 0; + +out_tx_ring_cleanup: + kfree(priv->tx_ring); + +out_rx_ring_cleanup: + kfree(priv->rx_ring); + +out: + return -ENOMEM; +} + +void ipoib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv; + + ipoib_delete_debug_file(dev); + + /* Delete any child interfaces first */ + list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { + unregister_netdev(cpriv->dev); + ipoib_dev_cleanup(cpriv->dev); + free_netdev(cpriv->dev); + } + + ipoib_ib_dev_cleanup(dev); + + if (priv->rx_ring) { + kfree(priv->rx_ring); + priv->rx_ring = NULL; + } + + if (priv->tx_ring) { + kfree(priv->tx_ring); + priv->tx_ring = NULL; + } +} + +static void ipoib_setup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + dev->open = ipoib_open; + dev->stop = ipoib_stop; + dev->change_mtu = ipoib_change_mtu; + dev->hard_start_xmit = ipoib_start_xmit; + dev->get_stats = ipoib_get_stats; + dev->tx_timeout = ipoib_timeout; + dev->hard_header = ipoib_hard_header; + dev->set_multicast_list = ipoib_set_mcast_list; + dev->neigh_setup = ipoib_neigh_setup_dev; + + dev->watchdog_timeo = HZ; + + dev->rebuild_header = NULL; + dev->set_mac_address = NULL; + dev->header_cache_update = NULL; + + dev->flags |= IFF_BROADCAST | IFF_MULTICAST; + + /* + * We add in INFINIBAND_ALEN to allow for the destination + * address "pseudoheader" for skbs without neighbour struct. + */ + dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; + dev->addr_len = INFINIBAND_ALEN; + dev->type = ARPHRD_INFINIBAND; + dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + + /* MTU will be reset when mcast join happens */ + dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; + priv->mcast_mtu = priv->admin_mtu = dev->mtu; + + memcpy(dev->broadcast, ipv4_bcast_addr, INFINIBAND_ALEN); + + netif_carrier_off(dev); + + SET_MODULE_OWNER(dev); + + priv->dev = dev; + + spin_lock_init(&priv->lock); + spin_lock_init(&priv->tx_lock); + + init_MUTEX(&priv->mcast_mutex); + init_MUTEX(&priv->vlan_mutex); + + INIT_LIST_HEAD(&priv->path_list); + INIT_LIST_HEAD(&priv->child_intfs); + INIT_LIST_HEAD(&priv->dead_ahs); + INIT_LIST_HEAD(&priv->multicast_list); + + INIT_WORK(&priv->pkey_task, ipoib_pkey_poll, priv->dev); + INIT_WORK(&priv->mcast_task, ipoib_mcast_join_task, priv->dev); + INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, priv->dev); + INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev); + INIT_WORK(&priv->ah_reap_task, ipoib_reap_ah, priv->dev); +} + +struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) +{ + struct net_device *dev; + + dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name, + ipoib_setup); + if (!dev) + return NULL; + + return netdev_priv(dev); +} + +static ssize_t show_pkey(struct class_device *cdev, char *buf) +{ + struct ipoib_dev_priv *priv = + netdev_priv(container_of(cdev, struct net_device, class_dev)); + + return sprintf(buf, "0x%04x\n", priv->pkey); +} +static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); + +static ssize_t create_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; +} +static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); + +static ssize_t delete_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; + +} +static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); + +int ipoib_add_pkey_attr(struct net_device *dev) +{ + return class_device_create_file(&dev->class_dev, + &class_device_attr_pkey); +} + +static struct net_device *ipoib_add_port(const char *format, + struct ib_device *hca, u8 port) +{ + struct ipoib_dev_priv *priv; + int result = -ENOMEM; + + priv = ipoib_intf_alloc(format); + if (!priv) + goto alloc_mem_failed; + + SET_NETDEV_DEV(priv->dev, hca->dma_device); + + result = ib_query_pkey(hca, port, 0, &priv->pkey); + if (result) { + printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } + + priv->dev->broadcast[8] = priv->pkey >> 8; + priv->dev->broadcast[9] = priv->pkey & 0xff; + + result = ib_query_gid(hca, port, 0, &priv->local_gid); + if (result) { + printk(KERN_WARNING "%s: ib_query_gid port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + + result = ipoib_dev_init(priv->dev, hca, port); + if (result < 0) { + printk(KERN_WARNING "%s: failed to initialize port %d (ret = %d)\n", + hca->name, port, result); + goto device_init_failed; + } + + INIT_IB_EVENT_HANDLER(&priv->event_handler, + priv->ca, ipoib_event); + result = ib_register_event_handler(&priv->event_handler); + if (result < 0) { + printk(KERN_WARNING "%s: ib_register_event_handler failed for " + "port %d (ret = %d)\n", + hca->name, port, result); + goto event_failed; + } + + result = register_netdev(priv->dev); + if (result) { + printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", + hca->name, port, result); + goto register_failed; + } + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_create_child)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_delete_child)) + goto sysfs_failed; + + return priv->dev; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ib_unregister_event_handler(&priv->event_handler); + +event_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +alloc_mem_failed: + return ERR_PTR(result); +} + +static void ipoib_add_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct net_device *dev; + struct ipoib_dev_priv *priv; + int s, e, p; + + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); + if (!dev_list) + return; + + INIT_LIST_HEAD(dev_list); + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (p = s; p <= e; ++p) { + dev = ipoib_add_port("ib%d", device, p); + if (!IS_ERR(dev)) { + priv = netdev_priv(dev); + list_add_tail(&priv->list, dev_list); + } + } + + ib_set_client_data(device, &ipoib_client, dev_list); +} + +static void ipoib_remove_one(struct ib_device *device) +{ + struct ipoib_dev_priv *priv, *tmp; + struct list_head *dev_list; + + dev_list = ib_get_client_data(device, &ipoib_client); + + list_for_each_entry_safe(priv, tmp, dev_list, list) { + ib_unregister_event_handler(&priv->event_handler); + + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + free_netdev(priv->dev); + } +} + +static int __init ipoib_init_module(void) +{ + int ret; + + ret = ipoib_register_debugfs(); + if (ret) + return ret; + + /* + * We create our own workqueue mainly because we want to be + * able to flush it when devices are being removed. We can't + * use schedule_work()/flush_scheduled_work() because both + * unregister_netdev() and linkwatch_event take the rtnl lock, + * so flush_scheduled_work() can deadlock during device + * removal. + */ + ipoib_workqueue = create_singlethread_workqueue("ipoib"); + if (!ipoib_workqueue) { + ret = -ENOMEM; + goto err_fs; + } + + ret = ib_register_client(&ipoib_client); + if (ret) + goto err_wq; + + return 0; + +err_fs: + ipoib_unregister_debugfs(); + +err_wq: + destroy_workqueue(ipoib_workqueue); + + return ret; +} + +static void __exit ipoib_cleanup_module(void) +{ + ipoib_unregister_debugfs(); + ib_unregister_client(&ipoib_client); + destroy_workqueue(ipoib_workqueue); +} + +module_init(ipoib_init_module); +module_exit(ipoib_cleanup_module); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2004-12-27 21:48:25.681186047 -0800 @@ -0,0 +1,254 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include + +#include "ipoib.h" + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr *qp_attr; + int attr_mask; + int ret; + u16 pkey_index; + + ret = -ENOMEM; + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) + goto out; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ret = -ENXIO; + goto out; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + /* set correct QKey for QP */ + qp_attr->qkey = priv->qkey; + attr_mask = IB_QP_QKEY; + ret = ib_modify_qp(priv->qp, qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret); + goto out; + } + + /* attach QP to multicast group */ + down(&priv->mcast_mutex); + ret = ib_attach_mcast(priv->qp, mgid, mlid); + up(&priv->mcast_mutex); + if (ret) + ipoib_warn(priv, "failed to attach to multicast group, ret = %d\n", ret); + +out: + kfree(qp_attr); + return ret; +} + +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + down(&priv->mcast_mutex); + ret = ib_detach_mcast(priv->qp, mgid, mlid); + up(&priv->mcast_mutex); + if (ret) + ipoib_warn(priv, "ib_detach_mcast failed (result = %d)\n", ret); + + return ret; +} + +int ipoib_qp_create(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u16 pkey_index; + struct ib_qp_attr qp_attr; + int attr_mask; + + /* + * Search through the port P_Key table for the requested pkey value. + * The port has to be assigned to the respective IB partition in + * advance. + */ + ret = ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index); + if (ret) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return ret; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qkey = 0; + qp_attr.port_num = priv->port; + qp_attr.pkey_index = pkey_index; + attr_mask = + IB_QP_QKEY | + IB_QP_PORT | + IB_QP_PKEY_INDEX | + IB_QP_STATE; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret); + goto out_fail; + } + + qp_attr.qp_state = IB_QPS_RTR; + /* Can't set this in a INIT->RTR transition */ + attr_mask &= ~IB_QP_PORT; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret); + goto out_fail; + } + + qp_attr.qp_state = IB_QPS_RTS; + qp_attr.sq_psn = 0; + attr_mask |= IB_QP_SQ_PSN; + attr_mask &= ~IB_QP_PKEY_INDEX; + ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret); + goto out_fail; + } + + return 0; + +out_fail: + ib_destroy_qp(priv->qp); + priv->qp = NULL; + + return -EINVAL; +} + +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr init_attr = { + .cap = { + .max_send_wr = IPOIB_TX_RING_SIZE, + .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_sge = 1, + .max_recv_sge = 1 + }, + .sq_sig_type = IB_SIGNAL_ALL_WR, + .rq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_UD + }; + + priv->pd = ib_alloc_pd(priv->ca); + if (IS_ERR(priv->pd)) { + printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name); + return -ENODEV; + } + + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, + IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + if (IS_ERR(priv->cq)) { + printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); + goto out_free_pd; + } + + if (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP)) + goto out_free_cq; + + priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(priv->mr)) { + printk(KERN_WARNING "%s: ib_reg_phys_mr failed\n", ca->name); + goto out_free_cq; + } + + init_attr.send_cq = priv->cq; + init_attr.recv_cq = priv->cq, + + priv->qp = ib_create_qp(priv->pd, &init_attr); + if (IS_ERR(priv->qp)) { + printk(KERN_WARNING "%s: failed to create QP\n", ca->name); + goto out_free_mr; + } + + priv->dev->dev_addr[1] = (priv->qp->qp_num >> 16) & 0xff; + priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; + priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; + + return 0; + +out_free_mr: + ib_dereg_mr(priv->mr); + +out_free_cq: + ib_destroy_cq(priv->cq); + +out_free_pd: + ib_dealloc_pd(priv->pd); + return -ENODEV; +} + +void ipoib_transport_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->qp) { + if (ib_destroy_qp(priv->qp)) + ipoib_warn(priv, "ib_qp_destroy failed\n"); + + priv->qp = NULL; + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + } + + if (ib_dereg_mr(priv->mr)) + ipoib_warn(priv, "ib_dereg_mr failed\n"); + + if (ib_destroy_cq(priv->cq)) + ipoib_warn(priv, "ib_cq_destroy failed\n"); + + if (ib_dealloc_pd(priv->pd)) + ipoib_warn(priv, "ib_dealloc_pd failed\n"); +} + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record) +{ + struct ipoib_dev_priv *priv = + container_of(handler, struct ipoib_dev_priv, event_handler); + + if (record->event == IB_EVENT_PORT_ACTIVE || + record->event == IB_EVENT_LID_CHANGE || + record->event == IB_EVENT_SM_CHANGE) { + ipoib_dbg(priv, "Port active event\n"); + schedule_work(&priv->flush_task); + } +} From roland at topspin.com Mon Dec 27 21:51:17 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:17 -0800 Subject: [openib-general] [PATCH][v5][20/24] Add IPoIB multicast & partition code In-Reply-To: <200412272151.XKAQkznglOHW39xJ@topspin.com> Message-ID: <200412272151.JEARwZ80axXxZD2Q@topspin.com> Add functions for handling IPoIB multicast and multiple partitions. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2004-12-27 21:48:27.157968669 -0800 @@ -0,0 +1,981 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_multicast.c 1362 2004-12-18 15:56:29Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ipoib.h" + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int mcast_debug_level; + +module_param(mcast_debug_level, int, 0644); +MODULE_PARM_DESC(mcast_debug_level, + "Enable multicast debug tracing if > 0"); +#endif + +static DECLARE_MUTEX(mcast_mutex); + +/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ +struct ipoib_mcast { + struct ib_sa_mcmember_rec mcmember; + struct ipoib_ah *ah; + + struct rb_node rb_node; + struct list_head list; + struct completion done; + + int query_id; + struct ib_sa_query *query; + + unsigned long created; + unsigned long backoff; + + unsigned long flags; + unsigned char logcount; + + struct list_head neigh_list; + + struct sk_buff_head pkt_queue; + + struct net_device *dev; +}; + +struct ipoib_mcast_iter { + struct net_device *dev; + union ib_gid mgid; + unsigned long created; + unsigned int queuelen; + unsigned int complete; + unsigned int send_only; +}; + +static void ipoib_mcast_free(struct ipoib_mcast *mcast) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tmp; + unsigned long flags; + + ipoib_dbg_mcast(netdev_priv(dev), + "deleting multicast group " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + spin_lock_irqsave(&priv->lock, flags); + + list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); + + if (mcast->ah) + ipoib_put_ah(mcast->ah); + + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + dev_kfree_skb_any(skb); + } + + kfree(mcast); +} + +static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev, + int can_sleep) +{ + struct ipoib_mcast *mcast; + + mcast = kmalloc(sizeof (*mcast), can_sleep ? GFP_KERNEL : GFP_ATOMIC); + if (!mcast) + return NULL; + + memset(mcast, 0, sizeof (*mcast)); + + init_completion(&mcast->done); + + mcast->dev = dev; + mcast->created = jiffies; + mcast->backoff = HZ; + mcast->logcount = 0; + + INIT_LIST_HEAD(&mcast->list); + INIT_LIST_HEAD(&mcast->neigh_list); + skb_queue_head_init(&mcast->pkt_queue); + + mcast->ah = NULL; + mcast->query = NULL; + + return mcast; +} + +static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, union ib_gid *mgid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->multicast_tree.rb_node; + + while (n) { + struct ipoib_mcast *mcast; + int ret; + + mcast = rb_entry(n, struct ipoib_mcast, rb_node); + + ret = memcmp(mgid->raw, mcast->mcmember.mgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return mcast; + } + + return NULL; +} + +static int __ipoib_mcast_add(struct net_device *dev, struct ipoib_mcast *mcast) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->multicast_tree.rb_node, *pn = NULL; + + while (*n) { + struct ipoib_mcast *tmcast; + int ret; + + pn = *n; + tmcast = rb_entry(pn, struct ipoib_mcast, rb_node); + + ret = memcmp(mcast->mcmember.mgid.raw, tmcast->mcmember.mgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&mcast->rb_node, pn, n); + rb_insert_color(&mcast->rb_node, &priv->multicast_tree); + + return 0; +} + +static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, + struct ib_sa_mcmember_rec *mcmember) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + mcast->mcmember = *mcmember; + + /* Set the cached Q_Key before we attach if it's the broadcast group */ + if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + sizeof (union ib_gid))) + priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); + + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + ipoib_warn(priv, "multicast group " IPOIB_GID_FMT + " already attached\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + return 0; + } + + ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret < 0) { + ipoib_warn(priv, "couldn't attach QP to multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags); + return ret; + } + } + + { + /* + * For now we set static_rate to 0. This is not + * really correct: we should look at the rate + * component of the MC member record, compare it with + * the rate of our local port (calculated from the + * active link speed and link width) and set an + * inter-packet delay appropriately. + */ + struct ib_ah_attr av = { + .dlid = be16_to_cpu(mcast->mcmember.mlid), + .port_num = priv->port, + .sl = mcast->mcmember.sl, + .static_rate = 0, + .ah_flags = IB_AH_GRH, + .grh = { + .flow_label = be32_to_cpu(mcast->mcmember.flow_label), + .hop_limit = mcast->mcmember.hop_limit, + .sgid_index = 0, + .traffic_class = mcast->mcmember.traffic_class + } + }; + + av.grh.dgid = mcast->mcmember.mgid; + + mcast->ah = ipoib_create_ah(dev, priv->pd, &av); + if (!mcast->ah) { + ipoib_warn(priv, "ib_address_create failed\n"); + } else { + ipoib_dbg_mcast(priv, "MGID " IPOIB_GID_FMT + " AV %p, LID 0x%04x, SL %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + mcast->ah->ah, + be16_to_cpu(mcast->mcmember.mlid), + mcast->mcmember.sl); + } + } + + /* actually send any queued packets */ + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + + if (!skb->dst || !skb->dst->neighbour) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof (struct ipoib_pseudoheader)); + } + + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n"); + } + + return 0; +} + +static void +ipoib_mcast_sendonly_join_complete(int status, + struct ib_sa_mcmember_rec *mcmember, + void *mcast_ptr) +{ + struct ipoib_mcast *mcast = mcast_ptr; + struct net_device *dev = mcast->dev; + + if (!status) + ipoib_mcast_join_finish(mcast, mcmember); + else { + if (mcast->logcount++ < 20) + ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); + + /* Flush out any queued packets */ + while (!skb_queue_empty(&mcast->pkt_queue)) { + struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + + skb->dev = dev; + + dev_kfree_skb_any(skb); + } + + /* Clear the busy flag so we try again */ + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + } + + complete(&mcast->done); +} + +static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) +{ + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { +#if 0 /* Some SMs don't support send-only yet */ + .join_state = 4 +#else + .join_state = 1 +#endif + }; + int ret = 0; + + if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) { + ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n"); + return -ENODEV; + } + + if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n"); + return -EBUSY; + } + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + 1000, GFP_ATOMIC, + ipoib_mcast_sendonly_join_complete, + mcast, &mcast->query); + if (ret < 0) { + ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", + ret); + } else { + ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT + ", starting join\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + mcast->query_id = ret; + } + + return ret; +} + +static void ipoib_mcast_join_complete(int status, + struct ib_sa_mcmember_rec *mcmember, + void *mcast_ptr) +{ + struct ipoib_mcast *mcast = mcast_ptr; + struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg_mcast(priv, "join completion for " IPOIB_GID_FMT + " (status %d)\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); + + if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { + mcast->backoff = HZ; + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_work(ipoib_workqueue, &priv->mcast_task); + up(&mcast_mutex); + complete(&mcast->done); + return; + } + + if (status == -EINTR) { + complete(&mcast->done); + return; + } + + if (status && mcast->logcount++ < 20) { + if (status == -ETIMEDOUT || status == -EINTR) { + ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT + ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + status); + } else { + ipoib_warn(priv, "multicast join failed for " + IPOIB_GID_FMT ", status %d\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), + status); + } + } + + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + + mcast->query = NULL; + + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { + if (status == -ETIMEDOUT) + queue_work(ipoib_workqueue, &priv->mcast_task); + else + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, + mcast->backoff * HZ); + } else + complete(&mcast->done); + up(&mcast_mutex); + + return; +} + +static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, + int create) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { + .join_state = 1 + }; + ib_sa_comp_mask comp_mask; + int ret = 0; + + ipoib_dbg_mcast(priv, "joining MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + comp_mask = + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE; + + if (create) { + comp_mask |= + IB_SA_MCMEMBER_REC_QKEY | + IB_SA_MCMEMBER_REC_SL | + IB_SA_MCMEMBER_REC_FLOW_LABEL | + IB_SA_MCMEMBER_REC_TRAFFIC_CLASS; + + rec.qkey = priv->broadcast->mcmember.qkey; + rec.sl = priv->broadcast->mcmember.sl; + rec.flow_label = priv->broadcast->mcmember.flow_label; + rec.traffic_class = priv->broadcast->mcmember.traffic_class; + } + + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, + mcast->backoff * 1000, GFP_ATOMIC, + ipoib_mcast_join_complete, + mcast, &mcast->query); + + if (ret < 0) { + ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); + + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->mcast_task, + mcast->backoff); + up(&mcast_mutex); + } else + mcast->query_id = ret; +} + +void ipoib_mcast_join_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (!test_bit(IPOIB_MCAST_RUN, &priv->flags)) + return; + + if (ib_query_gid(priv->ca, priv->port, 0, &priv->local_gid)) + ipoib_warn(priv, "ib_gid_entry_get() failed\n"); + else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + if (!priv->broadcast) { + priv->broadcast = ipoib_mcast_alloc(dev, 1); + if (!priv->broadcast) { + ipoib_warn(priv, "failed to allocate broadcast group\n"); + down(&mcast_mutex); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->mcast_task, HZ); + up(&mcast_mutex); + return; + } + + memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + sizeof (union ib_gid)); + + spin_lock_irq(&priv->lock); + __ipoib_mcast_add(dev, priv->broadcast); + spin_unlock_irq(&priv->lock); + } + + if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { + ipoib_mcast_join(dev, priv->broadcast, 0); + return; + } + + while (1) { + struct ipoib_mcast *mcast = NULL; + + spin_lock_irq(&priv->lock); + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + /* Found the next unjoined group */ + break; + } + } + spin_unlock_irq(&priv->lock); + + if (&mcast->list == &priv->multicast_list) { + /* All done */ + break; + } + + ipoib_mcast_join(dev, mcast, 1); + return; + } + + { + struct ib_port_attr attr; + + if (!ib_query_port(priv->ca, priv->port, &attr)) + priv->local_lid = attr.lid; + else + ipoib_warn(priv, "ib_query_port failed\n"); + } + + priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - + IPOIB_ENCAP_LEN; + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n"); + + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + netif_carrier_on(dev); +} + +int ipoib_mcast_start_thread(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg_mcast(priv, "starting multicast thread\n"); + + down(&mcast_mutex); + if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_work(ipoib_workqueue, &priv->mcast_task); + up(&mcast_mutex); + + return 0; +} + +int ipoib_mcast_stop_thread(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; + + ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + + down(&mcast_mutex); + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + cancel_delayed_work(&priv->mcast_task); + up(&mcast_mutex); + + flush_workqueue(ipoib_workqueue); + + if (priv->broadcast && priv->broadcast->query) { + ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); + priv->broadcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for bcast\n"); + wait_for_completion(&priv->broadcast->done); + } + + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } + } + + return 0; +} + +int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_mcmember_rec rec = { + .join_state = 1 + }; + int ret = 0; + + if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) + return 0; + + ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = be16_to_cpu(priv->pkey); + + /* Remove ourselves from the multicast group */ + ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret) + ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + + /* + * Just make one shot at leaving and don't wait for a reply; + * if we fail, too bad. + */ + ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + 0, GFP_ATOMIC, NULL, + mcast, &mcast->query); + if (ret < 0) + ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " + "for leave (result = %d)\n", ret); + + return 0; +} + +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + mcast = __ipoib_mcast_find(dev, mgid); + if (!mcast) { + /* Let's create a new send only group now */ + ipoib_dbg_mcast(priv, "setting up send only multicast group for " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(*mgid)); + + mcast = ipoib_mcast_alloc(dev, 0); + if (!mcast) { + ipoib_warn(priv, "unable to allocate memory for " + "multicast structure\n"); + dev_kfree_skb_any(skb); + goto out; + } + + set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags); + mcast->mcmember.mgid = *mgid; + __ipoib_mcast_add(dev, mcast); + list_add_tail(&mcast->list, &priv->multicast_list); + } + + if (!mcast->ah) { + if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) + skb_queue_tail(&mcast->pkt_queue, skb); + else + dev_kfree_skb_any(skb); + + if (mcast->query) + ipoib_dbg_mcast(priv, "no address vector, " + "but multicast join already started\n"); + else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) + ipoib_mcast_sendonly_join(mcast); + + /* + * If lookup completes between here and out:, don't + * want to send packet twice. + */ + mcast = NULL; + } + +out: + if (mcast && mcast->ah) { + if (skb->dst && + skb->dst->neighbour && + !*to_ipoib_neigh(skb->dst->neighbour)) { + struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + + if (neigh) { + kref_get(&mcast->ah->ref); + neigh->ah = mcast->ah; + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + list_add_tail(&neigh->list, &mcast->neigh_list); + } + } + + ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + } + + spin_unlock(&priv->lock); +} + +void ipoib_mcast_dev_flush(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + LIST_HEAD(remove_list); + struct ipoib_mcast *mcast, *tmcast, *nmcast; + unsigned long flags; + + ipoib_dbg_mcast(priv, "flushing multicast list\n"); + + spin_lock_irqsave(&priv->lock, flags); + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + nmcast = ipoib_mcast_alloc(dev, 0); + if (nmcast) { + nmcast->flags = + mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY); + + nmcast->mcmember.mgid = mcast->mcmember.mgid; + + /* Add the new group in before the to-be-destroyed group */ + list_add_tail(&nmcast->list, &mcast->list); + list_del_init(&mcast->list); + + rb_replace_node(&mcast->rb_node, &nmcast->rb_node, + &priv->multicast_tree); + + list_add_tail(&mcast->list, &remove_list); + } else { + ipoib_warn(priv, "could not reallocate multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + } + } + + if (priv->broadcast) { + nmcast = ipoib_mcast_alloc(dev, 0); + if (nmcast) { + nmcast->mcmember.mgid = priv->broadcast->mcmember.mgid; + + rb_replace_node(&priv->broadcast->rb_node, + &nmcast->rb_node, + &priv->multicast_tree); + + list_add_tail(&priv->broadcast->list, &remove_list); + } + + priv->broadcast = nmcast; + } + + spin_unlock_irqrestore(&priv->lock, flags); + + list_for_each_entry(mcast, &remove_list, list) { + ipoib_mcast_leave(dev, mcast); + ipoib_mcast_free(mcast); + } +} + +void ipoib_mcast_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; + + /* Delete broadcast since it will be recreated */ + if (priv->broadcast) { + ipoib_dbg_mcast(priv, "deleting broadcast group\n"); + + spin_lock_irqsave(&priv->lock, flags); + rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); + spin_unlock_irqrestore(&priv->lock, flags); + ipoib_mcast_leave(dev, priv->broadcast); + ipoib_mcast_free(priv->broadcast); + priv->broadcast = NULL; + } +} + +void ipoib_mcast_restart_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dev_mc_list *mclist; + struct ipoib_mcast *mcast, *tmcast; + LIST_HEAD(remove_list); + unsigned long flags; + + ipoib_dbg_mcast(priv, "restarting multicast task\n"); + + ipoib_mcast_stop_thread(dev); + + spin_lock_irqsave(&priv->lock, flags); + + /* + * Unfortunately, the networking core only gives us a list of all of + * the multicast hardware addresses. We need to figure out which ones + * are new and which ones have been removed + */ + + /* Clear out the found flag */ + list_for_each_entry(mcast, &priv->multicast_list, list) + clear_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags); + + /* Mark all of the entries that are found or don't exist */ + for (mclist = dev->mc_list; mclist; mclist = mclist->next) { + union ib_gid mgid; + + memcpy(mgid.raw, mclist->dmi_addr + 4, sizeof mgid); + + /* Add in the P_Key */ + mgid.raw[4] = (priv->pkey >> 8) & 0xff; + mgid.raw[5] = priv->pkey & 0xff; + + mcast = __ipoib_mcast_find(dev, &mgid); + if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + struct ipoib_mcast *nmcast; + + /* Not found or send-only group, let's add a new entry */ + ipoib_dbg_mcast(priv, "adding multicast entry for mgid " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); + + nmcast = ipoib_mcast_alloc(dev, 0); + if (!nmcast) { + ipoib_warn(priv, "unable to allocate memory for multicast structure\n"); + continue; + } + + set_bit(IPOIB_MCAST_FLAG_FOUND, &nmcast->flags); + + nmcast->mcmember.mgid = mgid; + + if (mcast) { + /* Destroy the send only entry */ + list_del(&mcast->list); + list_add_tail(&mcast->list, &remove_list); + + rb_replace_node(&mcast->rb_node, + &nmcast->rb_node, + &priv->multicast_tree); + } else + __ipoib_mcast_add(dev, nmcast); + + list_add_tail(&nmcast->list, &priv->multicast_list); + } + + if (mcast) + set_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags); + } + + /* Remove all of the entries don't exist anymore */ + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_FOUND, &mcast->flags) && + !test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "deleting multicast group " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + + rb_erase(&mcast->rb_node, &priv->multicast_tree); + + /* Move to the remove list */ + list_del(&mcast->list); + list_add_tail(&mcast->list, &remove_list); + } + } + spin_unlock_irqrestore(&priv->lock, flags); + + /* We have to cancel outside of the spinlock */ + list_for_each_entry(mcast, &remove_list, list) { + ipoib_mcast_leave(mcast->dev, mcast); + ipoib_mcast_free(mcast); + } + + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_mcast_start_thread(dev); +} + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev) +{ + struct ipoib_mcast_iter *iter; + + iter = kmalloc(sizeof *iter, GFP_KERNEL); + if (!iter) + return NULL; + + iter->dev = dev; + memset(iter->mgid.raw, 0, sizeof iter->mgid); + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter) +{ + kfree(iter); +} + +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter) +{ + struct ipoib_dev_priv *priv = netdev_priv(iter->dev); + struct rb_node *n; + struct ipoib_mcast *mcast; + int ret = 1; + + spin_lock_irq(&priv->lock); + + n = rb_first(&priv->multicast_tree); + + while (n) { + mcast = rb_entry(n, struct ipoib_mcast, rb_node); + + if (memcmp(iter->mgid.raw, mcast->mcmember.mgid.raw, + sizeof (union ib_gid)) < 0) { + iter->mgid = mcast->mcmember.mgid; + iter->created = mcast->created; + iter->queuelen = skb_queue_len(&mcast->pkt_queue); + iter->complete = !!mcast->ah; + iter->send_only = !!(mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY)); + + ret = 0; + + break; + } + + n = rb_next(n); + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *mgid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only) +{ + *mgid = iter->mgid; + *created = iter->created; + *queuelen = iter->queuelen; + *complete = iter->complete; + *send_only = iter->send_only; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2004-12-27 21:48:27.219959544 -0800 @@ -0,0 +1,177 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_vlan.c 1349 2004-12-16 21:09:43Z roland $ + */ + +#include +#include + +#include +#include +#include + +#include + +#include "ipoib.h" + +static ssize_t show_parent(struct class_device *class_dev, char *buf) +{ + struct net_device *dev = + container_of(class_dev, struct net_device, class_dev); + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return sprintf(buf, "%s\n", priv->parent->name); +} +static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) +{ + struct ipoib_dev_priv *ppriv, *priv; + char intf_name[IFNAMSIZ]; + int result; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + ppriv = netdev_priv(pdev); + + down(&ppriv->vlan_mutex); + + /* + * First ensure this isn't a duplicate. We check the parent device and + * then all of the child interfaces to make sure the Pkey doesn't match. + */ + if (ppriv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } + + list_for_each_entry(priv, &ppriv->child_intfs, list) { + if (priv->pkey == pkey) { + result = -ENOTUNIQ; + goto err; + } + } + + snprintf(intf_name, sizeof intf_name, "%s.%04x", + ppriv->dev->name, pkey); + priv = ipoib_intf_alloc(intf_name); + if (!priv) { + result = -ENOMEM; + goto err; + } + + set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags); + + priv->pkey = pkey; + + memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN); + priv->dev->broadcast[8] = pkey >> 8; + priv->dev->broadcast[9] = pkey & 0xff; + + result = ipoib_dev_init(priv->dev, ppriv->ca, ppriv->port); + if (result < 0) { + ipoib_warn(ppriv, "failed to initialize subinterface: " + "device %s, port %d", + ppriv->ca->name, ppriv->port); + goto device_init_failed; + } + + result = register_netdev(priv->dev); + if (result) { + ipoib_warn(priv, "failed to initialize; error %i", result); + goto register_failed; + } + + priv->parent = ppriv->dev; + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_parent)) + goto sysfs_failed; + + list_add_tail(&priv->list, &ppriv->child_intfs); + + up(&ppriv->vlan_mutex); + + return 0; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +err: + up(&ppriv->vlan_mutex); + return result; +} + +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey) +{ + struct ipoib_dev_priv *ppriv, *priv, *tpriv; + int ret = -ENOENT; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + ppriv = netdev_priv(pdev); + + down(&ppriv->vlan_mutex); + list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) { + if (priv->pkey == pkey) { + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + + list_del(&priv->list); + + kfree(priv); + + ret = 0; + break; + } + } + up(&ppriv->vlan_mutex); + + return ret; +} From roland at topspin.com Mon Dec 27 21:51:18 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:18 -0800 Subject: [openib-general] [PATCH][v5][21/24] Add InfiniBand userspace MAD support In-Reply-To: <200412272151.JEARwZ80axXxZD2Q@topspin.com> Message-ID: <200412272151.3Lde9MPbD7ODIUdu@topspin.com> Add a driver that provides a character special device for each InfiniBand port. This device allows userspace to send and receive MADs via write() and read() (with some control operations implemented as ioctls). All operations are 32/64 clean and have been tested with 32-bit userspace running on a ppc64 kernel. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-27 21:48:20.847897490 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-27 21:48:27.528914067 -0800 @@ -1,6 +1,6 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o ib_umad.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o @@ -8,3 +8,5 @@ ib_mad-y := mad.o smi.o agent.o ib_sa-y := sa_query.o + +ib_umad-y := user_mad.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/user_mad.c 2004-12-27 21:48:27.576907002 -0800 @@ -0,0 +1,738 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: user_mad.c 1389 2004-12-27 22:56:47Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand userspace MAD packet access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + IB_UMAD_MAX_PORTS = 256, + IB_UMAD_MAX_AGENTS = 32 +}; + +struct ib_umad_port { + int devnum; + struct cdev dev; + struct class_device class_dev; + struct ib_device *ib_dev; + struct ib_umad_device *umad_dev; + u8 port_num; +}; + +struct ib_umad_device { + int start_port, end_port; + struct kref ref; + struct ib_umad_port port[0]; +}; + +struct ib_umad_file { + struct ib_umad_port *port; + spinlock_t recv_lock; + struct list_head recv_list; + wait_queue_head_t recv_wait; + struct rw_semaphore agent_mutex; + struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; + struct ib_mr *mr[IB_UMAD_MAX_AGENTS]; +}; + +struct ib_umad_packet { + struct ib_user_mad mad; + struct ib_ah *ah; + struct list_head list; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static dev_t base_dev; +static spinlock_t map_lock; +static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS); + +static void ib_umad_add_one(struct ib_device *device); +static void ib_umad_remove_one(struct ib_device *device); + +static int queue_packet(struct ib_umad_file *file, + struct ib_mad_agent *agent, + struct ib_umad_packet *packet) +{ + int ret = 1; + + down_read(&file->agent_mutex); + for (packet->mad.id = 0; + packet->mad.id < IB_UMAD_MAX_AGENTS; + packet->mad.id++) + if (agent == file->agent[packet->mad.id]) { + spin_lock_irq(&file->recv_lock); + list_add_tail(&packet->list, &file->recv_list); + spin_unlock_irq(&file->recv_lock); + wake_up_interruptible(&file->recv_wait); + ret = 0; + break; + } + + up_read(&file->agent_mutex); + + return ret; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *send_wc) +{ + struct ib_umad_file *file = agent->context; + struct ib_umad_packet *packet = + (void *) (unsigned long) send_wc->wr_id; + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(packet, mapping), + sizeof packet->mad.data, + DMA_TO_DEVICE); + ib_destroy_ah(packet->ah); + + if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { + packet->mad.status = ETIMEDOUT; + + if (!queue_packet(file, agent, packet)) + return; + } + + kfree(packet); +} + +static void recv_handler(struct ib_mad_agent *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_umad_file *file = agent->context; + struct ib_umad_packet *packet; + + if (mad_recv_wc->wc->status != IB_WC_SUCCESS) + goto out; + + packet = kmalloc(sizeof *packet, GFP_KERNEL); + if (!packet) + goto out; + + memset(packet, 0, sizeof *packet); + + memcpy(packet->mad.data, mad_recv_wc->recv_buf.mad, sizeof packet->mad.data); + packet->mad.status = 0; + packet->mad.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); + packet->mad.lid = cpu_to_be16(mad_recv_wc->wc->slid); + packet->mad.sl = mad_recv_wc->wc->sl; + packet->mad.path_bits = mad_recv_wc->wc->dlid_path_bits; + packet->mad.grh_present = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH); + if (packet->mad.grh_present) { + /* XXX parse GRH */ + packet->mad.gid_index = 0; + packet->mad.hop_limit = 0; + packet->mad.traffic_class = 0; + memset(packet->mad.gid, 0, 16); + packet->mad.flow_label = 0; + } + + if (queue_packet(file, agent, packet)) + kfree(packet); + +out: + ib_free_recv_mad(mad_recv_wc); +} + +static ssize_t ib_umad_read(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_umad_file *file = filp->private_data; + struct ib_umad_packet *packet; + ssize_t ret; + + if (count < sizeof (struct ib_user_mad)) + return -EINVAL; + + spin_lock_irq(&file->recv_lock); + + while (list_empty(&file->recv_list)) { + spin_unlock_irq(&file->recv_lock); + + if (filp->f_flags & O_NONBLOCK) + return -EAGAIN; + + if (wait_event_interruptible(file->recv_wait, + !list_empty(&file->recv_list))) + return -ERESTARTSYS; + + spin_lock_irq(&file->recv_lock); + } + + packet = list_entry(file->recv_list.next, struct ib_umad_packet, list); + list_del(&packet->list); + + spin_unlock_irq(&file->recv_lock); + + if (copy_to_user(buf, &packet->mad, sizeof packet->mad)) + ret = -EFAULT; + else + ret = sizeof packet->mad; + + kfree(packet); + return ret; +} + +static ssize_t ib_umad_write(struct file *filp, const char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_umad_file *file = filp->private_data; + struct ib_umad_packet *packet; + struct ib_mad_agent *agent; + struct ib_ah_attr ah_attr; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + }; + u8 method; + u64 *tid; + int ret; + + if (count < sizeof (struct ib_user_mad)) + return -EINVAL; + + packet = kmalloc(sizeof *packet, GFP_KERNEL); + if (!packet) + return -ENOMEM; + + if (copy_from_user(&packet->mad, buf, sizeof packet->mad)) { + kfree(packet); + return -EFAULT; + } + + if (packet->mad.id < 0 || packet->mad.id >= IB_UMAD_MAX_AGENTS) { + ret = -EINVAL; + goto err; + } + + down_read(&file->agent_mutex); + + agent = file->agent[packet->mad.id]; + if (!agent) { + ret = -EINVAL; + goto err_up; + } + + /* + * If userspace is generating a request that will generate a + * response, we need to make sure the high-order part of the + * transaction ID matches the agent being used to send the + * MAD. + */ + method = ((struct ib_mad_hdr *) packet->mad.data)->method; + + if (!(method & IB_MGMT_METHOD_RESP) && + method != IB_MGMT_METHOD_TRAP_REPRESS && + method != IB_MGMT_METHOD_SEND) { + tid = &((struct ib_mad_hdr *) packet->mad.data)->tid; + *tid = cpu_to_be64(((u64) agent->hi_tid) << 32 | + (be64_to_cpup(tid) & 0xffffffff)); + } + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = be16_to_cpu(packet->mad.lid); + ah_attr.sl = packet->mad.sl; + ah_attr.src_path_bits = packet->mad.path_bits; + ah_attr.port_num = file->port->port_num; + if (packet->mad.grh_present) { + ah_attr.ah_flags = IB_AH_GRH; + memcpy(ah_attr.grh.dgid.raw, packet->mad.gid, 16); + ah_attr.grh.flow_label = packet->mad.flow_label; + ah_attr.grh.hop_limit = packet->mad.hop_limit; + ah_attr.grh.traffic_class = packet->mad.traffic_class; + } + + packet->ah = ib_create_ah(agent->qp->pd, &ah_attr); + if (IS_ERR(packet->ah)) { + ret = PTR_ERR(packet->ah); + goto err_up; + } + + gather_list.addr = dma_map_single(agent->device->dma_device, + packet->mad.data, + sizeof packet->mad.data, + DMA_TO_DEVICE); + gather_list.length = sizeof packet->mad.data; + gather_list.lkey = file->mr[packet->mad.id]->lkey; + pci_unmap_addr_set(packet, mapping, gather_list.addr); + + wr.wr.ud.mad_hdr = (struct ib_mad_hdr *) packet->mad.data; + wr.wr.ud.ah = packet->ah; + wr.wr.ud.remote_qpn = be32_to_cpu(packet->mad.qpn); + wr.wr.ud.remote_qkey = be32_to_cpu(packet->mad.qkey); + wr.wr.ud.timeout_ms = packet->mad.timeout_ms; + + wr.wr_id = (unsigned long) packet; + + ret = ib_post_send_mad(agent, &wr, &bad_wr); + if (ret) { + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(packet, mapping), + sizeof packet->mad.data, + DMA_TO_DEVICE); + goto err_up; + } + + up_read(&file->agent_mutex); + + return sizeof packet->mad; + +err_up: + up_read(&file->agent_mutex); + +err: + kfree(packet); + return ret; +} + +static unsigned int ib_umad_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ib_umad_file *file = filp->private_data; + + /* we will always be able to post a MAD send */ + unsigned int mask = POLLOUT | POLLWRNORM; + + poll_wait(filp, &file->recv_wait, wait); + + if (!list_empty(&file->recv_list)) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +static int ib_umad_reg_agent(struct ib_umad_file *file, unsigned long arg) +{ + struct ib_user_mad_reg_req ureq; + struct ib_mad_reg_req req; + struct ib_mad_agent *agent; + int agent_id; + int ret; + + down_write(&file->agent_mutex); + + if (copy_from_user(&ureq, (void __user *) arg, sizeof ureq)) { + ret = -EFAULT; + goto out; + } + + if (ureq.qpn != 0 && ureq.qpn != 1) { + ret = -EINVAL; + goto out; + } + + for (agent_id = 0; agent_id < IB_UMAD_MAX_AGENTS; ++agent_id) + if (!file->agent[agent_id]) + goto found; + + ret = -ENOMEM; + goto out; + +found: + req.mgmt_class = ureq.mgmt_class; + req.mgmt_class_version = ureq.mgmt_class_version; + memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); + memcpy(req.oui, ureq.oui, sizeof req.oui); + + agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num, + ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI, + &req, 0, send_handler, recv_handler, + file); + if (IS_ERR(agent)) { + ret = PTR_ERR(agent); + goto out; + } + + file->agent[agent_id] = agent; + + file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(file->mr[agent_id])) { + ret = -ENOMEM; + goto err; + } + + if (put_user(agent_id, + (u32 __user *) (arg + offsetof(struct ib_user_mad_reg_req, id)))) { + ret = -EFAULT; + goto err_mr; + } + + ret = 0; + goto out; + +err_mr: + ib_dereg_mr(file->mr[agent_id]); + +err: + file->agent[agent_id] = NULL; + ib_unregister_mad_agent(agent); + +out: + up_write(&file->agent_mutex); + return ret; +} + +static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg) +{ + u32 id; + int ret = 0; + + down_write(&file->agent_mutex); + + if (get_user(id, (u32 __user *) arg)) { + ret = -EFAULT; + goto out; + } + + if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) { + ret = -EINVAL; + goto out; + } + + ib_dereg_mr(file->mr[id]); + ib_unregister_mad_agent(file->agent[id]); + file->agent[id] = NULL; + +out: + up_write(&file->agent_mutex); + return ret; +} + +static int ib_umad_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +{ + switch (cmd) { + case IB_USER_MAD_REGISTER_AGENT: + return ib_umad_reg_agent(filp->private_data, arg); + case IB_USER_MAD_UNREGISTER_AGENT: + return ib_umad_unreg_agent(filp->private_data, arg); + default: + return -ENOIOCTLCMD; + } +} + +static int ib_umad_open(struct inode *inode, struct file *filp) +{ + struct ib_umad_port *port = + container_of(inode->i_cdev, struct ib_umad_port, dev); + struct ib_umad_file *file; + + file = kmalloc(sizeof *file, GFP_KERNEL); + if (!file) + return -ENOMEM; + + memset(file, 0, sizeof *file); + + spin_lock_init(&file->recv_lock); + init_rwsem(&file->agent_mutex); + INIT_LIST_HEAD(&file->recv_list); + init_waitqueue_head(&file->recv_wait); + + file->port = port; + filp->private_data = file; + + return 0; +} + +static int ib_umad_close(struct inode *inode, struct file *filp) +{ + struct ib_umad_file *file = filp->private_data; + int i; + + for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) + if (file->agent[i]) { + ib_dereg_mr(file->mr[i]); + ib_unregister_mad_agent(file->agent[i]); + } + + kfree(file); + + return 0; +} + +static struct file_operations umad_fops = { + .owner = THIS_MODULE, + .read = ib_umad_read, + .write = ib_umad_write, + .poll = ib_umad_poll, + .ioctl = ib_umad_ioctl, + .open = ib_umad_open, + .release = ib_umad_close +}; + +static struct ib_client umad_client = { + .name = "umad", + .add = ib_umad_add_one, + .remove = ib_umad_remove_one +}; + +static ssize_t show_dev(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + return print_dev_t(buf, port->dev.dev); +} +static CLASS_DEVICE_ATTR(dev, S_IRUGO, show_dev, NULL); + +static ssize_t show_ibdev(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + return sprintf(buf, "%s\n", port->ib_dev->name); +} +static CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); + +static ssize_t show_port(struct class_device *class_dev, char *buf) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + return sprintf(buf, "%d\n", port->port_num); +} +static CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL); + +static void ib_umad_release_dev(struct kref *ref) +{ + struct ib_umad_device *dev = + container_of(ref, struct ib_umad_device, ref); + + kfree(dev); +} + +static void ib_umad_release_port(struct class_device *class_dev) +{ + struct ib_umad_port *port = + container_of(class_dev, struct ib_umad_port, class_dev); + + cdev_del(&port->dev); + clear_bit(port->devnum, dev_map); + kref_put(&port->umad_dev->ref, ib_umad_release_dev); +} + +static struct class umad_class = { + .name = "infiniband_mad", + .release = ib_umad_release_port +}; + +static ssize_t show_abi_version(struct class *class, char *buf) +{ + return sprintf(buf, "%d\n", IB_USER_MAD_ABI_VERSION); +} +static CLASS_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); + +static void ib_umad_add_one(struct ib_device *device) +{ + struct ib_umad_device *umad_dev; + int s, e, i; + + if (device->node_type == IB_NODE_SWITCH) + s = e = 0; + else { + s = 1; + e = device->phys_port_cnt; + } + + umad_dev = kmalloc(sizeof *umad_dev + + (e - s + 1) * sizeof (struct ib_umad_port), + GFP_KERNEL); + if (!umad_dev) + return; + + memset(umad_dev, 0, sizeof *umad_dev + + (e - s + 1) * sizeof (struct ib_umad_port)); + + kref_init(&umad_dev->ref); + + umad_dev->start_port = s; + umad_dev->end_port = e; + + for (i = s; i <= e; ++i) { + umad_dev->port[i - s].umad_dev = umad_dev; + kref_get(&umad_dev->ref); + + spin_lock(&map_lock); + umad_dev->port[i - s].devnum = + find_first_zero_bit(dev_map, IB_UMAD_MAX_PORTS); + if (umad_dev->port[i - s].devnum >= IB_UMAD_MAX_PORTS) { + spin_unlock(&map_lock); + goto err; + } + set_bit(umad_dev->port[i - s].devnum, dev_map); + spin_unlock(&map_lock); + + umad_dev->port[i - s].ib_dev = device; + umad_dev->port[i - s].port_num = i; + + cdev_init(&umad_dev->port[i - s].dev, &umad_fops); + umad_dev->port[i - s].dev.owner = THIS_MODULE; + kobject_set_name(&umad_dev->port[i - s].dev.kobj, + "umad%d", umad_dev->port[i - s].devnum); + if (cdev_add(&umad_dev->port[i - s].dev, base_dev + + umad_dev->port[i - s].devnum, 1)) + goto err; + + umad_dev->port[i - s].class_dev.class = &umad_class; + umad_dev->port[i - s].class_dev.dev = device->dma_device; + snprintf(umad_dev->port[i - s].class_dev.class_id, + BUS_ID_SIZE, "umad%d", umad_dev->port[i - s].devnum); + if (class_device_register(&umad_dev->port[i - s].class_dev)) + goto err_class; + + if (class_device_create_file(&umad_dev->port[i - s].class_dev, + &class_device_attr_dev)) + goto err_class; + if (class_device_create_file(&umad_dev->port[i - s].class_dev, + &class_device_attr_ibdev)) + goto err_class; + if (class_device_create_file(&umad_dev->port[i - s].class_dev, + &class_device_attr_port)) + goto err_class; + } + + ib_set_client_data(device, &umad_client, umad_dev); + + return; + +err_class: + cdev_del(&umad_dev->port[i - s].dev); + clear_bit(umad_dev->port[i - s].devnum, dev_map); + +err: + while (--i >= s) + class_device_unregister(&umad_dev->port[i - s].class_dev); + + kref_put(&umad_dev->ref, ib_umad_release_dev); +} + +static void ib_umad_remove_one(struct ib_device *device) +{ + struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client); + int i; + + if (!umad_dev) + return; + + for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) + class_device_unregister(&umad_dev->port[i].class_dev); + + kref_put(&umad_dev->ref, ib_umad_release_dev); +} + +static int __init ib_umad_init(void) +{ + int ret; + + spin_lock_init(&map_lock); + + ret = alloc_chrdev_region(&base_dev, 0, IB_UMAD_MAX_PORTS, + "infiniband_mad"); + if (ret) { + printk(KERN_ERR "user_mad: couldn't get device number\n"); + goto out; + } + + ret = class_register(&umad_class); + if (ret) { + printk(KERN_ERR "user_mad: couldn't create class infiniband_mad\n"); + goto out_chrdev; + } + + ret = class_create_file(&umad_class, &class_attr_abi_version); + if (ret) { + printk(KERN_ERR "user_mad: couldn't create abi_version attribute\n"); + goto out_class; + } + + ret = ib_register_client(&umad_client); + if (ret) { + printk(KERN_ERR "user_mad: couldn't register ib_umad client\n"); + goto out_class; + } + + /* Our ioctls are 32/64 clean */ + ret = register_ioctl32_conversion(IB_USER_MAD_REGISTER_AGENT, NULL); + ret |= register_ioctl32_conversion(IB_USER_MAD_UNREGISTER_AGENT, NULL); + if (ret) { + printk(KERN_ERR "user_mad: couldn't register ioctl32 conversions\n"); + goto out_client; + } + + return 0; + +out_client: + ib_unregister_client(&umad_client); + +out_class: + class_unregister(&umad_class); + +out_chrdev: + unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS); + +out: + return ret; +} + +static void __exit ib_umad_cleanup(void) +{ + unregister_ioctl32_conversion(IB_USER_MAD_REGISTER_AGENT); + unregister_ioctl32_conversion(IB_USER_MAD_UNREGISTER_AGENT); + ib_unregister_client(&umad_client); + class_unregister(&umad_class); + unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS); +} + +module_init(ib_umad_init); +module_exit(ib_umad_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_user_mad.h 2004-12-27 21:48:27.631898908 -0800 @@ -0,0 +1,123 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_user_mad.h 1389 2004-12-27 22:56:47Z roland $ + */ + +#ifndef IB_USER_MAD_H +#define IB_USER_MAD_H + +#include +#include + +/* + * Increment this value if any changes that break userspace ABI + * compatibility are made. + */ +#define IB_USER_MAD_ABI_VERSION 2 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + */ + +/** + * ib_user_mad - MAD packet + * @data - Contents of MAD + * @id - ID of agent MAD received with/to be sent with + * @status - 0 on successful receive, ETIMEDOUT if no response + * received (transaction ID in data[] will be set to TID of original + * request) (ignored on send) + * @timeout_ms - Milliseconds to wait for response (unset on receive) + * @qpn - Remote QP number received from/to be sent to + * @qkey - Remote Q_Key to be sent with (unset on receive) + * @lid - Remote lid received from/to be sent to + * @sl - Service level received with/to be sent with + * @path_bits - Local path bits received with/to be sent with + * @grh_present - If set, GRH was received/should be sent + * @gid_index - Local GID index to send with (unset on receive) + * @hop_limit - Hop limit in GRH + * @traffic_class - Traffic class in GRH + * @gid - Remote GID in GRH + * @flow_label - Flow label in GRH + * + * All multi-byte quantities are stored in network (big endian) byte order. + */ +struct ib_user_mad { + __u8 data[256]; + __u32 id; + __u32 status; + __u32 timeout_ms; + __u32 qpn; + __u32 qkey; + __u16 lid; + __u8 sl; + __u8 path_bits; + __u8 grh_present; + __u8 gid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 gid[16]; + __u32 flow_label; +}; + +/** + * ib_user_mad_reg_req - MAD registration request + * @id - Set by the kernel; used to identify agent in future requests. + * @qpn - Queue pair number; must be 0 or 1. + * @method_mask - The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + * @mgmt_class - Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version - Indicates which version of MADs for the given + * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. + */ +struct ib_user_mad_reg_req { + __u32 id; + __u32 method_mask[4]; + __u8 qpn; + __u8 mgmt_class; + __u8 mgmt_class_version; + __u8 oui[3]; +}; + +#define IB_IOCTL_MAGIC 0x1b + +#define IB_USER_MAD_REGISTER_AGENT _IOWR(IB_IOCTL_MAGIC, 1, \ + struct ib_user_mad_reg_req) + +#define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 2, __u32) + +#endif /* IB_USER_MAD_H */ From roland at topspin.com Mon Dec 27 21:51:19 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:19 -0800 Subject: [openib-general] [PATCH][v5][22/24] Document InfiniBand ioctl use In-Reply-To: <200412272151.3Lde9MPbD7ODIUdu@topspin.com> Message-ID: <200412272151.zeKZJPoIEBr55elh@topspin.com> Add the 0x1b ioctl magic number used by ib_umad module to Documentation/ioctl-number.txt. Signed-off-by: Roland Dreier --- linux-bk.orig/Documentation/ioctl-number.txt 2004-12-27 21:47:59.407053483 -0800 +++ linux-bk/Documentation/ioctl-number.txt 2004-12-27 21:48:28.036839302 -0800 @@ -72,6 +72,7 @@ 0x09 all linux/md.h 0x12 all linux/fs.h linux/blkpg.h +0x1b all InfiniBand Subsystem 0x20 all drivers/cdrom/cm206.h 0x22 all scsi/sg.h '#' 00-3F IEEE 1394 Subsystem Block for the entire subsystem From roland at topspin.com Mon Dec 27 21:51:19 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:19 -0800 Subject: [openib-general] [PATCH][v5][23/24] Add InfiniBand Documentation files In-Reply-To: <200412272151.zeKZJPoIEBr55elh@topspin.com> Message-ID: <200412272151.S29WkrmlJifc5kHZ@topspin.com> Add files to Documentation/infiniband that describe the tree under /sys/class/infiniband, the IPoIB driver and the userspace MAD access driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/ipoib.txt 2004-12-27 21:48:28.484773367 -0800 @@ -0,0 +1,56 @@ +IP OVER INFINIBAND + + The ib_ipoib driver is an implementation of the IP over InfiniBand + protocol as specified by the latest Internet-Drafts issued by the + IETF ipoib working group. It is a "native" implementation in the + sense of setting the interface type to ARPHRD_INFINIBAND and the + hardware address length to 20 (earlier proprietary implementations + masqueraded to the kernel as ethernet interfaces). + +Partitions and P_Keys + + When the IPoIB driver is loaded, it creates one interface for each + port using the P_Key at index 0. To create an interface with a + different P_Key, write the desired P_Key into the main interface's + /sys/class/net//create_child file. For example: + + echo 0x8001 > /sys/class/net/ib0/create_child + + This will create an interface named ib0.8001 with P_Key 0x8001. To + remove a subinterface, use the "delete_child" file: + + echo 0x8001 > /sys/class/net/ib0/delete_child + + The P_Key for any interface is given by the "pkey" file, and the + main interface for a subinterface is in "parent." + +Debugging Information + + By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set + to 'y', tracing messages are compiled into the driver. They are + turned on by setting the module parameters debug_level and + mcast_debug_level to 1. These parameters can be controlled at + runtime through files in /sys/module/ib_ipoib/. + + CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs" + virtual filesystem. By mounting this filesystem, for example with + + mkdir -p /ipoib_debugfs + mount -t ipoib_debugfs none /ipoib_debufs + + it is possible to get statistics about multicast groups from the + files /ipoib_debugfs/ib0_mcg and so on. + + The performance impact of this option is negligible, so it + is safe to enable this option with debug_level set to 0 for normal + operation. + + CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in + the data path when data_debug_level is set to 1. However, even with + the output disabled, enabling this configuration option will affect + performance, because it adds tests to the fast path. + +References + + IETF IP over InfiniBand (ipoib) Working Group + http://ietf.org/html.charters/ipoib-charter.html --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/sysfs.txt 2004-12-27 21:48:28.513769099 -0800 @@ -0,0 +1,64 @@ +SYSFS FILES + + For each InfiniBand device, the InfiniBand drivers create the + following files under /sys/class/infiniband/: + + node_guid - Node GUID + sys_image_guid - System image GUID + + In addition, there is a "ports" subdirectory, with one subdirectory + for each port. For example, if mthca0 is a 2-port HCA, there will + be two directories: + + /sys/class/infiniband/mthca0/ports/1 + /sys/class/infiniband/mthca0/ports/2 + + (A switch will only have a single "0" subdirectory for switch port + 0; no subdirectory is created for normal switch ports) + + In each port subdirectory, the following files are created: + + cap_mask - Port capability mask + lid - Port LID + lid_mask_count - Port LID mask count + rate - Port data rate (active width * active speed) + sm_lid - Subnet manager LID for port's subnet + sm_sl - Subnet manager SL for port's subnet + state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER) + + There is also a "counters" subdirectory, with files + + VL15_dropped + excessive_buffer_overrun_errors + link_downed + link_error_recovery + local_link_integrity_errors + port_rcv_constraint_errors + port_rcv_data + port_rcv_errors + port_rcv_packets + port_rcv_remote_physical_errors + port_rcv_switch_relay_errors + port_xmit_constraint_errors + port_xmit_data + port_xmit_discards + port_xmit_packets + symbol_error + + Each of these files contains the corresponding value from the port's + Performance Management PortCounters attribute, as described in + section 16.1.3.5 of the InfiniBand Architecture Specification. + + The "pkeys" and "gids" subdirectories contain one file for each + entry in the port's P_Key or GID table respectively. For example, + ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key + table. + +MTHCA + + The Mellanox HCA driver also creates the files: + + hw_rev - Hardware revision number + fw_ver - Firmware version + hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)", + or "MT25208" --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/Documentation/infiniband/user_mad.txt 2004-12-27 21:48:28.543764684 -0800 @@ -0,0 +1,81 @@ +USERSPACE MAD ACCESS + +Device files + + Each port of each InfiniBand device has a "umad" device attached. + For example, a two-port HCA will have two devices, while a switch + will have one device (for switch port 0). + +Creating MAD agents + + A MAD agent can be created by filling in a struct ib_user_mad_reg_req + and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file + descriptor for the appropriate device file. If the registration + request succeeds, a 32-bit id will be returned in the structure. + For example: + + struct ib_user_mad_reg_req req = { /* ... */ }; + ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req); + if (!ret) + my_agent = req.id; + else + perror("agent register"); + + Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT + ioctl. Also, all agents registered through a file descriptor will + be unregistered when the descriptor is closed. + +Receiving MADs + + MADs are received using read(). The buffer passed to read() must be + large enough to hold at least one struct ib_user_mad. For example: + + struct ib_user_mad mad; + ret = read(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("read"); + + In addition to the actual MAD contents, the other struct ib_user_mad + fields will be filled in with information on the received MAD. For + example, the remote LID will be in mad.lid. + + If a send times out, a receive will be generated with mad.status set + to ETIMEDOUT. Otherwise when a MAD has been successfully received, + mad.status will be 0. + + poll()/select() may be used to wait until a MAD can be read. + +Sending MADs + + MADs are sent using write(). The agent ID for sending should be + filled into the id field of the MAD, the destination LID should be + filled into the lid field, and so on. For example: + + struct ib_user_mad mad; + + /* fill in mad.data */ + + mad.id = my_agent; /* req.id from agent registration */ + mad.lid = my_dest; /* in network byte order... */ + /* etc. */ + + ret = write(fd, &mad, sizeof mad); + if (ret != sizeof mad) + perror("write"); + +/dev files + + To create the appropriate character device files automatically with + udev, a rule like + + KERNEL="umad*", NAME="infiniband/%k" + + can be used. This will create a device node named + + /dev/infiniband/umad0 + + for the first port, and so on. The InfiniBand device and port + associated with this device can be determined from the files + + /sys/class/infiniband_mad/umad0/ibdev + /sys/class/infiniband_mad/umad0/port From roland at topspin.com Mon Dec 27 21:51:20 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 21:51:20 -0800 Subject: [openib-general] [PATCH][v5][24/24] InfiniBand MAINTAINERS entry In-Reply-To: <200412272151.S29WkrmlJifc5kHZ@topspin.com> Message-ID: <200412272151.hCBo2fMh4Io7Xy87@topspin.com> Add OpenIB maintainers information to MAINTAINERS. Signed-off-by: Roland Dreier --- linux-bk.orig/MAINTAINERS 2004-12-27 21:47:44.140300651 -0800 +++ linux-bk/MAINTAINERS 2004-12-27 21:48:28.966702428 -0800 @@ -1081,6 +1081,17 @@ L: linux-fbdev-devel at lists.sourceforge.net S: Maintained +INFINIBAND SUBSYSTEM +P: Roland Dreier +M: roland at topspin.com +P: Sean Hefty +M: mshefty at ichips.intel.com +P: Hal Rosenstock +M: halr at voltaire.com +L: openib-general at openib.org +W: http://www.openib.org/ +S: Supported + INPUT (KEYBOARD, MOUSE, JOYSTICK) DRIVERS P: Vojtech Pavlik M: vojtech at suse.cz From roland at topspin.com Mon Dec 27 22:10:46 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Dec 2004 22:10:46 -0800 Subject: [openib-general] [PATCH] Add support for querying port width/speed In-Reply-To: <52fz1rj809.fsf@topspin.com> (Roland Dreier's message of "Mon, 27 Dec 2004 21:40:06 -0800") References: <52fz1rj809.fsf@topspin.com> Message-ID: <528y7jj6l5.fsf@topspin.com> By, the way does anyone know what IPD is supposed to be used when the injection port's rate is not a multiple of the rate for the path? For example, suppose an 8X DDR port is sending to a 12X path -- 12 doesn't divide 16 evenly. I would guess that the correct thing to do is round up, so 8X DDR -> 12X uses an IPD of 1, 8X QDR -> 12X uses an IPD of 2, etc. - R. From davem at davemloft.net Mon Dec 27 22:54:17 2004 From: davem at davemloft.net (David S. Miller) Date: Mon, 27 Dec 2004 22:54:17 -0800 Subject: [openib-general] Re: [PATCH][v5][0/24] Latest IB patch queue In-Reply-To: <200412272150.IBRnA4AvjendsF8x@topspin.com> References: <200412272150.IBRnA4AvjendsF8x@topspin.com> Message-ID: <20041227225417.3ac7a0a6.davem@davemloft.net> On Mon, 27 Dec 2004 21:50:47 -0800 Roland Dreier wrote: > >>>>> "David" == David S Miller writes: > > David> Send it all over. > > OK, you asked for it... here's our latest tree, which should > incorporate all the feedback I've seen. W00t :-) All applied, thanks Roland. I'll run it through some build tests then toss it upstream. From shaharf at voltaire.com Tue Dec 28 00:39:30 2004 From: shaharf at voltaire.com (shaharf) Date: Tue, 28 Dec 2004 10:39:30 +0200 Subject: [openib-general] Problem with userspace tree structure Message-ID: > Michael> Could you please remove openib/src/userspace/osm.old as a > Michael> first step? Its really painful to get two copies of > Michael> opensm each time. If you thjink you need it for > Michael> reference, you could always just move it somewhere ... > > I agree osm.old should be deleted. No need to move it -- since we're > using a revision control system, one can always check out any specific > revision from the past that's of interest. > > - R. Osm.old is deleted. Shahar From Diego at Mellanox.com Tue Dec 28 04:43:25 2004 From: Diego at Mellanox.com (Diego Crupnicoff) Date: Tue, 28 Dec 2004 04:43:25 -0800 Subject: [openib-general] [PATCH] Add support for querying port width/ speed Message-ID: <25AE7F432672D511B8DC00B0D0DF11DA029850E9@MTIEX01> I think the choice of IPD is beyond the scope of the IB spec. The spec provides a mechanism to regulate the injection rate but does not mandate its use. Your assumption below (rounding up) makes perfect sense. Rounding down would defeat the entire purpose of the IPD mechanism. Diego > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Tuesday, December 28, 2004 3:11 AM > To: openib-general at openib.org > Subject: Re: [openib-general] [PATCH] Add support for > querying port width/speed > > > By, the way does anyone know what IPD is supposed to be used > when the injection port's rate is not a multiple of the rate > for the path? For example, suppose an 8X DDR port is sending > to a 12X path -- 12 doesn't divide 16 evenly. > > I would guess that the correct thing to do is round up, so 8X DDR > -> 12X uses an IPD of 1, 8X QDR -> 12X uses an IPD of 2, etc. > > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-> general > > To > unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Dec 28 05:03:12 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Dec 2004 08:03:12 -0500 Subject: [openib-general] Some CM Comments Message-ID: <1104238992.4054.3823.camel@localhost.localdomain> Hi Sean, I've been looking over ib_cm.h and have some comments: 1.IBA 1.2 CM is still class version 2 and is upward compatible with 1.1 CM. The only difference appears to be SRQ component added to REQ and REP. 2. The following CM states appear to be missing: peer compare reject retry timeout DREQ timeout (is an event) It is also a state in the spec RTU timeout (nor is it an event) 3. There is no REJ received CM event type. Shouldn't there be ? 4. Should there be an API to obtain local/remote comm IDs ? 5. REQ message is missing transport service type (connection type). Also, what about ack timeout ? Also, SRQ. Is starting PSN chosen internally by the CM ? 6. SRQ should be added for REP. Is there a danger to have the client set the CM response timeout ? Also, is the lack of remote response timeout for LAP in inconsistent with this ? Thanks. -- Hal From hadi at cyberus.ca Tue Dec 28 05:31:57 2004 From: hadi at cyberus.ca (jamal) Date: 28 Dec 2004 08:31:57 -0500 Subject: [openib-general] Re: LLTX and netif_stop_queue In-Reply-To: <5cac192f04122408102129af43@mail.gmail.com> References: <52llbwoaej.fsf@topspin.com> <20041217214432.07b7b21e.davem@davemloft.net> <1103484675.1050.158.camel@jzny.localdomain> <5cac192f04122210491d64d4b6@mail.gmail.com> <20041222202919.057b8331.davem@davemloft.net> <5cac192f0412230110628749e3@mail.gmail.com> <41CAF444.3000305@trash.net> <5cac192f04122408102129af43@mail.gmail.com> Message-ID: <1104240717.1100.66.camel@jzny.localdomain> On Fri, 2004-12-24 at 11:10, Eric Lemoine wrote: > Yes but requiring drivers to release a lock that they should not even > be aware of doesn't sound good. Another way would be to keep > dev->queue_lock grabbed when entering start_xmit() and let the driver > drop it (and re-acquire it before it returns) only if it wishes so. > Although I don't like this too much either, that's the best way I can > think of up to now... I am not a big fan of that patch either, but i cant think of a cleaner way to do it. The violation already happens with the LLTX flag. So maybe a big warning that says "Do this only if you driver is LLTX enabled". The other way to do it is put a check to see if LLTX is enabled before releasing that lock - but why the extra cycles? Driver writer should know. cheers, jamal From mst at mellanox.co.il Tue Dec 28 06:45:28 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 16:45:28 +0200 Subject: [openib-general] opensm error message In-Reply-To: References: Message-ID: <20041228144528.GB27036@mellanox.co.il> Hi! ./osm/opensm/opensm ------------------------------------------------- OpenSM Rev:1.6.0-rc2 Command Line Arguments: Log File: /tmp/osm.log ------------------------------------------------- warn: [22072] umad_init: wrong ABI version: /sys/class/infiniband_mad/abi_version is 134705591 but library ABI is 2 using default guid 0xc900012a3981 Error from osm_opensm_bind (0x2A) swlab118:/usr/src/openib/src/userspace # cat /sys/class/infiniband_mad/abi_version cat: /sys/class/infiniband_mad/abi_version: No such file or directory So the file does not exist, but the log indicates bad ABI. You probably dont check the return status of open(). mst From mst at mellanox.co.il Mon Dec 27 17:30:51 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 03:30:51 +0200 Subject: [openib-general] opensm failure In-Reply-To: <20041228144528.GB27036@mellanox.co.il> References: <20041228144528.GB27036@mellanox.co.il> Message-ID: <20041228013051.GA22735@mellanox.co.il> Hi! # ./osm/opensm/opensm -V ------------------------------------------------- OpenSM Rev:1.6.0-rc2 Command Line Arguments: Big V selected Log File: /tmp/osm.log ------------------------------------------------- using default guid 0xc900012a3981 Error from osm_opensm_bind (0x2A) MST -------------- next part -------------- [1104197291:000962479][B7E78460] -> OpenSM Rev:1.6.0-rc2 [1104197291:000962736][B7E78460] -> osm_opensm_init: [ [1104197291:000962753][B7E78460] -> osm_opensm_init: Forcing single threaded dispatcher. [1104197291:000962828][B7E78460] -> osm_vendor_new: [ [1104197291:000962839][B7E78460] -> osm_vendor_init: [ [1104197291:000962897][B7E78460] -> osm_vendor_init: ] [1104197291:000962905][B7E78460] -> osm_vendor_new: ] [1104197291:000962915][B7E78460] -> osm_mad_pool_init: [ [1104197291:000963039][B7E78460] -> osm_mad_pool_init: ] [1104197291:000963048][B7E78460] -> osm_vl15_init: [ [1104197291:000963088][B7E78460] -> osm_vl15_init: ] [1104197291:000963099][B7E78460] -> osm_sm_init: [ [1104197291:000963121][B7E78460] -> osm_sm_mad_ctrl_init: [ [1104197291:000963130][B7E78460] -> osm_sm_mad_ctrl_init: ] [1104197291:000963140][B7E78460] -> osm_req_init: [ [1104197291:000963147][B7E78460] -> osm_req_init: ] [1104197291:000963154][B7E78460] -> osm_req_ctrl_init: [ [1104197291:000963162][B7E78460] -> osm_req_ctrl_init: ] [1104197291:000963169][B7E78460] -> osm_resp_init: [ [1104197291:000963176][B7E78460] -> osm_resp_init: ] [1104197291:000963185][B7E78460] -> osm_ni_rcv_init: [ [1104197291:000963192][B7E78460] -> osm_ni_rcv_init: ] [1104197291:000963199][B7E78460] -> osm_ni_rcv_ctrl_init: [ [1104197291:000963207][B7E78460] -> osm_ni_rcv_ctrl_init: ] [1104197291:000963214][B7E78460] -> osm_pi_rcv_init: [ [1104197291:000963221][B7E78460] -> osm_pi_rcv_init: ] [1104197291:000963228][B7E78460] -> osm_pi_rcv_ctrl_init: [ [1104197291:000963235][B7E78460] -> osm_pi_rcv_ctrl_init: ] [1104197291:000963242][B7E78460] -> osm_si_rcv_init: [ [1104197291:000963249][B7E78460] -> osm_si_rcv_init: ] [1104197291:000963255][B7E78460] -> osm_si_rcv_ctrl_init: [ [1104197291:000963262][B7E78460] -> osm_si_rcv_ctrl_init: ] [1104197291:000963269][B7E78460] -> osm_nd_rcv_init: [ [1104197291:000963276][B7E78460] -> osm_nd_rcv_init: ] [1104197291:000963283][B7E78460] -> osm_nd_rcv_ctrl_init: [ [1104197291:000963293][B7E78460] -> osm_nd_rcv_ctrl_init: ] [1104197291:000963301][B7E78460] -> osm_lid_mgr_init: [ [1104197291:000963323][B4E6FBB0] -> __osm_vl15_poller: [ [1104197291:000963347][B7E78460] -> osm_lid_mgr_init: ] [1104197291:000963365][B7E78460] -> osm_ucast_mgr_init: [ [1104197291:000963373][B7E78460] -> osm_ucast_mgr_init: ] [1104197291:000963381][B7E78460] -> osm_link_mgr_init: [ [1104197291:000963388][B7E78460] -> osm_link_mgr_init: ] [1104197291:000963421][B7E78460] -> osm_state_mgr_init: [ [1104197291:000963431][B7E78460] -> osm_state_mgr_init: ] [1104197291:000963438][B7E78460] -> osm_state_mgr_ctrl_init: [ [1104197291:000963446][B7E78460] -> osm_state_mgr_ctrl_init: ] [1104197291:000963454][B7E78460] -> osm_drop_mgr_init: [ [1104197291:000963461][B7E78460] -> osm_drop_mgr_init: ] [1104197291:000963468][B7E78460] -> osm_lft_rcv_init: [ [1104197291:000963475][B7E78460] -> osm_lft_rcv_init: ] [1104197291:000963482][B7E78460] -> osm_lft_rcv_ctrl_init: [ [1104197291:000963498][B7E78460] -> osm_lft_rcv_ctrl_init: ] [1104197291:000963506][B7E78460] -> osm_mft_rcv_init: [ [1104197291:000963512][B7E78460] -> osm_mft_rcv_init: ] [1104197291:000963520][B7E78460] -> osm_mft_rcv_ctrl_init: [ [1104197291:000963528][B7E78460] -> osm_mft_rcv_ctrl_init: ] [1104197291:000963536][B7E78460] -> osm_sweep_fail_ctrl_init: [ [1104197291:000963545][B7E78460] -> osm_sweep_fail_ctrl_init: ] [1104197291:000963552][B7E78460] -> osm_sminfo_rcv_init: [ [1104197291:000963559][B7E78460] -> osm_sminfo_rcv_init: ] [1104197291:000963566][B7E78460] -> osm_sminfo_rcv_ctrl_init: [ [1104197291:000963574][B7E78460] -> osm_sminfo_rcv_ctrl_init: ] [1104197291:000963581][B7E78460] -> osm_trap_rcv_init: [ [1104197291:000963589][B7E78460] -> cl_event_wheel_init: [ [1104197291:000963598][B7E78460] -> cl_event_wheel_init: ] [1104197291:000963606][B7E78460] -> osm_trap_rcv_init: ] [1104197291:000963613][B7E78460] -> osm_trap_rcv_ctrl_init: [ [1104197291:000963621][B7E78460] -> osm_trap_rcv_ctrl_init: ] [1104197291:000963631][B7E78460] -> osm_sm_state_mgr_init: [ [1104197291:000963640][B7E78460] -> osm_sm_state_mgr_init: ] [1104197291:000963721][B7E78460] -> osm_mcast_mgr_init: [ [1104197291:000963730][B7E78460] -> osm_mcast_mgr_init: ] [1104197291:000963737][B7E78460] -> osm_slvl_rcv_init: [ [1104197291:000963744][B7E78460] -> osm_slvl_rcv_init: ] [1104197291:000963751][B7E78460] -> osm_slvl_rcv_ctrl_init: [ [1104197291:000963759][B7E78460] -> osm_slvl_rcv_ctrl_init: ] [1104197291:000963766][B7E78460] -> osm_vla_rcv_init: [ [1104197291:000963773][B7E78460] -> osm_vla_rcv_init: ] [1104197291:000963780][B7E78460] -> osm_vla_rcv_ctrl_init: [ [1104197291:000963788][B7E78460] -> osm_vla_rcv_ctrl_init: ] [1104197291:000963795][B7E78460] -> osm_pkey_rcv_init: [ [1104197291:000963802][B7E78460] -> osm_pkey_rcv_init: ] [1104197291:000963809][B7E78460] -> osm_pkey_rcv_ctrl_init: [ [1104197291:000963817][B7E78460] -> osm_pkey_rcv_ctrl_init: ] [1104197291:000963852][B7E78460] -> osm_sm_init: ] [1104197291:000963860][B7E78460] -> osm_sa_init: [ [1104197291:000963870][B7E78460] -> osm_sa_resp_init: [ [1104197291:000963878][B7E78460] -> osm_sa_resp_init: ] [1104197291:000963887][B7E78460] -> osm_sa_mad_ctrl_init: [ [1104197291:000963895][B7E78460] -> osm_sa_mad_ctrl_init: ] [1104197291:000963941][B7E78460] -> osm_cpi_rcv_init: [ [1104197291:000963948][B7E78460] -> osm_cpi_rcv_init: ] [1104197291:000963955][B7E78460] -> osm_cpi_rcv_ctrl_init: [ [1104197291:000963963][B7E78460] -> osm_cpi_rcv_ctrl_init: ] [1104197291:000963973][B7E78460] -> osm_nr_rcv_init: [ [1104197291:000963993][B7E78460] -> osm_nr_rcv_init: ] [1104197291:000964004][B7E78460] -> osm_nr_rcv_ctrl_init: [ [1104197291:000964012][B7E78460] -> osm_nr_rcv_ctrl_init: ] [1104197291:000964019][B7E78460] -> osm_pir_rcv_init: [ [1104197291:000964034][B7E78460] -> osm_pir_rcv_init: ] [1104197291:000964042][B7E78460] -> osm_pir_rcv_ctrl_init: [ [1104197291:000964050][B7E78460] -> osm_pir_rcv_ctrl_init: ] [1104197291:000964057][B7E78460] -> osm_lr_rcv_init: [ [1104197291:000964075][B7E78460] -> osm_lr_rcv_init: ] [1104197291:000964082][B7E78460] -> osm_lr_rcv_ctrl_init: [ [1104197291:000964090][B7E78460] -> osm_lr_rcv_ctrl_init: ] [1104197291:000964097][B7E78460] -> osm_pr_rcv_init: [ [1104197291:000964114][B7E78460] -> osm_pr_rcv_init: ] [1104197291:000964122][B7E78460] -> osm_pr_rcv_ctrl_init: [ [1104197291:000964129][B7E78460] -> osm_pr_rcv_ctrl_init: ] [1104197291:000964136][B7E78460] -> osm_smir_rcv_init: [ [1104197291:000964143][B7E78460] -> osm_smir_rcv_init: ] [1104197291:000964150][B7E78460] -> osm_smir_ctrl_init: [ [1104197291:000964158][B7E78460] -> osm_smir_ctrl_init: ] [1104197291:000964165][B7E78460] -> osm_mcmr_rcv_init: [ [1104197291:000964179][B7E78460] -> osm_mcmr_rcv_init: ] [1104197291:000964187][B7E78460] -> osm_mcmr_rcv_ctrl_init: [ [1104197291:000964195][B7E78460] -> osm_mcmr_rcv_ctrl_init: ] [1104197291:000964202][B7E78460] -> osm_sr_rcv_init: [ [1104197291:000964236][B7E78460] -> osm_sr_rcv_init: ] [1104197291:000964244][B7E78460] -> osm_sr_rcv_ctrl_init: [ [1104197291:000964251][B7E78460] -> osm_sr_rcv_ctrl_init: ] [1104197291:000964259][B7E78460] -> osm_infr_rcv_init: [ [1104197291:000964266][B7E78460] -> osm_infr_rcv_init: ] [1104197291:000964273][B7E78460] -> osm_infr_rcv_ctrl_init: [ [1104197291:000964280][B7E78460] -> osm_infr_rcv_ctrl_init: ] [1104197291:000964290][B7E78460] -> osm_vlarb_rec_rcv_init: [ [1104197291:000964301][B7E78460] -> osm_vlarb_rec_rcv_init: ] [1104197291:000964315][B7E78460] -> osm_vlarb_rec_rcv_ctrl_init: [ [1104197291:000964325][B7E78460] -> osm_vlarb_rec_rcv_ctrl_init: ] [1104197291:000964336][B7E78460] -> osm_slvl_rec_rcv_init: [ [1104197291:000964348][B466EBB0] -> __osm_sm_sweeper: [ [1104197291:000964363][B466EBB0] -> __osm_sm_sweeper: Masking ^C Signals. [1104197291:000964376][B7E78460] -> osm_slvl_rec_rcv_init: ] [1104197291:000964387][B7E78460] -> osm_slvl_rec_rcv_ctrl_init: [ [1104197291:000964396][B7E78460] -> osm_slvl_rec_rcv_ctrl_init: ] [1104197291:000964403][B7E78460] -> osm_pkey_rec_rcv_init: [ [1104197291:000964419][B7E78460] -> osm_pkey_rec_rcv_init: ] [1104197291:000964427][B7E78460] -> osm_pkey_rec_rcv_ctrl_init: [ [1104197291:000964442][B7E78460] -> osm_pkey_rec_rcv_ctrl_init: ] [1104197291:000964451][B7E78460] -> osm_lftr_rcv_init: [ [1104197291:000964468][B7E78460] -> osm_lftr_rcv_init: ] [1104197291:000964476][B7E78460] -> osm_lftr_rcv_ctrl_init: [ [1104197291:000964484][B7E78460] -> osm_lftr_rcv_ctrl_init: ] [1104197291:000964491][B7E78460] -> osm_sa_init: ] [1104197291:000964499][B7E78460] -> osm_opensm_create_mcgroups: [ [1104197291:000964506][B7E78460] -> osm_sa_create_template_record_ipoib: [ [1104197291:000964518][B7E78460] -> osm_mcmr_rcv_create_new_mgrp: [ [1104197291:000964526][B7E78460] -> osm_mcmr_rcv_create_new_mgrp: Getting new mlid 0xc000. [1104197291:000964534][B7E78460] -> __validate_requested_mgid: [ [1104197291:000964543][B7E78460] -> __validate_requested_mgid: MGID Signed as 0x401B. [1104197291:000964551][B7E78460] -> __validate_requested_mgid: Skeeping MGID Validation for IPoIB Signed (0x401B) MGIDs. [1104197291:000964558][B7E78460] -> __validate_requested_mgid: ] [1104197291:000964568][B7E78460] -> osm_mgrp_send_create_notice: [ [1104197291:000964580][B7E78460] -> osm_report_notice: [ [1104197291:000964600][B7E78460] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GUID:0xfe80000000000000,0x0000000000000000 [1104197291:000964616][B7E78460] -> osm_report_notice: ] [1104197291:000964623][B7E78460] -> osm_mgrp_send_create_notice: ] [1104197291:000964631][B7E78460] -> osm_mcmr_rcv_create_new_mgrp: ] [1104197291:000964638][B7E78460] -> osm_mcmr_rcv_create_new_mgrp: [ [1104197291:000964646][B7E78460] -> osm_mcmr_rcv_create_new_mgrp: Getting new mlid 0xc001. [1104197291:000964653][B7E78460] -> __validate_requested_mgid: [ [1104197291:000964660][B7E78460] -> __validate_requested_mgid: MGID Signed as 0x401B. [1104197291:000964668][B7E78460] -> __validate_requested_mgid: Skeeping MGID Validation for IPoIB Signed (0x401B) MGIDs. [1104197291:000964675][B7E78460] -> __validate_requested_mgid: ] [1104197291:000964683][B7E78460] -> osm_mgrp_send_create_notice: [ [1104197291:000964690][B7E78460] -> osm_report_notice: [ [1104197291:000964699][B7E78460] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GUID:0xfe80000000000000,0x0000000000000000 [1104197291:000964709][B7E78460] -> osm_report_notice: ] [1104197291:000964716][B7E78460] -> osm_mgrp_send_create_notice: ] [1104197291:000964723][B7E78460] -> osm_mcmr_rcv_create_new_mgrp: ] [1104197291:000964731][B7E78460] -> osm_sa_create_template_record_ipoib: ] [1104197291:000964738][B7E78460] -> osm_opensm_create_mcgroups: ] [1104197291:000964749][B7E78460] -> updn_construct: [ [1104197291:000964756][B7E78460] -> updn_construct: ] [1104197291:000964763][B7E78460] -> updn_init: [ [1104197291:000964775][B7E78460] -> updn_init: ] [1104197291:000964782][B7E78460] -> osm_opensm_init: ] [1104197291:000964802][B7E78460] -> osm_vendor_get_all_port_attr: [ [1104197291:000965856][B7E78460] -> osm_vendor_get_all_port_attr: assign CA 0x80980f8ort 1 guid (0xc900012a3981) as the default port. [1104197291:000965866][B7E78460] -> osm_vendor_get_all_port_attr: ] [1104197291:000965957][B7E78460] -> osm_opensm_bind: [ [1104197291:000965965][B7E78460] -> osm_sm_bind: [ [1104197291:000965973][B7E78460] -> osm_sm_mad_ctrl_bind: [ [1104197291:000965981][B7E78460] -> osm_sm_mad_ctrl_bind: Binding to port 0xc900012a3981. [1104197291:000965989][B7E78460] -> osm_vendor_bind: [ [1104197291:000965996][B7E78460] -> osm_vendor_bind: Binding to port 0xc900012a3981. [1104197291:000966004][B7E78460] -> osm_vendor_open_port: [ [1104197291:000968284][B7E78460] -> osm_vl15_init: [ [1104197291:000968335][B7E78460] -> osm_vl15_init: ] [1104197291:000968344][B7E78460] -> osm_vendor_open_port: ] [1104197291:000968351][B7E78460] -> osm_vendor_bind: Unable to Open Port 0xc900012a3981. [1104197291:000968359][B7E78460] -> osm_vendor_bind: ] [1104197291:000968366][B7E78460] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed. [1104197291:000968373][B7E78460] -> osm_sm_mad_ctrl_bind: ] [1104197291:000968382][B7E78460] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR). [1104197291:000968397][B7E78460] -> osm_sm_bind: ] [1104197291:000968405][B7E78460] -> osm_opensm_bind: ] [1104197291:000968522][B7E78460] -> osm_vl15_shutdown: [ [1104197291:000968531][B7E78460] -> osm_vl15_shutdown: ] [1104197291:000968551][B7E78460] -> osm_sm_destroy: [ [1104197291:000968572][B466EBB0] -> __osm_sm_sweeper: Off schedule sweep signalled. [1104197291:000968586][B466EBB0] -> __osm_sm_sweeper: ] [1104197291:000968598][B3E6DBB0] -> umad_receiver: [ [1104197291:000968619][B7E78460] -> osm_trap_rcv_destroy: [ [1104197291:000968628][B7E78460] -> cl_event_wheel_destroy: [ [1104197291:000968635][B7E78460] -> cl_event_wheel_dump: [ [1104197291:000968642][B7E78460] -> cl_event_wheel_dump: event_wheel ptr:0x8093f18 [1104197291:000968672][B7E78460] -> cl_event_wheel_dump: ] [1104197291:000968686][B7E78460] -> cl_event_wheel_destroy: ] [1104197291:000968694][B7E78460] -> osm_trap_rcv_destroy: ] [1104197291:000968701][B7E78460] -> osm_sminfo_rcv_destroy: [ [1104197291:000968708][B7E78460] -> osm_sminfo_rcv_destroy: ] [1104197291:000968716][B7E78460] -> osm_ni_rcv_destroy: [ [1104197291:000968723][B7E78460] -> osm_ni_rcv_destroy: ] [1104197291:000968730][B7E78460] -> osm_pi_rcv_destroy: [ [1104197291:000968737][B7E78460] -> osm_pi_rcv_destroy: ] [1104197291:000968745][B7E78460] -> osm_si_rcv_destroy: [ [1104197291:000968751][B7E78460] -> osm_si_rcv_destroy: ] [1104197291:000968759][B7E78460] -> osm_nd_rcv_destroy: [ [1104197291:000968766][B7E78460] -> osm_nd_rcv_destroy: ] [1104197291:000968773][B7E78460] -> osm_lid_mgr_destroy: [ [1104197291:000968781][B7E78460] -> osm_lid_mgr_destroy: ] [1104197291:000968788][B7E78460] -> osm_ucast_mgr_destroy: [ [1104197291:000968794][B7E78460] -> osm_ucast_mgr_destroy: ] [1104197291:000968801][B7E78460] -> osm_link_mgr_destroy: [ [1104197291:000968808][B7E78460] -> osm_link_mgr_destroy: ] [1104197291:000968815][B7E78460] -> osm_drop_mgr_destroy: [ [1104197291:000968821][B7E78460] -> osm_drop_mgr_destroy: ] [1104197291:000968829][B7E78460] -> osm_lft_rcv_destroy: [ [1104197291:000968836][B7E78460] -> osm_lft_rcv_destroy: ] [1104197291:000968843][B7E78460] -> osm_mft_rcv_destroy: [ [1104197291:000968850][B7E78460] -> osm_mft_rcv_destroy: ] [1104197291:000968858][B7E78460] -> osm_slvl_rcv_destroy: [ [1104197291:000968864][B7E78460] -> osm_slvl_rcv_destroy: ] [1104197291:000968872][B7E78460] -> osm_vla_rcv_destroy: [ [1104197291:000968878][B7E78460] -> osm_vla_rcv_destroy: ] [1104197291:000968886][B7E78460] -> osm_pkey_rcv_destroy: [ [1104197291:000968893][B7E78460] -> osm_pkey_rcv_destroy: ] [1104197291:000968901][B7E78460] -> osm_state_mgr_destroy: [ [1104197291:000968909][B7E78460] -> osm_state_mgr_destroy: ] [1104197291:000968917][B7E78460] -> osm_sm_state_mgr_destroy: [ [1104197291:000968924][B7E78460] -> osm_sm_state_mgr_destroy: ] [1104197291:000968932][B7E78460] -> osm_mcast_mgr_destroy: [ [1104197291:000968939][B7E78460] -> osm_mcast_mgr_destroy: ] [1104197291:000969031][B7E78460] -> osm_sm_destroy: ] [1104197291:000969039][B7E78460] -> osm_sa_destroy: [ [1104197291:000969047][B7E78460] -> osm_nr_rcv_destroy: [ [1104197291:000969056][B7E78460] -> osm_nr_rcv_destroy: ] [1104197291:000969074][B7E78460] -> osm_pir_rcv_destroy: [ [1104197291:000969082][B7E78460] -> osm_pir_rcv_destroy: ] [1104197291:000969089][B7E78460] -> osm_lr_rcv_destroy: [ [1104197291:000969097][B7E78460] -> osm_lr_rcv_destroy: ] [1104197291:000969104][B7E78460] -> osm_pr_rcv_destroy: [ [1104197291:000969112][B7E78460] -> osm_pr_rcv_destroy: ] [1104197291:000969119][B7E78460] -> osm_smir_rcv_destroy: [ [1104197291:000969126][B7E78460] -> osm_smir_rcv_destroy: ] [1104197291:000969133][B7E78460] -> osm_mcmr_rcv_destroy: [ [1104197291:000969141][B7E78460] -> osm_mcmr_rcv_destroy: ] [1104197291:000969149][B7E78460] -> osm_sr_rcv_destroy: [ [1104197291:000969198][B7E78460] -> osm_sr_rcv_destroy: ] [1104197291:000969207][B7E78460] -> osm_infr_rcv_destroy: [ [1104197291:000969214][B7E78460] -> osm_infr_rcv_destroy: ] [1104197291:000969221][B7E78460] -> osm_vlarb_rec_rcv_destroy: [ [1104197291:000969240][B7E78460] -> osm_vlarb_rec_rcv_destroy: ] [1104197291:000969249][B7E78460] -> osm_slvl_rec_rcv_destroy: [ [1104197291:000969257][B7E78460] -> osm_slvl_rec_rcv_destroy: ] [1104197291:000969265][B7E78460] -> osm_pkey_rec_rcv_destroy: [ [1104197291:000969272][B7E78460] -> osm_pkey_rec_rcv_destroy: ] [1104197291:000969280][B7E78460] -> osm_lftr_rcv_destroy: [ [1104197291:000969290][B7E78460] -> osm_lftr_rcv_destroy: ] [1104197291:000969298][B7E78460] -> osm_sa_destroy: ] [1104197291:000969345][B3E6DBB0] -> umad_receiver: recv error Success From roland at topspin.com Tue Dec 28 08:28:58 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Dec 2004 08:28:58 -0800 Subject: [openib-general] Some CM Comments In-Reply-To: <1104238992.4054.3823.camel@localhost.localdomain> (Hal Rosenstock's message of "28 Dec 2004 08:03:12 -0500") References: <1104238992.4054.3823.camel@localhost.localdomain> Message-ID: <523bxqjsj9.fsf@topspin.com> Hal> 4. Should there be an API to obtain local/remote comm IDs ? What would the consumer do with the comm ID? - R. From shaharf at voltaire.com Tue Dec 28 08:34:33 2004 From: shaharf at voltaire.com (shaharf) Date: Tue, 28 Dec 2004 18:34:33 +0200 Subject: [openib-general] RE: opensm failure Message-ID: > Hi! > > > # ./osm/opensm/opensm -V --------------------------------- > ---------------- > OpenSM Rev:1.6.0-rc2 > Command Line Arguments: > Big V selected > Log File: /tmp/osm.log > ------------------------------------------------- > using default guid 0xc900012a3981 > > Error from osm_opensm_bind (0x2A) > > MST Please update your SM, and run it again with "-V -d5" and send me also the stdout/stderr of the SM. Shahar From shaharf at voltaire.com Tue Dec 28 08:53:12 2004 From: shaharf at voltaire.com (shaharf) Date: Tue, 28 Dec 2004 18:53:12 +0200 Subject: [openib-general] usermode tree changes Message-ID: I moved all osm and umad related directories to usermode/management. I also fixed some minor bugs in the umad libraries: - Minor make files error fix (missing .depend warning) - force terminating null at sys_read_string - fix sys_read_uint return types - add warning message on missing ABI sys files (i.e. ib_umad not loaded). Shahar -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: patch Type: application/octet-stream Size: 4305 bytes Desc: patch URL: From mst at mellanox.co.il Tue Dec 28 09:07:39 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 19:07:39 +0200 Subject: [openib-general] usermode tree changes In-Reply-To: References: Message-ID: <20041228170739.GA27321@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "[openib-general] usermode tree changes": > I moved all osm and umad related directories to usermode/management. Good idea. But I still see userspace/osm/ at 1406. Cant something be done about the necessity to set the openib root? Assuming I check everything out properly from svn, why cant each component set OPENIB_ROOT inside the makefile to ../../ or something? MST From mst at mellanox.co.il Tue Dec 28 09:09:55 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 19:09:55 +0200 Subject: [openib-general] Re: opensm failure In-Reply-To: References: Message-ID: <20041228170955.GB27321@mellanox.co.il> Quoting r. shaharf (shaharf at voltaire.com) "RE: opensm failure": > > Hi! > Please update your SM, and run it again with "-V -d5" and send me also > the stdout/stderr of the SM. > > Shahar Here. ------------------------------------------------- OpenSM Rev:1.6.0-rc2 Command Line Arguments: Big V selected d level = 0x5 Log File: /tmp/osm.log ------------------------------------------------- warn: [32666] umad_init: warn: [32666] umad_get_cas_names: max 10 warn: [32666] umad_get_cas_names: return 1 ca warn: [32666] umad_get_ca_portguids: ca name mthca0 max port guids warn: [32666] umad_get_ca: ca_name mthca0 warn: [32666] umad_get_ca: opened mthca0 warn: [32666] umad_get_ca_portguids: mthca0: 3 ports using default guid 0xc900012a3981 warn: [32666] umad_get_ca_portguids: ca name mthca0 max port guids warn: [32666] umad_get_ca: ca_name mthca0 warn: [32666] umad_get_ca: opened mthca0 warn: [32666] umad_get_ca_portguids: mthca0: 3 ports warn: [32666] umad_get_ca: ca_name mthca0 warn: [32666] umad_get_ca: opened mthca0 warn: [32666] umad_get_port: ca_name mthca0 portnum 1 warn: [32666] umad_open_port: ca mthca0 port 1 warn: [32666] umad_open_port: open /dev/infiniband/mthca0/ports/1/mad failed Error from osm_opensm_bind (0x2A) warn: [32666] umad_done: warn: [32666] umad_recv: portid -5 umad 0x80970e8 timeout 0 -------------- next part -------------- [1104201592:000212673][40188080] -> OpenSM Rev:1.6.0-rc2 [1104201592:000212887][40188080] -> osm_opensm_init: [ [1104201592:000212905][40188080] -> osm_opensm_init: Forcing single threaded dispatcher. [1104201592:000213005][40188080] -> osm_vendor_new: [ [1104201592:000213016][40188080] -> osm_vendor_init: [ [1104201592:000213309][40188080] -> osm_vendor_init: ] [1104201592:000213319][40188080] -> osm_vendor_new: ] [1104201592:000213340][40188080] -> osm_mad_pool_init: [ [1104201592:000213455][40188080] -> osm_mad_pool_init: ] [1104201592:000213464][40188080] -> osm_vl15_init: [ [1104201592:000213499][40188080] -> osm_vl15_init: ] [1104201592:000213511][40188080] -> osm_sm_init: [ [1104201592:000213530][40188080] -> osm_sm_mad_ctrl_init: [ [1104201592:000213540][40188080] -> osm_sm_mad_ctrl_init: ] [1104201592:000213550][40188080] -> osm_req_init: [ [1104201592:000213557][40188080] -> osm_req_init: ] [1104201592:000213564][40188080] -> osm_req_ctrl_init: [ [1104201592:000213571][40188080] -> osm_req_ctrl_init: ] [1104201592:000213579][40188080] -> osm_resp_init: [ [1104201592:000213586][40188080] -> osm_resp_init: ] [1104201592:000213595][40188080] -> osm_ni_rcv_init: [ [1104201592:000213602][40188080] -> osm_ni_rcv_init: ] [1104201592:000213609][40188080] -> osm_ni_rcv_ctrl_init: [ [1104201592:000213617][40188080] -> osm_ni_rcv_ctrl_init: ] [1104201592:000213624][40188080] -> osm_pi_rcv_init: [ [1104201592:000213631][40188080] -> osm_pi_rcv_init: ] [1104201592:000213638][40188080] -> osm_pi_rcv_ctrl_init: [ [1104201592:000213645][40188080] -> osm_pi_rcv_ctrl_init: ] [1104201592:000213652][40188080] -> osm_si_rcv_init: [ [1104201592:000213659][40188080] -> osm_si_rcv_init: ] [1104201592:000213666][40188080] -> osm_si_rcv_ctrl_init: [ [1104201592:000213673][40188080] -> osm_si_rcv_ctrl_init: ] [1104201592:000213680][40188080] -> osm_nd_rcv_init: [ [1104201592:000213687][40188080] -> osm_nd_rcv_init: ] [1104201592:000213694][40188080] -> osm_nd_rcv_ctrl_init: [ [1104201592:000213701][40188080] -> osm_nd_rcv_ctrl_init: ] [1104201592:000213708][40188080] -> osm_lid_mgr_init: [ [1104201592:000213742][40188080] -> osm_lid_mgr_init: ] [1104201592:000213753][40188080] -> osm_ucast_mgr_init: [ [1104201592:000213760][40188080] -> osm_ucast_mgr_init: ] [1104201592:000213767][40188080] -> osm_link_mgr_init: [ [1104201592:000213774][40188080] -> osm_link_mgr_init: ] [1104201592:000213783][40188080] -> osm_state_mgr_init: [ [1104201592:000213791][40188080] -> osm_state_mgr_init: ] [1104201592:000213798][40188080] -> osm_state_mgr_ctrl_init: [ [1104201592:000213805][40188080] -> osm_state_mgr_ctrl_init: ] [1104201592:000213812][40188080] -> osm_drop_mgr_init: [ [1104201592:000213819][40188080] -> osm_drop_mgr_init: ] [1104201592:000213826][40188080] -> osm_lft_rcv_init: [ [1104201592:000213833][40188080] -> osm_lft_rcv_init: ] [1104201592:000213840][40188080] -> osm_lft_rcv_ctrl_init: [ [1104201592:000213853][40188080] -> osm_lft_rcv_ctrl_init: ] [1104201592:000213860][40188080] -> osm_mft_rcv_init: [ [1104201592:000213867][40188080] -> osm_mft_rcv_init: ] [1104201592:000213874][40188080] -> osm_mft_rcv_ctrl_init: [ [1104201592:000213885][40188080] -> osm_mft_rcv_ctrl_init: ] [1104201592:000213893][40188080] -> osm_sweep_fail_ctrl_init: [ [1104201592:000213900][40188080] -> osm_sweep_fail_ctrl_init: ] [1104201592:000213911][40188080] -> osm_sminfo_rcv_init: [ [1104201592:000213918][40188080] -> osm_sminfo_rcv_init: ] [1104201592:000213925][40188080] -> osm_sminfo_rcv_ctrl_init: [ [1104201592:000213933][40188080] -> osm_sminfo_rcv_ctrl_init: ] [1104201592:000213940][40188080] -> osm_trap_rcv_init: [ [1104201592:000213948][40188080] -> cl_event_wheel_init: [ [1104201592:000213958][40188080] -> cl_event_wheel_init: ] [1104201592:000213966][40188080] -> osm_trap_rcv_init: ] [1104201592:000213973][40188080] -> osm_trap_rcv_ctrl_init: [ [1104201592:000213980][40188080] -> osm_trap_rcv_ctrl_init: ] [1104201592:000213988][40188080] -> osm_sm_state_mgr_init: [ [1104201592:000213995][40188080] -> osm_sm_state_mgr_init: ] [1104201592:000214003][40188080] -> osm_mcast_mgr_init: [ [1104201592:000214022][40188080] -> osm_mcast_mgr_init: ] [1104201592:000214031][40188080] -> osm_slvl_rcv_init: [ [1104201592:000214038][40188080] -> osm_slvl_rcv_init: ] [1104201592:000214045][40188080] -> osm_slvl_rcv_ctrl_init: [ [1104201592:000214053][40188080] -> osm_slvl_rcv_ctrl_init: ] [1104201592:000214060][40188080] -> osm_vla_rcv_init: [ [1104201592:000214067][40188080] -> osm_vla_rcv_init: ] [1104201592:000214074][40188080] -> osm_vla_rcv_ctrl_init: [ [1104201592:000214082][40188080] -> osm_vla_rcv_ctrl_init: ] [1104201592:000214089][40188080] -> osm_pkey_rcv_init: [ [1104201592:000214096][40188080] -> osm_pkey_rcv_init: ] [1104201592:000214103][40188080] -> osm_pkey_rcv_ctrl_init: [ [1104201592:000214110][40188080] -> osm_pkey_rcv_ctrl_init: ] [1104201592:000214156][40F91BB0] -> __osm_vl15_poller: [ [1104201592:000214186][41192BB0] -> __osm_sm_sweeper: [ [1104201592:000214195][41192BB0] -> __osm_sm_sweeper: Masking ^C Signals. [1104201592:000214217][40188080] -> osm_sm_init: ] [1104201592:000214225][40188080] -> osm_sa_init: [ [1104201592:000214234][40188080] -> osm_sa_resp_init: [ [1104201592:000214242][40188080] -> osm_sa_resp_init: ] [1104201592:000214252][40188080] -> osm_sa_mad_ctrl_init: [ [1104201592:000214260][40188080] -> osm_sa_mad_ctrl_init: ] [1104201592:000214267][40188080] -> osm_cpi_rcv_init: [ [1104201592:000214273][40188080] -> osm_cpi_rcv_init: ] [1104201592:000214281][40188080] -> osm_cpi_rcv_ctrl_init: [ [1104201592:000214288][40188080] -> osm_cpi_rcv_ctrl_init: ] [1104201592:000214298][40188080] -> osm_nr_rcv_init: [ [1104201592:000214319][40188080] -> osm_nr_rcv_init: ] [1104201592:000214339][40188080] -> osm_nr_rcv_ctrl_init: [ [1104201592:000214347][40188080] -> osm_nr_rcv_ctrl_init: ] [1104201592:000214354][40188080] -> osm_pir_rcv_init: [ [1104201592:000214370][40188080] -> osm_pir_rcv_init: ] [1104201592:000214378][40188080] -> osm_pir_rcv_ctrl_init: [ [1104201592:000214385][40188080] -> osm_pir_rcv_ctrl_init: ] [1104201592:000214392][40188080] -> osm_lr_rcv_init: [ [1104201592:000214402][40188080] -> osm_lr_rcv_init: ] [1104201592:000214409][40188080] -> osm_lr_rcv_ctrl_init: [ [1104201592:000214416][40188080] -> osm_lr_rcv_ctrl_init: ] [1104201592:000214423][40188080] -> osm_pr_rcv_init: [ [1104201592:000214438][40188080] -> osm_pr_rcv_init: ] [1104201592:000214447][40188080] -> osm_pr_rcv_ctrl_init: [ [1104201592:000214454][40188080] -> osm_pr_rcv_ctrl_init: ] [1104201592:000214461][40188080] -> osm_smir_rcv_init: [ [1104201592:000214468][40188080] -> osm_smir_rcv_init: ] [1104201592:000214475][40188080] -> osm_smir_ctrl_init: [ [1104201592:000214482][40188080] -> osm_smir_ctrl_init: ] [1104201592:000214490][40188080] -> osm_mcmr_rcv_init: [ [1104201592:000214506][40188080] -> osm_mcmr_rcv_init: ] [1104201592:000214514][40188080] -> osm_mcmr_rcv_ctrl_init: [ [1104201592:000214521][40188080] -> osm_mcmr_rcv_ctrl_init: ] [1104201592:000214528][40188080] -> osm_sr_rcv_init: [ [1104201592:000214560][40188080] -> osm_sr_rcv_init: ] [1104201592:000214568][40188080] -> osm_sr_rcv_ctrl_init: [ [1104201592:000214575][40188080] -> osm_sr_rcv_ctrl_init: ] [1104201592:000214582][40188080] -> osm_infr_rcv_init: [ [1104201592:000214589][40188080] -> osm_infr_rcv_init: ] [1104201592:000214596][40188080] -> osm_infr_rcv_ctrl_init: [ [1104201592:000214604][40188080] -> osm_infr_rcv_ctrl_init: ] [1104201592:000214611][40188080] -> osm_vlarb_rec_rcv_init: [ [1104201592:000214620][40188080] -> osm_vlarb_rec_rcv_init: ] [1104201592:000214628][40188080] -> osm_vlarb_rec_rcv_ctrl_init: [ [1104201592:000214636][40188080] -> osm_vlarb_rec_rcv_ctrl_init: ] [1104201592:000214643][40188080] -> osm_slvl_rec_rcv_init: [ [1104201592:000214652][40188080] -> osm_slvl_rec_rcv_init: ] [1104201592:000214660][40188080] -> osm_slvl_rec_rcv_ctrl_init: [ [1104201592:000214667][40188080] -> osm_slvl_rec_rcv_ctrl_init: ] [1104201592:000214675][40188080] -> osm_pkey_rec_rcv_init: [ [1104201592:000214690][40188080] -> osm_pkey_rec_rcv_init: ] [1104201592:000214698][40188080] -> osm_pkey_rec_rcv_ctrl_init: [ [1104201592:000214713][40188080] -> osm_pkey_rec_rcv_ctrl_init: ] [1104201592:000214721][40188080] -> osm_lftr_rcv_init: [ [1104201592:000214736][40188080] -> osm_lftr_rcv_init: ] [1104201592:000214744][40188080] -> osm_lftr_rcv_ctrl_init: [ [1104201592:000214752][40188080] -> osm_lftr_rcv_ctrl_init: ] [1104201592:000214758][40188080] -> osm_sa_init: ] [1104201592:000214766][40188080] -> osm_opensm_create_mcgroups: [ [1104201592:000214774][40188080] -> osm_sa_create_template_record_ipoib: [ [1104201592:000214786][40188080] -> osm_mcmr_rcv_create_new_mgrp: [ [1104201592:000214794][40188080] -> osm_mcmr_rcv_create_new_mgrp: Getting new mlid 0xc000. [1104201592:000214802][40188080] -> __validate_requested_mgid: [ [1104201592:000214811][40188080] -> __validate_requested_mgid: MGID Signed as 0x401B. [1104201592:000214820][40188080] -> __validate_requested_mgid: Skeeping MGID Validation for IPoIB Signed (0x401B) MGIDs. [1104201592:000214828][40188080] -> __validate_requested_mgid: ] [1104201592:000214838][40188080] -> osm_mgrp_send_create_notice: [ [1104201592:000214849][40188080] -> osm_report_notice: [ [1104201592:000214863][40188080] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GUID:0xfe80000000000000,0x0000000000000000 [1104201592:000214877][40188080] -> osm_report_notice: ] [1104201592:000214887][40188080] -> osm_mgrp_send_create_notice: ] [1104201592:000214894][40188080] -> osm_mcmr_rcv_create_new_mgrp: ] [1104201592:000214901][40188080] -> osm_mcmr_rcv_create_new_mgrp: [ [1104201592:000214908][40188080] -> osm_mcmr_rcv_create_new_mgrp: Getting new mlid 0xc001. [1104201592:000214915][40188080] -> __validate_requested_mgid: [ [1104201592:000214923][40188080] -> __validate_requested_mgid: MGID Signed as 0x401B. [1104201592:000214930][40188080] -> __validate_requested_mgid: Skeeping MGID Validation for IPoIB Signed (0x401B) MGIDs. [1104201592:000214937][40188080] -> __validate_requested_mgid: ] [1104201592:000214945][40188080] -> osm_mgrp_send_create_notice: [ [1104201592:000214953][40188080] -> osm_report_notice: [ [1104201592:000214962][40188080] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GUID:0xfe80000000000000,0x0000000000000000 [1104201592:000214971][40188080] -> osm_report_notice: ] [1104201592:000214978][40188080] -> osm_mgrp_send_create_notice: ] [1104201592:000214985][40188080] -> osm_mcmr_rcv_create_new_mgrp: ] [1104201592:000214992][40188080] -> osm_sa_create_template_record_ipoib: ] [1104201592:000215000][40188080] -> osm_opensm_create_mcgroups: ] [1104201592:000215011][40188080] -> updn_construct: [ [1104201592:000215019][40188080] -> updn_construct: ] [1104201592:000215026][40188080] -> updn_init: [ [1104201592:000215038][40188080] -> updn_init: ] [1104201592:000215046][40188080] -> osm_opensm_init: ] [1104201592:000215070][40188080] -> osm_vendor_get_all_port_attr: [ [1104201592:000216450][40188080] -> osm_vendor_get_all_port_attr: assign CA 0x8098100ort 1 guid (0xc900012a3981) as the default port. [1104201592:000216461][40188080] -> osm_vendor_get_all_port_attr: ] [1104201592:000216544][40188080] -> osm_opensm_bind: [ [1104201592:000216553][40188080] -> osm_sm_bind: [ [1104201592:000216560][40188080] -> osm_sm_mad_ctrl_bind: [ [1104201592:000216568][40188080] -> osm_sm_mad_ctrl_bind: Binding to port 0xc900012a3981. [1104201592:000216576][40188080] -> osm_vendor_bind: [ [1104201592:000216583][40188080] -> osm_vendor_bind: Binding to port 0xc900012a3981. [1104201592:000216591][40188080] -> osm_vendor_open_port: [ [1104201592:000219904][40188080] -> osm_vl15_init: [ [1104201592:000219967][40188080] -> osm_vl15_init: ] [1104201592:000219976][40188080] -> osm_vendor_open_port: ] [1104201592:000219985][40188080] -> osm_vendor_bind: Unable to Open Port 0xc900012a3981. [1104201592:000219992][40188080] -> osm_vendor_bind: ] [1104201592:000220000][40188080] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed. [1104201592:000220009][40188080] -> osm_sm_mad_ctrl_bind: ] [1104201592:000220017][40188080] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR). [1104201592:000220037][40188080] -> osm_sm_bind: ] [1104201592:000220045][40188080] -> osm_opensm_bind: ] [1104201592:000220170][40188080] -> osm_vl15_shutdown: [ [1104201592:000220179][40188080] -> osm_vl15_shutdown: ] [1104201592:000220259][40188080] -> osm_sm_destroy: [ [1104201592:000220287][41393BB0] -> umad_receiver: [ [1104201592:000220414][41393BB0] -> umad_receiver: recv error Success [1104201592:000220450][41192BB0] -> __osm_sm_sweeper: Off schedule sweep signalled. [1104201592:000220460][41192BB0] -> __osm_sm_sweeper: ] [1104201592:000220477][40188080] -> osm_trap_rcv_destroy: [ [1104201592:000220486][40188080] -> cl_event_wheel_destroy: [ [1104201592:000220493][40188080] -> cl_event_wheel_dump: [ [1104201592:000220501][40188080] -> cl_event_wheel_dump: event_wheel ptr:0x8093f18 [1104201592:000220508][40188080] -> cl_event_wheel_dump: ] [1104201592:000220523][40188080] -> cl_event_wheel_destroy: ] [1104201592:000220530][40188080] -> osm_trap_rcv_destroy: ] [1104201592:000220539][40188080] -> osm_sminfo_rcv_destroy: [ [1104201592:000220547][40188080] -> osm_sminfo_rcv_destroy: ] [1104201592:000220555][40188080] -> osm_ni_rcv_destroy: [ [1104201592:000220562][40188080] -> osm_ni_rcv_destroy: ] [1104201592:000220569][40188080] -> osm_pi_rcv_destroy: [ [1104201592:000220576][40188080] -> osm_pi_rcv_destroy: ] [1104201592:000220583][40188080] -> osm_si_rcv_destroy: [ [1104201592:000220590][40188080] -> osm_si_rcv_destroy: ] [1104201592:000220598][40188080] -> osm_nd_rcv_destroy: [ [1104201592:000220604][40188080] -> osm_nd_rcv_destroy: ] [1104201592:000220611][40188080] -> osm_lid_mgr_destroy: [ [1104201592:000220620][40188080] -> osm_lid_mgr_destroy: ] [1104201592:000220627][40188080] -> osm_ucast_mgr_destroy: [ [1104201592:000220634][40188080] -> osm_ucast_mgr_destroy: ] [1104201592:000220641][40188080] -> osm_link_mgr_destroy: [ [1104201592:000220648][40188080] -> osm_link_mgr_destroy: ] [1104201592:000220655][40188080] -> osm_drop_mgr_destroy: [ [1104201592:000220661][40188080] -> osm_drop_mgr_destroy: ] [1104201592:000220669][40188080] -> osm_lft_rcv_destroy: [ [1104201592:000220676][40188080] -> osm_lft_rcv_destroy: ] [1104201592:000220683][40188080] -> osm_mft_rcv_destroy: [ [1104201592:000220690][40188080] -> osm_mft_rcv_destroy: ] [1104201592:000220698][40188080] -> osm_slvl_rcv_destroy: [ [1104201592:000220704][40188080] -> osm_slvl_rcv_destroy: ] [1104201592:000220712][40188080] -> osm_vla_rcv_destroy: [ [1104201592:000220719][40188080] -> osm_vla_rcv_destroy: ] [1104201592:000220727][40188080] -> osm_pkey_rcv_destroy: [ [1104201592:000220733][40188080] -> osm_pkey_rcv_destroy: ] [1104201592:000220741][40188080] -> osm_state_mgr_destroy: [ [1104201592:000220750][40188080] -> osm_state_mgr_destroy: ] [1104201592:000220758][40188080] -> osm_sm_state_mgr_destroy: [ [1104201592:000220765][40188080] -> osm_sm_state_mgr_destroy: ] [1104201592:000220773][40188080] -> osm_mcast_mgr_destroy: [ [1104201592:000220780][40188080] -> osm_mcast_mgr_destroy: ] [1104201592:000220834][40188080] -> osm_sm_destroy: ] [1104201592:000220843][40188080] -> osm_sa_destroy: [ [1104201592:000220852][40188080] -> osm_nr_rcv_destroy: [ [1104201592:000220861][40188080] -> osm_nr_rcv_destroy: ] [1104201592:000220870][40188080] -> osm_pir_rcv_destroy: [ [1104201592:000220880][40188080] -> osm_pir_rcv_destroy: ] [1104201592:000220888][40188080] -> osm_lr_rcv_destroy: [ [1104201592:000220897][40188080] -> osm_lr_rcv_destroy: ] [1104201592:000220904][40188080] -> osm_pr_rcv_destroy: [ [1104201592:000220912][40188080] -> osm_pr_rcv_destroy: ] [1104201592:000220919][40188080] -> osm_smir_rcv_destroy: [ [1104201592:000220926][40188080] -> osm_smir_rcv_destroy: ] [1104201592:000220934][40188080] -> osm_mcmr_rcv_destroy: [ [1104201592:000220941][40188080] -> osm_mcmr_rcv_destroy: ] [1104201592:000220949][40188080] -> osm_sr_rcv_destroy: [ [1104201592:000220969][40188080] -> osm_sr_rcv_destroy: ] [1104201592:000220978][40188080] -> osm_infr_rcv_destroy: [ [1104201592:000220985][40188080] -> osm_infr_rcv_destroy: ] [1104201592:000221001][40188080] -> osm_vlarb_rec_rcv_destroy: [ [1104201592:000221010][40188080] -> osm_vlarb_rec_rcv_destroy: ] [1104201592:000221018][40188080] -> osm_slvl_rec_rcv_destroy: [ [1104201592:000221026][40188080] -> osm_slvl_rec_rcv_destroy: ] [1104201592:000221034][40188080] -> osm_pkey_rec_rcv_destroy: [ [1104201592:000221042][40188080] -> osm_pkey_rec_rcv_destroy: ] [1104201592:000221050][40188080] -> osm_lftr_rcv_destroy: [ [1104201592:000221057][40188080] -> osm_lftr_rcv_destroy: ] [1104201592:000221064][40188080] -> osm_sa_destroy: ] From mst at mellanox.co.il Tue Dec 28 09:32:47 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 19:32:47 +0200 Subject: [openib-general] Re: opensm failure In-Reply-To: <20041228170955.GB27321@mellanox.co.il> References: <20041228170955.GB27321@mellanox.co.il> Message-ID: <20041228173247.GC27321@mellanox.co.il> Hello! Quoting r. Michael S. Tsirkin (mst at mellanox.co.il) "[openib-general] Re: opensm failure": > warn: [32666] umad_open_port: open /dev/infiniband/mthca0/ports/1/mad failed Oh. Is that a udev problem then? From shaharf at voltaire.com Tue Dec 28 09:34:06 2004 From: shaharf at voltaire.com (shaharf) Date: Tue, 28 Dec 2004 19:34:06 +0200 Subject: [openib-general] Re: opensm failure Message-ID: > > Hello! > Quoting r. Michael S. Tsirkin (mst at mellanox.co.il) "[openib-general] Re: > opensm failure": > > warn: [32666] umad_open_port: open /dev/infiniband/mthca0/ports/1/mad > failed > > Oh. Is that a udev problem then? Yes. I Forgot to modify umad.c for the new udev scheme. I will do it soon. Shahar From mst at mellanox.co.il Tue Dec 28 09:45:32 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 19:45:32 +0200 Subject: [openib-general] Re: opensm failure In-Reply-To: References: Message-ID: <20041228174532.GD27321@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Re: opensm failure": > > > > Hello! > > Quoting r. Michael S. Tsirkin (mst at mellanox.co.il) "[openib-general] > Re: > > opensm failure": > > > warn: [32666] umad_open_port: open > /dev/infiniband/mthca0/ports/1/mad > > failed > > > > Oh. Is that a udev problem then? > > Yes. I Forgot to modify umad.c for the new udev scheme. I will do it > soon. > > Shahar I now have /dev/infiniband/umad0 and /dev/infiniband/umad1, as per src/linux-kernel/docs/user_mad.txt by the way, should not user_mad.txt move out of linux-kernel/docs? From mst at mellanox.co.il Tue Dec 28 09:58:24 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 19:58:24 +0200 Subject: [openib-general] Re: opensm failure In-Reply-To: References: Message-ID: <20041228175824.GE27321@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Re: opensm failure": > > > > Hello! > > Quoting r. Michael S. Tsirkin (mst at mellanox.co.il) "[openib-general] > Re: > > opensm failure": > > > warn: [32666] umad_open_port: open > /dev/infiniband/mthca0/ports/1/mad > > failed > > > > Oh. Is that a udev problem then? > > Yes. I Forgot to modify umad.c for the new udev scheme. I will do it > soon. > > Shahar Thanks, I worked around this by creating a softlink and now ip over ib seems to work. mst From mst at mellanox.co.il Tue Dec 28 10:09:41 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 20:09:41 +0200 Subject: [openib-general] Re: opensm failure In-Reply-To: References: Message-ID: <20041228180941.GA27507@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Re: opensm failure": > > > > Hello! > > Quoting r. Michael S. Tsirkin (mst at mellanox.co.il) "[openib-general] > Re: > > opensm failure": > > > warn: [32666] umad_open_port: open > /dev/infiniband/mthca0/ports/1/mad > > failed > > > > Oh. Is that a udev problem then? > > Yes. I Forgot to modify umad.c for the new udev scheme. I will do it > soon. > > Shahar 1. Why cant I give the udev device path on a command line to opensm? that would solve this once and for all, and let the user play with udev names all he wants. mst From mst at mellanox.co.il Tue Dec 28 10:21:12 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 20:21:12 +0200 Subject: [openib-general] Re: opensm failure In-Reply-To: <20041228180941.GA27507@mellanox.co.il> References: <20041228180941.GA27507@mellanox.co.il> Message-ID: <20041228182112.GA27551@mellanox.co.il> Hello! Quoting r. Michael S. Tsirkin (mst at mellanox.co.il) "Re: [openib-general] Re: opensm failure": > Hello! > Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Re: opensm failure": > > > > > > Hello! > > > Quoting r. Michael S. Tsirkin (mst at mellanox.co.il) "[openib-general] > > Re: > > > opensm failure": > > > > warn: [32666] umad_open_port: open > > /dev/infiniband/mthca0/ports/1/mad > > > failed > > > > > > Oh. Is that a udev problem then? > > > > Yes. I Forgot to modify umad.c for the new udev scheme. I will do it > > soon. > > > > Shahar > > Why cant I give the udev device path on a command line to opensm? > that would solve this once and for all, and let the user play with > udev names all he wants. What I had in mind was something like: ./opensm --dev /devinfiniband/umad0 to work on port 1 only and ./opensm --dev /dev/infiniband/umad* to work on all ports of all devices. /dev/infiniband/umad* could be a default (maybe set at configure time or by varibale in makefile), but otherwise I really think *forcing* a specific udev scheme is a problem, since this makes udev scheme part of abi, but not checkable by abiversion. mst From shaharf at voltaire.com Tue Dec 28 10:32:07 2004 From: shaharf at voltaire.com (shaharf) Date: Tue, 28 Dec 2004 20:32:07 +0200 Subject: [openib-general] Re: opensm failure Message-ID: > > > > Oh. Is that a udev problem then? > > > > > > Yes. I Forgot to modify umad.c for the new udev scheme. I will do it > > > soon. > > > > > > Shahar > > > > Why cant I give the udev device path on a command line to opensm? > > that would solve this once and for all, and let the user play with > > udev names all he wants. > > What I had in mind was something like: > > ./opensm --dev /devinfiniband/umad0 > > to work on port 1 only > > and > > ./opensm --dev /dev/infiniband/umad* > > to work on all ports of all devices. > > /dev/infiniband/umad* could be a default (maybe set at configure time > or by varibale in makefile), but otherwise I really think *forcing* a > specific > udev scheme is a problem, since this makes udev scheme part of abi, > but not checkable by abiversion. > > mst You are correct about the udev scheme. I suggest that the ABI version will change on any modification (Roland?) Specifying dev seems like a bad idea. You will have to find what port is active, reverse map it by scanning /sys/class/infiniband_mad/*/{ibdev,port) and then find the correct umad dev file and path its path to the opensm. I prefer letting the library doing that. That what I have done and commited. Please check again. Shahar -------------- next part -------------- A non-text attachment was scrubbed... Name: patch-umad Type: application/octet-stream Size: 4477 bytes Desc: patch-umad URL: From roland.list at gmail.com Tue Dec 28 10:41:45 2004 From: roland.list at gmail.com (Roland Dreier) Date: Tue, 28 Dec 2004 10:41:45 -0800 Subject: [openib-general] Re: opensm failure In-Reply-To: References: Message-ID: > You are correct about the udev scheme. I suggest that the ABI version > will change on any modification (Roland?) The udev rules are not part of the kernel ABI. udev and indeed all dev/ naming issues are pure userspace and as such are totally out of the kernel's control. Eventually a standard location for umad* devices may emerge but for now I think making any programs that use the device nodes as flexible as possible is probably a good idea. - Roland From mst at mellanox.co.il Tue Dec 28 11:00:00 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 21:00:00 +0200 Subject: [openib-general] Re: opensm failure In-Reply-To: References: Message-ID: <20041228190000.GA27601@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Re: opensm failure": > > > > > Oh. Is that a udev problem then? > > > > > > > > Yes. I Forgot to modify umad.c for the new udev scheme. I will do > it > > > > soon. > > > > > > > > Shahar > > > > > > Why cant I give the udev device path on a command line to opensm? > > > that would solve this once and for all, and let the user play with > > > udev names all he wants. > > > > What I had in mind was something like: > > > > ./opensm --dev /devinfiniband/umad0 > > > > to work on port 1 only > > > > and > > > > ./opensm --dev /dev/infiniband/umad* > > > > to work on all ports of all devices. > > > > /dev/infiniband/umad* could be a default (maybe set at configure time > > or by varibale in makefile), but otherwise I really think *forcing* a > > specific > > udev scheme is a problem, since this makes udev scheme part of abi, > > but not checkable by abiversion. > > > > mst > > You are correct about the udev scheme. I suggest that the ABI version > will change on any modification (Roland?) Cant be done. And the freedon to set rules is the whole idea of udev. > Specifying dev seems like a bad idea. You will have to find what port is > active, reverse map it by scanning > /sys/class/infiniband_mad/*/{ibdev,port) and then find the correct umad > dev file and path its path to the opensm. Its not that hard :) Since opensm now just needs all devices, its could be as simple as giving /dev/infiniband/umad* to opensm. And, if it becomes more relevant to have opensm on separate ports only, user shall be free to change the udev scheme to make port/device mapping more explicit. > I prefer letting the library doing that. That what I have done and > commited. Please check again. > > Shahar > Works with this rule in udev.rules KERNEL="umad*", NAME="infiniband/%k" mst From roland.list at gmail.com Tue Dec 28 11:07:41 2004 From: roland.list at gmail.com (Roland Dreier) Date: Tue, 28 Dec 2004 11:07:41 -0800 Subject: [openib-general] [PATCH] Add support for querying port width/ speed In-Reply-To: <25AE7F432672D511B8DC00B0D0DF11DA029850E9@MTIEX01> References: <25AE7F432672D511B8DC00B0D0DF11DA029850E9@MTIEX01> Message-ID: > I think the choice of IPD is beyond the scope of the IB spec. The spec > provides a mechanism to regulate the injection rate but does not mandate its > use. Your assumption below (rounding up) makes perfect sense. Rounding down > would defeat the entire purpose of the IPD mechanism. Thanks. By the way, I realized that this is not a totally unrealistic situation -- the question arises for a 12X port injecting into a fabric with 4X DDR links, which should be possible in the not-too-distant future. - Roland From roland at topspin.com Tue Dec 28 11:22:35 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Dec 2004 11:22:35 -0800 Subject: [openib-general] Problem with userspace tree structure References: Message-ID: <523bxqut1g.fsf@topspin.com> I assume libmthca was moved to userspace/management by mistake. I moved it back to the top level userspace directory. Thanks, Roland From roland at topspin.com Tue Dec 28 11:48:13 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Dec 2004 11:48:13 -0800 Subject: [openib-general] Re: [PATCH][v5][0/24] Latest IB patch queue In-Reply-To: <20041227225417.3ac7a0a6.davem@davemloft.net> (David S. Miller's message of "Mon, 27 Dec 2004 22:54:17 -0800") References: <200412272150.IBRnA4AvjendsF8x@topspin.com> <20041227225417.3ac7a0a6.davem@davemloft.net> Message-ID: <52pt0unr0i.fsf@topspin.com> David> W00t :-) All applied, thanks Roland. David> I'll run it through some build tests then toss it upstream. Very cool, thanks a lot. Let me know if you see any build failures -- I test on about 6 or 7 different archs/configs but the bug gods always seem to hide problems from me. Speaking of build failures, one of my test builds is cross-compiling for sparc64 with gcc 3.4.2, which adds __attribute__((warn_unused_result)) to copy_to_user() et al. The -Werror in the arch/sparc64 means the build fails with linux-2.6.10/arch/sparc64/kernel/sys_sparc32.c:1686: warning: ignoring return value of `copy_to_user', declared with attribute warn_unused_result Of course binfmt_elf.c and compat_ioctl.c still have issues but those probably get more visibility... Thanks, Roland Check copy_to_user() return value in sys_sparc32.c and sys_sunos32.c. Signed-off-by: Roland Dreier Index: linux-2.6.10/arch/sparc64/kernel/sys_sparc32.c =================================================================== --- linux-2.6.10.orig/arch/sparc64/kernel/sys_sparc32.c 2004-12-24 13:35:00.000000000 -0800 +++ linux-2.6.10/arch/sparc64/kernel/sys_sparc32.c 2004-12-28 11:46:00.190457463 -0800 @@ -1683,7 +1683,8 @@ put_user(oldlen, (u32 __user *)(unsigned long) tmp.oldlenp)) error = -EFAULT; } - copy_to_user(args->__unused, tmp.__unused, sizeof(tmp.__unused)); + if (copy_to_user(args->__unused, tmp.__unused, sizeof(tmp.__unused))) + error = -EFAULT; } return error; #endif Index: linux-2.6.10/arch/sparc64/kernel/sys_sunos32.c =================================================================== --- linux-2.6.10.orig/arch/sparc64/kernel/sys_sunos32.c 2004-12-24 13:35:00.000000000 -0800 +++ linux-2.6.10/arch/sparc64/kernel/sys_sunos32.c 2004-12-28 11:47:03.954923634 -0800 @@ -291,7 +291,8 @@ put_user(ino, &dirent->d_ino); put_user(namlen, &dirent->d_namlen); put_user(reclen, &dirent->d_reclen); - copy_to_user(dirent->d_name, name, namlen); + if (copy_to_user(dirent->d_name, name, namlen)) + return -EFAULT; put_user(0, dirent->d_name + namlen); dirent = (void __user *) dirent + reclen; buf->curr = dirent; @@ -371,7 +372,8 @@ put_user(ino, &dirent->d_ino); put_user(namlen, &dirent->d_namlen); put_user(reclen, &dirent->d_reclen); - copy_to_user(dirent->d_name, name, namlen); + if (copy_to_user(dirent->d_name, name, namlen)) + return -EFAULT; put_user(0, dirent->d_name + namlen); dirent = (void __user *) dirent + reclen; buf->curr = dirent; From mst at mellanox.co.il Tue Dec 28 11:53:28 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Dec 2004 21:53:28 +0200 Subject: [openib-general] make.inc In-Reply-To: References: Message-ID: <20041228195328.GA27727@mellanox.co.il> I got tired of editing make.inc, and I made this patch to figure it out on the fly. Index: libcommon/Makefile =================================================================== --- libcommon/Makefile (revision 1408) +++ libcommon/Makefile (working copy) @@ -1,4 +1,5 @@ -include ../make.inc +OPENIB_MANAGEMENT=.. +include ${OPENIB_MANAGEMENT}/make.inc SRCS:=$(wildcard *.c) LIB_OBJS:=$(SRCS:.c=.lo) Index: make.inc =================================================================== --- make.inc (revision 1408) +++ make.inc (working copy) @@ -2,9 +2,12 @@ # Openib usermode common make file variables # -# OPENIB_ROOT: root of OPIBIB src tree -OPENIB_ROOT=/home/openib/svn/gen2/trunk/src/ +OPENIB_ROOT:=$(shell pwd)/${OPENIB_MANAGEMENT}/../.. +#Uncomment to override the default above +# OPENIB_ROOT: root of OPENIB src tree +#OPENIB_ROOT=/usr/src/openib/src + # OPENIB_KERN: kenel used to compile modules OPENIB_KERN=$(OPENIB_ROOT)/linux-2.6 Index: diags/net/smpdump/Makefile =================================================================== --- diags/net/smpdump/Makefile (revision 1408) +++ diags/net/smpdump/Makefile (working copy) @@ -1,4 +1,5 @@ -include ../../../make.inc +OPENIB_MANAGEMENT=../../.. +include ${OPENIB_MANAGEMENT}/make.inc BIN_OBJS=smpdump.o BIN_LIBS=$(OPENIB_USR_LIB)/libcommon.la $(OPENIB_USR_LIB)/libumad.la Index: diags/host/ibstat/Makefile =================================================================== --- diags/host/ibstat/Makefile (revision 1408) +++ diags/host/ibstat/Makefile (working copy) @@ -1,4 +1,5 @@ -include ../../../make.inc +OPENIB_MANAGEMENT=../../.. +include ${OPENIB_MANAGEMENT}/make.inc BIN_OBJS=ibstat.o BIN_LIBS=$(OPENIB_USR_LIB)/libcommon.la $(OPENIB_USR_LIB)/libumad.la Index: diags/host/scripts/Makefile =================================================================== --- diags/host/scripts/Makefile (revision 1408) +++ diags/host/scripts/Makefile (working copy) @@ -1,4 +1,5 @@ -include ../../../make.inc +OPENIB_MANAGEMENT=../../.. +include ${OPENIB_MANAGEMENT}/make.inc SCRIPT_TARGET=ibstatus Index: util/mad_test/Makefile =================================================================== --- util/mad_test/Makefile (revision 1408) +++ util/mad_test/Makefile (working copy) @@ -1,4 +1,5 @@ -include ../../make.inc +OPENIB_MANAGEMENT=../.. +include ${OPENIB_MANAGEMENT}/make.inc BIN_OBJS=mad_test.o BIN_LIBS=$(OPENIB_USR_LIB)/libcommon.la Index: libumad/Makefile =================================================================== --- libumad/Makefile (revision 1408) +++ libumad/Makefile (working copy) @@ -1,4 +1,5 @@ -include ../make.inc +OPENIB_MANAGEMENT=.. +include ${OPENIB_MANAGEMENT}/make.inc SRCS=$(wildcard *.c) LIB_OBJS=$(SRCS:.c=.lo) Index: osm/opensm/Makefile =================================================================== --- osm/opensm/Makefile (revision 1408) +++ osm/opensm/Makefile (working copy) @@ -1,4 +1,5 @@ -include ../../make.inc +OPENIB_MANAGEMENT=../.. +include ${OPENIB_MANAGEMENT}/make.inc # We need to provide a filtered C source files to avoid inclussion # of the other (non selected) vendor sources: Index: osm/complib/Makefile =================================================================== --- osm/complib/Makefile (revision 1408) +++ osm/complib/Makefile (working copy) @@ -1,4 +1,5 @@ -include ../../make.inc +OPENIB_MANAGEMENT=../.. +include ${OPENIB_MANAGEMENT}/make.inc SRCS=$(wildcard *.c) LIB_OBJS=$(SRCS:.c=.lo) From roland at topspin.com Tue Dec 28 12:35:22 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Dec 2004 12:35:22 -0800 Subject: [openib-general] [PATCH] Add support for querying port width/speed In-Reply-To: <52fz1rj809.fsf@topspin.com> (Roland Dreier's message of "Mon, 27 Dec 2004 21:40:06 -0800") References: <52fz1rj809.fsf@topspin.com> Message-ID: <52652mnotx.fsf@topspin.com> Here's an updated patch that actually uses this information to set static rate in IPoIB. Seems to work for me... Any comments/suggestions, or does it seem OK to commit? - R. Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 1408) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -294,10 +294,18 @@ struct ib_ah_attr av = { .dlid = be16_to_cpu(pathrec->dlid), .sl = pathrec->sl, - .static_rate = 0, .port_num = priv->port }; + if (ib_sa_rate_enum_to_int(pathrec->rate) > 0) + av.static_rate = (2 * priv->local_rate - + ib_sa_rate_enum_to_int(pathrec->rate) - 1) / + (priv->local_rate ? priv->local_rate : 1); + + ipoib_dbg(priv, "static_rate %d for local port %dX, path %dX\n", + av.static_rate, priv->local_rate, + ib_sa_rate_enum_to_int(pathrec->rate)); + ah = ipoib_create_ah(dev, priv->pd, &av); } Index: infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 1408) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -238,19 +238,10 @@ } { - /* - * For now we set static_rate to 0. This is not - * really correct: we should look at the rate - * component of the MC member record, compare it with - * the rate of our local port (calculated from the - * active link speed and link width) and set an - * inter-packet delay appropriately. - */ struct ib_ah_attr av = { .dlid = be16_to_cpu(mcast->mcmember.mlid), .port_num = priv->port, .sl = mcast->mcmember.sl, - .static_rate = 0, .ah_flags = IB_AH_GRH, .grh = { .flow_label = be32_to_cpu(mcast->mcmember.flow_label), @@ -262,6 +253,15 @@ av.grh.dgid = mcast->mcmember.mgid; + if (ib_sa_rate_enum_to_int(mcast->mcmember.rate) > 0) + av.static_rate = (2 * priv->local_rate - + ib_sa_rate_enum_to_int(mcast->mcmember.rate) - 1) / + (priv->local_rate ? priv->local_rate : 1); + + ipoib_dbg_mcast(priv, "static_rate %d for local port %dX, mcmember %dX\n", + av.static_rate, priv->local_rate, + ib_sa_rate_enum_to_int(mcast->mcmember.rate)); + mcast->ah = ipoib_create_ah(dev, priv->pd, &av); if (!mcast->ah) { ipoib_warn(priv, "ib_address_create failed\n"); @@ -506,6 +506,17 @@ else memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + { + struct ib_port_attr attr; + + if (!ib_query_port(priv->ca, priv->port, &attr)) { + priv->local_lid = attr.lid; + priv->local_rate = attr.active_speed * + ib_width_enum_to_int(attr.active_width); + } else + ipoib_warn(priv, "ib_query_port failed\n"); + } + if (!priv->broadcast) { priv->broadcast = ipoib_mcast_alloc(dev, 1); if (!priv->broadcast) { @@ -554,15 +565,6 @@ return; } - { - struct ib_port_attr attr; - - if (!ib_query_port(priv->ca, priv->port, &attr)) - priv->local_lid = attr.lid; - else - ipoib_warn(priv, "ib_query_port failed\n"); - } - priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - IPOIB_ENCAP_LEN; dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); Index: infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband/ulp/ipoib/ipoib.h (revision 1408) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -143,6 +143,7 @@ union ib_gid local_gid; u16 local_lid; + u8 local_rate; unsigned int admin_mtu; unsigned int mcast_mtu; Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 1408) +++ infiniband/include/ib_verbs.h (working copy) @@ -144,13 +144,6 @@ } } -enum ib_static_rate { - IB_STATIC_RATE_FULL = 0, - IB_STATIC_RATE_12X_TO_4X = 2, - IB_STATIC_RATE_4X_TO_1X = 3, - IB_STATIC_RATE_12X_TO_1X = 11 -}; - enum ib_port_state { IB_PORT_NOP = 0, IB_PORT_DOWN = 1, @@ -182,6 +175,24 @@ IB_PORT_BOOT_MGMT_SUP = (1<<9) }; +enum ib_port_width { + IB_WIDTH_1X = 1, + IB_WIDTH_4X = 2, + IB_WIDTH_8X = 4, + IB_WIDTH_12X = 8 +}; + +static inline int ib_width_enum_to_int(enum ib_port_width width) +{ + switch (width) { + case IB_WIDTH_1X: return 1; + case IB_WIDTH_4X: return 4; + case IB_WIDTH_8X: return 8; + case IB_WIDTH_12X: return 12; + default: return -1; + } +} + struct ib_port_attr { enum ib_port_state state; enum ib_mtu max_mtu; @@ -199,6 +210,8 @@ u8 sm_sl; u8 subnet_timeout; u8 init_type_reply; + u8 active_width; + u8 active_speed; }; enum ib_device_modify_flags { Index: infiniband/include/ib_sa.h =================================================================== --- infiniband/include/ib_sa.h (revision 1408) +++ infiniband/include/ib_sa.h (working copy) @@ -59,6 +59,34 @@ IB_SA_BEST = 3 }; +enum ib_sa_rate { + IB_SA_RATE_2_5_GBPS = 2, + IB_SA_RATE_5_GBPS = 5, + IB_SA_RATE_10_GBPS = 3, + IB_SA_RATE_20_GBPS = 6, + IB_SA_RATE_30_GBPS = 4, + IB_SA_RATE_40_GBPS = 7, + IB_SA_RATE_60_GBPS = 8, + IB_SA_RATE_80_GBPS = 9, + IB_SA_RATE_120_GBPS = 10 +}; + +static inline int ib_sa_rate_enum_to_int(enum ib_sa_rate rate) +{ + switch (rate) { + case IB_SA_RATE_2_5_GBPS: return 1; + case IB_SA_RATE_5_GBPS: return 2; + case IB_SA_RATE_10_GBPS: return 4; + case IB_SA_RATE_20_GBPS: return 8; + case IB_SA_RATE_30_GBPS: return 12; + case IB_SA_RATE_40_GBPS: return 16; + case IB_SA_RATE_60_GBPS: return 24; + case IB_SA_RATE_80_GBPS: return 32; + case IB_SA_RATE_120_GBPS: return 48; + default: return -1; + } +} + typedef u64 __bitwise ib_sa_comp_mask; #define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) Index: infiniband/core/sysfs.c =================================================================== --- infiniband/core/sysfs.c (revision 1408) +++ infiniband/core/sysfs.c (working copy) @@ -171,12 +171,41 @@ return sprintf(buf, "0x%08x\n", attr.port_cap_flags); } +static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + char *speed = ""; + int rate; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + switch (attr.active_speed) { + case 2: speed = " DDR"; break; + case 4: speed = " QDR"; break; + } + + printk(KERN_ERR "width %d speed %d\n", attr.active_width, attr.active_speed); + + rate = 25 * ib_width_enum_to_int(attr.active_width) * attr.active_speed; + if (rate < 0) + return -EINVAL; + + return sprintf(buf, "%d%s Gb/sec (%dX%s)\n", + rate / 10, rate % 10 ? ".5" : "", + ib_width_enum_to_int(attr.active_width), speed); +} + static PORT_ATTR_RO(state); static PORT_ATTR_RO(lid); static PORT_ATTR_RO(lid_mask_count); static PORT_ATTR_RO(sm_lid); static PORT_ATTR_RO(sm_sl); static PORT_ATTR_RO(cap_mask); +static PORT_ATTR_RO(rate); static struct attribute *port_default_attrs[] = { &port_attr_state.attr, @@ -185,6 +214,7 @@ &port_attr_sm_lid.attr, &port_attr_sm_sl.attr, &port_attr_cap_mask.attr, + &port_attr_rate.attr, NULL }; Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 1408) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -115,14 +115,16 @@ } props->lid = be16_to_cpup((u16 *) (out_mad->data + 16)); - props->lmc = (*(u8 *) (out_mad->data + 34)) & 0x7; + props->lmc = out_mad->data[34] & 0x7; props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 18)); - props->sm_sl = (*(u8 *) (out_mad->data + 36)) & 0xf; - props->state = (*(u8 *) (out_mad->data + 32)) & 0xf; + props->sm_sl = out_mad->data[36] & 0xf; + props->state = out_mad->data[32] & 0xf; props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 20)); props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 48)); + props->active_width = out_mad->data[31] & 0xf; + props->active_speed = out_mad->data[35] >> 4; out: kfree(in_mad); Index: docs/sysfs.txt =================================================================== --- docs/sysfs.txt (revision 1408) +++ docs/sysfs.txt (working copy) @@ -21,6 +21,7 @@ cap_mask - Port capability mask lid - Port LID lid_mask_count - Port LID mask count + rate - Port data rate (active width * active speed) sm_lid - Subnet manager LID for port's subnet sm_sl - Subnet manager SL for port's subnet state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER) From davem at davemloft.net Tue Dec 28 14:17:10 2004 From: davem at davemloft.net (David S. Miller) Date: Tue, 28 Dec 2004 14:17:10 -0800 Subject: [openib-general] Re: [PATCH][v5][0/24] Latest IB patch queue In-Reply-To: <52pt0unr0i.fsf@topspin.com> References: <200412272150.IBRnA4AvjendsF8x@topspin.com> <20041227225417.3ac7a0a6.davem@davemloft.net> <52pt0unr0i.fsf@topspin.com> Message-ID: <20041228141710.4daebcfb.davem@davemloft.net> On Tue, 28 Dec 2004 11:48:13 -0800 Roland Dreier wrote: > Speaking of build failures, one of my test builds is cross-compiling > for sparc64 with gcc 3.4.2, which adds __attribute__((warn_unused_result)) > to copy_to_user() et al. The -Werror in the arch/sparc64 means the > build fails with Thanks, I'll check that out. I believe that you didn't test the sparc64 build of the infiniband stuff because arch/sparc64/Kconfig needs to explicitly include the infiniband Kconfig since it does not use drivers/Kconfig. You didn't send me any such changes. There are a few platforms which also are in this situation. I added the sparc64 one to my tree while integrating your changes, but the others need to be attended to if you wish infiniband to be configurable on them. From roland at topspin.com Tue Dec 28 15:24:43 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Dec 2004 15:24:43 -0800 Subject: [openib-general] Re: [PATCH][v5][0/24] Latest IB patch queue In-Reply-To: <20041228141710.4daebcfb.davem@davemloft.net> (David S. Miller's message of "Tue, 28 Dec 2004 14:17:10 -0800") References: <200412272150.IBRnA4AvjendsF8x@topspin.com> <20041227225417.3ac7a0a6.davem@davemloft.net> <52pt0unr0i.fsf@topspin.com> <20041228141710.4daebcfb.davem@davemloft.net> Message-ID: <52pt0uhupw.fsf@topspin.com> David> I believe that you didn't test the sparc64 build of the David> infiniband stuff because arch/sparc64/Kconfig needs to David> explicitly include the infiniband Kconfig since it does not David> use drivers/Kconfig. You didn't send me any such changes. Actually I did test the build (and Tom Duffy at Sun has actually run the drivers on his system), but I forgot to include the required Kconfig change -- I just have it in my local test tree. David> There are a few platforms which also are in this situation. David> I added the sparc64 one to my tree while integrating your David> changes, but the others need to be attended to if you wish David> infiniband to be configurable on them. I think sparc64 is the only such platform where InfiniBand is likely to be of much interest. However I'll check out all of arch/ and send patches to hook up drivers/infiniband/ to the relevant maintainers once IB makes it upstream. Thanks, Roland From halr at voltaire.com Tue Dec 28 15:59:41 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Dec 2004 18:59:41 -0500 Subject: [openib-general] Some CM Comments In-Reply-To: <523bxqjsj9.fsf@topspin.com> References: <1104238992.4054.3823.camel@localhost.localdomain> <523bxqjsj9.fsf@topspin.com> Message-ID: <1104278381.4054.5322.camel@localhost.localdomain> On Tue, 2004-12-28 at 11:28, Roland Dreier wrote: > Hal> 4. Should there be an API to obtain local/remote comm IDs ? > > What would the consumer do with the comm ID? This would be for debug purposes (correlating host state, etc. with CM messages on the wire). -- Hal From shaeffer at neuralscape.com Tue Dec 28 17:28:17 2004 From: shaeffer at neuralscape.com (Karen Shaeffer) Date: Tue, 28 Dec 2004 17:28:17 -0800 Subject: [openib-general] Re: [PATCH][v5][0/24] Latest IB patch queue In-Reply-To: <52pt0uhupw.fsf@topspin.com> References: <200412272150.IBRnA4AvjendsF8x@topspin.com> <20041227225417.3ac7a0a6.davem@davemloft.net> <52pt0unr0i.fsf@topspin.com> <20041228141710.4daebcfb.davem@davemloft.net> <52pt0uhupw.fsf@topspin.com> Message-ID: <20041229012817.GA18863@synapse.neuralscape.com> On Tue, Dec 28, 2004 at 03:24:43PM -0800, Roland Dreier wrote: > > I think sparc64 is the only such platform where InfiniBand is likely > to be of much interest. However I'll check out all of arch/ and send > patches to hook up drivers/infiniband/ to the relevant maintainers > once IB makes it upstream. > Hi Roland, I am interested in Infiniband with x86_64 Opterons. Thanks, Karen -- Karen Shaeffer Neuralscape, Palo Alto, Ca. 94306 shaeffer at neuralscape.com http://www.neuralscape.com From roland at topspin.com Tue Dec 28 17:36:12 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Dec 2004 17:36:12 -0800 Subject: [openib-general] Re: [PATCH][v5][0/24] Latest IB patch queue In-Reply-To: <20041229012817.GA18863@synapse.neuralscape.com> (Karen Shaeffer's message of "Tue, 28 Dec 2004 17:28:17 -0800") References: <200412272150.IBRnA4AvjendsF8x@topspin.com> <20041227225417.3ac7a0a6.davem@davemloft.net> <52pt0unr0i.fsf@topspin.com> <20041228141710.4daebcfb.davem@davemloft.net> <52pt0uhupw.fsf@topspin.com> <20041229012817.GA18863@synapse.neuralscape.com> Message-ID: <52hdm5j377.fsf@topspin.com> Roland> I think sparc64 is the only such platform where InfiniBand Roland> is likely to be of much interest. However I'll check out Roland> all of arch/ and send patches to hook up Roland> drivers/infiniband/ to the relevant maintainers once IB Roland> makes it upstream. Karen> I am interested in Infiniband with x86_64 Opterons. OK, the current code should work well for you -- x86_64 is probably the most-tested architecture. "such platform[s]" in my comment above referred to architectures where arch/xxx/Kconfig does _not_ include drivers/Kconfig; arch/x86_64/Kconfig does include that file. So no change is required to use the current IB patches on x86_64. I believe the only architectures that both support PCI and do not include drivers/Kconfig in their arch Kconfig are arm, sparc, sparc64 and v850. Perhaps I'm wrong, but of those four architectures, sparc64 seems to be the only one where there would be any interest in using IB. - Roland From boas1 at llnl.gov Tue Dec 28 17:36:59 2004 From: boas1 at llnl.gov (Bill Boas) Date: Tue, 28 Dec 2004 17:36:59 -0800 Subject: [openib-general] OpenIB Development Workshop Feb 6-8 2005, Sonoma CA Message-ID: <6.1.2.0.2.20041228164304.0c07dfc8@mail-lc.llnl.gov> The OpenIB Alliance is holding its first Development Workshop at The Lodge at Sonoma from Febrary 6-8, 2005. The program will start at 6 PM Sunday evening and end by 5PM on the 8th. Everyone interested in the work of the OpenIB Alliance is invited. Please set aside these dates. We apologize for the short notice. More details are forthcoming. The program will have these major elements: * developers describing the kernel-level software submitted to kernel.org and its current status; * developers discussing the proposed architecture and plans for additional modules to complete a full Lunux open software stack for enterprise and high performance computing deployments including underlying capabilities for Infiniband and other RDMA architectures such as IT-API, DAT, and iWARP; * customers, vendors and developers discussing their requirements and road map for the OpenIB software development, quality assurance and distribution beginning in 2005 for subsequent inclusion in Linux distributions and integrated into products and systems such as servers, blades, clusters and data center infrastructures; * the DOE Labs and their partners describing the OpenIB PathForward project - http://www.llnl.gov/pao/news/news_releases/2004/NR-04-11-03.html; * OpenIB Alliance membership Annual General and Board meetings. The Alliance web site, will be set up for workshop registration, provide more details of the program, enable room reservations at the Lodge, costs, etc. Please do not contact the Lodge directly. Please stay tuned to www.openib.org ------------------------------------------------------------------ Bill Boas bboas at llnl.gov ICCD LLNL, B-113, R-2018 Wk: 925-422-4110 7000 East Ave Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From shaharf at voltaire.com Wed Dec 29 02:56:11 2004 From: shaharf at voltaire.com (shaharf) Date: Wed, 29 Dec 2004 12:56:11 +0200 Subject: [openib-general] Problem with userspace tree structure Message-ID: > > I assume libmthca was moved to userspace/management by mistake. I > moved it back to the top level userspace directory. > > Thanks, > Roland Oops, sorry. From mst at mellanox.co.il Wed Dec 29 05:43:51 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 29 Dec 2004 15:43:51 +0200 Subject: [openib-general] ip over ib throughtput Message-ID: <20041229134351.GA3486@mellanox.co.il> Hi! What kind of performance do people see with ip over ib on gen2? I see about 100Mbyte/sec at 99% CPU utilisation on send, on an express card, Xeon 2.8GHz, SSE doorbells enabled. MST From mst at mellanox.co.il Wed Dec 29 05:53:47 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 29 Dec 2004 15:53:47 +0200 Subject: [openib-general] ip over ib throughtput In-Reply-To: <20041229134351.GA3486@mellanox.co.il> References: <20041229134351.GA3486@mellanox.co.il> Message-ID: <20041229135347.GB3486@mellanox.co.il> Hello! Quoting r. Michael S. Tsirkin (mst at mellanox.co.il) "[openib-general] ip over ib throughtput": > Hi! > What kind of performance do people see with ip over ib on gen2? > I see about 100Mbyte/sec at 99% CPU utilisation on send, > on an express card, Xeon 2.8GHz, SSE doorbells enabled. At 159 Mbyte/sec without SSE doorbells. The test is netperf-2.2pl5, BTW. mst From halr at voltaire.com Wed Dec 29 09:12:30 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2004 12:12:30 -0500 Subject: [openib-general] [PATCH] Add support for querying port width/speed In-Reply-To: <528y7jj6l5.fsf@topspin.com> References: <52fz1rj809.fsf@topspin.com> <528y7jj6l5.fsf@topspin.com> Message-ID: <1104340350.4054.6458.camel@localhost.localdomain> On Tue, 2004-12-28 at 01:10, Roland Dreier wrote: > By, the way does anyone know what IPD is supposed to be used when the > injection port's rate is not a multiple of the rate for the path? For > example, suppose an 8X DDR port is sending to a 12X path -- 12 doesn't > divide 16 evenly. > > I would guess that the correct thing to do is round up, so 8X DDR > -> 12X uses an IPD of 1, 8X QDR -> 12X uses an IPD of 2, etc. Yes, rounding up causes the time to wait before scheduling the next packet to be larger which slows the injection rate (which is what you would want). There is also the recommendation on the HCA that if the IPD value is not supported then to use the next larger one supported (IBA 1.2 informative text in paragraph after C9-226). This might result in a slightly larger slowdown than required for some cases. To do the optimal thing here, you might need to determine which optional IPDs are supported. However, I'm not sure if there is a way to determine the optional IPDs supported by the HCA. -- Hal From mshefty at ichips.intel.com Wed Dec 29 11:25:43 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Dec 2004 11:25:43 -0800 Subject: [openib-general] Opensm - retries added In-Reply-To: References: Message-ID: <41D304B7.9030500@ichips.intel.com> shaharf wrote: > I added very simple retries mechanism to the SM. It is not > really checked, because I don't have a controlled way to loose packets. You can simulate this by modifying validate_mad in mad.c to return occasionally. E.g. declare a local static variable as a counter and return 0 whenever variable mod drop rate = 0. - Sean From mshefty at ichips.intel.com Wed Dec 29 11:28:52 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Dec 2004 11:28:52 -0800 Subject: [openib-general] [PATCH] Add support for querying port width/speed In-Reply-To: <52652mnotx.fsf@topspin.com> References: <52fz1rj809.fsf@topspin.com> <52652mnotx.fsf@topspin.com> Message-ID: <41D30574.6010000@ichips.intel.com> Roland Dreier wrote: > Here's an updated patch that actually uses this information to set > static rate in IPoIB. Seems to work for me... > > Any comments/suggestions, or does it seem OK to commit? Looks fine to commit by me. - Sean From mshefty at ichips.intel.com Wed Dec 29 11:58:00 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Dec 2004 11:58:00 -0800 Subject: [openib-general] Some CM Comments In-Reply-To: <1104278381.4054.5322.camel@localhost.localdomain> References: <1104238992.4054.3823.camel@localhost.localdomain> <523bxqjsj9.fsf@topspin.com> <1104278381.4054.5322.camel@localhost.localdomain> Message-ID: <41D30C48.8070707@ichips.intel.com> Hal Rosenstock wrote: > On Tue, 2004-12-28 at 11:28, Roland Dreier wrote: > >> Hal> 4. Should there be an API to obtain local/remote comm IDs ? >> >>What would the consumer do with the comm ID? > > > This would be for debug purposes (correlating host state, etc. with CM > messages on the wire). I don't think that we should add an API for this purpose. We could add this to the ib_cm_id structure if it's useful outside of the CM. - Sean From mshefty at ichips.intel.com Wed Dec 29 12:25:03 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Dec 2004 12:25:03 -0800 Subject: [openib-general] Re: Some CM Comments In-Reply-To: <1104238992.4054.3823.camel@localhost.localdomain> References: <1104238992.4054.3823.camel@localhost.localdomain> Message-ID: <41D3129F.7060209@ichips.intel.com> Hal Rosenstock wrote: > 2. The following CM states appear to be missing: > peer compare > reject retry > timeout > DREQ timeout (is an event) It is also a state in the spec > RTU timeout (nor is it an event) > 3. There is no REJ received CM event type. Shouldn't there be ? I'll add in needed CM states when I get to the relevant portions of the code. I may have left out some of the states, since they are temporary and would transition the connection object immediately to a new state. E.g. I don't think that we really need a peer compare state. > 5. REQ message is missing transport service type (connection type). > Also, what about ack timeout ? Also, SRQ. Is starting PSN chosen > internally by the CM ? The CM can get the transport service type from the QP type. I have a note to add in SRQ based on our previous discussions, but we need to add in verb support as well. I need to think about the PSN. The Topspin CM generated the PSN internal to the CM, but I'm not certain that it makes sense to still do this. > Is there a danger to have the client set the CM response timeout ? Also, > is the lack of remote response timeout for LAP in inconsistent with this > ? Since the client is responsible for generating the response, I thought that it made more sense for it to set that value. We could have the CM set a reasonable upper bound to what a client can provide. The lack of a remote response timeout for LAP does seem inconsistent. It looks like the code could use the value from the REQ, but it may be better to expose this field. Thanks for the feedback. Once there's more code, most of these issues should get flushed out. - Sean From roland at topspin.com Wed Dec 29 13:31:03 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Dec 2004 13:31:03 -0800 Subject: [openib-general] w00t Message-ID: <52ekh8g5bc.fsf@topspin.com> In case you aren't a junkie doing bk pulls every 15 minutes, I'm extremely happy to say that Linus's tree now contains a drivers/infiniband directory! - R. From libor at topspin.com Wed Dec 29 13:47:45 2004 From: libor at topspin.com (Libor Michalek) Date: Wed, 29 Dec 2004 13:47:45 -0800 Subject: [openib-general] w00t In-Reply-To: <52ekh8g5bc.fsf@topspin.com>; from roland@topspin.com on Wed, Dec 29, 2004 at 01:31:03PM -0800 References: <52ekh8g5bc.fsf@topspin.com> Message-ID: <20041229134745.A13183@topspin.com> On Wed, Dec 29, 2004 at 01:31:03PM -0800, Roland Dreier wrote: > In case you aren't a junkie doing bk pulls every 15 minutes, I'm > extremely happy to say that Linus's tree now contains a > drivers/infiniband directory! Nice. Any thoughts on what's going to be the procedure for submitting fixes that are accepted into the OpenIB source base to LKML? -Libor From roland at topspin.com Wed Dec 29 14:00:19 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Dec 2004 14:00:19 -0800 Subject: [openib-general] w00t In-Reply-To: <20041229134745.A13183@topspin.com> (Libor Michalek's message of "Wed, 29 Dec 2004 13:47:45 -0800") References: <52ekh8g5bc.fsf@topspin.com> <20041229134745.A13183@topspin.com> Message-ID: <521xd8g3yk.fsf@topspin.com> Libor> Nice. Any thoughts on what's going to be the procedure Libor> for submitting fixes that are accepted into the OpenIB Libor> source base to LKML? I'm planning on pulling patches out of subversion and submitting them upstream on a semi-regular basis. - R. From mshefty at ichips.intel.com Wed Dec 29 15:37:11 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Dec 2004 15:37:11 -0800 Subject: [openib-general] w00t In-Reply-To: <52ekh8g5bc.fsf@topspin.com> References: <52ekh8g5bc.fsf@topspin.com> Message-ID: <41D33FA7.1010906@ichips.intel.com> Roland Dreier wrote: > In case you aren't a junkie doing bk pulls every 15 minutes, I'm > extremely happy to say that Linus's tree now contains a > drivers/infiniband directory! Excellent job! From boas1 at llnl.gov Wed Dec 29 18:04:44 2004 From: boas1 at llnl.gov (Bill Boas) Date: Wed, 29 Dec 2004 18:04:44 -0800 Subject: [openib-general] w00t In-Reply-To: <52ekh8g5bc.fsf@topspin.com> References: <52ekh8g5bc.fsf@topspin.com> Message-ID: <6.1.2.0.2.20041229180304.01f1af70@mail-lc.llnl.gov> Roland and those working with you, Great news. Thank you for all your work on OpenIB. Bill. At 01:31 PM 12/29/2004, Roland Dreier wrote: >In case you aren't a junkie doing bk pulls every 15 minutes, I'm >extremely happy to say that Linus's tree now contains a >drivers/infiniband directory! > > - R. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general Bill Boas bboas at llnl.gov ICCD LLNL, B-113, R-2018 Wk: 925-422-4110 7000 East Ave Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From roland at topspin.com Wed Dec 29 21:14:02 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Dec 2004 21:14:02 -0800 Subject: [openib-general] InfiniBand MAD device number request Message-ID: <52brcce5b9.fsf@topspin.com> The InfiniBand drivers just merged into the 2.6 tree add a character device for userspace to send and receive InfiniBand "management datagrams" (MADs) for each port of each InfiniBand device. Current documentation recommends that these character devices be named /dev/infiniband/umad0, /dev/infiniband/umad1, etc. Please assign a major number for these devices. Thanks, Roland Dreier OpenIB Alliance www.openib.org From tziporet at mellanox.co.il Wed Dec 29 23:50:38 2004 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 30 Dec 2004 09:50:38 +0200 Subject: [openib-general] w00t Message-ID: <506C3D7B14CDD411A52C00025558DED6064BECAB@mtlex01.yok.mtl.com> These are great news. Thanks to all developers involved and mainly to Roland. Tziporet -----Original Message----- From: Roland Dreier [mailto:roland at topspin.com] Sent: Wednesday, December 29, 2004 11:31 PM To: openib-general at openib.org Subject: [openib-general] w00t In case you aren't a junkie doing bk pulls every 15 minutes, I'm extremely happy to say that Linus's tree now contains a drivers/infiniband directory! - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaharf at voltaire.com Thu Dec 30 01:03:34 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 30 Dec 2004 11:03:34 +0200 Subject: [openib-general] Opensm - retries added Message-ID: > shaharf wrote: > > I added very simple retries mechanism to the SM. It is not > > really checked, because I don't have a controlled way to loose packets. > > You can simulate this by modifying validate_mad in mad.c to return > occasionally. E.g. declare a local static variable as a counter and > return 0 whenever variable mod drop rate = 0. > > - Sean Thanks. I will do that. Shahar From mst at mellanox.co.il Thu Dec 30 08:46:08 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 30 Dec 2004 18:46:08 +0200 Subject: [openib-general] RFC: process_mad extension Message-ID: <20041230164608.GA12143@mellanox.co.il> Hello! ib_verbs.h has: int (*process_mad)(struct ib_device *device, int process_mad_flags, u8 port_num, u16 source_lid, struct ib_mad *in_mad, struct ib_mad *out_mad); This function serves both GSI and SMA MADs. It is used internally by the ib core, and I think its normally not used by ULPs. process_mad_flags just says whether to enable the MKey check. The Mkey check is enabled for incoming mad packets that are being returned to the HCA. When enabled, the HCA is expected to generate mkey violation trap when appropriate. I see some issues with this interface: 1. I think for GSI MADs, the HCA (even Tavor) needs more information to generate traps in case of errors. Consider for example Trap 259 - Bkey violation. Per IB spec 1.2, the trap has the following payload. Bad B_Key, from // attempted with and However, the GIDADDR/QP is never passed to process_mad and so is not available to the HCA. 2. For BMA MADs, it seems an additional flag would be needed to enable/ disable the bkey check for the MAD, same as we do for the mkey? 3. The Arbel memfree (aka native) mode hardware needs the Grh, and some bits from the completion of the incoming MAD, to process the MAD. Again, this information is not passed to process_mad. I think it makes sence to address these issues sooner rather than later :) I propose extending the interface in the following way: int (*process_mad)(struct ib_device *device, int process_mad_flags, u8 port_num, struct ib_wc* in_wc, struct ib_grh* in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad); I think its in some sence nicer than passing the slid directly since when sm builds packets (with mkey check disabled) slid may not be valid at all. The new parameters can be NULL in this case, then no trap can be generated. If you guys agree, I'll prepare a patch fixing the core and mthca. MST From roland at topspin.com Thu Dec 30 09:09:36 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 30 Dec 2004 09:09:36 -0800 Subject: [openib-general] RFC: process_mad extension In-Reply-To: <20041230164608.GA12143@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 30 Dec 2004 18:46:08 +0200") References: <20041230164608.GA12143@mellanox.co.il> Message-ID: <52652jd86n.fsf@topspin.com> Michael> 3. The Arbel memfree (aka native) mode hardware needs the Michael> Grh, and some bits from the completion of the incoming Michael> MAD, to process the MAD. Again, this information is not Michael> passed to process_mad. I'm looking at version 0.60 of the memfree PRM. I don't see any requirement for this information if the "extended info" flag isn't set when calling MAD_IFC. Is the PRM inaccurate? Michael> If you guys agree, I'll prepare a patch fixing the core Michael> and mthca. This seems good to me. Fixing this up was on my list of things to do but I never got to it yet. - Roland From tduffy at sun.com Thu Dec 30 21:28:57 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 30 Dec 2004 21:28:57 -0800 Subject: [openib-general] w00t In-Reply-To: <52ekh8g5bc.fsf@topspin.com> References: <52ekh8g5bc.fsf@topspin.com> Message-ID: <1104470937.17172.3.camel@duffman> On Wed, 2004-12-29 at 13:31 -0800, Roland Dreier wrote: > In case you aren't a junkie doing bk pulls every 15 minutes, I'm > extremely happy to say that Linus's tree now contains a > drivers/infiniband directory! This is REALLY good news. Congrats to all. Have a good new year. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Fri Dec 31 09:56:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Dec 2004 12:56:28 -0500 Subject: [openib-general] tvflash HCA numbering with multiple HCAs Message-ID: <1104515787.4662.170.camel@localhost.localdomain> Hi, Has flashing HCAs when there are multiple HCAs in the host been tried ? There seems to be some confusion on the HCA numbering in tvflash as shown by the GUIDs below. I needed to tvflash -h 1 to update mthca0. -- Hal tvflash -v /home/mellanox/fw-23108-a1-rel.bin New Node GUID = 005442b100004900 New Port1 GUID = 005442b100004901 New Port2 GUID = 005442b100004902 Programming HCA Microcode... Flash Image Size = 342956 Erasing [==================================================================] Writing [==================================================================] Verifying [==================================================================] Flash verify passed! CA 'mthca0': CA type: MT23108 Numer of ports: 2 Firmware version: 3.0.0 Hardware version: a1 Node GUID: 0x0008f10403961354 System image GUID: 0x0008f10403961357